Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Viperfish 64-Byte Bundle

Every offset, value, and address on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d, not stripped — full C++ symbols). All addresses are virtual addresses; .text and .rodata are mapped 1:1 (VA == file offset). Other wheel versions differ.

Abstract

Viperfish (kViperfish, TpuVersion enum 3, external "TPU v5" / "TPU v5 lite") is the first TPU TensorCore generation whose VLIW issue word is 64 bytes (512 bits) wide, up from Pufferfish's 51 bytes and Jellyfish's 41. Both cloud SKU names map to this one codename: the accelerator-type parser AcceleratorTypeToTpuVersionEnum (@ 0x204cf620) routes v5e and v5lite to enum case 5 and v5p to case 6, but the codec/HAL family for both is the single Viperfish generation — there is no separate per-cloud-name codec. The 64-byte width is not derived by zero-extending a narrower bundle — it is returned (after an a2 == 0 component check) from the codec-metadata virtual ViperfishCodecMetadata::BundleSizeBytes(TpuSequencerType) (@ 0x1ee71320, return 64), one cell of the (TpuVersion, TpuSequencerType)-keyed codec-metadata table described on the Bundle Model page. The 64-byte width recurs in the two later VXC-family generations Ghostlite (kGhostlite, enum 4, external "TPU v6 lite") and 6acc60406 (k6acc60406, enum 5, external "TPU7x"), but each has its own per-generation codec class and slot-encoder set; this page documents Viperfish specifically and notes where the family diverges. ("Trillium" appears nowhere in the binary — the internal name for the TPU7x generation is the obfuscated tag 6acc60406.)

The structural break from the JF/PF model is the encode mechanism, not just the width. Jellyfish and Pufferfish each have a single monolithic encoder — EncoderJf::EncodeBundleInternal (@ 0x1e86c7c0) and EncoderPfTensorCore::EncodeBundleInternal (@ 0x1e8c5c40) — that walks one Bundle object and packs every slot inline. Viperfish has no EncodeBundleInternal. Its top-level entry is the codec dispatcher TpuCodecViperfish::EncodeBundle (@ 0x1e8449a0), which switches on TpuSequencerType and, for the TensorCore case (sequencer 0), constructs the asic_sw::deepsea::vxc::isa::TensorCoreCodecBase<…> (whose template-argument list is the slot-encoder walk order) and calls viperfish::isa::EncoderVfTensorCore::EncodeBundle (@ 0x1d2f7ce0). That function casts the bundle facade to vxc::isa::TensorCoreBundle (or TensorCoreBundleCompact), allocates a zeroed BundleSizeBytes()-long buffer, then invokes the codec's TensorCoreBundle::Encode worker (a vxc TensorCoreCodecBase member, in the 0x1d2f8… range), which serializes each slot through its own <Slot>Encoder::Encode method. Every field write — opcode, operand, predicate — goes through one universal bit-granular packer, BitCopy(dst, dst_bit, src, src_bit, nbits) (@ 0x1fa0a900, LSB-first), so the whole bundle layout is expressed as a flat list of (absolute-bit, width) triples rather than a struct cast.

NOTE — the structurally similar codec at EncodeBundle @ 0x1e838cc0 / TensorCoreCodecBase worker 0x1d371540 (ctor 0x1d371e80) is not Viperfish: those addresses live in asic_sw::deepsea::gxc::gfc::isa (the 6acc60406 / gfc TPU7x codec, default arm tpu_codec_6acc60406.cc). Viperfish's per-slot encoders are all in asic_sw::deepsea::vxc::isa; do not cross-reference gfc addresses when reconstructing the vxc bundle.

The rest of the page maps those triples. It covers the slot taxonomy and which bytes each slot occupies, the two MXU slots (VectorExtended0/1) with their matmul / weight-latch / matpush / transpose / lane / cross-lane op families, the VectorResult0/1 pop slots that drain the MXU and the EUP, the EUP/transcendental push that — unlike every other compute op — lives in VALU slot 3 rather than a dedicated slot, the immediate, sequencer, and predicate slot positions, and how V5+ structurally differs from the JF/PF EncodeBundleInternal model.

For reimplementation, the contract is:

  • The 64-byte width and that it is a codec-metadata constant (ViperfishCodecMetadata::BundleSizeBytes64), selected by (TpuVersion, TpuSequencerType) lookup — never by extending a 51-byte bundle.
  • The V5+ two-stage encode chain: isa_emitter::EmitX populates a proto submessage; <Slot>Encoder::Encode then BitCopys each proto field to a generation-fixed absolute bit in the 512-bit buffer. The entry is TpuCodecViperfish::EncodeBundleEncoderVfTensorCore::EncodeBundle; there is no EncodeBundleInternal.
  • The MXU slot model: two VectorExtended slots (one per physical MXU) sharing one 8×6-bit systolic-feed vreg region, with distinct opcode/control regions; the matmul opcode is 7-bit @ bit 57, the data-format sub-discriminator 4-bit @ bit 51, the MXU-id 4-bit @ bit 64.
  • The latch/push/matmul opcode families (LoadMatrixRegister{Gmr,Lmr}{Msra,Msrb}, Pushmatrix<fmt>, MatrixMultiply<fmt>[Lgmr{Msra,Msrb}][Masked]) and the EUP push-pop protocol (VALU3 push → VectorResult PopEupResult).
GenerationViperfish — kViperfish, TpuVersion enum 3, external "TPU v5" / "TPU v5 lite" (per Bundle Model)
Cloud SKUsv5e/v5lite (case 5) and v5p (case 6) in AcceleratorTypeToTpuVersionEnum @ 0x204cf620 — both served by the one Viperfish codec
Namespaceasic_sw::deepsea::vxc::isa (TC), asic_sw::deepsea::vxc::vfc::isa (SparseCore TEC/SCS)
Bundle width64 B / 512 bitViperfishCodecMetadata::BundleSizeBytes @ 0x1ee71320 return 64 (after a2==0 component check)
Width dispatchcodec_metadata::BundleSizeBytes(TpuVersion, TpuSequencerType) @ 0x1ecf7180GetMetadataOrDie @ 0x1ecf6f60 (CodecMetadataRegistry hash-map) → vtable +16
Encode entryTpuCodecViperfish::EncodeBundle @ 0x1e8449a0EncoderVfTensorCore::EncodeBundle @ 0x1d2f7ce0 → vxc TensorCoreCodecBase worker (0x1d2f8…), per-slot Encoder::Encode walk
Universal packerBitCopy(dst, dst_bit, src, src_bit, nbits) @ 0x1fa0a900 — bit-granular, LSB-first (byte = dst_bit>>3, bit = dst_bit&7)
MXU slotsTensorCoreVectorExtended0Encoder::Encode @ 0x1efa0f60 (switch on proto opcode, max case 0x66=102 = CoreToCoreMove), …Extended1Encoder::Encode @ 0x1efec020
Result slotsTensorCoreVectorResult0Encoder::Encode @ 0x1f018f40, …Result1Encoder::Encode @ 0x1f019d80
EUP pushEncodeTensorCoreVectorAlu3EupPush @ 0x1ef6e400 (VALU slot 3)
MXU op census94 VectorExtended0 + 94 VectorExtended1 per-op helpers; 0 Extended2/3Encoder (exactly two MXU issue slots)
ConfidenceCONFIRMED (byte-anchored) unless a row says otherwise

Slot Taxonomy and Byte Map

A Viperfish bundle is a 512-bit flat buffer. Every functional unit reads one contiguous bit window of it; the windows are fixed per generation and populated by the per-slot encoders. The reader should think of the 64 bytes as a packed bitfield indexed from bit 0 = LSB of byte 0, exactly as BitCopy indexes it.

The slot taxonomy is the same Bundle sub-instruction set as Pufferfish (see Bundle Model), with the V5+ doubling of the MXU and the EUP moved into the VALU. The table below gives each slot, the encoder that serializes it, and the bit region it occupies (from the verified BitCopy offsets of the representative ops).

SlotEngineEncoder (asic_sw::deepsea::vxc::isa::…)Region (bits)
VectorExtended0MXU 0 (matmul / latch / push / transpose / lane / xlane)TensorCoreVectorExtended0Encoder::Encode @ 0x1efa0f60opcode/ctl ~48–64; operand pool 157–293
VectorExtended1MXU 1TensorCoreVectorExtended1Encoder::Encode @ 0x1efec020opcode/ctl ~28–44; shares the 157–293 pool
VectorResult0MXU/EUP/transpose result popTensorCoreVectorResult0Encoder::Encode @ 0x1f018f4011–24 (dest @ 14)
VectorResult1second result popTensorCoreVectorResult1Encoder::Encode @ 0x1f019d80mirror of Result0
VectorAlu0…34 VALU lanes (Alu3 also issues the EUP push)TensorCoreVectorAlu{0..3}Encoder::EncodeVALU0 opcode @ 299 w7; Alu3 EUP @ 186/197
VectorLoad0/1vector memory loadTensorCoreVectorLoad{0,1}Encoderper-gen repacked (see Memory Load)
VectorStorevector memory storeTensorCoreVectorStoreEncoderdata-vreg @ 170 w4; base @ 157 w6
ScalarAlu0/1sequencer / scalar pipe (branch/call/halt/LCC)TensorCoreScalarAlu0Encoder::Encode @ 0x1eecb900opcode-low @ 488 w5; pred @ 499/503
Immediates6 immediate slots (branch/call offset home)TensorCoreImmediatesEncoder::Encode @ 0x1eebee40imm0 @ 430 … imm5 @ 330, each w20
Predicatespredicate pool (per-slot field on VF)per-slot 4+1 field at slot topTC ScalarAlu0 @ 499/503

NOTE — the byte regions are not contiguous per slot in the obvious way. The MXU operand pool (bits 157–293, the eight 6-bit source vregs) lives physically in the middle of the bundle and is shared between VectorExtended0 and VectorExtended1; only the opcode/control words at the top of each MXU slot are distinct. A reimplementation that assumes each slot owns a private contiguous run of bytes will mis-pack the second MXU.

QUIRK — there is no Sparsity slot in the Viperfish TensorCore bundle. The 94 VectorExtended0 op families (matmul, latch, push, transpose, lane-broadcast/rotate, permute, cross-lane reduce, set-pattern-register, core-to-core move) contain no structured-sparsity op. Structured sparsity arrives in a later generation; see Sparsity Slot (the backing finding is still open — do not invent a VF sparsity field).


The V5+ Encode Model — Why There Is No EncodeBundleInternal

Purpose

The single most important structural fact about the Viperfish bundle is how it is built, because the JF/PF mental model breaks. Jellyfish and Pufferfish carry a per-encoder EncodeBundleInternal that takes a Bundle and packs all slots in one function body, mixing and/shl/or (Jellyfish direct-pack) or BitCopy (Pufferfish). Viperfish replaces that with a templated codec whose slot list is fixed at compile time and a per-slot encoder per slot. A reimplementer who looks for EncoderVf::EncodeBundleInternal will not find it.

Entry Point

TpuCodecViperfish::EncodeBundle @0x1e8449a0    ── codec dispatch on TpuSequencerType (vxc)
  ├─ (TC, seq 0)  → vxc::isa::TensorCoreCodecBase<…> ctor (operator new 0xF0)
  │                  EncoderVfTensorCore::EncodeBundle @0x1d2f7ce0
  │                    └─ TensorCoreBundle::Encode worker (0x1d2f8… range)
  │                       ── walks each slot Encoder in template-arg order:
  │             ├─ TensorCoreImmediatesEncoder::Encode      @0x1eebee40
  │             ├─ TensorCoreScalarAlu0Encoder::Encode      @0x1eecb900
  │             ├─ TensorCoreVectorAlu{0..3}Encoder::Encode
  │             ├─ TensorCoreVectorExtended0Encoder::Encode @0x1efa0f60   ── MXU 0
  │             ├─ TensorCoreVectorExtended1Encoder::Encode @0x1efec020   ── MXU 1
  │             ├─ TensorCoreVectorResult0Encoder::Encode   @0x1f018f40
  │             ├─ TensorCoreVectorResult1Encoder::Encode   @0x1f019d80
  │             ├─ TensorCoreVectorStoreEncoder::Encode
  │             └─ TensorCoreVectorLoad{0,1,2}Encoder::Encode
  ├─ (SCS, seq 3) → vxc::vfc::isa::SparseCoreScsCodecBase<…> → EncoderVfSparseCoreScs::EncodeBundle
  ├─ (TAC, seq 4) → vxc::vfc::isa::SparseCoreTacCodecBase<…> → EncoderVfSparseCoreTac::EncodeBundle
  └─ (TEC, seq 5) → vxc::vfc::isa::SparseCoreTecCodecBase<…> → EncoderVfSparseCoreTec::EncodeBundle

NOTE — the TC codec template lists three vector-load slots (TensorCoreVectorLoad0/1/2) and carries Predication as a template type parameter rather than as a dedicated TensorCorePredicatesEncoder slot — that dedicated predicate-slot encoder first appears in the 6acc60406 (gfc) TC codec, consistent with the Bundle Model's note that the predicate slot is a 6acc60406 addition.

Algorithm

The two-stage chain is: upstream the isa_emitter EmitX template populates a proto submessage and sets its present bit; downstream the slot encoder reads that submessage and BitCopys each field to its absolute bit. The encoder bodies inline the per-op packing as a switch on the proto opcode (proto + 0x50); each switch arm is one op family's field layout.

function TpuCodecViperfish::EncodeBundle(out, bundle, seq):   // @0x1e8449a0
    switch (seq):                              // TpuSequencerType
        case 0: codec = vxc::TensorCoreCodecBase<…>()         // TC
                EncoderVfTensorCore::EncodeBundle(out, bundle) // @0x1d2f7ce0
        case 3: …SparseCoreScs…   case 4: …SparseCoreTac…   case 5: …SparseCoreTec…

function EncoderVfTensorCore::EncodeBundle(out, bundle):   // @0x1d2f7ce0
    tc = cast<vxc::isa::TensorCoreBundle>(bundle)           // or TensorCoreBundleCompact
    buf = zeroed(tc.BundleSizeBytes())                      // 64 bytes
    tc.Encode(buf)            // vxc TensorCoreCodecBase worker, walks slot encoders
    // worker, in template-arg order, calls each SlotEncoder.Encode(slot_proto, buf):

function TensorCoreVectorExtended0Encoder::Encode(proto, out):   // @0x1efa0f60
    out_field = *(int*)(proto + 0x1c)
    BitCopy(out, 64, &out_field, 0, 4)         // MXU-id (unit) written FIRST, always
    switch (*(int*)(proto + 0x50)):             // proto opcode — the op-family dispatch
        case 0:    return                       // empty slot — nothing packed
        case 0xC:  EncodeMatrixMultiplyBf16(out, proto)   // = decimal 12
        case 0x41: EncodePushmatrixBf16(out, proto)       // = decimal 65
        case …:    // 94 op families; max case 0x66 (=102) = CoreToCoreMove

GOTCHA — the present-bit guard. Every field BitCopy is gated on a proto present bit ((proto[2] & mask) != 0), and when the field is absent the encoder reads a default-instance global (…_globals_) instead of skipping. The decompiled MatrixMultiplyBf16 body (@ 0x1efa2e40) interleaves the live-proto path and the …MatrixMultiplyBf16_globals_ default path for every field. A reimplementation that omits the default-instance fallback will leave stale bits when a field is unset; the hardware default for an unset matmul source vreg is whatever the proto default carries, not zero.


MXU Slots — VectorExtended0 / VectorExtended1

Purpose

Each VectorExtended slot drives one MXU and carries the full matrix-unit op vocabulary: dense matmul (MatrixMultiply<fmt>), the weight-stationary latch (LoadMatrixRegister{Gmr,Lmr}{Msra,Msrb}), the moving-operand push (Pushmatrix<fmt>), the latch-via-LMR fused matmul (MatrixMultiplyLmr / …Lgmr{Msra,Msrb}), plus transpose, lane-broadcast/rotate, permute, cross-lane reduce, and pattern-register ops. There are exactly two slots — the binary has 94 Extended0 and 94 Extended1 per-op helpers and zero Extended2/3.

Encoding — Absolute Bit Map

The MXU control region for Viperfish, verified from the decompiled MatrixMultiplyBf16 (@ 0x1efa2e40), PushmatrixBf16 (@ 0x1efaf820), and LoadMatrixRegisterGmrMsra helpers:

FieldAbs bitWidthWritten byValue (Bf16 case)
MXU-id (unit)644dispatcher (proto + 0x1c)0 (MXU 0)
opcode-HIGH (matmul)577MatrixMultiply<fmt>0x1
opcode-HIGH (latch)577LoadMatrixRegister*0x37 (55)
opcode-HIGH (push)595Pushmatrix<fmt>0xe (14)
data-format sub-disc514per-opmatmul Bf16=1 / push Bf16=3 / latch=0
control (3-bit)483per-opproto + 0x18
done-gains / latch flag552per-opproto + 0x1c
Transpose field (push/latch)571Pushmatrix* (proto + 0x20)
Target field (push/latch)581Pushmatrix* (proto + 0x24)
push-src vreg1806Pushmatrix* (proto + 0x44)
primary operand (matmul)1806MatrixMultiply*proto + 0x3c
src vreg #1 (proto + 0x20)1576matmulsystolic feed
src vreg #2 (proto + 0x24)2826matmulsystolic feed
src vreg #3 (proto + 0x28)2936matmulsystolic feed
src vreg #4 (proto + 0x2c)2486matmulsystolic feed
src vreg #5 (proto + 0x30)2596matmulsystolic feed
src vreg #6 (proto + 0x34)2146matmulsystolic feed
src vreg #7 (proto + 0x38)2256matmulsystolic feed

QUIRK — the opcode field changes both position and width by op family at the same slot. MatrixMultiply<fmt> writes a 7-bit opcode at bit 57; Pushmatrix<fmt> writes a 5-bit opcode-HIGH at bit 59 (with the 4-bit data-format at bit 51 below it); LoadMatrixRegister* reuses the 7-bit @ 57 window but with value 0x37. On the push path, bits 57 and 58 are repurposed as the 1-bit Transpose and Target fields — the same physical bits that on the matmul path are the two high bits of the 7-bit opcode. The decoder distinguishes them by the opcode-HIGH value (a push/latch has opcode-HIGH 0xe, a matmul 0x1), so the LSB region is free to carry latch control. On the decode side the Pushmatrix*TransposeField reads [base+8]>>0x39&1 = abs57 and the …TargetField reads >>0x3a&1 = abs58.

The Shared Systolic-Feed Operand Pool

The eight source-vreg fields (bits 157/282/293/248/259/214/225 plus the operand @ 180) are the systolic operand stream into the MXU. The decisive structural fact: VectorExtended0 and VectorExtended1 read the same vregs from the same absolute bits — only the opcode/control region differs between the two slots. Both MXUs draw the same vector read ports; the two slots address two physical matrix units over one shared operand pool.

// VectorExtended0 (MXU 0) and VectorExtended1 (MXU 1):
//   opcode/control region: distinct per slot (Extended1 shifted down ~20 bits)
//   operand pool 157..293: IDENTICAL absolute bits in both slots
function EncodeMatrixMultiplyBf16(out, proto):     // @0x1efa2e40, proto opcode 0xc
    BitCopy(out, 57, {0x1}, 0, 7)                  // opcode-HIGH = 1
    BitCopy(out, 51, {0x1}, 0, 4)                  // data-format = bf16
    BitCopy(out, 48, proto.control, 0, 3)
    BitCopy(out, 55, proto.done_gains, 0, 2)
    BitCopy(out, 157, proto.src0, 0, 6)            // present-gated; else read globals_
    BitCopy(out, 282, proto.src1, 0, 6)
    BitCopy(out, 293, proto.src2, 0, 6)
    BitCopy(out, 248, proto.src3, 0, 6)
    BitCopy(out, 259, proto.src4, 0, 6)
    BitCopy(out, 214, proto.src5, 0, 6)
    BitCopy(out, 225, proto.src6, 0, 6)
    BitCopy(out, 180, proto.operand, 0, 6)

Op Inventory

The 94 helpers per slot are not 94 distinct layouts — they are one {opcode, sub-format, MXU-id, operand} template specialized by the opcode immediate and the operand-present mask. The op families:

FamilyMembersRole
MatrixMultiply<fmt>Bf16, Bf8, If8Bf16, S4, S8, U4, U8, F32Rounded, Lmrdense matmul step, one helper per data format
MatrixMultiply<fmt>Lgmr{Msra,Msrb}[Masked]per fmt × {Msra, Msrb} × {plain, Masked}latch-via-LMR fused matmul (multi-pass K-tiling)
Pushmatrix<fmt>Bf16, Bf8, S4, S8, U4, U8, Rounded, PackedIf8Convmoving-operand push (matprep)
LoadMatrixRegister{Gmr,Lmr}{Msra,Msrb}4 variantsweight-stationary latch (opcode-HIGH 0x37)
*TransposeStart[End], *TransposeContinueAnyType, Segmented*, Packed*~10systolic-array transpose
LaneBroadcast[Packed], LaneRotate, PackedLaneRotate, Permute[Packed]~6intra-lane data movement
Xlane{Add,Max,Min}[Index]5cross-lane reduction
SetPatternRegisterPcr[Bytes,Sublanes], CoreToCoreMove, SupplementalPackedXlu4pattern-register / inter-core / XLU supplement

NOTE — the Masked suffix and the Lgmr{Msra,Msrb} suffix are the V5+ realization of the Jellyfish 6-entry GainLatchMode → VEOpcode table: what was a small enum on v3 became a named opcode family on v5. Msra/Msrb select which of the two matrix-staging-register banks the fused matmul accumulates into. See MXU Slot for the cross-generation opcode story and MXU Latency (Viperfish) for the latency of each.


Result Slots — VectorResult0 / VectorResult1

Purpose

The VectorResult slot is the pop side of the push-pop protocols. It drains a finished MXU matmul (PopMxuResult), an EUP transcendental result (PopEupResult), a transpose result (TransposeResult), or — uniquely on Viperfish — a scalar/CRF pop (PopCcrfResult). The slot is a single discriminator-plus-tail layout: a 4-bit result-type discriminator selects which sub-message is present, and a common tail BitCopys the dest vreg.

Encoding — Absolute Bit Map (Viperfish)

Verified from TensorCoreVectorResult0Encoder::Encode (@ 0x1f018f40):

FieldAbs bitWidthNotes
header field (proto + 0x1c)244written first, every result-type
sub-type selector222constant 0/1/2/3 = PopEup / PopMxu / Transpose / PopCcrf
result mode (proto + 0x18)202per result-type (Mxu/Transpose only)
dest vreg146common tail

The four Viperfish sub-messages (proto types under asic_sw::deepsea::vxc::isa), dispatched by the proto + 0x50 opcode and tagged by the 2-bit selector at bit 22:

Sub-messageproto opcodebit-22 selectorRole
TensorCoreVectorResult_PopEupResult50drain an EUP/transcendental result
TensorCoreVectorResult_PopMxuResult61drain a finished MXU matmul accumulator
TensorCoreVectorResult_TransposeResult72drain a systolic transpose
TensorCoreVectorResult_PopCcrfResult83scalar/cross-core-register-file pop (vxc-only)

GOTCHA — the 2-bit selector at bit 22 (values 0/1/2/3) is the field a decoder keys on, and it does not equal the proto + 0x50 opcode (5/6/7/8). The encoder reads the opcode to choose a switch arm, then writes the selector constant. PopEup is reachable from two arms — opcode 5 and opcode 8 (the else of the PopCcrf branch falls through to the PopEup default-instance), so a reimplementation that maps opcode→selector 1:1 will mis-tag those. The dest vreg lands at the common tail bit 14 (w6) regardless of sub-type.

QUIRK — the result sub-message set is generation-specific, not constant: a cross-generation reimplementation must key it on the TpuVersion, not assume Viperfish's four carry over. On Viperfish the set is {PopEup, PopMxu, Transpose, PopCcrf} (the four switch arms above). The Ghostlite and 6acc60406 sets differ (e.g. Ghostlite adds a fused matres+accumulate result for the K>128 multi-pass path) — those deltas are documented on the per-generation bundle pages.


EUP / Transcendental Push — VALU Slot 3

Purpose

The Extended Unary Processor (the transcendental unit) has no dedicated bundle slot. Its push is a VALU slot-3 (Alu3) op, and its pop is the VectorResult PopEupResult. This is the bit-exact realization of the push-pop transcendental model: an Alu3 op pushes a source vreg into the EUP pipeline tagged with a 5-bit function selector, and one or more bundles later a VectorResult op pops the result into a dest vreg. The single-issue EUP means only Alu3 (not Alu0/1/2) sources the push — the transcendental helpers exist only in the Alu3 set.

Encoding — Absolute Bit Map (Viperfish)

Verified from EncodeTensorCoreVectorAlu3EupPush (@ 0x1ef6e400):

FieldAbs bitWidthValue
VALU opcode (EUP-push family)19770x0
EUP-function selector18650x16 (22) for the generic EupPush
src vreg1916proto + 0x18 (present-gated)
function EncodeTensorCoreVectorAlu3EupPush(out, proto):   // @0x1ef6e400
    BitCopy(out, 197, {0x0}, 0, 7)        // VALU-opcode = EUP-push family
    BitCopy(out, 186, {0x16}, 0, 5)       // generic EUP-push function selector = 22
    if (proto[0x50] == 135 && present):   // proto opcode 135 = EupPush
        BitCopy(out, 191, proto.src, 0, 6)

Push-Pop Protocol

bundle N    :  VALU slot 3 (Alu3)  ── VALU-op=0x0 @197, fn-selector @186, src vreg @191
                 │
                 ▼  (EUP pipeline latency; single-issue XLU hazard)
bundle N+k  :  VectorResult slot   ── PopEupResult (result-opcode 7), dest vreg @14

NOTE — Viperfish's named transcendentals (Reciprocal, ReciprocalSqrt, Tanh, ShiftedSigmoid, LogTwo, PowTwo) each have their own Alu3 helper carrying its own 5-bit selector in the same bit-186 field; the 0x16 value above is the generic EupPush whose function is carried elsewhere. The per-function selector value table (Tanh, Reciprocal, Erf, Sin/Cos, …) is documented per-generation on EUP Transcendental Slot. The pop dest vreg lands at the common VectorResult tail (bit 14 w6), so the transcendental result occupies a normal vreg one+ bundles after the push.


Sequencer, Immediate, and Predicate Positions

These slots are not MXU-specific but pin the rest of the 512-bit map. They come from the same <Slot>Encoder::Encode + BitCopy mechanism and are documented in full on V5+ EmitX Bit Positions; the Viperfish positions are summarized here for completeness.

Immediate slots — branch/call/sync offset home

TensorCoreImmediatesEncoder::Encode (@ 0x1eebee40) writes six 20-bit immediate slots, each from proto + 0x18 … 0x2c. The branch/call/sync 20-bit signed offset lands in immediate slot 0:

imm slotproto fieldVF abs bitWidth
0 (branch/call offset)proto + 0x1843020
1proto + 0x1c41020
2proto + 0x2039020
3proto + 0x2437020
4proto + 0x2835020
5proto + 0x2c33020

Sequencer slot

TensorCoreScalarAlu0Encoder::Encode (@ 0x1eecb900) writes a common predication header then dispatches on the proto opcode. The control-op layout:

FieldAbs bitWidthNotes
predication reg index4994proto + 0x20
predication inversion5031proto + 0x18 (byte)
opcode-HIGH / family49360 for branch/call
opcode-LOW / discriminator48854/5/6/7
x-target sreg (BranchSreg)4885sreg x()
dest (return-addr) sreg4775Call dest

Discriminator: BranchAbsolute=4, BranchRelative=5, CallAbsolute=6, CallRelative=7. The branch offset is a signed 20-bit value in immediate slot 0 (bit 430); there is no dedicated return op — a return is a BranchSreg reading the link sreg.

GOTCHA — there is no in-bundle delay-slot field on Viperfish. The branch/call helpers write only {opcode-low, offset (imm0), dest}; the delay-slot count is a bundle-packer pad-count (empty kNeverExecute bundles appended after the branch), not an encoded bit field. A reimplementation that looks for a 3-bit delay_slots field in the bundle will not find one.

Predicate slot

Viperfish uses a per-slot 4+1 predicate field (4-bit reg index + 1-bit inversion) at the top of each scalar slot — for TensorCoreScalarAlu0 the reg is @ bit 499 (w4) and inversion @ bit 503 (w1). Predication is a template type parameter of the vxc TC codec, not a standalone slot encoder. This is the encoding the later 6acc60406 (TPU7x) generation replaces with a dedicated TensorCorePredicatesEncoder slot; on Viperfish the predicate index is local to each populated slot. Empty slots carry the kNeverExecute predicate written into the bundle header (see Bundle Model).


How Viperfish Differs from the JF/PF Bundle Model

AxisJellyfish (41 B) / Pufferfish (51 B)Viperfish (64 B)
Encode entryone Encoder<gen>::EncodeBundleInternal packing all slotsTpuCodecViperfish::EncodeBundleEncoderVfTensorCore::EncodeBundle → per-slot <Slot>Encoder::Encode
Field writeJF direct and/shl/or; PF BitCopyevery field via BitCopy; no direct-pack
Layout sourceper-slot Encode methods on the Bundleisa_emitter::EmitX proto templates + slot encoders
InstBits table(n/a)InstBits is all-zero on disk; positions live in the encoders
MXU slotsJF 1 VectorExtended; PF 2 (VE0/1)2 (VectorExtended0/1), 94 helpers each, 0 Extended2/3
MXU operand modelJF single moving-operand vreg8×6-bit systolic-feed pool, shared between the two MXU slots
Latch encodingJF 6-entry GainLatchMode→VEOpcode tablenamed LoadMatrixRegister{Gmr,Lmr}{Msra,Msrb} opcode family (opcode-HIGH 0x37)
EUP transcendentalissued from the VALU slot, popped by result slotpinned to VALU slot 3 (Alu3); 5-bit fn-selector @ bit 186
Bundle width sourceinline constant in the encoderViperfishCodecMetadata::BundleSizeBytes64 (codec-metadata cell)

The clean way to state the break: Pufferfish's EncodeBundleInternal is the kIsaTable for its slots, baked into one function. Viperfish has no such function — its kIsaTable is the set of per-slot BitCopy offsets, distributed across ~200 per-op helpers, orchestrated by TpuCodecViperfish::EncodeBundleEncoderVfTensorCore::EncodeBundle. The InstBits table that LLVM-MC would normally hold the fixed instruction bits in is entirely zero on disk for all V5+ generations, which is the binary's confirmation that the bits come from the emitter path, not from a static table.


NameRelationship
Pufferfish 51B BundleThe immediate predecessor: monolithic EncodeBundleInternal, 2 MXU slots, no shared-pool / VALU3-EUP model
Ghostlite / 6acc60406The other two 64-byte VXC-family generations (external "TPU v6 lite" / "TPU7x"); same slot grid, byte-shifted opcode/predicate regions, differing result sub-message sets
Bundle ModelThe cross-generation width-dispatch and slot-taxonomy reference
MXU SlotThe cross-generation matmul/latch opcode story this page specializes for VF

Cross-References

  • Pufferfish 51B Bundle — the predecessor; the EncodeBundleInternal model Viperfish replaces
  • Ghostlite Bundle — sibling 64-byte gen; PopAddMxu01Result, byte-shifted opcode region
  • Bundle Model — bundle widths, codec-metadata dispatch, kNeverExecute empty-slot convention
  • V5+ EmitX Bit Positions — the isa_emitter::EmitXBitCopy chain and the full sequencer/immediate/predicate positions
  • Sparsity Slot — the structured-sparsity slot (open; absent from the Viperfish TC bundle)
  • MXU Slot — cross-generation matmul/latch/push opcode families
  • EUP Transcendental Slot — the per-function EUP selector value table and pop semantics
  • IsaEmitter Registry — the (TpuVersion, SequencerType) codec cell census that selects the Viperfish encoders
  • MXU Latency (Viperfish) — per-op MXU cost for the Viperfish matmul/latch families