Pufferfish 51-Byte Bundle
Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel. Other versions differ.
Abstract
The Pufferfish TensorCore VLIW bundle is a 51-byte (408-bit) issue word — the TPU v4 successor to the 41-byte Jellyfish bundle, and the architectural bridge between Jellyfish's direct-pack encoder and the V6+ GXC codec. It is best read as a delta against Jellyfish: same kNeverExecute empty-slot convention, same predicate semantics, same shared operand pool, but four new datapath capabilities (a second MXU, a widened VALU lane, a first-class constant-memory load slot, and a second result-drain lane) and a fundamentally different encoding mechanism.
Where Jellyfish builds the bundle with shl/and/or arithmetic on the qwords of a 53-byte scratch struct and then strips a 12-byte header, Pufferfish has no scratch struct and no header strip. EncoderPfTensorCore::EncodeBundleInternal (0x1e8c5c40) is a thin wrapper: it asks the codec for its byte size (51), operator news and memsets a 51-byte buffer to zero, and hands the buffer to the codec's Encode method. The codec — asic_sw::deepsea::pxc::isa::TensorCoreCodecBase<TensorCoreBundle, …, Predication> (Encode @ 0x1d224300) — walks the bundle proto slot-by-slot and, for each present slot, calls a per-slot TensorCore{Slot}Encoder::Encode that writes each field into the buffer with one call to the shared bit-packing primitive BitCopy(void* dst, int dst_bit, const void* src, int src_bit, int nbits) (0x1fa0a900). The absolute bundle bit of a field is the literal dst_bit argument to its BitCopy call — there is no shift arithmetic to invert and no header offset to subtract. Output byte N is buffer byte N directly.
This page documents the complete 51-byte slot map at absolute-bit precision, the twelve-slot has-bit dispatch keyed on proto+0x10, the per-slot BitCopy field arithmetic, the five shared Load/Store/Cmem operand sub-encoders that all three memory slots inline against one shared addressing submessage and operand pool, the kNeverExecute=31 prefill, and the byte-exact Jellyfish→Pufferfish delta. For reimplementation, the contract is:
- The direct-
BitCopymodel: a 51-byte buffer,memsetto zero, written field-by-field viaBitCopy(buf, abs_bit, &field, 0, width); output byte N == buffer byte N, with no 12-byte strip and no scratch struct. - The twelve-slot has-bit dispatch: a 12-bit slot has-mask at
TensorCoreBundleproto+0x10, per-slot submessage pointers atproto+0x18..+0x70, and the codec'stest [proto+0x10],bit / cmove defaultsubstitution of a_globals_default on a clear bit. - The absolute bit base + width of every slot: the low dedicated region (bits 17..240), the shared operand pool (241..353), and the two scalar slots at the top (354..407).
- The five shared memory operand sub-fields: base-address (2b), offset (2b), stride (3b), the 3-entry Y-register selector SlotMap (5b @241/246/251), and the 6-entry immediate SlotMap (16b @256/272/288/304/320/338), bound at proto-build time by
VectorYVEncode/ScalarYEncode/SetImmOrDie. - The
kNeverExecute=31prefill thatFillDefaultBundle(0x1d222ee0) stamps into every slot's predicate field, and the four-axis JF→PF delta (2nd MXU, wide VALU0, cmem_load, 2nd result).
| Encode wrapper | EncoderPfTensorCore::EncodeBundleInternal(TensorCoreBundle const&) @ 0x1e8c5c40 |
| Codec engine | TensorCoreCodecBase<…, Predication>::Encode @ 0x1d224300 (12-slot dispatch) |
| Bit primitive | BitCopy(void*, int dst_bit, void const*, int src_bit, int nbits) @ 0x1fa0a900 — dst_bit == absolute bundle bit |
| Wire width | 51 bytes / 408 bits (EncoderPfTensorCore::BundleSizeBytes @ 0x1d227740 returns 0x33) |
| Bit numbering | LSB-first — bit 0 is the LSB of byte 0; BitCopy's dst_bit is an LSB-first absolute index (matches Bundle Model) |
| Buffer model | operator new(51) + memset(buf,0,51); output byte N == buffer byte N (no header strip) |
| Slot has-mask | 12-bit word at TensorCoreBundle proto +0x10; one bit per slot |
| Dispatchable slots | 12 (scalar ×2, vector-ALU ×2, vstore, vload, cmem_load, vextended/MXU ×2, vresult ×2, misc) |
| Shared operand pool | 3× 5-bit Y-register selectors @241/246/251 + 6× 16-bit immediates @256/272/288/304/320/338 |
| Empty-slot mark | predicate 0x1f (31 = kNeverExecute) stamped by FillDefaultBundle @ 0x1d222ee0 |
| MC cross-check | TPUMCCodeEmitter::encodeInstruction @ 0x13c74885 loads 0x198 = 408 bits = 51 bytes |
| TpuVersion | 2 = Pufferfish = TPU v4 (second width source: PufferfishCodecMetadata::BundleSizeBytes @ 0x1ecf7ac0 returns 51) |
| Confidence | CONFIRMED (byte-anchored) unless a row says otherwise |
The Direct-BitCopy Model and the Absolute-Bit Law
Jellyfish encodes a bundle as read-modify-write arithmetic on a scratch struct, then copies a 41-byte window out of it. Pufferfish does neither. EncodeBundleInternal is a five-step wrapper around the codec, and the disassembly is the proof:
// EncoderPfTensorCore::EncodeBundleInternal(TensorCoreBundle const&) @ 0x1e8c5c40
codec = this->codec; // r13 = encoder; codec ptr from this
size = codec->vtable[0x30](); // call [rax+0x30] -> 51 (BundleSizeBytes)
buf = operator new(size); // _Znwm(51) @ 0x1e8c5c6d
memset(buf, 0, size); // zero the whole buffer @ 0x1e8c5c7d
codec->vtable[0x18](buf, bundle, size); // call [rax+0x18] -> TensorCoreCodecBase::Encode
result.data = buf; result.size = 51; // StatusOr<vector<uint8>> of size 51
There is no +0x0C advance and no overlapping vmovups copy-out — the buffer the codec writes is the wire bundle. Every field lands at its absolute bundle bit directly, because the codec writes it there:
// TensorCore{Slot}Encoder::Encode field write (every slot, every field)
BitCopy(/*dst=*/buf, /*dst_bit=*/ABS_BIT, /*src=*/&proto_field, /*src_bit=*/0, /*nbits=*/WIDTH);
BitCopy (0x1fa0a900, mangled _Z7BitCopyPviPKvii) takes the destination buffer in rdi, the absolute destination bit in esi, the source field pointer in rdx, the source start bit in ecx (always 0 for a fresh field), and the width in r8d. The bit index is LSB-first: bit 0 is the least-significant bit of buffer byte 0, bit 8 the LSB of byte 1, and so on — the same convention used everywhere in the encode path (see Bundle Model §bit-numbering). There is no MSB-first ordering and nothing to invert.
NOTE — this is the single fact that makes the whole bundle decodable. To recover any field's position, disassemble its per-slot encoder and read the
mov esi,0xNN(absolute bit) andmov r8d,0xNN(width) immediates that precede eachcall 0x1fa0a900. The slot map below is the harvest of every such pair across the twelve per-slot encoders. The encode side is authoritative; the decode-sideTensorCore{Slot}Decoder::Decodefunctions read the same bits as the inverse confirmation (see Decode-Side: JF / PF).
The 408-bit width is cross-confirmed from an independent code path: the LLVM MC layer's TPUMCCodeEmitter::encodeInstruction loads mov r15d,0x198 (0x198 = 408) at 0x13c74885 — a separate emitter from the proto codec, agreeing bit-for-bit. The 51 bytes partition into seven qwords by absolute bit:
qword0 = abs 0 .. 63 output bytes 0x00..0x07
qword1 = abs 64 .. 127 output bytes 0x08..0x0F
qword2 = abs 128 .. 191 output bytes 0x10..0x17
qword3 = abs 192 .. 255 output bytes 0x18..0x1F
qword4 = abs 256 .. 319 output bytes 0x20..0x27
qword5 = abs 320 .. 383 output bytes 0x28..0x2F
qword6 = abs 384 .. 407 output bytes 0x30..0x32 (partial, 3 bytes)
Field tops reach exactly bit 407 — the scalar_0 predicate at abs 403 plus its 5-bit width — so all 408 bits are accounted for at the top end.
The Twelve-Slot Has-Bit Dispatch
The compiler-side TensorCoreBundle proto carries a 12-bit slot has-mask in the word at proto+0x10 and a typed submessage pointer per slot at proto+0x18..+0x70. The codec Encode (0x1d224300) tests each has-bit in turn; on a set bit it encodes the present submessage, on a clear bit it substitutes that slot's _globals_ default instance (a cmove to the default pointer). The dispatch order and the bit→slot map are byte-exact from the test BYTE/DWORD PTR [r12+0x10],<bit>; cmove rax,<default> ladder:
// TensorCoreCodecBase<...>::Encode dispatch @ 0x1d224300 (decompiled)
mask = *(uint16_t*)(proto + 0x10); // 12-bit slot has-mask @ proto+0x10
for each slot in dispatch order: // scalar_0 (0x1) .. misc (0x800)
sub = (mask & slot.bit) ? *(proto + slot.proto_off) // present submessage ptr
: slot.globals_default; // _globals_ default
if (sub == nullptr) sub = slot.globals_default; // null-ptr also falls back to default
slot.encoder->Encode(sub, buf); // per-slot BitCopy writes
The twelve test [proto+0x10],N immediates appear in strict ascending order — 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x100, 0x200, 0x400, 0x800 — confirming a dense 12-bit mask with no gaps. Each slot's encoder writes into a disjoint dedicated bit region, so all twelve slots can co-exist in one bundle; operand-pool conflicts (more register/immediate operands than the 3 Y-reg / 6 imm slots can hold) are resolved earlier, at proto-build time, by the SlotMap packer (see The Five Shared Operand Sub-Encoders).
QUIRK — scalar_1 is encoded conditionally on scalar_0's opcode, not purely on its own has-bit. The dispatch tests the scalar_1 has-bit (
0x2) as expected, but the codec gates the entire scalar_1 encode behind(scalar0.opcode_field − 17) >= 3(the codec reads scalar_0's opcode oneof at submessage dword+0x50and skips scalar_1 when it falls in{17,18,19}). Those three scalar_0 opcodes are the wide forms that consume the scalar_1 bit window themselves; emitting a second scalar op alongside them would overwrite bits. Every other slot is gated purely on its own has-bit. A reimplementer must reproduce this scalar_0→scalar_1 interlock, not treat the two SPU slots as fully independent.
| Has-bit | proto+ | _globals_ default @ | Slot (role) | Per-slot encoder @ | Abs bits (region) |
|---|---|---|---|---|---|
0x001 | +0x18 | 0x223fd378 | scalar_0 (SPU) | 0x1ed16dc0 | 381..407 (27b) |
0x002 | +0x20 | 0x223fef38 | scalar_1 (SPU) | 0x1ed2a0e0 | 354..380 (27b) |
0x004 | +0x28 | 0x22400f90 | vector_alu_0 (VPU, wide lane) | 0x1ed45060 | 198..240 (43b) |
0x008 | +0x30 | 0x224036b8 | vector_alu_1 (VPU, narrow lane) | 0x1ed68d80 | 167..197 (31b) |
0x010 | +0x38 | 0x22411eb0 | vector_store (mem-store) | 0x1ee3b440 | 142..166 (25b) |
0x020 | +0x40 | 0x22410b88 | vector_load (mem-load) | 0x1ee287e0 | 119..140 (22b) |
0x040 | +0x48 | 0x223fb410 | cmem_load (const-mem load, NEW) | 0x1ecf89a0 | 103..118 (16b) |
0x080 | +0x50 | 0x22407210 | vector_extended_0 / MXU0 | 0x1edb0900 | 83..102 (20b) |
0x100 | +0x58 | 0x2240cf50 | vector_extended_1 / MXU1 (−20 twin) | 0x1ee08060 | 63..82 (20b) |
0x200 | +0x60 | 0x22410fd8 | vector_result_0 (EUP-pop lane 0) | 0x1ee2b1c0 | 52..62 (11b) |
0x400 | +0x68 | 0x22411428 | vector_result_1 (EUP-pop lane 1) | 0x1ee2d180 | 41..51 (11b) |
0x800 | +0x70 | 0x223fbce8 | misc (mask / rotate / imm-set) | 0x1ed03500 | 17..40 (24b) |
QUIRK — the proto-pointer order and the on-wire bit order are reversed. The has-bit /
proto+order runs scalar_0 → misc (bit 0x1 to 0x800), but the on-wire bit layout runs the other way: misc sits at the bottom of the bundle (abs 17) and scalar_0 at the very top (abs 381..407). A reimplementer who lays slots into the buffer in has-bit order, low-to-high, inverts the entire bundle. The has-bit is a proto-occupancy index; the absolute bit comes only from theBitCopy dst_bit, never from the slot's ordinal.
The Three On-Wire Regions
The 408-bit bundle is not twelve contiguous slot fields. It is three regions:
abs 0 .. 16 header / reserved (not written by any per-slot encoder; buffer stays 0)
abs 17 .. 240 LOW dedicated region: one disjoint sub-range per non-scalar slot
abs 241 .. 353 SHARED operand pool: 3 Y-reg selectors + 6 immediate words
abs 354 .. 407 the two scalar slots (SPU), packed at the TOP
The low dedicated region packs the ten non-scalar slots disjointly, low-to-high:
abs 17 .. 40 (24b) misc
abs 41 .. 51 (11b) vector_result_1
abs 52 .. 62 (11b) vector_result_0
abs 63 .. 82 (20b) vector_extended_1 / MXU1
abs 83 .. 102 (20b) vector_extended_0 / MXU0
abs 103 .. 118 (16b) cmem_load (NEW in Pufferfish)
abs 119 .. 140 (22b) vector_load
abs 142 .. 166 (25b) vector_store
abs 167 .. 197 (31b) vector_alu_1 (narrow lane)
abs 198 .. 240 (43b) vector_alu_0 (WIDE lane, +12b)
Because every dedicated sub-range is disjoint, all twelve slots can be populated in a single bundle. What is not disjoint is the shared operand pool: the three memory slots and both VALU lanes all draw their register and immediate operands from the same 3 Y-register selectors (abs 241/246/251) and 6 immediate words (abs 256/272/288/304/320/338). That sharing is what bounds co-issuability — a bundle can name at most three distinct Y-registers and six distinct immediates across all of its present slots combined.
NOTE — three small bit gaps are never written by any of the twelve per-slot encoders: the leading 17 bits (abs 0..16), bit 141 (between vector_load and vector_store), and bits 336..337 (inside the immediate-pool region, between imm slot 4 and imm slot 5). The buffer is
memsetto zero, so they ship as zero. Whether abs 0..16 carries a bundle-level sequencer/loop prefix (as the SparseCore sequencer bundle does) or is pure alignment was not isolated — noBitCopyin any of the twelve encoders targets it. Marked LOW; see What Is Not Yet Pinned.
Per-Slot Field Arithmetic
Each per-slot encoder writes its fields with BitCopy(buf, abs_bit, &proto_field, 0, width). The positions below are the esi/r8d immediates harvested directly from the call 0x1fa0a900 sites in each encoder; they are absolute bundle bits, no transform applied.
Scalar slots (abs 354..407) — 0x1ed16dc0 (scalar_0), 0x1ed2a0e0 (scalar_1)
The two scalar slots are identical in layout, scalar_0 sitting exactly 27 bits above scalar_1. Pufferfish replaces Jellyfish's single ScalarInstruction proto with roughly fifty per-opcode oneof submessages (TensorCoreScalar0_ScalarMove, _ScalarIntAdd, _ScalarDmaSimple, _ScalarHalt, …); the opcode field is written from two of those oneof branches but always lands at the same base, and the scalar's Y/immediate operand is bound through ScalarYEncode into the shared immediate pool, not written inline into the scalar region.
| Field | scalar_0 abs | scalar_1 abs | Width |
|---|---|---|---|
| operand (Scalar Y / operand) | 381 | 354 | 11 |
| X reg | 386 | 359 | 6 |
| opcode | 397 | 370 | 6 |
| predicate | 403 | 376 | 5 |
The scalar_0 harvest reads literally: BitCopy(buf,0x193,…,5) (pred @403), BitCopy(buf,0x18d,…,6) (opcode @397), BitCopy(buf,0x182,…,6) (X @386), BitCopy(buf,0x17d,…,11) (operand @381), plus four 16-bit immediate writes into the shared pool (abs 288/304/320/338).
Vector-ALU lanes — 0x1ed45060 (lane 0, wide), 0x1ed68d80 (lane 1, narrow)
Lane 0 is the wide lane (43 bits, abs 198..240); lane 1 is the narrow lane (31 bits, abs 167..197). Both carry a 6-bit opcode matching the 56-value VectorAluOpcode enum (see VPU Slot).
| Field | lane0 (wide) abs | lane1 (narrow) abs | Width |
|---|---|---|---|
| (lane0 first operand) | 198 | — | 5 |
| dest | 203 | 167 | 5 |
| (wide-only field region) | 208..219 | — | 12 |
| Vx | 220 | 177 | 5 |
| y | 225 | 172 | 5 |
| x2 | — | 182 | 5 |
| opcode | 230 | 187 | 6 |
| predicate | 236 | 193 | 5 |
Both lanes also write their register operands into the shared Y-register selectors (abs 241/246/251) and immediates into the shared pool.
QUIRK — the wide VALU0 lane has a 12-bit field region (abs 208..219) that VALU1 lacks, and no primary opcode branch writes it. The 12-bit extent is CONFIRMED (it is the exact width difference between the 43-bit VALU0 region and the 31-bit VALU1 region), but no
mov esi,0x..in the harvested VALU0BitCopycalls targets bits 208..219 — they are written by a per-opcode oneof branch (the wide-only ops,TensorCoreVectorAlu0_*), not the common path. Its role — a second vector source, a wide-immediate selector, or a vmask/select operand — is inferred from the lane-width delta and marked LOW. A reimplementer must reserve the 12 bits but should treat their meaning as unresolved.
Vector-Extended / MXU slots — 0x1edb0900 (MXU0), 0x1ee08060 (MXU1)
Pufferfish doubles Jellyfish's single VectorExtended slot into two independent MXU control slots. MXU1 (abs 63..82) is a bit-for-bit twin of MXU0 (abs 83..102), offset exactly −20 bits.
| Field | MXU0 abs | MXU1 abs | Width |
|---|---|---|---|
| sub-op | 83 | 63 | 3 |
| mode / mxu-num | 89 | 69 | 2 |
| opcode | 91 | 71 | 7 |
| predicate | 98 | 78 | 5 |
The MXU0 harvest reads BitCopy(buf,0x62,…,5) (pred @98), BitCopy(buf,0x5b,…,7) (opcode @91), BitCopy(buf,0x59,…,2) (mode @89). The 7-bit opcode field selects the matmul / PushGains / transpose roster; for matmul the field widens to 9 bits (abs 89..97) so the low two bits @89..90 carry the physical MXU number (0..3). The full opcode roster and the −20 twin are documented on the MXU Slot page and the Decode-Side: JF / PF page.
Vector-Result slots — 0x1ee2b1c0 (result_0), 0x1ee2d180 (result_1)
Two EUP-result drain lanes, one per VALU lane, each 11 bits. result_0 (abs 52..62) sits exactly 11 bits above result_1 (abs 41..51).
| Field | result_0 abs | result_1 abs | Width |
|---|---|---|---|
| which-destination | 52 | 41 | 2 |
| (mode block) | 54 | 43 | 2 |
| valid | 54 | 43 | 1 |
| result-format | 56 | 45 | 2 |
| predicate | 58 | 47 | 5 |
The which-destination field routes the deferred transcendental/EUP result to a dest VREG drawn from the shared register references (V0_DEST / V1_DEST / VLD_DEST routing); see ResultFifo and EUP / Transcendental Slot.
cmem_load slot (abs 103..118) — 0x1ecf89a0
The first-class constant-memory load slot, new in Pufferfish. Its addressing operand is read from the addressing submessage at proto+0x48 (the same structure vector_load uses) and written with the shared memory operand shapes.
| Field | abs bit | Width |
|---|---|---|
| sublane-mask | 103 | 3 |
| base-address | 106 | 2 |
| offset | 108 | 2 |
| stride | 110 | 3 |
| has-bit | 113 | 1 |
| predicate | 114 | 5 |
The harvest reads BitCopy(buf,0x72,…,5) (pred @114), BitCopy(buf,0x71,…,1) (has @113), BitCopy(buf,0x6e,…,3) (stride @110), BitCopy(buf,0x6c,…,2) (offset @108), BitCopy(buf,0x6a,…,2) (base @106), BitCopy(buf,0x67,…,3) (sublane-mask @103). The field names match the cmem_load Slot page byte-for-byte. The index/destination registers come from the shared Y-register selectors (abs 241/246/251) and up to four 16-bit immediates (abs 256/272/288/304) from the shared pool. Full coverage of the addressing modes and the constant-memory path is on the cmem_load Slot page.
vector_load (abs 119..140) — 0x1ee287e0, vector_store (abs 142..166) — 0x1ee3b440, misc (abs 17..40) — 0x1ed03500
vector_load carries predicate @136 (5b), the addr-mode/oneof discriminator @134 (2b), dest @129 (5b), stride @126 (3b), offset @122 (2b), and base @134 (2b, mode-dependent); it has four addressing-mode oneof branches (Vmem / Shuffled / Indexed-Iar0 / Indexed-Iar1) that reuse the same field bases. vector_store carries source register fields at abs 162/157/152 (5b each), stride/feature fields at abs 149/142 (3b each), base @145 (2b), and offset @147 (2b). misc carries three 3-bit fields at abs 22/25/28, a 5-bit sub-op @31, and predication @36 (5b). All three pull their register operands from the shared Y-register selectors and immediates from the shared pool. The per-mode field-role differences for the four vector_load addressing modes were not separated branch-by-branch (the bases are CONFIRMED; the per-mode semantics are HIGH); see Memory-Load and Memory-Store.
The Five Shared Operand Sub-Encoders
In Jellyfish, the memory-operand sub-fields are written by five standalone EncodeVector{SublaneMask,BaseAddress,Offset,Shuffle,Stride}Encoding helper functions. Pufferfish has no standalone helpers — vector_load, vector_store, and cmem_load write the same five sub-field shapes inline, all reading from one common addressing submessage (proto+0x48 for vector_load / cmem_load) and writing into the same shared operand pool. The five shared sub-fields:
| Sub-field | Width | Abs bit(s) | Source / role |
|---|---|---|---|
| base-address | 2 | vload @134 · vstore @145 · cmem @106 | BaseAddressEncoding (ZERO / vs0 / vs1 / vs2) |
| offset | 2 | vload @122 · vstore @147 · cmem @108 | OffsetEncoding mode selector |
| stride | 3 | vload @126 · vstore @142/149 · cmem @110/103 | stride / feature-length mode |
| index/mask register | 5 | shared @241 / 246 / 251 | 3-entry Y-register selector SlotMap |
| index/offset immediate | 16 | shared @256 / 272 / 288 / 304 / 320 / 338 | 6-entry immediate SlotMap |
The operand source — which of the three Y-register slots or six immediate slots a given memory operand uses — is not decided by the encoder. It is bound earlier, at the proto-build layer, by a family of xla::pufferfish::proto_utils template functions that take a SlotMap<unsigned long, 3>& (the three Y-register selectors) and a SlotMap<ImmValue, 6>& (the six immediates):
// xla::pufferfish::proto_utils — operand-source binding (proto-build layer)
VectorYVEncode<Slot>(VregnoOrImm, SlotMap<ulong,3>&, SlotMap<ImmValue,6>&, Slot*); // vector Y reg-or-imm
VectorYSEncode<Slot>(SregnoOrImm, SlotMap<ulong,3>&, SlotMap<ImmValue,6>&, Slot*); // vector Y sreg-or-imm
ScalarYEncode<Slot>(SregnoOrImm, Slot*, SlotMap<ImmValue,6>&); // scalar Y immediate
SetImmOrDie<...>(int slot, ImmValue, Slot*); // writes one imm slot
VisitImmediateSlots<0..5>(TensorCoreBundle*, optional<uint>); // the 6 imm-slot visitors
The packer assigns each present slot's register/immediate operands to a free SlotMap entry; if more operands are requested than there are slots (more than 3 distinct Y-registers, or more than 6 distinct immediates, across the whole bundle), the binding is rejected at proto-build time. This is the Pufferfish analog of Jellyfish's FindFreeSlot per-lane search — here it is a SlotMap allocation over one shared pool rather than per-lane submessage search.
GOTCHA — the shared pool is the real co-issue constraint, not the dedicated regions. A reimplementation that treats each slot as fully independent (because the dedicated regions are disjoint) will let, say, two memory slots and both VALU lanes each demand a fourth distinct Y-register and silently overwrite each other's selector. The dedicated regions co-exist freely; the operands must fit in 3 Y-register selectors and 6 immediate words combined. The SlotMap allocator is what enforces this, and it runs before
Encode.
The kNeverExecute Prefill
An absent slot must be a defined no-op, not garbage. EncodeBundleInternal zeroes the whole buffer, and the slot-default mechanism supplies the no-op value: FillDefaultBundle (0x1d222ee0) default-constructs each slot's submessage, sets all twelve has-bits, and stamps the predicate value 0x1f (31 = kNeverExecute) into every slot's predicate field, plus a default Scalar0_ScalarHalt op. The disassembly shows the stamp directly — mov DWORD PTR [rax+0x1c],0x1f and mov DWORD PTR [rax+0x20],0x1f, repeated once per slot, writing the proto predicate field (at submessage offset +0x1c or +0x20 depending on slot type).
When a slot's has-bit is clear, the codec substitutes that slot's _globals_ default instance (via the cmove in the dispatch ladder), whose predicate field is the FillDefaultBundle-stamped 31. The per-slot encoder then writes predicate 31 into the slot's predicate field. A present slot's encoder instead writes the real 5-bit Predication value: 0..14 a predicate-register reference, 15 kAlwaysExecute, +16 the negated form, 31 kNeverExecute.
GOTCHA — empty is predicate-31, not all-zero, even though the buffer starts at zero. The buffer is
memsetto 0, but predicate 0 is a valid predicate-register reference, not "skip". A reimplementation that fills only active slots and leaves the rest at the zeroed default turns every empty slot into a live op gated on predicate register 0. The empty-slot value is the explicit0x1fstamp, identical to Jellyfish's empty-slot encoding. See NOP / Unused-Slot Canonical Encoding and Predicate Slot.
The Jellyfish → Pufferfish Delta
Pufferfish grows the bundle from 41 to 51 bytes (+10 bytes / +80 bits) and changes the encoder mechanism. The four datapath additions and the one mechanism change, all byte-exact:
| Change | Jellyfish | Pufferfish | Bit cost |
|---|---|---|---|
| Encoder mechanism | scratch struct + shl/and/or + 12-byte strip | 51-byte buffer + BitCopy absolute-bit, no strip | — |
| 2nd MXU | one VectorExtended slot | MXU0 (abs 83..102) + MXU1 (abs 63..82), −20 twin | +20b |
| Wide VALU lane | two near-symmetric VALU lanes | VALU0 wide (43b) + VALU1 narrow (31b) | +12b |
| cmem_load slot | none | first-class const-mem load (abs 103..118) | +16b |
| 2nd result slot | one VectorResult | result_0 (abs 52..62) + result_1 (abs 41..51) | +11b |
| Operand sub-encoders | 5 standalone helpers | 5 inline BitCopy writes into one shared pool | (larger pool) |
The mechanism change is the deeper one. Jellyfish's encoder is monolithic and personality-specific; Pufferfish's is a generic TensorCoreCodecBase template parameterized over the twelve {Slot}Decoder/{Slot}Encoder pairs plus a Predication policy. That same codec template, with a different slot list (four VALU lanes, two loads, vector-scalar + DMA slots), is the next-generation GXC codec — Pufferfish is the v4 origin of the codec design that carries through to V6+.
QUIRK — the second MXU is selected by an opcode field, not a bundle slot, even though there are two MXU slots. Pufferfish has two MXU control slots in the bundle (MXU0/MXU1, the −20 twin) and four physical MXU arrays. The two are orthogonal: a matmul op's 9-bit opcode field carries the physical MXU number (0..3) in its low two bits (abs 89..90 for MXU0), so the bundle slot picks the control lane while the opcode picks the physical array. A reimplementer must not conflate "which MXU slot" with "which of the four MXUs".
HBM / DMA Framing
The 51-byte width is the issue width. Stored in HBM, each bundle is framed by the shared codec-metadata table keyed on (TpuVersion, TpuSequencerType), not by a hardcoded Pufferfish constant. The framing helpers on EncoderPfTensorCore all delegate to that table:
| Helper | @ addr | Result / delegate |
|---|---|---|
BundleSizeBytes() | 0x1d227740 | 0x33 = 51 (on-wire issue width) |
BundleSizeBytesForDma() | 0x1d227760 | reads version [encoder+0x8] → tail-jumps codec_metadata::BundleSizeBytesForHbm(ver, seqtype) @ 0x1ecf71a0 |
HasCheckByteForDma() | 0x1d227780 | → codec_metadata::HasCheckByteForHbm @ 0x1ecf71c0 |
BundleCheckByte() | 0x1d2277a0 | → codec_metadata::BundleCheckByte @ 0x1ecf71e0 |
BundleCheckByteMask() | 0x1d2277c0 | → codec_metadata::BundleCheckByteMask @ 0x1ecf7200 |
DmaEncodingBytesRequired(n) | 0x1d2277e0 | n + BundleSizeBytesForHbm() (per-bundle stride) |
MinimumBundlesRequiredToEncodeToDma() | 0x1d227920 | 0xa = 10 |
EncodeProgramForHbmInternal | 0x1e8c5ce0 | thin wrapper → codec vtable [+0x20], per-bundle loop over TensorCoreProgram.bundles |
NOTE — the codec-metadata table lives in
platforms_deepsea::jellyfish::isa::codec_metadataand is shared across generations — Pufferfish (TpuVersion = 2) is one row of a(TpuVersion, TpuSequencerType)lookup, so its HBM stride and check-byte are not compiled intoEncoderPfTensorCore. The exact per-bundle framing bytes (the check byte and any pad) are written by the program-levelEncodeProgramForHbmInternaland are not individually pinned on this page.
What Is Not Yet Pinned
- The VALU0 12-bit field region (abs 208..219). Extent CONFIRMED, role LOW — written by a
TensorCoreVectorAlu0_*per-opcode oneof branch, not the common path; a second vector source, wide-immediate selector, or vmask is the inferred candidate set. - The header / reserved bits (abs 0..16, bit 141, abs 336..337). Not written by any of the twelve per-slot encoders; ship as zero. Whether abs 0..16 is a bundle-level sequencer prefix or pure alignment is LOW.
- The per-opcode oneof submessage rosters. Pufferfish replaces each scalar/VALU/MXU opcode field with roughly fifty typed oneof submessages; the opcode-field bit base+width is CONFIRMED, but the enum-value → oneof mapping is a separate (large) deliverable, not enumerated here.
- The four vector_load addressing-mode branches. The shared field bases are CONFIRMED; the per-mode role split (Vmem / Shuffled / Indexed-Iar0 / Indexed-Iar1, discriminated by the 2-bit oneof @134) is HIGH.
Cross-References
- Jellyfish 41B Bundle — the v3 predecessor this page deltas against: the scratch-struct / 12-byte-strip encoder, the
slot_mask@proto+0x10dispatch, and the single MXU / VALU / result slots Pufferfish doubles or widens. - Bundle Model — the per-generation bundle widths (41 / 51 / 64) and the slot taxonomy this page instantiates for Pufferfish.
- cmem_load Slot — the new v4 constant-memory load slot (abs 103..118) and its addressing submessage in full.
- MXU Slot — the vector-extended/MXU opcode roster, the −20 twin, and the matmul / PushGains / transpose families the 7-bit opcode @91/71 selects.
- MXU Latency: Pufferfish — the cost model that consumes the MXU-slot fields decoded here.
- Decode-Side: JF / PF — the inverse
TensorCore{Slot}Decoder::Decodepath that reads these same bits, the independent confirmation of every slot-map base. - Record Format — the 239-bit MC
APInt; why the proto codec path is distinct from the MC layer that cross-confirms the 408-bit width. - MC-Emitter —
TPUMCCodeEmitter::encodeInstructionand the0x198= 408-bit constant that independently fixes the bundle width.