Memory-Store Slot
Every address, bit offset, opcode value, and string on this page was read byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (BuildID md589edbbe81c5b328a958fe628a9f2207d, not stripped — full C++ symbols)..text/.rodataare mapped VA == file offset. Other wheel versions differ.
Abstract
The memory-store slot is the bundle slot that moves a compute register out to on-chip memory inside a single VLIW issue word — the write-side mirror of the Memory-Load Slot. It is distinct from intra-chip DMA: the store slot moves one vector register into VMEM (or one scalar register into SMEM) in one bundle cycle, whereas DMA moves tier→tier blocks via descriptors. Like the load slot it carries no tier-selector bit — the destination tier is a function of (slot, sub-opcode).
A store slot is a discriminated union whose sub-opcode selects both the destination tier and the addressing mode. The per-gen <Bundle><Slot>StoreEncoder::Encode(<msg> const&, absl::Span<uint8_t>) is the canonical byte codec, and on Pufferfish and all V5+ generations each field is written with one call to the universal bit packer BitCopy(dst, bit_offset, src, 0, width). Every bit position on this page is LSB-first (matching Bundle Model): bit 0 is the least-significant bit of byte 0, and the literal bit_offset argument is the absolute, LSB-numbered bundle bit, so the store-slot field map is read off the encoder disassembly with no shift arithmetic to invert. The store-data source register is always written first; the sub-opcode discriminator (read from the proto at +0x50 on the TensorCore store, +0x50/+0x58 on the V5e/V6e SparseCore-TEC store) then selects which addressing fields follow.
Two structural facts dominate the per-generation story. First, the store slot is the read-and-write twin of the load slot — same dedicated bundle slot, same paired Encoder/Decoder in the same TensorCoreCodecBase template, same addressing-mode taxonomy (base+offset, base-register NoOffset, strided, indexed/scatter) — but with three write-only additions: a sublane/vmask write-enable, the SparseCore reduce-add (read-modify-write) family, and the vector-store fence ordering barrier. Second, the per-gen deltas are concrete: the bundle grows 41 → 51 → 64 bytes; Pufferfish adds CMEM-store and 8 hardware vmask registers; Viperfish shrinks the store-data field 5 → 4 bits while growing base/offset 5 → 6 bits and adds the SparseCore TEC tile-SPMEM store with reduce-add; Ghostlite doubles the write ports 1 → 2; 6acc60406 moves the store-fence into the misc slot and reaches a 34-sub-op SparseCore store family with atomic SMEM stores.
For reimplementation, the contract is: the LloOpcodeIsVectorStore opcode window, the per-gen store Encoder/Decoder address table, the BitCopy absolute-bit field maps for PF/VF/GFC, the slot-selects-tier model, the two orthogonal masking mechanisms (bundle predicate vs vmask), and the per-gen deltas.
| Slot role | move one register's worth of data into VMEM/CMEM/SPMEM (vector) or SMEM (scalar) |
| Tier select | by (slot, sub-opcode) (no tier bit) |
| PXC store encoder | pxc::isa::TensorCoreVectorStoreEncoder::Encode @ 0x1ee3b440 (51-B bundle) |
| VXC store encoder | vxc::isa::TensorCoreVectorStoreEncoder::Encode @ 0x1f01ff60 (64-B bundle) |
| GFC store encoder | gxc::gfc::isa::TensorCoreVectorStoreEncoder::Encode @ 0x1fa08920 (64-B bundle) |
| JF store encoder | jellyfish::isa::EncoderJf::EncodeVectorStoreInstruction @ 0x1e868c40 (41-B, monolithic) |
| Bit primitive | BitCopy(dst, abs_bit, src, 0, width) — abs_bit == absolute bundle bit (PF/V5+) |
| Opcode classifier | xla::jellyfish::LloOpcodeIsVectorStore @ 0x14024920 → window {63,64,65,68,69,70} + 460 |
| Store-data field | 5-bit on JF/PF, 4-bit on VF/GL/GFC; base/offset 5-bit → 6-bit at v5 |
| Write ports | MaxVectorStoreSlots = 1 on JF/PF/VF, 2 on GL/GFC |
The Store Op List (LLO Opcodes)
The store family in the LloOpcode enum (internal k-prefixed names; proto LloOpcodeProto OPCODE_-prefixed names):
| LloOpcode internal | LloOpcodeProto name | Destination tier | Slot |
|---|---|---|---|
kVectorStore | OPCODE_VECTOR_STORE | VMEM | VECTOR_STORE |
kVectorStoreMasked | OPCODE_VECTOR_STORE_MASKED | VMEM (masked) | VECTOR_STORE |
kVectorStoreIndexed | OPCODE_VECTOR_STORE_INDEXED | VMEM (scatter) | VECTOR_STORE |
kVectorStoreIndexedMasked | OPCODE_VECTOR_STORE_INDEXED_MASKED | VMEM (scatter+mask) | VECTOR_STORE |
kVectorStoreSublaneShuffle | OPCODE_VECTOR_STORE_SUBLANE_SHUFFLE | VMEM (shuffled) | VECTOR_STORE |
kVectorStoreEvenOddSublanes | OPCODE_VECTOR_STORE_EVEN_ODD_SUBLANES | VMEM (even/odd) | VECTOR_STORE |
kVectorStoreFence | OPCODE_VECTOR_STORE_FENCE | (ordering barrier) | VECTOR_MISC (GFC) |
kVectorCmemStore | OPCODE_VECTOR_CMEM_STORE | CMEM | VECTOR_STORE |
kVectorCmemStorePseudo | OPCODE_VECTOR_CMEM_STORE_PSEUDO | CMEM (pseudo) | VECTOR_STORE |
kBarnaCoreVectorStore | OPCODE_BARNA_CORE_VECTOR_STORE | BarnaCore VMEM | (BCS channel) |
kScalarStore | OPCODE_SCALAR_STORE | SMEM | SCALAR_0/1 |
The classifier xla::jellyfish::LloOpcodeIsVectorStore (@ 0x14024920) is byte-exact: it returns true when (opcode - 63) <= 7 && _bittest(0xE7, opcode - 63) or opcode == 460. 0xE7 = 0b1110_0111 selects offsets {0,1,2,5,6,7}, i.e. internal opcodes {63,64,65,68,69,70} plus 460. Scalar-store, CMEM-store and BarnaCore-store are classified by separate predicates.
NOTE — the vector-store fence is not a store; it is an ordering barrier. On
6acc60406it is encoded in the VectorMisc slot (gxc::gfc::isa::TensorCoreVectorMisc0Compact::vector_store_fenceaccessor), not the store slot. The bundle packer'sRequiresVectorStoreFence(@0x14020900) gates it behindTarget::SupportsVectorStoreFence(). See Per-Gen Deltas.
Store-Slot Encoder/Decoder Address Table
Each gen has a <Bundle><Slot>StoreEncoder::Encode and a matching <…>StoreDecoder::Decode. The decoder is the symmetric inverse BitExtract of the encoder.
TensorCore vector-store slot
| Gen | Namespace | Encode @ | Decode @ |
|---|---|---|---|
| Jellyfish | jellyfish::isa::EncoderJf::EncodeVectorStoreInstruction | 0x1e868c40 | (paired in DecoderJf) |
| Pufferfish | pxc::isa::TensorCoreVectorStore{En,De}coder | 0x1ee3b440 | 0x1ee2dae0 |
| Viperfish | vxc::isa::TensorCoreVectorStore{En,De}coder | 0x1f01ff60 | 0x1f01a560 |
| Ghostlite | gxc::glc::isa::TensorCoreVectorStore{En,De}coder | 0x1f3c3200 | 0x1f3bd780 |
6acc60406 | gxc::gfc::isa::TensorCoreVectorStore{En,De}coder | 0x1fa08920 | 0x1fa02f00 |
Jellyfish's TensorCore store is not a standalone codec class; it is the monolithic EncoderJf::EncodeVectorStoreInstruction(VectorStoreInstruction const&, Bundle const&) (@ 0x1e868c40) that fills the slot bits inline.
SparseCore TEC tile-SPMEM store slot (V5+ only)
| Gen | Namespace | Encode @ | Decode @ |
|---|---|---|---|
| Viperfish | vxc::vfc::isa::SparseCoreTecVectorStore{En,De}coder | 0x1e9c2760 | 0x1e9bbf40 |
| Ghostlite | gxc::glc::isa::SparseCoreTecVectorStore{En,De}coder | 0x1eb50ac0 | 0x1eb42300 |
6acc60406 | gxc::gfc::isa::SparseCoreTecVectorStore{En,De}coder | 0x1eccbe20 | 0x1ecbd7a0 |
There is no Jellyfish/Pufferfish SparseCore TEC store — the SparseCore TEC sequencer exists only from Viperfish onward. The BarnaCore channel vector-store (Pufferfish only) is pxc::pfc::isa::BarnaCoreChannelVectorStore{En,De}coder (Encode 0x1e8c5640, Decode 0x1e8c4d40).
Scalar-store slot (SREG → SMEM)
Routed through the scalar-slot encoder, not the vector-store slot. Jellyfish: ScalarStoreOperands proto, emitted by EncoderJf::EncodeScalarInstruction (@ 0x1e862060). Pufferfish: TensorCoreScalar1_ScalarStoreSmemAbsolute via ScalarYEncode (@ 0x1c471c60) + SetImmOrDie (@ 0x1c550660); BarnaCore-sequencer variant via 0x1c471fc0/0x1c550e00. Viperfish/Ghostlite: …ScalarAlu_ScalarStoreXToSmemY. 6acc60406 adds …ScalarStoreXToSmemSumDestAndY (atomic add), …ScalarStoreCircularBuffer, and SmemFetchAndAdd (via EmitFetchAndAddOp @ 0x13a3a300).
Store Slot Position in the Bundle
The store slot is one fixed slot in the per-gen VLIW bundle; the slot order is the per-gen TensorCoreCodecBase<…> template-arg order:
ScalarAlu0 / ScalarAlu1 / [Predicates] / Immediates / VectorScalar /
Dma / VectorAlu0 / VectorAlu1 / [VectorAlu2 / VectorAlu3] /
**VectorStore** / VectorLoad0 / [VectorLoad1] / VectorMisc /
VectorExtended0 / [VectorExtended1] / VectorResult0 / [VectorResult1]
(Pufferfish carries VectorAlu0/1 + VectorStore + VectorLoad + CmemLoad + ….) The slot-enumeration constants name it: isa::SLOT_VECTOR_STORE (TC), isa::SLOT_SCALAR_0/1 and SLOT_SCALAR_ALU_0/1 (scalar stores), isa::SLOT_VECTOR_MISC (overlay target / store fence), llvm::TPU::SparseCoreMCSlot::SLOT_VST and SLOT_SST (SparseCore vector/scalar store). Bundle widths: Jellyfish/Dragonfish 41 B, Pufferfish 51 B, V5+ TensorCore 64 B; SparseCore SCS 32 B, TEC 64 B.
On Pufferfish the vector_store slot occupies absolute bundle bits 142..166 (see Pufferfish 51B Bundle), disjoint from the vector_load slot at 119..140 — so a load and a store can co-exist in one bundle. The constraint "SCALAR_STORE_ABSOLUTE not allowed in slot 0." forces a scalar-store absolute on Jellyfish/Pufferfish into SCALAR_1, never SCALAR_0. The accounting predicate store_and_misc_slots == module.target().MaxVectorStoreSlots() ties the number of store-bearing slots to the per-gen write-port count.
Bit-Field Layout (BitCopy Field Maps)
Each per-slot Encoder fills the slot via BitCopy(dst, abs_bit, src, 0, width); the store-data source register is written first, then the sub-opcode discriminator selects the addressing fields. The maps below are harvested directly from the BitCopy(a3, …) call sites in the decompiled encoders — the second argument is the absolute bundle bit.
Pufferfish (PXC) TensorCoreVectorStore — Encode @ 0x1ee3b440 (51-B bundle)
Byte-exact from the encoder body (BitCopy(a3, 162, …, 5) etc.):
| Field | abs bit | width | meaning |
|---|---|---|---|
| source vreg | 162 | 5 | store-data vreg (proto +28); = 31 ⇒ NOOP (kNeverExecute) |
| sub-opcode | 157 | 5 | VmemStore=0 / CmemStore≠0 discriminator |
| base address / Vmem offset | 152 | 5 | VregNumber base |
| offset | 149 | 3 | immediate-offset slot index |
| stride | 147 | 2 | stride select |
| sublane-mask / vmask | 145 | 2 | sublane/vmask select |
| (CMEM full-address words) | 304 / 288 / 272 | 16 | CMEM 16-bit immediate address (CmemStore only) |
| (CMEM address regs) | 251 / 246 / 241 | 5 | shared Y-register selectors |
The PF store slot region (abs 142..166) matches the Pufferfish 51B Bundle page exactly. Sub-opcode dispatch (proto +0x50): 0=Noop, 5=NoopAlt, 6=CmemStore, 7=CmemStoreNoOffset, 8=VmemStore, 9=VmemStoreNoOffset, 0xA-0x11=VmemStoreVmsk0..7, 0x12=VmemStoreIndexed, 0x13=VmemStoreIndexedNoOffset, 0x14-0x1B=VmemStoreIndexedVmsk0..7, 0x1C=SetIarLane, 0x1D=SetIarSublane, 0x1E=SetIarRaw, 0x1F=PushV2s. 32 sub-ops total.
Viperfish (VXC) TensorCoreVectorStore — Encode @ 0x1f01ff60 (64-B bundle)
Byte-exact (BitCopy(a3, 170, …, 4) etc.):
| Field | abs bit | width | meaning |
|---|---|---|---|
| source vreg | 170 | 4 | store-data vreg (proto +28) |
| sub-opcode discriminator | 167 | 3 | addr-mode/family |
| secondary opcode / vmsk | 163 | 4 | mask / sub-variant |
| base address / offset | 157 | 6 | VregNumber base |
| stride (Base variants) | 153 | 4 | stride select |
| (Base-variant field) | 151 | 2 | — |
| (trailing field) | 148 | 3 | Vs select / mask |
| address / mask field | 144 | 4 | — |
The data-vreg @170 w4 and base @157 w6 match the Viperfish 64B Bundle page exactly. Sub-opcode dispatch (proto +0x50): 0=Noop, 5=VectorStore, 6=VectorStoreBase, 7=VectorStoreMasked, 8=VectorStoreBaseMasked, 9=VectorStoreShuffled, 0xA=VectorStoreShuffledBase, 0xB=VectorStoreShuffledMasked, 0xC=VectorStoreShuffledBaseMasked, 0xD=VectorStoreIndexed0, 0xE=VectorStoreIndexed1, 0xF=VectorStoreIndexed0Masked, 0x10=VectorStoreIndexed1Masked, 0x11-0x16=SetLaneIar0/1·SetSublaneIar0/1·SetRawIar0/1, 0x17=PushV2s, 0x18=SetPrng. 25 sub-ops.
6acc60406 (GFC) TensorCoreVectorStore — Encode @ 0x1fa08920 (64-B bundle)
Byte-exact (BitCopy(a3, 169, …, 2) etc.):
| Field | abs bit | width |
|---|---|---|
| sub-opcode top field | 169 | 2 |
| sub-opcode | 166 | 3 |
| secondary opcode | 162 | 4 |
| base address / offset | 156 | 6 |
| stride / mask | 152 | 4 |
| (Base-variant field) | 150 | 2 |
| (field) | 147 | 3 |
| (field) | 143 | 4 |
Sub-opcode dispatch is identical to Viperfish. Ghostlite's TensorCoreVectorStoreEncoder (@ 0x1f3c3200) is structurally identical to VXC/GFC (same sub-op set, same template) with gen-shifted bit offsets; its field map was not separately dumped — HIGH by template identity.
6acc60406 (GFC) SparseCoreTecVectorStore — Encode @ 0x1eccbe20 (64-B TEC bundle)
Byte-exact (BitCopy(a3, 359, …, 3) etc.), store slot in the upper bit region — the richest store family:
| Field | abs bit | width |
|---|---|---|
| opcode | 359 | 3 (4 alt) |
| normal-predication | 362 | 1 |
| rotate-predication | 363 | 1 |
| base / source | 353 | 6 |
| offset | 347 | 6 |
| stride | 340 | 3 |
| mask | 337 | 3 |
| (field) | 333 | 4 |
Sub-opcode dispatch (proto +0x4c, 34 cases): 0=Noop, 8=TileSpmemStore, 9=TileSpmemStoreCircularBuffer, then the reduce-add matrix — TileSpmemStoreAdd{S32,S16,F32,Bf16}, *CircularBufferAdd*, *CircularBufferPostUpdateAdd*, TileSpmemIndexedStore, *IndexedCircularBuffer*, *IndexedAdd{S32,S16,F32,Bf16}*, *IndexedReturnValueAdd*, *IndexedCircularBufferReturnValueAdd{S32,S16,F32,Bf16}*. The per-variant inner field widths beyond the shared positions above are not exhaustively dumped (HIGH).
Jellyfish EncodeVectorStoreInstruction — @ 0x1e868c40 (41-B bundle)
Byte-exact dispatch: encodes predication first (EncodePredication<VectorStoreInstruction> @ the call before the switch), then switch(*((_DWORD*)a2 + 12)) — the sub-opcode at proto +12:
- family A
{0,6,0x10-0x17}: sublane-mask + stride store (EncodeVectorSublaneMaskEncoding,EncodeVectorStrideEncoding). - family B
{1,7,0x18-0x1F}: offset + base + sublane-mask + stride store (EncodeVectorOffsetEncoding,EncodeVectorBaseAddressEncoding, plus the two above). - family C
{2,3,4}: indexed store (SetIndexedAddressOperands, routed byv36[3]&1). case 5: NOOP.default:"Not implemented yet."fatal.
The per-field encoders write into shared immediate slots: an inner switch(v12[8]) with cases 0-3 chooses which of the 6 immediate slots holds the 5-bit register number, written at Bundle bytes 31..39. The store-data vreg is packed at byte 22 (5-bit) / byte 13; the TTU operand at byte 19/63. The absolute slot base inside the 41-byte bundle comes from the Bundle byte map (LOW — not pinned here).
Addressing Modes
Four addressing modes, selected by sub-opcode — the same taxonomy as the load slot:
- (a) base + immediate-offset (
VmemStore/VectorStorewith offset): base is a granule-scaled VMEM byte offset; offset is an immediate placed in an immediate slot. - (b) base + register-offset (
VmemStoreNoOffset/VectorStoreBase): base held in a Vs register, no immediate-offset field. PufferfishPlaceVsRegister<BaseAddress, VmemStore>places the base into one of 3 Vs slots. - (c) strided: every store carries a stride field. Pufferfish
StrideToSyU32<VmemStore>(@0x1d2970a0) encodes the stride as either an Sreg number or a U32 immediate (variant<SregNumber, ImmediateValueU32>); the destination address moves bystridegranules between sublanes. - (d) indexed / scatter (
VmemStoreIndexedPXC /VectorStoreIndexed0/1VXC/GXC /TileSpmemIndexedStoreSC-TEC): per-lane destination addresses come from an index-address register (IAR) staged viaSetIar*; the store scatters vector lanes to per-lane addresses. VXC/GXC have two index ports (Indexed0/Indexed1) for two coexisting scatter streams; PXC has one.
The IAR is set by dedicated store-slot sub-ops: PXC SetIarLane/SetIarSublane/SetIarRaw (cases 0x1C-0x1E); VXC/GXC SetLaneIar0/1/SetSublaneIar0/1/SetRawIar0/1 (cases 0x11-0x16). The guard kVectorStoreEvenOddSublanesIar < target.IarsPerTensorCore() confirms the IAR count is per-gen.
Memory-Tier Destination Encoding
The destination tier is selected by which sub-opcode (not a tier bit), within the single VectorStore slot:
| Destination tier | Sub-op family / path |
|---|---|
| VMEM | VmemStore* (PXC 8-0x1B) / VectorStore* (VXC/GXC 5-0x10) |
| CMEM | CmemStore/CmemStoreNoOffset (PXC cases 6-7; 16-bit imm address @304/288/272) |
| SPMEM | TileSpmemStore* (SparseCore TEC slot) |
| SMEM | ScalarStoreSmemAbsolute/ScalarStoreXToSmemY (SCALAR/SCALAR_ALU slot, not vector-store) |
| BarnaCore VMEM | BarnaCoreChannelVectorStore (Pufferfish PFC channel slot) |
The LLO opcode (kVectorStore vs kVectorCmemStore vs kScalarStore) carries the tier choice down from the IR; the per-gen lowering maps it to the matching sub-opcode in the matching slot. This is the same slot-selects-tier model the load slot uses, keyed off the runtime MemorySpace Enum; the SMEM tier is asserted by the buffer-assignment invariant dest_address->memory_space() == MemorySpace::kSmem. CMEM-store sub-ops exist only on Pufferfish; higher gens fold CMEM into VMEM-class addressing or lower it via kVectorCmemStorePseudo.
Store Granularity
The store moves one full vector register per cycle, modulated by:
- SublaneMask (
sublane_maskproto field): a per-sublane write-enable. On VXC/GXC the Masked sub-op variants add an explicit mask; on PXC the 8VmemStoreVmsk0..7variants select one of 8 hardware vector-mask registers. - Even/odd-sublane split (
kVectorStoreEvenOddSublanes): store only even or only odd sublanes (sublane-stride 2) — VXC/GXC via the sublane-stride field, Jellyfish via family A's stride encoder. - Indexed/scatter (per-lane): the finest granularity; each lane writes an independent address from the IAR.
- Reduce-add (SparseCore TEC only, V5+):
TileSpmemStoreAdd{S32,S16,F32,Bf16}— store-with-accumulate (SPMEM[addr] += vector), plus circular-buffer post-update and return-value variants. This is the only store family that reads the destination (read-modify-write).
There is no per-element count field beyond the sublane mask and stride; the store always processes the full SIMD width. The validation string "VectorStore mask length is not equal to target SIMD width." enforces that the mask covers exactly the lane/sublane count.
Masked / Predicated Store
Two orthogonal masking mechanisms:
- Bundle predication — every store slot carries a predicate-register reference (5-bit on Jellyfish via
EncodePredication<VectorStoreInstruction>; a predication enum on V5+). Values 0..14 = predicate register, 15 =kAlwaysExecute, 31 =kNeverExecute(NOP); a false predicate suppresses the entire store. The proto field ispredication(TensorcorePredicationon PXC,Predicationon glc/gfc; SparseCore TEC additionally splitsnormal_predication@362 androtate_predication@363). - Vector write-mask (sublane/vmask) — a per-sublane (or per-lane) write enable, independent of the bundle predicate. PXC has 8 hardware vmask registers (
Vmsk0..7, sub-opsVmemStoreVmsk0..7andVmemStoreIndexedVmsk0..7). VXC/GXC carry an explicitmaskfield (SparsecoreVectorMaskon TEC) and the Masked sub-op variants.
Predicated store = bundle-predicate gating; masked store = vmask/sublane write-enable. Both can apply at once (e.g. a predicated VmemStoreVmsk3).
Per-Gen Deltas
| Property | JF (v2) | PF (v4) | VF (v5p) | GL (v6e) | GFC (TPU7x) |
|---|---|---|---|---|---|
| TC bundle width (B) | 41 | 51 | 64 | 64 | 64 |
MaxVectorStoreSlots (write ports) | 1 | 1 | 1 | 2 | 2* |
NumVsSlots (read/base ports) | 3 | 3 | 4 | 4 | 4* |
| Store-data vreg field width | 5 | 5 | 4 | 4 | 4 |
| Base/offset field width | 5 | 5 | 6 | 6 | 6 |
CanOverlayInMiscSlot | false | (n/a) | a2==1 | true | true* |
| Hardware vmask registers | sublane only | 8 (Vmsk0..7) | mask field | mask field | mask field |
| Indexed/scatter ports | 1 | 1 | 2 (Idx0/1) | 2 | 2 |
| CMEM store sub-ops | no | yes | no | no | no |
| SparseCore TEC tile-SPMEM store | n/a | n/a | yes | yes | yes (+RMW) |
| SPMEM reduce-add (S32/S16/F32/Bf16) | n/a | n/a | yes | yes | yes (34 sub-ops) |
| Scalar-store SMEM atomic add | no | no | no | no | yes (SumDestAndY / FetchAndAdd) |
SupportsVectorStoreFence | false | false | false | false | misc-slot |
Verified target overrides: JellyfishTarget::MaxVectorStoreSlots=1 (@0x1d4916a0), Pufferfish=1 (@0x1d495be0), Viperfish=1 (@0x1d49c0a0), Ghostlite=2 (@0x1d498ca0). CanOverlayInMiscSlot: Jellyfish=false (@0x1d491680), Ghostlite=true (@0x1d498c80). SupportsVectorStoreFence=false on JF/PF/VF/GL (@0x1d48f660/0x1d493fa0/0x1d499f00/0x1d497060). Dragonfish (v3) shares the Jellyfish codec.
Structural deltas:
- v2→v4 (JF→PF): bundle 41→51 B; CMEM-store sub-ops added; 8 explicit hardware vmask registers (
VmemStoreVmsk0..7). - v4→v5p (PF→VF): bundle 51→64 B; store-data field 5→4 bits while base/offset 5→6 bits; CMEM-store dropped from the TC slot; sublane-shuffle store added; 2 scatter index ports; SparseCore TEC tile-SPMEM store family (with reduce-add) introduced.
- v5p→v6e (VF→GL):
MaxVectorStoreSlots1→2 (two VMEM write ports);CanOverlayInMiscSlotfully true — the store slot can overlay the misc slot, sharing the vmask (CheckVectorStoreSlotAndMiscSlotShareVmskinTensorCoreCodecBase). - v6e→TPU7x (GL→GFC): store-fence moves into the VectorMisc compact message; scalar SMEM stores gain atomic add (
ScalarStoreXToSmemSumDestAndY,SmemFetchAndAdd); SparseCore TEC store reaches 34 sub-ops (full reduce-add + circular-buffer + indexed + return-value matrix).
(*) The 6acc60406 (gfc) store reuses the gxc EncoderBase template; the V5+ property values are confirmed for VF/GL via the Target overrides, and the gfc-specific values marked * follow from the shared template plus the verified gfc store-encoder bit map (0x1fa08920).
Load / Store Slot Symmetry and Asymmetries
The store slot is the structural mirror of the Memory-Load Slot:
- Mirror: both occupy a dedicated bundle slot (
SLOT_VECTOR_STORE/SLOT_VECTOR_LOADon TC;SLOT_VST/SLOT_VLDon SparseCore), decode through paired Encoder/Decoder classes in the same per-genTensorCoreCodecBasetemplate, and share the addressing-mode taxonomy (base+immediate-offset, base+register-offset NoOffset, strided, indexed). TheSetIar*sub-ops live in the store slot but the IAR they set can be consumed by a subsequent indexed load and vice-versa.kVectorLoadandkVectorStoreare adjacent in the internalLloOpcodeenum (the store window starts at internal opcode 63). - Asymmetry 1 — port counts: loads have
NumVsSlotsread ports (3/3/4/4), stores haveMaxVectorStoreSlotswrite ports (1/1/1/2). VMEM is read-wide, write-narrow until Ghostlite doubles the write ports. - Asymmetry 2 — write-enable: the store slot adds the sublane-mask/vmask write-enable, a write-side concept with no load analogue.
- Asymmetry 3 — RMW: only the store side has the SparseCore TEC reduce-add (read-modify-write) family.
- Asymmetry 4 — fence: store→load ordering to the same VMEM region is protected by the vector-store fence (
RequiresVectorStoreFence@0x14020900,IsVmemReadPrecededByVmemStoreFence@0x14020820) — a load following a store needs a fence; the reverse does not.
What Is Not Yet Pinned
- The absolute byte offset of the store slot inside the Jellyfish 41-B bundle. The V5+
BitCopyoffsets are bundle-absolute; the Jellyfish per-field encoders route to shared immediate slots (Bundle bytes 31..39) whose absolute base needs the full Jellyfish Bundle byte map. LOW. - The 34 SparseCore-TEC reduce-add inner field widths. The dispatch and the shared positions (359/353/347/340/337/333) are CONFIRMED; the per-variant Add/CircularBuffer/ReturnValue inner widths are not exhaustively dumped. HIGH.
- The Ghostlite glc store
BitCopyfield map — structurally identical to vxc/gfc (same sub-ops, same template), not separately dumped. HIGH. - The full LloOpcode internal→proto integer for
kScalarStore/kVectorCmemStore/kBarnaCoreVectorStore/kVectorStoreFence(thekVectorStore*window{63,64,65,68,69,70,460}is CONFIRMED; the others need theProtoToLloOpcodeswitch decode). Target::IarsPerTensorCore()numeric value per gen (gates the indexed-store IAR count). LOW.
Cross-References
- Memory-Load Slot — the read-side mirror; shared addressing-mode taxonomy, IAR sharing, and the load/store asymmetries above.
- MemorySpace Enum — the 17-value runtime enum the store slot tier-selects on, and the proto↔enum remap.
- Bundle Model — the per-generation bundle widths (41/51/64) and slot taxonomy this slot plugs into.
- Pufferfish 51B Bundle — the absolute bundle bits of the
vector_storeslot (142..166) on PXC. - Viperfish 64B Bundle — the V5+ per-slot
Encoder::Encode+BitCopymodel the VXC/GLC/GFC store slots are written under (data-vreg @170, base @157). - MC-Emitter — the MC-layer store mnemonics and the register encoding table.
- Memory Subsystem Overview — the tier model (HBM/VMEM/SMEM/CMEM/SPMEM) the store slot writes to.