Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Memory-Store Slot

Every address, bit offset, opcode value, and string on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d, not stripped — full C++ symbols). .text/.rodata are mapped VA == file offset. Other wheel versions differ.

Abstract

The memory-store slot is the bundle slot that moves a compute register out to on-chip memory inside a single VLIW issue word — the write-side mirror of the Memory-Load Slot. It is distinct from intra-chip DMA: the store slot moves one vector register into VMEM (or one scalar register into SMEM) in one bundle cycle, whereas DMA moves tier→tier blocks via descriptors. Like the load slot it carries no tier-selector bit — the destination tier is a function of (slot, sub-opcode).

A store slot is a discriminated union whose sub-opcode selects both the destination tier and the addressing mode. The per-gen <Bundle><Slot>StoreEncoder::Encode(<msg> const&, absl::Span<uint8_t>) is the canonical byte codec, and on Pufferfish and all V5+ generations each field is written with one call to the universal bit packer BitCopy(dst, bit_offset, src, 0, width). Every bit position on this page is LSB-first (matching Bundle Model): bit 0 is the least-significant bit of byte 0, and the literal bit_offset argument is the absolute, LSB-numbered bundle bit, so the store-slot field map is read off the encoder disassembly with no shift arithmetic to invert. The store-data source register is always written first; the sub-opcode discriminator (read from the proto at +0x50 on the TensorCore store, +0x50/+0x58 on the V5e/V6e SparseCore-TEC store) then selects which addressing fields follow.

Two structural facts dominate the per-generation story. First, the store slot is the read-and-write twin of the load slot — same dedicated bundle slot, same paired Encoder/Decoder in the same TensorCoreCodecBase template, same addressing-mode taxonomy (base+offset, base-register NoOffset, strided, indexed/scatter) — but with three write-only additions: a sublane/vmask write-enable, the SparseCore reduce-add (read-modify-write) family, and the vector-store fence ordering barrier. Second, the per-gen deltas are concrete: the bundle grows 41 → 51 → 64 bytes; Pufferfish adds CMEM-store and 8 hardware vmask registers; Viperfish shrinks the store-data field 5 → 4 bits while growing base/offset 5 → 6 bits and adds the SparseCore TEC tile-SPMEM store with reduce-add; Ghostlite doubles the write ports 1 → 2; 6acc60406 moves the store-fence into the misc slot and reaches a 34-sub-op SparseCore store family with atomic SMEM stores.

For reimplementation, the contract is: the LloOpcodeIsVectorStore opcode window, the per-gen store Encoder/Decoder address table, the BitCopy absolute-bit field maps for PF/VF/GFC, the slot-selects-tier model, the two orthogonal masking mechanisms (bundle predicate vs vmask), and the per-gen deltas.

Slot rolemove one register's worth of data into VMEM/CMEM/SPMEM (vector) or SMEM (scalar)
Tier selectby (slot, sub-opcode) (no tier bit)
PXC store encoderpxc::isa::TensorCoreVectorStoreEncoder::Encode @ 0x1ee3b440 (51-B bundle)
VXC store encodervxc::isa::TensorCoreVectorStoreEncoder::Encode @ 0x1f01ff60 (64-B bundle)
GFC store encodergxc::gfc::isa::TensorCoreVectorStoreEncoder::Encode @ 0x1fa08920 (64-B bundle)
JF store encoderjellyfish::isa::EncoderJf::EncodeVectorStoreInstruction @ 0x1e868c40 (41-B, monolithic)
Bit primitiveBitCopy(dst, abs_bit, src, 0, width)abs_bit == absolute bundle bit (PF/V5+)
Opcode classifierxla::jellyfish::LloOpcodeIsVectorStore @ 0x14024920 → window {63,64,65,68,69,70} + 460
Store-data field5-bit on JF/PF, 4-bit on VF/GL/GFC; base/offset 5-bit → 6-bit at v5
Write portsMaxVectorStoreSlots = 1 on JF/PF/VF, 2 on GL/GFC

The Store Op List (LLO Opcodes)

The store family in the LloOpcode enum (internal k-prefixed names; proto LloOpcodeProto OPCODE_-prefixed names):

LloOpcode internalLloOpcodeProto nameDestination tierSlot
kVectorStoreOPCODE_VECTOR_STOREVMEMVECTOR_STORE
kVectorStoreMaskedOPCODE_VECTOR_STORE_MASKEDVMEM (masked)VECTOR_STORE
kVectorStoreIndexedOPCODE_VECTOR_STORE_INDEXEDVMEM (scatter)VECTOR_STORE
kVectorStoreIndexedMaskedOPCODE_VECTOR_STORE_INDEXED_MASKEDVMEM (scatter+mask)VECTOR_STORE
kVectorStoreSublaneShuffleOPCODE_VECTOR_STORE_SUBLANE_SHUFFLEVMEM (shuffled)VECTOR_STORE
kVectorStoreEvenOddSublanesOPCODE_VECTOR_STORE_EVEN_ODD_SUBLANESVMEM (even/odd)VECTOR_STORE
kVectorStoreFenceOPCODE_VECTOR_STORE_FENCE(ordering barrier)VECTOR_MISC (GFC)
kVectorCmemStoreOPCODE_VECTOR_CMEM_STORECMEMVECTOR_STORE
kVectorCmemStorePseudoOPCODE_VECTOR_CMEM_STORE_PSEUDOCMEM (pseudo)VECTOR_STORE
kBarnaCoreVectorStoreOPCODE_BARNA_CORE_VECTOR_STOREBarnaCore VMEM(BCS channel)
kScalarStoreOPCODE_SCALAR_STORESMEMSCALAR_0/1

The classifier xla::jellyfish::LloOpcodeIsVectorStore (@ 0x14024920) is byte-exact: it returns true when (opcode - 63) <= 7 && _bittest(0xE7, opcode - 63) or opcode == 460. 0xE7 = 0b1110_0111 selects offsets {0,1,2,5,6,7}, i.e. internal opcodes {63,64,65,68,69,70} plus 460. Scalar-store, CMEM-store and BarnaCore-store are classified by separate predicates.

NOTE — the vector-store fence is not a store; it is an ordering barrier. On 6acc60406 it is encoded in the VectorMisc slot (gxc::gfc::isa::TensorCoreVectorMisc0Compact::vector_store_fence accessor), not the store slot. The bundle packer's RequiresVectorStoreFence (@ 0x14020900) gates it behind Target::SupportsVectorStoreFence(). See Per-Gen Deltas.


Store-Slot Encoder/Decoder Address Table

Each gen has a <Bundle><Slot>StoreEncoder::Encode and a matching <…>StoreDecoder::Decode. The decoder is the symmetric inverse BitExtract of the encoder.

TensorCore vector-store slot

GenNamespaceEncode @Decode @
Jellyfishjellyfish::isa::EncoderJf::EncodeVectorStoreInstruction0x1e868c40(paired in DecoderJf)
Pufferfishpxc::isa::TensorCoreVectorStore{En,De}coder0x1ee3b4400x1ee2dae0
Viperfishvxc::isa::TensorCoreVectorStore{En,De}coder0x1f01ff600x1f01a560
Ghostlitegxc::glc::isa::TensorCoreVectorStore{En,De}coder0x1f3c32000x1f3bd780
6acc60406gxc::gfc::isa::TensorCoreVectorStore{En,De}coder0x1fa089200x1fa02f00

Jellyfish's TensorCore store is not a standalone codec class; it is the monolithic EncoderJf::EncodeVectorStoreInstruction(VectorStoreInstruction const&, Bundle const&) (@ 0x1e868c40) that fills the slot bits inline.

SparseCore TEC tile-SPMEM store slot (V5+ only)

GenNamespaceEncode @Decode @
Viperfishvxc::vfc::isa::SparseCoreTecVectorStore{En,De}coder0x1e9c27600x1e9bbf40
Ghostlitegxc::glc::isa::SparseCoreTecVectorStore{En,De}coder0x1eb50ac00x1eb42300
6acc60406gxc::gfc::isa::SparseCoreTecVectorStore{En,De}coder0x1eccbe200x1ecbd7a0

There is no Jellyfish/Pufferfish SparseCore TEC store — the SparseCore TEC sequencer exists only from Viperfish onward. The BarnaCore channel vector-store (Pufferfish only) is pxc::pfc::isa::BarnaCoreChannelVectorStore{En,De}coder (Encode 0x1e8c5640, Decode 0x1e8c4d40).

Scalar-store slot (SREG → SMEM)

Routed through the scalar-slot encoder, not the vector-store slot. Jellyfish: ScalarStoreOperands proto, emitted by EncoderJf::EncodeScalarInstruction (@ 0x1e862060). Pufferfish: TensorCoreScalar1_ScalarStoreSmemAbsolute via ScalarYEncode (@ 0x1c471c60) + SetImmOrDie (@ 0x1c550660); BarnaCore-sequencer variant via 0x1c471fc0/0x1c550e00. Viperfish/Ghostlite: …ScalarAlu_ScalarStoreXToSmemY. 6acc60406 adds …ScalarStoreXToSmemSumDestAndY (atomic add), …ScalarStoreCircularBuffer, and SmemFetchAndAdd (via EmitFetchAndAddOp @ 0x13a3a300).


Store Slot Position in the Bundle

The store slot is one fixed slot in the per-gen VLIW bundle; the slot order is the per-gen TensorCoreCodecBase<…> template-arg order:

ScalarAlu0 / ScalarAlu1 / [Predicates] / Immediates / VectorScalar /
Dma / VectorAlu0 / VectorAlu1 / [VectorAlu2 / VectorAlu3] /
**VectorStore** / VectorLoad0 / [VectorLoad1] / VectorMisc /
VectorExtended0 / [VectorExtended1] / VectorResult0 / [VectorResult1]

(Pufferfish carries VectorAlu0/1 + VectorStore + VectorLoad + CmemLoad + ….) The slot-enumeration constants name it: isa::SLOT_VECTOR_STORE (TC), isa::SLOT_SCALAR_0/1 and SLOT_SCALAR_ALU_0/1 (scalar stores), isa::SLOT_VECTOR_MISC (overlay target / store fence), llvm::TPU::SparseCoreMCSlot::SLOT_VST and SLOT_SST (SparseCore vector/scalar store). Bundle widths: Jellyfish/Dragonfish 41 B, Pufferfish 51 B, V5+ TensorCore 64 B; SparseCore SCS 32 B, TEC 64 B.

On Pufferfish the vector_store slot occupies absolute bundle bits 142..166 (see Pufferfish 51B Bundle), disjoint from the vector_load slot at 119..140 — so a load and a store can co-exist in one bundle. The constraint "SCALAR_STORE_ABSOLUTE not allowed in slot 0." forces a scalar-store absolute on Jellyfish/Pufferfish into SCALAR_1, never SCALAR_0. The accounting predicate store_and_misc_slots == module.target().MaxVectorStoreSlots() ties the number of store-bearing slots to the per-gen write-port count.


Bit-Field Layout (BitCopy Field Maps)

Each per-slot Encoder fills the slot via BitCopy(dst, abs_bit, src, 0, width); the store-data source register is written first, then the sub-opcode discriminator selects the addressing fields. The maps below are harvested directly from the BitCopy(a3, …) call sites in the decompiled encoders — the second argument is the absolute bundle bit.

Pufferfish (PXC) TensorCoreVectorStore — Encode @ 0x1ee3b440 (51-B bundle)

Byte-exact from the encoder body (BitCopy(a3, 162, …, 5) etc.):

Fieldabs bitwidthmeaning
source vreg1625store-data vreg (proto +28); = 31 ⇒ NOOP (kNeverExecute)
sub-opcode1575VmemStore=0 / CmemStore≠0 discriminator
base address / Vmem offset1525VregNumber base
offset1493immediate-offset slot index
stride1472stride select
sublane-mask / vmask1452sublane/vmask select
(CMEM full-address words)304 / 288 / 27216CMEM 16-bit immediate address (CmemStore only)
(CMEM address regs)251 / 246 / 2415shared Y-register selectors

The PF store slot region (abs 142..166) matches the Pufferfish 51B Bundle page exactly. Sub-opcode dispatch (proto +0x50): 0=Noop, 5=NoopAlt, 6=CmemStore, 7=CmemStoreNoOffset, 8=VmemStore, 9=VmemStoreNoOffset, 0xA-0x11=VmemStoreVmsk0..7, 0x12=VmemStoreIndexed, 0x13=VmemStoreIndexedNoOffset, 0x14-0x1B=VmemStoreIndexedVmsk0..7, 0x1C=SetIarLane, 0x1D=SetIarSublane, 0x1E=SetIarRaw, 0x1F=PushV2s. 32 sub-ops total.

Viperfish (VXC) TensorCoreVectorStore — Encode @ 0x1f01ff60 (64-B bundle)

Byte-exact (BitCopy(a3, 170, …, 4) etc.):

Fieldabs bitwidthmeaning
source vreg1704store-data vreg (proto +28)
sub-opcode discriminator1673addr-mode/family
secondary opcode / vmsk1634mask / sub-variant
base address / offset1576VregNumber base
stride (Base variants)1534stride select
(Base-variant field)1512
(trailing field)1483Vs select / mask
address / mask field1444

The data-vreg @170 w4 and base @157 w6 match the Viperfish 64B Bundle page exactly. Sub-opcode dispatch (proto +0x50): 0=Noop, 5=VectorStore, 6=VectorStoreBase, 7=VectorStoreMasked, 8=VectorStoreBaseMasked, 9=VectorStoreShuffled, 0xA=VectorStoreShuffledBase, 0xB=VectorStoreShuffledMasked, 0xC=VectorStoreShuffledBaseMasked, 0xD=VectorStoreIndexed0, 0xE=VectorStoreIndexed1, 0xF=VectorStoreIndexed0Masked, 0x10=VectorStoreIndexed1Masked, 0x11-0x16=SetLaneIar0/1·SetSublaneIar0/1·SetRawIar0/1, 0x17=PushV2s, 0x18=SetPrng. 25 sub-ops.

6acc60406 (GFC) TensorCoreVectorStore — Encode @ 0x1fa08920 (64-B bundle)

Byte-exact (BitCopy(a3, 169, …, 2) etc.):

Fieldabs bitwidth
sub-opcode top field1692
sub-opcode1663
secondary opcode1624
base address / offset1566
stride / mask1524
(Base-variant field)1502
(field)1473
(field)1434

Sub-opcode dispatch is identical to Viperfish. Ghostlite's TensorCoreVectorStoreEncoder (@ 0x1f3c3200) is structurally identical to VXC/GFC (same sub-op set, same template) with gen-shifted bit offsets; its field map was not separately dumped — HIGH by template identity.

6acc60406 (GFC) SparseCoreTecVectorStore — Encode @ 0x1eccbe20 (64-B TEC bundle)

Byte-exact (BitCopy(a3, 359, …, 3) etc.), store slot in the upper bit region — the richest store family:

Fieldabs bitwidth
opcode3593 (4 alt)
normal-predication3621
rotate-predication3631
base / source3536
offset3476
stride3403
mask3373
(field)3334

Sub-opcode dispatch (proto +0x4c, 34 cases): 0=Noop, 8=TileSpmemStore, 9=TileSpmemStoreCircularBuffer, then the reduce-add matrix — TileSpmemStoreAdd{S32,S16,F32,Bf16}, *CircularBufferAdd*, *CircularBufferPostUpdateAdd*, TileSpmemIndexedStore, *IndexedCircularBuffer*, *IndexedAdd{S32,S16,F32,Bf16}*, *IndexedReturnValueAdd*, *IndexedCircularBufferReturnValueAdd{S32,S16,F32,Bf16}*. The per-variant inner field widths beyond the shared positions above are not exhaustively dumped (HIGH).

Jellyfish EncodeVectorStoreInstruction — @ 0x1e868c40 (41-B bundle)

Byte-exact dispatch: encodes predication first (EncodePredication<VectorStoreInstruction> @ the call before the switch), then switch(*((_DWORD*)a2 + 12)) — the sub-opcode at proto +12:

  • family A {0,6,0x10-0x17}: sublane-mask + stride store (EncodeVectorSublaneMaskEncoding, EncodeVectorStrideEncoding).
  • family B {1,7,0x18-0x1F}: offset + base + sublane-mask + stride store (EncodeVectorOffsetEncoding, EncodeVectorBaseAddressEncoding, plus the two above).
  • family C {2,3,4}: indexed store (SetIndexedAddressOperands, routed by v36[3]&1).
  • case 5: NOOP. default: "Not implemented yet." fatal.

The per-field encoders write into shared immediate slots: an inner switch(v12[8]) with cases 0-3 chooses which of the 6 immediate slots holds the 5-bit register number, written at Bundle bytes 31..39. The store-data vreg is packed at byte 22 (5-bit) / byte 13; the TTU operand at byte 19/63. The absolute slot base inside the 41-byte bundle comes from the Bundle byte map (LOW — not pinned here).


Addressing Modes

Four addressing modes, selected by sub-opcode — the same taxonomy as the load slot:

  • (a) base + immediate-offset (VmemStore / VectorStore with offset): base is a granule-scaled VMEM byte offset; offset is an immediate placed in an immediate slot.
  • (b) base + register-offset (VmemStoreNoOffset / VectorStoreBase): base held in a Vs register, no immediate-offset field. Pufferfish PlaceVsRegister<BaseAddress, VmemStore> places the base into one of 3 Vs slots.
  • (c) strided: every store carries a stride field. Pufferfish StrideToSyU32<VmemStore> (@ 0x1d2970a0) encodes the stride as either an Sreg number or a U32 immediate (variant<SregNumber, ImmediateValueU32>); the destination address moves by stride granules between sublanes.
  • (d) indexed / scatter (VmemStoreIndexed PXC / VectorStoreIndexed0/1 VXC/GXC / TileSpmemIndexedStore SC-TEC): per-lane destination addresses come from an index-address register (IAR) staged via SetIar*; the store scatters vector lanes to per-lane addresses. VXC/GXC have two index ports (Indexed0/Indexed1) for two coexisting scatter streams; PXC has one.

The IAR is set by dedicated store-slot sub-ops: PXC SetIarLane/SetIarSublane/SetIarRaw (cases 0x1C-0x1E); VXC/GXC SetLaneIar0/1/SetSublaneIar0/1/SetRawIar0/1 (cases 0x11-0x16). The guard kVectorStoreEvenOddSublanesIar < target.IarsPerTensorCore() confirms the IAR count is per-gen.


Memory-Tier Destination Encoding

The destination tier is selected by which sub-opcode (not a tier bit), within the single VectorStore slot:

Destination tierSub-op family / path
VMEMVmemStore* (PXC 8-0x1B) / VectorStore* (VXC/GXC 5-0x10)
CMEMCmemStore/CmemStoreNoOffset (PXC cases 6-7; 16-bit imm address @304/288/272)
SPMEMTileSpmemStore* (SparseCore TEC slot)
SMEMScalarStoreSmemAbsolute/ScalarStoreXToSmemY (SCALAR/SCALAR_ALU slot, not vector-store)
BarnaCore VMEMBarnaCoreChannelVectorStore (Pufferfish PFC channel slot)

The LLO opcode (kVectorStore vs kVectorCmemStore vs kScalarStore) carries the tier choice down from the IR; the per-gen lowering maps it to the matching sub-opcode in the matching slot. This is the same slot-selects-tier model the load slot uses, keyed off the runtime MemorySpace Enum; the SMEM tier is asserted by the buffer-assignment invariant dest_address->memory_space() == MemorySpace::kSmem. CMEM-store sub-ops exist only on Pufferfish; higher gens fold CMEM into VMEM-class addressing or lower it via kVectorCmemStorePseudo.


Store Granularity

The store moves one full vector register per cycle, modulated by:

  • SublaneMask (sublane_mask proto field): a per-sublane write-enable. On VXC/GXC the Masked sub-op variants add an explicit mask; on PXC the 8 VmemStoreVmsk0..7 variants select one of 8 hardware vector-mask registers.
  • Even/odd-sublane split (kVectorStoreEvenOddSublanes): store only even or only odd sublanes (sublane-stride 2) — VXC/GXC via the sublane-stride field, Jellyfish via family A's stride encoder.
  • Indexed/scatter (per-lane): the finest granularity; each lane writes an independent address from the IAR.
  • Reduce-add (SparseCore TEC only, V5+): TileSpmemStoreAdd{S32,S16,F32,Bf16} — store-with-accumulate (SPMEM[addr] += vector), plus circular-buffer post-update and return-value variants. This is the only store family that reads the destination (read-modify-write).

There is no per-element count field beyond the sublane mask and stride; the store always processes the full SIMD width. The validation string "VectorStore mask length is not equal to target SIMD width." enforces that the mask covers exactly the lane/sublane count.


Masked / Predicated Store

Two orthogonal masking mechanisms:

  1. Bundle predication — every store slot carries a predicate-register reference (5-bit on Jellyfish via EncodePredication<VectorStoreInstruction>; a predication enum on V5+). Values 0..14 = predicate register, 15 = kAlwaysExecute, 31 = kNeverExecute (NOP); a false predicate suppresses the entire store. The proto field is predication (TensorcorePredication on PXC, Predication on glc/gfc; SparseCore TEC additionally splits normal_predication @362 and rotate_predication @363).
  2. Vector write-mask (sublane/vmask) — a per-sublane (or per-lane) write enable, independent of the bundle predicate. PXC has 8 hardware vmask registers (Vmsk0..7, sub-ops VmemStoreVmsk0..7 and VmemStoreIndexedVmsk0..7). VXC/GXC carry an explicit mask field (SparsecoreVectorMask on TEC) and the Masked sub-op variants.

Predicated store = bundle-predicate gating; masked store = vmask/sublane write-enable. Both can apply at once (e.g. a predicated VmemStoreVmsk3).


Per-Gen Deltas

PropertyJF (v2)PF (v4)VF (v5p)GL (v6e)GFC (TPU7x)
TC bundle width (B)4151646464
MaxVectorStoreSlots (write ports)11122*
NumVsSlots (read/base ports)33444*
Store-data vreg field width55444
Base/offset field width55666
CanOverlayInMiscSlotfalse(n/a)a2==1truetrue*
Hardware vmask registerssublane only8 (Vmsk0..7)mask fieldmask fieldmask field
Indexed/scatter ports112 (Idx0/1)22
CMEM store sub-opsnoyesnonono
SparseCore TEC tile-SPMEM storen/an/ayesyesyes (+RMW)
SPMEM reduce-add (S32/S16/F32/Bf16)n/an/ayesyesyes (34 sub-ops)
Scalar-store SMEM atomic addnonononoyes (SumDestAndY / FetchAndAdd)
SupportsVectorStoreFencefalsefalsefalsefalsemisc-slot

Verified target overrides: JellyfishTarget::MaxVectorStoreSlots=1 (@0x1d4916a0), Pufferfish=1 (@0x1d495be0), Viperfish=1 (@0x1d49c0a0), Ghostlite=2 (@0x1d498ca0). CanOverlayInMiscSlot: Jellyfish=false (@0x1d491680), Ghostlite=true (@0x1d498c80). SupportsVectorStoreFence=false on JF/PF/VF/GL (@0x1d48f660/0x1d493fa0/0x1d499f00/0x1d497060). Dragonfish (v3) shares the Jellyfish codec.

Structural deltas:

  • v2→v4 (JF→PF): bundle 41→51 B; CMEM-store sub-ops added; 8 explicit hardware vmask registers (VmemStoreVmsk0..7).
  • v4→v5p (PF→VF): bundle 51→64 B; store-data field 5→4 bits while base/offset 5→6 bits; CMEM-store dropped from the TC slot; sublane-shuffle store added; 2 scatter index ports; SparseCore TEC tile-SPMEM store family (with reduce-add) introduced.
  • v5p→v6e (VF→GL): MaxVectorStoreSlots 1→2 (two VMEM write ports); CanOverlayInMiscSlot fully true — the store slot can overlay the misc slot, sharing the vmask (CheckVectorStoreSlotAndMiscSlotShareVmsk in TensorCoreCodecBase).
  • v6e→TPU7x (GL→GFC): store-fence moves into the VectorMisc compact message; scalar SMEM stores gain atomic add (ScalarStoreXToSmemSumDestAndY, SmemFetchAndAdd); SparseCore TEC store reaches 34 sub-ops (full reduce-add + circular-buffer + indexed + return-value matrix).

(*) The 6acc60406 (gfc) store reuses the gxc EncoderBase template; the V5+ property values are confirmed for VF/GL via the Target overrides, and the gfc-specific values marked * follow from the shared template plus the verified gfc store-encoder bit map (0x1fa08920).


Load / Store Slot Symmetry and Asymmetries

The store slot is the structural mirror of the Memory-Load Slot:

  • Mirror: both occupy a dedicated bundle slot (SLOT_VECTOR_STORE/SLOT_VECTOR_LOAD on TC; SLOT_VST/SLOT_VLD on SparseCore), decode through paired Encoder/Decoder classes in the same per-gen TensorCoreCodecBase template, and share the addressing-mode taxonomy (base+immediate-offset, base+register-offset NoOffset, strided, indexed). The SetIar* sub-ops live in the store slot but the IAR they set can be consumed by a subsequent indexed load and vice-versa. kVectorLoad and kVectorStore are adjacent in the internal LloOpcode enum (the store window starts at internal opcode 63).
  • Asymmetry 1 — port counts: loads have NumVsSlots read ports (3/3/4/4), stores have MaxVectorStoreSlots write ports (1/1/1/2). VMEM is read-wide, write-narrow until Ghostlite doubles the write ports.
  • Asymmetry 2 — write-enable: the store slot adds the sublane-mask/vmask write-enable, a write-side concept with no load analogue.
  • Asymmetry 3 — RMW: only the store side has the SparseCore TEC reduce-add (read-modify-write) family.
  • Asymmetry 4 — fence: store→load ordering to the same VMEM region is protected by the vector-store fence (RequiresVectorStoreFence @ 0x14020900, IsVmemReadPrecededByVmemStoreFence @ 0x14020820) — a load following a store needs a fence; the reverse does not.

What Is Not Yet Pinned

  • The absolute byte offset of the store slot inside the Jellyfish 41-B bundle. The V5+ BitCopy offsets are bundle-absolute; the Jellyfish per-field encoders route to shared immediate slots (Bundle bytes 31..39) whose absolute base needs the full Jellyfish Bundle byte map. LOW.
  • The 34 SparseCore-TEC reduce-add inner field widths. The dispatch and the shared positions (359/353/347/340/337/333) are CONFIRMED; the per-variant Add/CircularBuffer/ReturnValue inner widths are not exhaustively dumped. HIGH.
  • The Ghostlite glc store BitCopy field map — structurally identical to vxc/gfc (same sub-ops, same template), not separately dumped. HIGH.
  • The full LloOpcode internal→proto integer for kScalarStore/kVectorCmemStore/kBarnaCoreVectorStore/kVectorStoreFence (the kVectorStore* window {63,64,65,68,69,70,460} is CONFIRMED; the others need the ProtoToLloOpcode switch decode).
  • Target::IarsPerTensorCore() numeric value per gen (gates the indexed-store IAR count). LOW.

Cross-References

  • Memory-Load Slot — the read-side mirror; shared addressing-mode taxonomy, IAR sharing, and the load/store asymmetries above.
  • MemorySpace Enum — the 17-value runtime enum the store slot tier-selects on, and the proto↔enum remap.
  • Bundle Model — the per-generation bundle widths (41/51/64) and slot taxonomy this slot plugs into.
  • Pufferfish 51B Bundle — the absolute bundle bits of the vector_store slot (142..166) on PXC.
  • Viperfish 64B Bundle — the V5+ per-slot Encoder::Encode + BitCopy model the VXC/GLC/GFC store slots are written under (data-vreg @170, base @157).
  • MC-Emitter — the MC-layer store mnemonics and the register encoding table.
  • Memory Subsystem Overview — the tier model (HBM/VMEM/SMEM/CMEM/SPMEM) the store slot writes to.