VectorLoad Slot
Every opcode value, mask immediate, field shift/width, and per-generation count on this page was read byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d) — from the per-opSparseCoreTecVectorLoadTileSpmemLoad<Form>Opcode::Matches()compare immediates, the…<Form><Field>Field::GetConcatenatedValue()accessor shifts, and theSparseCoreTecVectorExtended…SourceOneField/GetVexSourcePortEncodingdecode. Addresses apply to this build; other versions differ.
Abstract
VectorLoad is the read-side mirror of the VectorStore slot in the 64-byte TEC bundle: the slot that pulls a gathered row out of per-tile SRAM (TILE_SPMEM) into a VREG so the VectorExtended scan engine can reduce it. Where VectorStore is a type×mode opcode matrix (it must carry the accumulate dtype), VectorLoad is addressing-mode-only: 5 ops over a 3-bit opcode field, all dtype-agnostic. The element type is set by the consuming op, not by the load — a reimplementer who looks for an S32/F32/S16/Bf16 split in the load roster (as exists on the store side) will not find one, and must not synthesize it.
The decisive structural fact is that the load packs every operand field into ONE word — the 8-byte word at struct offset 0x28 — because the load produces a Dest (a VREG written) rather than driving a store-source mux. The store puts its data fields in word 0x30 (Source@27); the load puts its Dest@52 and Index@27 and the address-build fields all in word 0x28. The two slots are deliberately symmetric at the bit level: the load Dest (word0x28 >> 52 & 0x3f) is the exact same mux position the store's IndexedReturnValueAdd fetch-and-add returns its pre-add value through — which is why fetch-and-add returns the pre-add value (it is a load-then-add through the load-Dest mux).
This page also owns three pieces of the embedding-reduce datapath that bracket the load: the SourceOne seed enum (VexSourcePortEncoding, 8 values) that selects the scan's carry-in source bus; the segmented-scan boundary operand (the per-segment reset id, bound as the second SSA operand of SegmentedScanOp); and the fetch-and-add scatter ordering/dedup semantics on the load/return path. These are documented here because all three are read through the load-Dest mux and the V read ports the load feeds; the scan opcode roster itself lives in VectorExtended.
For reimplementation, the contract is:
- The opcode is addressing-mode only; 5 ops, 3-bit field @ word
0x28bit 58 (gfc/glc).opcode = Matches_cmp_immediate >> 58, contiguous 0..4. There is no dtype in the load opcode —VectorLoadis dtype-agnostic, unlike the VectorStoreAddforms. The mode lattice isplain · CircularBuffer · CircularBufferPostUpdate · Indexed · IndexedCircularBuffer. - All operand fields live in word
0x28, contiguous bit 27..60, no overlap, no gap.Index@27/6(Indexed forms),Mask@33/5,Stride@38/4,Offset@42/3,BaseAddress@45/3,Cbreg@48/4(CB forms),Dest@52/6,Opcode@58/3. The store-side sibling proves the bit-position symmetry: loadDest@word0x28 bit52== store fetch-and-addDest@word0x28 bit52. SourceOneis a source-bus SELECTOR, not a constant. A 3-bit field @ word0x28bit 13 selecting aVexSourcePortEncoding∈ {VST_SOURCE=0,V0_Y_VREG=1,V0_X=2,V1_Y_VREG=3,V1_X=4,V2_Y_VREG=5,V2_X=6,V3_Y_VREG=7}. It routes the EUP scan-input/carry-in mux to a VREG read port; the reduction identity (0, ±inf) is a value placed in that port, not a hardwired code.- Segmented scans bind the segment id as the second SSA operand.
sparse_core::SegmentedScanOp= 2 operands (operand 0 = data, operand 1 = segment_ids); plainScanOp= 1 operand. The segment id is register-allocated to a free V read port; the per-segment reset reads operand 1. - The fetch-and-add returns the pre-add value, and dedup makes the ordering moot. The return is captured through the load-Dest mux; the dedup pipeline (Sort → Uniquify → DuplicateCount) collapses duplicate ids before the scatter so each unique row is touched once with its multiplicity folded in.
| Slot | VectorLoad — TecVectorLoad slot of the 64-byte TEC bundle |
| Opcode field | 3-bit @ bit 58 of word 0x28 (gfc/glc); 3-bit @ bit 56 (vfc) |
| Opcode → mnemonic source | per-op SparseCoreTecVectorLoadTileSpmemLoad<Form>Opcode::Matches() immediate (>> 58) |
| Op count (per gen) | vfc 5 · glc 5 · gfc 5 |
| Mode lattice | plain · CircularBuffer · CircularBufferPostUpdate · Indexed · IndexedCircularBuffer |
| Data word | word 0x28 (ALL fields — Dest, Index, address-build, opcode) |
| dtype | none — addressing-mode only (set by the consuming op) |
| SourceOne seed | 3-bit @ word 0x28 bit 13 → VexSourcePortEncoding (8 values) |
| Confidence | CONFIRMED (decompile / Matches-immediate & accessor-shift anchored) unless a row or callout says otherwise |
NOTE — this page owns the
VectorLoadopcode roster, field decode, theSourceOneseed enum, the segment-operand binding, and the fetch-and-add return ordering. The 64-byte bundle layout lives in TEC Engine; the VEX scan opcode roster lives in VectorExtended; the dedup pipeline lives in Dedup Multiplicity. They are linked, not repeated.
The Slot Field Map
Purpose
The VectorLoad slot reads one row out of TILE_SPMEM into a Dest VREG. The address is built from {BaseAddress, Offset, Stride} (or a CBREG window via Cbreg), under a lane Mask, optionally gathered per-element by an Index VREG. Unlike the store side there is no Source (no store mux) and no accumulate dtype: the load only chooses where to read and which VREG to write. Every field — including the opcode and the destination — lives in the 8-byte word at struct offset 0x28.
Field Layout
Confirmed byte-exact against the gfc plain TileSpmemLoad field accessors (SparseCoreTecVectorLoadTileSpmemLoad<Field>Field::GetConcatenatedValue), each a single (word0x28 >> shift) & mask. In the decompiled accessors *((_QWORD*)this + 5) is the 8-byte word at byte offset 0x28 (5 × 8 = 0x28):
VectorLoad slot — word 0x28 (8 bytes), gfc plain TileSpmemLoad form
word0x28 bit: 13 27 33 38 42 45 48 52 58
┌───────┬────────┬────────┬──────┬─────┬──────┬──────┬────────┬───────┐
│SrcOne │ Index │ Mask │Stride│Offs.│Base │Cbreg │ Dest │Opcode │
│ 3b │ 6b │ 5b │ 4b │ 3b │ 3b │ 4b │ 6b │ 3b │
│ @13 │ @27 │ @33 │ @38 │ @42 │ @45 │ @48 │ @52 │ @58 │
└───────┴────────┴────────┴──────┴─────┴──────┴──────┴────────┴───────┘
(VEX scan Indexed* ◄──── address build ────► CB* the VREG the
seed sel) only only written mode
NOTE —
SourceOne@13is aVectorExtended(scan) field, not aVectorLoadfield; it shares word0x28and is drawn here to show the seam. The load proper carries{Dest, Cbreg, BaseAddress, Offset, Stride, Mask, Index}at bits 27..60.SourceOneat bit 13 belongs to the scan op that consumes the loaded VREG (see §The SourceOne Seed Enum); it is byte-confirmed in the same word from the…AddScanS32SourceOneFieldaccessor.
| Field | Word | Shift | Width | Present in modes | Accessor (gfc) |
|---|---|---|---|---|---|
Opcode | 0x28 | 58 | 3 | all | (Matches predicate, byte+0x2f & 0x1c) |
Dest | 0x28 | 52 | 6 | all (the loaded row's VREG) | …TileSpmemLoadDestField 0x1ecb9b20 |
Cbreg | 0x28 | 48 | 4 | CircularBuffer* only (→ 16 CBREGs) | …CircularBufferCbregField 0x1ecb9be0 |
BaseAddress | 0x28 | 45 | 3 | all (TILE_SPMEM tile base / alt to CBREG) | …TileSpmemLoadBaseAddressField 0x1ecb9b40 |
Offset | 0x28 | 42 | 3 | all (within-window offset) | …TileSpmemLoadOffsetField 0x1ecb9b60 |
Stride | 0x28 | 38 | 4 | all (per-element address stride) | …TileSpmemLoadStrideField 0x1ecb9b80 |
Mask | 0x28 | 33 | 5 | all (lane predicate / vmask) | …TileSpmemLoadMaskField 0x1ecb9ba0 |
Index | 0x28 | 27 | 6 | Indexed* only (per-element gather VREG) | …TileSpmemLoadIndexedIndexField 0x1ecb9dc0 |
The shifts above were read from the plain TileSpmemLoad op-form — the canonical reference, because it carries the address-build fields with no Cbreg/Index. The accessor bodies decode exactly:
// SparseCoreTecVectorLoadTileSpmemLoadDestField::GetConcatenatedValue (gfc 0x1ecb9b20)
return (*((uint64_t *)this + 5) >> 52) & 0x3F; // word0x28 bit52, 6-bit — the loaded-row VREG
// …TileSpmemLoadBaseAddressField::GetConcatenatedValue (gfc 0x1ecb9b40)
return (*((uint64_t *)this + 5) >> 45) & 0x7; // word0x28 bit45, 3-bit — TILE_SPMEM tile base
// …TileSpmemLoadOffsetField::GetConcatenatedValue (gfc 0x1ecb9b60)
return (*((uint64_t *)this + 5) >> 42) & 0x7; // word0x28 bit42, 3-bit — within-window offset
// …TileSpmemLoadStrideField::GetConcatenatedValue (gfc 0x1ecb9b80)
return (*((uint64_t *)this + 5) >> 38) & 0xF; // word0x28 bit38, 4-bit — per-element stride
// …TileSpmemLoadMaskField::GetConcatenatedValue (gfc 0x1ecb9ba0)
return (*((uint64_t *)this + 5) >> 33) & 0x1F; // word0x28 bit33, 5-bit — lane predicate
// …TileSpmemLoadIndexedIndexField::GetConcatenatedValue (gfc 0x1ecb9dc0)
return (*((uint64_t *)this + 5) >> 27) & 0x3F; // word0x28 bit27, 6-bit — gather index VREG
// …TileSpmemLoadCircularBufferCbregField::GetConcatenatedValue (gfc 0x1ecb9be0)
return *((uint16_t *)this + 23) & 0xF; // 16-bit @ byte 0x2e = bits48..63, &0xf => bit48/4
The densest op-form, TileSpmemLoadIndexedCircularBuffer (Index accessor 0x1ecb9ea0, Cbreg 0x1ecb9e20, Dest 0x1ecb9e00), carries all of {Dest@52, Cbreg@48, BaseAddress@45, Offset@42, Stride@38, Mask@33, Index@27} plus Opcode@58 in word 0x28 — verified to fit contiguous bit 27..60 with no overlap and no gap.
QUIRK — every load field is in word
0x28; there is no second data word. The VectorStore slot spreads its fields across word0x30(Source/address-build) withDestborrowed into word0x28. The load has no store mux to drive, so it packsDest,Index, the address-build fields, and the opcode into the single word0x28. A reimplementer porting the store-side decoder must re-base every shift off0x28, not0x30, and re-map every position — the field order differs (store hasSource@27; load hasIndex@27,Dest@52).
GOTCHA — the load has no dtype field. The 5 load ops are addressing-mode only; there is no
S32/F32/S16/Bf16split as on the VectorStoreAddforms. The loaded element type is fixed by the consuming VectorExtended/VectorAluop, not by the load. ModelingVectorLoadwith a dtype operand will collide with the consuming op's type and mis-decode. Whether a separate load-width control bit exists elsewhere in the bundle (vs word0x28) was not searched —LOWfor that one negative.
Load/Store bit-position symmetry
The two slots are deliberately mirror-symmetric, and the symmetry is the mechanism behind fetch-and-add:
word packed fields
VectorStore (0x30): Source@27 Mask@8 Stride@13 Offset@17 Base@20 Cbreg@23 Index@2 Opcode@33
+ Dest @ word0x28 bit52 (ReturnValueAdd forms only)
VectorLoad (0x28): Index@27 Mask@33 Stride@38 Offset@42 Base@45 Cbreg@48 Dest@52 Opcode@58
▲
store fetch-and-add Dest @word0x28 bit52 ════════════════════════════════════════╝ SAME mux
The store puts its data fields in word 0x30 because it drives the store mux; the load packs everything in word 0x28. The one field they share at the same bit position is Dest@word0x28 bit52 — the load writes its row there, and the store's IndexedReturnValueAdd family returns its pre-add value through the identical position. All 8 gfc store ReturnValueAdd ops were confirmed (on the store page) to read Dest from word 0x28 bit 52. This bit-position identity is the structural proof that the fetch-and-add is a load-then-add (see §The Fetch-and-Add Return Path).
The 5-Op Addressing-Mode Roster
The opcode-recovery model
Each load op-form is a distinct C++ type SparseCoreTecVectorLoadTileSpmemLoad<Form>Opcode carrying a Matches() const predicate that masks the opcode field out of the decoded-instruction word and compares it to the op's signature. The field is 3-bit @ bit 58 of word 0x28 (mask 0x1c00000000000000), so opcode = cmp_immediate >> 58. The base op (TileSpmemLoad, 0) is tested via testb $0x1c, 0x2f(%rdi); sete — the byte at +0x2f is bits 56..63 of word 0x28, masking 0x1c isolates the 3 opcode bits (58..60), all-zero ⇒ op 0. Byte-exact:
// SparseCoreTecVectorLoadTileSpmemLoadOpcode::Matches (gfc 0x1ecb9a00)
return (*((uint8_t *)this + 47) & 0x1C) == 0; // byte +0x2f, opcode 0
// …TileSpmemLoadIndexedOpcode::Matches (gfc 0x1ecb9a60)
return (*((uint64_t *)this + 5) & 0x1C00000000000000) == 0xC00000000000000; // 0xc…>>58 = 3
*((uint64_t*)this + 5) is the word at byte offset 0x28. All 5 gfc load-op Matches() immediates were enumerated; the resulting opcode set is contiguous 0..4, no gaps, no duplicates.
The full roster (gfc, 5 ops, byte-confirmed)
| op | mnemonic (TileSpmem…) | mode | extra fields | Matches immediate (>>58) |
|---|---|---|---|---|
| 0 | Load | plain | — | byte+0x2f & 0x1c == 0 |
| 1 | LoadCircularBuffer | CB-windowed | Cbreg | 0x04…>>58 = 1 |
| 2 | LoadCircularBufferPostUpdate | CB + advance | Cbreg | 0x08…>>58 = 2 |
| 3 | LoadIndexed | indexed gather | Index | 0x0c…>>58 = 3 |
| 4 | LoadIndexedCircularBuffer | indexed gather, CB | Index,Cbreg | 0x10…>>58 = 4 |
Per-form field sets:
| op-form | field set |
|---|---|
TileSpmemLoad | {Dest, BaseAddress, Offset, Stride, Mask} |
+CircularBuffer / +CircularBufferPostUpdate | adds {Cbreg} |
+Indexed | adds {Index} |
+IndexedCircularBuffer (densest) | adds {Cbreg, Index} |
The mode lattice
| Mode | Marker in name | Behavior | Extra field | Embedding role |
|---|---|---|---|---|
| plain load | (no qualifier) | read addressed word(s) into Dest | — | gathered-row load |
| CircularBuffer | CircularBuffer | address via a CBREG window (16 CBREGs, 4-bit selector) | Cbreg | windowed minibatch tile read |
| +PostUpdate | …PostUpdate | advance the CBREG offset after the load | Cbreg | streaming tile read without a separate pointer bump |
| Indexed | Indexed | per-element gather offset read from a VREG | Index | per-id row gather |
| IndexedCircularBuffer | Indexed…CircularBuffer | per-element gather within a CBREG window | Index,Cbreg | windowed per-id gather |
The load mode lattice is the read-side subset of the VectorStore lattice (plain / CircularBuffer / PostUpdate / Indexed), minus the Add/ReturnValue accumulate semantics — those only make sense on a write. The Indexed forms supply a 6-bit Index VREG (word0x28 bit27) so each lane reads addr + Index[lane] — the per-id gather that pairs with the Stream GATHER stage feeding TILE_SPMEM.
QUIRK —
CircularBufferPostUpdateexists for the non-indexed CB form but not for the indexed one. The roster hasLoadCircularBufferPostUpdate(op 2) but there is noLoadIndexedCircularBufferPostUpdate. This mirrors the VectorStore sparsity rule (PostUpdate only pairs with non-indexed CB). A reimplementer must not synthesize an indexed post-update load; the 5 entries are exactly the reachable cells.
NOTE — the consuming-stage dispatch counts 18 opcodes mapping to these 5 ops. The TEC bundle consumer (
ConsumeOneTecBundleInstruction, gfcutils::band; glc0x13a08e00) routes a 18-entry inner jump table to these 5EmitVectorLoadOrStore<…SparseCoreTecVectorLoad_<Form>>leaves — the multiple opcodes per op are pred / non-pred / circular-mask MCInst forms distinguished by the operand fields, not by the op identity. The opcode→op map is 18→5; the field decode above is the same for every form. (Dispatch detail is in the code-gen pages; this page owns the field decode.)
The SourceOne Seed Enum
What it selects
The reduction stage downstream of the load reads its scan seed / accumulator carry-in through a 3-bit field named SourceOne, byte-confirmed at word 0x28 bit 13 (op-invariant across the scan ops; read from …AddScanS32SourceOneField gfc 0x1eca7c40):
// SparseCoreTecVectorExtendedAddScanS32SourceOneField::GetConcatenatedValue (gfc 0x1eca7c40)
return (uint8_t)HIBYTE(*((uint16_t *)this + 20)) >> 5;
// *((uint16_t*)this + 20) is the 16-bit at byte 0x28; HIBYTE is byte 0x29 (bits 8..15 of word0x28);
// >>5 takes the top 3 bits of that byte = bits 13..15 of word0x28 → 3-bit @ bit13.
The value is an asic_sw::deepsea::gxc::gfc::isa::VexSourcePortEncoding. It is NOT a constant-pool index ({0, +inf, -inf}); it is a source-bus selector — it routes the EUP scan-input/carry-in mux to one of the V0/V1/V2/V3 VREG read ports (an X selector or a Y_VREG selector sub-port) or to the VST source bus.
The 8 values
| value | enum name | role |
|---|---|---|
| 0 | VEX_SOURCE_PORT_ENCODING_VST_SOURCE | the VST (vector-store) source bus |
| 1 | VEX_SOURCE_PORT_ENCODING_V0_Y_VREG | V0 operand, Y_VREG sub-port |
| 2 | VEX_SOURCE_PORT_ENCODING_V0_X | V0 operand, X sub-port |
| 3 | VEX_SOURCE_PORT_ENCODING_V1_Y_VREG | V1 operand, Y_VREG sub-port |
| 4 | VEX_SOURCE_PORT_ENCODING_V1_X | V1 operand, X sub-port |
| 5 | VEX_SOURCE_PORT_ENCODING_V2_Y_VREG | V2 operand, Y_VREG sub-port |
| 6 | VEX_SOURCE_PORT_ENCODING_V2_X | V2 operand, X sub-port |
| 7 | VEX_SOURCE_PORT_ENCODING_V3_Y_VREG | V3 operand, Y_VREG sub-port |
The enum is cross-confirmed by xla::ghostlite::GhostliteProtoUtils::GetVexSourcePortEncoding(common_proto_utils::VregReadPort) (gfc 0x1c5ee280), a switch that maps the compiler's logical VregReadPort onto these encodings 1:1 — cases 0..7 return encodings 0..7 with an OK status, and the two higher ports are rejected: case 8 (V3_X) returns InvalidArgument ("The V3_X slot (port number 8) cannot be used by a VEX instruction.") and case 9 (MISC_AUX) returns InvalidArgument ("MISC_AUX not supported on GLC"):
// xla::ghostlite::GhostliteProtoUtils::GetVexSourcePortEncoding(common_proto_utils::VregReadPort) (gfc 0x1c5ee280)
switch (port) {
case 0: *(int*)(this+2) = 0; *(uint64_t*)this = 1 /*OK*/; return this; // VST_SOURCE
case 1: *(int*)(this+2) = 1; goto ok; // V0_Y_VREG
case 2: *(int*)(this+2) = 2; goto ok; // V0_X
// … cases 3..7 → encodings 3..7 (V1_Y_VREG, V1_X, V2_Y_VREG, V2_X, V3_Y_VREG)
case 8: /* V3_X — InvalidArgument: cannot be used by a VEX instruction */
case 9: /* MISC_AUX — InvalidArgument: not supported on GLC */
}
The SparsecoreVregReadPort proto enum (the descriptor strings are stored in declaration order) is {VST_SOURCE=0, V0_Y_VREG=1, V0_X=2, V1_Y_VREG=3, V1_X=4, V2_Y_VREG=5, V2_X=6, LANE_ID=7} — the same Y_VREG-before-X order as the encoding table. The wider common_proto_utils::VregReadPort this function actually accepts extends past that proto: ports 8 (V3_X) and 9 (MISC_AUX) exist as inputs and are rejected, and input port 7 maps to encoding 7 (V3_Y_VREG).
Why a selector, not a constant
SourceOne selects which read port (or the VST bus) feeds the scan's first/seed input. The reduction identity element (0 for Add, ±inf for Min/Max) is whatever value is placed in that port — not a hardwired SourceOne code. This is the mechanism that chains a scan's accumulator across tiles / across the V operand frame: write the previous partial to a chosen V port, set SourceOne to that port, and the scan resumes with that carry-in.
QUIRK — there are 4 source-bus encodings (V0..V3) but only 3 V operand pairs in the VEX frame. The VectorExtended operand frame has 3 V operand pairs (V0/V1/V2, each a
Y_VREG+Xselector), yetSourceOnecan selectV3_Y_VREG(enc 7). Whether V3 is a separate physical read port (a chained-accumulator / VST-feedback path reachable only viaSourceOne) or an alias of the VST source isHIGH— structurally confirmed as a reachable encoding, micro-architecturally inferred.HasVexSourceBuses()is a per-target predicate (Ghostlite gfc / Jellyfish / Viperfish / Pufferfish), so the bus set is gen-dependent.
The Segmented-Scan Boundary Operand
The reduction stage the load feeds is a VectorExtended scan; for embedding sum the scan is segmented — it resets the running accumulator at per-sample boundaries. The boundary is bound as an SSA operand at the MLIR level, byte-confirmed from the op trait packs:
| MLIR op | operands | results | reduction kind |
|---|---|---|---|
sparse_core::ScanOp | AtLeastNOperands<1> (data) | OneResult | reduction_op attr |
sparse_core::SegmentedScanOp | NOperands<2> (data, segment_ids) | OneResult | reduction_op attr |
SegmentedScanOp::build(OpBuilder, OperationState, Type result, Value input, Value segment, StringAttr reduction_op) (gfc 0x145fd4a0) issues two addOperands calls unconditionally (operand 0 = input, operand 1 = segment) — the exact-2 form NOperands<2>. The sibling ScanOp::build(OpBuilder, OperationState, Type, Value, Value, StringAttr) (gfc 0x145f92e0/0x145f9480) guards its FIRST addOperands behind if (first_value) and adds the second unconditionally — i.e. the first operand is optional, the AtLeastNOperands<1> form. The reduction_op StringAttr is stored as a property on both. So:
- operand 0 = the data / value to scan.
- operand 1 = the segment ids — the scan resets the running accumulator wherever the segment id changes. This is the per-segment boundary the task asks for.
The reduction_op string is decoded by byte-comparison in SegmentedScanOpLowering::matchAndRewrite (gfc 0x13589d40): "sum", "max", "min". The lowering chain splits on the segmented-vs-plain op identity, and the segmentation is carried in the intrinsic name as a .seg. infix:
HLO embedding-sum
→ sparse_core::SegmentedScanOp (2 operands: data, segment_ids)
→ SegmentedScanOpLowering (gfc 0x1358 9d40) / ScanOpLowering<SegmentedScanOp,…> (0x135f3000)
→ LLVM intrinsic llvm.tpu.{add,max,min}[.full/.half][.seg].scan{1x,2x}[.index] (.seg = segmented)
vs sparse_core::ScanOp (1+ operands: data, optional second)
→ ScanOpLowering (0x1358 ab00) / ScanOpLowering<ScanOp,…> (0x135f2580)
→ llvm.tpu.{add,max,min}[.full/.half].scan{1x,2x}[.index] (no .seg infix)
The SC isa_emitter then register-allocates each intrinsic operand to a free V read port via FindAndEmitToUnusedPort<SparsecoreVregReadPort, …> and writes it into the matching slot V-field (a port→slot jump table). So the segment-id operand lands on whichever V0/V1/V2 read port is free at allocation time — a register-ALLOCATED V operand, not a fixed slot. The binding is by SSA operand #1; the placement is by the read-port allocator.
NOTE — the
.seg.infix is the segmentation marker;scan1x/scan2xis a width/throughput marker, NOT an operand count. The intrinsic family isllvm.tpu.{add,max,min}[.full/.half][.seg].scan{1x,2x}[.index](all byte-confirmed in the binary:llvm.tpu.add.seg.scan1x,llvm.tpu.add.full.seg.scan2x,llvm.tpu.max.seg.index.scan2x, etc.). The cleanest structural marker that a scan is segmented (a per-sample embedding sum) is the.seg.infix together with the second SSA operand (the segment ids) onSegmentedScanOp;scan1xvsscan2xand.fullvs.halfare independent throughput/lane-width axes, not the operand count. The fullreduction_opattr set is confirmed as{sum, max, min}; whethermean/sqrtnexist as attrs vs being a post-scan divide was not enumerated —LOWfor that boundary.
The Fetch-and-Add Return Path
Pre-add return through the load-Dest mux
The VectorStore IndexedReturnValueAdd* family is the fetch-and-add: it accumulates Source into mem[addr + Index] and returns the value present before the add into a Dest VREG. The decisive fact for this page is that the return Dest is read from word 0x28 bit 52 — the identical mux position VectorLoad writes its loaded row through. All 8 gfc store ReturnValueAdd ops were confirmed to read Dest from word 0x28 bit 52 (the store page owns that table). The return therefore captures the pre-add value because the op is structurally a load-then-add:
// fetch-and-add structural model — the Dest is the LOAD mux, so it captures pre-add
function FetchAndAddStore(slot): // e.g. TileSpmemStoreIndexedReturnValueAddF32
addr = BuildAddress(BaseAddress, Offset, Stride, Cbreg?) // word0x30 fields (store side)
addr += Index[lane] // per-element scatter (Index @ word0x30 bit2)
old = mem[addr] // read THROUGH the load-Dest mux (word0x28 bit52)
Dest[lane] = old // return the PRE-add value (== VectorLoad Dest@52)
mem[addr] = old + Source[lane] // atomic accumulate (word0x30 Source @bit27)
At the MLIR level the fetch-and-add is sparse_core::VectorLoadStoreIdxAddOp, which has a result Type — confirmed from VectorLoadStoreIdxAddOp::build(OpBuilder, OperationState, Type, Value, Value, ValueRange, Value) (gfc 0x1459d840), whose leading Type argument is the result type (the returned pre-add value). The plain scatter-add sparse_core::VectorStoreIdxOp has no result Type. The lowering (VectorLoadStoreIdxAddOpLowering, gfc 0x135c3…) computes a strided element pointer (Mul/Add/InsertElement) then emits LLVM::AddOp (the read-modify-add). The LLVM intrinsic naming confirms: llvm.tpu.vst.msk.idx.ret.add.np / …ret.add.e4m3.np / …ret.add.e5m2.np (all byte-confirmed; ret.add = return-then-add, all carry the .np suffix) vs llvm.tpu.vst.[cb.]msk.idx.add[.e4m3/.e5m2][.np] (plain indexed scatter-add, no ret, and also available in a .cb. circular-buffer form).
Scatter ordering and dedup on the load path
DLRM embedding scatter-add — ordering is made moot by collapsing duplicates first
gathered ids ──► SortInteger ──► Uniquify ──► DuplicateCount ──► scatter-add (fetch-and-add)
(VEX 20/21) (26/27) (24/25) each UNIQUE row touched ONCE
│ │ ▲
WithLaneIds inverse-map count× multiplicity ──┘
(routes result back) (forward: /count mean · backward: ×count grad)
For concurrent same-address fetch-and-adds within one vector (true duplicate indices), the per-lane commit order is hardware behavior not present in the C++. In the embedding path it never fires: the dedup pipeline — SortInteger → Uniquify → DuplicateCount (VEX scan ops, byte-confirmed op set, all NOperands<2>) — collapses duplicate ids before the scatter, so each unique row is fetched/updated exactly once. The WithLaneIds variants (UniqueWithLaneIdsOp / DuplicateCountWithLaneIdsOp, NResults<3>) emit a third result: the per-input → unique-representative lane-id map (an inverse permutation) that routes each per-input result back to its original position. The DuplicateCount multiplicity scales the gradient (×count backward; the mean divisor forward), so a collapsed duplicate accumulates the correct count× contribution. The dedup is the correctness mechanism; the fetch-and-add ordering only matters for a raw, deduplication-disabled scatter, gated by xla_tpu_enable_sparse_core_computation_deduplication.
NOTE — the in-vector duplicate-index commit order is silicon (LOW for raw scatter); the count→weight arithmetic was not bit-traced. Two facts remain inferred: (1) the exact HW serialization of concurrent same-address fetch-and-adds within one vector is not in the C++ (moot for the dedup path, unspecified for a raw scatter); (2) the exact arithmetic that folds the
DuplicateCounti32 result into the gradient before theIndexedReturnValueAddwas not bit-traced — the op set and result counts are CONFIRMED, thecount×gradientfold is the lowering body and isLOW.
The Embedding-Reduce Composition
VectorLoad is stage 2 of the SparseCore embedding-reduce datapath. The full composition (the other stages owned by the linked pages):
SparseCore embedding lookup + reduce — 5-stage TEC datapath
1. GATHER Stream IndirectStream id list → HBM[base+id*stride] → TILE_SPMEM (stream-gather-scatter)
2. LOAD VectorLoad TileSpmemLoad* TILE_SPMEM rows → VREGs (Dest/Index/Base/Offset/Stride/Mask/Cbreg
all @word0x28) ◄── THIS PAGE
3. REDUCE VectorExtended SegmentedAddScan operand0=data, operand1=segment_ids; SourceOne seed selects the
carry-in source bus (V0..V3 / VST) (vectorextended-vex)
4. DRAIN VectorResult EupResult/PopXrf* XRF (extended-result FIFO) → VRF (vector-opcode-enum)
5. SCATTER VectorStore …Add{dt} atomic scatter-add gradient INTO the row; fetch-and-add returns
pre-add (Dest@word0x28 bit52 — the load mux) (vectorstore-slot)
Stage 2 (this page) pulls the gathered rows into VREGs; stage 3's SourceOne (this page) selects the scan seed, and its segment-id operand (this page) supplies the per-sample reset; stage 5's fetch-and-add returns its pre-add value through this page's load-Dest mux. The load is dtype-agnostic so the same 5 ops serve every embedding width; the type is fixed at stage 3.
Function Map
| Symbol (gfc) | Address | Role |
|---|---|---|
…VectorLoadTileSpmemLoadOpcode::Matches | 0x1ecb9a00 | op 0 predicate (byte+0x2f & 0x1c == 0) — the base op |
…VectorLoadTileSpmemLoadIndexedOpcode::Matches | 0x1ecb9a60 | op 3 (cmp 0xc00000000000000 → >>58 = 3) |
…VectorLoadTileSpmemLoadDestField::GetConcatenatedValue | 0x1ecb9b20 | Dest @ word 0x28 >> 52 & 0x3f |
…VectorLoadTileSpmemLoadBaseAddressField | 0x1ecb9b40 | BaseAddress @ word 0x28 >> 45 & 0x7 |
…VectorLoadTileSpmemLoadOffsetField | 0x1ecb9b60 | Offset @ word 0x28 >> 42 & 0x7 |
…VectorLoadTileSpmemLoadStrideField | 0x1ecb9b80 | Stride @ word 0x28 >> 38 & 0xf |
…VectorLoadTileSpmemLoadMaskField | 0x1ecb9ba0 | Mask @ word 0x28 >> 33 & 0x1f |
…VectorLoadTileSpmemLoadCircularBufferCbregField | 0x1ecb9be0 | Cbreg @ word 0x28 >> 48 & 0xf (CB forms) |
…VectorLoadTileSpmemLoadIndexedIndexField | 0x1ecb9dc0 | Index @ word 0x28 >> 27 & 0x3f (Indexed forms) |
…VectorLoadTileSpmemLoadIndexedCircularBufferDestField | 0x1ecb9e00 | densest-form Dest (same @52) |
…VectorExtendedAddScanS32SourceOneField::GetConcatenatedValue | 0x1eca7c40 | SourceOne @ word 0x28 >> 13 (3-bit), op-invariant |
GhostliteProtoUtils::GetVexSourcePortEncoding | 0x1c5ee280 | VregReadPort → VexSourcePortEncoding 1:1 (8 values) |
mlir::sparse_core::SegmentedScanOp::build | 0x145fd4a0 | 2-operand build (operand0=data, operand1=segment_ids) |
mlir::sparse_core::VectorLoadStoreIdxAddOp::build | 0x1459d840 | fetch-and-add op; leading Type arg = result (pre-add) |
Cross-gen anchors: the vfc VectorLoad opcode field is 3-bit @ word 0x28 bits[56:58] (base op via testb $0x7, 0x2f; CircularBuffer via mask 0x07<<56 cmp 0x01<<56 → 1), with TileSpmemIndexedLoad* (Indexed-prefix) naming; gfc/glc are 3-bit @ bits[58:60] with TileSpmemLoadIndexed* (Indexed-suffix) naming — a 2-bit upward shift (GF inserted 2 bits below the opcode). The decode VALUES 0..4 and the operand-field semantics are gen-stable; the per-gen count is 5 ops in vfc/glc/gfc. The 2 bits GF inserted below the opcode were not identified — LOW.
NOTE —
TensorCoreVectorLoad*is a different engine, excluded from the 5-count. The0x1f9e9880+VectorLoadsymbols (TensorCoreVectorLoad0/1,TensorCoreVectorMiscVectorLoad) belong to the TensorCore datapath, not the SparseCore TEC. Only theSparseCoreTecVectorLoad*types contribute to this slot's 5-op roster; a reimplementer grepping forVectorLoadmust filter to theSparseCoreTecprefix.
Considerations
- Re-base every shift off word
0x28. Porting the store-side decoder to the load means re-basing all shifts off0x28(not0x30) and re-mapping the order: storeSource@27, loadIndex@27; store opcode @33, load opcode @58. The load's single-word packing is the structural difference. - No dtype on the load. Decode
VectorLoadas addressing-mode only; the element type comes from the consuming VectorExtended/VectorAluop. A dtype operand on the load is a modeling error. SourceOneis the cross-tile scan-chaining seam. It selects the carry-in source bus, so the reduction identity is a value placed in a V port, never a hardwired code. TreatingSourceOneas a constant-pool index mis-models the accumulator chaining.- The segment boundary is operand 1, register-allocated to a free V port. Bind the segment id as the second SSA operand of
SegmentedScanOp; the placement (which V0/V1/V2 port) is the read-port allocator's choice, recorded in the slot V-field. - Fetch-and-add ordering is silicon (LOW for raw scatter). The pre-add return is structurally confirmed (Dest = load mux @52,
.ret.addintrinsic, result-typedVectorLoadStoreIdxAddOp); the in-vector duplicate-index commit order is not in the C++ and is undefined here for a non-deduplicated scatter. It is moot for the embedding path (the dedup collapses duplicates first). - Unmapped (LOW/inferred). The
count×gradientmultiplicity fold (lowering body); the V3_Y_VREG physical-port identity (4th read port vs VST alias); the 2 bits GF inserted below the opcode (bits[56:58] → [58:60] vs vfc); whether a separate load-width control bit exists outside word0x28; the fullreduction_opattr set beyond{sum, max, min}.
Related Components
| Name | Relationship |
|---|---|
SparseCoreTecVectorLoadTileSpmemLoad*Opcode::Matches | the 5 per-op predicates that define the opcode values (0..4 @ word 0x28 bit 58) |
SparseCoreTecVectorStore* (0x1ecc9f40+) | the store counterpart; shares the Dest mux (word 0x28 bit 52) the fetch-and-add returns through |
SparseCoreTecVectorExtended*::SourceOne (0x1eca7c40 gfc) | the scan seed-port selector field in the same word 0x28 |
GhostliteProtoUtils::GetVexSourcePortEncoding (0x1c5ee280) | maps logical VregReadPort → the 8 VexSourcePortEncoding values |
mlir::sparse_core::SegmentedScanOp (build 0x145fd4a0) | the 2-operand segmented scan; operand 1 = the segment-id boundary |
mlir::sparse_core::VectorLoadStoreIdxAddOp (build 0x1459d840) | the result-typed fetch-and-add; returns the pre-add value |
Cross-References
- TEC (Vector) Engine — owns the 64-byte bundle, the slot bases, and the encoder-dispatch model.
- VectorStore Slot — the store-side mirror; the bit-position symmetry (load
Dest@52== store fetch-and-addDest@52) and the 33-op type×mode matrix. - VectorExtended (VEX) — the scan/sort/reduce slot the load feeds; the home of the scan opcode roster and the
SourceOne/VstSourcefields. - Segmented Scan — the per-segment-reset scan the embedding sum uses; the 2-operand binding documented here in summary.
- TEC Vector Opcode Enumeration — the
VectorAluopcode roster and the opcode-recovery model this page reuses; theVectorResultXRF-drain slot. - Stream Gather/Scatter — the
GATHERstage that fillsTILE_SPMEMbefore the load; the cross-HBMSCATTER_FLOAT_ADDatomic the on-tile scatter-add mirrors. - CBREG — the 16-CBREG circular-buffer register triple the
Cbregselector indexes; the wrap arithmetic the CB load modes inherit. - Dedup Multiplicity — the Sort→Uniquify→DuplicateCount path that precedes the scatter-add and makes the fetch-and-add ordering moot.
- SparseCore Overview — the three SC engine classes, per-gen presence, and where the TEC vector slots sit.
- Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part IX — SparseCore & BarnaCore / SparseCore ISA — back to index