Dedup Multiplicity
Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id
89edbbe81c5b328a958fe628a9f2207d). Other versions differ.
Abstract
A SparseCore embedding lookup gathers many ids per sample, and the same id usually appears more than once in a sample's window — a DLRM categorical feature is a multiset, not a set. Before the scatter-add writes a gradient back into an embedding row, the pipeline collapses each repeated id to a single unique row and records how many input lanes collapsed onto it: the multiplicity. This page owns that arithmetic — how the DuplicateCount op and the HLO CSR row-pointers produce a per-unique count, how the UniqueWithLaneIds inverse permutation routes the deduplicated result back to every original lane, the in-vector fetch-and-add commit ordering (and why dedup makes it moot), and how the multiplicity binds into the forward combiner (sum/mean/sqrtn) and the backward gradient scale.
The mechanism lives at two altitudes that produce the same number by different means. On the TEC (the SparseCore vector engine), DuplicateCount is a single VectorExtended scan op (opcode 24/25) that emits, per lane, the count of lanes in the vector sharing that lane's id — an in-vector multiplicity. At the HLO level the SparseDenseMatmulGradOpDecomposer lowers the gradient apply into a Sort → COO→CSR → Cumsum-exclusive → Add-combiner scatter chain where the multiplicity is the run length of a unique row in the CSR row_pointers and the count× accumulation falls out of the Add-combiner summing every duplicate into one row. Both reduce to the same MLIR dedup ops (sc_tpu.unique, sc_tpu.duplicate_count, with/without lane ids), which lower through one templated pattern, DuplicateCountUniqueOpLowering<OpT>, to a HW intrinsic (tpu_dupcnt{i,f} / tpu_uniq{i,f}) returning an LLVM struct that the lowering field-extracts.
The decisive structural facts, all byte-confirmed: the HW dedup intrinsic always produces three outputs (two vector<N×i32> maps plus the input-dtype value); the MLIR op chooses whether to surface two or three of them via the ReplaceOpWithExtracts index count (2 plain, 3 WithLaneIds). The struct field order is {i32-vec, i32-vec, input-dtype} and the extract index array re-orders it back to the op-result convention — {2,1} plain, {2,1,0} WithLaneIds. The surfaced third result is the lane→unique-row inverse permutation consumed by a PermuteOp immediately before the scatter. The fetch-and-add op carries no atomic-ordering trait — only LLVM alias-scope — so the intra-vector same-address commit order is silicon; dedup sidesteps it entirely by creating its scatter with unique_indices=TRUE, so no two lanes of a vector ever hit the same address.
For reimplementation, the contract is:
- The dedup intrinsic emits three results; the op surfaces 2 or 3.
inferReturnTypesbuildsresult[0]=input dtype,result[1]=vector<N×i32>, and (WithLaneIds only)result[2]=vector<N×i32>. The lowering builds a three-field struct{i32-vec, i32-vec, input-dtype}(the op-result order reversed) and theReplaceOpWithExtractsindex array ({2,1}or{2,1,0}) selects which struct fields become op results, re-ordering them back. - The multiplicity is a count, computed two ways that agree. TEC: the
DuplicateCountscan op's i32 result = per-lane in-vector id count. HLO:csr_row_pointers[i+1] − csr_row_pointers[i]= run length of unique rowi, produced byCumsumExclusive(Pad+ReduceWindow-Add). - The count× accumulation is the Add-combiner of a
unique_indices=TRUEscatter.DeduplicateGradientVectorsToApplyscatter-ADDs each input's gradient into its unique row withCreateAddComputationas the update computation; summing all duplicates into one row is the count× weighting. No explicit multiply is emitted on the backward path. - The forward combiner divisor is the per-sample
gains.sum → 1,mean → 1/valency,sqrtn → 1/sqrt(valency), wherevalency= the per-sample id count = the multiplicity; realized by the SCReciprocalF32/ReciprocalSqrtF32VectorAlu ops inEmitValencyLoop. - The lane_ids 3rd result routes the deduplicated result back to every input lane. It is the inverse permutation (input-lane domain → unique-row codomain), consumed by
PermuteOp(→tpu_sc_permute/llvm.tpu.vperm.sublane) just before theVectorStoreIdx/ fetch-and-add scatter. - In-vector commit order is silicon, made moot by
unique_indices. The fetch-and-add (tpu_vst_msk_idx_ret_add_np) hasAccessGroup+AliasAnalysistraits and noAtomicOrdering; the dedup scatter is createdunique_indices=TRUEso the per-lane order can never affect the result.
| Dedup ops | sc_tpu.unique · …unique_with_lane_ids · …duplicate_count · …duplicate_count_with_lane_ids |
| Lowering template | DuplicateCountUniqueOpLowering<OpT>::matchAndRewrite — 4 instantiations |
| HW intrinsics | llvm.tpu.dupcnt{i,f} · llvm.tpu.uniq{i,f} (3-field struct, NOperands<2> OneResult) |
| TEC opcodes | DuplicateCount{Integer=24,Float=25} · Uniquify{Integer=26,Float=27} @ word 0x28 bits 16..21 |
| Result types | [0]=input dtype · [1]=vector<N×i32> · [2]=vector<N×i32> (WithLaneIds) |
| HLO decomposer | SparseDenseMatmulGradOpDecomposer (Deduplicate{RowIds,GradientVectors}ToApply, CumsumExclusive, LocalReduction) |
| Fetch-and-add | tpu_vst_msk_idx_ret_add_np — NOperands<4>, OneResult, AccessGroup+AliasAnalysis, no AtomicOrdering |
| Dedup flag | xla_tpu_enable_sparse_core_computation_deduplication (0x223c5f10) |
NOTE — this page owns the count→multiplicity arithmetic, the
WithLaneIdsinverse permutation, and the in-vector commit ordering. The bundle-levelSort/Uniquify/DuplicateCountopcodes live in VectorExtended; the SSA emission sequence (Unique → IdxAdd → Permute → StoreIdx) lives in RankAndPermute; the fetch-and-add return path andSourceOneseed live in VectorLoad Slot; the per-sample valency loop lives in Emit Valency Loop. They are linked, not repeated.
The Dedup Op → Intrinsic → Result Map
Purpose
Four MLIR sc_tpu.* ops express the dedup primitive. unique and duplicate_count differ only in what result[0] means (unique values vs per-position counts); the …_with_lane_ids variants add a third result, the inverse permutation. All four lower through one templated pattern to one HW intrinsic that always produces a three-field struct — the op merely chooses how many fields to extract.
The Four Ops
MLIR op (sc_tpu.*) | operands | results | intrinsic (int / float) | extract idx |
|---|---|---|---|---|
unique | 2 | 2 | uniqi / uniqf | {0,1} |
unique_with_lane_ids | 2 | 3 | uniqi / uniqf | {0,1,2} |
duplicate_count | 2 | 2 | dupcnti / dupcntf | {0,1} |
duplicate_count_with_lane_ids | 2 | 3 | dupcnti / dupcntf | {0,1,2} |
The result types are set by inferReturnTypes (see §The Result Types):
result[0] = INPUT DATA TYPE (unique: the unique-value vector;
duplicate_count: the per-position count vector)
result[1] = vector<N × i32> (N = input shape; secondary i32 map)
result[2] = vector<N × i32> (WithLaneIds only: lane→unique-row INVERSE permutation)
The Lowering — One Template, Three Struct Fields
DuplicateCountUniqueOpLowering<OpT>::matchAndRewrite is instantiated once per op. Each instance reads the op's N result types, convertTypes each, builds an LLVMStructType::getLiteral of those types, emits the HW intrinsic via two type-builder lambdas (lambda #1 → integer intrinsic, lambda #2 → float), and calls ReplaceOpWithExtracts with a 2- or 3-element index array:
function DuplicateCountUniqueOpLowering<OpT>::matchAndRewrite(op, rewriter): // e.g. 0x13599d40
// read the op's result types (3 reads even for the plain op: add edx,0x3)
t0 = convertType(op.getNextResultAtOffset(0)) // input dtype
t1 = convertType(op.getNextResultAtOffset(1)) // vector<N x i32>
t2 = convertType(op.getNextResultAtOffset(2)) // vector<N x i32> (built regardless)
structTy = LLVMStructType::getLiteral({t0,t1,t2}) // 0x17471ae0; header pack 0x600000003
if isIntegerElementType:
v = tpu_dupcnti::create(rewriter, loc, structTy, ...) // 0x146e23c0 (uniqi 0x149895c0)
else:
v = tpu_dupcntf::create(rewriter, loc, structTy, ...) // 0x146e1bc0 (uniqf 0x14988dc0)
indices = {0,1} if plain // mov QWORD [rbp-0x60], 0x2
= {0,1,2} if WithLaneIds // mov QWORD [rbp-0x60], 0x3
ReplaceOpWithExtracts(rewriter, loc, op, structTy, v, indices) // 0x135b82a0
ReplaceOpWithExtracts (@0x135b82a0) is byte-confirmed to loop over the index array — do { ... } while (4*count != offset), stepping 4 bytes per unsigned index — emitting one LLVM::ExtractValueOp::create per index, then calling replaceOp through the rewriter vtable to substitute the extracted SSA values for the original op's results:
// xla::tpu::sparse_core::ReplaceOpWithExtracts (0x135b82a0)
for (i = 0; i < count; ++i) { // count = ArrayRef length (2 or 3)
extractIndex = indices[i]; // v20[2] = *(uint*)(a7 + 4*i)
ev = mlir::LLVM::ExtractValueOp::create(builder); // one extract per surfaced field
results.push_back(ev);
}
rewriter.replaceOp(op, ValueRange(results)); // (**v9)(...) vtable call
QUIRK — the HW intrinsic always emits three lanes of output; the MLIR op picks two or three. Even the plain
unique/duplicate_countlowering reads three result types and builds a three-field struct (add edx, 0x3; threegetNextResultAtOffsetreads), then extracts only{0,1}. Thelane_idsthird field is computed by the silicon either way and discarded by the plain op. A reimplementer who sizes the intrinsic struct to the op's result count (2) will mis-decode the WithLaneIds form, whose third field lives at the same struct offset the plain op leaves unread.
Lowering / Intrinsic Address Map
| Symbol (gfc) | Address | Role |
|---|---|---|
DuplicateCountUniqueOpLowering<DuplicateCountOp>::matchAndRewrite | 0x13599d40 | plain count lowering (extract {0,1}) |
…<DuplicateCountWithLaneIdsOp>::matchAndRewrite | 0x1359a7e0 | count + lane_ids (extract {0,1,2}) |
…<UniqueOp>::matchAndRewrite | 0x1359b280 | plain unique lowering (extract {0,1}) |
…<UniqueWithLaneIdsOp>::matchAndRewrite | 0x1359bd20 | unique + lane_ids (extract {0,1,2}) |
ReplaceOpWithExtracts | 0x135b82a0 | per-index ExtractValueOp + replaceOp |
tpu_dupcnti::create / tpu_dupcntf::create | 0x146e23c0 / 0x146e1bc0 | integer / float count intrinsic |
tpu_uniquei::create / tpu_uniquef::create | 0x149895c0 / 0x14988dc0 | integer / float unique intrinsic |
NOTE — the intrinsic op itself is
NOperands<2>,OneResult,OneTypedResult<Type>. The single result is theLLVMStructTypeliteral; the lowering (not the intrinsic) splits it into 2 or 3 op results. The LLVM intrinsic names arellvm.tpu.dupcnti/.dupcntf/.uniquei/.uniquef. The value semantics ofresult[0]forduplicate_count(the per-position broadcast count) vsunique(the unique values) are read from the op name, not a per-field accessor; the field type is= input dtype.
The Result Types
Purpose
inferReturnTypes is the single source of truth for how many results each op has and what their types are. It is the function a reimplementer must reproduce to wire the dedup op into a type-checked IR. The two-result and three-result forms are byte-identical in structure except for one extra push.
The WithLaneIds Build (3 results)
UniqueWithLaneIdsOp::inferReturnTypes (@0x145a0040) takes the op's ValueRange operands, reads the input data type from operand 0 (& ~7 to strip the pointer tag bits), takes the shape from the last operand, builds a signless i32 VectorType::get(shape, i32), and pushes three result types — the input type once, the vector<N×i32> twice:
// mlir::sparse_core::UniqueWithLaneIdsOp::inferReturnTypes (0x145a0040)
inputTy = dereference_iterator(operands, 0) // v11 → result[0]
shape = VectorType::getShape(operands[last] & ~7) // 0x1d895080
i32 = IntegerType::get(ctx, 32, /*signless*/0) // 0x1d8c60c0
vecI32 = VectorType::get(shape, i32) // 0x1d894100
results.push_back(inputTy & ~7) // result[0] = input data type
results.push_back(vecI32) // result[1] = vector<N x i32>
results.push_back(vecI32) // result[2] = vector<N x i32> (lane→unique map)
return 1
The three grow_pod-guarded stores into the SmallVectorImpl<Type> (a11) are byte-confirmed: v17 (input type) once, v16 (the vector<N×i32> value) twice. DuplicateCountWithLaneIdsOp::inferReturnTypes (@0x1459ff00) is structurally identical (three pushes).
The Plain Build (2 results)
UniqueOp::inferReturnTypes (@0x1459fe00) and DuplicateCountOp::inferReturnTypes (@0x1459fd00) push two types — the input type then vector<N×i32> once:
// mlir::sparse_core::UniqueOp::inferReturnTypes (0x1459fe00)
inputTy = dereference_iterator(operands, 0) & ~7 // v13
results.push_back(inputTy) // result[0] = input data type
shape = VectorType::getShape(operands[last] & ~7)
vecI32 = VectorType::get(shape, IntegerType::get(ctx,32,0))
results.push_back(vecI32) // result[1] = vector<N x i32>
return 1
| inferReturnTypes | Address | Results pushed |
|---|---|---|
UniqueWithLaneIdsOp | 0x145a0040 | input · vec |
DuplicateCountWithLaneIdsOp | 0x1459ff00 | input · vec |
UniqueOp | 0x1459fe00 | input · vec |
DuplicateCountOp | 0x1459fd00 | input · vec |
GOTCHA —
result[1]andresult[2]are the same typevector<N×i32>, but not the same semantics. They are built by one sharedVectorType::getcall and pushed twice, so a type-only check cannot tell them apart.result[2]is thelane_idsinverse permutation (consumed byPermuteOp);result[1]is a secondary i32 map (forward permutation / segment-start / count-index). The role split of[1]vs[2]is read from the op-name suffix and thePermuteOpconsumer convention, not from a distinct per-field accessor.
The Multiplicity Arithmetic — Two Levels
Purpose
The "how many times does this unique id appear" count is produced at two granularities that the pipeline keeps as parallel mechanisms: the TEC in-vector primitive (per-tile, one vector at a time) and the HLO CSR row-pointers (cross-tile, the full minibatch window). Both yield an i32 count; the page documents each and marks the un-traced handoff.
TEC In-Vector — the DuplicateCount Scan Op
On the TEC, DuplicateCount is a unary VectorExtended (EUP scan-family) op. DuplicateCountInteger is opcode 24 (@0x1eca7b20, cmp 0x180000 of word 0x28 bits 16..21 >> 16); DuplicateCountFloat is 25. It is emitted as EmitVectorResultUnop<…DuplicateCountInteger> (@0x13aaf660): one input VREG (GetVregno), a GetVectorMask, a FindAndEmitToUnusedPort to claim a free V read port, and the count result drained through the VectorResult XRF→VRF path — the same drain model as the scan ops it shares the slot with.
// EmitVectorResultUnop<…DuplicateCountInteger> (0x13aaf660)
in = GetVregno(input) // the id vector
mask = GetVectorMask(...)
port = FindAndEmitToUnusedPort(...) // one free V read port
emit DuplicateCountInteger(in, mask) → VectorResult (XRF)
// per lane: result[lane] = #{ j : id[j] == id[lane] } — the in-vector multiplicity
The op gives, per lane, the multiplicity of that lane's value across the vector. This i32 count is the value the combiner uses as the mean divisor (forward) or the gradient scaler (backward).
HLO Decomposition — CSR Row-Pointers and the Add-Combiner
The minibatch gradient apply is lowered by SparseDenseMatmulGradOpDecomposer. The multiplicity here is the run length of a unique id in the CSR row_pointers, and the count× accumulation is the Add-combiner of a unique_indices=TRUE scatter:
SparseDenseMatmulGradOpDecomposer — backward dedup-reduce
Sort (sample,id) pairs SortVectorsAndGainsBySampleIds 0x1367c5a0
→ COO → CSR CooToSparseMatrixFormat 0x13412480
csr_row_pointers[i+1] − csr_row_pointers[i] = multiplicity of unique row i
→ exclusive prefix sum CumsumExclusive = Pad + ReduceWindow(Add) 0x134b3ec0
= each input's destination unique-row index (the multiplicity INDEX)
→ DeduplicateRowIdsToApply 0x134b4560 Scatter(unique=TRUE)+Overwrite+MaxValue+Ternary
collapses duplicate ids onto the unique slot
→ DeduplicateGradientVectorsToApply 0x134b4e40
Scatter(update=AddComputation, unique=TRUE) + LiteralUtil::Zero
scatter-ADDs each input gradient into its unique row → count× accumulation
→ LocalReduction 0x134b01e0 ReduceWindow + Scatter(Add) + Gather/Compare/Ternary
windowed sum over the run of duplicate rows of one unique id
DeduplicateGradientVectorsToApply (@0x134b4e40) is byte-confirmed to: create two HLO parameters (cumsum_exclusive, gradient_vectors); zero-initialize the accumulator via LiteralUtil::Zero + CreateBroadcast (accumulated_gradients_zero_init); build the scatter's update computation with CreateAddComputation; and emit CreateScatter(operand, scatter_indices, updates, AddComputation, dim_numbers, indices_are_sorted=0, unique_indices=…). The trailing scatter bool args are byte-confirmed: push 0x0 = indices_are_sorted=false, push 0x1 = unique_indices=true.
QUIRK — there is no explicit
× counton the backward path. The count× weighting is implicit in the Add-combiner: every one of thecountduplicate ids contributes its gradient once, and the scatter'sAddComputationsums them all into the single unique row, so the collapsed row accumulates exactlycount ×the per-input gradient. A reimplementer looking for aMulOpby the multiplicity on the backward path will not find one — the multiply is the fold of the Add-combiner over the run of duplicates.
Forward Combiner — the gains Divisor
The forward reduction is a segmented sum scaled by a per-id gain. The combiner is one of three, byte-confirmed from the backend-config attr strings "sum" / "mean" / "sqrtn":
| Combiner | gains | realized by |
|---|---|---|
sum | 1 | (no scale) |
mean | 1 / valency | ReciprocalF32 (EmitExtendedVectorVxUnop 0x13a1dc00) |
sqrtn | 1 / sqrt(valency) | ReciprocalSqrtF32 (0x13a1de00) |
valency = the per-sample id count = the multiplicity. The gains are applied per-sample inside SparseDenseMatmulDotCombinerEmitter::EmitValencyLoop (@0x1332cee0); SortVectorsAndGainsBySampleIds (@0x1367c5a0) sorts the gathered rows and their gains by sample id before the segmented reduce. The config fields carrying this are gains, csr_row_pointers, has_reciprocal, has_reciprocal_sqrt, divisor, max_valency, feature_length, valency. Bf16 reciprocal variants (0x13a1dd00 / 0x13a1df00) exist for the half-precision path.
NOTE — whether the TEC in-vector
DuplicateCountand the HLO CSR multiplicity are ever combined is not bit-traced. Both produce a count, but the per-tile in-vector primitive and the cross-minibatch CSR row-pointer count are parallel mechanisms; which one feeds which scatter at which granularity is not resolved. The custom-combiner gains arithmetic (SparseDenseMatmulCustomCombinerOpDecomposer0x1366cf80) defines gains beyondsum/mean/sqrtnand is not traced.
The Uniquify Inverse Permutation
Purpose
The dedup loop computes once per unique row, but the embedding op must produce a result for every input lane. The third result of the …_with_lane_ids ops is the map that closes that loop: for each original input lane, the unique row it collapsed onto — an inverse permutation that a PermuteOp uses to fan the per-unique result back to every input position.
The Map and Its Type
result[2] ("lane_ids") is vector<N×i32>, the same shape as the input (built by the same VectorType::get(inputShape, i32) in inferReturnTypes above). It is the inverse permutation: domain = input-lane index, codomain = unique-row index. Reading lane k gives the unique representative that input id k mapped to.
The Consumer — PermuteOp Before the Scatter
The emission sequence in RankAndPermuteComputeFunction (@0x134039c0) — the SC radix-sort emitter — is byte-confirmed:
UniqueOp create @0x134042a9 (sc_tpu.unique)
UniqueWithLaneIdsOp create @0x13404303 (sc_tpu.unique_with_lane_ids → lane_ids = result[2])
VectorLoadStoreIdxAddOp create @0x1340456c (rank / fetch-and-add over the unique window)
PermuteOp create @0x13404771 (data, lane_ids) → fan result back to input lanes
VectorStoreIdxOp create @0x134049cb (write the permutation index vector)
PermuteOp is NOperands<2>, OneResult (create @0x145f3920; lowering @0x135a1640 → tpu_sc_permute / llvm.tpu.vperm.sublane). It takes (data, lane_ids) and applies the permutation, routing each per-unique-row result back to the original input lane positions, immediately before the VectorStoreIdx / fetch-and-add scatter writes the result. The second caller is (anon)::TransferOperandSlices (@0x13d202e0), within an ElementScatterContext — the scatter consumer that uses the lane-id map to route slices.
NOTE — the plain
uniqueop has nolane_idsand skips thePermuteOp. The emitter selectsUniqueWithLaneIds(3 results) vsUnique(2 results) by a target capability query; in the plain path thelane_idValue is null and thePermuteOpstep is omitted entirely (see RankAndPermute). A reimplementation that always emitsPermuteOpwill fault on a null operand in the no-lane-ids path. The HW datapath oftpu_sc_permute(the sublane-permute network width / cycles) is silicon.
The In-Vector Fetch-and-Add Commit Order
Purpose
When two lanes of one vector scatter-add to the same address, the order in which their read-modify-adds commit determines the result of a raw (non-deduplicated) scatter. This section documents what the IR does and does not specify, and why the dedup path is immune.
No Atomic Ordering — Only Alias Scope
The fetch-and-add op tpu_vst_msk_idx_ret_add_np carries this trait pack, byte-confirmed from its getHasTraitFn template instantiation (@0x14a6ac60):
ZeroRegions · OneResult · OneTypedResult<Type> · ZeroSuccessors · NOperands<4u>
· OpInvariants · BytecodeOpInterface::Trait
· LLVM::AccessGroupOpInterface::Trait · LLVM::AliasAnalysisOpInterface::Trait
There is no AtomicOrdering trait and no MemoryEffectsOpInterface ordering attribute. The only memory interfaces are AccessGroup and AliasAnalysis — LLVM alias-scope metadata for the scheduler, not atomic ordering. The plain scatter-add tpu_vst_msk_idx_add is NOperands<4>, ZeroResult (no return). The fetch-and-add returns the pre-add value (it reads through the load-Dest mux — see VectorLoad Slot); its LLVM intrinsics are llvm.tpu.vst.msk.idx.ret.add.{np,e4m3,e5m2} (no .cb form).
The lowering VectorLoadStoreIdxAddOpLowering::matchAndRewrite (@0x135c3ba0) computes a strided element pointer (MemRefDescriptor::stride + LLVM::MulOp + LLVM::AddOp = base + index*stride), builds the index vector (ShuffleVectorOp / InsertElementOp), reads the mask (getBoolVectorAttr), and emits one tpu_vst_msk_idx_ret_add_np::create. There is no loop, no atomic construct, no per-lane serialization in the lowering — it is a single HW vector op.
intra-vector same-address fetch-and-adds within ONE vector:
IR says nothing about order → it is silicon (a read-modify-add HAZARD)
cross-vector order:
sequential program order of the scatter instructions (each is one ordered HW op)
The hardware detects same-address conflicts: the SC telemetry counter "Address match conflict count" (in the .rodata string pool next to "Stream wait count") and the conflict-stall strings ("must wait more that 16 cycles due to conflicting accesses by Vector Store instructions") establish that the HW serializes conflicting same-address accesses with stall cycles — the combine is an ordered read-modify-add, not a free parallel reduce.
GOTCHA — for a raw scatter the intra-vector commit order is undefined here. The exact per-lane priority (low-lane-first / high-lane-wins / tree-combine) for concurrent same-address fetch-and-adds within one vector is not in the C++ — it is silicon. This matters only for a deduplication-disabled scatter (gated by
xla_tpu_enable_sparse_core_computation_deduplication@0x223c5f10); for the dedup path it is moot.
Why Dedup Makes It Moot
Both DeduplicateRowIdsToApply (@0x134b4560) and DeduplicateGradientVectorsToApply (@0x134b4e40) create their HLO Scatter with unique_indices=TRUE (byte-confirmed: push 0x1 = arg unique_indices; push 0x0 = arg indices_are_sorted). After Uniquify, each unique row is touched exactly once, so no two lanes of a vector ever fetch-and-add the same address — the per-lane commit order can never affect the result. The AliasScopeAssignment / MemrefAliasScopeAnnotationPass (@0x135d2060) attaches LLVM !alias.scope / !noalias metadata to the deduplicated scatter ops (the AliasAnalysisOpInterface trait), encoding the non-aliasing for the scheduler. The dedup is the correctness mechanism; the silicon in-vector order is only reachable on the raw path.
The Complete Dedup-Reduce Pipeline
DuplicateCount and Uniquify sit in the middle of the SparseCore embedding dedup-reduce datapath. The full composition (the other stages owned by the linked pages):
SparseCore embedding dedup-reduce — 8 stages
1. GATHER Stream IndirectStream id list → HBM[base+id*stride] → TILE_SPMEM (stream-gather-scatter)
2. SORT SortVectorsAndGainsBySampleIds → HLO Sort; TEC VectorExtended SortInteger 20/21
3. COO→CSR CooToSparseMatrixFormat csr_row_pointers = run lengths = multiplicity
4. UNIQUIFY UniqueWithLaneIds (op 26/27) collapse duplicates + lane→unique inverse perm (result[2])
5. COUNT DuplicateCount (op 24/25) per-unique multiplicity i32 ◄── THIS PAGE
6. REDUCE forward: SegmentedAddScan sum × gains{1, 1/valency, 1/√valency} ◄── THIS PAGE
backward: scatter-ADD gradient into each unique row ONCE (unique_indices=TRUE,
Add-combiner = count× accumulation) ◄── THIS PAGE
7. PERMUTE PermuteOp + lane_ids → tpu_sc_permute route per-unique result → input lanes ◄── THIS PAGE
8. SCATTER VectorLoadStoreIdxAdd fetch-and-add / VectorStoreIdx; pre-add return;
no atomic ordering; conflict-free by unique_indices + alias-scope (vectorload-slot)
Stages 4–7 (this page) collapse the duplicates, count the multiplicity, fold it into the combiner, and route the result back. The radix-sort SSA emission that realizes stages 2–7 is owned by RankAndPermute; the per-sample valency loop that applies the gains is owned by Emit Valency Loop; the bundle-level opcodes 20/21/24/25/26/27 are owned by VectorExtended.
Per-Generation Table
The dedup op set and the HLO decomposer are gen-stable: SparseDenseMatmulGradOpDecomposer is parameterised by jellyfish::Target, so it covers all SC generations, and the per-gen VectorExtended opcode field position is gen-stable (6-bit @ word 0x28 bits 16..21 across vfc/glc/gfc).
| Quantity | VF (vfc, v5e) | GL (glc, v5p) | GF (gfc, v6e) |
|---|---|---|---|
DuplicateCount{Integer,Float} VEX op | yes | yes | yes |
Uniquify{Integer,Float} VEX op | yes | yes | yes |
DuplicateCountInteger opcode | 24 (0x180000) | 24 | 24 |
UniquifyInteger opcode | 26 (0x1a0000) | 26 | 26 |
tpu_dupcnt{i,f} / tpu_uniq{i,f} intrinsic | yes | yes | yes |
sc_tpu.*_with_lane_ids (3-result) | yes | yes | yes |
lane_ids 3rd result = vector<N×i32> | yes | yes | yes |
tpu_vst_msk_idx_ret_add_np (fetch-and-add) | yes | yes | yes |
| AtomicOrdering on fetch-and-add | NO | NO | NO |
dedup Scatter unique_indices=TRUE | yes | yes | yes |
Per-gen Matches() anchors for the TEC opcodes: gfc DuplicateCountInteger @0x1eca7b20 / UniquifyInteger @0x1eca7b60; vfc @0x1e9ae0c0 / @0x1e9ae100; glc @0x1eb2d320 / @0x1eb2d360.
Function Map
| Symbol (gfc) | Address | Role |
|---|---|---|
DuplicateCountUniqueOpLowering<DuplicateCountOp>::matchAndRewrite | 0x13599d40 | plain count lowering, extract {0,1} |
…<DuplicateCountWithLaneIdsOp>::matchAndRewrite | 0x1359a7e0 | count + lane_ids, extract {0,1,2} |
…<UniqueOp>::matchAndRewrite | 0x1359b280 | plain unique lowering |
…<UniqueWithLaneIdsOp>::matchAndRewrite | 0x1359bd20 | unique + lane_ids |
ReplaceOpWithExtracts | 0x135b82a0 | per-index ExtractValueOp + replaceOp |
UniqueWithLaneIdsOp::inferReturnTypes | 0x145a0040 | 3 results (input · vec |
DuplicateCountWithLaneIdsOp::inferReturnTypes | 0x1459ff00 | 3 results (identical structure) |
UniqueOp::inferReturnTypes | 0x1459fe00 | 2 results (input · vec |
DuplicateCountOp::inferReturnTypes | 0x1459fd00 | 2 results |
EmitVectorResultUnop<…DuplicateCountInteger> | 0x13aaf660 | TEC in-vector count emitter (op 24) |
DeduplicateGradientVectorsToApply | 0x134b4e40 | Add-combiner scatter, unique_indices=TRUE |
DeduplicateRowIdsToApply | 0x134b4560 | Overwrite scatter, collapse ids |
CumsumExclusive | 0x134b3ec0 | Pad + ReduceWindow(Add) → CSR row index |
LocalReduction | 0x134b01e0 | windowed sum over duplicate run |
CooToSparseMatrixFormat | 0x13412480 | COO → CSR; row-pointer run lengths |
SortVectorsAndGainsBySampleIds | 0x1367c5a0 | sort rows + gains by sample id |
EmitValencyLoop | 0x1332cee0 | per-sample gains application |
ReciprocalF32 / ReciprocalSqrtF32 emitters | 0x13a1dc00 / 0x13a1de00 | mean / sqrtn gain |
PermuteOp::create / lowering | 0x145f3920 / 0x135a1640 | apply lane_ids inverse perm |
tpu_vst_msk_idx_ret_add_np trait fn | 0x14a6ac60 | fetch-and-add traits (no AtomicOrdering) |
VectorLoadStoreIdxAddOpLowering::matchAndRewrite | 0x135c3ba0 | one HW op, no serialization |
MemrefAliasScopeAnnotationPass::runOnOperation | 0x135d2060 | !noalias on dedup scatters |
Considerations
- Reproduce
inferReturnTypesexactly. Buildresult[0]=input dtype,result[1..]=vector<N×i32>(twice for WithLaneIds). The two i32 vectors share a type; only the op-name suffix distinguishes thelane_idsmap (result[2]) from the secondary map (result[1]). - Size the intrinsic struct to three fields, not the op's result count. The HW intrinsic always returns
{value, vec<i32>, vec<i32>}; the lowering surfaces 2 or 3 via theReplaceOpWithExtractsindex array. A 2-field struct mis-decodes the WithLaneIds form. - The backward count× is the Add-combiner, not a multiply. Emit the gradient scatter with
unique_indices=TRUEandCreateAddComputation; the count× accumulation falls out of summing duplicates. Do not add a× multiplicityop on the backward path. - The forward divisor is the gains, applied per-sample.
sum → 1,mean → 1/valency,sqrtn → 1/sqrt(valency)via the SC reciprocal ops inEmitValencyLoop;valencyis the multiplicity. - Skip
PermuteOpin the plain-unique path.lane_idsexists only on the…_with_lane_idsops; the plain path stores the rank directly and emits noPermuteOp. - In-vector ordering is silicon; dedup makes it moot. The fetch-and-add has no atomic-ordering trait; rely on
unique_indices=TRUE+ alias-scope, not on a deterministic per-lane commit order. - Unmapped. The raw (dedup-disabled) intra-vector commit priority; the
result[1]vsresult[2]role split ([2]=lane_ids,[1]secondary map); theresult[0]value semantics forduplicate_count(count broadcast); the custom-combiner gains arithmetic; the TEC-in-vector vs HLO-CSR multiplicity handoff; thetpu_sc_permuteHW datapath.
Related Components
| Name | Relationship |
|---|---|
DuplicateCountUniqueOpLowering<OpT> (0x13599d40+) | the one template that lowers all four dedup ops to the 3-field intrinsic |
ReplaceOpWithExtracts (0x135b82a0) | selects 2 vs 3 surfaced results via the index-array length |
SparseDenseMatmulGradOpDecomposer (Deduplicate* 0x134b4560/0x134b4e40) | the HLO multiplicity decomposition (CSR + Add-combiner scatter) |
SparseDenseMatmulDotCombinerEmitter::EmitValencyLoop (0x1332cee0) | applies the sum/mean/sqrtn gains (the forward multiplier) |
tpu_vst_msk_idx_ret_add_np (trait fn 0x14a6ac60) | the fetch-and-add the dedup feeds; no atomic ordering, alias-scope only |
RankAndPermuteComputeFunction (0x134039c0) | the SSA emission sequence (Unique → IdxAdd → Permute → StoreIdx) |
Cross-References
- RankAndPermute and the Radix-Sort Ordering — the SSA emission sequence that creates the Unique / Permute / scatter ops documented here
- VectorExtended (VEX) — the bundle-level
Sort/Uniquify/DuplicateCountopcodes (20/21/24/25/26/27) and the scan-emitter model - VectorLoad Slot — the fetch-and-add return path (pre-add through the load-Dest mux) and the
SourceOnescan seed - Emit Valency Loop — the per-sample valency loop that consumes the multiplicity and applies the gains
- Sample Combiner Emitter — the embedding combiner that drives the sort → dedup → gather → reduce datapath
- SparseCore Overview — where the dedup-reduce sits among the SC engine classes
- Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part IX — SparseCore & BarnaCore / SparseCore datapath (embeddings) — back to index