Scan Datapath
Every address, oneof tag, register-band guard, reduction-string XOR constant, struct offset, and error string on this page was read byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d; buildlibtpu_lts_20260413_b_RC00) — from theScanOpLowering/SegmentedScanOpLowering::matchAndRewritebodies, theConsumeOneTecVexBundleInstructionper-arm emit, theGetVectorMask/GetVMDestregnoband guards, theScanOp/SegmentedScanOp::buildaddOperandsorder, andmlir::tpu::ScanOp::verify. Addresses apply to this build; other versions differ.
Abstract
The SparseCore scan datapath turns a vector prefix-reduction (sum/min/max) into a single masked VectorExtended bundle slot. It is the reduce stage of the embedding pipeline: rows gathered into VREGs by the VectorLoad slot are scanned in place, with per-lane participation gated by an M-register predicate. The reduction primitive a reimplementer must reproduce is a masked, hardware-segmented prefix scan — not a software loop and not a tree reduction. The hardware consumes the mask directly: the bundle carries a 5-bit M-register selector, and inactive INPUT lanes contribute the reduction identity in silicon, below the binary.
The decisive structural fact is that scan predication is a two-part datapath that the lowering keeps strictly separate. The in-scan mask is carried INTO the VEX bundle as a field of the scan op itself — proto+0x38 (the M-register index) written by every scan-emit arm, unconditionally, from MLIR operand[1]. It is not pre-applied by a VectorSelect before the scan. The inactive-output disposition — what a masked-off lane reads after the scan — is a separate VectorAlu VectorSelect op (select(M, scan_result, else)), and it draws its mask from a different, narrower register file (M0..M15 vs the scan's M0..M31). A third, orthogonal field is the whole-op predicate (the Predication submessage, a P-register) that gates the entire instruction. Three predication fields coexist on one scan bundle; conflating them mis-models the datapath.
This page documents the three layers in order: the MLIR ScanOp lowering (the reduction × dtype × rank switch, the i1→i32 count-active path, the segmented variant); the ISA mask consumption (the emit body that attaches proto+0x38, the encoder that copies it to bundle bit 0x104, the two M-register bands); and the verify contract (mlir::tpu::ScanOp::verify, the constraints a front-end must satisfy). The VEX scan opcode roster and bit positions live in VectorExtended and VEX Mask/Dest-Port/Sub-Opcode; the M-register predicate word layout lives in M-Register Predicate; the segment-boundary operand frame lives in Segmented Scan and Segmented Add-Scan. This page owns the scan datapath: mask consumption, the ScanOp lowering, and the scan-mode roster.
For reimplementation, the contract is:
- The scan is always masked, and the mask is carried to HW, not pre-applied. Every scan-emit arm sets
proto+0x38 = GetVectorMask(operand[1])andproto+0x11 |= 1(present) unconditionally. The encoder copiesproto+0x38to bundle bit0x104as a 5-bit field. There is no pre-scanVectorSelectmasking the input — the bundle hands the hardware the M-register index and the HW gates participating lanes. - The reduction is decoded from a 3-char string by XOR, not a switch on an enum (in the SC dialect).
getReductionOp()returns aStringRef; the lowering testslen==3then(word0 ^ K0) | (byte2 ^ K1) == 0withsum=0x7573|0x6d,min=0x696d|0x6e,max=0x616d|0x78. The Mosaictpu.scandialect instead carries aReductionKindAttrenum (sum=0,max=1,min=2). - The intrinsic is a 3-axis function: reduction × element-type × rank.
{sum,min,max}×{i32,f32,i16/bf16}→ one oftpu_{add,min,max}_scan{1xNi,1xNf}/tpu_*_half_scan2xN/tpu_{min,max}_scan2xN. The1xN/2xNsuffix is rank (1xN=rank-1 lane vector,2xN=rank-2 packed sublanes);i/f=int/float. i1(boolean) sum is the count-active path. Ani1input withsumlowers to a SINGLEtpu_mprefix(the population-count prefix) whose result is thei32count vector —tpu_mprefix::createbuilds ani32-vector result type andreplaceOps directly; there is no follow-on convert ortpu_add_scan1xNi. Ani1input forbids a separate mask, forbids non-sumreductions, and requires ani32output (enforced in verify).- Segmented scans bind the boundary as operand[1], a V read-port operand — not the M-register mask.
SegmentedScanOp::buildadds(data, segment)in order; the lowering emitstpu_*_seg_scan*(NOperands<2>). There is noi1/mprefixsegmented path. - Inactive-OUTPUT lanes are a separate
VectorSelectVectorAluop reading a mask from the M0..M15 write/select band, distinct from the scan's M0..M31 read band.
| MLIR ops | sparse_core::ScanOp (sc_tpu.scan), sparse_core::SegmentedScanOp, Mosaic tpu::ScanOp (tpu.scan) |
| SC lowering | ScanOpLowering::matchAndRewrite 0x1358ab00; SegmentedScanOpLowering::matchAndRewrite 0x13589d40 |
| Build (operand order) | ScanOp::build 0x145f92e0 (data, mask); SegmentedScanOp::build 0x145fd4a0 (data, segment) |
| ISA mask consume | ConsumeOneTecVexBundleInstruction 0x13a15ba0 — every arm: proto+0x38 = GetVectorMask(op[1]), proto+0x11 |= 1 |
| Mask read band | GetVectorMask 0x13a33320 — [0x5f,0x7e] = M0..M31 (32-deep), value regno−0x5f |
| Mask select/write band | GetVMDestregno 0x13a65b20 — [0x5f,0x6e] = M0..M15 (16-deep) |
| Encoder mask field | proto+0x38 → bundle bit 0x104, 5-bit (EncodeAddScanF32 0x1eb32380, gen-stable glc/vfc) |
| Post-scan select | EmitVectorSelect 0x13a1e000 → SparseCoreTecVectorAlu_VectorSelect |
| Verify | mlir::tpu::ScanOp::verify 0x14af7460 |
| Reduction enum | ReductionKindAttr: sum=0, max=1, min=2 (Mosaic); 3-char string (SC) |
| Confidence | CONFIRMED (decompile-anchored) unless a row or callout says otherwise |
NOTE — this page owns the scan datapath: how the mask is consumed, the
ScanOplowering, and the scan-mode roster. The VEX bundle bit positions live in VectorExtended / VEX Mask/Dest-Port; the M-register predicate word layout in M-Register Predicate; the segment-boundary operand binding in Segmented Scan / Segmented Add-Scan. They are linked, not repeated.
The Two-Part Predication Datapath
Purpose
Before the layer-by-layer detail, fix the model, because it is the part a naive reimplementation gets wrong. A masked scan has two questions, and the binary answers them in two different places with two different mechanisms:
- Which INPUT lanes participate in the reduction? Answered by the in-scan mask — an M-register index carried in the scan op's own bundle field (
proto+0x38→ bit0x104). The hardware reads it and lets only active lanes contribute; inactive input lanes contribute the reduction identity (0 foradd, ±inf formin/max). - What does a masked-off OUTPUT lane read after the scan? Answered by a separate
VectorAluVectorSelectop emitted alongside the scan,select(M, scan_result, else), drawing its mask from the narrower M0..M15 write/select file.
SparseCore masked scan — the predication is TWO ops, not one
data (VREGs from VectorLoad) ─┐
▼
┌──────────────────────────────────────────────────────┐
│ VEX scan slot (e.g. AddScanS32) │
│ proto+0x38 = M-reg idx (in-scan mask) → bit0x104 5b │ ◄── gates INPUT lanes in HW
│ FindAndEmitToUnusedPort(data) → V read port │ (M0..M31 read band)
│ EmitPredicationToSlot(P-reg) → Predication │ ◄── whole-op predicate (orthogonal)
└──────────────────────────────────────────────────────┘
│ scan_result
▼
┌──────────────────────────────────────────────────────┐
│ VEX VectorAlu VectorSelect (SEPARATE op) │
│ select(M, then=scan_result, else) M0..M15 band │ ◄── disposes INACTIVE OUTPUT lanes
└──────────────────────────────────────────────────────┘
GOTCHA — do not pre-apply the mask with a
VectorSelectbefore the scan. A reader who assumes "masked scan = select the inputs, then scan" builds the wrong datapath. The mask rides INTO the scan bundle (proto+0x38, set unconditionally in every emit arm), and the HW gates lanes during the reduction. TheVectorSelectthat does appear is downstream of the scan and disposes the output lanes — a different M-register file (M0..M15, not M0..M31). The two are independent ops with independent mask registers.
The three orthogonal fields
Every VEX scan bundle carries three predication-related fields. They are decoded from three different MLIR sources and written to three different proto locations:
| Field | Emitter | proto / submessage | Scope | Source |
|---|---|---|---|---|
per-lane VECTOR MASK (bit 0x104, 5b) | scan-emit arm (op[1]) → GetVectorMask (M0..M31) | proto+0x38, present proto+0x11 |= 1 | which lanes participate in the reduction | ConsumeOneTecVexBundleInstruction 0x13a15ba0 |
whole-op PREDICATE (Predication submessage) | EmitPredicationToSlot (last MCInst operand) → GetPregno (P0..P13) | Predication submessage | gates the ENTIRE instruction | EmitPredicationToSlot 0x13a4a160 |
| data READ-PORT (V port) | FindAndEmitToUnusedPort (op[2]) → GetVregno | a V read-port slot field | routes the DATA value VREG | 0x13a15ba0 per arm |
QUIRK — the mask SELECTS a predicate register; it is not the predicate itself.
proto+0x38holds a 5-bit M-register index (regno − 0x5f), not a lane bitmask. The 8-byte predicate WORD that the M-register holds —{s_start, l_start, s_end, l_end}sub-/lane bounds, or a synthesized iota-compare — lives in the M-Register Predicate page. This page only shows that the index is carried into the bundle; the word it points at is decoded there.
The two M-register bands
The read and write sides of the mask file are different widths, and the band guards prove it. GetVectorMask (the scan's mask read) accepts M0..M31; GetVMDestregno (the VectorSelect's mask select/write) accepts only M0..M15:
// GetVectorMask<SparsecoreVectorMask>:: (glc 0x13a33320)
// the in-scan per-lane mask READ — 32-deep file
if (!operand.isReg()) LogFatal("operand.isReg()");
regno = operand.getReg();
if (regno <= 0x5E) LogFatal("regno >= llvm::TPU::M0"); // M0 = 0x5f
if (regno >= 0x7F) LogFatal("regno <= llvm::TPU::M31"); // M31 = 0x7e
return regno - 95; // 0x5f → index 0 ⇒ band [0x5f,0x7e] = M0..M31
// GetVMDestregno:: (glc 0x13a65b20)
// the post-scan VectorSelect mask SELECT/WRITE — 16-deep file
if (regno <= 0x5E) LogFatal("regno >= llvm::TPU::M0");
if (regno >= 0x6F) LogFatal("regno <= llvm::TPU::M15"); // M15 = 0x6e
return regno - 95; // band [0x5f,0x6e] = M0..M15
NOTE — the scan reads from M0..M31; the post-scan select reads from M0..M15. A reimplementer allocating mask registers must respect both ceilings: a scan input mask may live in M16..M31, but a
VectorSelectmask may not. The asymmetry is byte-confirmed by the two band guards above (< 0x7Fvs< 0x6F).
The MLIR Scan-Op Lowering
Purpose
ScanOpLowering (a ConvertToLLVM conversion pattern) rewrites sparse_core::ScanOp into one SC scan intrinsic, chosen by the reduction string and the input element type. It is the single dispatch point where a generic prefix-scan becomes a concrete tpu_*_scan* op that the VEX emitter later realizes as a bundle slot. The body is a flat string-XOR → element-type → rank cascade.
Entry Point
sparse_core::ScanOp (op-name "sc_tpu.scan", AtLeastNOperands<1>, OneResult)
└─ ScanOpLowering::matchAndRewrite (0x1358ab00) ── reduction × dtype × rank → intrinsic
├─ ScanOp::getReductionOp (StringRef, 3-char)
├─ VectorType::getElementType(operand[1]) ── the scanned element type
├─ tpu_mprefix::create / tpu_*_scan*::create
└─ ReplaceWithScanIntrinsic<tpu_*_scan2xN> (0x1358c1c0 / 0x1358bd80 / 0x1358bfa0)
Algorithm
The reduction string is decoded by a constant XOR over the 3 bytes — no strcmp, no enum. The element type is fetched from operand[1] (the data operand of the adaptor) and compared to the builder's canonical i32/f32/i16/bf16 types. The i1 element type is special-cased first:
function ScanOpLowering_matchAndRewrite(op): // 0x1358ab00
elt = getElementType(op.operand[1]) // scanned element type
red = op.getReductionOp() // StringRef, 3 chars
// --- i1 (boolean) input: count-active path ---
if elt == i1Type: // cmp r13,r14 (ElementType == I1Type)
if len(red) != 3 || (red ^ "sum") != 0: // i1 allows ONLY sum
return failure
i32vec = vector<i32> // built up-front from operand[1] lane count
cnt = tpu_mprefix::create(builder, {i32vec}, i1_input) // 0x14731a40 — population-count prefix, i32 result
replaceOp(op, cnt) // SINGLE op: no convert, no add_scan
return success
// --- sum: (word0 ^ 0x7573) | (byte2 ^ 0x6d) == 0 ---
if len(red) == 3 && red == "sum":
if elt == i32: emit tpu_add_scan1xNi // 0x146d57c0
elif elt == f32: emit tpu_add_scan1xNf // 0x146d4fc0
elif elt in {i16, bf16}:
if !HasGxcHalfScan(): emitError(
"Currently scan add for i16 and bf16 is only supported for GXC")
emit tpu_add_half_scan2xN // 0x146d4400
else: return failure
// --- min: (word0 ^ 0x696d) | (byte2 ^ 0x6e) == 0 ---
elif len(red) == 3 && red == "min":
if elt == i32: emit tpu_min_scan1xNi // 0x14731340
elif elt == f32: emit tpu_min_scan1xNf // 0x14731180
elif elt in {i16, bf16}:
if !HasGxcHalfScan(): emitError(...GXC...)
return ReplaceWithScanIntrinsic<tpu_min_scan2xN> // 0x1358bd80
else: return failure
// --- max: (word0 ^ 0x616d) | (byte2 ^ 0x78) == 0 ---
elif len(red) == 3 && red == "max":
if elt == i32: emit tpu_max_scan1xNi // 0x14730a80
elif elt == f32: return ReplaceWithScanIntrinsic<tpu_max_scan1xNf> // 0x1358bfa0
elif elt in {i16, bf16}:
if !HasGxcHalfScan(): emitError(...GXC...)
return ReplaceWithScanIntrinsic<tpu_max_scan2xN> // 0x1358c1c0
else: return failure
else:
return failure // unknown reduction string
The XOR constants are the little-endian byte triples: "sum" = s u=0x7573, m=0x6d; "min" = m i=0x696d, n=0x6e; "max" = m a=0x616d, x=0x78 — read directly off the cmp/xor immediates in the decompiled body (0x1358ab00 lines around the three getReductionOp calls).
The reduction × dtype × rank → intrinsic map
| reduction | input elt | → intrinsic | create / leaf @ | emit form |
|---|---|---|---|---|
sum | i1 (count) | tpu_mprefix (i32 result, replaces op) | 0x14731a40 | direct (single op) |
sum | i32 | tpu_add_scan1xNi | 0x146d57c0 | direct |
sum | f32 | tpu_add_scan1xNf | 0x146d4fc0 | direct |
sum | i16/bf16 | tpu_add_half_scan2xN | 0x146d4400 | direct (GXC-gated) |
min | i32 | tpu_min_scan1xNi | 0x14731340 | direct |
min | f32 | tpu_min_scan1xNf | 0x14731180 | direct |
min | i16/bf16 | tpu_min_scan2xN | 0x1358bd80 | ReplaceWithScanIntrinsic (GXC-gated) |
max | i32 | tpu_max_scan1xNi | 0x14730a80 | direct |
max | f32 | tpu_max_scan1xNf | 0x1358bfa0 | ReplaceWithScanIntrinsic |
max | i16/bf16 | tpu_max_scan2xN | 0x1358c1c0 | ReplaceWithScanIntrinsic (GXC-gated) |
Naming: 1xN = rank-1 lane vector; 2xN / half = rank-2, two sublanes packed (bf16/16-bit with an f32-accumulate); i/f = int/float. ReplaceWithScanIntrinsic<T> is the templated path used for the 2xN forms and max-f32; the others call T::create directly into the LLVMStructType literal. Whether the 2xN/half forms map 1:1 onto the VEX *PartialSum* sub-opcodes or carry the accumulate dtype via the VpackFormat attribute is owned by VEX / Segmented Add-Scan — cross-linked, not re-decoded here (LOW for the exact rank-2 sub-opcode binding).
NOTE — i16/bf16 scan-add is gated on a target capability and emits a "GXC only" error otherwise. The
i16/bf16arms call a vtable predicate ((**(ctx+104)+1920)(ctx+104)in the decompile, a per-targetHasGxcHalfScan-style query) and, when false,emitError("Currently scan add for i16 and bf16 is only supported for GXC")(InFlightDiagnostic << "Currently scan add for i16 and bf16 is only supported for " << "GXC"). A reimplementer targeting a non-GXC generation must reject half-precision scans rather than emit an intrinsic. (This gate is not in the priorScanOpwrite-ups; CONFIRMED here from the lowering body.)
The i1 count-active path
The i1-sum case is the SparseCore's population-count-prefix primitive — the count of active lanes up to each position, used for ragged-row offset computation in the embedding pipeline. It is the only place a scan op consumes a boolean vector:
i1 vector ──► tpu_mprefix (OneOperand) ──► i32 vector (the op result)
(cross-lane mask prefix-sum;
asm "vmprefix.xlane")
tpu_mprefix (0x14731a40, trait OneOperand, op-name "llvm_tpu.mprefix") takes the i1 input alone — it has no separate mask, which is exactly why verify forbids a mask operand on i1 inputs. The lowering builds an i32-vector result type up-front (from the operand[1] lane count), constructs the single tpu_mprefix op against that type, and replaceOps the original ScanOp directly — there is NO follow-on convert and NO chained tpu_add_scan1xNi. The i1-sum scan is exactly one intrinsic.
NOTE —
tpu_mprefixlowers to aVectorAluop (VectorMaskPrefixSum), NOT aVectorExtended/VEX op. Tracing past the intrinsic settles the long-open "which VEX sub-opcode does mprefix carry" question: it carries none.llvm_tpu.mprefix(LLVM intrinsic13389, machine mnemonicscVMPREFIX, asmvmprefix.xlane) instruction-selects toSparseCoreTecVectorAlu_VectorMaskPrefixSum— a cross-lane mask op in theVectorAluslot, emitted by the sharedEmitCrossLaneUnopbody (glc0x13a19d40), not by anyEncodeSparseCoreTecVectorExtended<...>encoder (noVectorExtendedmprefix/prefix encoder exists in the binary). It is one of three M-register cross-lane unops sharing that emit body —VectorMaskPrefixSum,VectorMaskPopulationCount,VectorMaskCountTrailingZeros(see theVectorAluopcode roster). The binding essentials:VectorAluopcode0x54(84) with sub-field1onvfc(single form,EncodeSparseCoreTecVectorAlu0VectorMaskPrefixSum0x1e960cc0;Matches:(opcode & 0x7F00)==0x5400 && sub==10x1e950e40); opcode0x80(128) with sub-field2(=B32)/3(=B16) ongxcglc/gfc(EncodeSparseCoreTecVectorAlu0VectorMaskPrefixSumB320x1eab4a40,…B160x1eab4b40) — the bit-width is carried by the 6-bit sub-field, not by a distinct opcode. The single MLIR operand reaches an M-register viaGetVMDestregno(the M0..M15 select band — the same band the post-scanVectorSelectuses), and the source is routed through an X read-port (UseVectorXPort). So thei1-sum scan is aVectorAlucross-lane prefix-sum over an M-register predicate, not aVectorExtendedscan. CONFIRMED.
The Segmented Variant
Purpose
SegmentedScanOpLowering is the embedding-sum lowering: a prefix scan that resets the running accumulator at per-sample segment boundaries, so a single scan over a packed ragged batch produces per-row sums. It reuses the identical reduction-string XOR switch but binds a second operand — the segment-boundary vector — and has no i1/mprefix path (a segment scan reduces data, it does not count predicate bits).
Algorithm
function SegmentedScanOpLowering_matchAndRewrite(op): // 0x13589d40
red = op.getReductionOp() // same 3-char StringRef
elt = getElementType(op.operand[1])
// same XOR switch: "sum"=0x7573|0x6d, "min"=0x696d|0x6e, "max"=0x616d|0x78
switch (red, elt):
sum / i32 -> tpu_add_seg_scan1xNi // 0x146d5c40
sum / f32 -> tpu_add_seg_scan1xNf // 0x146d5a80
sum / i16,bf16 -> tpu_add_half_seg_scan2xN // 0x146d45c0 (GXC-gated, "...GXC" error)
min / i32 -> tpu_min_seg_scan1xNi // 0x14731880
min / f32 -> tpu_min_seg_scan1xNf // 0x147316c0
max / i32 -> tpu_max_seg_scan1xNi // 0x14730fc0
max / f32 -> tpu_max_seg_scan1xNf // 0x14730e00
// min/max i16,bf16 → tpu_{min,max}_seg_scan2xN (roster)
default -> emitError / failure
The boundary operand is operand[1], not the mask
The segment boundary is bound as the second SSA operand, and SegmentedScanOp::build proves the order — it issues two addOperands calls, data first then segment, both unconditional:
// SegmentedScanOp::build(OpBuilder, OperationState, Type, Value data, Value segment, StringAttr red)
// (0x145fd4a0)
addOperands(state, &data, 1); // operand[0] = data
addOperands(state, &segment, 1); // operand[1] = segment boundary
state.getOrAddProperties().reduction_op = red; // StringAttr property
Contrast ScanOp::build (0x145f92e0), which guards the data operand (if (data) addOperands(data)) and then adds the mask as operand[1] — so for a plain scan operand[1] is the per-lane VECTOR MASK (the one that becomes proto+0x38), whereas for a segmented scan operand[1] is the SEGMENT BOUNDARY, a V-read-port value. Both are operand[1] at the SSA level, but they are routed to different bundle fields by the VEX emitter.
GOTCHA — operand[1] means two different things for
ScanOpvsSegmentedScanOp. In a plainScanOp, operand[1] is the M-register vector mask (→proto+0x38, the in-scan predicate). In aSegmentedScanOp, operand[1] is the segment-id boundary (→ a V read port, register-allocated byFindAndEmitToUnusedPort). A reimplementer wiring the operand frame must branch on the op identity, not assume operand[1] is always the mask. The boundary operand frame andVpackFormatcapability matrix are owned by Segmented Scan / Segmented Add-Scan.
The ISA Mask Consumption
Purpose
ConsumeOneTecVexBundleInstruction (the glc TEC-VEX bundle emitter) converts an MCInst scan op into the bundle proto. Every scan arm does the same three things: construct the per-op proto submessage (a oneof tag into proto+0x50), attach the per-lane mask from MCInst operand[1], and route the data VREG from operand[2] to a free V read port. The mask attach is the heart of this page — it is what makes the scan masked in hardware.
Algorithm
The AddScanS32 arm, byte-traced (0x13a16ce6..; oneof tag 6):
// ConsumeOneTecVexBundleInstruction — AddScanS32 arm (0x13a15ba0)
clear_inst(vex_proto); // SparseCoreTecVectorExtended::clear_inst
vex_proto.oneof_tag = 6; // [proto+0x50] = 6
sub = Arena::DefaultConstruct<...AddScanS32>(arena); // proto+0x50 submessage
proto.scan = sub;
// --- the in-scan mask: operand[1] → M-register index, ALWAYS attached ---
sub[0x38] = GetVectorMask<SparsecoreVectorMask>(mcinst.operand[1]); // proto+0x38 = regno - 0x5f
sub[0x11] |= 1; // present flag (or [proto+0x11], 1)
// --- the data value: operand[2] → a free V read port ---
vregno = GetVregno(mcinst.operand[2]);
FindAndEmitToUnusedPort<SparsecoreVregReadPort, ...AddScanS32>(status, slot, vregno, sub);
The decompile shows the assignment as a plain mov [sub+0x38], eax followed by or [sub+0x11], 1 — present-bit set unconditionally, in every scan arm. The MCInst operand fetch is *((_QWORD*)inst+2) + 0x10 for operand[1] and +0x20 for operand[2] (the per-operand stride is 0x10).
QUIRK — the mask is attached on EVERY scan arm, not just masked scans. There is no "is this scan masked" branch in the emit body —
proto+0x38 = GetVectorMask(op[1])andproto+0x11 |= 1run unconditionally forAddScanS32,AddScanBf16PartialSumBf16,AddScanS16PartialSumS16, theSegmentedAddScan*family,DuplicateCount{Integer,Float},MaxIndexScan{F32,U32},MaxScanF32, and every other scan/reduce arm. An "unmasked" scan is simply one whose mask M-register selects all lanes; the bundle field is always populated. A reimplementer who makes the mask optional at the encoder level will mis-encode every scan.
From proto field to bundle bit
The encoder closes the loop: proto+0x38 (the M-register index) is copied into bundle bit 0x104 as a 5-bit field. The mask arm is the last field the scan encoder writes, after the V read-port array:
; EncodeSparseCoreTecVectorExtendedAddScanF32 (glc 0x1eb32380), mask arm @0x1eb32470
test byte ptr [rax+0x11], 1 ; f6 40 11 01 — mask present?
; if present:
movsxd rax, dword ptr [rax+0x38] ; 48 63 40 38 — sign-extend the M-register index
mov esi, 0x104 ; be 04 01 00 00 — bundle bit 0x104
xor ecx, ecx
mov r8d, 5 ; 41 b8 05 — 5-bit field
call BitCopy
This arm is byte-identical in the vfc EncodeSparseCoreTecVectorExtendedFloatAddScan encoder (0x1e9b14a0, mask arm 0x1e9b1590: same be 04 01 00 00 / 41 b8 05), so the M-register-selector-to-bundle field is gen-stable glc↔vfc. The exact bit position and the surrounding V-port array are owned by VEX Mask/Dest-Port/Sub-Opcode; this page anchors only that proto+0x38 is the source and bit 0x104 (5b) is the destination.
The post-scan VectorSelect
The inactive-OUTPUT-lane disposition is a separate VectorAlu op, SparseCoreTecVectorAlu_VectorSelect, emitted by EmitVectorSelect (0x13a1e000). It reads three things: the then value (the scan result, decoded by GetOperandAndVsEncoding(op, 2) then GetVregno → proto+0x1c), a mask register via GetVMDestregno (the M0..M15 select band → proto+0x18), and an else value routed through UseVectorXPort (the X read port → proto+0x20):
// EmitVectorSelect<...VectorSelect> (0x13a1e000)
GetOperandAndVsEncoding(op, 2); // X-port = then / scan_result
then_v = GetVregno(op); // → proto+0x1c (a4+28)
mask = GetVMDestregno(op); // M0..M15 select band → proto+0x18 (a4+24)
proto[0x10] |= 3; // present flags for then + mask
else_v = GetVregno(op); // → proto+0x20 (a4+32), via UseVectorXPort
proto[0x10] |= 4; // present flag for else
// result lane = mask[lane] ? then[lane] : else[lane]
The else operand is the zero-vs-preserve choice: a masked-off output lane reads else, which is identity/zero for a fresh result or the prior value for a preserve-old reduction. The exact else wiring per scan family (whether index-scan writes a sentinel, whether duplicate-count zeros) is per-emitter and not exhaustively enumerated here (LOW); the VectorSelect op, its M0..M15 mask, and its two-VREG operand frame are CONFIRMED.
The Verify Contract
Purpose
mlir::tpu::ScanOp::verify (0x14af7460) is the front-end contract: the constraints a tpu.scan op must satisfy before lowering. It is the cleanest single source for the scan's typing rules — core placement, rank, element-type, the i1 special cases, the mask shape, and the reduction enum. Verify failures emit opErrors with the exact strings below.
The constraints (byte-exact strings)
| # | Check | String | Decompile anchor |
|---|---|---|---|
| 1 | parent core type == 2 (SC vector subcore) | Scan is supported only on the SC vector subcore | GetCoreTypeOfParentOp != 2 (line 51) |
| 2 | i1 input → output is i32 | Output element type must be i32 vector for i1 vector inputs. | isInteger(1) then !isInteger(0x20) (60/63) |
| 3 | non-i1 input/output element types match | Input and output element type mismatch. | getElementType cmp (71/72) |
| 4 | input/output shapes match | Input and output shape mismatch. Input shape: ( | getShape + bcmp (78–84, 163) |
| 5 | input rank 1 or 2 (reject ≥3) | Input must be a rank 1 or 2 vector. | rank >= 3 (87) |
| 6 | i1 input → reduction is sum (enum 0) | Only sum reduction is supported for i1 vector inputs. | getValue() != 0 (101–103) |
| 7 | reduction ∈ {sum=0, max=1, min=2} | Only sum, max and min reductions are supported. | getValue() 0/1/2 chain (108–110, 153) |
| 8 | i1 input → no mask operand | Mask is not supported for i1 vector inputs. | mask present + isInteger(1) (116–118) |
| 9 | mask is rank 1 | Mask must be a rank 1 vector. | mask getShape rank != 1 (124, 148) |
| 10 | mask length == input lane count | Mask and input mismatch. Expected mask of length: …, but got … | mask shape cmp (131–141) |
function ScanOp_verify(op): // 0x14af7460
if GetCoreTypeOfParentOp(op) != 2: // SC vector subcore
return opError("Scan is supported only on the SC vector subcore")
in_elt = getElementType(op.operand[0])
out_elt = getElementType(op.result[0])
if in_elt.isInteger(1): // i1 input
if !out_elt.isInteger(32):
return opError("Output element type must be i32 vector for i1 vector inputs.")
else if in_elt != out_elt:
return opError("Input and output element type mismatch.")
if shape(in) != shape(out): // bcmp
return opError("Input and output shape mismatch. Input shape: (")
if rank(in) >= 3:
return opError("Input must be a rank 1 or 2 vector.")
red = op.reduction_kind // ReductionKindAttr enum
if in_elt.isInteger(1) && red != 0: // i1 ⇒ sum only
return opError("Only sum reduction is supported for i1 vector inputs.")
if red not in {0,1,2}: // sum=0, max=1, min=2
return opError("Only sum, max and min reductions are supported.")
if op.numOperands == 1 || op.mask == null: // no mask → done
return success
if in_elt.isInteger(1):
return opError("Mask is not supported for i1 vector inputs.")
if rank(op.mask) != 1:
return opError("Mask must be a rank 1 vector.")
if shape(op.mask)[0] != shape(in)[lane_dim]:
return opError("Mask and input mismatch. Expected mask of length: <N>, but got <M>.")
return success
The reduction enum (sum=0, max=1, min=2) is read from the ReductionKindAttr::getValue() je/jne chain (lines 101/108–110) and cross-confirmed by the parse error roster (expected ::mlir::tpu::ReductionKind to be one of: …). The mask-presence test numOperands == 1 || mask == null (decompiled as *((_DWORD*)op+17) == 1 || !*(...op+9)+56)) is the exact gate: only when a real mask operand is present do constraints 8–10 run.
NOTE — verify enforces TWO mask-shape constraints the lowering does not re-check. A mask must be rank-1 and its length must equal the input's lane count. These (
"Mask must be a rank 1 vector.","Mask and input mismatch. Expected mask of length: …, but got …") are CONFIRMED inverifyhere and are additions to the prior mask write-ups, which listed only thei1-no-mask rule. A reimplementer's verifier must reject a rank-2 or mis-sized mask before lowering, because the lowering assumes a well-formed mask.
QUIRK — there are two scan dialects with two reduction encodings. Mosaic
tpu.scan(tpu::ScanOp, this verify) carries aReductionKindAttrENUM (sum=0/max=1/min=2). The SCsc_tpu.scan(sparse_core::ScanOp, the lowering above) carries a 3-charreduction_opSTRING decoded by XOR. The enum→string bridge (thetpu.scan → sc_tpu.scanconversion) is a separate pattern not on this page. A reimplementer must not assume one encoding; the front-end op uses the enum, the SC lowering uses the string.
Function Map
| Symbol | Address | Role |
|---|---|---|
ScanOpLowering::matchAndRewrite | 0x1358ab00 | reduction × dtype × rank → intrinsic; the i1 count path |
SegmentedScanOpLowering::matchAndRewrite | 0x13589d40 | segmented variant; same XOR switch, no i1 path |
ScanOp::build (StringAttr) | 0x145f92e0 | addOperands(data) then addOperands(mask) — operand[0]=data, [1]=mask |
ScanOp::create | 0x145f93e0 | op-name "sc_tpu.scan"; calls build |
SegmentedScanOp::build | 0x145fd4a0 | addOperands(data) then addOperands(segment) — operand[1]=boundary |
SegmentedScanOp::create | 0x145fd5a0 | builds (data, segment, reductionStr) |
ConsumeOneTecVexBundleInstruction | 0x13a15ba0 | per-arm emit; proto+0x38 = GetVectorMask(op[1]), proto+0x11 |= 1 |
GetVectorMask<SparsecoreVectorMask> | 0x13a33320 | in-scan mask read; band [0x5f,0x7e] = M0..M31, value regno−0x5f |
GetVMDestregno | 0x13a65b20 | VectorSelect mask select/write; band [0x5f,0x6e] = M0..M15 |
EmitPredicationToSlot<…VectorExtended> | 0x13a4a160 | whole-op predicate (last MCInst operand → Predication submessage); GetPregno band P0..P13 (0x139f1bc0) |
EmitVectorSelect<…VectorSelect> | 0x13a1e000 | post-scan select(M, then, else); mask via GetVMDestregno |
EncodeSparseCoreTecVectorExtendedAddScanF32 | 0x1eb32380 | encoder; proto+0x38 → bundle bit 0x104 (5b); glc |
EncodeSparseCoreTecVectorExtendedFloatAddScan | 0x1e9b14a0 | vfc encoder; mask arm byte-identical to glc |
BroadcastBoolToVector | 0x13d9bfa0 | getBoolAttr(value) → BroadcastScalarToVector — the scan-mask producer |
tpu_mprefix::create | 0x14731a40 | i1 cross-lane mask prefix-sum (OneOperand, "llvm_tpu.mprefix", LLVM intrinsic 13389, mnemonic scVMPREFIX) |
EmitCrossLaneUnop<…VectorMaskPrefixSum…> | 0x13a19d40 (glc) / 0x139adba0 (vfc) | the emit body for tpu_mprefix → VectorAlu VectorMaskPrefixSum; operand → M-reg via GetVMDestregno, source via UseVectorXPort |
EncodeSparseCoreTecVectorAlu0VectorMaskPrefixSum | 0x1e960cc0 (vfc) / 0x1eab4a40 (glc-B32) / 0x1eab4b40 (glc-B16) | the VectorAlu encoder: opcode 0x54 sub 1 (vfc) / opcode 0x80 sub 2=B32/3=B16 (glc) |
mlir::tpu::ScanOp::verify | 0x14af7460 | the typing/mask/reduction contract (10 constraints) |
NOTE — the prior write-ups cited
tpu::ScanOp::verifyat0x14af7460and that is correct; the SC-sideverifyInvariantsImpl(0x145f9640) does NOT carry the constraint strings. The byte-exact constraint roster (incl. the two NEW mask-shape rules) lives in the Mosaictpu::ScanOp::verify. A reimplementer searching the SCsparse_core::ScanOp::verifyInvariantsImplfor the messages will not find them.
Considerations
- The scan is always masked at the ISA layer. Encode
proto+0x38(the M-register selector) and setproto+0x11 |= 1on every scan op, not conditionally. An unmasked scan selects an all-lanes M-register; it is not a different encoding. - Mask consumption is in-HW, not pre-select. Do not lower a masked scan as
VectorSelect(input) → scan. Lower it asscan(input, mask)with the mask carried in the bundle; the HW gates lanes and inactive inputs contribute the reduction identity. TheVectorSelectthat appears is a separate downstream op for the output lanes. - Two M-register files, two ceilings. Scan input masks may use M0..M31 (
GetVectorMask); post-scanVectorSelectmasks may use only M0..M15 (GetVMDestregno). Allocate accordingly. - operand[1] is mask for
ScanOp, boundary forSegmentedScanOp. Branch on op identity; both are SSA operand[1] but route to different bundle fields. i1-sumis the only boolean scan, and it is a SINGLEtpu_mprefixop — noadd_scan. It lowers to oneVectorAluVectorMaskPrefixSum(cross-lane mask prefix-sum), forbids a mask, forbids non-sum, and requires ani32output (verify constraints 2/6/8).- i16/bf16 scan-add is GXC-only. Gate half-precision scans on the target-capability predicate and emit the "GXC only" error otherwise; do not synthesize a
*_half_scan2xNintrinsic on a non-GXC generation. - Two scan dialects, two reduction encodings. Mosaic
tpu.scanusesReductionKindAttr(sum=0/max=1/min=2); SCsc_tpu.scanuses a 3-char string (XOR-decoded). Verify operates on the enum; the SC lowering operates on the string. - Unmapped / LOW. The lowest per-lane HW write-enable of the in-scan mask (physical suppress-on-masked-output vs always-materialized
VectorSelect); the exactelseoperand per scan family in the post-scan select; the rank-22xN/half→ VEX*PartialSum*sub-opcode binding. (Thetpu_mprefixsub-opcode is no longer open — it binds toVectorAluVectorMaskPrefixSum, opcode0x54/0x80by target, resolved in §"The i1 count-active path".)
Related Components
| Name | Relationship |
|---|---|
ScanOpLowering / SegmentedScanOpLowering (0x1358ab00 / 0x13589d40) | the reduction × dtype × rank → intrinsic switch; the segmented variant |
ConsumeOneTecVexBundleInstruction (0x13a15ba0) | the per-arm emit that attaches the in-scan mask (proto+0x38) and routes the data port |
GetVectorMask / GetVMDestregno (0x13a33320 / 0x13a65b20) | the M0..M31 read band vs the M0..M15 select/write band |
EmitVectorSelect (0x13a1e000) | the separate VectorAlu op that disposes inactive OUTPUT lanes |
mlir::tpu::ScanOp::verify (0x14af7460) | the 10-constraint typing/mask/reduction contract |
BroadcastBoolToVector (0x13d9bfa0) | the all-true/all-false scan-mask producer at the MLIR layer |
Cross-References
- VectorExtended (VEX) — the scan/sort/reduce slot this datapath emits into; the VEX opcode roster, V read ports, and the
SourceOneseed the scans realize. - VEX Mask / Dest-Port / Sub-Opcode — the bundle bit
0x104(5-bit mask selector), the dest read-port, and the 48-encoder sub-opcode map this page'sproto+0x38feeds. - M-Register Predicate Word (M0–M31) — the 8-byte predicate WORD the M-register holds (
{s_start,l_start,s_end,l_end}/ iota-compare) and the masked-scan inactive-lane output model. - Segmented Scan — the per-segment-reset scan; the
SegmentedScanOpLoweringreduction switch and the boundary-operand binding summarized here. - Segmented Add-Scan — the
SegmentedAddScanoperand frame and theVpackFormatdtype-attribute capability matrix the2xN/halfforms ride. - VectorLoad Slot — the read-side slot that fills the VREGs this datapath scans; the
SourceOneseed selector and the segment-id operand binding documented there. - TEC Vector Opcode Enumeration — the
VectorAluopcode roster (incl.VectorSelect) and the opcode-recovery model. - SparseCore Overview — the three SC engine classes and where the TEC vector scan datapath sits.
- Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part IX — SparseCore & BarnaCore / SparseCore datapath (embeddings) — back to index