LowerToSparseCoreLlvm
All addresses, symbols, and offsets on this page were read byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, BuildID md589edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, not stripped,.textVA == file offset). The.symtabis not stripped; every claim anchors to a demangled symbol or a decompiled body. Other versions will differ — treat every VA as version-pinned.
Abstract
LowerToSparseCoreLlvmPass is the terminal MLIR lowering on the SparseCore path: it takes a module that still carries the mlir::sparse_core (ScDialect) op surface — gather/scatter streams, circular-buffer registers, sync-flag waits, simple DMAs, EUP transcendentals, address-space casts — and drops it into the llvm dialect plus the SparseCore-specific llvm_tpu intrinsic dialect (the 1356-intrinsic tpu_* catalog documented in LlvmTpu Intrinsic Catalog). It is the SparseCore sibling of the TensorCore tpu → LLO ODS descent and the stage immediately after the LowerToMlo DMA bridge has expanded tiled memrefs. Where the upstream MLIR convert-to-llvm lowers arith/memref/func to LLVM, this pass adds ~118 SparseCore *OpLowering conversion patterns that select the right tpu_* intrinsic for each ScDialect op, all riding on the type map built by the SCTypeConverter.
The reader who knows MLIR's DialectConversion should hold one structural fact: this is not one applyFullConversion but a three-substage driver inside runOnOperation (0x13566d00). First a standalone scf → cf lowering runs on the whole module (no type conversion); then a per-func::FuncOp ScDialect → llvm/llvm_tpu conversion runs (this is where the 118 patterns and the SCTypeConverter live, inside lowerFunc 0x13568280); then a per-LLVM::LLVMFuncOp assert + memref-finalize pass cleans up cf.assert and any residual memref ops (lowerAsserts). A lower-scf-to-cfg-only test flag can stop after the first substage. This page owns the per-class rewrite bodies (the matchAndRewrite algebra: operand mapping, attribute filtering, intrinsic selection, replaceOp), the scf→cfg lowering (lowerScfToCfg, with its two custom ForLowering/IndexSwitchLowering patterns), and the pass driver itself (the three substages, the factory, the conversion targets).
The pieces this page deliberately does not re-derive, because a sibling owns each: the address-space-ID → !llvm.ptr<N> type map and the sequencer-context flatten, the CheckAddressSpaces legality gate, and the 42-instantiation EUP roster all live on SCTypeConverter; the full AS-id ↔ MemorySpace ↔ pool table is on Fat Pointers (AS7/8/9); the per-cast addrspacecast instruction selection is on addrspacecast ISel. This page documents how the rewrite bodies consume the converted types and reach the intrinsics.
For reimplementation, the contract is:
- The three-substage driver.
scf→cf(module-wide, untyped) → per-funcScDialect→llvm/llvm_tpu(typed) → per-func assert/memref-finalize. The order is mandatory —scfops must be control-flow before the typed conversion runs, andcf.assertmust survive into the LLVM-func substage. - The uniform rewrite shape. Every
*OpLowering::matchAndRewriteis "resolve operands to LLVM values →FilterLLVMAttributes(dropaccess_groups) → select the leaftpu_*intrinsic by dtype/memspace/predicate →tpu_X::create(b, loc, …)→replaceOp". Six representative bodies are byte-decoded. - The dispatch keys per class. DMA = (srcMemSpaceID, dstMemSpaceID) tuple table; stream = (dtype, off-tile memspace, verb) lambda table; wait = comparison-predicate attr; sync = local/tile/remote attr; addrspacecast =
convertType(src) == convertType(dst)elide-or-fail.
| Pass class | xla::tpu::sparse_core::(anon)::LowerToSparseCoreLlvmPass (ODS base mlir::sparse_core::impl::LowerToSparseCoreLlvmBase) |
| Pass kind / scope | builtin.module pass (object size 0x298) |
| Driver | LowerToSparseCoreLlvmPass::runOnOperation @ 0x13566d00 |
| Factory | CreateLowerToSparseCoreLlvmPass(Target&, InlinedVector<long,4>, SparseCoreConfig&, DebugInfoTracker*) @ 0x135667c0 |
| scf→cfg substage | lowerScfToCfg (collector lambda 0x13572640); custom ForLowering 0x135707c0, IndexSwitchLowering 0x13571ec0 |
| per-func typed substage | LowerToSparseCoreLlvmPass::lowerFunc(func::FuncOp) @ 0x13568280 |
| assert/memref substage | lowerAsserts (walks LLVM::LLVMFuncOp); AssertOpLowering::matchAndRewrite @ 0x135b7080 |
| Test-only flag | lower-scf-to-cfg-only ("Only lower scf ops to cf ops (for testing)", 42 chars) |
| Attribute filter | FilterLLVMAttributes @ 0x135b7a20 — drops access_groups (13 chars) |
| Pattern count | ~118 *OpLowering classes (templated families instantiate more) |
| Confidence | CONFIRMED (decompile-anchored) unless a row or callout says otherwise |
The Pass Driver — Three Substages
Purpose
runOnOperation (0x13566d00) is a builtin.module pass driver, not a single conversion. It runs three distinct applyFullConversion campaigns in sequence, each with its own ConversionTarget and pattern set, separated by IR walks. The split exists because the three campaigns operate on three different op universes — structured control flow (scf), the ScDialect op surface (per func.func), and the finalized llvm.func bodies — and each needs a clean legalizer state. A reimplementer who collapses these into one applyFullConversion will deadlock the legalizer: the typed ScDialect patterns expect control flow already lowered, and the assert/memref finalize expects the SC ops already gone.
Entry Point
RegisterAllPhases (0xf849ec0) → Phase 2a (SparseCore custom-call path)
└─ CreateLowerToSparseCoreLlvmPass (0x135667c0) ── builds the pass object (0x298 B)
└─ runOnOperation (0x13566d00) ── module pass driver
├─ [substage 1] lowerScfToCfg ── scf → cf, module-wide, untyped
│ ForLowering (0x135707c0) ── scf.for → cf branches (benefit 2)
│ IndexSwitchLowering (0x13571ec0) ── scf.index_switch → cf (benefit 2)
│ populateSCFToControlFlowConversionPatterns ── upstream scf.{if,while,parallel}
├─ [substage 2] per func::FuncOp: lowerFunc (0x13568280)
│ SCTypeConverter + ~118 *OpLowering patterns → llvm / llvm_tpu
└─ [substage 3] per LLVM::LLVMFuncOp: lowerAsserts
AssertOpLowering (0x135b7080) + populateFinalizeMemRefToLLVMConversionPatterns
Algorithm
function runOnOperation(pass): // 0x13566d00
module = pass.getOperation() // builtin.module
ctx = module.getContext()
// ---- substage 1: scf -> cf, module-wide, NO type conversion ----
scfPatterns = RewritePatternSet(ctx)
scfPatterns.add<ForLowering>(benefit=2) // 0x135707c0 "scf.for"
scfPatterns.add<IndexSwitchLowering>(benefit=2) // 0x13571ec0 "scf.index_switch"
populateSCFToControlFlowConversionPatterns(scfPatterns) // upstream scf.{if,parallel,while}
frozen1 = FrozenRewritePatternSet(scfPatterns)
// collect every scf.{for,if,parallel,while,index_switch} op into a worklist
worklist = []
module.walk(lowerScfToCfg_collect) // callback 0x13572640 (lambda #1)
for op in worklist: // one applyFullConversion PER root op
target = ConversionTarget(ctx)
target.addIllegalOp("scf.for") // setOpAction(..., 1) via lambda #2
target.addDynamicallyLegalOp<scf.{If,Parallel,While,IndexSwitch}>(...)
target.markOpRecursivelyLegal("scf.for","scf.if","scf.parallel","scf.while","scf.index_switch")
target.markUnknownOpDynamicallyLegal(lambda #3) // setLegalityCallback (catch-all legal)
if applyFullConversion(op, target, frozen1) != success:
pass.signalPassFailure(); return // sets bit 2 of pass flags
if pass.lowerScfToCfgOnly: // test flag, *(pass+456)
return // stop after scf->cf
// ---- substage 2: per-func ScDialect -> llvm / llvm_tpu (TYPED) ----
pass.target = SparseCoreTargetForModule(pass.cfg, module) // xla_mlo_util, 0x...
for fn in module.getOps<func::FuncOp>():
if !lowerFunc(pass, fn): // 0x13568280
pass.signalPassFailure(); return
// ---- substage 3: per-llvm.func assert + memref finalize ----
if module.walk<LLVM::LLVMFuncOp>(lowerAsserts) == interrupt:
opts = LowerToLLVMOptions(ctx); opts.useBarePtrCallConv = true
opts.dataLayout = DataLayout("…-S64") // 64-bit stack alignment
tc = LLVMTypeConverter(ctx, opts)
tc.registerTypeAttributeConversion( // SC memory-space attr -> i64 addrspace
lowerAsserts_memSpaceLambda) // (no sequencer flatten; see SCTypeConverter)
assertPatterns = RewritePatternSet(ctx)
assertPatterns.add<AssertOpLowering>(benefit=1, // 0x135b7080 "cf.assert"
granuleMask = -1 << (log2(stackBytes) - log2(target.GranuleBytes())))
populateFinalizeMemRefToLLVMConversionPatterns(tc)
llvmTarget = LLVMConversionTarget(ctx)
llvmTarget.addLegalDialect("llvm_tpu") // setDialectAction, 1
llvmTarget.addIllegalOp("builtin.module") // setOpAction, 0
llvmTarget.addDynamicallyLegalOp("llvm.alloca", "builtin.unrealized_conversion_cast")
if applyFullConversion(module, llvmTarget, assertPatterns) != success:
pass.signalPassFailure()
return
Function Map
| Function | VA | Role |
|---|---|---|
runOnOperation | 0x13566d00 | the three-substage driver |
CreateLowerToSparseCoreLlvmPass | 0x135667c0 | factory; takes Target&, InlinedVector<long,4>, SparseCoreConfig&, DebugInfoTracker* |
lowerFunc | 0x13568280 | per-func.func typed ScDialect→LLVM conversion (substage 2) |
lowerScfToCfg collector lambda | 0x13572640 | walk callback; pushes scf.{for,if,parallel,while,index_switch} to worklist |
ForLowering::matchAndRewrite | 0x135707c0 | scf.for → cf branches (custom, benefit 2) |
IndexSwitchLowering::matchAndRewrite | 0x13571ec0 | scf.index_switch → cf (custom, benefit 2) |
AssertOpLowering::matchAndRewrite | 0x135b7080 | cf.assert lowering with granule mask (substage 3) |
getArgument / getDescription | 0x13566cc0 / 0x13566ce0 | ODS pass registration metadata |
getDependentDialects | 0x13566c00 | declares the dialects the pass instantiates |
The Factory and the Test Flag
CreateLowerToSparseCoreLlvmPass (0x135667c0) news a 0x298-byte pass object whose op-name anchor is "builtin.module" (14 chars, stored at +16). Three pieces of state are captured into the object: the xla::jellyfish::Target& (at +536), an absl::InlinedVector<long,4> of core IDs (at +552, copied via Storage::InitFrom when it spills past the inline 4), and a SparseCoreConfig (placement-constructed at +592), plus a DebugInfoTracker* at +656. The one cl::opt<bool> it registers is lower-scf-to-cfg-only (21-char name, description "Only lower scf ops to cf ops (for testing)", 42 chars) — when set, the driver returns after substage 1. This is the only tunable on the pass; it exists so a test can inspect the scf→cf output without descending into the (target-dependent) intrinsic selection.
NOTE — the pass is a
builtin.modulepass, but substages 2 and 3 iterate functions internally rather than relying on afunc-scoped pass manager. The reason is the substage ordering: afunc-pass manager would interleave the three campaigns per-function, but the driver needs all of substage 1 (scf→cf) done module-wide before any function enters substage 2's typed conversion. Keeping the whole thing a module pass with explicit inner loops is what enforces "all control flow flat, then all SC ops lowered, then all asserts finalized."
QUIRK — substage 1 runs
applyFullConversiononce per collected rootscfop, not once over the module. The collector lambda (0x13572640) gathers each top-levelscf.{for,if,parallel,while,index_switch}into a worklist, then the driver loops, building a freshConversionTargetand callingapplyFullConversionon each root individually. A reimplementer who runs a single module-wideapplyFullConversionforscf→cfwill get the same result on well-formed input but a different failure granularity — the per-root loop lets one malformed loop fail without aborting the others' diagnostics.
The scf → cf Lowering (Substage 1)
Purpose
Before any ScDialect op is typed-converted, all structured control flow must become unstructured cf branches, because the tpu_* intrinsics and the LLVM dialect have no notion of scf.for/scf.if regions. Substage 1 is a plain (untyped) scf → cf legalization: it reuses MLIR's upstream populateSCFToControlFlowConversionPatterns for scf.if/scf.while/scf.parallel but supplies two custom higher-benefit patterns for scf.for and scf.index_switch, which the SparseCore path needs lowered differently from upstream (e.g. SparseCore's loop induction and the sequencer-aware index switch).
The two custom patterns
| Pattern | Source op | VA | Benefit | Anchor string |
|---|---|---|---|---|
ForLowering | scf.for | 0x135707c0 | 2 | "…ForLowering]" (57 chars) |
IndexSwitchLowering | scf.index_switch | 0x13571ec0 | 2 | "…IndexSwitchLowering]" (65 chars) |
Both are registered with PatternBenefit(2) — higher than the upstream populateSCFToControlFlowConversionPatterns patterns (default benefit 1) — so they win the match for scf.for and scf.index_switch while upstream handles the rest. The collector lambda (0x13572640) is a function_ref<void(Operation*)> callback that, on each visited op, compares the op's TypeID against scf::{IndexSwitchOp, WhileOp, ParallelOp, ForOp, IfOp} (a five-way TypeIDResolver identity check) and, on a hit, appends the op pointer to the worklist SmallVector.
The per-root conversion target
For each collected root op, the driver builds a ConversionTarget that:
- marks
scf.forillegal (setOpAction(..., 1)) so it must be rewritten; - marks
scf.{If, Parallel, While, IndexSwitch}dynamically legal (legal iff already lowered); - marks all five
scf.{for, if, parallel, while, index_switch}recursively legal as containers (markOpRecursivelyLegal) so nested regions are walked; - installs a catch-all
markUnknownOpDynamicallyLegal(legality callback lambda #3) so any non-scfop is legal as-is.
applyFullConversion(rootOp, target, frozenScfPatterns) then drives the rewrite. A false return signals pass failure (sets bit 2 of the pass flags at pass+40).
Per-Class Rewrite Bodies (Substage 2 — lowerFunc)
The uniform rewrite shape
lowerFunc (0x13568280) builds the SCTypeConverter (see that page) and registers ~118 *OpLowering conversion patterns, then runs applyFullConversion over a single func::FuncOp. Every one of those patterns follows the same five-step matchAndRewrite shape, regardless of class:
function matchAndRewrite(op, adaptor, rewriter): // generic ScDialect *OpLowering
// (a) resolve operands: memref+index -> raw LLVM pointer/offset Value
ptr = getStridedElementPtr(adaptor.memref, adaptor.indices) // SC-specialised GEP
vals = [ptr, adaptor.scalarOperands…]
// (b) filter the op's attribute dictionary
attrs = FilterLLVMAttributes(op.getAttrs()) // 0x135b7a20 — drops "access_groups"
// (c) pick the leaf tpu_* intrinsic by dtype / memspace / predicate
intr = selectIntrinsic(op, target) // class-specific dispatch key
// (d) create the intrinsic
res = intr::create(rewriter, op.getLoc(), {resultType}, vals, attrs)
// (e) replace
rewriter.replaceOp(op, res)
return success
The attribute filter is uniform: FilterLLVMAttributes (0x135b7a20) forwards the ScDialect op's attribute dictionary to the new tpu_* intrinsic but drops access_groups — a 13-char attribute name decoded from the inlined string-compare immediates 0x675f737365636361 ("access_g") / 0x7370756f72675f73 ("s_groups"). The intrinsic regenerates its own LLVM AccessGroup metadata at LLVM lowering, so threading the source attribute would double-tag it.
Operand mapping is uniform too: an ScDialect memref+index operand becomes a single raw !llvm.ptr/offset Value via the SC-specialised getStridedElementPtr (or ConvertToLLVMPattern::getStridedElementPtr); scalar operands pass through. The "1:N" character of some classes is op-identity selection, not operand explosion: the SC type system encodes each HW variant as a distinct intrinsic, and the dispatch key picks which one.
GOTCHA — "the rewrite is 1:1" and "the class lowers to N intrinsics" are both true and not contradictory. A single
matchAndRewritecall emits exactly one intrinsic (plus any helper ops); the "N" is the static number of candidate intrinsics the dispatch can choose among.DmaSimpleStartOphas one rewrite body but a 12-entry table of candidatetpu_dma_<src>_to_<dst>_sc_simpleintrinsics (out of 16 such_sc_simpleintrinsics registered in the binary); the body picks one per call. A reimplementer building a 1:1 op-to-intrinsic map will miss the dispatch and emit the wrong DMA.
Table 1 — the six representative rewrite bodies
One representative per functional class, byte-decoded from matchAndRewrite. → intrinsic is the tpu_*::create the body tail-calls; dispatch key is what selects the leaf.
| Class | Rewriter (matchAndRewrite @VA) | → leaf intrinsic(s) | Dispatch key | Algebra |
|---|---|---|---|---|
| CBREG advance | AdvanceCbOffsetOpLowering 0x1353bf80 | tpu_cbreg_add_offset::create | none (1:1) | descriptor read-modify-write (below) |
| CBREG read | ReadCbOffsetOpLowering 0x1353c400 | tpu_rdcbreg_offset::create 0x14734820 | none (1:1) | read OFFSET sub-register from cbreg → replaceOp |
| sync add | SyncAddOpLowering 0x13591660 | tpu_syncadd / _tile / _remote | SflagCoreType/SflagLocal + sequencer attr | local (V,V) / tile (V,V) / remote (V×5) |
| sync wait | SyncWaitOpLowering 0x13593040 | tpu_wait{ge,eq,ne,lt,le,gt} / waitdone / waitnotdone | comparison-predicate attr | predicate → pick wait intrinsic → replaceOp |
| DMA simple | DmaSimpleStartOpLowering 0x135a9100 | tpu_dma_<src>_to_<dst>_sc_simple (12-lambda table) | (srcMemSpaceID, dstMemSpaceID) tuple | vector<tuple<u32,u32,fn>> dispatch (below) |
| stream gather/scatter | LinearStreamStartOpLowering::rewriteSparseCoreStreamOpToLLVM 0x13542000 | tpu_stream_linear_<verb>_<src>_to_<dst>[_add] | (dtype, off-tile memspace, verb) lambda table | 16-entry dispatch (below) |
| addrspacecast | MemorySpaceCastOpLowering 0x135a5c20 | none (elide) or generic llvm.addrspacecast | convertType(src) == convertType(dst) | elide-or-fail (below) |
CBREG — the descriptor read-modify-write
AdvanceCbOffsetOpLowering (0x1353bf80) is richer than a flat 1:1 replace. The cbreg lives inside a CircularBufferDescriptor (a StructBuilder over the converted memref), so the body is a read-modify-write of that struct:
function AdvanceCbOffset::matchAndRewrite(op, adaptor, rewriter): // 0x1353bf80
desc = StructBuilder(adaptor.cbDescriptor) // dereference operand 0
cbreg = desc.CbReg() // CircularBufferDescriptor::CbReg
delta = adaptor.deltaOperand // operand 1
newReg = tpu_cbreg_add_offset::create(rewriter, loc, // 0x146d7e60 (in_place @0x146d7f60)
cbreg.getType(), cbreg, delta) // offset += delta mod size in HW
out = StructBuilder(adaptor.cbDescriptor)
out.SetCbReg(newReg) // write new register field back
out.SetMemRef(desc.MemRef()) // carry the memref field through
rewriter.replaceOp(op, out.value())
return success
The HW semantics (offset wraps modulo the buffer size, the {base, offset, size} register bit layout) belong to the CBREG page; this body's contribution is the descriptor plumbing — the new offset is written into the same descriptor struct the memref already lives in, so the rest of the function keeps using one SSA value for the circular buffer.
Sync / wait — attribute-driven dispatch
SyncWaitOpLowering (0x13593040) reads the op's comparison-predicate attribute and selects one of eight wait intrinsics: tpu_waitge (0x14a30aa0), waiteq (0x14a30a00), waitne (0x14a30dc0), waitlt (0x14a30d20), waitle (0x14a30c80), waitgt (0x14a30be0) — all 2-operand {sflag, threshold} compare forms — plus waitdone (0x14a30960) and waitnotdone (0x14a30e60), the 1-operand {sflag} done forms. SyncAddOpLowering (0x13591660) reads getSflagCoreType/getSflagLocal and the parent sequencer type to pick tpu_syncadd (local, V×2), tpu_syncadd_tile (tile bank, V×2), or tpu_syncadd_remote (ICI peer, V×5 — building a ChipIdOp + arith.IndexCast + LLVM::ConstantOp for the {device, core, id} routing). Both first resolve the sflag memref to a pointer via getStridedElementPtr and run ValidateSyncFlagsIndices.
DMA — the memspace-pair tuple table
DmaSimpleStartOpLowering (0x135a9100) builds a std::vector<std::tuple<u32, u32, std::function<void()>>> of (srcMemSpaceID, dstMemSpaceID, create-lambda) rows, reads getSrcBufferMemorySpace/getDstBufferMemorySpace off the op, and runs the lambda whose (src, dst) pair matches. Each lambda creates the specific tpu_dma_<src>_to_<dst>_sc_simple intrinsic. The decompile of this body carries 12 distinct lambda closures (lambdas #1…#12), each materialized as a __call_func/__large_clone/__large_destroy triple — 36 std::function thunk symbols in the range 0x135ac0c0…0x135acb20 — confirming the table is built inline rather than as a static dispatch array. Before the dispatch, CastTileSmemPointerToSmem normalises any tile-resident endpoint to generic SMEM and the result feeds the CheckAddressSpaces legality gate (the "simple DMA on SMEM ⇒ SCS only" contract) — both owned by SCTypeConverter and the LowerToMlo DMA bridge.
NOTE — the create arity encodes the descriptor tier:
tpu_dma_*_sc_simpletakes 8Values,_single_stridedtakes 11,_generaltakes 16 (tpu_dma_hbm_to_hbm_sc_simple0x146d8c60;tpu_dma_hbm_to_hbm_sc_single_strided0x146d9080;tpu_dma_hbm_to_hbm_sc_general0x146d8820). The per-field role of those operands (base vs offset vs size vs stride vs sflag) is not decoded here — only the count and the memspace-pair dispatch are confirmed. The field semantics are charged to the descriptor-encoder analysis, not this lowering.
Stream — the (dtype, memspace, verb) lambda table
LinearStreamStartOpLowering::rewriteSparseCoreStreamOpToLLVM (0x13542000) builds a 16-entry vector<tuple<dtype, off-tile-memspace, verb, lambda>> keyed by (dtype, off-tile memspace, verb=gather/scatter) and creates tpu_stream_linear_<verb>_<src>_to_<dst> (gather forms 0x148eb0a0; _hbm4b_ variants 0x148eae40; scatter 0x149368c0; indirect-gather-add 0x147922c0, V×8). It runs adjustOffsetForHbm4b (HBM_4B granularity), CheckLinearStreamIsValid, and CheckStartAndEndWithinMemref (bounds) before the dispatch. The per-leaf HW stream-engine opcode for each of the ~834 stream intrinsics is the SparseCoreStream slot encoding — see Stream Gather/Scatter; this body only selects which intrinsic.
addrspacecast — elide-or-fail
MemorySpaceCastOpLowering (0x135a5c20) is the smallest and most consequential body. It converts the source operand's type and the result type and compares:
function MemorySpaceCast::matchAndRewrite(op, adaptor, rewriter): // 0x135a5c20
srcT = TypeConverter::convertType(op.getSource().getType()) // -> !llvm.ptr<N_src>
resT = TypeConverter::convertType(op.getType()) // -> !llvm.ptr<N_dst>
if srcT != resT:
return failure // -> generic emits llvm.addrspacecast
rewriter.replaceOp(op, adaptor.source) // ELIDE: same ptr type, no cast
return success
If the two converted types are equal the cast is elided (replaceOp(op, sourceValue) — no instruction emitted); if they differ the pattern returns failure, and MLIR's generic ConvertOpToLLVMPattern for memref.memory_space_cast emits a real llvm.addrspacecast. The decision is therefore entirely the SCTypeConverter's — the sequencer-context flatten (which collapses {sflag_tile, sflag_scs, sflag_tc, sflag} → 204 and {smem_tile, smem_scs, smem} → 0 outside sequencer functions) is what makes two distinct MemorySpaces convert to the same !llvm.ptr<N> and thus elide. That flatten table lives on SCTypeConverter; the downstream llvm.addrspacecast instruction selection is on addrspacecast ISel.
QUIRK — this pattern emits zero or one op, never the intrinsic family the other classes reach. It is the only
*OpLoweringwhose "intrinsic" is sometimes nothing. A reimplementer must register the SC-specific elide pattern at higher benefit than the upstream genericmemref.memory_space_cast→llvm.addrspacecastpattern, so the elide gets first refusal; only on its failure does the generic cast fire. Getting the benefit ordering wrong emits a redundant same-address-space cast that LLVM then has to fold away.
The ~118-pattern roster (by class)
lowerFunc registers roughly 118 distinct (anonymous namespace)::*OpLowering classes (templated families — UnaryFloatVectorOpLowering, AluEpOpLowering — instantiate more than their class-name count). Rather than table all 118, the dispatch-dimension view:
| Group | Count | Representative members | Dispatch axis |
|---|---|---|---|
| STREAM | ~10 | LinearStream[Add]Start, StridedStream[Add]Start, IndirectStream[Add]Start, IndirectVectorStream[Add]Start | (dtype, off-tile memspace, verb) |
| DMA | 4 | DmaSimpleStart, DmaSingleStridedStart, DmaGeneralStart, DmaWait | (srcMemSpace, dstMemSpace) tuple |
| SYNC/WAIT | ~10 | SyncAdd, SyncSet, SyncSetRemote, SyncWait, MemoryWait, Sfence, FetchAndAdd, TileWaitScsSmem | predicate / local-tile-remote attr |
| CBREG | ~5 | CreateCb, ReshapeCb, AdvanceCbOffset[InPlace], ReadCbOffset | none (descriptor RMW) |
| TRANSCENDENTAL/EUP | ~42 | UnaryFloatVector (12), AluEp (30) — see SCTypeConverter | packed-operand → 1:1 macro vs 1:N unpack/pack |
| PACK/UNPACK | ~8 | PackF/SI/UI, UnpackF/SI/UI, UnpackI32Pair, CarryOut | element format |
| SCAN/SORT | ~4 | Scan, SegmentedScan, Sort, DuplicateCountUnique | — |
| VECTOR MEM | ~10 | VectorLoad[Idx], VectorStore[Idx], VolatileLoad/Store, ZeroMem, VectorBroadcast/Extract | memspace |
| LANE/PERMUTE | ~6 | Permute, Vlaneseq, VShiftInsert, VectorMaskCount{TrailingZeros,Population} | — |
| ADDRESS/PTR | ~12 | AddressOf, AllocateAtOffset, FoldOffsetIntoPtr, GetRemoteMemRef, MemRefView, MemorySpaceCast, SCSubView, SCView, HbmStackStartOffset | type-converter equality (cast) |
| CONTROL/TASK | ~14 | Barrier, Lock, Unlock, LaunchTileTask, Assert, GetCoreLocation, GetDynamic{DeviceAssignment,DimensionSize}, SetPTState, SetTag, SetTraceMark | — |
| TRACE/TELEMETRY | ~8 | Trace, LogEvent, LogMemRef, Read{Global,Local}CycleCount, SetDmaCredit, SetIndirectFilterValue | — |
The EUP group (UnaryFloatVectorOpLowering 1:1 macro vs AluEpOpLowering 1:N unpack/compute/pack, selected by IsDynamicallyLegal) is fully tabled on SCTypeConverter (Tables B/C) — it is not re-tabled here.
The Assert / MemRef-Finalize Lowering (Substage 3)
Purpose
After substage 2 has turned every func.func body into llvm.func with llvm/llvm_tpu ops, two residues remain: cf.assert ops (which substage 1's scf→cf may have produced, or which were already present) and any unfinalized memref ops. Substage 3 walks each LLVM::LLVMFuncOp, and if the walk signals work is needed, runs one final applyFullConversion that lowers cf.assert and finalizes memref-to-LLVM.
The conversion campaign
function lowerAsserts_finalize(module): // tail of runOnOperation
opts = LowerToLLVMOptions(ctx)
opts.useBarePtrCallConv = true
opts.dataLayout = DataLayout("…-S64") // "-S64": 64-bit natural stack alignment
tc = LLVMTypeConverter(ctx, opts)
tc.registerTypeAttributeConversion(lowerAsserts_memSpaceLambda) // SC mem-space attr (no flatten)
patterns.add<AssertOpLowering>(benefit=1) // 0x135b7080 "cf.assert"
populateFinalizeMemRefToLLVMConversionPatterns(tc)
target = LLVMConversionTarget(ctx)
target.addLegalDialect("llvm_tpu") // SC intrinsic dialect is the legal target
target.addIllegalOp("builtin.module")
target.addDynamicallyLegalOp("llvm.alloca") // alloca legal only post-finalize
target.addDynamicallyLegalOp("builtin.unrealized_conversion_cast") // bridge-cast passthrough
applyFullConversion(module, target, patterns)
AssertOpLowering (0x135b7080) is a ConvertToLLVMPattern over cf.assert that carries the target's GranuleBytes — it computes a granule mask -1 << (log2(stackBytes) − log2(GranuleBytes)) (the _BitScanReverse64 pair on the target's +88 field versus GranuleBytes), which aligns the assert's spill/scratch use to the SparseCore granule. The lowerAsserts memory-space lambda is the same MemorySpaceAttr → i64 addrspace conversion as substage 2's SCTypeConverter but without the sequencer-context flatten override — assertion text wants to name the exact per-tile/per-SCS bank, so it uses the un-flattened IDs. Both lambdas are documented on SCTypeConverter.
NOTE — the
llvm_tpudialect is marked a legal dialect in this final target, not illegal. By substage 3, the SC intrinsics emitted in substage 2 are the desired output, so the assert/memref finalize must treat them as legal-as-is and only rewrite the remainingcf.assert/memref/builtin.modulestructure. Thebuiltin.unrealized_conversion_castis held dynamically legal so any bridge-cast still in transit from the LowerToMlo DMA bridge passes through untouched.
Related Components
| Name | Relationship |
|---|---|
SCTypeConverter (lowerFunc 0x13568280) | builds the type map and EUP roster this page's rewrite bodies ride on |
FilterLLVMAttributes (0x135b7a20) | the uniform attribute filter every rewrite body calls (drops access_groups) |
getStridedElementPtr | memref+index → raw !llvm.ptr used by DMA/sync/stream operand resolution |
CheckAddressSpaces (0x135b8e00) | the DMA legality gate DmaSimpleStartOpLowering consults (owned by SCTypeConverter) |
tpu_*::create family | the 1356-intrinsic leaf set the rewrite bodies tail-call (LlvmTpu catalog) |
populateSCFToControlFlowConversionPatterns | upstream scf→cf patterns the custom ForLowering/IndexSwitchLowering augment |
Cross-References
- SCTypeConverter — the address-space →
!llvm.ptrtype map, the sequencer-context flatten,CheckAddressSpaces, and the 42-instantiation EUP roster the rewrite bodies consume (do not duplicate). - Fat Pointers (AS7/8/9) — the complete AS-id ↔
MemorySpace↔ pool table the dispatch keys index. - addrspacecast ISel — the
llvm.addrspacecastinstruction selection thatMemorySpaceCastOpLoweringfalls through to on a type mismatch. - LowerToMlo DMA Bridge-Cast — the prior two-stage DMA lowering whose tile-expanded output this pass finalizes; source of any in-transit bridge-cast.
- DialectConversion Legalizer — the
ConversionTarget/applyFullConversionmachinery the three substages drive. - ConversionPatternRewriter — the speculative-apply/rollback rewriter the
matchAndRewritebodies callreplaceOpon. - LlvmTpu Intrinsic Catalog — the 1356
tpu_*intrinsics that are the leaf targets of every rewrite body. - CBREG — the
{base, offset, size}circular-buffer register the CBREG rewrite bodies read and write. - Stream Gather/Scatter — the SparseCoreStream slot encoding behind the ~834 stream intrinsics the stream dispatch selects.
- SCS Engine — the Scalar Core Sequencer the sync/DMA legality contracts reference.
- The
tpuMLIR Dialect — the op-registration ABI for thetpu/sparse_coreops these patterns rewrite. - Compiler Overview — Part V orientation; where SC lowering sits in the five-phase descent.
- Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part V — Compiler: Lowering & Optimization Passes / MLIR lowering chain — back to index