LowerToMlo DMA Bridge-Cast
All addresses, symbols, and offsets on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, not stripped,.textVA == file offset; buildlibtpu_lts_20260413_b_RC00). Other versions will differ; treat every VA as version-pinned.
Abstract
createLowerToMloPass (0x1322adc0) is the MLIR ModuleOp pass that drops the tpu dialect onto the SparseCore mid-level (Mlo / sparse_core) dialect. It is the SparseCore sibling of tpu → LLO ODS (which is the TensorCore FunctionPass descent into LLO). Where LowerToLLO lowers every tpu op to an LLO target in one shot, LowerToMlo is a two-stage lowering: most tpu ops lower directly to sc_tpu.* (sparse_core dialect) / arith / memref ops in this pass, but the DMA-and-sync family — tpu.enqueue_dma, tpu.enqueue_indirect_dma, tpu.wait_dma2, tpu.fetch_and_add_sync — does not lower here. Those ops materialize a tagged builtin.unrealized_conversion_cast (the bridge-cast) that defers the real DMA emission to a downstream SparseCore pass.
This page owns the LowerToMlo-specific machinery of that bridge:
- The 2-stage DMA bridge-cast — how
tpu.enqueue_dma(and the three sibling DMA/sync ops) lowers to a per-operandbuiltin.unrealized_conversion_casttaggedsc.unlowering, with the source op markedsc.unlowered, and what the cast carries forward. - The block-signature conversion — the
func::FuncOpdynamic-legality predicate that detects a SparseCore sequencer function and forces its signature through theTypeConverter, plus theUnrealizedConversionCastOplegality guard that lets the bridge-cast pass through the legalizer in transit. - The deferred-DMA materialization resolution — how the tagged bridge-cast survives
applyFullConversionand is later consumed bysubstituteUnloweringConversionCastOp(0x134e73e0) insideExpandTiledMemRefsPassto emit the real SparseCore DMA ops.
The generic speculative-apply / rollback engine that makes the legalization safe (the IRRewrite action log, undoRewrites / resetState / applyRewrites, the UnresolvedMaterializationRewrite record) is not on this page — it lives on ConversionPatternRewriter. This page covers only the LowerToMlo bridge, the LowerToMlo legality predicates, and the deferred-DMA resolution.
| Pass factory | mlir::tpu::createLowerToMloPass(xla::jellyfish::Target const&, sparse_core::LowerToMloPassContext*) — 0x1322adc0 |
| Pass entry | mlir::tpu::(anon)::LowerToMloPass::runOnOperation() — 0x1322b200 |
| Pass kind | ModuleOp pass (jaxlib::mlir::Pass<LowerToMloPass, mlir::ModuleOp>) — not a FunctionPass |
| Driver | mlir::applyFullConversion(module, MloConversionTarget, frozenPatterns, ConversionConfig) — not partial |
| Target dialect | sparse_core (the Mlo mid-level) + arith/memref/scf/vector/cf/func/math/LLVM |
| Legal-dialect set | MloConversionTarget ctor 0x13245900 (vtable 0x21903b98): 11 dialect TypeIDs |
| Pattern set | populateTpuToMloConversionPatterns (0x1322c920) — 44 functional ConversionPattern lambdas + 1 catch-all |
| DMA bridge ops | tpu.enqueue_dma (0x13239e00), tpu.enqueue_indirect_dma (0x1323a660), tpu.wait_dma2 (0x1323ae00), tpu.fetch_and_add_sync (0x1323b600) |
| Bridge target | builtin.unrealized_conversion_cast tagged sc.unlowering (per-operand) + source op sc.unlowered |
| Stage-2 consumer | substituteUnloweringConversionCastOp (0x134e73e0) under ExpandTiledMemRefsPass |
| Source provenance | .rodata string platforms/xla/mosaic/dialect/tpu/transforms/lower_to_mlo.cc (line 405 in the EnqueueDMA body) |
Why Two Stages — The Direct-vs-Bridge Split
Purpose
Every tpu op LowerToMlo handles falls into one of two buckets. Establishing the split first is what makes the rest of the page legible: the bridge-cast is not a fallback, it is a deliberate deferral for the one op family whose final shape is not known until tile expansion has run.
The split
populateTpuToMloConversionPatterns (0x1322c920) installs 44 functional ConversionPattern lambdas (signature LogicalResult(SrcOp, SrcOpAdaptor, ConversionPatternRewriter&)). 40 of them lower their source op directly to one or more sc_tpu.* (Mlo / sparse_core dialect) ops — frequently a 1:N explosion (e.g. tpu.iota → sc_tpu.vlaneseq [+ arith.index_cast], tpu.device_id → 6 ops, tpu.delay → an scf.for nest of sc_tpu.sdelay). Those direct lowerings are the SparseCore analogue of the LowerToLLO bodies and are surveyed on tpu → LLO ODS for the TensorCore side; this page does not re-table them.
The remaining four ops are the DMA-and-sync family. They lower to nothing in LowerToMlo. Instead each emits a builtin.unrealized_conversion_cast — the bridge-cast — that carries the original op's operands forward in the converted (Mlo) type system and is tagged for a later pass to find.
tpu source op | LowerToMlo lambda | Lowers to | Stage |
|---|---|---|---|
tpu.enqueue_dma | 0x13239e00 | tagged builtin.unrealized_conversion_cast | bridge |
tpu.enqueue_indirect_dma | 0x1323a660 | tagged builtin.unrealized_conversion_cast | bridge |
tpu.wait_dma2 | 0x1323ae00 | tagged builtin.unrealized_conversion_cast | bridge |
tpu.fetch_and_add_sync | 0x1323b600 | tagged builtin.unrealized_conversion_cast | bridge |
tpu.wait_indirect_dma | 0x1323be00 | sc_tpu.stream_wait (StreamWaitOp) | direct |
NOTE —
tpu.wait_indirect_dma(0x1323be00) is not a bridge op — it lowers directly, emitting asparse_core::StreamWaitOp(sc_tpu.stream_wait) after computing dynamic sizes and granule counts (lowering_util::GetDynamicSizes/AssertAlignmentAndGetNumGranules). Only the four ops above defer. The reason for the split is structural: anenqueue_dma's final SparseCore form (rolled / retiled transfer loop, granule decomposition, host-IOVA vs intra-chip routing) depends on the tiled memref layout, which is not resolved untilExpandTiledMemRefsPassruns. LowerToMlo cannot know the shape, so it preserves the operands behind a typed cast and lets the tile-aware pass finish the job. This is confirmed by the consumer beingExpandTiledMemRefsPass::addPattern<tpu::EnqueueDMAOp>(0x134eef60),addPattern<tpu::WaitDMA2Op>(0x134ef720),addPattern<tpu::FetchAndAddSyncOp>(0x134f1240),addPattern<tpu::EnqueueIndirectDMAOp>(0x134ef340) — the four bridge ops are exactly the four ops that pass re-handles.
Stage 1 — The DMA Bridge-Cast (tpu.enqueue_dma → tagged cast)
Purpose
This is the heart of the page: the exact emission a reimplementer must reproduce so that the deferred op survives full conversion and is recoverable downstream. The tpu.enqueue_dma lambda body (0x13239e00) is decompiled in full; the other three bridge ops follow the same shape.
Algorithm
The lambda walks the op's operands, replaces each operand whose converted type differs from its source type with a per-operand builtin.unrealized_conversion_cast (tagged sc.unlowering), then marks the whole source op with sc.unlowered and re-inserts it as legal-in-transit. It does not erase or replace the op — the op survives, now type-bridged.
// LowerToMlo EnqueueDMAOp lambda — @ 0x13239e00
// (lower_to_mlo.cc:405; same shape for EnqueueIndirectDMA / WaitDMA2 / FetchAndAddSync)
LogicalResult lower_enqueue_dma(EnqueueDMAOp op, EnqueueDMAOpAdaptor adaptor,
ConversionPatternRewriter& rw):
srcOperands = op.getOperands() // type-pointer list @ op+0x48
newOperands = adaptor remapped operands // ValueRange (already type-converted)
// --- fast pre-scan: are ALL operand types already equal? ---
// compares (type_ptr & ~7) of src[i] vs remapped[i] (line 102 / 137)
if every src operand type == its remapped type:
goto NO_REWRITE // nothing to bridge; op already legal
rw.startRootUpdate(op) // [rw_vtable+0x28] (line 111)
for i, (srcVal, newVal) in zip(srcOperands, newOperands):
if (srcVal.type & ~7) != (newVal.type & ~7): // type changed
loc = LocationGenerator::Visitor(op.getLoc()) // sc-location synthesis
castTy = TypeRange{ newVal.type } // 1 result type (line 162)
cast = UnrealizedConversionCastOp::create(rw, loc, castTy, srcVal) // (line 175)
cast->setAttr("sc.unlowering", UnitAttr::get(ctx)) // (line 176-182)
// splice: replace srcOperand[i] use with the cast result (intrusive list, line 183-199)
op.setOperand(i, cast.getResult(0))
op->setAttr("sc.unlowered", UnitAttr::get(ctx)) // mark source op (line 212-219)
rw.finalizeRootUpdate(op) // [rw_vtable+0x30] (line 221)
NO_REWRITE:
// trace-region tagging (independent of the bridge)
if pass.insideTraceRegion: // v62[71] (line 224)
op->setAttr("sc.inside_trace_region", UnitAttr::get(ctx)) // (line 226-232)
return success // return 1 (line 234)
The three attribute tags
The emission deposits up to three UnitAttr markers — all confirmed as .rodata string literals consumed by the stage-2 pass:
| Attribute | Set on | Meaning |
|---|---|---|
sc.unlowering | each emitted builtin.unrealized_conversion_cast | "this cast bridges a not-yet-lowered operand into the Mlo type system; the stage-2 pass owns it" |
sc.unlowered | the source tpu.enqueue_dma op itself | "this op was intentionally left un-lowered by LowerToMlo; expand it during tile expansion" |
sc.inside_trace_region | the source op, only if the pass is mid-trace-region | propagates trace nesting so the expanded DMA stays inside the trace scope |
NOTE — the bridge resolves in
ExpandTiledMemRefsPass, notLowerToSparseCoreLlvm. The decompiled0x13239e00body sets the tags at fixed sites:sc.unloweringon the per-operand cast (UnitAttr, line 182),sc.unloweredon the source op (line 219), andsc.inside_trace_regionwhen nested (line 232). The downstream consumer issubstituteUnloweringConversionCastOp(0x134e73e0), registered byExpandTiledMemRefsPass— so the bridge resolves one stage earlier than theLowerToSparseCoreLlvmPass(0x13566d00) that the tag names might suggest.
GOTCHA — the bridge is per-operand selective, not whole-op. The pre-scan (line 95-108) and the inner loop (line 137) compare the low-3-bits-masked type pointer of each source operand against its remapped operand; only operands whose type actually changed get a cast. An
enqueue_dmawhose operands all keep their type emits no cast and merely gets thesc.unloweredmarker. A reimplementation that casts every operand unconditionally will emit dead identity casts the stage-2 pass must then strip.
NOTE — the lambda uses
startRootUpdate/finalizeRootUpdate(the in-place modification protocol,rw_vtable+0x28/+0x30), notreplaceOp. The mutation is therefore logged as aModifyOperationRewriterecord (the in-place-modification record on ConversionPatternRewriter), and the inserted casts asCreateOperationRewrite+UnresolvedMaterializationRewriterecords — so the whole bridge emission is rollback-safe under the speculative legalizer.
Stage 1b — Block-Signature Conversion (the func::FuncOp legality predicate)
Purpose
Before any DMA op is reached, LowerToMlo must decide which functions need their signature rewritten through the TypeConverter. SparseCore code is structured as sequencer functions whose argument/result types (memref memory-space attrs, TupleType 1:N pairs, WordType) must be converted; ordinary functions are left alone. This is the block-signature side of the lowering, and it is driven entirely by a dynamic-legality predicate.
The three dynamic-legality predicates
runOnOperation (0x1322b200) installs three legality callbacks via addDynamicallyLegalOp (the call census shows four addDynamicallyLegalOp invocations + two static setOpAction; the four resolve to the three distinct predicate lambdas below, one of which is registered for two op classes). All three return a 16-bit value where bit 0 = legal-now and bit 8 (0x100) = "answer present" (the form ConversionTarget::isLegal reads).
| Predicate | Lambda | Op(s) it gates |
|---|---|---|
$_0 FuncOp signature | 0x13231300 | func::FuncOp |
$_1 cast-in-transit | 0x13231560 | builtin.UnrealizedConversionCastOp |
$_2 result-type catch-all | 0x132315e0 | the OpResultTypeConversionPattern target |
$_0 — func::FuncOp legality (0x13231300)
// func::FuncOp dynamic legality — @ 0x13231300
optional<bool> func_is_legal(FuncOp op):
core = TPUDialect::GetCoreTypeAttr(op) // 0x14aa6020
seqTy = LowerMemrefToMlo::getSequencerType(op) // 0x13507760
// is this a SparseCore SEQUENCER function? (op-name compares vs "sc"/"execute")
if not is_sequencer(core, seqTy):
return legal // 0x100 — no signature rewrite
// it IS a sequencer: walk the signature, every type must already be converted
for ty in FunctionType.getInputs() ++ getResults():
if typeConverter.convertType(ty) != ty:
return illegal // signature still needs conversion
return legal_recursive // 0x101 when input/result lists empty
A function is legal as-is (no signature rewrite) unless it is a SparseCore sequencer function — detected by TPUDialect::GetCoreTypeAttr (0x14aa6020) plus LowerMemrefToMlo::getSequencerType (0x13507760), which test the function's core-type attribute and name against the ASCII tokens "sc" and "execute". For a sequencer function, the predicate walks getInputs() ++ getResults() and calls TypeConverter::convertType on each: the function is legal iff every arg and result type already converts to itself. If any type still needs conversion the FuncOp is illegal, which fires the FuncOp signature-conversion pattern (0x13231ca0) to rewrite the block signature through the converter.
NOTE — the FuncOp signature-conversion pattern (
0x13231ca0) is what performs the actual block-argument rewrite. The 1:N type expansions (aTupleTypeargument splitting into multiple SparseCore scalars, anI32Pair) drive block-argument multiplication; the rollback of those block-level edits (BlockTypeConversionRewrite) is the block-rewrite half of the action log documented on ConversionPatternRewriter.
$_1 — UnrealizedConversionCastOp legality (0x13231560)
This is the predicate that lets the bridge-cast pass through the legalizer in transit. Without it, the unrealized cast the DMA lambda just emitted would itself be reported illegal and the pass would abort.
// UnrealizedConversionCastOp dynamic legality — @ 0x13231560
optional<bool> cast_is_legal(UnrealizedConversionCastOp op):
attr = op.getInherentAttr("sc.unlowering") // len 0xd; falls back to DictionaryAttr lookup
if attr is absent:
return 0x100 // "answer present", low bit clear → ILLEGAL
// legal iff that attribute is the UnitAttr the bridge emits
return (attr.getTypeID() == UnitAttr::id) | 0x100
A cast is legal only when it carries the sc.unlowering UnitAttr — exactly the tag the DMA bridge lambda deposits on each per-operand cast at 0x13239e00. Any unrealized cast without that tag (e.g. a stray reconciliation cast the type-converter inserts at a signature boundary) is reported illegal (0x100 with the low bit clear) and must be materialized by the rewriter. The predicate is thus a direct lock-and-key pairing with the Stage-1 emission: the lambda writes sc.unlowering, this predicate admits exactly those casts.
$_2 — OpResultTypeConversionPattern target legality (0x132315e0)
// runOnOperation $_2 — @ 0x132315e0
optional<bool> result_types_legal(Operation* op):
// fast accept: op's own dialect is in the target's legal-dialect DenseSet
if op.getDialect().getTypeID() in MloConversionTarget.legalDialects:
goto CHECK_RESULTS
// the four DMA bridge ops are legal iff already tagged sc.unlowered (UnitAttr)
if op is {EnqueueDMAOp, EnqueueIndirectDMAOp, WaitDMA2Op, FetchAndAddSyncOp}:
u = op.getInherentAttr("sc.unlowered") // len 0xc
return u && (u.getTypeID() == UnitAttr::id) ? 0x101 : 0x100
// memref-reshaping ops (EraseLayout / MemRefSqueeze / MemRefBitcast /
// MemRefSlice / ReinterpretCast / AssumeMultiple) are legal only when
// no operand/result still carries a tpu::TiledLayoutAttr memref
... // TiledLayoutAttr guards
CHECK_RESULTS:
for ty in op.getResultTypes():
if typeConverter.convertType(ty) != ty:
return 0x100 // a result still needs conversion
return 0x101 // legal (0x101 when no results)
The catch-all OpResultTypeConversionPattern (the MatchAnyOpTypeTag pattern, vtable 0x21903b40, matchAndRewrite 0x132456e0 → mlir::convertOpResultTypes 0x1c9572c0) is the generic 1:1 result-type fixer for any op the explicit patterns do not cover: when an op's operands were remapped, this pattern converts the op's result types. $_2 is its legality gate — after the dialect / bridge-op / tiled-layout fast paths, an op is legal iff TypeConverter::convertType maps every result type to itself (i.e. no result still needs conversion).
NOTE (low confidence) —
$_2is not a pure result-type predicate: the decompiled body at0x132315e0first short-circuits on the op's dialect membership in theMloConversionTargetlegal-dialect set, then special-cases the four DMA bridge ops (legal oncesc.unlowered-tagged) and the memref-reshaping ops (gated ontpu::TiledLayoutAttr), and only the fall-through path runs theconvertType(ty) == tyresult-type loop. The pseudocode above is the verified shape; the exact branch ordering of the tiled-layout guards is summarized, not byte-decoded.
Stage 2 — Deferred-DMA Materialization Resolution
Purpose
The bridge-cast is only half the story; the deferral has to be resolved eventually. This unit pins where and how the tagged cast becomes a real SparseCore DMA — closing the loop the LowerToMlo lambda opens.
Resolution
applyFullConversion (the LowerToMlo driver) finishes with the four DMA ops still present, now marked sc.unlowered, their changed operands fed by sc.unlowering-tagged builtin.unrealized_conversion_cast ops, and the cast-legality predicate ($_1) having declared those casts legal-in-transit so the pass does not abort on them.
A later SparseCore pass, ExpandTiledMemRefsPass, owns stage 2. It registers conversion patterns for exactly the four bridge ops:
| Stage-2 registration | VA |
|---|---|
ExpandTiledMemRefsPass::addPattern<tpu::EnqueueDMAOp> | 0x134eef60 |
ExpandTiledMemRefsPass::addPattern<tpu::EnqueueIndirectDMAOp> | 0x134ef340 |
ExpandTiledMemRefsPass::addPattern<tpu::WaitDMA2Op> | 0x134ef720 |
ExpandTiledMemRefsPass::addPattern<tpu::FetchAndAddSyncOp> | 0x134f1240 |
sparse_core::expandTPUFetchAndAddSync(FetchAndAddSyncOp, adaptor, rw) | 0x134e60c0 |
sparse_core::substituteUnloweringConversionCastOp(UnrealizedConversionCastOp, adaptor, rw) | 0x134e73e0 |
The cast resolver substituteUnloweringConversionCastOp (0x134e73e0) is the function that finds an sc.unlowering-tagged cast and substitutes it — undoing the bridge once the tiled-memref layout is known, so the now-shaped DMA op can be expanded into the real SparseCore transfer ops (the rolled / retiled / granule-decomposed transfer chain). Because ExpandTiledMemRefsPass runs after the memref tiling is resolved, this is the first point at which the DMA's final form is computable — which is exactly why LowerToMlo deferred it.
NOTE — the real DMA emission (the
LloRegionBuilder::EnqueueDmaGeneral/EnqueueDmaInGranules/EnqueueDmaToHostIovafamily,0x1d543600/0x1d546700/0x1d548b20, and the rolled/retiled transfer helpers) is the content of stage 2 and is owned by the SparseCore-LLVM lowering pages. This page's claim stops at the bridge contract: what tag is written, which pass reads it, which resolver substitutes it. The granule-level transfer algebra is cross-referenced, not re-derived.
End-to-end trace
STAGE 1 — LowerToMlo (ModuleOp pass, applyFullConversion)
tpu.enqueue_dma %src, %dst, %sflag, ... (operands in tpu types)
└─ lambda 0x13239e00:
per changed operand → %c = builtin.unrealized_conversion_cast %x
{sc.unlowering} (→ Mlo type)
op.setAttr("sc.unlowered") (+ sc.inside_trace_region?)
⇒ op SURVIVES, type-bridged; cast declared legal by $_1 (0x13231560)
── applyFullConversion succeeds with the bridge intact ──
STAGE 2 — ExpandTiledMemRefsPass (after memref tiling resolved)
addPattern<EnqueueDMAOp> 0x134eef60
└─ substituteUnloweringConversionCastOp 0x134e73e0 (resolve sc.unlowering casts)
└─ expand sc.unlowered DMA → real SparseCore transfer ops
(EnqueueDmaGeneral / InGranules / ToHostIova: 0x1d543600 / 0x1d546700 / 0x1d548b20)
What Is Not On This Page
- The generic rollback engine — the
IRRewrite12-record action log,undoRewrites(LIFO),resetState,applyRewrites(two-pass commit), and theUnresolvedMaterializationRewriterecord that backs every cast the bridge emits — is owned by ConversionPatternRewriter. This page references those records but does not decode them. - The depth-aware legalize-to-fixpoint loop — the cost model and
ConversionTargetlegality enum that decide which pattern fires and in what order — is owned by DialectConversion Legalizer. - The 40 direct (non-bridge) tpu→Mlo pattern bodies —
tpu.iota,tpu.device_id,tpu.semaphore_signal,tpu.delay, and the rest — are surveyed for the TensorCore analogue on tpu → LLO ODS; only the four DMA bridge ops are this page's subject. - The stage-2 transfer algebra — the granule decomposition, rolled/retiled transfer loop nest, and the
LloRegionBuilder::EnqueueDma*emission — is owned by LowerToSparseCoreLlvm and the DMA pages; this page ends at the bridge-resolution boundary. - The
TypeConvertercallback bodies — the 10registerConversion+ 3registerTypeAttributeConversionrules (memref memory-space attrs,TupleType1:N,WordType) shared with the SparseCore type system — are owned by SCTypeConverter.
Cross-References
- ConversionPatternRewriter — the
IRRewriteaction log +undoRewrites/resetState/applyRewritesrollback engine that makes the bridge emission speculative-safe; theUnresolvedMaterializationRewriterecord behind each cast - DialectConversion Legalizer — the depth-aware legalize-to-fixpoint cost model and
ConversionTargetlegality enum this pass drives - tpu → LLO ODS Lowering — the TensorCore
FunctionPasssibling; the per-op ODS signatures and the LowerToLLOEnqueueDMAOpdirect realization - The tpu MLIR Dialect — the source dialect this pass consumes; the
tpu.enqueue_dma/tpu.wait_dma2/tpu.fetch_and_add_syncop surface - LowerToSparseCoreLlvm — the per-class SparseCore→LLVM rewrite bodies, including the real DMA transfer emission downstream of
ExpandTiledMemRefsPass - SCTypeConverter — the shared SparseCore
TypeConverterwhoseconvertTypethe FuncOp legality predicate calls - The TPU Compiler — the five-phase dialect descent overview where LowerToMlo sits
- Compile Phases — the ordered phase sequence placing
lower-to-mlobefore tile expansion - Rolled / Strided / General Transfer — the granule-level transfer the stage-2 pass emits once the bridge is resolved
- Tile Index Expansion — the tiled-memref resolution that
ExpandTiledMemRefsPassperforms before substituting the bridge-cast