Mosaic Overview
All addresses, symbols, op-name strings, dump-stage names, and error strings on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d, buildlibtpu_lts_20260413_b_RC00). Other versions will differ.
Abstract
Mosaic is the tiled-tensor MLIR kernel compiler reached through one HLO opcode — kCustomCall("tpu_custom_call") — and it is the only producer of the mlir::tpu dialect on the TPU device path. A Pallas @pl.kernel (or a hand-written Mosaic kernel) is lowered, outside libtpu, into a tpu-dialect MLIR module; that module is serialized, embedded in the custom-call's backend config, and shipped across the PjRt boundary inside the HLO program. When the compiler reaches the custom-call at HLO→LLO emit time, libtpu parses the embedded module, runs a version-migration round-trip, and drives it through a 16-stage tpu-dialect pass pipeline down to LLO (or, for SparseCore kernels, to LLVM-TPU intrinsics). General HLO never becomes tpu ops — it is emitted straight to LLO by ~3225 xla::jellyfish::*Emitter classes (MHLO → XTile → tpu). Mosaic is therefore a second IR producer feeding the same target dialect, not a parallel back end: it skips the entire HLO optimizer and rejoins the main descent at the tpu dialect (The tpu MLIR Dialect).
This is the orientation page for the Mosaic backend. It fixes what Mosaic is, the import/serde seam that brings a tpu module into the compiler, the CustomCallEmitter::Emit driver that glues the imported module to the surrounding HLO program, the 16-stage RunMLIRPasses pipeline (TensorCore and SparseCore arms), and a map of the Mosaic sub-pages. It does not reproduce the layout algebra — the VectorLayout (sublane, lane) value type lives on Mosaic VectorLayout and the per-op layout-inference rules on Mosaic Layout Inference; both are linked, not duplicated.
For reimplementation, the orientation contract is:
- Mosaic is the only
tpu-dialect origin. The binary has zero*ToTpuPass/MhloToTpu/StablehloToTpuconversion passes; thetpudialect is imported, authored upstream, never lowered from MHLO. A reimplementer must not build an MHLO→tpulegalizer that does not exist. - The import + serde seam.
GetMlirModuleOpFromCustomCall→GetCachedCustomCallBody(CityHash128-keyed, one parse per unique kernel) →GetMlirModule→ParseModule+ a singleMosaicSerdePassdeserialize/upgrade. The serde is a per-op version-migration engine, not a generic codec. - The driver entry.
CustomCallEmitter::Emitvalidates the kernelmain's ABI against the HLO operands, binds memory-space colors and windows, selects the entry func by core type, runs the pipeline, and stitches the resulting LLO into the parent region. - The 16-stage pipeline.
RunMLIRPassesrunssimplify → infer-memref-layout → canonicalize-mosaic → infer-vector-layout → apply-vector-layout → … → lower-to-llo, with a SparseCore branch that diverts tolower-to-mlo→ LLVM-TPU instead. - The sub-page map. Where the layout algebra, the layout-inference rules, the tpu→LLO descent, and the SparseCore lowering each live.
| What Mosaic is | the tiled-tensor MLIR kernel compiler; the only producer of mlir::tpu on the TPU path, via tpu_custom_call import |
| Entry opcode / target | HloOpcode::kCustomCall, custom_call_target = "tpu_custom_call" |
| Driver entry | xla::jellyfish::(anon)::CustomCallEmitter::Emit @ 0x111ef740 (3657 lines; source …/mosaic/python/custom_call_emitter.cc) |
| Import seam | mlir_utils::GetMlirModuleOpFromCustomCall @ 0x13e327a0 → GetCachedCustomCallBody @ 0x13e31860 → GetMlirModule @ 0x13e31220 |
| Serde engine | mlir::tpu::MosaicSerdePass::runOnOperation @ 0x145307a0; jaxlib::mosaic::RunSerde @ 0x14533b20; module attr stable_mosaic.version, default version 11 |
| Pipeline driver | xla::jellyfish::RunMLIRPasses @ 0x111fefa0 (source …/mosaic/python/mosaic_passes.cc); 16 stages, dumps post-<stage> |
| Entry-func selection | mlir::tpu::GetFuncWithCoreType @ 0x14aa61a0 (matches tpu.core_type, falls back to @main) |
| Mem-space bridge | MemorySpaceToColor @ 0x1d6ffb80 / ColorToMemorySpace @ 0x1d6ffb00 |
| TensorCore handoff | mlir::tpu::createLowerToLLOPass @ 0x11203ba0 → llo::createEliminateLLOExtensionsPass @ 0x13e668a0 |
| SparseCore handoff | mlir::tpu::createLowerToMloPass @ 0x1322adc0 → sparse_core::CreateLowerToSparseCoreLlvmPass @ 0x135667c0 |
MHLO→tpu passes in binary | none (0 *ToTpuPass/MhloToTpu/StablehloToTpu symbols) |
| Confidence | HIGH (symbol/string-anchored) unless a row or callout says otherwise |
What Mosaic Is, and Why It Is a Side Channel
Mosaic is not a stage inside the HLO optimizer; it is an escape hatch. The general TPU compute path lowers HLO to LLO directly through ~3225 xla::jellyfish::*Emitter classes (DmaEmitter, FusionEmitter, ReduceEmitter, GatherEmitter, …) calling into LloRegionBuilder (3037 referencing functions) — no MLIR tpu dialect is ever produced for general compute. The tpu dialect exists in the binary only because Pallas/Mosaic kernels are authored in it upstream and imported. Two independent observations pin this down:
- No conversion pass produces
tpu. The function table contains zero symbols matching*ToTpuPass,*ToTPUDialect,MhloToTpu,HloToTpu,StablehloToTpu,LegalizeToTpu,LowerMhloToTpu. (The onlyConvertToTpu*hits areConvertToTpuCompilationCacheGroupEntryProtoandConvertToTpuCoreProgram— runtime/SparseCore-ISA, unrelated.) - The general path bypasses
tpuentirely. Each*Emitteris a per-HLO-opcode visitor that buildsllo.*ops straight into anLloRegion.
So the device path is two trees that converge at the tpu dialect only on the Mosaic arm:
TREE A1 — general TPU compute
HLO ──[97-row pre-passes]──► HLO ──[~3225 jellyfish *Emitter + LloRegionBuilder]──► LLO ──► bundles
(never becomes tpu-dialect ops)
TREE A2 — Pallas/Mosaic kernels ◄── THIS PAGE
HLO kCustomCall("tpu_custom_call") carrying a SERIALIZED tpu-dialect module
──[import + serde]──► tpu-dialect IR
──[RunMLIRPasses 16-stage pipeline]──► LLO (TensorCore) | LLVM-TPU (SparseCore)
Mosaic kernels skip the HLO optimizer (no algebraic simplification, layout assignment, fusion, or MSA on the kernel body) because they arrive already as tpu ops carrying their own tiling. They converge with the optimized path at the tpu dialect and share the identical tpu→LLO→bundle descent from there.
GOTCHA — do not build an MHLO→
tpulegalizer. The convergence point the compiler overview draws at Level 2 (tpudialect) is fed by exactly one producer: the Mosaic custom-call import. General HLO bypassestpuand goes straight to LLO. A reimplementation that allocates effort to an MHLO/StableHLO→tpuconversion-pattern set is building a stage the production compiler does not contain. See MHLO → XTile → tpu for the full two-tree evidence.
NOTE — "TLP"/"MLO" are IR-layer names, not dialects. The
tpudialect's physical role as the "TPU-Level Program" container is realized only for the imported Mosaic kernel. "Mlo" (mid-level ops) is the SparseCore intermediate inside thetpu/sparse_corelowering (see The SparseCore Arm), not a top-level IR.
The Mosaic Kernel Embedding
A Pallas @pl.kernel is compiled by the JAX frontend (outside libtpu) into a tpu-dialect module, run through MosaicSerdePass in serialize mode (downgrade to a pinned version, set the module's stable_mosaic.version attr), and embedded into one HLO instruction:
HloOpcode::kCustomCall
custom_call_target = "tpu_custom_call"
backend_config = serialized xla::jellyfish::CustomCallConfig proto:
mlir_module (an absl::Cord; the serialized tpu module)
serialization_format (must equal 1)
input_memory_colors / output_memory_colors
has_communication, collective_id, cost_estimate, metadata
Four HLO pre-passes prepare the call before lowering (Custom-Call Lowering): MosaicFusion (fuse the surrounding HLO into operands) → TpuCustomCallLegalizer (TensorCore/SparseCore/Megachip classification) → TpuCustomCallMemorySpacePolicy (fill the memory colors) → TpuCustomCallScopedVmemAdjuster (size scoped VMEM). At HLO→LLO emit time the registered emitter for the "tpu_custom_call" target is CustomCallEmitter::Emit.
Import and the MosaicSerde Round-Trip
The import is a thin caching wrapper that parses the embedded module exactly once per unique kernel and runs a deserialize/upgrade pass.
GetMlirModuleOpFromCustomCall (0x13e327a0)
└─ GetCachedCustomCallBody (0x13e31860)
1. validate body present (else "Custom call body is empty.")
2. key = farmhash::CityHash128WithSeed(Cord→string(mlir_module)) // 128-bit
3. HloModule::GetCacheEntry<MosaicMlirCacheEntry>(module, key)
HIT → reuse the parsed ModuleOp across every duplicate kernel
MISS → build it:
ctx = new mlir::MLIRContext
LoadDialects(ctx) // 0x13e32140
GetMlirModule(config, ctx, verify) // 0x13e31220
cache a MosaicMlirCacheEntry (OWNS ctx + ModuleOp)
4. fingerprint-collision guard → fatal on bcmp mismatch
("Kernel body fingerprint collision detected for key: …")
LoadDialects (0x13e32140) builds the kernel's self-contained MLIRContext (it does not share the host compiler's): it loads stablehlo, arith, func, memref, scf, vector, mlir::tpu::TPUDialect, cf, mlir::llo::LLODialect, mlir::sparse_core::ScDialect (registered as "sc_tpu"), plus RegisterNonSparseCoreDialects and func::registerAllExtensions, then calls allowUnregisteredDialects(true) — required by the serde's serialize side and to let unknown ops survive a round-trip.
GetMlirModule (0x13e31220) parses and deserializes:
ParseModule(0x13e30dc0) wrapsmlir::parseSourceString— it consumes both textual MLIR and MLIR bytecode through one entry — under aStatusScopedDiagnosticHandler, thenconstructContainerOpForParserIfNecessary<ModuleOp>. Failure:"Failed to parse the Mosaic module: <diag>".- If
serialization_format == 1, run oneMosaicSerdePass(serialize=false)in aPassManager(dump stage"deserialization"); any other value →"Unsupported serialization format: <N>".
MosaicSerdePass is a version-migration engine
MosaicSerdePass::runOnOperation (0x145307a0) is not a generic bytecode codec — it is a per-op version upgrade/downgrade engine. It runs in two modes: serialize=false (deserialize/upgrade, used on import) and serialize=true (serialize/downgrade, used when writing the kernel into the persistent compilation cache). It lazily builds two StringMaps keyed by op-name (upgrade_rules, downgrade_rules), dispatched by jaxlib::mosaic::RunSerde (0x14533b20).
RunSerde reads the module's stable_mosaic.version IntegerAttr, bounds-checks it ("Unsupported version: expected <= <cur> but got <stored>"), removes the attr, then walks every op applying the matching migration lambda. Eight ops carry per-version rules:
| op (key) | schema evolution the rule encodes |
|---|---|
tpu.enqueue_dma | v2 added the operand-segment-sizes form; v4 added priority; v8 added strict_ordering |
tpu.wait_dma2 | v7 added device_id/core_id segments |
tpu.dynamic_gather | v5: singular dimension → multi-dim dimensions |
tpu.iota | v6: singular dimension → multi-dim dimensions |
tpu.sem_signal | v2 added core_id |
vector.multi_reduction | v3: reduction_dims ArrayAttr-of-IntegerAttr → DenseArrayAttr |
tpu.store | v11 added the fused store-and-accumulate add flag |
arith.constant | v10 generalized vector-of-i1 constants (pre-v10 must be splat) |
The current dialect version is 11. An out-of-tree compiler reading a JAX-emitted persistent cache must replay the matching downgrade lambdas down to its pinned version. The string constants resolve to kMangledDialect = "stable_mosaic" (the bytecode dialect tag, 13 chars) and kVersionAttrName = "stable_mosaic.version". The per-version field-migration bodies are documented on Mosaic VectorLayout alongside the layout serialization.
NOTE — the bytecode magic is upstream MLIR; the TPU-specific part is the version attr + the 8 rules. Parsing handles standard MLIR text/bytecode; the only TPU-specific deserialization work is reading/removing
stable_mosaic.versionand replaying the eight per-op upgrade lambdas.
The Driver: CustomCallEmitter::Emit
CustomCallEmitter::Emit (0x111ef740, 3657 lines) is the registered LLO emitter for "tpu_custom_call". It connects the imported tpu module to the surrounding HLO program and drives the whole pipeline. Recovered call structure and verbatim error strings:
-
Fetch & validate config —
"Custom call does not have a custom call config."/"Failed to retrieve the config. Error: %s". -
Import the body —
GetCachedCustomCallBody→ cachedtpuModuleOp. Re-entry guard:"Trying to lower the same Mosaic kernel more than once". -
Core-type gate (TensorCore vs SparseCore) from the cache entry's core-type set:
"Cannot lower a Mosaic kernel targeting SparseCore on a device without SparseCore","Cannot lower a Mosaic kernel targeting SparseCore to TensorCore.","Mosaic is not supported on this hardware version.". -
Validate the kernel
mainABI (the HLO↔kernel contract):main.getNumArguments() == iteration_bounds + num_inputs + num_outputs + num_scratch + has_communication main.getNumResults() == 0 (all outputs are by-ref memrefs) main.getRegion().hasOneBlock() -
Bind operands/outputs — element-type allow-list (
float8e5m2,float8e4m3fn,float8e4m3b11fnuz,float8e8m0fnu,float4e2m1fn,bfloat16, 32-bit), per-operand memory-space ("Some arguments are missing their memory space attribute"), memory-space→color viaMemorySpaceToColor, layout verify viaVerifyAndMaybeUpdateShapeLayoutForMosaic("Missing layout for Mosaic kernel operand %d"/"Failed to verify layout for Mosaic kernel operand "), scratch placement ("Scratch memref allocation only supported for vmem, smem and semaphore_mem."), and semaphore-liveness on exit ("Semaphore (scratch argument %d) has a nonzero value upon exit … Make sure every DMA is awaited …"). -
Windowing —
PipelineEmitter(39 references inEmit) windows each operand/output into the iteration grid, reading thewindow_params,iteration_bounds, anddimension_semanticsattrs offmain("Wrong number of window_params: expected %d, got %d","iteration_bounds should be a DenseI64ArrayAttr"). The window-descriptor math (grid flatten, double-buffering, dynamic-offset DMA, window cycle cost) is recovered on Mosaic VectorLayout. -
Select the entry func —
GetFuncWithCoreType(module, core_type)(0x14aa61a0) returns thefunc.funcwhosetpu.core_typematches the device, falling back to@main("No function with tpu.core_type = %v nor a main function found"). -
Run the pipeline —
RunMLIRPasses(...)lowers the module'smainto LLO in-place (next section). -
Megacore split —
MegacoreAdjuster(4 references) splits the work across the two TensorCores of a megacore chip. -
Stitch into LLO — the lowered body is emitted into the parent
LloRegionviaLloRegionBuilder(AllocateScopedVmem/Smem,EnqueueDmaLocalInGranules, …); post-emit cleanup"post-finalize-llo-post-emitter". Returns the LLO SSA value(s) for the kernel outputs.
NOTE — the driver entry is
CustomCallEmitter::Emit, notMosaicEmitter.MosaicEmitter(EmitWindow@0xfaadcc0) is a small anonymous-namespace window-emit helper — a sibling ofMosaicBroadcastandPipelineEmitter::OperandWindow, used inside the windowing step. The class that actually imports the module, validates the ABI, runs the pipeline, and stitches LLO isCustomCallEmitter::Emit(0x111ef740). "MosaicEmitter" is a loose name for the Mosaic side door; the driver symbol isCustomCallEmitter::Emit.
How the imported module connects to HLO
The connection is by ABI on main plus a memory-space-color bridge:
| Surrounding HLO concept | tpu-module main binding |
|---|---|
kCustomCall operand i (HLO buffer) | a memref arg (tiled layout + tpu.memory_space attr); color from config.input_memory_colors[i] |
kCustomCall result | a by-ref output memref arg (num_results == 0) |
| grid / pipeline dims | iteration_bounds (DenseI64ArrayAttr) on main |
| parallel vs reduction grid dims | dimension_semantics attr |
| per-operand window/tiling | window_params attr (one per memref arg) |
| scoped scratch (vmem/smem/sem) | extra main args after the outputs |
| collective barrier | has_communication arg (+ config.collective_id) |
MemorySpaceToColor (0x1d6ffb80) maps the mosaic-tpu memref memory-space enum onto the integer "color" the host BufferAssignment uses, so an imported kernel's inputs/scratch land in the right physical memory. The mosaic-tpu MemorySpace enum (vmem/smem/hbm/cmem/semaphore_mem/vmem_shared/host) and the jellyfish color table values are documented on Mosaic VectorLayout.
The tpu-Dialect Pass Pipeline: RunMLIRPasses
RunMLIRPasses (0x111fefa0, source …/mosaic/python/mosaic_passes.cc) is the actual tpu-dialect pipeline. Each pass runs via RunPass (0x14514d60), which wraps the PassManager in RunAndCaptureDiagnostics and dumps MLIR as post-<stage> (or post-<stage>-failed). The TensorCore pipeline, in execution order with create-functions (all mlir::tpu:: unless noted):
| # | stage (dump name) | create-function / action |
|---|---|---|
| 0 | original | DumpMlir; setup: GetFuncWithCoreType, GetMosaicHardwareGeneration(target); guard "Should not run HLO passes without layout passes" |
| 1 | deserialize/diagnostics | RunAndCaptureDiagnostics |
| 2 | simplify | $_4 canonicalizer → InferMemorySpaces (dump post-infer-memref-space) |
| 3 | hlo-conversion (iff run_hlo_passes) | nest func.func: stablehlo::createStablehloLegalizeToLinalgPass + LinalgVectorizationPass |
| 4 | infer-memref-layout | createInferMemRefLayoutPass (0x132c0f00) (+ -simplify) |
| 5 | assert-insertion | createDebugAssertInsertionPass |
| 6 | canonicalize-operations | createCanonicalizeOperationsPass |
| 7 | pre-canon-optimization | createPreCanonicalizationOptimizationPass |
| 8 | canonicalize-mosaic | createCanonicalizeMosaicPass (0x132a2ac0) (+ -simplify) |
| 9 | tiling-propagation | createTilingPropagationPass |
| 10 | infer-vector-layout | createInferVectorLayoutPass (0x132c2c20) |
| 11 | relayout-insertion | createRelayoutInsertionPass |
| 12 | apply-vector-layout | createApplyVectorLayoutPass (0x1325cda0) (+ -simplify) |
| 13 | logical-to-physical-device-id | createLogicalToPhysicalDeviceIdPass |
| 14 | lower-to-llo | nest func.func: createLowerToLLOPass(target) (0x11203ba0) |
| 15 | eliminate-llo-extensions | nest func.func: llo::createEliminateLLOExtensionsPass (0x13e668a0) |
| 16 | finalize-llo | $_4 canonicalizer |
After finalize-llo the module body is pure llo.* (+ scf/func/memref structural ops) and control returns to CustomCallEmitter::Emit for the LloRegionBuilder stitch. The -simplify sub-stages run the $_4 canonicalize/CSE lambda after the three layout passes. IR printing is gated by VLOG; the verifier by IsMosaicVerificationEnabled.
The layout heart of the pipeline is stages 10–12. infer-vector-layout annotates every op with in_layout/out_layout attributes (the per-op inference rules — Mosaic Layout Inference); relayout-insertion inserts explicit tpu.relayout wherever a consumer's required layout differs from a producer's, so that apply-vector-layout (applyLayoutFunc @ 0x1325cc80 → per-op applyLayoutOp) materializes each logical vector<NxMxT> into hardware-native vreg tiles via a 49-entry op-name→rule dispatch table. The VectorLayout value type, the textual grammar, the per-vreg packing math, and the relayout driver are all on Mosaic VectorLayout.
GOTCHA —
canonicalize-mosaicrunsallow_unverified=!a12.createCanonicalizeMosaicPasstakes agen+IsMosaicCompatibilityModeEnabled+ anallow_unverifiedflag derived from the verification gate. A reimplementer must not assume the canonicalizer always re-verifies; in compatibility mode it tolerates unverified IR so older kernels survive.
NOTE — four of the 22 inventoried
tpupasses are not in the TensorCore arm.CanonicalizeMemorySpaceandConvertIntegerMemrefsbelong to other paths,LowerToMloto the SparseCore branch below, andMosaicSymbolicShapeRefinementruns earlier as a hybrid HLO+MLIR pass. Searching this 16-stage list for them will not find them.
TensorCore vs SparseCore: the Two Lowering Arms
The pipeline forks after logical-to-physical-device-id on the entry func's tpu.core_type.
The TensorCore arm (stages 14–16)
createLowerToLLOPass(target) (0x11203ba0) runs the full applyFullConversion of tpu.* → llo.* (242 patterns — tpu → LLO Lowering); createEliminateLLOExtensionsPass (0x13e668a0) expands LLO "extension" convenience-macro ops into base LLO ops; finalize-llo canonicalizes. After this, every vector value is a native vreg shape that maps 1:1 onto an LLO VregType, and the body is stitched into the parent LloRegion. Downstream, TpuProgram Serialization and the scheduling back end pack bundles.
The SparseCore arm
A kernel whose main carries a SparseCore tpu.core_type is not lowered to LLO. Instead the tpu ops convert to the SparseCore "Mlo" (mid-level-ops) stage and then to LLVM-TPU intrinsics:
tpu (SparseCore core_type) module
→ [mosaic_sc::InferVectorLayoutPass + ApplyVectorLayoutPass] (vreg layout; only op: mosaic_sc.relayout)
→ mlir::tpu::createLowerToMloPass(Target, LowerToMloPassContext) (0x1322adc0)
converts tpu.* (incl. SparseCore-only tpu.barrier, tpu.all_reduce,
tpu.enqueue_indirect_dma, tpu.fetch_and_add_sync, tpu.scan,
tpu.wait_indirect_dma) into mlo/sparse_core/llvm ops
→ MloModuleVerifier
→ sparse_core::CreateLowerToSparseCoreLlvmPass(Target, …) (0x135667c0) → mlir::LLVM / LlvmTpu
→ sparse_core::CreateLlvmIntToPtrSafetyPass(Target) (int↔ptr safety legalize)
→ mlir-translate -mlir-to-llvmir → per-gen SparseCore ISA
The SparseCore Mosaic dialect (mosaic_sc) is intentionally tiny — one op (mosaic_sc.relayout) plus the two layout passes — so the Mosaic vreg-layout phase can run for SparseCore before the tpu → mlo → LlvmTpu lowering. Detail: LowerToMlo DMA Bridge, LowerToSparseCoreLlvm, SCTypeConverter, and the SparseCore Part.
The Complete Mosaic → tpu → LLO Chain
JAX/Pallas frontend (out of libtpu)
@pl.kernel → tpu-dialect module → MosaicSerde SERIALIZE (downgrade, set stable_mosaic.version)
→ embedded in HloCustomCall("tpu_custom_call").backend_config.mlir_module
─────────────────────────────── libtpu.so ───────────────────────────────
HLO pre-passes: MosaicFusion → TpuCustomCallLegalizer
→ TpuCustomCallMemorySpacePolicy → TpuCustomCallScopedVmemAdjuster
│ at HLO→LLO emit time, target = "tpu_custom_call"
▼
CustomCallEmitter::Emit (0x111ef740)
│ GetMlirModuleOpFromCustomCall (0x13e327a0)
│ └ GetCachedCustomCallBody (0x13e31860): CityHash128 key; on miss
│ new MLIRContext → LoadDialects → GetMlirModule (parse + MosaicSerde deserialize)
│ validate main ABI; bind memref colors (MemorySpaceToColor); set windows (PipelineEmitter)
│ GetFuncWithCoreType (0x14aa61a0) → pick entry func
│ ▼ RunMLIRPasses (0x111fefa0)
│ simplify → infer-memref-layout → canonicalize-mosaic → tiling-propagation
│ → infer-vector-layout → relayout-insertion → apply-vector-layout
│ → logical-to-physical-device-id
│ ├── TensorCore: → lower-to-llo (242 patterns) → eliminate-llo-extensions → finalize-llo
│ └── SparseCore: → (mosaic_sc layout) → lower-to-mlo → MloModuleVerifier
│ → lower-to-sparsecore-llvm → llvm-int-to-ptr-safety ⇒ LLVM-TPU
│ MegacoreAdjuster (split across 2 TensorCores)
│ stitch LLO into parent LloRegion via LloRegionBuilder
▼
LLO instruction stream → bundle packer → ISA bundles (SparseCore: LLVM IR → per-gen SC ISA)
Map of the Mosaic Sub-Pages
| topic | page | owns |
|---|---|---|
| the layout atom | Mosaic VectorLayout | the (sublane, lane) VectorLayout struct + invariants, textual grammar, tilesPerVreg/tileArrayShape, the ImplicitDim model, the relayout driver, the 49-entry applyLayoutOp dispatch table, the 8 serde migration lambdas, the MemorySpaceToColor table values, and the PipelineEmitter window math |
| the layout solver | Mosaic Layout Inference | the per-op VectorLayoutInferer rules that choose each op's in_layout/out_layout (the producer of the attrs applyLayoutOp consumes) |
| the target dialect | The tpu MLIR Dialect | the tpu dialect's ops, attributes, and op-model contract |
| TensorCore handoff | tpu → LLO Lowering | the createLowerToLLOPass 242-pattern descent into LLO |
| SparseCore handoff | LowerToMlo DMA Bridge, LowerToSparseCoreLlvm, SCTypeConverter | the tpu → mlo → LlvmTpu SparseCore lowering |
| the dispatch layer | Custom-Call Lowering | the tpu_custom_call target registration + the four HLO pre-passes |
Confidence Summary
| Claim | Evidence |
|---|---|
Mosaic (tpu_custom_call import) is the only tpu-dialect producer | 0 *ToTpuPass/MhloToTpu/StablehloToTpu symbols; ~3225 jellyfish *Emitter + 3037 LloRegionBuilder refs for the general path |
Import seam = GetMlirModuleOpFromCustomCall → GetCachedCustomCallBody → GetMlirModule/ParseModule/LoadDialects | 0x13e327a0, 0x13e31860, 0x13e31220, 0x13e30dc0, 0x13e32140 decompiled; CityHash128 key + MosaicMlirCacheEntry cache |
Serde is a per-op version-migration engine; 8 ops; default v=11; attr stable_mosaic.version | MosaicSerdePass::runOnOperation 0x145307a0, RunSerde 0x14533b20; kMangledDialect/kVersionAttrName from .data relocs |
Driver entry is CustomCallEmitter::Emit (3657 lines), calling GetCachedCustomCallBody/GetFuncWithCoreType/MemorySpaceToColor/PipelineEmitter/MegacoreAdjuster/RunMLIRPasses | 0x111ef740 decompiled; 1/1/2/39/4/1 references confirmed; verbatim ABI/memory/window/semaphore error strings |
MosaicEmitter is a window-emit helper, not the driver | MosaicEmitter::EmitWindow 0xfaadcc0, sibling of MosaicBroadcast/PipelineEmitter::OperandWindow; not called as the lowering driver |
16-stage RunMLIRPasses pipeline with the listed stage names + create-functions | 0x111fefa0 (source mosaic_passes.cc); createInferMemRefLayoutPass 0x132c0f00, createCanonicalizeMosaicPass 0x132a2ac0, createInferVectorLayoutPass 0x132c2c20, createApplyVectorLayoutPass 0x1325cda0, createLowerToLLOPass 0x11203ba0, createEliminateLLOExtensionsPass 0x13e668a0 |
Entry-func selection by tpu.core_type via GetFuncWithCoreType | 0x14aa61a0; error strings "No function with tpu.core_type = %v …" |
SparseCore arm = mosaic_sc layout → createLowerToMloPass → MloModuleVerifier → CreateLowerToSparseCoreLlvmPass → LLVM-TPU | createLowerToMloPass 0x1322adc0, CreateLowerToSparseCoreLlvmPass 0x135667c0, mosaic_sc::createInferVectorLayoutPass 0x132ecf60 |
Exact hlo-conversion/megacore/large-2nd-minor arm gating per TPU generation | call sequence recovered; per-gen flag selection not fully traced |
Cross-References
- The TPU Compiler (overview) — the five-phase spine; this page details its "Mosaic side channel" section.
- MHLO → XTile → tpu Lowering — the two-tree evidence and the proof that no MHLO→
tpupass exists; the general path's jellyfish*Emitterroute. - Custom-Call Lowering & the Target Registry — the
tpu_custom_calltarget registration and the four HLO pre-passes that prepare the embedding. - The tpu MLIR Dialect — the target dialect that Mosaic imports and the pipeline lowers.
- Mosaic VectorLayout — the
(sublane, lane)layout algebra, theapplyLayoutOprule table, the serde migration lambdas, and the window-descriptor math (linked, not duplicated here). - Mosaic Layout Inference — the per-op inference rules that produce the
in_layout/out_layoutattrs. - tpu → LLO Lowering — the TensorCore
createLowerToLLOPassdescent. - LowerToMlo DMA Bridge · LowerToSparseCoreLlvm · SCTypeConverter — the SparseCore lowering arm.
- TpuProgram Serialization — how the lowered LLO is packed and serialized.
- Part VIII — Scheduling · SparseCore Overview — the back ends the two arms hand off to.
- Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part V — Compiler: Lowering & Optimization Passes / MLIR lowering chain — back to index