The TPU Compiler
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d). Other versions will differ.
Abstract
The TPU compiler is the half of libtpu.so that turns a serialized XLA program into the packed VLIW bundles a TensorCore executes. It is not one monolithic lowering. It is a five-phase, multi-dialect descent that begins as portable StableHLO bytecode handed across the PjRt boundary and ends as a per-generation bundle byte stream, threading through six successive IR levels and four kinds of optimization machinery on the way down. The whole descent is reachable from one class — xla::jellyfish::DeepseaCompilerBase — whose RunHloPasses (0x109152a0) and RunBackend family drive the HLO-level work, and from the phase registry xla::TpuCompiler::RegisterAllPhases (0xf849ec0) that wires the five xla::CompilePhase* entry points into the PjRt partial-program pipeline.
The reader who knows LLVM should hold a layered analogy. The first half of the pipeline is XLA proper — an HLO-on-HloInstruction optimizer with layout assignment, fusion, a memory-space allocator (MSA, XLA's analogue of register allocation but for HBM/VMEM tiers), and a latency-hiding scheduler — structurally the same machinery upstream XLA ships for GPU, retargeted with a TPU cost model and a TPU layout model. The second half is MLIR — the program crosses from HloInstruction into the MLIR world (StableHLO/CHLO/MHLO already live here on ingestion; the tpu dialect is the TPU-specific target dialect) and descends through dialect-conversion legalizers to the jellyfish::Llo* low-level IR, the TPU's equivalent of LLVM MachineInstr. From LLO, the scheduling pipeline packs bundles and the per-generation encoder serializes bytes. A side door — the Mosaic kernel compiler — lets hand-written kernels enter the tpu dialect directly through tpu_custom_call HLOs, bypassing the HLO optimizer entirely.
This is the orientation page for Part V. It fixes the IR layer stack, names the five compile phases and the four pass families that run inside them, says where each lives in the binary, and maps the rest of Part V so a reader can jump to the page that owns each phase's detail. It does not reproduce any phase's algorithm — the HLO pass registry, layout assignment, MSA, fusion, the MLIR legalizers, the tpu→LLO ODS lowering, and Mosaic each have their own pages, cross-referenced below.
For reimplementation, the orientation contract is:
- The IR layer stack and its crossing points. Six levels — StableHLO/CHLO/VHLO bytecode → XLA HLO → MHLO/
tpu-dialect MLIR (via TLP) → LLO → scheduledBundles → packed bytes — and the two boundaries that are format crossings, not mere passes: the StableHLO↔HLO legalization on ingestion, and the HLO→MLIR crossing inside Phase 2a. - The five-phase spine and what each consumes/produces.
CompilePhase0StablehloToHlothroughCompilePhase3Linking, registered byRegisterAllPhases, each aPjRtPartialProgramProto-in /PjRtPartialProgramProto-out stage. - The four pass families and where they sit. HLO optimization passes (Phase 1), the analyses that gate codegen (layout assignment, MSA), the lowering legalizers (Phase 2a/2b), and the schedule+pack back end (Part VIII, after LLO).
- The Mosaic side channel. Why a second IR producer exists and how its output rejoins the main descent at the
tpudialect.
| Compiler class | xla::jellyfish::DeepseaCompilerBase (Deepsea = internal TPU backend name) |
| HLO pass driver | public entry xla::jellyfish::DeepseaCompilerBase::RunHloPasses(unique_ptr<HloModule>, StreamExecutor*, CompileOptions const&) @ 0x109152a0 — the xla::Compiler virtual override, which delegates to the internal pipeline-running overload RunHloPasses(Target const&, RunHloPassesConfig const&, …) @ 0x1093a420 that actually builds and runs the HloPassPipeline (the C TpuCompiler_RunHloPasses export @ 0xeabcd80 is the PjRt-side wrapper). See Compile Phases. |
| Phase registry | xla::TpuCompiler::RegisterAllPhases @ 0xf849ec0 |
| Phase entry points | CompilePhase0StablehloToHlo 0xf84de60 · CompilePhase1HloOptimizations 0xf84ee00 · CompilePhase2aTlpLowering 0xf850840 · CompilePhase2bDedupedLowering 0xf852180 · CompilePhase3Linking 0xf852f40 |
| Phase I/O type | absl::Span<const xla::PjRtPartialProgramProto> in / out (PjRt partial-program protocol) |
| IR levels | StableHLO/CHLO/VHLO → XLA HLO → TLP/MHLO + tpu dialect → LLO (jellyfish::Llo*) → Bundle → packed bytes |
| MLIR dialects present | stablehlo, chlo, vhlo, mhlo, tpu (verified in RTTI: mlir::{mhlo,tpu,chlo,vhlo,stablehlo}::*) |
| tpu→LLO lowering | mlir::tpu::createLowerToLLOPass @ 0x11203ba0 (LowerToLLOPass); mlir::tpu::LowerPassBase family |
| LLO → bundles | PackBundles @ 0x10a30a20 (back end; see Part VIII) |
| Mosaic side door | tpu_custom_call HLO → tpu dialect via mlir::tpu::CanonicalizeMosaicPass / MosaicSerdePass; LLO emission driven by jellyfish::CustomCallEmitter::Emit @ 0x111ef740 (MosaicEmitter::EmitWindow @ 0xfaadcc0 is its per-window helper) |
| Confidence | CONFIRMED (byte-anchored) unless a row or callout says otherwise |
The IR Layer Stack
A TPU program is not lowered through one IR but descends through a stack of six, two of which sit in upstream XLA's HloInstruction world and four of which sit in MLIR. The crossing between those two worlds is the single most important structural fact on this page: the front half is XLA, the back half is MLIR, and the program is reconstituted across the boundary, not merely walked.
┌──────────────────────────────────────────────────────────────────────────┐
│ LEVEL 0 — PORTABLE WIRE IR (MLIR bytecode, crosses the PjRt boundary) │
│ StableHLO · CHLO (high-level "client" HLO) · VHLO (versioned) │
│ mlir::{stablehlo,chlo,vhlo}::* — what JAX/the bridge actually ships │
└──────────────────────────────────────────────────────────────────────────┘
│ Phase 0 — CompilePhase0StablehloToHlo (format crossing, not a pass)
│ ConvertStablehloToHlo / StablehloLegalizeToHlo / ChloLegalizeToHlo
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ LEVEL 1 — XLA HLO (HloModule / HloInstruction, the HloOpcode enum) │
│ the optimizer's home: HLO opt passes, layout assignment, fusion, MSA │
│ DeepseaCompilerBase::RunHloPasses 0x109152a0 │
└──────────────────────────────────────────────────────────────────────────┘
│ Phase 2a — CompilePhase2aTlpLowering (HLO → MLIR crossing)
│ TlpLowering → TLPFunction (TPU-Level Program); enters MLIR
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ LEVEL 2 — MLIR HIGH/MID (MHLO + TLP) → tpu target dialect │
│ mlir::mhlo::* → mlir::tpu::* ; Mosaic kernels join HERE via │
│ tpu_custom_call (CanonicalizeMosaicPass / MosaicSerdePass) │
└──────────────────────────────────────────────────────────────────────────┘
│ mlir::tpu::LowerPassBase family → createLowerToLLOPass 0x11203ba0
│ (DialectConversion legalizers; ApplyVectorLayoutPass; LowerToMloPass DMA)
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ LEVEL 3 — LLO (jellyfish::LloModule / LloInstruction / LloRegion) │
│ the "MachineInstr" of the TPU: opcodes, allocations, dependency graph │
│ LSRAv2 live-range alloc + MSA placement resolve here │
└──────────────────────────────────────────────────────────────────────────┘
│ Part VIII — schedule + pack (PackBundles 0x10a30a20)
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ LEVEL 4 — SCHEDULED BUNDLES (vector<jellyfish::Bundle>) │
│ MXU/MRB assignment → bundle packing → modulo scheduling for loops │
└──────────────────────────────────────────────────────────────────────────┘
│ Phase 3 — CompilePhase3Linking ; per-gen Encoder*::EncodeBundle
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ LEVEL 5 — PACKED PROGRAM (TpuProgram, serialized byte stream) │
└──────────────────────────────────────────────────────────────────────────┘
The two worlds and the crossing
Levels 0 and 1 are MLIR-bytecode and XLA-HLO respectively; both predate the TPU target and are upstream XLA. StableHLO/CHLO/VHLO are the portable wire IRs that the framework bridge (JAX) actually serializes — they are MLIR dialects, but at this point they are payload, not yet under TPU control. Level 1 is classic XLA HLO: the HloModule/HloInstruction graph that XLA's whole optimizer was written against, with roughly 100 HloOpcodes. Phase 0 is the format crossing that imports the portable bytecode into HLO; it is legalization between equivalent representations, not an optimization, which is why it is a distinct phase rather than a pass inside Phase 1.
Levels 2 through 5 are TPU-specific. Phase 2a (CompilePhase2aTlpLowering, 0xf850840) is the crossing from HloInstruction into MLIR via the TPU-Level Program (TLP) representation. The tpu dialect (covering MXU matmul, DMA enqueue/wait, semaphores, vector relayout, pack/unpack, PRNG, and trace) is the TPU's machine dialect. The Level-2 tpu layer is reached by exactly one producer — imported Mosaic tpu_custom_call kernels — not by lowering general MHLO (the binary has no MHLO→tpu pass); general HLO is emitted straight to LLO by the jellyfish *Emitter family without ever entering the tpu dialect. See the two-tree picture on MHLO → XTile → tpu and Mosaic Overview. The tpu dialect lowers through the mlir::tpu::LowerPassBase family / createLowerToLLOPass (0x11203ba0) into LLO, the jellyfish::Llo* family — LloModule, LloInstruction, LloRegion, LloOpcode, LloAllocation, LloDependencyGraph — which is the genuine "machine instruction" level: pre-bundle, post-MLIR, where live-range allocation (LSRAv2) and memory-space placement resolve. The scheduling back end then packs LLO into Bundles, and Phase 3 (CompilePhase3Linking, 0xf852f40) finalizes and serializes the TpuProgram.
NOTE — "TLP" and "MHLO" are the same level, not two. TLP (TPU-Level Program) names the Phase 2a import mechanism — its only direct anchors are the phase symbol
CompilePhase2aTlpLowering(0xf850840) and the LLVM-backend predicatellvm::TPU::isTLPFunction(0x13b67a60); MHLO is the MLIR dialect the program is expressed in on arrival. They are not stacked layers — Level 2 is "the MLIR mid-level," and TLP names the import mechanism while MHLO/tpuname the dialects. A reimplementer should treat Phase 2a as "HLO → MLIR" and not search for a separate standalone TLP IR with its own op set; notlp.op-model vtables or a distinctTlpLowering/TLPFunctiontype were isolable in the binary (LOW confidence on TLP having a distinct op enum).
QUIRK — the dialect op tables are deceptively large because the bytecode dialects ship in full. RTTI confirms StableHLO, MHLO (including
mhlo.async_bundle), CHLO (the high-level "client" HLO), and VHLO (the versioned dialect, with_v1/_v2-suffixed op names) are all present as complete op tables — each contributes thousands ofmlir::{stablehlo,mhlo,chlo,vhlo}::*symbols. Only a fraction are reachable on the TPU path — VHLO in particular exists purely for wire-format forward compatibility (older runtimes ignore newer fields), and most VHLO/CHLO ops are legalized away in Phase 0. A reimplementation that builds handlers for every dialect op is building for the import surface, not the codegen surface; the codegen surface is the HLO opcodes plus thetpudialect ops. (Exact per-dialect op-enum sizes were not byte-anchored here — string scans of the<dialect>.op-name prefixes are contaminated by truncated strings and by type/attribute mnemonics, so no precise op count is asserted.)
The Five Compile Phases
The descent is structured as five phases registered by xla::TpuCompiler::RegisterAllPhases (0xf849ec0). Each is a free function in the xla namespace taking and returning absl::Span<const PjRtPartialProgramProto> — the PjRt partial program protocol, which lets the runtime stop after any phase (for AOT, caching, or staged compilation) and resume from the serialized intermediate. This phase boundary is why each stage is a distinct symbol rather than an inlined block: each is independently invocable and its output is serializable.
| Phase | Entry point | Address | Crosses | Owns |
|---|---|---|---|---|
| 0 — Stablehlo→Hlo | CompilePhase0StablehloToHlo | 0xf84de60 | wire bytecode → XLA HLO | StableHLO/CHLO/VHLO legalization, HLO import |
| 1 — Hlo optimizations | CompilePhase1HloOptimizations | 0xf84ee00 | HLO → HLO | the entire HLO pass pipeline (opt, layout, fusion, MSA, schedule) |
| 2a — Tlp lowering | CompilePhase2aTlpLowering | 0xf850840 | HLO → MLIR (tpu) → LLO | HLO→MLIR crossing, dialect legalization, tpu→LLO |
| 2b — Deduped lowering | CompilePhase2bDedupedLowering | 0xf852180 | LLO → LLO | cross-region deduplication of lowered output |
| 3 — Linking | CompilePhase3Linking | 0xf852f40 | LLO/bundles → packed | bundle finalization, encode, TpuProgram serialization |
The single heaviest phase is Phase 1: it is where DeepseaCompilerBase::RunHloPasses (0x109152a0) runs the full HLO pass pipeline, and it is where the four optimization-pass families described in the next section live. Phases 2a and 2b are the lowering descent — 2a does the bulk crossing into MLIR and down to LLO, and 2b deduplicates: the CompilePhase2bDedupedLowering name and its LLO-in/LLO-out position indicate it collapses identical lowered regions so the back end packs each unique region once (the dedup algorithm itself was not reverse-engineered — see Compile Phases; MEDIUM on the exact dedup key).
GOTCHA — Phase 1 is not "optimize then lower"; layout assignment changes the program shape that lowering depends on. Inside Phase 1, layout assignment and MSA are analyses that rewrite the HLO (they insert copies, choose tiled layouts, and assign memory spaces), and fusion and the latency-hiding scheduler run against the post-layout program. A reimplementer who treats Phase 1 as a pure optimizer and defers all layout/memory decisions to the MLIR side will produce a
tpu-dialect program whose tile shapes and memory spaces are unassigned, which thetpu→LLO legalizer cannot lower. The HLO half commits layout and memory space; the MLIR half realizes them.
The Four Pass Families
Inside the five phases, the actual transformation work clusters into four families. Each family has its own detail pages; this section places them and says what they decide.
Family 1 — HLO optimization passes (Phase 1)
The classic XLA pass pipeline: algebraic simplification, sharding propagation and SPMD partitioning, dynamic-shape handling, optimization barriers, custom-call lowering, and the long tail of canonicalizations. These run on HloInstruction under RunHloPasses and are registered/ordered by the HLO pass registry. They are target-retargeted upstream XLA: the same pass classes XLA ships for every backend, with TPU-specific enablement driven by the xla_jf_* flag family (Jellyfish; the dominant family, hundreds of flags), xla_tpu_* (pan-TPU codegen), and per-generation prefixes (xla_sc_* SparseCore, xla_vf_* Viperfish, xla_gf_* 6acc60406, xla_pf_* Pufferfish). Detail: HLO Pass Registry, HLO Pre-Passes, Algebraic Simplifier, Sharding Propagation, Auto-Sharding / SPMD, Dynamic-Shape Support, Optimization Barrier, Custom-Call Lowering.
Family 2 — codegen-gating analyses: layout assignment + MSA (Phase 1)
Two analyses commit physical decisions the lowering depends on. Layout assignment (LayoutAssignment) chooses the tiled memory layout of every tensor — the TPU analogue of choosing data layout for a SIMD machine, except the tiling interacts with the MXU's 128×128 systolic geometry. MSA (MemorySpaceAssignment) is XLA's memory-space assignment: it decides which buffers live in fast VMEM versus HBM and when to prefetch/evict across the tiers, governed by the xla_msa_* flag family — it is the TPU's closest analogue to register allocation, but for the memory hierarchy rather than registers. Detail: Layout Assignment, MSA Overview, MSA AllocateSegment, MSA Per-Version Defaults, MSA Reservation & HBM Policy.
Family 3 — fusion (Phase 1)
Fusion clusters elementwise and reduction ops into single kernels to cut memory traffic, priced by a TPU cost model. HloFusion and the FusionPipeline drive it; the profitability decision reads the Part VII cost model. Detail: Fusion Patterns, Fusion Cost Model, and the dot/conv → MXU lowering that fusion feeds: Dot / Conv → MXU Lowering, RaggedDot → Convolution.
Family 4 — MLIR lowering legalizers + schedule/pack back end (Phases 2a/2b/3)
The MLIR-side lowering is a chain of DialectConversion legalizers centered on tpu → LLO (createLowerToLLOPass, 0x11203ba0), with the vector-layout realization pass (ApplyVectorLayoutPass) and the DMA bridge (createLowerToMloPass). There is no MHLO → tpu legalizer in this binary (0 *ToTpuPass/MhloToTpu/StablehloToTpu symbols): the tpu dialect is imported from Mosaic tpu_custom_calls and lowered to LLO, while general (non-Mosaic) HLO is emitted straight to LLO by the jellyfish *Emitter family — it never becomes tpu ops. The "MHLO → XTile → tpu" reading of Level 2 is two disconnected trees; see the two-tree picture in MHLO → XTile → tpu. After LLO, the schedule-and-pack back end (PackBundles, 0x10a30a20) assigns MXU sequences, packs VLIW bundles, and modulo-schedules loops — this is the subject of Part VIII. Detail on the MLIR side: MHLO → XTile → tpu, The tpu MLIR Dialect, tpu → LLO Lowering, DialectConversion Legalizer, ConversionPatternRewriter, LowerToMlo DMA Bridge, MLIR Op-Model Contract. The scheduling/packing back end lives in Part VIII — Scheduling; its pricing lives in Part VII — Cost Model.
The Mosaic Side Channel
Not every tpu-dialect program comes from the HLO optimizer. Mosaic is a separate kernel compiler that lets hand-written kernels (Pallas/JAX-side) be expressed directly in the tpu dialect and embedded into the HLO graph as tpu_custom_call ops. On the way down, mlir::tpu::CanonicalizeMosaicPass normalizes the embedded MLIR and mlir::tpu::MosaicSerdePass handles its serialization, so the kernel body rejoins the main descent at the tpu dialect (Level 2). The LLO-level driver that imports the embedded module, validates its ABI, runs the Mosaic pass pipeline, and stitches the result into the parent region is jellyfish::CustomCallEmitter::Emit (0x111ef740); the smaller jellyfish::MosaicEmitter (MosaicEmitter::EmitWindow @ 0xfaadcc0, taking an LloRegionBuilder) is the per-window emit helper used inside it, not the driver. See Mosaic Overview.
The structural point for a reimplementer is that Mosaic is a second IR producer feeding the same target dialect, not a parallel back end. A Mosaic kernel skips the entire HLO optimizer (Family 1–3) — it arrives already as tpu ops — but it converges with the optimized path at the tpu dialect and shares the identical tpu→LLO→bundle descent from there. Mosaic carries its own layout machinery (mlir::tpu::VectorLayout, TiledLayoutAttr) because its kernels declare their own tiling rather than inheriting layout assignment's choices. MosaicFusion lets the surrounding HLO fuse around a custom call, and IsMosaicVerificationEnabled (0x14514d40) gates an extra verifier. Detail: Mosaic Overview, Mosaic Layout Inference, Mosaic VectorLayout.
NOTE — the Mosaic boundary is a real IR seam, and it is where SparseCore also enters. The SparseCore path (
mosaic_sc::MosaicSCDialect,LowerToSparseCoreLlvmPass) reuses the same custom-call entry and dialect-conversion machinery to target the SparseCore engines rather than the TensorCore MXU. Its type lowering is an ordinaryLLVMTypeConverteraugmented with one SC address-space conversion lambda — there is no distinctSCTypeConvertersymbol, the name is a documentary handle for that augmented converter. A reimplementer building only the TensorCore descent will seetpu_custom_calls it cannot lower; the custom-call registry routes each call to its emitter by target. See LowerToSparseCoreLlvm and SCTypeConverter, and the SparseCore Part.
Map of Part V
The remaining Part V pages own the detail this overview only places. Grouped by IR level:
| Level / family | Pages |
|---|---|
| Front-end / phases | HLO Ingestion, Compile Phases 0–3, Compilation Cache |
| HLO opt passes | HLO Pass Registry, HLO Pre-Passes, Algebraic Simplifier, Sharding Propagation, Auto-Sharding / SPMD, Dynamic-Shape Support, Optimization Barrier, Custom-Call Lowering |
| Layout / memory | Layout Assignment, MSA Overview, MSA AllocateSegment, MSA Per-Version Defaults, MSA Reservation & HBM Policy |
| Fusion / MXU | Fusion Patterns, Fusion Cost Model, Dot / Conv → MXU Lowering, RaggedDot → Convolution, Loop Tiling & Unrolling |
| MLIR lowering | MHLO → XTile → tpu, The tpu MLIR Dialect, tpu → LLO Lowering, DialectConversion Legalizer, ConversionPatternRewriter, LowerToMlo DMA Bridge, MLIR Op-Model Contract, LlvmTpu Intrinsic Catalog |
| Mosaic / SparseCore lowering | Mosaic Overview, Mosaic Layout Inference, Mosaic VectorLayout, LowerToSparseCoreLlvm, SCTypeConverter |
| Output | TpuProgram Serialization |
| Adjacent parts | Part VIII — Scheduling, Part VII — Cost Model, Part — SparseCore |
Confidence Summary
| Claim | Evidence |
|---|---|
Five compile phases named CompilePhase0..3 (with 2a/2b split) | symbols CompilePhase0StablehloToHlo 0xf84de60, …1HloOptimizations 0xf84ee00, …2aTlpLowering 0xf850840, …2bDedupedLowering 0xf852180, …3Linking 0xf852f40 |
Phases registered by one registry, take/return PjRtPartialProgramProto spans | TpuCompiler::RegisterAllPhases 0xf849ec0; phase signatures absl::Span<const PjRtPartialProgramProto> |
HLO pass pipeline driven by DeepseaCompilerBase::RunHloPasses | RunHloPasses 0x109152a0; C wrapper TpuCompiler_RunHloPasses 0xeabcd80; RunBackendOnModuleInternal, LowerHloModuleImpl (seen in XLA_Timer* guard variables) |
IR stack: StableHLO/CHLO/VHLO → HLO → MHLO/tpu → LLO → bundles | RTTI mlir::{stablehlo,chlo,vhlo,mhlo,tpu}::*; jellyfish::Llo{Module,Instruction,Region,Opcode} |
| Phase 0 is a legalization crossing | ConvertStablehloToHlo, StablehloLegalizeToHlo, ChloLegalizeToHlo, HloLegalizeToStablehlo, VhloToVersion |
tpu dialect lowers to LLO via LowerToLLOPass/createLowerToLLOPass | mlir::tpu::createLowerToLLOPass 0x11203ba0; mlir::tpu::LowerPassBase, ApplyVectorLayoutPass, createLowerToMloPass 0x1322adc0 |
| Layout assignment, MSA, fusion are the HLO-side codegen families | LayoutAssignment, MemorySpaceAssignment, HloFusion, FusionPipeline symbols present |
| LLO is packed to bundles by the back end | PackBundles 0x10a30a20; jellyfish::BundlePacker, GlobalBundlePacker |
Mosaic enters the tpu dialect via tpu_custom_call | tpu_custom_call strings, CanonicalizeMosaicPass, CustomCallEmitter::Emit (driver), MosaicEmitter::EmitWindow (helper), MosaicSerdePass, MosaicFusion |
| TLP is the HLO→MLIR import vehicle, same level as MHLO (no separate op enum) | phase symbol CompilePhase2aTlpLowering 0xf850840, llvm::TPU::isTLPFunction 0x13b67a60; no isolable tlp. op-model vtables or standalone TlpLowering/TLPFunction type |
| Phase 2b key is cross-region lowered-output deduplication | name CompilePhase2bDedupedLowering + LLO-in/LLO-out position; dedup key not traced |
Cross-References
- HLO Ingestion — Phase 0: how StableHLO/CHLO/VHLO bytecode is legalized and imported into XLA HLO.
- Compile Phases 0–3 — the per-phase detail this page only places: each phase's body, the
RegisterAllPhaseswiring, and the Phase 2b dedup mechanism. - Compilation Cache — how the
PjRtPartialProgramProtoboundary is reused for caching and AOT. - HLO Pass Registry — Family 1: the ordered HLO pass pipeline and its
xla_*flag-gated enablement. - Layout Assignment — Family 2: tiled-layout commitment that the MLIR lowering realizes.
- MSA Overview — Family 2: memory-space assignment across the VMEM/HBM tiers.
- Fusion Patterns — Family 3: kernel clustering and its cost-model profitability gate.
- MHLO → XTile → tpu — Family 4: the HLO→MLIR mid-level lowering inside Phase 2a.
- The tpu MLIR Dialect — the TPU target dialect, the convergence point of the optimizer and Mosaic paths.
- tpu → LLO Lowering — Family 4: the
createLowerToLLOPassdescent into the LLO machine IR. - Mosaic Overview — the side-channel kernel compiler that feeds the
tpudialect directly viatpu_custom_call. - TpuProgram Serialization — Phase 3: how the packed bundle stream serializes into a
TpuProgram. - Part VIII — Scheduling — the LLO→bundle schedule-and-pack back end this page hands off to.
- Part VII — Cost Model — the cycle/latency model that prices fusion, layout, MSA, and scheduling decisions.
- SparseCore Overview — the second target the custom-call/Mosaic seam routes to.
- Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part V — Compiler: Lowering & Optimization Passes / Front-end and pipeline — back to index