MHLO → XTile → tpu Lowering
All addresses, symbols, and op-name strings on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d, buildlibtpu_lts_20260413_b_RC00). Other versions will differ.
Abstract
This page documents the MHLO/StableHLO → XTile → tpu lowering as it actually exists in the binary — and the most important fact about it is structural and corrective. The phrase "MHLO → XTile → tpu" describes two different lowering trees that share libtpu.so but never connect:
MHLO/StableHLO → XTileis real. XTile (xla::xtile::XTileDialect) is a tiled-tensor MLIR dialect with six registered ops, a four-pass front-end lowering, and a verbatim StableHLO→arithpattern table. It lives underthird_party/.../xla/codegen/xtile/and lowers to the CPU/LLVM stack via axtile-cpu-*pipeline.XTile → tpuis not real. No pass produces thetpudialect from XTile. The binary contains zero*ToTpuPass/MhloToTpu/StablehloToTpuconversion-pass functions. XTile's dependent-dialect set is the CPU/LLVM set (xla::cpu::XlaCpuDialect,memref,vector,LLVM) —tpu,llo, andsparse_coreappear nowhere in any XTile pass.
The tpu dialect that the compiler overview names as Level 2 is therefore not produced by lowering general MHLO. On the TPU device path, general HLO is emitted straight to LLO by ~3225 xla::jellyfish::*Emitter classes; the tpu dialect is only ever imported — authored by the Pallas/Mosaic frontend, embedded in an HLO kCustomCall("tpu_custom_call"), and extracted by xla::jellyfish::mlir_utils::GetMlirModuleOpFromCustomCall. This page documents Tree B (XTile, in full) and pins down where Tree A's tpu dialect actually originates. The "XTile sits between MHLO and tpu" reading is two disconnected trees: the symbol, string, and dependent-dialect evidence in this binary shows no XTile→tpu link (see The Two-Tree Picture).
For reimplementation, the contract this page fixes is:
- The two trees and their boundary. Tree A (TPU device: HLO → jellyfish
*Emitter→ LLO;tpuonly via Mosaic custom-call import) versus Tree B (XLA CPU/GPU codegen: StableHLO/MHLO → XTile → memref/vector → LLVM). Why "XTile →tpu" has no pass. - The four XTile front-end passes —
stablehlo-lower-to-xtile,stablehlo-lower-to-arith,convert-elementwise-0d-tensor-to-scalar,xtile-verify-legal-ops— their CLI names, create-functions, dependent dialects, and ordering inside thextile-cpu-*pipeline. - The six registered XTile ops with recovered
build()signatures, plus the two attributes (xtile.layout,xtile.tiling_info) andTiledBufferInterface. - The per-op MHLO/StableHLO → XTile mapping — the verbatim StableHLO→
arithtemplate table, theEmit*fusion-emitter helpers, and worked lowerings for elementwise, dot, block-scaled dot, transpose, reshape, and constant. - Where the
tpudialect actually comes from — the Mosaic import seam (GetMlirModuleOpFromCustomCall/RunMLIRPasses), so a reimplementer does not waste effort building an MHLO→tpulegalizer that does not exist.
| XTile dialect | xla::xtile::XTileDialect (initialize @ 0x1507ec20); source third_party/.../xla/codegen/xtile/ir/xtile_ops.cc |
| Front-end passes | StablehloLowerToXtilePass 0x150602c0 · StablehloLowerToArithPass 0x1505a880 · ConvertElementwise0DTensorToScalarPass 0x15059440 · VerifyLegalXTileOpsPass 0x15062300 |
| Terminal CPU pass | xla::cpu::LowerXTileEntryPass (xtile-cpu-lower-xtile-entry; create @ 0x14d905c0) |
| Top-level emitter | xla::xtile::EmitXTileModule @ 0x14c1c9e0 (HLO fusion → xtile.entry_func module) |
| Registered ops (6) | entry_func, return, extract, insert, mask, dot_scaled (from XTileDialect::initialize's addOperations<>) |
| Attributes (2) | xtile.layout (LayoutAttr) · xtile.tiling_info (TilingInfoAttr) |
| XTile lowering target | CPU/LLVM (memref/vector/scf/LLVM) — not tpu or llo |
tpu-dialect origin | imported only: GetMlirModuleOpFromCustomCall @ 0x13e327a0; pipeline RunMLIRPasses @ 0x111fefa0 |
MHLO→tpu passes in binary | none (0 *ToTpuPass/MhloToTpu/StablehloToTpu symbols) |
| Confidence | HIGH (symbol/string-anchored) unless a row or callout says otherwise |
The Two-Tree Picture
The single structural fact that governs this page: the string "MHLO → XTile → tpu" names a chain that does not exist as one chain. Two independent lowering trees are linked into libtpu.so, and the tpu-dialect link in the middle is absent.
TREE A — TPU device (the real product path)
┌────────────────────────────────────────────────────────────────────────┐
│ HLO ──[97-row pre-passes]──► HLO │
│ ──[~3225 xla::jellyfish::*Emitter + LloRegionBuilder]──► LLO │
│ ──[bundle packer]──► ISA bundles │
│ │
│ PLUS, for JAX/Pallas kernels ONLY: │
│ HLO kCustomCall("tpu_custom_call") carrying a serialized `tpu` module │
│ ──[GetMlirModuleOpFromCustomCall 0x13e327a0]──► tpu-dialect IR │
│ ──[RunMLIRPasses 16-stage pipeline]──► LLO │
└────────────────────────────────────────────────────────────────────────┘
the `tpu` dialect is authored upstream and imported — never
produced by lowering MHLO
TREE B — XLA CPU/GPU codegen (bundled, off the TPU device path)
┌────────────────────────────────────────────────────────────────────────┐
│ StableHLO/MHLO ──[stablehlo-lower-to-xtile]──► XTile (+arith/math) │
│ ──[stablehlo-lower-to-arith]──► XTile + arith/math │
│ ──[convert-elementwise-0d-tensor-to-scalar]──► scalarized │
│ ──[xtile-verify-legal-ops]──► (gate) │
│ ──[xtile-cpu-bufferization → memref]──► │
│ ──[xtile-cpu-*-to-vector / -to-loops]──► vector/scf/memref │
│ ──[xtile-cpu-lower-xtile-entry]──► func/LLVM ──► CPU kernel │
└────────────────────────────────────────────────────────────────────────┘
XTile lowers to the LLVM/CPU stack, NEVER to `tpu`
So "MHLO → XTile" is Tree B and is fully real; "XTile → tpu" is the broken link. The evidence:
| Observation | Evidence in this binary |
|---|---|
Every XTile transform pass is xtile-cpu-* | CLI strings recovered: xtile-cpu-bufferization, -fuse-elementwise, -linalg-elementwise-to-vector, -memref-copy-to-loops, -shlo-to-vector, -tensor-ops-to-bufferizable, -unpack-sub-byte-vector-write, -vector-to-scalar, -lower-xtile-entry |
| XTile lives in the XLA codegen tree, not jellyfish | source paths codegen/xtile/ir/xtile_ops.cc, codegen/xtile/ir/transforms/lower_stablehlo_to_xtile.cc, codegen/xtile/codegen/fusion_emitter.cc, …/dot_algorithms.cc, …/emitter_helpers.{cc,h}, …/tiled_emitter_constraints.cc |
XTile's dependent-dialect set is CPU/LLVM, with no tpu | registry insert<cf, func, math, xla::cpu::XlaCpuDialect, mhlo, scf, LLVM, tensor, vector, xla::Xla, xla::xtile, stablehlo, linalg, memref, ub> — XlaCpuDialect confirmed present; no tpu/llo/sparse_core in any XTile pass |
| XTile ops bufferize through MLIR std One-Shot Bufferize | ExtractTileOp/InsertTileOp implement mlir::bufferization::BufferizableOpInterface; the tpu/llo ops never do (they go through jellyfish allocation) |
The terminal pass is xla::cpu:: | LowerXTileEntryPass rewrites xtile.entry_func → func.func for the CPU/LLVM emitter |
No MHLO→tpu pass exists | grep of the functions table: 0 hits for *ToTpuPass, *ToTPUDialect, MhloToTpu, HloToTpu, StablehloToTpu, LegalizeToTpu, LowerMhloToTpu |
NOTE — "TLP" is not a dialect, and it is not XTile. "TLP" (TPU-Level Program) in the overview is the conceptual name for the HLO→MLIR import vehicle of Phase 2a. There is no
tlpMLIR dialect and noxla::tlp::/mlir::tlp::symbol. On the TPU path, thetpu(Mosaic) dialect is the physical realization of the "TPU-Level Program" container — but only for imported Pallas/Mosaic kernels. General MHLO is lowered straight to LLO by jellyfish*Emitterclasses; it never becomestpuops. XTile is an unrelated CPU/GPU dialect that happens to ship in the same.so.
The Four XTile Front-End Passes
The MHLO/StableHLO → XTile front-end is four passes in xla::xtile::(anonymous namespace), each generated from the impl::*PassBase TableGen template. The first three rewrite the tiled fusion body; the fourth is a legality gate. All addresses are .text create-function entry points.
| Pass class | create() | runOnOperation | CLI argument | role |
|---|---|---|---|---|
StablehloLowerToXtilePass | 0x150602c0 | 0x15060560 | stablehlo-lower-to-xtile | structural/tensor-shaped StableHLO → xtile.extract/insert/mask/dot_scaled tile ops + the Emit* tile-dot/transpose/reshape helpers |
StablehloLowerToArithPass | 0x1505a880 | 0x1505aa60 | stablehlo-lower-to-arith | scalar/elementwise StableHLO → arith/math |
ConvertElementwise0DTensorToScalarPass | 0x15059440 | 0x15059620 | convert-elementwise-0d-tensor-to-scalar | tensor<f32> (rank-0) → scalar f32 |
VerifyLegalXTileOpsPass | 0x15062300 | 0x150624c0 | xtile-verify-legal-ops | gate: only XTile/arith/math/tensor/func legal |
The StablehloLowerToXtilePass description string is recovered verbatim: "Lowers stablehlo ops to Xtile."
Dependent dialects (from each *PassBase::getDependentDialects body):
StablehloLowerToXtilePass— registersarith,stablehlo, plus the shared XTile/tensor/mathregistry insert. Input = StableHLO (with a small allow-list of MHLO-only ops tolerated); output = XTile +arith+math+tensor.StablehloLowerToArithPass— registersarith. Input = StableHLO scalar/elementwise ops; output =arith/math.ConvertElementwise0DTensorToScalarPass— amlir::TypeConverter-driven full conversion mapping rank-0 ranked tensors to scalars, with source+target materialization callbacks andaddDynamicallyLegalOp<arith::ConstantOp>to keep 0-D constants legal until the materializers fire.VerifyLegalXTileOpsPass— a verifier that walks the module (walk<Operation*>) and fails with"Could not legalize op: "if any op outside the legal XTile/arith/math/tensor/funcset survives. This is the "must be gone before XTile codegen" gate.
NOTE —
lower-to-xtileandlower-to-arithrun together over the tiled fusion body.StablehloLowerToXtilePasshandles the shape-level ops (the ones that become tile reads/writes, tile dot, transpose, and reshape), whileStablehloLowerToArithPasshandles the scalar/elementwise ops over the already-extracted tiles. They are complementary halves of one legalization, not sequential phases that fully legalize independently.
Position inside the xtile-cpu-* pipeline
The four front-end passes are stages 1–4 of the full CPU codegen pipeline, in lowering order (all CLI names recovered verbatim from the string pool):
1 stablehlo-lower-to-xtile shape ops → xtile.extract/insert/mask/dot_scaled + Emit* tile-dot/transpose/reshape helpers
2 stablehlo-lower-to-arith scalar/elementwise → arith/math
3 convert-elementwise-0d-tensor-to-scalar tensor<f32> → f32
4 xtile-verify-legal-ops gate (only XTile/arith/math/tensor/func legal)
5 xtile-cpu-fuse-elementwise fuse adjacent elementwise tile ops
6 xtile-cpu-shlo-to-vector / -linalg-elementwise-to-vector → vector dialect
7 xtile-cpu-tensor-ops-to-bufferizable normalize stray tensor.* ops
8 xtile-cpu-bufferization tensor → memref (One-Shot Bufferize)
9 xtile-cpu-memref-copy-to-loops memref copies → scf.for
10 xtile-cpu-vector-to-scalar residual vector → scalar
11 xtile-cpu-unpack-sub-byte-vector-write sub-byte stores
12 xtile-cpu-lower-xtile-entry xtile.entry_func → func.func (xla::cpu::LowerXTileEntryPass)
13 → LLVM dialect → CPU object code
GOTCHA — this pipeline is NOT a row in the 97-row HLO pre-pass table. The HLO pre-passes (Compile Phases) belong to the TPU device compiler (
DeepseaCompilerBase::RunHloPasses). Thextile-cpu-*pipeline is the XLA CPU backend's per-fusion codegen, invoked fromEmitXTileModuleafter the CPU backend's own StableHLO/MHLO normalization. There is no row in the device pipeline that adds anyxtile-*pass. A reimplementer searching the TPU phase list for an XTile stage will not find one; that is the concrete consequence of the two-tree split.
The Six Registered XTile Ops
XTileDialect::initialize (0x1507ec20) registers six ops via a single addOperations<DotScaledOp, EntryFuncOp, EntryFuncReturnOp, ExtractTileOp, InsertTileOp, MaskOp>(). Each has build/verify/print/parse and an xtile.<name> op-name string (xtile.entry_func, .return, .extract, .insert, .mask, .dot_scaled all recovered from the string pool). There is no xtile.dot op — see the GOTCHA below.
| op-name | C++ class | recovered build() signature | role |
|---|---|---|---|
xtile.entry_func | xla::xtile::EntryFuncOp | (StringRef name, ArrayRef<Type> argTypes, ArrayRef<NamedAttribute>, ArrayRef<DictionaryAttr>) | tiled-fusion entry function (FunctionOpInterface) |
xtile.return | xla::xtile::EntryFuncReturnOp | create(OpBuilder&, Location) — terminator, no operands | entry_func terminator |
xtile.extract | xla::xtile::ExtractTileOp | (Type result, Value source, ValueRange offsets, ArrayRef<long> staticOffsets, ArrayRef<long> staticSizes) | read a tile out of a tiled buffer/tensor (BufferizableOpInterface, TiledBufferInterface) |
xtile.insert | xla::xtile::InsertTileOp | (Value tile, Value dest, ValueRange offsets, ArrayRef<long> staticOffsets, ArrayRef<long> staticSizes) | write a tile back into a tiled buffer/tensor (BufferizableOpInterface, TiledBufferInterface) |
xtile.mask | xla::xtile::MaskOp | (Value source, ArrayRef<long> maskedDims, Value padValue) | predicate/pad partial boundary tiles; has fold() + inferReturnTypes |
xtile.dot_scaled | xla::xtile::DotScaledOp | (Type, Value lhs, Value rhs, Value lhsScale, Value rhsScale, bool, bool, bool) | block-scaled (MXFP) tile dot; custom print/parse |
NOTE —
xtile.extract/xtile.insertuse the canonical mixed static/dynamic slice form. Their operands are static-offset + dynamic-offset (ValueRange) + static-size — exactly the shape oftensor.extract_slice/tensor.insert_slice. That is precisely why they bufferize cleanly tomemref.subview-style reads and in-place tile stores under One-Shot Bufferize.
GOTCHA — there is NO plain
xtile.dotop.addOperations<>registers only the six ops above; noxla::xtile::DotOpclass, noxtile.dotop-name string, and noDotOp::build/print/parsesymbols exist in the binary. A plain (un-scaled) tile dot is not an XTile op — it is emitted directly intolinalg/vectorby the dot-algorithm emitterEmitSingleTileDot(0x14c277a0,dot_algorithms.cc). Only the block-scaledDotScaledOpis a first-class op. Per-algorithm dot instruction sequences (f32×f32, bf16×bf16→f32, tf32, x3/x6 high-precision) live indot_algorithms.ccand are not enumerated here — LOW confidence on the exact per-algorithm bodies.
XTile attributes and interfaces
| symbol | encodes |
|---|---|
xla::xtile::LayoutAttr (xtile.layout) | minor-to-major layout: minor_to_major : DenseI64ArrayAttr (parse-error string "failed to parse XTile_LayoutAttr parameter 'minor_to_major' …") |
xla::xtile::TilingInfoAttr (xtile.tiling_info) | tile_count : int32_t, tiles_per_workgroup : int32_t |
xla::xtile::TileInfo | per-op tile descriptor consumed by EmitParameterExtract |
xla::xtile::TiledBufferInterface | op interface implemented by ExtractTileOp + InsertTileOp |
xla::xtile::DotOperands / ScaledDotOperands / DotOperandSide | helper structs for dot operand routing |
The tiles_per_workgroup field, together with the CPU-thunk workgroup strings (NumWorkGroups{%d, %d, %d}, XLA_CPU_NumWorkGroups, xla.cpu.KernelThunkProto), confirms the CPU/GPU workgroup-parallel model: a dimension is split into tile_count tiles, and tiles_per_workgroup are assigned to each parallel workgroup. This is a CPU/GPU codegen concept; the TPU side uses sublane/lane vreg tiling instead (Mosaic Layout Inference).
Per-Op MHLO/StableHLO → XTile Mapping
XTile lowering splits cleanly: scalar/elementwise ops go through a verbatim StableHLO→arith table (StablehloLowerToArithPass), while shape-level ops go through the Emit* fusion-emitter helpers (StablehloLowerToXtilePass, fusion_emitter.cc/emitter_helpers.cc).
The StableHLO → arith table (verbatim)
Each row is a template instantiation recovered from the demangled name LowerStableHloOpToArith<Src, FloatDst, SignedIntDst[, UnsignedIntDst]>. Float vs int dispatch is selected at match time from the operand element type.
| StableHLO src | float dst | signed-int dst | unsigned-int dst |
|---|---|---|---|
stablehlo.add | arith.addf | arith.addi | arith.addi |
stablehlo.subtract | arith.subf | arith.subi | arith.subi |
stablehlo.mul | arith.mulf | arith.muli | arith.muli |
stablehlo.div | arith.divf | arith.divsi | arith.divui |
stablehlo.rem | arith.remf | arith.remsi | arith.remui |
stablehlo.max | arith.maximumf | arith.maxsi | arith.maxui |
stablehlo.min | arith.minimumf | arith.minsi | arith.minui |
stablehlo.and | arith.andi | arith.andi | arith.andi |
stablehlo.or | arith.ori | arith.ori | arith.ori |
stablehlo.xor | arith.xori | arith.xori | arith.xori |
Unary ops route through LowerStableHloUnaryOpToMath<Src, MathDst> (recovered instantiation: stablehlo.round_nearest_even → math.roundeven). Additional dedicated pattern classes in the same anonymous namespace:
LowerCompareOp—stablehlo.compare → arith.cmpf/arith.cmpi(predicate by comparison direction + type).LowerConvertOp—stablehlo.convert → arith.ext*/trunc*/sitofp/uitofp/fptosi/fptoui(type-pair table; class confirmed, the full per-pair mapping is inferred — MEDIUM).ConstantConversionPattern—stablehlo.constant → arith.constant(DenseElementsAttrpassthrough).ElementwiseConverter— the generic elementwise dispatcher.
The shape-level emit helpers
These (in xla::xtile:: / its anonymous namespace) build the tile-op structure that the arith ops fill in:
| helper | text addr | builds |
|---|---|---|
EmitXTileModule | 0x14c1c9e0 | top level: the whole xtile.entry_func module for a fusion |
EmitScope | 0x15066ec0 | walks an HLO instruction span, emits XTile/arith into a value map |
EmitParameterExtract | 0x15066e00 | xtile.extract of a parameter tile given a TileInfo |
EmitConstant | 0x15064da0 | arith.constant/splat for HLO constants |
EmitElementwise | 0x150630a0 | elementwise HLO → arith/math over extracted tiles |
EmitSingleTileDot | 0x14c277a0 | one tile of a dot → the dot algorithm (dot_algorithms.cc) |
EmitSingleTileScaledDot | 0x14c289c0 | one tile of a block-scaled dot → xtile.dot_scaled |
EmitTiledReshape | 0x150694c0 | reshape over tiled tensors |
EmitTiledTranspose | 0x15069860 | transpose over tiled tensors (SmallVector<long,6> permutation) |
EmitTiledComputation / EmitTiledInstructionList / EmitTiledBitcast / EmitReduceComputation / EmitNestedFusion | (multiple syms) | whole computation / per-instruction / bitcast / reduce body / nested-fusion inlining |
EmitXTileModule signature (recovered):
EmitXTileModule(std::string_view name,
HloFusionInstruction const*,
SymbolicTileAnalysis const&,
Tiling const&,
MLIRContext&,
absl::Span<Type>,
std::optional<stream_executor::GpuComputeCapability> const&);
The XTile module is emitted directly from an HLO fusion using the symbolic tile analysis + the chosen tiling, gated by GPU compute capability (CPU passes nullopt). Log anchor at entry: "Emitting XTile IR for fusion". Dot-emit canonicalization uses CanonicalizeDotOperand (0x150692a0), MaskDotOperand, GetDotLoopIterationCount, GetDotAccumulatorType, GetPaddedTileSizes.
Worked lowerings
// (a) elementwise binary — stablehlo.add of two tiled f32 tensors
// %c = stablehlo.add %a, %b : tensor<256x256xf32>
// after tiling + StablehloLowerToArith over one tile:
%ta = xtile.extract %A[%i, %j] [/*static offsets*/] [/*static sizes*/] : tensor<TMxTNxf32>
%tb = xtile.extract %B[%i, %j] [...] [...] : tensor<TMxTNxf32>
%tc = arith.addf %ta, %tb : tensor<TMxTNxf32> // float vs int per the arith table
xtile.insert %tc into %C[%i, %j] [...] [...]
// (b) dot (matmul), one tile — EmitSingleTileDot (dot_algorithms.cc)
// stablehlo.dot_general %lhs, %rhs {contracting = ...}
// per output tile: extract lhs/rhs tiles; loop GetDotLoopIterationCount over the
// contraction tiling; accumulate in GetDotAccumulatorType (f32 acc for bf16/f8 inputs);
// operands canonicalized by CanonicalizeDotOperand, boundary-masked by MaskDotOperand;
// result xtile.insert into the output buffer.
// (c) block-scaled dot (MXFP) — EmitSingleTileScaledDot
// a block_scaled_dot custom-call (scale type f8E8M0FNU):
%d = xtile.dot_scaled %lhsTile, %rhsTile, %lhsScale, %rhsScale
{transpose_lhs, transpose_rhs, acc} // three bool flags
// verifier strings: "expect scale operands dimension 2 to equal C/block_size",
// "block sizes for cast_from_block_scaled and cast_to_block_scaled must match"
// (d) transpose — EmitTiledTranspose(b, permutation, tiledValue)
// stablehlo.transpose {permutation} → tile-level transpose over the tiled
// RankedTensorType using the SmallVector<long,6> permutation.
// (e) reshape — EmitTiledReshape(b, newShape, tiledValue)
// stablehlo.reshape → tile-level reshape; collapse/expand of unit dims via CollapseUnitDims.
// (f) constant — EmitConstant → arith.constant (DenseElementsAttr); 0-D constants are
// later scalarized by ConvertElementwise0DTensorToScalarPass.
NOTE — broadcast / iota / concatenate.
BroadcastInDims,Splat, andConstLikehelpers handle broadcast; iota and concatenate go throughEmitGeneric/CheckConcatenateOperands. The upstream StableHLO→LinalgIotaConverter/ConcatenateConverterclasses also ship in the binary, but they belong to the GPU/CPUstablehlo-to-linalgpath, not the XTile pass — do not conflate them.
Tiling and Type Conversion
XTile consumes a tiling; it does not choose one
XTile does not decide tiling. The decision is made earlier by xla::SymbolicTileAnalysis:
AnalyzeFusion(0x14c35a40) /AnalyzeComputation(0x14c359a0) propagate anIndexingMapthrough the fusion.GetValidTilings(0x14c40280) enumerates validTilings;ParametersSatisfyConstraints(0x14c3d040) +tiled_emitter_constraints.ccfilter them.ComputeOutputTilingInfo(0x14c46960) /ComputeTiledInstructions(0x14c47f80) produce the per-instructionTileInfo.
The chosen Tiling + GpuComputeCapability are passed into EmitXTileModule, which bakes TilingInfoAttr{tile_count, tiles_per_workgroup} and LayoutAttr{minor_to_major} onto the entry function and tiles. Failure anchor: "cannot compute parametric tile sizes for dynamically shaped payload op". This is the GPU/CPU symbolic-tiling search; the TPU analogue (window-config search, Mosaic sublane/lane tiling) is entirely separate code.
Two type conversions
-
tensor<f32>→ scalarf32(ConvertElementwise0DTensorToScalarPass) — amlir::TypeConverterfull conversion carryingaddConversion(Type)(identity),addConversion(RankedTensorType)(rank-0 → element type),addSourceMaterialization(scalar → tensor re-wrap),addTargetMaterialization(tensor → scalar unwrap), plusaddDynamicallyLegalOp<arith::ConstantOp>so 0-D constants stay legal until the materializers fire. -
tensor→memref(xtile-cpu-bufferization) — MLIR One-Shot Bufferize via theBufferizableOpInterfacethatExtractTileOp/InsertTileOpimplement:ExtractTileOp::bufferizesToMemoryRead = trueon the source and emits amemref.subview-style read;InsertTileOp::bufferizesToMemoryWrite = trueon the dest and emits an in-place tile store;bufferizesToAllocationgoverns scratch allocation.xtile-cpu-tensor-ops-to-bufferizablefirst normalizes straytensor.*ops;xtile-cpu-memref-copy-to-loopsthen lowers copies toscfloops. Entry-function memrefs must use a tiled layout — anchor:"All memref arguments should use the TiledLayoutAttr for layout"; alignment is checked with"Cannot confirm that the memref is memory tile-aligned …".
Failing / Unsupported Ops at the XTile Boundary
xtile-verify-legal-ops and the two lower-to-* full conversions reject anything outside the legal set. Recovered failure anchors:
| anchor string | trigger |
|---|---|
"Could not legalize op: " | an op with no lower-to-xtile/arith pattern survived |
"cannot compute parametric tile sizes for dynamically shaped payload op" | dynamic-shape payload with no static tiling |
"Boundaries of the clamp are not legal: " | xtile.mask/clamp bound out of range |
"All tiles must have the same rank." | mixed-rank tiles into a tile op |
"At least one tile shape must be specified." | tiling info absent |
"All tiled squeezed dimensions must be of size 1." | bad squeeze on a tiled dim |
"All memref arguments should use the TiledLayoutAttr for layout" | entry-func memref lacks tiled layout |
"Cannot confirm that the memref is memory tile-aligned …" | alignment check could not prove producer is tile-aligned |
"expect scale operands dimension 2 to equal C/block_size " | xtile.dot_scaled scale operand shape mismatch |
"Allow XLA's MHLO ops not in StableHLO to remain present after legalization (copy, add_dependency, fusion, etc.)" | flag: MHLO-only ops tolerated; if not lowered they hit "Could not legalize op:" later |
The "MHLO not in StableHLO" flag confirms the input is StableHLO produced from MHLO, with a small allow-list of MHLO-only ops (copy, add_dependency, fusion) permitted to persist into the XTile lowering.
Where the tpu Dialect Actually Comes From
Because there is no MHLO→tpu pass, a reimplementer needs to know the real origin of the tpu dialect on the TPU path — it is imported, not lowered. The seam:
- A Pallas
@pl.kernel(or hand-written Mosaic kernel) is compiled by the JAX frontend, outsidelibtpuinto a serializedtpu-dialect MLIR module. - That module is embedded in
HloOpcode::kCustomCallwithcustom_call_target = "tpu_custom_call", inside themlir_modulefield of the backend config. - At HLO→LLO emit time,
xla::jellyfish::mlir_utils::GetMlirModuleOpFromCustomCall(0x13e327a0) extracts and caches themlir::ModuleOp(CityHash128-keyed, one parse per unique kernel). - The imported module runs through the
tpupass pipelineRunMLIRPasses(0x111fefa0) —MosaicSerdePass(version upgrade) → memref/vector-layout passes →createLowerToLLOPass(tpu → LLO) → LLO.
The tpu dialect is therefore authored upstream and imported, never produced by lowering general MHLO. See Mosaic Overview for the import/serde/pipeline detail and The tpu MLIR Dialect for the dialect itself.
GOTCHA — do not build an MHLO→
tpulegalizer; it does not exist. The convergence point the overview draws at Level 2 (tpudialect) is fed by exactly two producers: the Mosaic custom-call import (Tree A2) and — on the general path — nothing, because general HLO bypassestpuentirely and is emitted to LLO directly by jellyfish*Emitterclasses (Tree A1). XTile (Tree B) feeds neither; it targets CPU/LLVM. A reimplementation that allocates effort to an MHLO/StableHLO→tpuconversion pattern set is building a stage that the production compiler does not contain.
Confidence Summary
| Claim | Evidence |
|---|---|
| Four XTile front-end passes with the listed CLI names/create-functions | StablehloLowerToXtilePass 0x150602c0, StablehloLowerToArithPass 0x1505a880, ConvertElementwise0DTensorToScalarPass 0x15059440, VerifyLegalXTileOpsPass 0x15062300; CLI strings recovered |
| Six registered XTile ops with the listed classes | addOperations<DotScaledOp, EntryFuncOp, EntryFuncReturnOp, ExtractTileOp, InsertTileOp, MaskOp> in XTileDialect::initialize; op-name strings xtile.entry_func/return/extract/insert/mask/dot_scaled |
StableHLO→arith table (10 binary-op rows) | LowerStableHloOpToArith<…> template instantiations recovered from demangled names |
Emit* fusion-emitter helper map + EmitXTileModule signature | EmitXTileModule 0x14c1c9e0, EmitSingleTileDot 0x14c277a0, EmitTiledTranspose 0x15069860, … ; log anchor "Emitting XTile IR for fusion" |
XTile attributes xtile.layout/xtile.tiling_info and the workgroup model | XTile_LayoutAttr/minor_to_major and XTile_TilingInfoAttr/tile_count/tiles_per_workgroup parse-error strings; CPU-thunk NumWorkGroups{%d, %d, %d}/XLA_CPU_NumWorkGroups/xla.cpu.KernelThunkProto strings |
tensor→memref via One-Shot Bufferize on BufferizableOpInterface | ExtractTileOp/InsertTileOp implement BufferizableOpInterface + TiledBufferInterface; xtile-cpu-bufferization CLI string |
| Failure anchors at the XTile boundary | all 10 anchor strings recovered in the decompiled output |
| XTile is NOT on the TPU MXU path | XTile dep-set = XlaCpuDialect/memref/vector/LLVM (no tpu/llo); xtile-cpu-* pipeline; codegen/xtile/ source paths |
No MHLO/HLO→tpu conversion pass exists | grep of functions table: 0 hits for *ToTpuPass/MhloToTpu/StablehloToTpu/LegalizeToTpu |
tpu dialect is imported only, via tpu_custom_call | GetMlirModuleOpFromCustomCall 0x13e327a0, RunMLIRPasses 0x111fefa0, MosaicSerdePass, tpu_custom_call all present |
Plain tile dot is emitted to linalg/vector, not an xtile.dot op | no DotOp class / xtile.dot string in binary; EmitSingleTileDot/dot_algorithms.cc bodies not decompiled per-algorithm |
Full LowerConvertOp ext/trunc/itofp/fptoi type-pair table | LowerConvertOp class present (119 refs); per-pair mapping inferred, not enumerated |
Cross-References
- The TPU Compiler (overview) — the five-phase spine and IR-layer stack; this page refines its Level 2 ("XTile between MHLO and
tpu") into the two-tree split. - Compile Phases 0–3 — the device compiler's phase pipeline, into which the
xtile-cpu-*pipeline is not wired. - The tpu MLIR Dialect — the
tputarget dialect that XTile does not produce; the convergence point of the optimizer and Mosaic paths on the TPU side. - tpu → LLO Lowering — the
createLowerToLLOPassdescent that the importedtpumodule runs through (the real path, unlike XTile's CPU/LLVM target). - DialectConversion Legalizer — the MLIR dialect-conversion machinery underlying both the XTile full conversions and the
tpulegalizers. - Mosaic Overview — the Pallas/Mosaic import seam (
GetMlirModuleOpFromCustomCall/RunMLIRPasses) that is the actual origin of thetpudialect. - Mosaic Layout Inference — the TPU sublane/lane vreg tiling that is the analogue of XTile's CPU/GPU workgroup tiling.
- Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part V — Compiler: Lowering & Optimization Passes / MLIR lowering chain — back to index