Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

MHLO → XTile → tpu Lowering

All addresses, symbols, and op-name strings on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id 89edbbe81c5b328a958fe628a9f2207d, build libtpu_lts_20260413_b_RC00). Other versions will differ.

Abstract

This page documents the MHLO/StableHLO → XTile → tpu lowering as it actually exists in the binary — and the most important fact about it is structural and corrective. The phrase "MHLO → XTile → tpu" describes two different lowering trees that share libtpu.so but never connect:

  • MHLO/StableHLO → XTile is real. XTile (xla::xtile::XTileDialect) is a tiled-tensor MLIR dialect with six registered ops, a four-pass front-end lowering, and a verbatim StableHLO→arith pattern table. It lives under third_party/.../xla/codegen/xtile/ and lowers to the CPU/LLVM stack via a xtile-cpu-* pipeline.
  • XTile → tpu is not real. No pass produces the tpu dialect from XTile. The binary contains zero *ToTpuPass / MhloToTpu / StablehloToTpu conversion-pass functions. XTile's dependent-dialect set is the CPU/LLVM set (xla::cpu::XlaCpuDialect, memref, vector, LLVM) — tpu, llo, and sparse_core appear nowhere in any XTile pass.

The tpu dialect that the compiler overview names as Level 2 is therefore not produced by lowering general MHLO. On the TPU device path, general HLO is emitted straight to LLO by ~3225 xla::jellyfish::*Emitter classes; the tpu dialect is only ever imported — authored by the Pallas/Mosaic frontend, embedded in an HLO kCustomCall("tpu_custom_call"), and extracted by xla::jellyfish::mlir_utils::GetMlirModuleOpFromCustomCall. This page documents Tree B (XTile, in full) and pins down where Tree A's tpu dialect actually originates. The "XTile sits between MHLO and tpu" reading is two disconnected trees: the symbol, string, and dependent-dialect evidence in this binary shows no XTile→tpu link (see The Two-Tree Picture).

For reimplementation, the contract this page fixes is:

  • The two trees and their boundary. Tree A (TPU device: HLO → jellyfish *Emitter → LLO; tpu only via Mosaic custom-call import) versus Tree B (XLA CPU/GPU codegen: StableHLO/MHLO → XTile → memref/vector → LLVM). Why "XTile → tpu" has no pass.
  • The four XTile front-end passesstablehlo-lower-to-xtile, stablehlo-lower-to-arith, convert-elementwise-0d-tensor-to-scalar, xtile-verify-legal-ops — their CLI names, create-functions, dependent dialects, and ordering inside the xtile-cpu-* pipeline.
  • The six registered XTile ops with recovered build() signatures, plus the two attributes (xtile.layout, xtile.tiling_info) and TiledBufferInterface.
  • The per-op MHLO/StableHLO → XTile mapping — the verbatim StableHLO→arith template table, the Emit* fusion-emitter helpers, and worked lowerings for elementwise, dot, block-scaled dot, transpose, reshape, and constant.
  • Where the tpu dialect actually comes from — the Mosaic import seam (GetMlirModuleOpFromCustomCall / RunMLIRPasses), so a reimplementer does not waste effort building an MHLO→tpu legalizer that does not exist.
XTile dialectxla::xtile::XTileDialect (initialize @ 0x1507ec20); source third_party/.../xla/codegen/xtile/ir/xtile_ops.cc
Front-end passesStablehloLowerToXtilePass 0x150602c0 · StablehloLowerToArithPass 0x1505a880 · ConvertElementwise0DTensorToScalarPass 0x15059440 · VerifyLegalXTileOpsPass 0x15062300
Terminal CPU passxla::cpu::LowerXTileEntryPass (xtile-cpu-lower-xtile-entry; create @ 0x14d905c0)
Top-level emitterxla::xtile::EmitXTileModule @ 0x14c1c9e0 (HLO fusion → xtile.entry_func module)
Registered ops (6)entry_func, return, extract, insert, mask, dot_scaled (from XTileDialect::initialize's addOperations<>)
Attributes (2)xtile.layout (LayoutAttr) · xtile.tiling_info (TilingInfoAttr)
XTile lowering targetCPU/LLVM (memref/vector/scf/LLVM) — not tpu or llo
tpu-dialect originimported only: GetMlirModuleOpFromCustomCall @ 0x13e327a0; pipeline RunMLIRPasses @ 0x111fefa0
MHLO→tpu passes in binarynone (0 *ToTpuPass/MhloToTpu/StablehloToTpu symbols)
ConfidenceHIGH (symbol/string-anchored) unless a row or callout says otherwise

The Two-Tree Picture

The single structural fact that governs this page: the string "MHLO → XTile → tpu" names a chain that does not exist as one chain. Two independent lowering trees are linked into libtpu.so, and the tpu-dialect link in the middle is absent.

TREE A — TPU device (the real product path)
┌────────────────────────────────────────────────────────────────────────┐
│  HLO  ──[97-row pre-passes]──►  HLO                                       │
│       ──[~3225 xla::jellyfish::*Emitter + LloRegionBuilder]──►  LLO       │
│       ──[bundle packer]──►  ISA bundles                                   │
│                                                                          │
│  PLUS, for JAX/Pallas kernels ONLY:                                      │
│  HLO kCustomCall("tpu_custom_call")  carrying a serialized `tpu` module   │
│       ──[GetMlirModuleOpFromCustomCall  0x13e327a0]──►  tpu-dialect IR    │
│       ──[RunMLIRPasses 16-stage pipeline]──►  LLO                         │
└────────────────────────────────────────────────────────────────────────┘
        the `tpu` dialect is authored upstream and imported — never
        produced by lowering MHLO

TREE B — XLA CPU/GPU codegen (bundled, off the TPU device path)
┌────────────────────────────────────────────────────────────────────────┐
│  StableHLO/MHLO  ──[stablehlo-lower-to-xtile]──►  XTile (+arith/math)     │
│       ──[stablehlo-lower-to-arith]──►  XTile + arith/math                 │
│       ──[convert-elementwise-0d-tensor-to-scalar]──►  scalarized          │
│       ──[xtile-verify-legal-ops]──►  (gate)                               │
│       ──[xtile-cpu-bufferization → memref]──►                            │
│       ──[xtile-cpu-*-to-vector / -to-loops]──►  vector/scf/memref         │
│       ──[xtile-cpu-lower-xtile-entry]──►  func/LLVM  ──►  CPU kernel       │
└────────────────────────────────────────────────────────────────────────┘
        XTile lowers to the LLVM/CPU stack, NEVER to `tpu`

So "MHLO → XTile" is Tree B and is fully real; "XTile → tpu" is the broken link. The evidence:

ObservationEvidence in this binary
Every XTile transform pass is xtile-cpu-*CLI strings recovered: xtile-cpu-bufferization, -fuse-elementwise, -linalg-elementwise-to-vector, -memref-copy-to-loops, -shlo-to-vector, -tensor-ops-to-bufferizable, -unpack-sub-byte-vector-write, -vector-to-scalar, -lower-xtile-entry
XTile lives in the XLA codegen tree, not jellyfishsource paths codegen/xtile/ir/xtile_ops.cc, codegen/xtile/ir/transforms/lower_stablehlo_to_xtile.cc, codegen/xtile/codegen/fusion_emitter.cc, …/dot_algorithms.cc, …/emitter_helpers.{cc,h}, …/tiled_emitter_constraints.cc
XTile's dependent-dialect set is CPU/LLVM, with no tpuregistry insert<cf, func, math, xla::cpu::XlaCpuDialect, mhlo, scf, LLVM, tensor, vector, xla::Xla, xla::xtile, stablehlo, linalg, memref, ub>XlaCpuDialect confirmed present; no tpu/llo/sparse_core in any XTile pass
XTile ops bufferize through MLIR std One-Shot BufferizeExtractTileOp/InsertTileOp implement mlir::bufferization::BufferizableOpInterface; the tpu/llo ops never do (they go through jellyfish allocation)
The terminal pass is xla::cpu::LowerXTileEntryPass rewrites xtile.entry_funcfunc.func for the CPU/LLVM emitter
No MHLO→tpu pass existsgrep of the functions table: 0 hits for *ToTpuPass, *ToTPUDialect, MhloToTpu, HloToTpu, StablehloToTpu, LegalizeToTpu, LowerMhloToTpu

NOTE — "TLP" is not a dialect, and it is not XTile. "TLP" (TPU-Level Program) in the overview is the conceptual name for the HLO→MLIR import vehicle of Phase 2a. There is no tlp MLIR dialect and no xla::tlp::/mlir::tlp:: symbol. On the TPU path, the tpu (Mosaic) dialect is the physical realization of the "TPU-Level Program" container — but only for imported Pallas/Mosaic kernels. General MHLO is lowered straight to LLO by jellyfish *Emitter classes; it never becomes tpu ops. XTile is an unrelated CPU/GPU dialect that happens to ship in the same .so.


The Four XTile Front-End Passes

The MHLO/StableHLO → XTile front-end is four passes in xla::xtile::(anonymous namespace), each generated from the impl::*PassBase TableGen template. The first three rewrite the tiled fusion body; the fourth is a legality gate. All addresses are .text create-function entry points.

Pass classcreate()runOnOperationCLI argumentrole
StablehloLowerToXtilePass0x150602c00x15060560stablehlo-lower-to-xtilestructural/tensor-shaped StableHLO → xtile.extract/insert/mask/dot_scaled tile ops + the Emit* tile-dot/transpose/reshape helpers
StablehloLowerToArithPass0x1505a8800x1505aa60stablehlo-lower-to-arithscalar/elementwise StableHLO → arith/math
ConvertElementwise0DTensorToScalarPass0x150594400x15059620convert-elementwise-0d-tensor-to-scalartensor<f32> (rank-0) → scalar f32
VerifyLegalXTileOpsPass0x150623000x150624c0xtile-verify-legal-opsgate: only XTile/arith/math/tensor/func legal

The StablehloLowerToXtilePass description string is recovered verbatim: "Lowers stablehlo ops to Xtile."

Dependent dialects (from each *PassBase::getDependentDialects body):

  • StablehloLowerToXtilePass — registers arith, stablehlo, plus the shared XTile/tensor/math registry insert. Input = StableHLO (with a small allow-list of MHLO-only ops tolerated); output = XTile + arith + math + tensor.
  • StablehloLowerToArithPass — registers arith. Input = StableHLO scalar/elementwise ops; output = arith/math.
  • ConvertElementwise0DTensorToScalarPass — a mlir::TypeConverter-driven full conversion mapping rank-0 ranked tensors to scalars, with source+target materialization callbacks and addDynamicallyLegalOp<arith::ConstantOp> to keep 0-D constants legal until the materializers fire.
  • VerifyLegalXTileOpsPass — a verifier that walks the module (walk<Operation*>) and fails with "Could not legalize op: " if any op outside the legal XTile/arith/math/tensor/func set survives. This is the "must be gone before XTile codegen" gate.

NOTE — lower-to-xtile and lower-to-arith run together over the tiled fusion body. StablehloLowerToXtilePass handles the shape-level ops (the ones that become tile reads/writes, tile dot, transpose, and reshape), while StablehloLowerToArithPass handles the scalar/elementwise ops over the already-extracted tiles. They are complementary halves of one legalization, not sequential phases that fully legalize independently.

Position inside the xtile-cpu-* pipeline

The four front-end passes are stages 1–4 of the full CPU codegen pipeline, in lowering order (all CLI names recovered verbatim from the string pool):

 1  stablehlo-lower-to-xtile                 shape ops → xtile.extract/insert/mask/dot_scaled + Emit* tile-dot/transpose/reshape helpers
 2  stablehlo-lower-to-arith                 scalar/elementwise → arith/math
 3  convert-elementwise-0d-tensor-to-scalar  tensor<f32> → f32
 4  xtile-verify-legal-ops                   gate (only XTile/arith/math/tensor/func legal)
 5  xtile-cpu-fuse-elementwise               fuse adjacent elementwise tile ops
 6  xtile-cpu-shlo-to-vector / -linalg-elementwise-to-vector   → vector dialect
 7  xtile-cpu-tensor-ops-to-bufferizable     normalize stray tensor.* ops
 8  xtile-cpu-bufferization                  tensor → memref (One-Shot Bufferize)
 9  xtile-cpu-memref-copy-to-loops           memref copies → scf.for
10  xtile-cpu-vector-to-scalar               residual vector → scalar
11  xtile-cpu-unpack-sub-byte-vector-write   sub-byte stores
12  xtile-cpu-lower-xtile-entry              xtile.entry_func → func.func (xla::cpu::LowerXTileEntryPass)
13  → LLVM dialect → CPU object code

GOTCHA — this pipeline is NOT a row in the 97-row HLO pre-pass table. The HLO pre-passes (Compile Phases) belong to the TPU device compiler (DeepseaCompilerBase::RunHloPasses). The xtile-cpu-* pipeline is the XLA CPU backend's per-fusion codegen, invoked from EmitXTileModule after the CPU backend's own StableHLO/MHLO normalization. There is no row in the device pipeline that adds any xtile-* pass. A reimplementer searching the TPU phase list for an XTile stage will not find one; that is the concrete consequence of the two-tree split.


The Six Registered XTile Ops

XTileDialect::initialize (0x1507ec20) registers six ops via a single addOperations<DotScaledOp, EntryFuncOp, EntryFuncReturnOp, ExtractTileOp, InsertTileOp, MaskOp>(). Each has build/verify/print/parse and an xtile.<name> op-name string (xtile.entry_func, .return, .extract, .insert, .mask, .dot_scaled all recovered from the string pool). There is no xtile.dot op — see the GOTCHA below.

op-nameC++ classrecovered build() signaturerole
xtile.entry_funcxla::xtile::EntryFuncOp(StringRef name, ArrayRef<Type> argTypes, ArrayRef<NamedAttribute>, ArrayRef<DictionaryAttr>)tiled-fusion entry function (FunctionOpInterface)
xtile.returnxla::xtile::EntryFuncReturnOpcreate(OpBuilder&, Location) — terminator, no operandsentry_func terminator
xtile.extractxla::xtile::ExtractTileOp(Type result, Value source, ValueRange offsets, ArrayRef<long> staticOffsets, ArrayRef<long> staticSizes)read a tile out of a tiled buffer/tensor (BufferizableOpInterface, TiledBufferInterface)
xtile.insertxla::xtile::InsertTileOp(Value tile, Value dest, ValueRange offsets, ArrayRef<long> staticOffsets, ArrayRef<long> staticSizes)write a tile back into a tiled buffer/tensor (BufferizableOpInterface, TiledBufferInterface)
xtile.maskxla::xtile::MaskOp(Value source, ArrayRef<long> maskedDims, Value padValue)predicate/pad partial boundary tiles; has fold() + inferReturnTypes
xtile.dot_scaledxla::xtile::DotScaledOp(Type, Value lhs, Value rhs, Value lhsScale, Value rhsScale, bool, bool, bool)block-scaled (MXFP) tile dot; custom print/parse

NOTE — xtile.extract/xtile.insert use the canonical mixed static/dynamic slice form. Their operands are static-offset + dynamic-offset (ValueRange) + static-size — exactly the shape of tensor.extract_slice/tensor.insert_slice. That is precisely why they bufferize cleanly to memref.subview-style reads and in-place tile stores under One-Shot Bufferize.

GOTCHA — there is NO plain xtile.dot op. addOperations<> registers only the six ops above; no xla::xtile::DotOp class, no xtile.dot op-name string, and no DotOp::build/print/parse symbols exist in the binary. A plain (un-scaled) tile dot is not an XTile op — it is emitted directly into linalg/vector by the dot-algorithm emitter EmitSingleTileDot (0x14c277a0, dot_algorithms.cc). Only the block-scaled DotScaledOp is a first-class op. Per-algorithm dot instruction sequences (f32×f32, bf16×bf16→f32, tf32, x3/x6 high-precision) live in dot_algorithms.cc and are not enumerated here — LOW confidence on the exact per-algorithm bodies.

XTile attributes and interfaces

symbolencodes
xla::xtile::LayoutAttr (xtile.layout)minor-to-major layout: minor_to_major : DenseI64ArrayAttr (parse-error string "failed to parse XTile_LayoutAttr parameter 'minor_to_major' …")
xla::xtile::TilingInfoAttr (xtile.tiling_info)tile_count : int32_t, tiles_per_workgroup : int32_t
xla::xtile::TileInfoper-op tile descriptor consumed by EmitParameterExtract
xla::xtile::TiledBufferInterfaceop interface implemented by ExtractTileOp + InsertTileOp
xla::xtile::DotOperands / ScaledDotOperands / DotOperandSidehelper structs for dot operand routing

The tiles_per_workgroup field, together with the CPU-thunk workgroup strings (NumWorkGroups{%d, %d, %d}, XLA_CPU_NumWorkGroups, xla.cpu.KernelThunkProto), confirms the CPU/GPU workgroup-parallel model: a dimension is split into tile_count tiles, and tiles_per_workgroup are assigned to each parallel workgroup. This is a CPU/GPU codegen concept; the TPU side uses sublane/lane vreg tiling instead (Mosaic Layout Inference).


Per-Op MHLO/StableHLO → XTile Mapping

XTile lowering splits cleanly: scalar/elementwise ops go through a verbatim StableHLO→arith table (StablehloLowerToArithPass), while shape-level ops go through the Emit* fusion-emitter helpers (StablehloLowerToXtilePass, fusion_emitter.cc/emitter_helpers.cc).

The StableHLO → arith table (verbatim)

Each row is a template instantiation recovered from the demangled name LowerStableHloOpToArith<Src, FloatDst, SignedIntDst[, UnsignedIntDst]>. Float vs int dispatch is selected at match time from the operand element type.

StableHLO srcfloat dstsigned-int dstunsigned-int dst
stablehlo.addarith.addfarith.addiarith.addi
stablehlo.subtractarith.subfarith.subiarith.subi
stablehlo.mularith.mulfarith.muliarith.muli
stablehlo.divarith.divfarith.divsiarith.divui
stablehlo.remarith.remfarith.remsiarith.remui
stablehlo.maxarith.maximumfarith.maxsiarith.maxui
stablehlo.minarith.minimumfarith.minsiarith.minui
stablehlo.andarith.andiarith.andiarith.andi
stablehlo.orarith.oriarith.oriarith.ori
stablehlo.xorarith.xoriarith.xoriarith.xori

Unary ops route through LowerStableHloUnaryOpToMath<Src, MathDst> (recovered instantiation: stablehlo.round_nearest_even → math.roundeven). Additional dedicated pattern classes in the same anonymous namespace:

  • LowerCompareOpstablehlo.compare → arith.cmpf/arith.cmpi (predicate by comparison direction + type).
  • LowerConvertOpstablehlo.convert → arith.ext*/trunc*/sitofp/uitofp/fptosi/fptoui (type-pair table; class confirmed, the full per-pair mapping is inferred — MEDIUM).
  • ConstantConversionPatternstablehlo.constant → arith.constant (DenseElementsAttr passthrough).
  • ElementwiseConverter — the generic elementwise dispatcher.

The shape-level emit helpers

These (in xla::xtile:: / its anonymous namespace) build the tile-op structure that the arith ops fill in:

helpertext addrbuilds
EmitXTileModule0x14c1c9e0top level: the whole xtile.entry_func module for a fusion
EmitScope0x15066ec0walks an HLO instruction span, emits XTile/arith into a value map
EmitParameterExtract0x15066e00xtile.extract of a parameter tile given a TileInfo
EmitConstant0x15064da0arith.constant/splat for HLO constants
EmitElementwise0x150630a0elementwise HLO → arith/math over extracted tiles
EmitSingleTileDot0x14c277a0one tile of a dot → the dot algorithm (dot_algorithms.cc)
EmitSingleTileScaledDot0x14c289c0one tile of a block-scaled dot → xtile.dot_scaled
EmitTiledReshape0x150694c0reshape over tiled tensors
EmitTiledTranspose0x15069860transpose over tiled tensors (SmallVector<long,6> permutation)
EmitTiledComputation / EmitTiledInstructionList / EmitTiledBitcast / EmitReduceComputation / EmitNestedFusion(multiple syms)whole computation / per-instruction / bitcast / reduce body / nested-fusion inlining

EmitXTileModule signature (recovered):

EmitXTileModule(std::string_view name,
                HloFusionInstruction const*,
                SymbolicTileAnalysis const&,
                Tiling const&,
                MLIRContext&,
                absl::Span<Type>,
                std::optional<stream_executor::GpuComputeCapability> const&);

The XTile module is emitted directly from an HLO fusion using the symbolic tile analysis + the chosen tiling, gated by GPU compute capability (CPU passes nullopt). Log anchor at entry: "Emitting XTile IR for fusion". Dot-emit canonicalization uses CanonicalizeDotOperand (0x150692a0), MaskDotOperand, GetDotLoopIterationCount, GetDotAccumulatorType, GetPaddedTileSizes.

Worked lowerings

// (a) elementwise binary — stablehlo.add of two tiled f32 tensors
//     %c = stablehlo.add %a, %b : tensor<256x256xf32>
//     after tiling + StablehloLowerToArith over one tile:
%ta = xtile.extract %A[%i, %j] [/*static offsets*/] [/*static sizes*/] : tensor<TMxTNxf32>
%tb = xtile.extract %B[%i, %j] [...] [...]                              : tensor<TMxTNxf32>
%tc = arith.addf %ta, %tb : tensor<TMxTNxf32>      // float vs int per the arith table
      xtile.insert %tc into %C[%i, %j] [...] [...]
// (b) dot (matmul), one tile — EmitSingleTileDot (dot_algorithms.cc)
//     stablehlo.dot_general %lhs, %rhs {contracting = ...}
//     per output tile: extract lhs/rhs tiles; loop GetDotLoopIterationCount over the
//     contraction tiling; accumulate in GetDotAccumulatorType (f32 acc for bf16/f8 inputs);
//     operands canonicalized by CanonicalizeDotOperand, boundary-masked by MaskDotOperand;
//     result xtile.insert into the output buffer.
// (c) block-scaled dot (MXFP) — EmitSingleTileScaledDot
//     a block_scaled_dot custom-call (scale type f8E8M0FNU):
%d = xtile.dot_scaled %lhsTile, %rhsTile, %lhsScale, %rhsScale
       {transpose_lhs, transpose_rhs, acc}    // three bool flags
// verifier strings: "expect scale operands dimension 2 to equal C/block_size",
// "block sizes for cast_from_block_scaled and cast_to_block_scaled must match"
// (d) transpose — EmitTiledTranspose(b, permutation, tiledValue)
//     stablehlo.transpose {permutation} → tile-level transpose over the tiled
//     RankedTensorType using the SmallVector<long,6> permutation.

// (e) reshape — EmitTiledReshape(b, newShape, tiledValue)
//     stablehlo.reshape → tile-level reshape; collapse/expand of unit dims via CollapseUnitDims.

// (f) constant — EmitConstant → arith.constant (DenseElementsAttr); 0-D constants are
//     later scalarized by ConvertElementwise0DTensorToScalarPass.

NOTE — broadcast / iota / concatenate. BroadcastInDims, Splat, and ConstLike helpers handle broadcast; iota and concatenate go through EmitGeneric / CheckConcatenateOperands. The upstream StableHLO→Linalg IotaConverter/ConcatenateConverter classes also ship in the binary, but they belong to the GPU/CPU stablehlo-to-linalg path, not the XTile pass — do not conflate them.


Tiling and Type Conversion

XTile consumes a tiling; it does not choose one

XTile does not decide tiling. The decision is made earlier by xla::SymbolicTileAnalysis:

  • AnalyzeFusion (0x14c35a40) / AnalyzeComputation (0x14c359a0) propagate an IndexingMap through the fusion.
  • GetValidTilings (0x14c40280) enumerates valid Tilings; ParametersSatisfyConstraints (0x14c3d040) + tiled_emitter_constraints.cc filter them.
  • ComputeOutputTilingInfo (0x14c46960) / ComputeTiledInstructions (0x14c47f80) produce the per-instruction TileInfo.

The chosen Tiling + GpuComputeCapability are passed into EmitXTileModule, which bakes TilingInfoAttr{tile_count, tiles_per_workgroup} and LayoutAttr{minor_to_major} onto the entry function and tiles. Failure anchor: "cannot compute parametric tile sizes for dynamically shaped payload op". This is the GPU/CPU symbolic-tiling search; the TPU analogue (window-config search, Mosaic sublane/lane tiling) is entirely separate code.

Two type conversions

  1. tensor<f32> → scalar f32 (ConvertElementwise0DTensorToScalarPass) — a mlir::TypeConverter full conversion carrying addConversion(Type) (identity), addConversion(RankedTensorType) (rank-0 → element type), addSourceMaterialization (scalar → tensor re-wrap), addTargetMaterialization (tensor → scalar unwrap), plus addDynamicallyLegalOp<arith::ConstantOp> so 0-D constants stay legal until the materializers fire.

  2. tensormemref (xtile-cpu-bufferization) — MLIR One-Shot Bufferize via the BufferizableOpInterface that ExtractTileOp/InsertTileOp implement: ExtractTileOp::bufferizesToMemoryRead = true on the source and emits a memref.subview-style read; InsertTileOp::bufferizesToMemoryWrite = true on the dest and emits an in-place tile store; bufferizesToAllocation governs scratch allocation. xtile-cpu-tensor-ops-to-bufferizable first normalizes stray tensor.* ops; xtile-cpu-memref-copy-to-loops then lowers copies to scf loops. Entry-function memrefs must use a tiled layout — anchor: "All memref arguments should use the TiledLayoutAttr for layout"; alignment is checked with "Cannot confirm that the memref is memory tile-aligned …".


Failing / Unsupported Ops at the XTile Boundary

xtile-verify-legal-ops and the two lower-to-* full conversions reject anything outside the legal set. Recovered failure anchors:

anchor stringtrigger
"Could not legalize op: "an op with no lower-to-xtile/arith pattern survived
"cannot compute parametric tile sizes for dynamically shaped payload op"dynamic-shape payload with no static tiling
"Boundaries of the clamp are not legal: "xtile.mask/clamp bound out of range
"All tiles must have the same rank."mixed-rank tiles into a tile op
"At least one tile shape must be specified."tiling info absent
"All tiled squeezed dimensions must be of size 1."bad squeeze on a tiled dim
"All memref arguments should use the TiledLayoutAttr for layout"entry-func memref lacks tiled layout
"Cannot confirm that the memref is memory tile-aligned …"alignment check could not prove producer is tile-aligned
"expect scale operands dimension 2 to equal C/block_size "xtile.dot_scaled scale operand shape mismatch
"Allow XLA's MHLO ops not in StableHLO to remain present after legalization (copy, add_dependency, fusion, etc.)"flag: MHLO-only ops tolerated; if not lowered they hit "Could not legalize op:" later

The "MHLO not in StableHLO" flag confirms the input is StableHLO produced from MHLO, with a small allow-list of MHLO-only ops (copy, add_dependency, fusion) permitted to persist into the XTile lowering.


Where the tpu Dialect Actually Comes From

Because there is no MHLO→tpu pass, a reimplementer needs to know the real origin of the tpu dialect on the TPU path — it is imported, not lowered. The seam:

  1. A Pallas @pl.kernel (or hand-written Mosaic kernel) is compiled by the JAX frontend, outside libtpu into a serialized tpu-dialect MLIR module.
  2. That module is embedded in HloOpcode::kCustomCall with custom_call_target = "tpu_custom_call", inside the mlir_module field of the backend config.
  3. At HLO→LLO emit time, xla::jellyfish::mlir_utils::GetMlirModuleOpFromCustomCall (0x13e327a0) extracts and caches the mlir::ModuleOp (CityHash128-keyed, one parse per unique kernel).
  4. The imported module runs through the tpu pass pipeline RunMLIRPasses (0x111fefa0) — MosaicSerdePass (version upgrade) → memref/vector-layout passes → createLowerToLLOPass (tpu → LLO) → LLO.

The tpu dialect is therefore authored upstream and imported, never produced by lowering general MHLO. See Mosaic Overview for the import/serde/pipeline detail and The tpu MLIR Dialect for the dialect itself.

GOTCHA — do not build an MHLO→tpu legalizer; it does not exist. The convergence point the overview draws at Level 2 (tpu dialect) is fed by exactly two producers: the Mosaic custom-call import (Tree A2) and — on the general path — nothing, because general HLO bypasses tpu entirely and is emitted to LLO directly by jellyfish *Emitter classes (Tree A1). XTile (Tree B) feeds neither; it targets CPU/LLVM. A reimplementation that allocates effort to an MHLO/StableHLO→tpu conversion pattern set is building a stage that the production compiler does not contain.


Confidence Summary

ClaimEvidence
Four XTile front-end passes with the listed CLI names/create-functionsStablehloLowerToXtilePass 0x150602c0, StablehloLowerToArithPass 0x1505a880, ConvertElementwise0DTensorToScalarPass 0x15059440, VerifyLegalXTileOpsPass 0x15062300; CLI strings recovered
Six registered XTile ops with the listed classesaddOperations<DotScaledOp, EntryFuncOp, EntryFuncReturnOp, ExtractTileOp, InsertTileOp, MaskOp> in XTileDialect::initialize; op-name strings xtile.entry_func/return/extract/insert/mask/dot_scaled
StableHLO→arith table (10 binary-op rows)LowerStableHloOpToArith<…> template instantiations recovered from demangled names
Emit* fusion-emitter helper map + EmitXTileModule signatureEmitXTileModule 0x14c1c9e0, EmitSingleTileDot 0x14c277a0, EmitTiledTranspose 0x15069860, … ; log anchor "Emitting XTile IR for fusion"
XTile attributes xtile.layout/xtile.tiling_info and the workgroup modelXTile_LayoutAttr/minor_to_major and XTile_TilingInfoAttr/tile_count/tiles_per_workgroup parse-error strings; CPU-thunk NumWorkGroups{%d, %d, %d}/XLA_CPU_NumWorkGroups/xla.cpu.KernelThunkProto strings
tensormemref via One-Shot Bufferize on BufferizableOpInterfaceExtractTileOp/InsertTileOp implement BufferizableOpInterface + TiledBufferInterface; xtile-cpu-bufferization CLI string
Failure anchors at the XTile boundaryall 10 anchor strings recovered in the decompiled output
XTile is NOT on the TPU MXU pathXTile dep-set = XlaCpuDialect/memref/vector/LLVM (no tpu/llo); xtile-cpu-* pipeline; codegen/xtile/ source paths
No MHLO/HLO→tpu conversion pass existsgrep of functions table: 0 hits for *ToTpuPass/MhloToTpu/StablehloToTpu/LegalizeToTpu
tpu dialect is imported only, via tpu_custom_callGetMlirModuleOpFromCustomCall 0x13e327a0, RunMLIRPasses 0x111fefa0, MosaicSerdePass, tpu_custom_call all present
Plain tile dot is emitted to linalg/vector, not an xtile.dot opno DotOp class / xtile.dot string in binary; EmitSingleTileDot/dot_algorithms.cc bodies not decompiled per-algorithm
Full LowerConvertOp ext/trunc/itofp/fptoi type-pair tableLowerConvertOp class present (119 refs); per-pair mapping inferred, not enumerated

Cross-References

  • The TPU Compiler (overview) — the five-phase spine and IR-layer stack; this page refines its Level 2 ("XTile between MHLO and tpu") into the two-tree split.
  • Compile Phases 0–3 — the device compiler's phase pipeline, into which the xtile-cpu-* pipeline is not wired.
  • The tpu MLIR Dialect — the tpu target dialect that XTile does not produce; the convergence point of the optimizer and Mosaic paths on the TPU side.
  • tpu → LLO Lowering — the createLowerToLLOPass descent that the imported tpu module runs through (the real path, unlike XTile's CPU/LLVM target).
  • DialectConversion Legalizer — the MLIR dialect-conversion machinery underlying both the XTile full conversions and the tpu legalizers.
  • Mosaic Overview — the Pallas/Mosaic import seam (GetMlirModuleOpFromCustomCall / RunMLIRPasses) that is the actual origin of the tpu dialect.
  • Mosaic Layout Inference — the TPU sublane/lane vreg tiling that is the analogue of XTile's CPU/GPU workgroup tiling.
  • Binary: extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so (build-id 89edbbe81c5b328a958fe628a9f2207d)
  • Index entry: Part V — Compiler: Lowering & Optimization Passes / MLIR lowering chain — back to index