The tpu MLIR Dialect: Ops and the Op-Model Contract
All addresses, symbol names, and counts on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d, buildlibtpu_lts_20260413_b_RC00). The.symtabis not stripped — every claim below is anchored to a demangled symbol or a relocation addend. Other versions will differ; the 23-slot ordering in particular is pinned to this binary's LLVM SHA.
Abstract
The tpu dialect is the TPU's machine dialect — the MLIR target dialect that the optimizer path (HLO → MHLO → tpu) and the Mosaic side channel (tpu_custom_call → tpu) both converge on before the descent into LLO (see Compiler Overview). This page documents the dialect from the angle that survives in a stripped-but-symbol-bearing binary: not the source-level ODS .td files (which are not in the wheel), but the runtime op-registration ABI that every op in every MLIR dialect is compiled into.
Concretely, MLIR registers an op not by C++ inheritance but by instantiating a type-erased concept class — mlir::RegisteredOperationName::Model<Op> — once per op, and inserting it as the op's OperationName::Impl. Each Model<Op> is a vtable with a fixed 23-slot contract: a shared base destructor, a per-op deleting destructor, and 21 dispatch slots (fold, canonicalize, verify, parse/print, the four inherent-attribute accessors, and nine property-management methods). The binary holds 6,050 such Model<Op> vtables — one per registered op across 60-plus dialect namespaces — of which the tpu dialect owns 86.
The binary-wide Model<Op> ABI is owned by the sibling MLIR Op-Model Contract page; this page is the worked tpu-dialect instantiation of it and reproduces the 23-slot walk only as far as needed to read the 86 tpu ops against it. This page owns three things:
- The
Model<Op>23-slot contract — every slot labelled to its demangledModel<Op>::<method>symbol, the 1-shared + 22-per-op layout, and what each slot does. (Verified onModel<mlir::tpu::IotaOp>: all 21 dispatch-slot symbols present.) - The Op↔Model binding — how a concrete
tpuop fills its 23 slots: theDialect::addOperations<…>registrar, the singleRegisteredOperationName::insert, and the three-level delegation from the type-erased Model thunk to the op's ODS-generated statics. - The
tpuop-family taxonomy — the 86 ops grouped by theirOpInterfacesignature, the fold/canonicalize census (the dialect's entire in-dialect rewrite surface is 15 methods), and the 9 ops that carry a variadicAttrSizedOperandSegmentssize array (a subset of the 53 ops whosePropertiesstruct is non-trivial).
| Op-registration concept | mlir::RegisteredOperationName::Model<Op> (one per registered op) |
| Model vtable count (all dialects) | 6,050 across 60-plus namespaces (= total registered-op count) |
tpu dialect Model count | 86 (exact addOperations<> arity; corrects the ~157 a naive *Op-string scan over-counts — see below) |
| Vtable slots per Model | 23 = slot 0 shared base dtor + slot 1 per-op dtor + slots 2–22 dispatch |
| Shared slot-0 dtor | mlir::OperationName::Impl::~Impl (addend 0xfea8820, identical across all 6,050 Models) |
tpu op registrar | Dialect::addOperations<mlir::tpu::AllReduceOp, …> @ 0x14aa2c40 (86 args) |
| Registration sink | RegisteredOperationName::insert(unique_ptr<OperationName::Impl>, ArrayRef<StringRef>) @ 0x1d8c57a0 |
| Dialect ctor | mlir::tpu::TPUDialect::TPUDialect(MLIRContext*) @ 0x14a96d40 (registers "tpu", dialect-id 3, then calls addAttributes<13>/addType<3>/addOperations<86> inline — no separate initialize()) |
tpu MemoryEffect interface fan-out | 59 ops (cross-checked two ways) |
tpu in-dialect rewrite surface | 6 real fold() + 9 real getCanonicalizationPatterns() = 15 methods |
tpu ops with non-trivial Properties | 53 (slot 14 getOpPropertyByteSize > 0); 9 of those carry an AttrSizedOperandSegments size array |
| Confidence | CONFIRMED (byte-anchored) unless a row or callout says otherwise |
The Model<Op> 23-Slot Contract
MLIR's OperationName is a handle into an interned, per-op OperationName::Impl. For a registered op (one the dialect declared, as opposed to an opaque/unregistered op parsed from text), that Impl is a RegisteredOperationName::Model<Op> — a C++ template instantiated once per op type. The Model<Op> is a classic type-erasure concept: it derives from the abstract OperationName::Impl interface and overrides every virtual with a thunk that knows the concrete Op's ODS-generated statics. There is exactly one Model<Op> object — and therefore one vtable — per registered op, and nm counts 6,050 of them (_ZTVN4mlir23RegisteredOperationName5ModelI…EEE).
The vtable follows the Itanium PIE layout: vtable+0x00 is offset-to-top (0), vtable+0x08 points at the _ZTI…Model<Op>… typeinfo, and vtable+0x10 is the address point — the value an object's vptr stores, and the location of slot 0. Slot i sits at vtable+0x10+8i. In the on-disk image the slot words are zero; they are filled at load by R_X86_64_RELATIVE relocations whose addends are the method addresses. Resolving each addend against the (sorted) symbol table recovers the method name for every slot.
The reference walk below is Model<xla::PureCallOp> @ 0x219d4e38, reproduced byte-identically in ordering and signature on Model<mlir::tpu::IotaOp> @ 0x219abb28, Model<mlir::llo::ConstantOp> @ 0x2193dba0, and Model<mlir::mosaic_sc::RelayoutOp> @ 0x21905130 — three dialects, one ABI.
RegisteredOperationName::Model<Op> vtable (address point = +0x10 = slot 0)
slot off method (Model<Op>::…) nature
---- ---- ------------------------------------------ --------------------------------
0 0x00 OperationName::Impl::~Impl SHARED base dtor (0xfea8820 ∀ ops)
1 0x08 ~Model [deleting] per-op dtor
2 0x10 foldHook(Operation*, ArrayRef<Attribute>, per-op; wraps Op<>::getFoldHookFn
SmallVectorImpl<OpFoldResult>&) lambda (UniqueFunction)
3 0x18 getCanonicalizationPatterns( per-op; no-op unless op defines it
RewritePatternSet&, MLIRContext*)
4 0x20 hasTrait(TypeID) trait / interface membership test
5 0x28 getParseAssemblyFn() returns the op's custom parser fn
6 0x30 populateDefaultAttrs(OperationName const&, default-attr init
NamedAttrList&)
7 0x38 printAssembly(Operation*, OpAsmPrinter&, per-op; wraps getPrintAssemblyFn
StringRef)
8 0x40 verifyInvariants(Operation*) wraps Op<>::verifyInvariants
9 0x48 verifyRegionInvariants(Operation*) region-structure verify
10 0x50 getInherentAttr(Operation*, StringRef) → Op::getInherentAttr (TAIL-JMP)
11 0x58 setInherentAttr(Operation*, StringAttr, → Op::setInherentAttr
Attribute)
12 0x60 populateInherentAttrs(Operation*, → Op::populateInherentAttrs
NamedAttrList&)
13 0x68 verifyInherentAttrs(OperationName, inherent-attr verify
NamedAttrList&, function_ref<…>)
14 0x70 getOpPropertyByteSize() INLINED sizeof(Op::Properties)
15 0x78 initProperties(OperationName, PropertyRef, placement-construct props
PropertyRef)
16 0x80 deleteProperties(PropertyRef) destruct props
17 0x88 populateDefaultProperties(OperationName, default props
PropertyRef)
18 0x90 setPropertiesFromAttr(OperationName, attr → props
PropertyRef, Attribute, fn_ref<…>)
19 0x98 getPropertiesAsAttr(Operation*) props → attr (print / bytecode)
20 0xa0 copyProperties(PropertyRef, PropertyRef) clone props
21 0xa8 compareProperties(PropertyRef,PropertyRef) props equality (for CSE)
22 0xb0 hashProperties(PropertyRef) props hash (for CSE / dedup)
The 23 slots cluster into four functional groups:
- Slots 2–9 — the dispatch surface. Folding, canonicalization, the trait/interface gate (
hasTrait), custom parse/print, default-attribute population, and the two verifiers. Slots 2, 3, 7, and 8 are the optional hooks — they are present for every op but wrap a no-op unless the op actually defines the method (see the binding section). - Slots 10–13 — inherent-attribute accessors. Inherent attributes are the op's declared attributes (e.g.
iota'sdimensions), as opposed to discardable attributes. These four slots tail-jump straight to the op's ODS statics. - Slots 14–22 — property management. MLIR's
Propertiesis the inline storage for an op's inherent attributes and variadic operand-segment sizes. These nine slots construct, destruct, copy, compare, hash, and (de)serialize that storage. For ops without aPropertiesstruct they are trivial (see slot 14 below).
NOTE — the "23-slot Model" is structurally 1-shared + 22-per-op, not 23 per-op. Slot 0 — the address point that every object's vptr stores — is the same relocation addend
0xfea8820(mlir::OperationName::Impl::~Impl) across all 6,050 Models, verified on a 30-Model stratified sample spanningtpu,llo,sparse_core,mosaic_sc, andxla. It is the destructor of the commonOperationName::Implbase subobject. Only slots 1–22 are per-op template instantiations: slot 1 isModel<Op>::~Model(the deleting dtor) and slots 2–22 are the dispatch thunks. A reimplementer should modelModel<Op>as one shared base destructor plus 22 op-specific entries.
QUIRK —
getOpPropertyByteSize(slot 14) is an inlinedsizeof(Op::Properties)constant, not a function call. The compiler folds the slot to a literal.Model<mlir::tpu::IotaOp>::getOpPropertyByteSizeismov $0x8; ret— 8 bytes, the singledimensionsDenseArrayAttrhandle.Model<mlir::tpu::BarrierOp>::getOpPropertyByteSize(@0x14ab1960) isxor %eax,%eax; ret— 0 bytes, becauseBarrierOpcarries noProperties. This is the cheapest possible diagnostic for "does this op have inline properties": read slot 14's two-instruction body. The nine property-management slots (14–22) only do real work for ops carrying inherent attributes orAttrSizedOperandSegments.
The Op↔Model Binding
A concrete op is bound to its 23-slot Model not by overriding virtuals in a hand-written subclass, but by template instantiation + delegation, in three levels.
Level 1 — registration: addOperations → insert
Each dialect registers its ops with a single variadic Dialect::addOperations<Op…> template instantiation. For tpu, this is called directly from the dialect constructor body (TPUDialect::TPUDialect @ 0x14a96d40 — there is no separate initialize() symbol; the constructor runs Dialect::Dialect(this, "tpu", 3, …), then addAttributes<…>, addType<…> ×3, then addOperations<…>):
mlir::Dialect::addOperations<mlir::tpu::AllReduceOp, mlir::tpu::AllocaSemaphoreOp,
mlir::tpu::AssumeLayoutOp, …> @ 0x14aa2c40
with exactly 86 type arguments — confirming the tpu dialect's op count from the registrar arity, not from a string scan. For each Op in the pack, addOperations materialises a Model<Op> (an instance whose vtable holds the 23 Model<Op>::* thunks) and hands it to the single, non-template registration sink:
mlir::RegisteredOperationName::insert(
std::unique_ptr<mlir::OperationName::Impl, …>, ← the Model IS the Impl
llvm::ArrayRef<llvm::StringRef>) @ 0x1d8c57a0
The Model<Op> is the OperationName::Impl — registration is "intern this Impl under the op's name(s)." The TPUDialect constructor (mlir::tpu::TPUDialect::TPUDialect(MLIRContext*) @ 0x14a96d40) wires up attributes (addAttributes<13>) and types (addType<…> ×3) and the ops (addOperations<86>) all in its own body — initialize() is fully inlined into the constructor, so there is no standalone TPUDialect::initialize symbol. Binary-wide there are 243 distinct addOperations<…> symbols — 205 unique after demangling (38 collapse to identical demangled signatures) — one per registered dialect (some split into batches); the TPU-path registrars are tpu 0x14aa2c40 (86), llo 0x13e5f940 (325), sparse_core ScDialect 0x14594f60 (115), and mosaic_sc 0x132f9ec0 (1, RelayoutOp).
Level 2 — per-slot delegation to ODS statics
Each Model<Op>::slotN thunk does two things: it locates the op's Properties storage (inline vs. out-of-line, chosen from the op flag word), then tail-jumps to the ODS-generated static. The inline-property-offset idiom is identical across ops:
; Model<tpu::IotaOp>::getInherentAttr (slot 10 @ 0x14ac18a0)
mov 0x2c(%rdi),%ecx ; load the Operation flag word
shr $0x13,%ecx ; isolate the inline-properties bit
and $0x10,%ecx ; → 0x10 if inline, 0 if out-of-line
lea 0x40(%rdi,%rcx,1),%rsi ; %rsi = &Properties
jmp 0x14b220e0 ; TAIL-JMP to the ODS static, below
The jump target is the concrete op's ODS static, recovered as a demangled symbol:
mlir::tpu::IotaOp::getInherentAttr(
mlir::MLIRContext*,
mlir::tpu::detail::IotaOpGenericAdaptorBase::Properties const&,
llvm::StringRef) @ 0x14b220e0
The four inherent-attribute slots (10–13) all use this direct tail-jmp shape.
Level 3 — optional hooks via UniqueFunction callback-holders
The four optional hooks — foldHook (2), getCanonicalizationPatterns (3), printAssembly (7), verifyInvariants (8) — are wrapped one indirection deeper, in an llvm::UniqueFunctionBase<…>::CallbacksHolder<Op<Op,Traits…>::get<Hook>Fn()::lambda>. The slot loads the holder pointer and does call *(%rax) (or call *0x10(%rax) for the print/parse holders). For IotaOp:
Model<tpu::IotaOp>::foldHook (slot 2 @ 0x14ac1580) → holder @ 0x223413f0
Model<tpu::IotaOp>::printAssembly (slot 7 @ 0x14ac16e0) → holder @ 0x22341400
Model<tpu::IotaOp>::verifyInvariants (slot 8 @ 0x14ac1760) → Op<…>::verifyInvariants @ 0x14ac1de0
This third level is where the full trait and interface list of the op is encoded: the holder symbol carries the entire Op<OpName, Trait1, Trait2, …, Interface::Trait, …> template signature. Reading those template arguments off the getFoldHookFn holder symbol is how the per-op interface fan-out (next section) is recovered — and it is cross-validated against an independent count (the second-layer interface Models).
GOTCHA — two Model layers, not one. The 23-slot
RegisteredOperationName::Model<Op>is the op-registration concept.OpInterfacedispatch (e.g. "what side effects does this op have?") goes through a separate per-(op, interface) concept table:mlir::detail::<Iface>InterfaceTraits::Model<Op>— for exampleMemoryEffectOpInterfaceTraits::Model<mlir::tpu::IotaOp>::getEffects. The 23-slot Model's slot-4hasTraitis only the gate that answers "does this op implement interface X?"; the interface body lives in the second-layer Model. A reimplementer building op dispatch must build both: the registration vtable here, and the per-interface concept tables (the latter are catalogued for the TPU path but their slot layouts are out of scope for this page — they are the subject of the second-layer Model work).
Caller-side join
The dispatch is consumed at, e.g., mlir::Operation::fold @ 0x1d8cd480, which loads the Impl/Model, then the vptr, then calls slot 2:
mov 0x30(%rdi),%rdi ; %rdi = OperationName::Impl (= the Model)
mov (%rdi),%rax ; %rax = vptr (address point)
call *0x10(%rax) ; +0x10 from address point = slot 2 = foldHook (@ 0x1d8cd4ad)
The tpu Op-Family Taxonomy
The 86 tpu ops (Model vtables in the range 0x219aa468 … 0x219aed**) group naturally by their OpInterface signature — the set of interfaces each op implements, read from the Op<OpName, Traits…> template arguments in the op's getFoldHookFn holder. The interface abbreviations are: MemEff = MemoryEffectOpInterface, CondSpec = ConditionallySpeculatable, Bytecode = BytecodeOpInterface, InferType = InferTypeOpInterface, OpAsm = OpAsmOpInterface, RBTerm = RegionBranchTerminatorOpInterface.
#ops OpInterface signature representative ops
---- --------------------------------------------- ----------------------------------------
21 Bytecode, CondSpec, MemEff AllReduce, BroadcastInSublanes, Gather,
Iota, Matmul, MemRefSlice, Pack*, FPToSI,
SIToFP, Transpose, TruncF, Unpack*, …
16 CondSpec, MemEff AssumeLayout, Bitcast, BitcastVreg,
CreateMask, EraseLayout, ExtF, MaskCast,
MemRefBitcast/Reshape/Squeeze, Reshape,
RollVectors, UnrollVectors, Weird, …
13 Bytecode (only) EnqueueDMA, EnqueueIndirectDMA, Log,
LogBuffer, MatmulAccLhs/Pop/PushRhs,
Scan, SemaphoreSignal, Trace*, WaitDMA2
11 (none — pure structural / side-effecting) AllocaSemaphore, Barrier, Delay,
GetBarrierSemaphore, GetInternalScratch,
PRNGRandomBits, PRNGSeed32, Region,
SemaphoreWait, TraceStop, WaitIndirectDMA
10 Bytecode, MemEff Load, ShuffledLoad/Store, Store,
StridedLoad/Store, VectorLoad/Store(Idx)
6 Bytecode, CondSpec, InferType, MemEff AssumeMultiple, Concatenate, DynamicGather,
DynamicRotate, Reciprocal, Rotate
2 CondSpec, InferType, MemEff DeviceId, Relayout
2 Bytecode, InferType GetIterationBound, SublaneShuffle
1 InferType, MemEff FetchAndAddSync
1 CondSpec, InferType, MemEff, OpAsm ScanCount
1 InferType (only) SemaphoreRead
1 Bytecode, CondSpec, MemEff, OpAsm Sort
1 CondSpec, MemEff, RBTerm Yield (the tpu.region terminator)
Aggregating across the 86 ops, the interface fan-out is: MemoryEffect 59, Bytecode 53, ConditionallySpeculatable 48, InferType 13, OpAsm 2 (ScanCount, Sort), RegionBranchTerminator 1 (Yield). The MemoryEffect count is cross-validated: counting the independent second-layer Models MemoryEffectOpInterfaceTraits::Model<mlir::tpu::*> also yields 59.
The op families read as the TPU's machine-instruction classes: MXU matmul (Matmul, MatmulAccLhs/Pop/PushRhs), DMA (EnqueueDMA, EnqueueIndirectDMA, WaitDMA2, WaitIndirectDMA), semaphores (AllocaSemaphore, SemaphoreSignal/Wait/Read, GetBarrierSemaphore, FetchAndAddSync), vector relayout / pack-unpack (Relayout, RollVectors, UnrollVectors, Pack*, Unpack*, BitcastVreg), memref view ops (MemRefSlice/Bitcast/Reshape/Squeeze, EraseLayout, AssumeLayout), PRNG (PRNGSeed32, PRNGRandomBits), conversions (ExtF, TruncF, FPToSI, FPToUI, SIToFP, UIToFP), and trace/region structure (Trace*, Region, Yield, Barrier, Delay).
Properties-carrying ops
Reading slot 14 (getOpPropertyByteSize) across all 86 ops, 53 tpu ops carry a non-trivial Properties struct (slot 14 returns a non-zero sizeof; the remaining 33 return 0 via xor %eax,%eax;ret). The Properties sizes range from 8 to 40 bytes — most are the inline storage for an op's inherent attributes (e.g. IotaOp's 8-byte dimensions DenseArrayAttr handle, confirmed by the demangled mlir::tpu::IotaOp::getDimensions, readProperties, and writeProperties statics).
A distinguished subset of 9 ops additionally carries an AttrSizedOperandSegments variadic operand-segment size array (verified by the AttrSizedOperandSegments trait on the demangled Op<…> signature): EnqueueDMA, MemRefSlice, SemaphoreSignal, Store, VectorLoad, VectorLoadIdx, VectorStore, VectorStoreIdx, WaitDMA2. The largest Properties structs — EnqueueDMA and Store at 40 bytes, then Matmul/Rotate/UnpackSubelements/VectorStore/WaitDMA2 at 32 bytes — combine several inherent attributes (and, for the segment-bearing ops, the size array) in one inline blob.
The in-dialect rewrite surface is 15 methods
The fold and canonicalize slots (2 and 3) are present on all 86 ops, but each wraps a no-op unless the op defines the method. By demangled-symbol presence of Op::fold / Op::getCanonicalizationPatterns, the entire in-dialect rewrite surface of the tpu dialect is:
- 6 real
fold():BitcastVregOp,EraseLayoutOp,ExtFOp,MemRefSliceOp,ReshapeOp,TruncFOp. (All six confirmed by symbol;TruncFOp::foldshows two symbols — the fold method plus an inneroperator()(APFloat const&, bool&)lambda doing the per-element conversion.) - 9 real
getCanonicalizationPatterns():FPToSIOp,MatmulOp,MemRefBitcastOp,MemRefSliceOp,MemRefSqueezeOp,ShuffledLoadOp,ShuffledStoreOp,UnpackSubelementsOp,UnrollVectorsOp.
Everything else in the tpu dialect is transformed structurally by the dialect-conversion lowering (tpu → LLO Lowering), not rewritten in place. The fold/canonicalize bodies (the rewrite algebra inside, e.g., MatmulOp::getCanonicalizationPatterns) are not decompiled on this page — only the census (which ops rewrite) is established here.
NOTE — the "86" corrects a naive
*Op-string scan that over-counts to ~157. A surface RTTI scan formlir::tpu::*Opsymbols returns ~157 hits, but the exact figure from theaddOperations<>registrar arity is 86 registered ops, each with a 23-slot Model. The difference is*Op-suffixed symbols that are not registered ops — Pass classes and helpers (e.g.LinalgVectorizationPass,MosaicSerdePass,PreCanonicalization,applyLayout) that match a naive scan but never reachaddOperations, so they have no Model. The Compiler Overview deliberately asserts notpuop count for exactly this reason. For dispatch-ABI purposes, 86 is the authoritative count.
The tpu Dialect in the 6,050-Model Census
The tpu dialect's 86 ops sit inside a binary-wide census of 6,050 Model<Op> instances across 60-plus dialect namespaces (the exact dialect count depends on how the xla sub-namespaces — xla, xla::cpu, xla::xtile — are partitioned), where each dialect's Model count equals its registered-op count. The TPU-relevant dialects:
dialect Models role on the TPU path
------------------ ------ --------------------------------------------------
sparse_core 1471 SparseCore: 115 ScDialect ops + 1356 LlvmTpu tpu_* intrinsics
llo 325 LLO — TensorCore low-level machine ops (the layer below tpu)
LLVM 294 LLVM dialect — SparseCore tail / off the TensorCore path
linalg 99 Mosaic hlo-conversion stage
tpu 86 TPU — TensorCore high-level (Mosaic) target dialect ← THIS PAGE
arith / math / memref 52 / 46 / 32 scalar/math/memref helpers used by tpu, llo, sc
xtile 6 XLA CPU/GPU tile IR (off the TPU device path)
mosaic_sc 1 SparseCore Mosaic — RelayoutOp only
The other 50-odd namespaces (TF 834, spirv 376, ROCDL 350, NVVM 197, the HLO input languages mhlo/vhlo/stablehlo/chlo, etc.) are either input languages legalized away early or off-path build fallout. Two TPU-adjacent points worth flagging for a reimplementer:
llo(325 ops) is the dialect directly belowtpu— the same 23-slot Model ABI applies. Its bulk (164 ops) carriesCondSpec, InferType, MemEff(the scalar/vector ALU), and onlyllo::ConstantOpdefines a realfold()(the LLO constant folder); nolloop defines a canonicalizer, because LLO is the bottom MLIR layer — lowered, not rewritten.sparse_core(1471 ops) decomposes into 115 high-levelScDialectops plus 1356LlvmTpuDialecttpu_*LLVM-intrinsic ops — the largest single op family in the binary's MLIR layer, each carrying the LLVM-op interfaces (LLVM::AccessGroup,LLVM::AliasAnalysis) because each lowers to an LLVM op. These are catalogued in LlvmTpu Intrinsic Catalog and consumed by LowerToSparseCoreLlvm. They share the identical 23-slot Model ABI documented here.
Confidence Summary
| Claim | Evidence |
|---|---|
Model<Op> is a 23-slot vtable; 6,050 instances binary-wide | nm count of _ZTVN4mlir23RegisteredOperationName5ModelI…EEE = 6,050; reference vtable Model<xla::PureCallOp> @ 0x219d4e38 |
All 21 dispatch-slot symbols present on a tpu op | every Model<mlir::tpu::IotaOp>::<method> (slots 2–22) resolves to a demangled symbol |
| Slot 0 is one shared base dtor across all Models | addend 0xfea8820 = OperationName::Impl::~Impl on 30-Model stratified sample (tpu/llo/sc/mosaic_sc/xla) |
tpu dialect has exactly 86 registered ops | Dialect::addOperations<mlir::tpu::AllReduceOp, …> @ 0x14aa2c40 has 86 type args |
Registration sink is RegisteredOperationName::insert | demangled insert(unique_ptr<OperationName::Impl,…>, ArrayRef<StringRef>) @ 0x1d8c57a0 |
| Inherent-attr slots tail-jmp to ODS statics | Model<IotaOp>::getInherentAttr (slot 10) tail-jmp → mlir::tpu::IotaOp::getInherentAttr(MLIRContext*, …IotaOpGenericAdaptorBase::Properties const&, StringRef) @ 0x14b220e0 |
Optional hooks wrapped in UniqueFunction callback-holders | Model<IotaOp>::foldHook/printAssembly load holder ptr + call *(%rax)/*0x10(%rax); holders @ 0x223413f0/0x22341400 |
getOpPropertyByteSize is an inlined sizeof(Properties) | Model<IotaOp> = mov $0x8;ret; Model<BarrierOp> @ 0x14ab1960 = xor %eax,%eax;ret |
tpu MemoryEffect fan-out = 59 ops | Op<> trait scan = 59; independent MemoryEffectOpInterfaceTraits::Model<mlir::tpu::*> distinct count = 59 |
tpu in-dialect rewrite surface = 6 fold + 9 canon | demangled mlir::tpu::*Op::fold (6) and *Op::getCanonicalizationPatterns (9) present |
53 tpu ops carry non-trivial Properties; 9 of those carry AttrSizedOperandSegments | slot 14 getOpPropertyByteSize non-zero on 53/86 ops (read directly: mov $0xN;ret vs xor;ret); the 9 segment-bearing ops carry the AttrSizedOperandSegments trait; IotaOp = 8-byte dimensions property |
tpu op interface signatures (family grouping) | read from Op<OpName,Traits…> template args in each op's getFoldHookFn holder symbol |
Naive *Op-scan ~157 vs 86 registered | 86 = registrar arity; the ~71 extra are unregistered Pass/helper *Op symbols with no Model |
Cross-References
- Compiler Overview — where the
tpudialect sits in the five-phase descent and the six-level IR stack; the convergence point of the optimizer and Mosaic paths. - MLIR Op-Model Contract — the binary-wide
Model<Op>ABI and the full 6,050-Model census; that page owns the contract, this page is its workedtpu-dialect instantiation (relate, do not duplicate). - tpu → LLO Lowering — how the 86
tpuops are lowered structurally into the 325-opllodialect, the layer below this one. - MHLO → XTile → tpu — the mid-level lowering that produces the
tpuops this page registers. - DialectConversion Legalizer — the conversion-pattern machinery that consumes
tpuops during lowering. - tpu Program Serialization — how the property storage (slots 18/19,
setPropertiesFromAttr/getPropertiesAsAttr) round-trips through bytecode. - LlvmTpu Intrinsic Catalog — the 1356
sparse_core::tpu_*intrinsic ops that share this Model ABI. - LowerToSparseCoreLlvm — the pass that consumes the
tpu_*intrinsic Models on the SparseCore path. - Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part V — Compiler: Lowering & Optimization Passes / MLIR lowering chain — back to index