LlvmTpu Intrinsic Catalog

Every address, symbol, offset, and string on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (build libtpu_lts_20260413_b_RC00, BuildID md5 89edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, not stripped, .text/.rodata VMA == file offset). Other versions will differ; treat every VA as version-pinned.

Abstract

The TPU backend has two distinct op surfaces. The TensorCore path is the 86-op tpu target dialect, which lowers to the 322 mlir::llo::*Op classes that descend to LLO bundles (tpu → LLO ODS Lowering covers that descent). This page documents the other surface: the LlvmTpuDialect, a separate MLIR dialect of 1356 tpu_* intrinsic ops that is the bottom-of-stack instruction surface for the SparseCore — the embedding-processing co-processor. Each op is registered as mlir::sparse_core::tpu_X_Y_Z and prints as the LLVM-intrinsic name llvm.tpu.X.Y.Z (underscore→dot). These are not lowered to LLO; they descend through Lower to SparseCore LLVM into honest LLVM-IR intrinsic calls and reach the SparseCore LLVM backend, where addrspacecast ISel and the per-engine ISel match them.

The reader who knows LLVM should hold one analogy: this is a TableGen IntrinsicsTPU.td surface materialised as an MLIR dialect. Every llvm.tpu.* name is an LLVM intrinsic; every name is also a registered MLIR op carrying the standard 23-slot RegisteredOperationName::Model<Op> ABI. The two views are the same set — recovered two independent ways (1356 Model<tpu_*> vtables and 1356 llvm.tpu.* printed-name strings, set-identical under the underscore→dot map). The dominant structural fact is that op identity, not an attribute, encodes the hardware variant: 834 of the 1356 (62 %) are the tpu_stream_* embedding gather/scatter family, a {pattern × verb × dtype × memspace} cross-product where each cell is its own registered op. There is no dtype attribute on a stream op; the dtype is in its name.

This page owns the family taxonomy, the representative per-family signatures, and the ID-table organisation (the 10-batch registrar). It does not re-derive the per-cast address-space ISel (that is addrspacecast ISel), the tpu-dialect ODS (that is tpu → LLO ODS Lowering), or the SparseCore back-end engine encodings (those are the SparseCore ISA slot pages). For reimplementation, the contract is:

The Model↔printed-name correspondence — the mechanical tpu_X_Y ↔ llvm.tpu.X.Y map and why it is 1:1 with zero mismatch.
The 20-class functional taxonomy — a name-family grouping covering 1348 of the 1356 by prefix (each class size byte-confirmed individually; see The 20 Classes); each class mapped to its SparseCore engine or LLO/EmitX target.
The ID-table organisation — the 10-batch registerLlvmTpuDialectOperations0..9 registrar, why it is split, and how it differs from the 115-op ScDialect path.
The ODS shape recovery rule — how the typed tpu_*::create arg list is the operand/result declaration, with the verified per-family arities.


Dialect class	`mlir::sparse_core::LlvmTpuDialect` (ctor `LlvmTpuDialectC1EPNS_11MLIRContextE`)
Intrinsic count	1356 (`Model<tpu_>` vtables == `llvm.tpu.` strings, set-identical)
Registrar	`registerLlvmTpuDialectOperations` @ `0x146d0560` → tail-calls `…Operations0..9`
Printed-name rule	C++ `tpu_X_Y_Z` ↔ intrinsic `llvm.tpu.X.Y.Z` (`_`→`.`); `comm -23` == 0
Address spaces	`LlvmTpuDialect::AddressSpaceDescription(int)` @ `0x135462c0` (base 0xc9=201, span 0x18)
Consuming pass	`LowerToSparseCoreLlvmPass::runOnOperation` @ `0x13566d00`; `lowerFunc` @ `0x13568280`
Dominant class	`tpu_stream_` — 834 / 1356* (SparseCore embedding gather/scatter engine)
Typed-create coverage	466 / 1356 carry a typed `create` (ODS shape read off the symbol); 890 default-builder
Confidence	CONFIRMED (byte-anchored) for enumeration/taxonomy/registration; per-class HW target as marked

The Two-Surface Model

Purpose

Before the catalog, the reader must hold why a second TPU op dialect exists at all, and why the page that owns the tpu-dialect→LLO descent does not own this one. The split is structural, not incidental: TensorCore and SparseCore are different machines with different ISAs, and the compiler models them as different MLIR dialects.

The split

The tpu dialect (86 ops) is the TensorCore target: matmul/MXU, VPU vector ops, sequencer, descended through the 322 llo.*Op classes (tpu → LLO ODS Lowering) and packed into VLIW bundles. LlvmTpuDialect (1356 ops, this page) is the SparseCore target: an embedding-table stream processor whose primitives are indirect gather/scatter, sflag atomics, circular-buffer registers (CBREG), and address-space casts between the SparseCore memory pools. The SparseCore ops are not packed into LLO bundles — they become LLVM-IR intrinsic calls and are code-generated by the SparseCore LLVM backend.

NOTE — the 1356 intrinsics do not register through the 115-op ScDialect addOperations path (0x14594f60); a dialect inventory that scans only that path sees LlvmTpuDialect as empty. The intrinsics register through the separate 10-batch registerLlvmTpuDialectOperations path documented in ID-Table Organisation — that registrar is where their RegisteredOperationName::Model<Op> vtables are materialised and verified.

The Model↔printed-name correspondence

Every intrinsic exists in the binary twice, and the two forms are mechanically interconvertible:

// (a) the registered MLIR op — a RegisteredOperationName::Model<Op> vtable:
//        _ZTVN4mlir23RegisteredOperationName5ModelINS_11sparse_core11tpu_syncaddEEE
//     1356 such vtables for mlir::sparse_core::tpu_* classes.
// (b) the printed LLVM-intrinsic name string, in .rodata:
//        "llvm.tpu.syncadd"
//     1356 such strings.
//
// The map is purely lexical:
//     class  tpu_X_Y_Z   ->   intrinsic  llvm.tpu.X.Y.Z      // s/^tpu_/llvm.tpu./ ; s/_/./g
// Applying it to the class list and diffing against the string list:
//     comm -23 (mapped-classes) (string-list)  ==  0         // zero mismatch, both ways

So each registered op has exactly one printed llvm.tpu.* name and vice-versa. This is the canonical IntrinsicsTPU.td-derived surface: the C++ class is the ODS-registered MLIR op; the llvm.tpu.* string is the LLVM-intrinsic name it carries into the backend.

GOTCHA — the dot/underscore boundary is only the prefix and field separators, not a tokenizer. tpu_dma_hbm_to_tilespmem_sc_simple ↔ llvm.tpu.dma.hbm.to.tilespmem.sc.simple — every _ becomes ., including the ones inside hbm_to_tilespmem. A reimplementer that special-cases _to_ will desync the two name spaces.

Functional-Class Taxonomy

Purpose

The 1356 intrinsics group into 20 functional classes by name-family prefix. The large clean-prefix classes (stream 834, pack/unpack 87, vld/vst 74, wait 47, dma 40, scan 32, vcvt 28, sync 25, alloca 24, rdreg/setreg 21, addrspacecast 16, sort 10, i1 5) are byte-confirmed by rg -c over the printed-name string set; the smaller boundary classes (transcendental/EUP, scalar-ALU, ptr/addressing, shuffle, trace, CBREG, task) are name-family estimates whose exact membership is not cleanly grep-isolable. The 20 sizes are not an exact partition — they cover 1348 of 1356 (see The 20 Classes for the eight-op shortfall). This is the page's central reimplementation artifact: it tells a reader which SparseCore hardware unit each named intrinsic drives, without dumping 1356 rows. The taxonomy is recovered by name-family grouping cross-checked against the printed-name string set; the per-class hardware target is joined to the SparseCore engine pages and the V5+ EmitX bit-position work.

The 20 classes

#ops = the class size; HW target = the SparseCore engine / LLO slot it lowers to; st = C (encoding byte-confirmed in a sibling page) or I (engine confirmed, per-leaf opcode inferred).

#ops	functional class (`tpu_*` prefix)	HW target / lowering	st
834	sparsecore-stream (`stream_*`)	SparseCore stream/scatter engine descriptor (linear/strided/indirect × gather/scatter/vreg, CBREG-windowed); → LLVM call via LowerToSparseCoreLlvm	I
87	pack / unpack (`pack`, `unpack`)	VPU pack/unpack slot → `llo.vpack`/`llo.vunpack`; sub-byte staging (b32→b16→b8→b4→b2→b1, e4m3/e5m2/s8/u8) for MXU/quant	C
74	vector load/store (`vld`, `vst`)	VPU mem slot → `llo.vector_{load,store}[_masked]`; `_idx` indexed, `_cb` CBREG-windowed, `_strided`, `_add` scatter, `_np` no-predicate	C
47	semaphore wait/watch (`wait`, `watch`)	sflag VWait slot → `llo.vwait.{eq,ne,lt,le,gt,ge}[.done]`; `_yieldable`=seq-yield, `_imem`=instr-mem, `ordone`=or-done	C
40	DMA descriptor (`dma__sc_`)	SparseCore DMA engine cmd (simple/single_strided/general tiers); 8/10/16-operand descriptor; cf ScDialect DmaSimpleStart	C
32	scan / segment-scan / reduce (`{add,min,max}_scan`)	SparseCore scan unit (add/max/min × full/half/1xN/2xN × seg × index) — embedding-aggregation primitives	I
28	vector convert (`vcvt`, `cvt`)	VPU convert slot → `llo.vcvt.*`; f32↔bf16/bf8/hf16/if8/s4/s8/u4/u8; `_sr`=stochastic-round, `_pr`=probabilistic	C
25	semaphore set/add (`syncadd`, `syncset`, `sfence`, `sync{donemov,pamov,readpa,setpa}`)	sflag VSync slot → `llo.vsync.{add,set}[.done,.remote]`; `_remote`=ICI peer, `_tile`/`_pa`=tile/public bank, `_doneinv`=invert done	C
24	transcendental / EUP (`rcp`,`rsqrt`,`tanh`,`sin`,`cos`,`erf`,`log2`,`pow2`,`sigshft`)	EUP VALU3 push (Alu3 op0 + 5-bit selector) + PopEupResult; each bare + `_macro` push+pop pair	C
24	alloca / allocate (`alloca`, `allocate`)	SparseCore allocator (smem/spmem/vmem/sflag/hbm/iova/timem/tilesmem/tilespmem/dreg/cbreg + `_dyn` + `_any`); cf ScDialect Alloca	C
21	control register rd/set (`rdreg_`, `setreg_`)	scalar RdReg/SetReg → SCS scalar slot; cycle counters, tid/scid/tag id regs, fsr/ddr/dmacrdt/sflagrange	C
19	pointer / addressing / loop-bc (`inttoptr`,`ptrtoint`,`bc_*`,…)	LLVM inttoptr/ptrtoint/addrspace + loop bytecode (`bc_loop_*`, `bc_load/store_aliaddr`, `bc_select_predicate`, `make_restrict_ptr`)	C
16	addrspacecast (`addrspacecast*`)	LLVM addrspacecast → SparseCore address-space ID (base 201; AddressSpaceDescription @ `0x135462c0`)	C
14	scalar ALU / scalar mem (`shll`,`shra`,`sadd_ov`,`addcarry`,…)	SCS scalar slot: shll/shra/shrl/sshllo shifts, sadd_ov/ssub_ov/sshla_ov overflow-add, addcarry, add_{high,low}_f32	C
14	lane / sublane shuffle / permute (`vrot_sublane`,`vperm`,`sc_permute`)	VPU cross-lane slot → `llo.vrot.slane`/`llo.vperm.sublane`/`llo.vslaneseq`/`vshift_insert`/`sc_mask_permute`	C
12	CBREG / circular-buffer (`rdcbreg`,`wrcbreg`,`cbreg_add`,…)	scalar CBREG ops: rd/wr/add/copy CBREG {base,offset,size}; 16 CBREGs/bank, dual smem/tilespmem base, allocate_cbreg	C
12	trace / telemetry / sc-control (`sc_`,`strace`,`event`,`read_cycle`)	SparseCore control/trace: strace/event/spill_debug/log/mprefix, read_{global,local}_cycle_count, ssetpstate/ssettm	C
10	sort / unique / dupcount (`sort*`,`uniquef`,`dupcntf`,`vmpcnt`)	SparseCore sort/unique unit: sort_{asc,dsc}d{f,i}, uniquef/i, dupcntf/i, vmctz/vmpcnt_ones (embedding dedup)	I
10	task / control / structural (`task_`,`loop_`,`barrier`,`nop`,…)	SparseCore tile-task + structural: task_dispatch, loop_{name,parallel}, barrier, nop, delay, clear_ibuf, halt_trap, tileid	C
5	i1-mask width conversion (`i1_to_i1`)	VPU mask slot: `tpu_{8,16,32}i1_to_{8,16,32}i1` — vector-mask width re-pack	C

834+87+74+47+40+32+28+25+24+24+21+19+16+14+14+12+12+10+10+5  =  1348   (of 1356)

NOTE — the class sizes are byte-confirmed individually (rg -c over the llvm.tpu.* string set), but the 20 rows are a name-family grouping, not an exact partition: they sum to 1348, eight short of the 1356 total. The shortfall is in the soft boundary classes (transcendental/EUP, scalar-ALU, ptr/addressing, trace, task) where a handful of ops — e.g. exponent, *_macro pairs, and miscellaneous control/structural intrinsics — fall outside any single prefix row. Treat the taxonomy as covering 1348 of 1356 by prefix, not as a verified 1:1 partition. The sync class breakdown: 9 syncadd* + 12 syncset* + sfence + 4 singletons = 25.

QUIRK — the five largest classes (stream, pack/unpack, vld/vst, wait, dma) are 834+87+74+47+40 = 1082 of 1356 — 80 % of the surface is SparseCore data movement, not compute. The transcendental/EUP class is 24 ops and the scalar-ALU class 14; SparseCore is a gather/scatter/sflag machine that does very little arithmetic of its own. A reimplementer budgeting effort should expect the stream-engine descriptor to dominate the work, not the math.

The Stream Family (834 ops)

Purpose

The tpu_stream_* family is 62 % of the intrinsic surface and is the single most important SparseCore primitive: the embedding-table gather/scatter engine. Its 834-way explosion is not 834 unrelated ops — it is a four-axis cross-product where each cell is a distinct registered intrinsic so the per-(pattern,verb,dtype,memspace) stream-engine command is selected by op identity at the IR level, not by an attribute decoded at runtime.

The cross-product axes

llvm.tpu.stream.<pattern>.[vreg.]{gather|scatter}.[cb.][add.]<dtype>.<src>.to.<dst>

  axis            values                                          how it splits the 834
  --------------------------------------------------------------------------------------
  pattern         linear (180) · strided (114) · indirect (540)   address-generation mode
  verb            vreg (360) · gather (246) · scatter (228)        direction / lane-spread
  dtype           bf16 · e4m3 · e5m2 · f32 · s16 · s32 (6)         exactly 111 ops each (6×111=666 dtyped)
  transfer        _to_tilespmem 399 · _to_spmem 210 · _to_smem 27  src→dst memspace pair
                  · _to_hbm4b 15 · _to_hbm 15
  modifier        _cb_ (CBREG-windowed, 556) · _add (scatter-add) · _np (no-predicate)

The indirect pattern dominates (540) because that is the embedding-lookup mode: an index stream selects rows. The _cb_ (CBREG-windowed) modifier appears on 556 of the 834 — the windowed forms use the INDIRECT_OFFSET_SOURCE_CBREG source for embedding-table windowing (see CBREG). gather pulls rows into tile-local SPMEM (hence _to_tilespmem is the largest transfer bucket); scatter/scatter-add pushes accumulated gradients back out.

The pipeline this encodes

        HBM embedding table                         tile-local SPMEM
        ┌───────────────┐    indirect gather        ┌──────────────┐
        │ row 0         │ ── (CBREG-windowed) ─────▶ │ gathered rows│
        │ row 1         │    indexed by offset       │              │
        │ …             │    stream                  └──────┬───────┘
        └───────────────┘                                   │ compute (forward)
                ▲                                            ▼
                │  scatter-add (_add)                 ┌──────────────┐
                └──────── accumulate gradients ────── │ updated rows │
                                                      └──────────────┘

Representative signature

// tpu_stream_linear_gather_add_f32_hbm_to_tilespmem
//   ::create(OpBuilder&, Location, Value, Value, Value, Value, Value, Value)   // 6 operands
// tpu_stream_indirect_gather_add_bf16_hbm_to_tilespmem
//   ::create(OpBuilder&, Location, Value ×8)                                    // 8 operands
//
// The operand count tracks the pattern: linear/strided forms carry 6 SSA operands
// (src/dst handle + base/offset/size + sflag); the indirect forms carry 8
// (the extra two are the index/offset-stream + its CBREG window).

NOTE — the per-leaf intrinsic→numeric-stream-engine-command for all 834 is INFERRED at the class→engine level only. The class maps to the SparseCore stream engine and the descriptor operand shape is recovered byte-exactly; the per-(pattern,verb,dtype,memspace) HW command value is the SparseCore stream ISA — sibling to the DMA descriptor — and is not individually byte-dumped for each of the 834. See Stream Gather/Scatter.

ODS Shape Recovery

Purpose

The operand/result declaration of each intrinsic is recoverable from the binary without TableGen source, the same way tpu → LLO ODS Lowering recovers LLO signatures: the typed create symbol's demangled argument list is the ODS declaration in source order.

The recovery rule

// create(OpBuilder&, Location, <ODS args in declaration order>)
//   mlir::Type   -> 1 explicit result type (inference-free op)
//   mlir::Value  -> 1 SSA operand
// Cross-checked against the op's NOperands<N> trait, which pins the operand count
// independently of the create arg list (e.g. tpu_syncadd carries NOperands<2>).

466 of the 1356 carry a typed create; the other 890 use the generic (TypeRange, ValueRange, ArrayRef<NamedAttribute>) default builder (result type inferred via SameOperandsAndResultType/InferType, operand count by name-family arity). The typed forms verified against the binary:

intrinsic (class)	`create` args (after Location)	operands
`tpu_addrspacecast` (addrspacecast)	`Type, Value`	1 res, 1 opnd
`tpu_addrspacecast_smem` / `_spmem` / `_*_tec`	`Type, Value, Value`	+ tile-window/base
`tpu_dma_*_sc_simple` (DMA)	`Value ×8`	8-field descriptor
`tpu_dma_*_sc_single_strided` (DMA)	`Value ×10`	+ stride
`tpu_dma_*_sc_general` (DMA)	`Value ×16`	multi-dim
`tpu_stream_linear_*` (stream)	`Value ×6`	stream descriptor
`tpu_stream_indirect_*` (stream)	`Value ×8`	+ index/offset stream
`tpu_syncadd` (sync)	`Value, Value`	sflag, delta
`tpu_fetch_and_add` (sync)	`Value, Value, Value`	sflag, addr, value
`tpu_*_macro` (EUP, e.g. `tpu_sin_macro`)	`Type, Value`	1 res, 1 opnd
`tpu_rdcbreg_offset` / `_size` (CBREG)	`Type, Value`	result, cbreg
`tpu_wrcbreg_offset` (CBREG)	`Type, Value, Value`	cbreg, value
`tpu_inttoptr` / `tpu_ptrtoint` (ptr)	`Type, Value`	result, val

The DMA _single_strided tier carries 10 SSA operands: the demangled tpu_dma_hbm_to_hbm_sc_single_strided::create arg list is Value + 9×S5_ (a back-reference to the prior Value type), placing the stride tier two operands above the _simple 8 and below the _general 16. The stream family is likewise not uniform — linear/strided patterns carry 6 operands, but indirect patterns carry 8, the two extras being the index/offset stream and its CBREG window.

ID-Table Organisation

Purpose

A reimplementer must reproduce how 1356 ops register into one dialect. The answer is the 10-batch registrar — a TableGen pattern that splits one logical addOperations<…> into ten functions to bound per-function instantiation size. This is the page's "ID-table" structure: the intrinsic IDs are assigned in registration order across the ten batches.

The registrar

LlvmTpuDialect::initialize
  └─ registerLlvmTpuDialectOperations   @0x146d0560   ── tail-calls 10 sub-registrars:
       ├─ registerLlvmTpuDialectOperations0   @0x146d05c0
       ├─ registerLlvmTpuDialectOperations1   @0x1472bea0
       ├─ registerLlvmTpuDialectOperations2   @0x1478b500
       ├─ registerLlvmTpuDialectOperations3   @0x147e1c40
       ├─ registerLlvmTpuDialectOperations4   @0x14835b80
       ├─ registerLlvmTpuDialectOperations5   @0x148891c0
       ├─ registerLlvmTpuDialectOperations6   @0x148dc3c0
       ├─ registerLlvmTpuDialectOperations7   @0x1492d5c0
       ├─ registerLlvmTpuDialectOperations8   @0x14982d00
       └─ registerLlvmTpuDialectOperations9   @0x149d88e0   (last batch)
                                                ── 10 × ~135 ops  =  1356

Each batch is a variadic addOperations<tpu_X, tpu_Y, …> that materialises one RegisteredOperationName::Model<Op> per template argument — the standard MLIR op-registration ABI, just split across ten functions. There is no separate numeric "intrinsic ID enum" table in this dialect: the op's identity is its TypeID/RegisteredOperationName, and the printed llvm.tpu.* name is the ID the LLVM translator emits. The Model vtables span a contiguous .data.rel.ro region (tpu_16i1_to_32i1 … tpu_wrcbreg_tilespmem_base).

QUIRK — the ≤256-op batch split is a TableGen artifact, not a semantic grouping. Batch N does not correspond to functional class N; ops from the same family (e.g. the 834 streams) are scattered across all ten batches in alphabetical-class order. A reimplementer must not assume batch membership carries meaning beyond "register these ~135 ops here."

Registration binding

function registerLlvmTpuDialectOperations(LlvmTpuDialect *d):   // 0x146d0560
    registerLlvmTpuDialectOperations0(d);                       // each batch:
    …                                                           //   addOperations<tpu_A, tpu_B, …>(d)
    return registerLlvmTpuDialectOperations9(d);                //   -> RegisteredOperationName::insert ×135

The dialect class is mlir::sparse_core::LlvmTpuDialect (ctor LlvmTpuDialectC1EPNS_11MLIRContextE). Its companion helpers — AddressSpaceDescription(int) @ 0x135462c0, GetAnyTypeFromAddressSpace(int), the attr parse/print pair, and the target-CPU attribute predicates — are the dialect-level machinery the intrinsics rely on for memory-space and target resolution.

Bridging into the LLVM Backend

Purpose

The intrinsics do not stop at MLIR. The LowerToSparseCoreLlvmPass rewrites each tpu_* op into its LLVM-dialect form so the SparseCore LLVM backend can code-generate it. This section names the bridge and points to the pages that own each lowering arm.

The consuming pass

xla::tpu::sparse_core::CreateLowerToSparseCoreLlvmPass     @0x135667c0  ── factory
  └─ LowerToSparseCoreLlvmPass::runOnOperation             @0x13566d00  ── driver
       └─ lowerFunc                                        @0x13568280  ── per-op rewrite

Per class, the lowering arm differs:

class	lowers to	owned by
addrspacecast	LLVM `addrspacecast` / `INTRINSIC_WO_CHAIN` keyed by intrinsic ID	addrspacecast ISel
transcendental / EUP	EUP VALU3 push (`Alu3` op0 + 5-bit selector) + `PopEupResult`	EUP Transcendental Slot
CBREG	scalar CBREG ops (ReadCbreg 0x36 / WriteCbreg 0x35 / AddCbreg 0x33)	CBREG
sparsecore-stream	per-(pattern,verb,dtype,memspace) stream-engine descriptor	Stream Gather/Scatter
sync / wait	sflag VSync/VWait LLO ops	VPU Slot
pack/unpack, vcvt, vld/vst, permute	VPU slot ops (`llo.v*`)	VPU Slot
scan / sort / unique	SparseCore scan/sort/dedup units	Scan Datapath, Rank & Permute / Radixsort

NOTE — the addrspacecast arm is not an ISD::ADDRSPACECAST(0xf4) lowering. The SC cast intrinsics survive into LLVM-IR as intrinsic calls (INTRINSIC_WO_CHAIN), distinct from the TensorCore front-end's real addrspacecast instructions. See addrspacecast ISel for the full per-cast from→to map and why wiring the SC casts into LowerADDRSPACECAST produces a backend that traps with CannotYetSelect.

What Is Not Recovered

Honest gaps, for the reimplementer:

Per-leaf stream-engine command (834 ops). Class→engine confirmed and the descriptor operand shape recovered; the numeric per-(pattern,verb,dtype,memspace) HW command value is not individually byte-dumped. This is the SparseCore stream ISA, sibling to the DMA descriptor.
Per-intrinsic LLVM IntrProperties bitset. Which OpInterfaces each op carries is observable from the interface-Model symbols (AliasAnalysis 546, MemoryEffect 285, Bytecode 188, AccessGroup 180); the exact per-op IntrNoMem/IntrArgMemOnly/IntrWillReturn bitset that drives backend scheduling/aliasing is not transcribed.
The 890 default-builder ops' exact arity + result TypeConstraint. Recovered by name family and the NOperands<N> trait where present; the per-op verifyInvariantsImpl byte-decode (1-vs-2 operands, Vreg/Mask/Scalar/Ptr result predicate) is not exhaustively walked.
The complete numeric address-space ID table. The AddressSpaceDescription switch base (201) and sampled case strings (HBMAny, SflagAny, SflagTile, TileSmem, plus Smem/Sflag/HBM/Dreg) are decoded; the full ID↔space map and the 16 addrspacecast from→to ID pairs are completed on addrspacecast ISel.
The scan/sort/unique-engine opcode encodings (32+11 ops). Mapped to the SparseCore scan/sort/dedup units by name; the per-op HW command bit layout is not decoded (these are SparseCore-specific compute units, not TensorCore LLO slots).

Cross-References

tpu → LLO ODS Lowering — the other surface: the 86-op tpu dialect lowered to the 322 llo.*Op classes; the ODS-from-create recovery rule this page reuses
The TPU Compiler — Part V orientation; where the SparseCore lowering sits in the five-phase descent
Lower to SparseCore LLVM — the consuming LowerToSparseCoreLlvmPass that turns these intrinsics into LLVM-IR
Dot/Conv MXU Lowering — the TensorCore matmul lowering, contrast to the SparseCore stream engine
addrspacecast ISel — the 16 addrspacecast intrinsics at instruction selection; full from→to address-space map
Stream Gather/Scatter — the 834-op stream family's engine-side detail
CBREG — the 16-CBREG-per-bank circular buffer the CBREG and _cb_ stream intrinsics drive
Scan Datapath — the scan/segment-scan engine the 32 scan intrinsics target
Rank & Permute / Radixsort — the sort/unique/dedup unit the 11 sort intrinsics target
EUP Transcendental Slot — the EUP VALU3 push/pop the 24 transcendental intrinsics lower to
VPU Slot — the VPU slot ops the pack/unpack, convert, load/store, permute, and sync/wait intrinsics target
MXU Slot — the MXU bundle slot the pack/unpack staging feeds

Keyboard shortcuts

libtpu Internals — Reverse-Engineering Reference