Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LlvmTpu Intrinsic Catalog

Every address, symbol, offset, and string on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (build libtpu_lts_20260413_b_RC00, BuildID md5 89edbbe81c5b328a958fe628a9f2207d, 781,691,048 bytes, not stripped, .text/.rodata VMA == file offset). Other versions will differ; treat every VA as version-pinned.

Abstract

The TPU backend has two distinct op surfaces. The TensorCore path is the 86-op tpu target dialect, which lowers to the 322 mlir::llo::*Op classes that descend to LLO bundles (tpu → LLO ODS Lowering covers that descent). This page documents the other surface: the LlvmTpuDialect, a separate MLIR dialect of 1356 tpu_* intrinsic ops that is the bottom-of-stack instruction surface for the SparseCore — the embedding-processing co-processor. Each op is registered as mlir::sparse_core::tpu_X_Y_Z and prints as the LLVM-intrinsic name llvm.tpu.X.Y.Z (underscore→dot). These are not lowered to LLO; they descend through Lower to SparseCore LLVM into honest LLVM-IR intrinsic calls and reach the SparseCore LLVM backend, where addrspacecast ISel and the per-engine ISel match them.

The reader who knows LLVM should hold one analogy: this is a TableGen IntrinsicsTPU.td surface materialised as an MLIR dialect. Every llvm.tpu.* name is an LLVM intrinsic; every name is also a registered MLIR op carrying the standard 23-slot RegisteredOperationName::Model<Op> ABI. The two views are the same set — recovered two independent ways (1356 Model<tpu_*> vtables and 1356 llvm.tpu.* printed-name strings, set-identical under the underscore→dot map). The dominant structural fact is that op identity, not an attribute, encodes the hardware variant: 834 of the 1356 (62 %) are the tpu_stream_* embedding gather/scatter family, a {pattern × verb × dtype × memspace} cross-product where each cell is its own registered op. There is no dtype attribute on a stream op; the dtype is in its name.

This page owns the family taxonomy, the representative per-family signatures, and the ID-table organisation (the 10-batch registrar). It does not re-derive the per-cast address-space ISel (that is addrspacecast ISel), the tpu-dialect ODS (that is tpu → LLO ODS Lowering), or the SparseCore back-end engine encodings (those are the SparseCore ISA slot pages). For reimplementation, the contract is:

  • The Model↔printed-name correspondence — the mechanical tpu_X_Yllvm.tpu.X.Y map and why it is 1:1 with zero mismatch.
  • The 20-class functional taxonomy — a name-family grouping covering 1348 of the 1356 by prefix (each class size byte-confirmed individually; see The 20 Classes); each class mapped to its SparseCore engine or LLO/EmitX target.
  • The ID-table organisation — the 10-batch registerLlvmTpuDialectOperations0..9 registrar, why it is split, and how it differs from the 115-op ScDialect path.
  • The ODS shape recovery rule — how the typed tpu_*::create arg list is the operand/result declaration, with the verified per-family arities.
Dialect classmlir::sparse_core::LlvmTpuDialect (ctor LlvmTpuDialectC1EPNS_11MLIRContextE)
Intrinsic count1356 (Model<tpu_*> vtables == llvm.tpu.* strings, set-identical)
RegistrarregisterLlvmTpuDialectOperations @ 0x146d0560 → tail-calls …Operations0..9
Printed-name ruleC++ tpu_X_Y_Z ↔ intrinsic llvm.tpu.X.Y.Z (_.); comm -23 == 0
Address spacesLlvmTpuDialect::AddressSpaceDescription(int) @ 0x135462c0 (base 0xc9=201, span 0x18)
Consuming passLowerToSparseCoreLlvmPass::runOnOperation @ 0x13566d00; lowerFunc @ 0x13568280
Dominant classtpu_stream_*834 / 1356 (SparseCore embedding gather/scatter engine)
Typed-create coverage466 / 1356 carry a typed create (ODS shape read off the symbol); 890 default-builder
ConfidenceCONFIRMED (byte-anchored) for enumeration/taxonomy/registration; per-class HW target as marked

The Two-Surface Model

Purpose

Before the catalog, the reader must hold why a second TPU op dialect exists at all, and why the page that owns the tpu-dialect→LLO descent does not own this one. The split is structural, not incidental: TensorCore and SparseCore are different machines with different ISAs, and the compiler models them as different MLIR dialects.

The split

The tpu dialect (86 ops) is the TensorCore target: matmul/MXU, VPU vector ops, sequencer, descended through the 322 llo.*Op classes (tpu → LLO ODS Lowering) and packed into VLIW bundles. LlvmTpuDialect (1356 ops, this page) is the SparseCore target: an embedding-table stream processor whose primitives are indirect gather/scatter, sflag atomics, circular-buffer registers (CBREG), and address-space casts between the SparseCore memory pools. The SparseCore ops are not packed into LLO bundles — they become LLVM-IR intrinsic calls and are code-generated by the SparseCore LLVM backend.

NOTE — the 1356 intrinsics do not register through the 115-op ScDialect addOperations path (0x14594f60); a dialect inventory that scans only that path sees LlvmTpuDialect as empty. The intrinsics register through the separate 10-batch registerLlvmTpuDialectOperations path documented in ID-Table Organisation — that registrar is where their RegisteredOperationName::Model<Op> vtables are materialised and verified.

The Model↔printed-name correspondence

Every intrinsic exists in the binary twice, and the two forms are mechanically interconvertible:

// (a) the registered MLIR op — a RegisteredOperationName::Model<Op> vtable:
//        _ZTVN4mlir23RegisteredOperationName5ModelINS_11sparse_core11tpu_syncaddEEE
//     1356 such vtables for mlir::sparse_core::tpu_* classes.
// (b) the printed LLVM-intrinsic name string, in .rodata:
//        "llvm.tpu.syncadd"
//     1356 such strings.
//
// The map is purely lexical:
//     class  tpu_X_Y_Z   ->   intrinsic  llvm.tpu.X.Y.Z      // s/^tpu_/llvm.tpu./ ; s/_/./g
// Applying it to the class list and diffing against the string list:
//     comm -23 (mapped-classes) (string-list)  ==  0         // zero mismatch, both ways

So each registered op has exactly one printed llvm.tpu.* name and vice-versa. This is the canonical IntrinsicsTPU.td-derived surface: the C++ class is the ODS-registered MLIR op; the llvm.tpu.* string is the LLVM-intrinsic name it carries into the backend.

GOTCHA — the dot/underscore boundary is only the prefix and field separators, not a tokenizer. tpu_dma_hbm_to_tilespmem_sc_simplellvm.tpu.dma.hbm.to.tilespmem.sc.simple — every _ becomes ., including the ones inside hbm_to_tilespmem. A reimplementer that special-cases _to_ will desync the two name spaces.


Functional-Class Taxonomy

Purpose

The 1356 intrinsics group into 20 functional classes by name-family prefix. The large clean-prefix classes (stream 834, pack/unpack 87, vld/vst 74, wait 47, dma 40, scan 32, vcvt 28, sync 25, alloca 24, rdreg/setreg 21, addrspacecast 16, sort 10, i1 5) are byte-confirmed by rg -c over the printed-name string set; the smaller boundary classes (transcendental/EUP, scalar-ALU, ptr/addressing, shuffle, trace, CBREG, task) are name-family estimates whose exact membership is not cleanly grep-isolable. The 20 sizes are not an exact partition — they cover 1348 of 1356 (see The 20 Classes for the eight-op shortfall). This is the page's central reimplementation artifact: it tells a reader which SparseCore hardware unit each named intrinsic drives, without dumping 1356 rows. The taxonomy is recovered by name-family grouping cross-checked against the printed-name string set; the per-class hardware target is joined to the SparseCore engine pages and the V5+ EmitX bit-position work.

The 20 classes

#ops = the class size; HW target = the SparseCore engine / LLO slot it lowers to; st = C (encoding byte-confirmed in a sibling page) or I (engine confirmed, per-leaf opcode inferred).

#opsfunctional class (tpu_* prefix)HW target / loweringst
834sparsecore-stream (stream_*)SparseCore stream/scatter engine descriptor (linear/strided/indirect × gather/scatter/vreg, CBREG-windowed); → LLVM call via LowerToSparseCoreLlvmI
87pack / unpack (pack*, unpack*)VPU pack/unpack slot → llo.vpack*/llo.vunpack*; sub-byte staging (b32→b16→b8→b4→b2→b1, e4m3/e5m2/s8/u8) for MXU/quantC
74vector load/store (vld*, vst*)VPU mem slot → llo.vector_{load,store}[_masked]; _idx indexed, _cb CBREG-windowed, _strided, _add scatter, _np no-predicateC
47semaphore wait/watch (wait*, watch*)sflag VWait slot → llo.vwait.{eq,ne,lt,le,gt,ge}[.done]; _yieldable=seq-yield, _imem=instr-mem, ordone=or-doneC
40DMA descriptor (dma_*_sc_*)SparseCore DMA engine cmd (simple/single_strided/general tiers); 8/10/16-operand descriptor; cf ScDialect DmaSimpleStartC
32scan / segment-scan / reduce ({add,min,max}_*scan*)SparseCore scan unit (add/max/min × full/half/1xN/2xN × seg × index) — embedding-aggregation primitivesI
28vector convert (vcvt*, cvt*)VPU convert slot → llo.vcvt.*; f32↔bf16/bf8/hf16/if8/s4/s8/u4/u8; _sr=stochastic-round, _pr=probabilisticC
25semaphore set/add (syncadd*, syncset*, sfence, sync{donemov,pamov,readpa,setpa})sflag VSync slot → llo.vsync.{add,set}[.done,.remote]; _remote=ICI peer, _tile/_pa=tile/public bank, _doneinv=invert doneC
24transcendental / EUP (rcp,rsqrt,tanh,sin,cos,erf,log2,pow2,sigshft)EUP VALU3 push (Alu3 op0 + 5-bit selector) + PopEupResult; each bare + _macro push+pop pairC
24alloca / allocate (alloca*, allocate*)SparseCore allocator (smem/spmem/vmem/sflag/hbm/iova/timem/tilesmem/tilespmem/dreg/cbreg + _dyn + _any); cf ScDialect AllocaC
21control register rd/set (rdreg_*, setreg_*)scalar RdReg/SetReg → SCS scalar slot; cycle counters, tid/scid/tag id regs, fsr/ddr/dmacrdt/sflagrangeC
19pointer / addressing / loop-bc (inttoptr,ptrtoint,bc_*,…)LLVM inttoptr/ptrtoint/addrspace + loop bytecode (bc_loop_*, bc_load/store_aliaddr, bc_select_predicate, make_restrict_ptr)C
16addrspacecast (addrspacecast*)LLVM addrspacecast → SparseCore address-space ID (base 201; AddressSpaceDescription @ 0x135462c0)C
14scalar ALU / scalar mem (shll,shra,sadd_ov,addcarry,…)SCS scalar slot: shll/shra/shrl/sshllo shifts, sadd_ov/ssub_ov/sshla_ov overflow-add, addcarry, add_{high,low}_f32C
14lane / sublane shuffle / permute (vrot_sublane,vperm,sc_permute)VPU cross-lane slot → llo.vrot.slane/llo.vperm.sublane/llo.vslaneseq/vshift_insert/sc_mask_permuteC
12CBREG / circular-buffer (rdcbreg,wrcbreg,cbreg_add,…)scalar CBREG ops: rd/wr/add/copy CBREG {base,offset,size}; 16 CBREGs/bank, dual smem/tilespmem base, allocate_cbregC
12trace / telemetry / sc-control (sc_*,strace,event,read_*cycle)SparseCore control/trace: strace/event/spill_debug/log/mprefix, read_{global,local}_cycle_count, ssetpstate/ssettmC
10sort / unique / dupcount (sort*,uniquef,dupcntf,vmpcnt)SparseCore sort/unique unit: sort_{asc,dsc}d{f,i}, uniquef/i, dupcntf/i, vmctz/vmpcnt_ones (embedding dedup)I
10task / control / structural (task_*,loop_*,barrier,nop,…)SparseCore tile-task + structural: task_dispatch, loop_{name,parallel}, barrier, nop, delay, clear_ibuf, halt_trap, tileidC
5i1-mask width conversion (*i1_to_*i1)VPU mask slot: tpu_{8,16,32}i1_to_{8,16,32}i1 — vector-mask width re-packC
834+87+74+47+40+32+28+25+24+24+21+19+16+14+14+12+12+10+10+5  =  1348   (of 1356)

NOTE — the class sizes are byte-confirmed individually (rg -c over the llvm.tpu.* string set), but the 20 rows are a name-family grouping, not an exact partition: they sum to 1348, eight short of the 1356 total. The shortfall is in the soft boundary classes (transcendental/EUP, scalar-ALU, ptr/addressing, trace, task) where a handful of ops — e.g. exponent, *_macro pairs, and miscellaneous control/structural intrinsics — fall outside any single prefix row. Treat the taxonomy as covering 1348 of 1356 by prefix, not as a verified 1:1 partition. The sync class breakdown: 9 syncadd* + 12 syncset* + sfence + 4 singletons = 25.

QUIRK — the five largest classes (stream, pack/unpack, vld/vst, wait, dma) are 834+87+74+47+40 = 1082 of 1356 — 80 % of the surface is SparseCore data movement, not compute. The transcendental/EUP class is 24 ops and the scalar-ALU class 14; SparseCore is a gather/scatter/sflag machine that does very little arithmetic of its own. A reimplementer budgeting effort should expect the stream-engine descriptor to dominate the work, not the math.


The Stream Family (834 ops)

Purpose

The tpu_stream_* family is 62 % of the intrinsic surface and is the single most important SparseCore primitive: the embedding-table gather/scatter engine. Its 834-way explosion is not 834 unrelated ops — it is a four-axis cross-product where each cell is a distinct registered intrinsic so the per-(pattern,verb,dtype,memspace) stream-engine command is selected by op identity at the IR level, not by an attribute decoded at runtime.

The cross-product axes

llvm.tpu.stream.<pattern>.[vreg.]{gather|scatter}.[cb.][add.]<dtype>.<src>.to.<dst>

  axis            values                                          how it splits the 834
  --------------------------------------------------------------------------------------
  pattern         linear (180) · strided (114) · indirect (540)   address-generation mode
  verb            vreg (360) · gather (246) · scatter (228)        direction / lane-spread
  dtype           bf16 · e4m3 · e5m2 · f32 · s16 · s32 (6)         exactly 111 ops each (6×111=666 dtyped)
  transfer        _to_tilespmem 399 · _to_spmem 210 · _to_smem 27  src→dst memspace pair
                  · _to_hbm4b 15 · _to_hbm 15
  modifier        _cb_ (CBREG-windowed, 556) · _add (scatter-add) · _np (no-predicate)

The indirect pattern dominates (540) because that is the embedding-lookup mode: an index stream selects rows. The _cb_ (CBREG-windowed) modifier appears on 556 of the 834 — the windowed forms use the INDIRECT_OFFSET_SOURCE_CBREG source for embedding-table windowing (see CBREG). gather pulls rows into tile-local SPMEM (hence _to_tilespmem is the largest transfer bucket); scatter/scatter-add pushes accumulated gradients back out.

The pipeline this encodes

        HBM embedding table                         tile-local SPMEM
        ┌───────────────┐    indirect gather        ┌──────────────┐
        │ row 0         │ ── (CBREG-windowed) ─────▶ │ gathered rows│
        │ row 1         │    indexed by offset       │              │
        │ …             │    stream                  └──────┬───────┘
        └───────────────┘                                   │ compute (forward)
                ▲                                            ▼
                │  scatter-add (_add)                 ┌──────────────┐
                └──────── accumulate gradients ────── │ updated rows │
                                                      └──────────────┘

Representative signature

// tpu_stream_linear_gather_add_f32_hbm_to_tilespmem
//   ::create(OpBuilder&, Location, Value, Value, Value, Value, Value, Value)   // 6 operands
// tpu_stream_indirect_gather_add_bf16_hbm_to_tilespmem
//   ::create(OpBuilder&, Location, Value ×8)                                    // 8 operands
//
// The operand count tracks the pattern: linear/strided forms carry 6 SSA operands
// (src/dst handle + base/offset/size + sflag); the indirect forms carry 8
// (the extra two are the index/offset-stream + its CBREG window).

NOTE — the per-leaf intrinsic→numeric-stream-engine-command for all 834 is INFERRED at the class→engine level only. The class maps to the SparseCore stream engine and the descriptor operand shape is recovered byte-exactly; the per-(pattern,verb,dtype,memspace) HW command value is the SparseCore stream ISA — sibling to the DMA descriptor — and is not individually byte-dumped for each of the 834. See Stream Gather/Scatter.


ODS Shape Recovery

Purpose

The operand/result declaration of each intrinsic is recoverable from the binary without TableGen source, the same way tpu → LLO ODS Lowering recovers LLO signatures: the typed create symbol's demangled argument list is the ODS declaration in source order.

The recovery rule

// create(OpBuilder&, Location, <ODS args in declaration order>)
//   mlir::Type   -> 1 explicit result type (inference-free op)
//   mlir::Value  -> 1 SSA operand
// Cross-checked against the op's NOperands<N> trait, which pins the operand count
// independently of the create arg list (e.g. tpu_syncadd carries NOperands<2>).

466 of the 1356 carry a typed create; the other 890 use the generic (TypeRange, ValueRange, ArrayRef<NamedAttribute>) default builder (result type inferred via SameOperandsAndResultType/InferType, operand count by name-family arity). The typed forms verified against the binary:

intrinsic (class)create args (after Location)operands
tpu_addrspacecast (addrspacecast)Type, Value1 res, 1 opnd
tpu_addrspacecast_smem / _spmem / _*_tecType, Value, Value+ tile-window/base
tpu_dma_*_sc_simple (DMA)Value ×88-field descriptor
tpu_dma_*_sc_single_strided (DMA)Value ×10+ stride
tpu_dma_*_sc_general (DMA)Value ×16multi-dim
tpu_stream_linear_* (stream)Value ×6stream descriptor
tpu_stream_indirect_* (stream)Value ×8+ index/offset stream
tpu_syncadd (sync)Value, Valuesflag, delta
tpu_fetch_and_add (sync)Value, Value, Valuesflag, addr, value
tpu_*_macro (EUP, e.g. tpu_sin_macro)Type, Value1 res, 1 opnd
tpu_rdcbreg_offset / _size (CBREG)Type, Valueresult, cbreg
tpu_wrcbreg_offset (CBREG)Type, Value, Valuecbreg, value
tpu_inttoptr / tpu_ptrtoint (ptr)Type, Valueresult, val

The DMA _single_strided tier carries 10 SSA operands: the demangled tpu_dma_hbm_to_hbm_sc_single_strided::create arg list is Value + 9×S5_ (a back-reference to the prior Value type), placing the stride tier two operands above the _simple 8 and below the _general 16. The stream family is likewise not uniform — linear/strided patterns carry 6 operands, but indirect patterns carry 8, the two extras being the index/offset stream and its CBREG window.


ID-Table Organisation

Purpose

A reimplementer must reproduce how 1356 ops register into one dialect. The answer is the 10-batch registrar — a TableGen pattern that splits one logical addOperations<…> into ten functions to bound per-function instantiation size. This is the page's "ID-table" structure: the intrinsic IDs are assigned in registration order across the ten batches.

The registrar

LlvmTpuDialect::initialize
  └─ registerLlvmTpuDialectOperations   @0x146d0560   ── tail-calls 10 sub-registrars:
       ├─ registerLlvmTpuDialectOperations0   @0x146d05c0
       ├─ registerLlvmTpuDialectOperations1   @0x1472bea0
       ├─ registerLlvmTpuDialectOperations2   @0x1478b500
       ├─ registerLlvmTpuDialectOperations3   @0x147e1c40
       ├─ registerLlvmTpuDialectOperations4   @0x14835b80
       ├─ registerLlvmTpuDialectOperations5   @0x148891c0
       ├─ registerLlvmTpuDialectOperations6   @0x148dc3c0
       ├─ registerLlvmTpuDialectOperations7   @0x1492d5c0
       ├─ registerLlvmTpuDialectOperations8   @0x14982d00
       └─ registerLlvmTpuDialectOperations9   @0x149d88e0   (last batch)
                                                ── 10 × ~135 ops  =  1356

Each batch is a variadic addOperations<tpu_X, tpu_Y, …> that materialises one RegisteredOperationName::Model<Op> per template argument — the standard MLIR op-registration ABI, just split across ten functions. There is no separate numeric "intrinsic ID enum" table in this dialect: the op's identity is its TypeID/RegisteredOperationName, and the printed llvm.tpu.* name is the ID the LLVM translator emits. The Model vtables span a contiguous .data.rel.ro region (tpu_16i1_to_32i1tpu_wrcbreg_tilespmem_base).

QUIRK — the ≤256-op batch split is a TableGen artifact, not a semantic grouping. Batch N does not correspond to functional class N; ops from the same family (e.g. the 834 streams) are scattered across all ten batches in alphabetical-class order. A reimplementer must not assume batch membership carries meaning beyond "register these ~135 ops here."

Registration binding

function registerLlvmTpuDialectOperations(LlvmTpuDialect *d):   // 0x146d0560
    registerLlvmTpuDialectOperations0(d);                       // each batch:
    …                                                           //   addOperations<tpu_A, tpu_B, …>(d)
    return registerLlvmTpuDialectOperations9(d);                //   -> RegisteredOperationName::insert ×135

The dialect class is mlir::sparse_core::LlvmTpuDialect (ctor LlvmTpuDialectC1EPNS_11MLIRContextE). Its companion helpers — AddressSpaceDescription(int) @ 0x135462c0, GetAnyTypeFromAddressSpace(int), the attr parse/print pair, and the target-CPU attribute predicates — are the dialect-level machinery the intrinsics rely on for memory-space and target resolution.


Bridging into the LLVM Backend

Purpose

The intrinsics do not stop at MLIR. The LowerToSparseCoreLlvmPass rewrites each tpu_* op into its LLVM-dialect form so the SparseCore LLVM backend can code-generate it. This section names the bridge and points to the pages that own each lowering arm.

The consuming pass

xla::tpu::sparse_core::CreateLowerToSparseCoreLlvmPass     @0x135667c0  ── factory
  └─ LowerToSparseCoreLlvmPass::runOnOperation             @0x13566d00  ── driver
       └─ lowerFunc                                        @0x13568280  ── per-op rewrite

Per class, the lowering arm differs:

classlowers toowned by
addrspacecastLLVM addrspacecast / INTRINSIC_WO_CHAIN keyed by intrinsic IDaddrspacecast ISel
transcendental / EUPEUP VALU3 push (Alu3 op0 + 5-bit selector) + PopEupResultEUP Transcendental Slot
CBREGscalar CBREG ops (ReadCbreg 0x36 / WriteCbreg 0x35 / AddCbreg 0x33)CBREG
sparsecore-streamper-(pattern,verb,dtype,memspace) stream-engine descriptorStream Gather/Scatter
sync / waitsflag VSync/VWait LLO opsVPU Slot
pack/unpack, vcvt, vld/vst, permuteVPU slot ops (llo.v*)VPU Slot
scan / sort / uniqueSparseCore scan/sort/dedup unitsScan Datapath, Rank & Permute / Radixsort

NOTE — the addrspacecast arm is not an ISD::ADDRSPACECAST(0xf4) lowering. The SC cast intrinsics survive into LLVM-IR as intrinsic calls (INTRINSIC_WO_CHAIN), distinct from the TensorCore front-end's real addrspacecast instructions. See addrspacecast ISel for the full per-cast from→to map and why wiring the SC casts into LowerADDRSPACECAST produces a backend that traps with CannotYetSelect.


What Is Not Recovered

Honest gaps, for the reimplementer:

  • Per-leaf stream-engine command (834 ops). Class→engine confirmed and the descriptor operand shape recovered; the numeric per-(pattern,verb,dtype,memspace) HW command value is not individually byte-dumped. This is the SparseCore stream ISA, sibling to the DMA descriptor.
  • Per-intrinsic LLVM IntrProperties bitset. Which OpInterfaces each op carries is observable from the interface-Model symbols (AliasAnalysis 546, MemoryEffect 285, Bytecode 188, AccessGroup 180); the exact per-op IntrNoMem/IntrArgMemOnly/IntrWillReturn bitset that drives backend scheduling/aliasing is not transcribed.
  • The 890 default-builder ops' exact arity + result TypeConstraint. Recovered by name family and the NOperands<N> trait where present; the per-op verifyInvariantsImpl byte-decode (1-vs-2 operands, Vreg/Mask/Scalar/Ptr result predicate) is not exhaustively walked.
  • The complete numeric address-space ID table. The AddressSpaceDescription switch base (201) and sampled case strings (HBMAny, SflagAny, SflagTile, TileSmem, plus Smem/Sflag/HBM/Dreg) are decoded; the full ID↔space map and the 16 addrspacecast from→to ID pairs are completed on addrspacecast ISel.
  • The scan/sort/unique-engine opcode encodings (32+11 ops). Mapped to the SparseCore scan/sort/dedup units by name; the per-op HW command bit layout is not decoded (these are SparseCore-specific compute units, not TensorCore LLO slots).

Cross-References

  • tpu → LLO ODS Lowering — the other surface: the 86-op tpu dialect lowered to the 322 llo.*Op classes; the ODS-from-create recovery rule this page reuses
  • The TPU Compiler — Part V orientation; where the SparseCore lowering sits in the five-phase descent
  • Lower to SparseCore LLVM — the consuming LowerToSparseCoreLlvmPass that turns these intrinsics into LLVM-IR
  • Dot/Conv MXU Lowering — the TensorCore matmul lowering, contrast to the SparseCore stream engine
  • addrspacecast ISel — the 16 addrspacecast intrinsics at instruction selection; full from→to address-space map
  • Stream Gather/Scatter — the 834-op stream family's engine-side detail
  • CBREG — the 16-CBREG-per-bank circular buffer the CBREG and _cb_ stream intrinsics drive
  • Scan Datapath — the scan/segment-scan engine the 32 scan intrinsics target
  • Rank & Permute / Radixsort — the sort/unique/dedup unit the 11 sort intrinsics target
  • EUP Transcendental Slot — the EUP VALU3 push/pop the 24 transcendental intrinsics lower to
  • VPU Slot — the VPU slot ops the pack/unpack, convert, load/store, permute, and sync/wait intrinsics target
  • MXU Slot — the MXU bundle slot the pack/unpack staging feeds