Tileiras - MLIR-Based Optimizing Assembler
Tileiras is NVIDIA's CUDA TileIR optimizing assembler, shipped with CUDA 13.1 as a separate compiler binary. It consumes serialized MLIR bytecode for a tile program, lowers that program through NVIDIA tile dialects and NVPTX code generation, invokes the assembler toolchain, and writes a host relocatable object containing the compiled GPU payload.
The useful way to think about tileiras is not as a C++ compiler and not as a replacement for cudafe++. Tileiras starts after a frontend has already described the GPU work in MLIR. Its job is to make that tile-level program executable on Blackwell-family GPUs.
This wiki is written for two practical readers:
- If you use or integrate tileiras, the driver, option, bytecode, and subprocess pages explain what inputs the tool accepts, which target modes are valid, which external tools must be available, and how failures should be interpreted.
- If you are reimplementing compatible tooling, the subsystem pages describe observable contracts: bytecode structure, dialect schemas, pass ordering, scheduler invariants, lowering decisions, diagnostics, and pseudocode-level algorithms.
If you arrive with a specific question rather than wanting a topic tour, jump to Frequently Asked Questions, which maps common scenarios to a starting page.
At a Glance
| Item | Value |
|---|---|
| Program role | MLIR bytecode to host ELF relocatable with embedded GPU code |
| CUDA release | 13.1, toolkit build V13.1.80 |
| LLVM lineage | Internal LLVM main-branch snapshot identifying as LLVM21.0.0git |
| Default GPU target | sm_100 |
| Accepted driver targets | sm_100, sm_103, sm_110, sm_120, sm_121 |
| Default output | elf.o |
| Main input language | Binary MLIR bytecode carrying cuda_tile programs |
| Main output path | Tile dialects -> NVVM/LLVM -> NVPTX -> ptxas -> host object |
What Tileiras Does
Tileiras is an optimizing assembler in the MLIR sense. It accepts an already-formed module, validates that the module uses the dialect versions it understands, runs a target-specific lowering pipeline, schedules and legalizes tile operations, emits PTX through the NVPTX backend, and delegates final machine-code assembly to ptxas.
The input is not CUDA C++, and tileiras does not perform preprocessing, C++ parsing, EDG lowering, template instantiation, or host-stub generation. Those responsibilities belong to other CUDA tools. Tileiras is the compiler for a lower-level tile IR surface.
The broad flow is:
tileiras bytecode
-> parse builtin.module
-> load cuda_tile, nv_tileaa, nv_tileas, cute, cute_nvgpu, cutlass
-> lower tile program toward LLVM and NVVM
-> run TileAS scheduling, layout, TMA, pipeline, and cluster passes
-> run NVPTX code generation
-> run ptxas
-> optionally run nvdisasm -c for annotated disassembly payloads
-> emit host ELF relocatable
Public Contract
For integration work, treat tileiras as a narrow bytecode-to-object compiler.
- Produce MLIR bytecode for a
builtin.modulewhose dialect tables match the CUDA 13.1 tile dialect schema. - Select one of the supported Blackwell-family targets.
- Provide host, optimization, debug, line-info, output, and sanitizer options through the driver interface.
- Ensure
ptxasis available. Some configurations also requirenvdisasmbecause the compile pipeline shells out to it. - Consume the produced object file, normally
elf.o, as a host relocatable carrying the device payload.
The driver has a deliberately small option surface compared with nvcc or cicc: target GPU, host architecture, host OS, optimization level, line info, device debug, sanitizer mode, and output path. Most of the complexity is inside the bytecode reader and pass pipeline, not in command-line dispatch.
Compiler Model
Tileiras lowers across nine dialect layers. The early dialects preserve tile semantics; the middle dialects make layout, memory, and scheduling explicit; the late dialects bridge into NVVM and LLVM.
| Dialect | Role |
|---|---|
cuda_tile | Public bytecode-facing tile program surface: blocks, tiles, async operations, atomics, and high-level tensor actions. |
nv_tileaa | Alias-aware layer with typed pointer, token, and view operations. It makes memory-space and aliasing facts explicit enough for later rewriting. |
nv_tileas | Assembler-near layer for schedules, layouts, execution units, TMA descriptors, pipeline state, and resource decisions. |
cute | Layout algebra and tile decomposition primitives. |
cute_nvgpu | NVIDIA GPU atom layer: MMA atoms, TMA, WGMMA, tcgen05, LDSM/STSM, and cluster-specific operations. |
cutlass | Pipeline, scheduler, sequence-barrier, and block-striped primitives reused from the CUTLASS programming model. |
mlir::nvgpu | Generic NVIDIA GPU bridge dialect used before NVVM lowering. |
NVVM | LLVM IR with NVPTX intrinsics and NVIDIA memory-space semantics. |
llvm | Final LLVM IR representation consumed by the NVPTX backend. |
The central reimplementation point is that every stage has a structural contract. The bytecode reader must recognize the same dialect and operation tags. The pass manager must preserve the same invariants. The scheduler must obey the same resource and dependency model. The NVPTX lowering must emit the same param-space and memory-space conventions expected by ptxas.
End-to-End Algorithm
The top-level compiler can be modeled as this pipeline:
TileirasResult compile_tileiras(ByteBuffer input, TileirasConfig cfg) {
validate_config(cfg);
if (!is_tileiras_bytecode(input)) {
if (looks_like_plain_mlir_bytecode(input))
return error("failed to parse IR bytecode (it looks like MLIR bytecode instead)");
return error("failed to parse IR bytecode");
}
MLIRContext ctx = create_context();
register_tileiras_dialects(&ctx);
Module module = parse_tileiras_bytecode(&ctx, input);
verify_module_contract(module, cfg.gpu);
PassManager pm = build_tileiras_pipeline(cfg);
pm.run(module);
LLVMModule llvm = lower_to_llvm_and_nvvm(module, cfg);
PTXText ptx = emit_nvptx(llvm, cfg);
Cubin sass = run_ptxas(ptx, cfg);
Optional<Disassembly> disasm = none();
if (cfg.requires_disassembly_payload)
disasm = run_nvdisasm_c(sass, cfg);
return assemble_host_object(sass, disasm, cfg.output_file);
}
The overview intentionally keeps this algorithm coarse. The detailed pages define the bytecode grammar, pass families, scheduler resource model, NVVM lowering, call-lowering ABI, and code-emission helpers at reimplementation depth.
Position in CUDA 13.1
In CUDA 13.1, tileiras is best understood as a sibling device compiler to cicc, not as a child of it. Both paths eventually produce PTX and rely on the same downstream assembler, but they start from different frontends:
CUDA C++ source path:
CUDA C++ -> cudafe++ / cicc -> LLVM/NVVM -> PTX -> ptxas
TileIR path:
MLIR bytecode -> tileiras -> LLVM/NVVM -> PTX -> ptxas
That distinction matters for debugging. If tileiras rejects a program, the failure is normally in bytecode schema, dialect verification, tile lowering, scheduling, NVVM conversion, or PTX assembly. It is not a C++ frontend failure.
How to Read This Wiki
The wiki is dense. Pick a path based on what you need.
For reimplementers — building a compatible CUDA TileIR compiler:
- Start with Boundaries to fix tileiras's position in the CUDA toolchain.
- Read Program Layout for the executable shape and subsystem map.
- Read Pipeline Overview for the top-to-bottom cascade.
- Drill into whichever subsystem you are implementing: dialects, scheduler, lowering, codegen, or NVPTX passes. Each subsystem page is a reimplementation-grade contract.
For users — running tileiras or diagnosing failures:
- Start with Driver Overview for the public C-API and CLI surface.
- Read CLI Options for the full driver option list.
- Read Full Pass List by Opt Level to see which passes run at each
-Olevel. - Jump to the specific pass page for whatever behavior you are investigating. The Reading Map curates ordered sequences for common subsystems.
For a guided tour — sample the writing quality and depth:
- Modulo Scheduler and Rau — the scheduler exemplar, reimplementation depth.
- MLIR Bytecode Format — the wire-format contract.
- cuda_tile Overview — the public input IR.
- Lowering Overview — the conversion cascade in one page.
For specific topics — see the Specialized Topics cluster and the Reading Map for curated reader paths through scheduler, codegen, dialect lowering, and OSS comparison.
Reference catalogs such as the function map, opcode rosters, and sentinel tables are intentionally denser. They are for lookup and audit work; the subsystem pages are the narrative documentation.
Documentation Style
Public pages describe behavior first. Internal recovery anchors, binary offsets, and raw analysis notes are treated as authoring evidence, not as the reader-facing API. When a recovered implementation detail matters for compatibility, the page names the semantic role first and gives pseudocode or a data-structure contract before any low-level identifier.
Code blocks use C-like pseudocode for algorithms and explicit tables for externally visible contracts. The goal is that a reader can both operate tileiras and build a compatible implementation without having to reverse the prose back into an algorithm.
Program Layout
Abstract
Tileiras is a single ELF executable with the entire MLIR-and-LLVM compiler statically linked inside it. Its segments are conventional - .text, .rodata, .data, .bss, plus the usual support sections - but the content of each segment is structured by subsystem boundary. The .text segment is partitioned into bands that correspond to the driver, the bytecode reader, dialect implementations, conversion patterns, the scheduler, codegen, and the LLVM and NVPTX libraries. The .rodata segment carries dialect string pools, pass and diagnostic strings, the bitcode writer's tag tables, and the XOR-3 obfuscated NVPTX mnemonic pool used by the asm printer. The .data segment carries cl::opt globals, dialect registration tables, and the encoded mnemonic pool that the printer walks at runtime. The .bss segment carries the singletons every subsystem accumulates: the StorageUniquer hash tables, the TypeID Meyers-cache slots, the registered dialect instances, and the per-thread caches reached through TLS. This page describes what lives in each band and why each band has the shape it does. Addresses are deliberately omitted; reimplementers care about the structure, not the offsets.
Identity
| Property | Value |
|---|---|
| Tool role | CUDA TileIR optimizing assembler |
| CUDA release | 13.1 |
| Toolkit banner | Cuda compilation tools, release 13.1, V13.1.80 |
| LLVM lineage | Internal LLVM mainline snapshot identifying as LLVM21.0.0git |
| Input format | TileIR MLIR bytecode |
| Primary output | Host relocatable object containing compiled GPU code |
| Default output name | elf.o |
| Default GPU family | Blackwell-family target, normally sm_100 |
Tileiras is not a C++ frontend. It does not parse CUDA C++, instantiate templates, or generate host stubs. It consumes serialized MLIR, lowers it through the TileIR dialect stack, and produces PTX text that is then assembled by ptxas.
.text Bands
Code is grouped by subsystem; each band is a contiguous region containing the functions of one cohesive responsibility. The order, top to bottom, roughly tracks the runtime path through the compiler.
| Band | Contents |
|---|---|
| Driver text | Command-line parsing, target validation, CUDA-toolkit discovery, subprocess harness, file emission. |
| Bytecode reader text | Container header parsing, section walkers, varint and string-table reconstruction, operation/region rebuilder, post-load verifier driver. |
| Dialect-registration text | The register_operations, register_types, register_attributes, and register_interfaces bodies for each of the nine dialects, plus the per-operation verify, fold, print, and parse hooks. |
| Lowering text | The populate_*_patterns functions and pattern bodies for every conversion edge in the TileIR stack; ConversionTarget configuration; type converters; full and partial conversion drivers. |
| Scheduler text | RRT construction and probing, MII-bound computation, group placement, modulo-schedule solve, pipe/mutex materialization. |
| Codegen text | MLIR-to-LLVM translation, libdevice link logic, LLVM IR pipeline driver, NVPTX selection-DAG driver, MachineIR verifier hooks, asm printer. |
| LLVM/NVPTX library text | The statically linked LLVM core, LLVM IR analysis and transform passes, the NVPTX backend, the SelectionDAG infrastructure, and the asm-printer support. |
| C++ runtime and libstdc++ text | The standard-library bodies pulled in by the static link (the std::sort introsort, hash-table primitives, allocator support, the exception runtime). |
Each band has a stable internal shape: a small number of public entry points called by neighbouring bands, surrounded by a much larger population of pattern bodies and helper functions called only within the band. The lowering band is the heaviest; the scheduler band is the densest in algorithmic content per byte.
.rodata Bands
Read-only data is structured by purpose. The largest bands carry strings; the smaller bands carry the constant tables that drive the dispatch machinery.
| Band | Contents |
|---|---|
| Per-dialect mnemonic pools | The operation, type, and attribute mnemonics for each dialect, kept in registration order. The bytecode reader and the printer both index into these pools through OperationName. |
| Per-pass diagnostic strings | The text of every pass-emitted diagnostic, plus the verifier strings used by verify hooks. |
| Conversion-pattern descriptors | The static descriptor structures (OpRewritePattern / OpConversionPattern instances) for the lowering patterns, including their root mnemonics, benefits, and source-dialect tags. |
| NVPTX printer string table | The PTX mnemonic and operand-format strings consumed by the asm printer. These are stored XOR-3 encoded (each byte XORed with three) and decoded on first use; the encoded form keeps the readable PTX vocabulary out of strings output without changing runtime cost. |
| Bitcode writer string blob | The fixed blob of attribute, type, and metadata tag names that LLVM's bitcode writer would otherwise embed in every output; here it is interned into one rodata region and referenced by offset. |
cl::opt description text | The help text and default-value descriptions for every command-line option, including the LLVM-inherited options and the Tileiras-specific knobs. |
| C++ typeinfo and vtables | The __cxxabiv1 typeinfo nodes and the vtables for every polymorphic class - dialects, passes, patterns, conversion-target objects, the diagnostic engine, and so on. |
| libdevice bitcode blob | The bundled libdevice bitcode for the supported target families, embedded as a rodata blob and parsed at compile time. |
Each rodata band has a single dominant access pattern: mnemonic pools are read by hash lookup, descriptor arrays are walked sequentially during pass registration, vtables are indexed by slot, and the libdevice blob is read end-to-end exactly once per compilation.
.data Bands
Writable but statically initialised data carries the global state that registration and option handling populate at startup.
| Band | Contents |
|---|---|
cl::opt globals | The mutable storage for every command-line option (boolean flags, integers, enums, paths). Parsed values land here. |
| Dialect-registration tables | The static arrays the dialect registrar walks during register_operations / register_types. Their entries point into the .rodata mnemonic pools and into the .text hook functions. |
| Pass-registration tables | The list of static PassRegistration and PassPipelineRegistration entries that the pipeline builder consults to assemble each opt-level pipeline. |
| XOR-3 mnemonic walking-cipher pool | The runtime working copy of the NVPTX mnemonic table the printer decodes on first reference. The pool is initialised from rodata and walked entry-by-entry as PTX is emitted. |
| Global constant initialisers | Initialised static objects for the LLVM core, including the LLVM context manager, the target registry, and the intrinsic table. |
The data segment is small relative to text and rodata. Most truly mutable state lives in .bss; the data segment exists primarily to give registration tables and cl::opt storage a stable load-time address.
.bss Bands
Zero-initialised state is where every subsystem keeps its singletons and lazy caches. These are the structures that grow during a compilation and are not meant to outlive the process.
| Band | Contents |
|---|---|
StorageUniquer hash tables | The hash-set storage that uniques types and attributes. One bucket array per uniquer kind; populated lazily on first lookup. |
TypeID Meyers-cache singletons | The static local slots used by mlir::TypeID::get<T>() to give each concrete type a stable identity. Each instantiation lands in its own zero-initialised slot guarded by a one-time-init flag. |
| Dialect-singleton storage | The single Dialect instance per dialect kind, owned by the MLIRContext once registration runs. |
| Operation-name registry | The mnemonic-to-AbstractOperation lookup table, populated by dialect registration and read on every operation construction. |
| Pass and pattern statistics | The counter slots that pass and pattern implementations bump for Statistic reporting. |
| Per-thread caches | The TLS slots used by the diagnostic engine, the pass timer, the threaded pass manager, and the pattern applicator's local rewrite buffers. |
| LLVM context state | The LLVMContext singletons and intrinsic caches the codegen path relies on. |
Every entry in the .bss bands is logically owned by exactly one subsystem and is reset (or simply discarded) between compilations of independent modules.
Data Lifetimes
Three lifetime classes span the bands above. Knowing which lifetime a piece of data belongs to is what keeps the layout coherent across a long compilation.
| Lifetime | Data | Ends when |
|---|---|---|
| Static-init lifetime | .rodata pools, .data registration tables, vtables, libdevice blob, cl::opt defaults | Never; constructed during process startup. |
| Compile-unit lifetime | Bytecode buffer, MLIR module, dialect operations, scheduler analyses, LLVM module, libdevice working copy, PTX text | The host relocatable for the unit has been written. |
| Per-pass lifetime | Pattern applicator state, rewrite buffers, conversion target snapshots, diagnostic temporaries | The pass manager finishes the pass. |
Keeping these lifetimes separate matters in practice. cuda_tile operations are not meaningful after their conversion edge has run; scheduler analyses are not meaningful after lowering reaches LLVM; LLVM MachineIR is not meaningful before instruction selection; the XOR-3 mnemonic working copy is not meaningful before the asm printer is invoked.
Reimplementation Notes
A compatible implementation does not have to reproduce the executable's segment layout. It does have to reproduce the contracts that the layout exists to support: a stable static-init order for dialect, type, attribute, and pass registration; a uniquer that returns the same value for equal operands across the process; a .rodata-style discipline that keeps mnemonic pools, diagnostic strings, and printer tables immutable; and a .bss-style discipline that scopes all growth-only structures to the compilation that allocated them. The detailed pages under each subsystem describe these contracts as algorithms and invariants rather than as offsets.
For the reader-side view — file identity, the tools the wiki was produced with, and a recipe for verifying any individual wiki claim against the binary — see Binary Anatomy and RE Methodology.
Methodology
This wiki documents TileIR and tileiras from the behavior outward. The private reverse-engineering corpus is used as authoring evidence, but the public pages are written as technical documentation: what the program accepts, what it produces, which invariants it enforces, and how to reimplement compatible pieces.
The most important editorial rule is that implementation archaeology is not the API. A binary address, symbol nickname, or scan artifact may explain how a fact was discovered, but it usually does not help a reader operate tileiras or build compatible tooling. Public pages therefore prefer semantic names, data contracts, algorithms, and pseudocode.
Evidence Standard
Claims in this wiki are expected to rest on one of three kinds of evidence:
- direct behavior exposed by the driver, bytecode format, diagnostics, or emitted output,
- public or reconstructable MLIR/LLVM concepts such as dialect operations, pass contracts, verifier rules, and target attributes,
- repeated implementation evidence strong enough to describe a stable semantic contract.
When evidence is uncertain, pages should narrow the claim instead of exposing the uncertainty machinery. For example, a page should say "this pass materializes producer/consumer pipeline regions" only when that behavior is established. It should not ask the reader to reason from private scan identifiers.
Writing Rules
Public prose should answer the reader's practical questions first:
- What is this component for?
- Where does it sit in the pipeline?
- What inputs does it accept?
- What output or IR shape does it produce?
- What invariants must hold before and after it runs?
- What algorithm would a compatible reimplementation need?
- Which neighboring pages explain the surrounding system?
Code blocks should be reimplementation-grade pseudocode. They should name data structures and steps clearly, avoid private identifiers, and keep each line within 120 characters so the rendered wiki remains readable.
Reverse-Engineering Details
Low-level evidence still matters when it changes behavior. If a recovered detail affects compatibility, document the behavior it implies. Prefer this:
bool rrt_probe(const RRT *rrt, const NodeRRT *node, int start_cycle) {
for (int i = 0; i < node->duration; ++i) {
int row = (start_cycle + i) % rrt->initiation_interval;
if ((rrt->rows[row] & node->rows[i]) != 0) {
return false;
}
}
return true;
}
Avoid presenting the same fact as a list of private function names or address ranges. That form is useful while doing the investigation, but it is not useful documentation for readers who only see the wiki.
Confidence and Corrections
Claims in this wiki carry confidence tags drawn from the three-tier HIGH/MED/LOW scheme defined in String-Evidence and Confidence Policy. The wiki's working convention is:
- inline tags (HIGH, MED, LOW) are allowed and encouraged next to individual claims when the evidence basis is worth signalling to a reader,
- a deliberately authored "Confidence" or "Evidence" section is acceptable on pages whose subject is genuinely uncertain or contested, but it is not required on every page,
- core prose should still be written from the behavior outward; if a claim is only LOW, prefer to omit it or rephrase it with hedging built into the sentence rather than parking a weak claim behind a tag.
If a later investigation changes a page, update the page inline and remove the stale claim rather than preserving a historical dispute in the main prose. When a contradiction surfaces between two analyses of the same construct, prefer the one with the stronger structural anchor regardless of recency, and restamp dependent pages.
For compatibility-sensitive behavior, prefer executable-looking algorithms and explicit invariant lists. Those are easier to review and easier to port than paragraphs describing how the fact was found.
Page Structure
Most subsystem pages should follow this shape:
- A short purpose statement.
- A pipeline-position diagram or paragraph.
- The public data model or operation families.
- Pseudocode for the central algorithm.
- Invariants and verifier expectations.
- Reimplementation checklist.
- Cross-links to neighboring public pages.
Reference pages may be denser, but they should still avoid private raw evidence paths and address-heavy prose. If a table is too large to help a reader understand behavior, split it into a smaller public table and move the purely investigative detail out of the public wiki.
Subsystem Function Map
Abstract
Tileiras is a single executable, but it behaves as a stack of cooperating subsystems with a fairly rigid call-graph shape. This page is the function map for that stack at the role level: for each subsystem, the conceptual entry points it exposes, the callers it answers to, the callees it dispatches into, and the canonical wiki page where the mechanism is described in depth. The page is meant to answer the question "if I want to find where X happens in this compiler, where do I look?" without requiring a reader to know any internal address.
Call-Graph Shape at the Subsystem Level
The driver is the only entry. It builds a compile configuration, opens the input file, and calls into the bytecode reader. The reader returns an MLIR module rooted in cuda_tile. The driver hands that module, together with the configuration, to the pipeline builder. The pipeline builder asks the dialect registry for the operations and types it will see, asks the lowering subsystem for the conversion patterns it will use, and asks the scheduler for the analysis passes it must register. Pass execution then walks the registered passes in order: dialect-to-dialect lowering rewrites operations in place; the scheduler runs as an embedded phase that produces a ScheduleAnalysis without changing the IR; subsequent materialization passes consume that analysis to emit pipe and mutex operations. After the IR reaches the LLVM/NVVM dialects, the codegen subsystem translates it to an llvm::Module, links libdevice bitcode, runs the LLVM IR pipeline, and hands the result to the NVPTX backend. The backend produces PTX text. The driver then invokes ptxas as a subprocess, captures its object output, and writes the final host relocatable.
In short: driver -> bytecode -> dialects (via the registry) -> lowering -> scheduler (inside lowering) -> codegen -> libdevice link -> NVPTX backend -> ptxas. The MLIR infrastructure subsystem is orthogonal to that chain; every layer above calls into it for operations, types, attributes, regions, contexts, diagnostics, pattern rewriting, and pass management.
Driver
The driver owns process-level behavior: command-line parsing, target validation, CUDA tool discovery (ptxas, fatbinary), output naming, subprocess invocation via posix_spawn, and error reporting. Note: libdevice is not discovered here — it is embedded in the tileiras binary as _mlir_embedded_libdevice.
| Conceptual entry point | Role |
|---|---|
parse_command_line | Read argv into a structured compile configuration; reject malformed flags. |
resolve_cuda_installation | Resolve CUDA_HOME / CUDA_PATH to find the ptxas and fatbinary binaries; the libdevice bitcode is embedded in the tileiras binary itself, not loaded from the toolkit. |
validate_target | Check that the requested GPU architecture is supported, emitting invalid GPU architecture: <name> on rejection. |
open_input_module | Map the input file and hand its buffer to the bytecode reader. |
run_compilation | Drive the pipeline against the parsed module and the configuration. |
invoke_ptxas | Spawn ptxas via posix_spawn with the derived option set and capture its output. |
invoke_fatbinary | Spawn fatbinary to package one or more .cubin outputs into a fat binary container. |
report_error | Format a subprocess or pipeline failure as a user-facing diagnostic. |
Callers: process entry. Callees: bytecode reader, pipeline builder, codegen, subprocess harness (ptxas, fatbinary). See Driver Overview, CLI Options, Subprocess Harness.
Bytecode Reader
The reader owns the on-disk TileIR format. It validates the container, reconstructs the string and attribute tables, and rebuilds MLIR operations and regions in memory.
| Conceptual entry point | Role |
|---|---|
read_container_header | Check the bytecode magic and version, locate the section table. |
read_string_section | Restore the interned string pool from its compressed form. |
read_dialect_section | Resolve each referenced dialect against the dialect registry. |
read_type_and_attribute_sections | Reconstruct uniqued type and attribute values. |
read_operation_section | Walk the operation stream, building regions and blocks. |
verify_module | Invoke the MLIR verifier on the reconstructed module. |
Callers: driver open_input_module. Callees: dialect registry, MLIR infrastructure (storage uniquer, operation builder). See MLIR Bytecode Format.
Dialect Stack
The dialect subsystem registers operations, types, attributes, interfaces, verifiers, folders, and printers for every dialect the compiler understands. Lookup goes through OperationName and RegisteredOperationName, which is the bridge between a mnemonic and its implementation.
| Dialect | Conceptual role | Page |
|---|---|---|
cuda_tile | Public tile input dialect; what the bytecode actually contains. | cuda_tile Overview |
nv_tileaa | Alias-aware memory, token, queue, and pointer layer. | nv_tileaa Overview |
nv_tileas | Operationally-scheduled async memory and TMA layer. | nv_tileas Overview |
cute | Target-neutral layout algebra. | cute Overview |
cute_nvgpu | NVIDIA-architecture atom registry. | cute_nvgpu Overview |
cutlass | CUTLASS pipeline and tile-scheduler abstractions. | cutlass Overview |
nvgpu | Stock MLIR GPU bridge layer. | nvgpu Overview |
nvvm | PTX-facing intrinsic dialect. | NVVM Overview |
Each dialect exposes the same role-level entry points: register_operations, register_types, register_attributes, register_interfaces, and per-operation verify / fold / print / parse hooks. Callers: bytecode reader (to resolve mnemonics on load), lowering (to identify source-dialect operations), MLIR infrastructure (to build operations and unique types).
Lowering
The lowering subsystem owns dialect-to-dialect conversion. It is a set of pattern populations driven by ConversionTarget and DialectConversion. Each lowering page describes the patterns for one source-to-target hop.
| Conceptual entry point | Role |
|---|---|
populate_<src>_to_<tgt>_patterns | Register the rewrite patterns for one conversion edge. |
configure_conversion_target | Mark legal and illegal operations for the current hop. |
build_type_converter | Wire source-dialect types to target-dialect types and addr-space rules. |
apply_full_conversion | Run the pattern set to fixpoint or fail with a diagnostic. |
apply_partial_conversion | Run a target-bounded conversion where some operations remain. |
Conversion edges, in order along the main path: cuda_tile -> nv_tileaa -> nv_tileas -> LLVM/NVGPU -> NVVM. See Lowering Overview for the conversion cascade.
Scheduler
The scheduler is a phase inside lowering. It does not modify operations; it computes a ScheduleAnalysis (stage, order, and resource placement per operation) that later materialization passes consume.
| Conceptual entry point | Role |
|---|---|
build_constraints | Walk the loop region, build dependence and resource constraints. |
compute_mii_bounds | Compute resource, recurrence, fine-density, and dependence lower bounds on II. |
place_groups | Choose an II and place operation groups into the resource reservation table. |
solve | Run the linear schedule solve over a fixed placement. |
materialize_pipes_and_mutexes | Emit Pipe_ and Mutex_ values once placement is fixed. |
Callers: the TileAA-to-TileAS conversion edge, plus the warp-specialize pipelines. Callees: MLIR infrastructure (analysis manager, attribute storage). See Scheduler Overview and Modulo Scheduler.
Codegen and NVPTX Backend
Codegen is the boundary between MLIR and the LLVM toolchain. It produces an llvm::Module, links libdevice, runs the LLVM IR pipeline, then hands control to the NVPTX backend, which selects instructions and prints PTX.
| Conceptual entry point | Role |
|---|---|
translate_module_to_llvm | Convert the LLVM/NVVM-dialect module into an in-memory llvm::Module. |
link_libdevice | Resolve libdevice math calls against the bundled bitcode. |
run_llvm_passes | Run the LLVM IR pipeline (NVVM reflection, address-space opt, arg lowering, etc.). |
select_nvptx_instructions | Match LLVM IR against the NVPTX matcher table to produce MachineIR. |
verify_nvptx_machine_ir | Run LLVM's MachineVerifierPass over the NVPTX MachineIR before printing. |
emit_ptx_text | Print the final PTX module. |
Callers: pipeline driver, after lowering reaches LLVM dialect. Callees: libdevice subsystem, MLIR translation infrastructure. See Codegen Overview, NVPTX Backend Passes.
libdevice
libdevice is the bitcode that supplies math intrinsics. It is embedded as the symbol _mlir_embedded_libdevice inside the tileiras binary (not loaded from the CUDA toolkit), parsed once per compilation, and integrated by the LLVM IR pipeline.
| Conceptual entry point | Role |
|---|---|
load_libdevice_bitcode | Parse the embedded libdevice bitcode (symbol _mlir_embedded_libdevice linked into the tileiras binary) into an llvm::Module. |
resolve_nvvm_reflect | Replace __nvvm_reflect calls with the compile-time reflection answer. |
inline_selected_math | Inline the math functions whose bodies are pulled into the kernel module. |
prune_unused_bodies | Drop libdevice functions that survived as unused declarations. |
Callers: codegen during the LLVM IR pipeline. See libdevice Overview, NVVM Reflect Mechanism.
MLIR Infrastructure
The infrastructure is the shared runtime model every other subsystem depends on.
| Conceptual entry point | Role |
|---|---|
MLIRContext::getOrLoadDialect | Look up or create a dialect in the current context. |
StorageUniquer::getOrCreate | Intern a type or attribute storage object. |
OperationName::resolve | Bind a mnemonic to its registered operation hook table. |
OpBuilder::create | Build a new operation with operands, results, attributes, and regions. |
PatternApplicator::matchAndRewrite | Drive pattern matching against an operation. |
PassManager::run | Execute the configured pass pipeline against a module. |
Diagnostic::emit | Surface an error or warning attached to a location. |
Callers: every other subsystem in this map. See MLIR Infrastructure Overview, Storage Uniquer, Operation Layout.
How to Use This Map
Locate the behavior you want to understand by role: input parsing belongs to the driver and bytecode reader; semantic transformation belongs to lowering; placement decisions belong to the scheduler; libdevice math belongs to libdevice; PTX selection and emission belong to codegen and the NVPTX backend. The "page" link in each subsystem section is the canonical entry; child pages drill into one specific mechanism per page.
Versions and Fingerprints
This page records the version identifiers that matter for users and compatible implementations. It avoids private evidence anchors and focuses on the public compatibility contract: which CUDA release, LLVM lineage, dialect version family, and backend behavior this wiki describes.
Version Table
| Component | Version or identity | Compatibility meaning |
|---|---|---|
| CUDA toolkit | CUDA 13.1, toolkit banner V13.1.80, build tag local.local.36836380_ | The documented driver, dialects, and target defaults describe the CUDA 13.1 tileiras binary. The build tag identifies the exact NVIDIA-internal snapshot. |
| LLVM base | Host C++ link target: internal LLVM snapshot identifying as LLVM21.0.0git | LLVM IR, MLIR infrastructure, NVVM lowering, and NVPTX backend behavior should be read as LLVM-21-era behavior plus NVIDIA patches. |
| MLIR base | Co-tracked with the LLVM 21 snapshot | Operation, type, attribute, pass-manager, and bytecode infrastructure follow the corresponding MLIR generation. The bytecode reader's AttrTag numbering is wire-format-forked from upstream; see MLIR Bytecode Format. |
| Embedded device-bitcode producer (recent) | clang version 16.0.0 (NVIDIA internal), subtarget triple nvptx64-nvidia-gpulibs | The fp128 / __nv_fp128 softfloat module embedded inside the binary was compiled by an NVIDIA-internal clang-16 toolchain. Different from the host LLVM-21 link target. |
| Embedded device-bitcode producer (legacy) | clang version 7.1.0 git-630d6c22278, subtarget triple nvptx64-nvidia-gpulibs | A second, older embedded LLVM IR blob (the __nv_*128 integer family) carries this producer string. A compatible reimplementation must accept that the binary ships two embedded IR generations side by side. |
| Embedded soft-math providers | Berkeley SoftFloat (extF80 / f128M_* / softfloat_*), Sleef (Sleef_*, Sleef_rempitabqp, qp_cuda_sleefq) | fp128 arithmetic, fp128 transcendentals, and the payne_hanek argument reduction table are sourced from these third-party libraries linked into the embedded gpulibs IR, not from the main __nv_* libdevice math set. |
| Primary input dialect | cuda_tile TileIR bytecode | The accepted input is serialized MLIR bytecode carrying the public tile dialect. |
| Main target family | Blackwell-family targets, defaulting to sm_100; PTX produced through the LLVM-21 NVPTX AsmPrinter (header line Based on LLVM 21.0.0git) | Many docs assume Hopper/Blackwell-era TMA, WGMMA, and tensor-memory features. The user-target triple is nvptx64-nvidia-cuda; the gpulibs subtarget triple is nvptx64-nvidia-gpulibs. |
| NVPTX backend | LLVM 21 NVPTX with NVIDIA-internal extensions | Backend pass and intrinsic behavior extends stock upstream LLVM. |
| libdevice | CUDA 13.1 libdevice bitcode, exported as _mlir_embedded_libdevice | Device math calls are linked, reflected, inlined, and optimized before PTX emission. The bitcode is embedded as an MLIR-side resource, not loaded from disk. |
| Content hashing | BLAKE3-style construction (internal use only) | Used for IR object interning, deduplication, and caching. Not a public ABI. |
LLVM and MLIR Lineage
The key compatibility fact is that tileiras uses an LLVM/MLIR stack aligned with LLVM 21 development. That affects:
- MLIR bytecode reader behavior,
- operation, type, attribute, and interface mechanics,
- pass-manager and rewrite-pattern infrastructure,
- LLVM bitcode writing,
- NVVM intrinsic naming and lowering,
- NVPTX instruction selection and PTX emission.
A compatible reimplementation does not need to reproduce every linked LLVM helper. It does need to match the observable LLVM/NVVM contracts: data layout, target attributes, intrinsic lowering, kernel ABI, libdevice handling, and PTX backend expectations.
The binary also embeds device-side LLVM IR that was produced by older clang generations (16.0.0 and 7.1.0) running against the nvptx64-nvidia-gpulibs subtarget. The host LLVM-21 framework consumes that prebuilt IR through the standard bitcode reader; a reimplementation only needs to honor the producer-string and subtarget-triple shapes, not rebuild the embedded IR from source.
NVIDIA Extensions
The backend is not just stock upstream LLVM. It includes NVIDIA extensions for newer NVVM operations, Blackwell tensor-memory support, target-specific verifiers, NVVM reflection handling, parameter-space lowering, address-space specialization, and NVPTX machine-level cleanup.
The practical rule is:
Treat generic LLVM behavior as LLVM-21-era behavior.
Treat NVVM, NVPTX, TileIR, and tensor-memory behavior as NVIDIA-extended behavior.
When a page documents tcgen05, TMA, WGMMA, cluster launch control, TileAS scheduling, or CUTLASS pipeline lowering, assume NVIDIA-specific semantics unless the page explicitly names an upstream MLIR or LLVM feature.
External Dependency Surface
A reimplementation must account for every third-party or NVIDIA-internal component that crosses the binary's compatibility surface, not only the LLVM host link. The table below pins each one to a concrete integration point.
| Dependency | Where it crosses into tileiras | Anchor inside the wiki |
|---|---|---|
| LLVM 21 host library | C++ link target. Provides IR types, pass manager, bitcode reader/writer, NVPTX backend, AsmPrinter. | LLVM Fingerprint Table |
| MLIR (LLVM-21 generation) | C++ link target. Operation/type/attribute/interface mechanics, pass-manager, dialect registration, bytecode reader. | MLIR Bytecode Format, Dialect Asm-Printer Status |
| NVPTX backend extensions | Inside the LLVM host library, but with NVIDIA-internal passes, intrinsics, and Matcher tables. | NVPTX Backend Passes, LLVM Fingerprint Table §6, §8 |
| Embedded libdevice bitcode | Linked at module construction via _mlir_embedded_libdevice. CUDA 13.1 generation. | libdevice Overview, NVPTX Bring-up and Target Init |
| Embedded clang-16 device IR | Bitcode resource compiled by NVIDIA-internal clang 16.0.0 against nvptx64-nvidia-gpulibs. Carries the __nv_fp128 softfloat family. | Math Pass Pipeline and Crosswalk |
| Embedded clang-7.1 device IR | Bitcode resource compiled by NVIDIA-internal clang 7.1.0 (git-630d6c22278). Carries the __nv_*128 integer family. | Math Pass Pipeline and Crosswalk |
| Berkeley SoftFloat | Statically linked inside the embedded gpulibs IR. Drives fp128 arithmetic (f128M_*, softfloat_*). | Math Pass Pipeline and Crosswalk |
| Sleef | Statically linked inside the embedded gpulibs IR. Drives fp128 transcendentals (Sleef_*, qp_cuda_sleefq, Sleef_rempitabqp). | Math Pass Pipeline and Crosswalk |
| BLAKE3 content hashing | Internal interning, deduplication, and caching. Not a public ABI. | (no public surface) |
| Host C runtime | libpthread, libdl, GLIBC 2.3.4-baseline. Used for synchronization and dynamic loading; no CUDA driver linkage. | (no public surface) |
The integration points worth checking when a new CUDA release lands are concentrated in three places:
- Bytecode envelope and AttrTag numbering — the wire-format fork from upstream MLIR.
_mlir_embedded_libdeviceand the gpulibs subtarget triples — the device-side IR contract.- NVPTX AsmPrinter header + MatcherTable — the PTX emission contract.
Bytecode and Dialect Compatibility
The bytecode reader expects a TileIR-specific MLIR bytecode container. The public input dialect is cuda_tile; internal dialects such as nv_tileaa, nv_tileas, cute, cute_nvgpu, and cutlass are normally constructed by the pipeline or by frontend-specific producers.
Compatible tooling should preserve these boundaries:
- bytecode producers emit valid
cuda_tileprograms, - dialect conversion lowers toward internal dialects in one direction,
- internal dialects are not treated as stable standalone file formats unless a page explicitly describes a textual debugging surface,
- target-specific dialects are verified against the selected compute capability.
Content Hashing
BLAKE3-style content hashing is used internally for IR object identity, deduplication, and caching. Equivalent IR objects receive stable identities within a compiler run, but the hashes are not a public ABI; treat them as implementation support.
Version-Sensitive Pages
Some pages are especially tied to CUDA 13.1 and the LLVM 21-era backend:
- NVVM Dialect Overview
- NVPTX Backend Passes
- Codegen Overview
- libdevice Overview
- MLIR Bytecode Format
- Position in nvcc 13.1
If a future CUDA release changes the bytecode schema, dialect roster, target defaults, or NVPTX intrinsic set, these pages should be reviewed first.
Glossary
This glossary defines the public terms used throughout the tileiras wiki. It focuses on behavior, data models, dialects, passes, and target concepts. Detailed operation rosters live in their dialect pages.
Core Tools
| Term | Meaning |
|---|---|
tileiras | CUDA TileIR optimizing assembler. It consumes TileIR MLIR bytecode and produces a host object containing compiled GPU code. See Driver Overview. |
ptxas | NVIDIA assembler invoked after PTX emission to produce the final GPU binary payload. See ptxas Handoff Protocol. |
nvdisasm | NVIDIA disassembler optionally invoked to produce annotated SASS output. |
cicc | CUDA C++ device compiler. It shares the LLVM/NVPTX backend family with tileiras but starts from CUDA frontend output, not TileIR bytecode. See cicc Comparison. |
| TileIR | NVIDIA's MLIR-based tile program representation consumed by tileiras. The serialized bytecode form carries builtin.module containers whose gpu.module payloads are expressed in the cuda_tile dialect; passing through the full lowering cascade it becomes nv_tileaa, nv_tileas, cute, cute_nvgpu, cutlass, nvgpu, nvvm, and finally llvm. See Pipeline Overview. |
| TileAS | The pass family and dialect-family name covering scheduling, layout, async pipeline, CTA cluster, and buffer-management work over nv_tileas IR. The CLI prefix and option names use the lowercase form (tileas-*); prose uses TileAS. See TileAS Pass Families. |
Dialects
Each dialect occupies one layer of the lowering pipeline. The early dialects preserve tile semantics, the middle dialects make layout and scheduling explicit, and the late dialects bridge to NVVM and LLVM.
| Term | Meaning |
|---|---|
cuda_tile | Public input dialect for tile programs. It describes tile arithmetic, memory, control flow, tokens, tensor views, and kernel entries. The dialect is the only public surface — the rest of the cascade is NVIDIA-private. See cuda_tile Overview. |
nv_tileaa | Alias-aware internal dialect below cuda_tile. It introduces explicit memory references, pointer provenance, tokens, queues, and reuse markers so later passes can reason about aliasing without re-deriving it. See nv_tileaa Overview. |
nv_tileas | Operational async-scheduling dialect. It represents producer/consumer pipelines, TMA-ready memory operations, layout conversion, and scheduled regions. The TileAS pass family runs on this dialect. See nv_tileas Overview. |
cute | Target-neutral layout algebra dialect derived from CuTe concepts: shape, stride, layout, tile, coord, swizzle, and tiled atom descriptors. Used to express layout transformations and tile partitioning. See cute Overview. |
cute_nvgpu | NVIDIA architecture atom dialect for MMA, WGMMA, TMA, TMEM, ldmatrix, stmatrix, and target-specific copy operations. Each atom is parameterised by SM tier (SM70..SM120). See cute_nvgpu Overview. |
cutlass | CUTLASS pipeline dialect for tile schedulers, sequence barriers, pipeline roles, block-striped operations, and persistent kernel structure. Models the CUTLASS programming-model abstractions as MLIR ops. See cutlass Overview. |
nvgpu | Stock MLIR NVIDIA GPU bridge dialect used before NVVM conversion. Acts as an intermediate between high-level GPU intent and concrete NVVM intrinsics. See nvgpu Overview. |
nvvm | MLIR dialect representing NVVM/PTX-facing intrinsics and target operations before LLVM IR materialization. See NVVM Overview. |
llvm | MLIR LLVM dialect used as the last MLIR form before creating an LLVM module. |
Tile and Layout Terms
| Term | Meaning |
|---|---|
| Tile | A logical block of tensor data operated on as a unit. |
| Shape | Extents of a tile, tensor view, or coordinate tuple. |
| Stride | Offset step associated with each coordinate dimension. |
| Layout | Mapping from logical coordinates to physical offsets, usually shape plus stride and optional swizzle. |
| Swizzle | Bit permutation used to match hardware layout requirements or avoid memory-bank conflicts. |
| Coord | Coordinate value used to index a shape or layout. |
| View | Pointer or memref plus shape, stride, element type, and memory-space metadata. |
| Tensor view | High-level view of a tensor region with shape and stride semantics. |
| Partition view | View that partitions a tensor or tile among program dimensions, lanes, warps, or agents. |
| Atom | A hardware-sized operation descriptor such as a copy atom, MMA atom, or TMA atom. |
Scheduling Terms
| Term | Meaning |
|---|---|
| Stage | Logical software-pipeline stage assigned by the TileAS scheduler. Operations in stage k start k iterations of the prologue ahead of the steady-state. |
| Order | Deterministic tie-breaker within a stage. Together with stage, it forms the (stage, order) pair downstream materialization consumes. |
Initiation interval (II) | Number of cycles between starts of successive software-pipeline iterations. The minimum II respects both data-dependence and resource constraints. |
| RRT | Resource Reservation Table. A bitset table with one row per cycle modulo the candidate II, where each row is a bitset of resource classes. Used to test whether an operation can occupy a candidate modulo cycle. See Resource Constraint Builder and RRT. |
| Resource footprint | Per-operation resource occupancy over one or more cycles. The scheduler reads it before probing an RRT slot. |
ScheduleAnalysis | Preserved MLIR analysis carrying the fixed schedule from TileASGenerateSchedule to TileASMaterializeSchedule. The two-pass split is what lets the scheduler decide once and the materializer apply once. |
| MaterializeSchedule | The TileAS pass that consumes the cached ScheduleAnalysis and emits Pipe_ / Mutex_ SSA values along with the cute_nvgpu.arch.agent_switch partitioning at warp-specialized boundaries. See Async/Pipeline Family. |
Pipe_ | Concrete producer/consumer coordination value emitted after schedule placement. Models a depth-d ring buffer with bounded slack between producer and consumer stages. See Pipe_ and Mutex_ Value-Header Layout. |
Mutex_ | Concrete mutual-exclusion coordination value emitted after schedule placement. Models a zero-slack serialization edge — iteration i of the protected region must complete before iteration i+1 starts. |
Schedule::solve | Materialization algorithm that groups producers and consumers into Pipe_ values after placement is fixed. See Schedule::solve and Cost Evaluators. |
| VLIW | Very Long Instruction Word. Used in the scheduler context to describe how multiple operations get bundled into a single issue slot — the modulo scheduler emits VLIW-style packed schedules when the target pipeline has multiple parallel function units. |
Async Pipeline Terms
| Term | Meaning |
|---|---|
| Producer | Agent or region that fills a pipeline stage. |
| Consumer | Agent or region that waits for and reads a produced pipeline stage. |
| Pipeline stage | Rotating buffer slot shared by producer and consumer agents. |
| Producer acquire | Operation that grants producer ownership of a stage. |
| Producer commit | Operation that publishes a filled stage to consumers. |
| Consumer wait | Operation that waits for a committed stage. |
| Consumer release | Operation that returns a consumed stage to the pipeline. |
| Pipeline iterator | SSA value identifying the current rotating stage. |
| Agent switch | Operation that selects producer or consumer agent regions under warp specialization. The nv_tileas.async.pipeline.agent_switch op is the IR-visible form. |
| AWS | Agent-Warp-Specialized. The dispatch mode MaterializeSchedule selects when distinct producer and consumer agents are partitioned across warps; the alternative AUS (Agent-Unspecialized) has a single SIMT agent owning both. The nv_tile.aws.* attribute family threads scheduling keys back into the AsyncValue headers. |
GPU Architecture Terms
| Term | Meaning |
|---|---|
| SM (Streaming Multiprocessor) | The basic GPU compute unit. Each SM owns a register file, a shared-memory bank, warp schedulers, and one or more tensor-core pipelines. Targets are named by SM tier: tileiras emits for the Blackwell family (sm_100, sm_103, sm_110, sm_120, sm_121). See GPU Execution Model. |
| CTA (Cooperative Thread Array) | The PTX-level name for a thread block. A CTA contains 1 to 1024 threads grouped into warps; threads in the same CTA share an SMEM allocation and can synchronize through CTA-local barriers. See GPU Execution Model. |
| Warp | 32 threads executing in SIMT lockstep on the same SM. The warp is the unit of instruction issue, divergence, and most synchronization primitives. See GPU Execution Model. |
| Warp-group | Four contiguous warps, 128 threads. The unit of cooperation for WGMMA on Hopper and for several Blackwell tensor-memory operations. See WGMMA Emission Protocol. |
| Cluster | A SM90-introduced grouping of 1-8 CTAs that share distributed shared memory and can use cluster-scope barriers. Hopper introduced 2-CTA clusters; Blackwell extends to 4-CTA. See Cluster Sync and DSMEM Handshake. |
| Grid | The whole kernel launch — a 1D/2D/3D array of CTAs scheduled together by the driver. |
| Register file | The per-SM bank of 32-bit registers, partitioned among resident warps. Tileiras's register-pressure heuristics and the modulo scheduler both reason about this resource. |
| SMEM (Shared Memory) | Per-CTA on-chip memory. Around 228 KB usable per SM on H100-class parts; bandwidth on the order of tens of TB/s. Used for tiles, mbarriers, and TMA staging. |
| GMEM (Global Memory) | Device-wide off-chip DRAM. Tens to hundreds of GB on data-center parts. Accessed through ld.global, cp.async, or TMA. |
| DSMEM (Distributed Shared Memory) | Cross-CTA shared memory inside a cluster: each cluster member can address shared memory of every peer through nvvm.mapa plus llvm.addrspacecast. The handshake pairs nvvm.cluster.arrive and nvvm.cluster.wait with optional fences. See Cluster Sync and DSMEM Handshake. |
| TMEM (Tensor Memory) | SM100+ on-chip memory used as the operand and accumulator store for tcgen05.mma. A separate address space (addrspace 4) with its own load/store and copy primitives. See tcgen05 Tensor Memory Model. |
| TMA (Tensor Memory Accelerator) | SM90+ async bulk tensor-copy engine. Driven by 128-byte tensormap descriptors and the cp.async.bulk.tensor family. See TMA TensorMap and cp.async.bulk. |
| S2T copy | Shared-to-tensor-memory copy. Blackwell-specific transfer from SMEM to TMEM, used to stage tcgen05.mma operands. The cute_nvgpu.atom.copy_make_s2t_copy_op family models it. |
| WGMMA | Warp-group matrix multiply-accumulate, introduced for Hopper tensor cores. Issued by a 128-thread warp group cooperatively against an SMEM-resident B descriptor and a register or SMEM A descriptor. See WGMMA Emission Protocol. |
| UMMA | Unified MMA family used by Blackwell tensor-memory operations. Issued through tcgen05.mma with accumulator and operands in TMEM. |
| IMMA | Integer matrix multiply-accumulate. The PTX instruction family for integer MMA tiles; appears in mixed-precision MMA paths alongside the floating-point families. |
| GMMA descriptor | Synonym for SMEM descriptor in the WGMMA context. The 64-bit shared-memory descriptor that encodes the SMEM base address (low 14 bits, in 16-byte units) plus leading and stride byte offsets pinning the 2D tile shape into shared memory. WGMMA operand B is always an SMEM descriptor; operand A is either a register fragment or an SMEM descriptor. |
| SMEM descriptor | See GMMA descriptor. |
| f8E8M0FNU | 8-bit floating-point variant used as the scale-factor type in block-scaled MMA. Encodes a pure exponent (no mantissa, no sign), giving microscale factors a wide dynamic range from a single byte. See also e8m0 under Math and Precision. |
| Microscale | Block-scaled MMA where each tile of operand data carries a small shared scale factor (typically f8E8M0FNU). Allows narrow operand types (FP4 and FP8 mantissa) to express a wide effective dynamic range. See Fast-Math and Numerical Precision. |
collector::a | The tcgen05.mma accumulator-mode parameter selecting how the accumulator participates: use reads and writes, fill writes only (zero-init equivalent), discard writes only with no read dependency. The kind-word verifier at sub_1AD26A0 packs this into the same bitfield as cta_group. |
tcgen05 | Blackwell tensor-memory instruction family exposed through NVVM/NVPTX lowering. Covers tcgen05.mma, tcgen05.cp, tcgen05.commit, and the synchronizing primitives. See tcgen05 Tensor Memory Model. |
mma.sync | Warp-cooperative matrix multiply-accumulate on SM70 through SM89. Operands and accumulator live in registers; the whole warp issues the operation together. Superseded by WGMMA on Hopper and tcgen05.mma on Blackwell, but still emitted for older targets. |
ldmatrix | Synchronous instruction family that loads matrix fragments from shared memory into per-thread registers shaped for mma.sync/WGMMA consumption. The SMEM-to-RF companion to cp.async/cp.async.bulk. |
stmatrix | Synchronous matrix-fragment store from registers back to shared memory. The store-side counterpart to ldmatrix. |
cp.async | Ampere (SM80+) asynchronous global-to-shared copy family. Decouples the load issue from the data-ready point through commit-and-wait groups. |
cp.async.bulk | SM90+ bulk async copy family covering both tensor and non-tensor variants. The tensor variant is the TMA path; the non-tensor variant carries plain byte ranges. |
cp.async.bulk.tensor | Hopper/Blackwell bulk tensor-memory copy family used by TMA, driven by tensormap descriptors. |
| mbarrier | Transactional barrier object held in shared memory. Used by TMA, async copy, and the producer/consumer handshake to coordinate arrivals and byte-count transactions across warps. See mbarrier State Machine. |
NamedBarrier (bar.sync N) | The CTA-local barrier pool indexed by a small integer (0-15). Distinct from mbarriers: bar.sync is a hardware-implemented synchronous barrier with no transactional state, used for sub-CTA synchronization at warp-specialized boundaries. |
PTX and SASS
| Term | Meaning |
|---|---|
| PTX | NVIDIA's virtual ISA and target-independent intermediate representation. Tileiras emits PTX text that ptxas then translates to a concrete SM's SASS. See ptxas Handoff Protocol. |
| SASS | NVIDIA's hardware ISA, generated by ptxas from PTX and specific to one SM tier. Tileiras itself does not emit SASS; it relies on ptxas for instruction selection at that level. See PTX Version and Target Selection. |
| State space | PTX's address-space designation on a load/store or pointer: global, shared, local, constant, param, or the unspecified generic. State spaces map to MLIR memory spaces and to LLVM address spaces in the NVVM target. See AddrSpace Vote Lattice. |
| Inline PTX | LLVM inline assembly carrying PTX text and operand constraints. Tileiras emits inline PTX for primitives the NVVM intrinsics layer does not cover directly. |
Backend Terms
| Term | Meaning |
|---|---|
| NVVM intrinsic | LLVM intrinsic in the llvm.nvvm.* family. |
| LLVM module | LLVM IR representation produced after MLIR lowering. |
| MachineIR | LLVM target-specific machine representation after instruction selection. |
| Parameter space | PTX address space used for kernel parameters. |
| Address space | Memory-space classification such as generic, global, shared, constant, local, or parameter. |
| libdevice | NVIDIA device math bitcode library linked into modules that call __nv_* math functions. See libdevice Overview. |
__nvvm_reflect | Compile-time configuration query used by libdevice and NVVM support code. The reflect pass replaces __nvvm_reflect("name") calls with the resolved integer value at compile time. See NVVMReflect Mechanism. |
__grid_constant__ | Kernel-parameter attribute indicating a value that is constant per grid launch. The TMA descriptor pass uses it to mark TMA descriptors passed by kernel parameter, so codegen can place the descriptor into a read-only constant slot without proving constancy from scratch. |
| Descriptor | Generic name for a structured operand passed to a hardware primitive. Each architecture family has its own descriptor type: TMA descriptors are 128-byte records for cp.async.bulk.tensor; GMMA/SMEM descriptors are 64-bit records for WGMMA. |
| Intrinsic | A function-like name that lowers to one or a few target instructions rather than a regular call. PTX intrinsics surface in MLIR as nvvm.* ops. |
| Pass | An MLIR transformation that runs on an operation kind (builtin.module, gpu.module, nv_tileaa.func, etc.). Tileiras runs about fifty passes per device module at -O3. See Full Pass List by Opt Level. |
| Dialect | An MLIR namespace owning a set of operations, types, attributes, and interfaces. Tileiras registers nine dialects across the lowering cascade plus upstream MLIR dialects (arith, math, scf, builtin, etc.). |
| NCL | NVPTX Common Library — the family of nv-* and nvptx-* helper passes that perform common-base elimination, dead-sync elimination, kernel attribute stamping, and other NVPTX-specific cleanups in the backend. |
MLIR Infrastructure
| Term | Meaning |
|---|---|
| MLIR (Multi-Level IR) | The LLVM-project IR-of-IRs framework that hosts tileiras's whole lowering cascade. Dialects, operations, types, attributes, and passes are all MLIR concepts. See Architecture Evolution and Design Decisions. |
| Operation | An instruction-level IR node in MLIR. Carries operands, results, attributes, regions, a source location, and an OperationName. The whole MLIR program is a tree of operations. See Operation Layout. |
| Attribute | Compile-time-known data attached to an operation: integers, strings, types, dictionaries, dialect-defined records, etc. Attributes are uniqued in the MLIRContext. See Attribute System and Lowering. |
| Type | An MLIR value's type. Types are uniqued through the StorageUniquer per context and carry a TypeID plus optional dialect-defined storage. See Storage Uniquer and ContextImpl. |
| Region | A container of basic blocks living inside an operation. Functions, loops, branches, and structured constructs each own one or more regions. |
| OperationName | The per-op-kind runtime identity that every concrete operation refers to. Holds the dialect pointer, the operation's TypeID, its interface table, and folding/verification hooks. See Operation Layout. |
| TypeID | Per-class runtime identity assigned by MLIR's TypeID machinery. Used to key attribute storage, type storage, interface dispatch, and pass IDs. RTTI is disabled in LLVM/MLIR, so TypeID plays the role that typeid would in standard C++. See TypeID Sentinels and Anchors. |
| TableGen | LLVM's declarative DSL (extension .td) for describing instructions, registers, intrinsics, and other compiler tables. A backend reads the .td files and emits C++ headers and tables at build time. |
| ODS (Operation Definition Specification) | The MLIR-specific use of TableGen. Each dialect's operations, types, attributes, and interfaces are declared in .td files; mlir-tblgen emits the C++ classes and definitions consumed by the dialect implementation. |
Math and Precision
| Term | Meaning |
|---|---|
| FP32 / f32 | IEEE 754 single-precision binary32. 1 sign + 8 exponent + 23 mantissa bits. The reference precision for tile arithmetic that is not explicitly narrowed. |
| FP16 / f16 / half | IEEE 754 half-precision binary16. 1 + 5 + 10 bits. Common as MMA operand and accumulator on pre-Hopper tensor cores. |
| BF16 / bf16 | brain-float-16. 1 + 8 + 7 bits. Same exponent range as FP32 but only 7-bit mantissa; the standard low-precision training format on Hopper and Blackwell tensor cores. |
| FP8 (e4m3, e5m2) | 8-bit floating-point types from the OFP8 family. e4m3 has 4 exponent + 3 mantissa bits (used for forward activations and weights), e5m2 has 5 + 2 (wider range, used for gradients). MMA operand type on SM89+. |
| FP4 (e2m1) | 4-bit floating-point type with 2 exponent + 1 mantissa bit. Used as MMA operand in Blackwell block-scaled MMA. |
| e8m0 | 8-bit exponent, 0-bit mantissa, no sign. Used as the per-block scale factor in MX-FP block formats. In MLIR this is f8E8M0FNU. |
| Block-scaled FP | An MX-FP-style format: a block of N narrow values (FP4 or FP8 mantissa) plus a shared e8m0 scale factor. Lets narrow operands cover a wide effective dynamic range. See Fast-Math and Numerical Precision. |
| FTZ (Flush to Zero) | Hardware option that flushes subnormal inputs and results to signed zero. Controlled per-module through NVVM-Reflect, per-call through libdevice fast variants, and per-instruction through PTX rounding modifiers. |
| Denormal / Subnormal | An IEEE 754 number with the implicit leading 1 absent, allowing magnitudes below the smallest normal at the cost of reduced relative precision. GPU pipelines often FTZ them for throughput. |
| FMA (Fused Multiply-Add) | The operation a*b + c computed with a single rounding step. Lower error and higher throughput than separate multiply and add. See Fast-Math and Numerical Precision. |
| Fast-math flags | The LLVM IR flag set carried on floating-point ops: nnan (no NaNs), ninf (no Infs), nsz (no signed zero), arcp (allow reciprocal), contract (allow FMA contraction), afn (approximate function), reassoc (allow reassociation). Tileiras propagates these through NVVM lowering. |
Reverse Engineering and Binary
| Term | Meaning |
|---|---|
| ELF (Executable and Linkable Format) | The standard Linux binary format. Both the tileiras driver shared object and ptxas's input/output use ELF containers. |
| Stripped | A binary with its symbol table removed. Tileiras ships stripped, which is why the wiki refers to internal routines by sub_ADDR instead of source names. See Binary Anatomy and RE Methodology. |
sub_ADDR | IDA Pro's auto-generated name for an unnamed function at virtual address ADDR (hex). The wiki uses this convention to cite specific routines in the stripped binary. |
| IDA Pro | The commercial disassembler and decompiler used to recover tileiras's behavior from its stripped shared object. See Binary Anatomy and RE Methodology. |
| vtable | The per-class table of virtual-function pointers a C++ object carries when it has virtual methods. The wiki cites vtable layouts when discussing dialect interfaces, pass classes, and pattern rewriters. |
| RTTI (Run-Time Type Information) | The standard-C++ mechanism for runtime type identification via typeid/dynamic_cast. LLVM and MLIR disable RTTI for code size; tileiras uses MLIR's TypeID machinery instead. See TypeID Sentinels and Anchors. |
CUDA Toolchain
| Term | Meaning |
|---|---|
| nvcc | The top-level CUDA compiler driver. Invokes the host compiler, cudafe++, cicc/tileiras, ptxas, fatbinary, and the host linker. See nvcc 13.1 Position. |
| ptxas | The PTX → SASS assembler. Receives PTX text from tileiras and emits a cubin for one SM target. See ptxas Handoff Protocol. |
| cudafe++ | NVIDIA's CUDA C++ frontend. Splits a CUDA source file into host and device translation units before either side is compiled. See cudafe Non-Relationship. |
| cicc | The older LLVM-based device compiler that lowers CUDA C++ device IR to PTX. Shares the NVPTX backend family with tileiras but starts from cudafe++ output rather than TileIR bytecode. See cicc Comparison. |
| libdevice | NVIDIA's device-side math bitcode library, linked into device modules that call __nv_* math functions. Configured through NVVM-Reflect at link time. See libdevice Overview. |
| NVVM | NVIDIA's variant of LLVM IR for device code. Tileiras's final MLIR form lowers into NVVM-flavored LLVM IR, which is then translated to PTX. |
| NVVM-Reflect | The mechanism that resolves environment-style integer queries (__CUDA_FTZ, __CUDA_PREC_SQRT, SM version, etc.) into compile-time constants, controlling which libdevice variants survive optimization. See NVVMReflect Mechanism. |
| Fatbin | A container format holding multiple cubin and/or PTX images for different SM targets in one file. Produced by fatbinary and consumed by the CUDA runtime for JIT or load-time selection. |
| Cubin | A compiled CUDA binary for one SM target, produced by ptxas. The unit packaged into a fatbin. |
Scheduler Coordination Values
| Term | Meaning |
|---|---|
| AsyncValue | The umbrella value family the TileAS scheduler emits to model async coordination resources after placement. Pipe_ and Mutex_ are the two concrete shapes; both are interned and fingerprinted (BLAKE3) so identical synchronization patterns share storage. See AsyncValue and BLAKE3 Interning. |
Pipe_ | A depth-d producer/consumer ring buffer with bounded slack between producer and consumer stages. Emitted by Schedule::solve after placement. See Pipe_ and Mutex_ Value-Header Layout. |
Mutex_ | A zero-slack mutual-exclusion edge between successive iterations of a protected region. Iteration i must complete before iteration i+1 starts. See Pipe_ and Mutex_ Value-Header Layout. |
| Rau scheduling | The Rau 1994 modulo-scheduling algorithm: search an initiation interval, place each operation into a cycle modulo II, and respect both recurrence and resource constraints. Tileiras's TileASGenerateSchedule is a Rau-style placement engine. See Modulo Scheduler and Rau. |
| RRT (Resource Reservation Table) | A per-cycle bitset table indexed modulo the candidate II, where each row records which resource classes are occupied. The scheduler probes the RRT before committing an operation to a cycle. See Resource Constraint Builder and RRT. |
| Modulo Initiation Interval (II) | The number of cycles between starts of successive software-pipeline iterations under a modulo schedule. Smaller II raises throughput; the scheduler searches upward from the maximum of the resource, recurrence, and dependence lower bounds. See Modulo Scheduler and Rau. |
Common Options and Environment
| Term | Meaning |
|---|---|
--gpu-name | Driver target GPU option. |
--host-arch | Host architecture option used when producing the host object. |
--host-os | Host operating-system option used when producing the host object. |
--opt-level / -O | Optimization level controlling the pass pipeline. |
--lineinfo | Requests line-number information when input debug information exists. |
--device-debug / -g | Requests device debug information when input debug information exists. |
--sanitize | Enables supported sanitizer mode. |
CUDA_ROOT, CUDA_HOME, CUDA_PATH | Environment variables used to locate CUDA tools when needed. |
Reading Notes
Operation names are written in backticks, for example nv_tileas.async.pipeline.produce_one. Dialect names are also written in backticks. Pseudocode uses C-like syntax but is descriptive rather than ABI-exact unless the page explicitly says otherwise.
Reading Map
This page is curated reader paths. Each path is an ordered sequence of pages with a one-sentence rationale for why the next page follows. Use these when you want to answer "I want to understand X — what do I read in what order?" instead of browsing the SUMMARY.
Driver and Integration Path
For running tileiras, embedding it, or diagnosing a driver failure:
- Driver Overview — what the binary does and which public entry points exist.
- Main Entry — how
main()builds the configuration and dispatches the four phases. - Program Handle — the 104-byte handle threaded through create / compile / get-output / release.
- CLI Options — the option surface, separating user-facing flags from internal
cl::optplumbing. - Env Vars and Runtime Gates — environment-driven knobs that bypass the CLI.
- Host Launch and ptxas Knobs — how the driver shells out to
ptxas. - ptxas Handoff Protocol — the exact PTX surface
ptxasaccepts. - Position in nvcc 13.1 — where tileiras fits in the larger CUDA toolchain.
Bytecode Producer Path
For producing valid TileIR bytecode that tileiras will accept:
- MLIR Bytecode Format — the container grammar and section layout.
- Dialect Reader/Writer Status — which dialects have custom bytecode readers and what coverage looks like.
- AsmPrinter Status — printer-side companion (the textual round-trip is partial).
- cuda_tile Overview — the public input dialect.
- cuda_tile Op Roster — every op the public surface accepts.
- cuda_tile Types and Attrs — types and attributes those ops use.
- cuda_tile Verifiers — what gets checked at parse time.
- TypeID Sentinel Table — lookup table when you need the exact identity of a sentinel.
Dialect Lowering Chain
For understanding how the IR cascades from public input to LLVM:
- cuda_tile — public tile-compute surface.
- cuda_tile to tileaa — first conversion: introduce alias awareness.
- nv_tileaa — alias-aware memory, tokens, queues.
- tileaa to tileas — second conversion: make scheduling explicit.
- nv_tileas — operational async-scheduling dialect.
- cute — target-neutral layout algebra.
- cute_nvgpu — NVIDIA architecture atoms (MMA, TMA, tcgen05).
- cutlass — pipeline scheduler, sequence barriers, persistent kernels.
- tileas to LLVM — final MLIR-side conversion.
- cute and cute_nvgpu to LLVM — atom lowering to LLVM intrinsics.
- nvgpu and gpu to NVVM — bridge to PTX-facing dialect.
- Lowering Overview — top-down summary tying these conversions together.
Scheduler Deep-Dive
For understanding how TileAS turns dependence graphs into placed schedules:
- Scheduler Overview — the two-pass GenerateSchedule / MaterializeSchedule split.
- Schedule Constraint Attributes — the nine
tileas.schedule.constraint.*attributes that drive placement. - Resource Constraint Builder and RRT — how per-op footprints become RRT bits.
- Modulo Scheduler and Rau — the modulo-scheduling exemplar (read this one carefully).
- Modulo Driver and 4-Arm OR-Chain — the four placement arms (PERMUTE / FUSE / RETRY / CBS).
- Serial vs Cost-Based Generators — the two generator implementations and when each fires.
- Schedule::solve and Cost Evaluators — the materialization algorithm.
- Pipe and Mutex Value Layout — the IR-visible coordination values.
- Buffer Assignment and Named Barriers — the 32-slot named-barrier pool and how Mutex_ values consume it.
- Blackwell Pipeline 15-Slot Model — the target pipeline model the scheduler reasons against.
TileAS Pass Families
For the per-family pass roster running on nv_tileas IR:
- Async/Pipeline Family — MaterializeSchedule, AUS vs AWS, agent materialization.
- Layout and Buffer Family — layout assignment, slicing, and shared-memory handoffs.
- TMA and Memops Family — TMA-descriptor and bulk-copy lowering.
- CTA Cluster Family — cluster geometry, DynamicPersistent, PlanCTA, PrepareForScheduling, ResolveAgentBoundary.
- Scheduling Glue — the small passes wiring schedule data into surrounding IR.
Codegen Deep-Dive
For the NVPTX backend that consumes the lowered LLVM IR:
- Codegen Overview — pipeline shape from LLVM IR to PTX.
- NVPTX Bring-up and Target Init — how the target gets registered and initialized.
- NVPTX Subtarget and Feature Matrix — per-SM feature gating.
- NVPTX Target Lowering, Call and Args — calling convention, parameter space, byval handling.
- ISelDAG and MatcherTable — DAG-to-DAG instruction selection.
- Per-SM Emission Templates — emission templates parameterised by SM tier.
- AsmPrinter Monster and Windows — final PTX text emission.
- tcgen05, WGMMA, mbarrier, Cluster — emission of the Blackwell-era instruction families.
- TMA, Tensormap and cp.async.bulk — TMA-descriptor emission.
- ldmatrix, stmatrix and Register Class Vtables — matrix-fragment movement.
NVPTX Custom Pass Family
For the NVIDIA-private passes layered onto the NVPTX backend:
- NVPTX Backend Passes Overview — pipeline position and shared state.
- Kernel, CDP, Inline, Pretreat — entry-side stamping and inline forcing.
- Lower-Args, Aggr, Struct — byval lowering and parameter-space pointer materialization.
- MemorySpaceOpt and process-restrict — concrete address-space inference and noalias scope generation.
- Printf Lowering and vprintf — printf-to-vprintf rewrite.
- DeadSyncElim and CommonBaseElim — barrier removal and SCEV-keyed GEP CSE.
- Peephole MIR and Image Handles — post-ISel MIR rewriting.
- NVVMIRVerifier — kernel-ABI invariants enforced before backend handoff.
libdevice and NVVM Reflect
For modules that link against libdevice math functions:
- libdevice Overview — the bitcode library and what it covers.
- NVVMReflect Mechanism — how compile-time reflect calls get resolved.
- Intrinsic ID Switch and Name Table —
__nv_*name to intrinsic ID mapping. - Math Pass Pipeline and Crosswalk — pass ordering around the math expansion.
MLIR Infrastructure Tour
For the MLIR-side mechanics referenced by dialect and lowering pages:
- MLIR Infra Overview — what the infra layer covers.
- Operation Layout — the 48+ byte
Operationrecord and its slots. - StorageUniquer and Context Impl — type and attribute uniquing.
- Pattern Vtables and Shapes — rewrite-pattern shapes and dispatch.
- Interface Vtables — op and type interface mechanics.
- TypeID Sentinels and Anchors — how TypeIDs are interned and addressed.
- Container Fingerprints — recognizing MLIR container shapes in the binary.
- Diagnostic ABI and Helpers — diagnostic emission, severity packing.
- AsyncValue and BLAKE3 Interning — the 808-byte AsyncValue record backing
Pipe_/Mutex_.
OSS Comparison Tour
For comparing tileiras against the public cuda-tile repository:
- OSS Comparison Overview — what the public tree covers vs what tileiras adds.
- cuda_tile Tree Mapping — file-by-file mapping between public source and tileiras behavior.
- .td Files Delta — TableGen differences.
- Transforms, FuseFMA, SynthDbg — public transform passes and where they live in tileiras.
Cross-cutting Infra
For low-level mechanics referenced from multiple pages:
| Topic | Page |
|---|---|
| Data section decryption | Data Section Decryption |
| Vtable banks | Binary Vtable Banks and Static Ctors |
| Threading | Threading and Synchronization |
| Allocators | Allocator BumpPtr and Slab Sizes |
| String mechanics | Twine, StringRef, format |
| Diagnostic helpers | Diagnostic Helpers |
| GlobalValue flags | GlobalValue Flag Bits |
End-to-End Reimplementation Path
For a single linear read through every contract you must reproduce:
index
-> binary-layout
-> boundaries/nvcc-13-1-position
-> pipeline/overview
-> bytecode/mlir-bc-format
-> dialects/cuda_tile/overview
-> lowering/cuda-tile-to-tileaa
-> dialects/nv_tileaa/overview
-> lowering/tileaa-to-tileas
-> dialects/nv_tileas/overview
-> passes/tileas/scheduling-glue
-> scheduler/overview
-> scheduler/modulo-scheduler-and-rau
-> lowering/tileas-to-llvm
-> codegen/overview
-> nvptx-passes/overview
-> libdevice/overview
Then return to the detailed operation, verifier, and pass-family pages for the subsystem you are implementing.
Driver Overview
Abstract
tileiras is NVIDIA's TileIR optimizing assembler. It takes a TileIR
bytecode module, lowers it through the TileIR and NVVM pipeline, emits PTX,
invokes ptxas, and writes a host relocatable object. It is not a CUDA C++
front-end — no EDG, no cudafe, no host stub synthesis, no CUDA source parser
lives in this tool. Those stages must already have produced the TileIR
bytecode this driver consumes.
From the command line the driver behaves like a compact LLVM-style compiler:
tileiras [driver options] <tileir-bytecode>
-> parse TileIR bytecode as an MLIR builtin.module
-> run TileIR, NVVM, and NVPTX lowering
-> serialize PTX
-> assemble PTX with ptxas
-> optionally dump SASS through nvdisasm -c
-> write a host relocatable object, default elf.o
The public contract stays deliberately small. Users select the GPU architecture, host architecture, host OS, optimization/debug mode, optional memcheck instrumentation, CUDA toolkit root, and output file. The large pass inventory hiding behind that surface is catalogued in the Pipeline Overview and the Full Pass List by Opt Level.
What the driver does
One translation unit per process invocation. The input is a TileIR
bytecode buffer (magic 7f 54 69 6c 65 49 52 00, version 13.1.x); the
output is a host relocatable object the driver writes to --output-file
or, by default, elf.o. Exit status is 0 on success or one of the five
error codes documented in
Driver Program Handle; no partial
output is ever written.
The driver distinguishes TileIR bytecode from generic upstream MLIR
bytecode at the magic-number level. A stream that opens with the MLIR
framing prefix 06 03 80 0a 4d 4c 49 52 and the "\nMLIR" payload tag —
rather than the TileIR "Tile\0" tag in the same slot — is rejected with
a separate diagnostic that names MLIR bytecode explicitly, so the user
can route the input to the right tool instead of guessing whether a
parser failure means a corrupt file.
Validation runs before any pipeline construction. It rejects null buffers,
non-TileIR bytecode, unsupported GPU names, optimization levels above 3,
and --device-debug paired with any nonzero optimization level. The
verbatim diagnostic strings and their error codes live in
Driver CLI Options.
Supported Targets
| Surface | Accepted values | Default / effect |
|---|---|---|
--gpu-name | sm_100, sm_103, sm_110, sm_120, sm_121 | Defaults to sm_100. |
--host-arch | x86_64, aarch64, arm64ec | Selects the host triple fragment. |
--host-os | linux, windows | Selects the object and triple OS fragment. |
--sanitize | memcheck | Adds TileIR memcheck instrumentation when present. |
--opt-level / -O | 0, 1, 2, 3 | Driver default is 3. |
--lineinfo | boolean | Emits line information without full device debug. |
--device-debug / -g | boolean | Requires -O0; enables full device debug mode. |
--output-file / -o | path | Defaults to elf.o. |
The target set is Blackwell-oriented. A clean-room implementation should treat unsupported SM names as hard validation errors rather than silently remap them to the closest known architecture.
Driver Flow
The compile path is linear and has no user-visible subcommands:
main
parse argv against the cl::opt registry
read positional TileIR bytecode file
resolve CUDA toolkit root
validate buffer, target, optimization level
create an MLIRContext and register dialects
parse bytecode into builtin.module
attach host/GPU target tuple
build the TileIR pass pipeline for the requested optimization level
lower to NVVM and LLVM
serialize PTX text
invoke ptxas with PTX passed as --input-as-string
optionally write cubin to a temporary file and run nvdisasm -c
write the relocatable object bytes to disk
The only external tools on the default path are CUDA toolkit binaries.
ptxas receives PTX through --input-as-string and returns assembled
cubin bytes on stdout. The SASS dump path writes that cubin to a temporary
file, runs the configured disassembler command, and removes the temporary
file when the driver created it.
Failure Model
Every failure prints a diagnostic and returns a nonzero exit status; the driver never writes a partial output file. The user-visible categories are:
| Category | Typical trigger |
|---|---|
| Input missing | No positional TileIR bytecode file was provided. |
| Read failure | The input file cannot be opened or mapped. |
| Bytecode mismatch | The buffer is not TileIR bytecode. |
| Unsupported target | --gpu-name, --host-arch, or --host-os is outside the accepted set. |
| Invalid options | --opt-level > 3 or --device-debug with nonzero optimization. |
| Toolkit failure | CUDA root cannot be resolved for an operation that requires the toolkit. |
| Compile failure | MLIR parsing, pass execution, PTX emission, or ptxas failed. |
| Dump failure | The configured SASS dump command failed or could not be executed. |
Errors are terminal for the current invocation by design. The driver makes no attempt at partial output recovery after a pipeline or assembler failure — a fresh invocation with corrected input is always cheaper than guessing how much of a half-finished artifact is trustworthy.
Related pages
Driver main() Entry walks the entry-point code path in detail; Driver CLI Options catalogues every option and its validator; Driver Program Handle defines the public error-code numbering; Host Launch ABI and ptxas Knobs covers the kernel-launch metadata the driver emits into the produced object.
Driver main() Entry
Abstract
tileiras is a conventional LLVM-style compiler driver. The process entry
point parses argv against an option schema registered during static
initialization, reads the positional TileIR bytecode file, validates the
buffer and the requested target, runs the TileIR-to-object pipeline inside
a fresh MLIRContext, and writes the resulting host relocatable object —
default path elf.o — to disk. The artifact is a relocatable object, not
a raw PTX file or a standalone cubin.
The end-to-end contract is narrow. The input is an argv vector whose
first non-option positional element is a TileIR bytecode file (magic
7f 54 69 6c 65 49 52 00, version 13.1.x, with at least the String and
Func sections present). The output is either a host relocatable object
written to --output-file and exit status 0, or one of the five integer
error codes from
Driver Program Handle with a
verbatim diagnostic on stderr and no output file written. A failed compile
leaves the filesystem in its prior state — partial output never happens.
No lowering happens in the outer driver frame. The entry point is pure orchestration: it owns option lifetime, error routing, file I/O, and the call sequence into the TileIR compiler proper.
Static-init Option Registration
Every command-line flag is an llvm::cl::opt<T> global constructed during
C++ static initialization, long before main runs. Each global registers
itself with the process-wide cl::OptionRegistry from its constructor, so by
the time main reaches cl::ParseCommandLineOptions the parser already
knows every flag, every alias, every default, and every cl::values(...)
mapping table. Option objects live for the entire process lifetime and are
not torn down by the driver; their storage belongs to the LLVM CommandLine
library.
The four enum-valued flags (--gpu-name, --host-arch, --host-os,
--sanitize) are template instantiations of cl::opt<cl::ValuesClass>
that share a single parser shape: a cl::values(...) table that maps each
accepted string to an int32 code, plus a default integer. The driver
never sees the raw spelling — by the time the parser returns, each option
holds an integer the rest of the compiler can switch on.
main()
main is pure orchestration. It runs the LLVM command-line parser, reads
the positional bytecode file, builds an MLIRContext with the dialects the
TileIR stack can parse or lower, dispatches to the compile entry, writes
the resulting object bytes, and returns one of the five public error codes
defined in Driver Program Handle.
The compile dispatcher is structured so every failure path returns before any artifact reaches disk. There is no partial-output mode and no rollback logic — a failed compile leaves the filesystem in its prior state, with diagnostics already on stderr.
int main(int argc, char **argv) {
cl::ParseCommandLineOptions(argc, argv,
"tileiras: NVIDIA (R) Cuda Tile IR optimizing assembler\n");
if (InputFile.empty()) {
errs() << "error: no input file provided\n";
return DRIVER_ERR_INPUT_MISSING; // code 1
}
auto buffer = MemoryBuffer::getFile(InputFile);
if (!buffer) {
errs() << "error: cannot read '" << InputFile
<< "': " << buffer.getError().message() << "\n";
return DRIVER_ERR_INPUT_READ; // code 4
}
int err = validate_driver_options(buffer->getMemBufferRef());
if (err != 0)
return err; // codes 2, 3
MLIRContext ctx;
register_tileiras_dialects(ctx);
OwningOpRef<ModuleOp> module = parseSourceString<ModuleOp>(
buffer->getBuffer(), &ctx);
if (!module) {
errs() << "error: failed to parse IR bytecode\n";
return DRIVER_ERR_COMPILE; // code 5
}
attach_target(*module, GpuName, HostArch, HostOs);
PassManager pm(&ctx);
build_tileir_pipeline(pm, make_pipeline_options());
if (failed(pm.run(*module))) {
errs() << "error: failed to compile Tile IR program\n";
return DRIVER_ERR_COMPILE; // code 5
}
SmallString<0> object;
raw_svector_ostream os(object);
if (failed(emit_relocatable_object(*module, os))) {
errs() << "error: failed to emit relocatable object\n";
return DRIVER_ERR_COMPILE; // code 5
}
auto write_err = writeToOutput(OutputFile.empty() ? "elf.o" : OutputFile,
object);
return write_err ? DRIVER_ERR_OUTPUT_WRITE : 0;
}
The branches above are the real branches: every diagnostic the driver can
emit before reaching the pipeline maps to one of them. The compile path
itself produces only the generic compile-failure diagnostic — finer-grained
pipeline errors print from inside the pass that failed, through MLIR's
diagnostic engine, before control returns to main.
Opt-level Dispatch
-O is an alias for --opt-level. The driver default is 3, the accepted
range is 0..3, and the validated integer is copied into the pipeline
options before the pass manager is built.
The embedded pass pipeline carries its own opt-level field with default
2, used when the pipeline runs outside the driver wrapper. The V2
pipeline carries a third field, v2-opt-level, defaulting to 0. These
are three distinct axes — collapsing them produces silently different
behavior for integrators who embed the pipeline directly.
Full device debug carries one hard invariant: it cannot coexist with a
nonzero optimization level. With --device-debug set, the driver demands
-O0, and the NVVM option string then carries debug-preserving options
such as -g, --dont-merge-basicblocks, and --return-at-end. The
validator rejects the combination rather than silently degrading an
optimized build, because the user's intent — preserved control flow for a
debugger — is incompatible with the transforms -O>0 would run.
Diagnostics Before the Pipeline
Two diagnostics fire from main before the compile dispatcher even
constructs an MLIR context. A missing positional argument produces
error: no input file provided and returns the input-missing code. A
file that cannot be opened or mapped produces a diagnostic carrying both
the path and the operating-system error message and returns the read-error
code. The full validator — bytecode magic check, GPU support, optimization
range, debug/optimization compatibility — runs immediately after the file
is in memory and emits the verbatim diagnostics catalogued in
Driver CLI Options.
--sanitize=memcheck is the only sanitizer selector accepted. Setting it
appends the memcheck and tensor-memory access-check options to the
downstream tool configuration.
MLIRContext and Dialect Registration
A single MLIRContext lives for the duration of one compile and owns every
op, type, and attribute the pipeline allocates. Before bytecode parsing
starts, the driver eagerly loads every dialect the TileIR stack can parse
or lower:
| Dialect family | Purpose |
|---|---|
cuda_tile | Input TileIR operations and target metadata. |
nv_tileaa | Tile-level analysis and allocation representation. |
nv_tileas | Tile assembler scheduling and memory-operation representation. |
cute_nvgpu | Cute/NVGPU atoms and Blackwell copy/MMA forms. |
cutlass | CUTLASS-style scheduling and pipeline abstractions. |
gpu, llvm, nvvm | Upstream lowering targets for host/device IR. |
Eager registration matters because MLIR bytecode references dialects by
name from its symbol table. If a dialect is not loaded by the time the
parser hits its first op, parsing fails with an unresolved-dialect
diagnostic. The driver therefore loads every dialect the pipeline could
need, and includes a late-registration fallback specifically for
cuda_tile to cover bytecode that references the dialect through a
form the eager path has not yet materialized.
After the module parses, the driver attaches the host/GPU target tuple
through a target-attribute setter on builtin.module. The pass manager is
then constructed against the module, and the pipeline is built with the
options object derived from the parsed command line.
Teardown Semantics
Driver-owned cleanup is strictly local. File buffers, the MLIR context,
the parsed module, and any output bytes go out of scope when main
returns; LLVM's CommandLine library owns the cl::opt globals and tears
them down through static destructors. MLIR's dialect registry and uniqued
type/attribute storage are global runtime objects destroyed by their normal
destructors after main exits — they are not part of the driver phase
graph and not modeled as extra compile phases.
The distinction matters for a reimplementer building a long-lived embedding of the driver. The driver should free exactly the resources it owns and never manually destroy global dialect singletons owned by MLIR support code. Doing so corrupts the global registry for any subsequent compile in the same process.
Related pages
Driver Overview frames the surrounding pipeline
from the user's perspective; Driver CLI Options
catalogues every option main consumes, including the validator error codes
referenced here; Driver Program Handle
documents the public error-code numbering returned through main's exit
status.
Driver Program Handle
Abstract
A single 104-byte program handle represents one TileIR translation unit in the tileiras driver. The public C-API surface exposes only an opaque pointer, but the recovered allocation is a fixed 0x68-byte block built by sub_57A480 (tileirasProgramCreate) and consumed by sub_57A8E0 (tileirasProgramCompile), sub_57A850 (tileirasProgramGetOutput), and sub_57A7C0 (tileirasProgramRelease). Every offset is reachable from the four public entry points, and the layout stays stable across the three create/compile/release call sites in the driver binary.
The handle stores the validated driver configuration, a small ownership bit, and an inline byte view that doubles as a CUDA-root pointer during early lifetime and as the compiled-output byte span after compile. Storage at +0x48 lives a two-phase life: the slot holds the resolved CUDA install root while the front end runs, then the same 16 bytes are repurposed to track the compiled output buffer once sub_57A8E0 finishes.
Public Error Codes
Every entry point in the driver's public C API returns a small integer status. Five non-zero codes are emitted across the four functions, and every diagnostic routes through sub_578D40 with a packed severity byte. The severity values 259, 260, and 2563 are the (class | (op_prefix << 8) | (trace << 9)) form documented in Diagnostic ABI and Helpers: 259 and 260 are fatal driver errors, 2563 is a user-input rejection.
| Code | Trigger | Verbatim diagnostic | Severity |
|---|---|---|---|
| 0 | success | (none) | — |
| 1 | allocation failure (sub_44A8C20(0x68) returns NULL) | failed to allocate memory for program | 259 |
| 2 | null config (out_program ok) | configuration is null | 2563 |
| 2 | null inputBuffer | input buffer is null | 2563 |
| 2 | opt_level != 0 && device_debug == 1 | optimized debugging is not supported, change optimization level to 0 or disable full debug info | 2563 |
| 2 | invalid GPU id (not in {100, 103, 110, 120, 121}) | unsupported GPU target | 2563 |
| 2 | opt_level > 3 | invalid optimization level | 2563 |
| 2 | unsupported host arch (not in {0, 1, 2}) | unsupported host architecture | 2563 |
| 2 | unsupported host OS (not in {0, 1}) | unsupported host operating system | 2563 |
| 3 | parse failure on TileIR bytecode magic | input does not correspond to Tile IR bytecode | 260 |
| 3 | parse failure with MLIR fall-through | failed to parse IR bytecode (it looks like MLIR bytecode instead) | 260 |
| 4 | null program (Compile, GetOutput, Release) | program is null | 2563 |
| 4 | null output pointer (GetOutput) | output pointer is null | 2563 |
| 4 | output requested before compile (GetOutput) | program has not been compiled | 2563 |
| 5 | compile failure (tileirasProgramCompile) | failed to compile Tile IR program | 259 |
Code 1 is reserved for the single allocation site at sub_44A8C20(0x68). Code 2 covers every configuration-level rejection. Code 3 is reserved for bytecode-parse failures and is the only code with a conditional suffix appended by a magic-tail heuristic. Code 4 is the shared null-handle / not-compiled rejection used by every entry point other than create. Code 5 is the compile-time failure code emitted by sub_57A8E0.
The MLIR fall-through on code 3 is a small heuristic inside sub_57A480. When the bytecode magic check fails, the function scans the input for the 8-byte ASCII tail e IR byt — the suffix of MLIR bytecode minus the leading ML — and on a match appends (it looks like MLIR bytecode instead) to the base code-3 diagnostic so the user can route their bytecode to the right tool.
tileirasProgramCreate Validation Order
sub_57A480 is an 820-byte routine that funnels every diagnostic through sub_578D40. Validation order is fixed and observable from the call sites — a caller can rely on each check happening before the next, because every early return is a separate diagnostic with a separate severity field.
int tileirasProgramCreate(TileirasProgram **out_program, const TileirasConfig *config) {
if (out_program == NULL) return 4; // "program is null"
if (config == NULL) return 2; // "configuration is null"
if (config->input_buffer_data == NULL) return 2; // "input buffer is null"
if (!sub_57FF40(config->input_buffer_data,
config->input_buffer_size)) return 3; // parse probe; MLIR tail check appends suffix
if (!is_supported_gpu(config->gpu_id)) return 2; // "unsupported GPU target"
if ((uint32_t)config->opt_level > 3) return 2; // "invalid optimization level"
if (config->opt_level != 0 && config->device_debug == 1) return 2; // "optimized debugging is not supported, ..."
if ((uint32_t)config->host_arch > 2) return 2; // "unsupported host architecture"
if ((uint32_t)config->host_os > 1) return 2; // "unsupported host operating system"
TileirasProgram *p = (TileirasProgram *)sub_44A8C20(0x68);
if (p == NULL) return 1; // "failed to allocate memory for program"
copy_config_into_handle(p, config);
p->owning_flag = 0;
*out_program = p;
return 0;
}
The eight predicate gates are pure functions of the caller's argument tuple. Only after every predicate passes does sub_57A480 request the 104-byte block from the allocator. Allocation is the last possible failure point, so a successful return from tileirasProgramCreate guarantees the handle is initialized end-to-end.
tileirasConfig Layout
The configuration is a 16-byte aligned block whose first 80 bytes mirror the front of the program handle. sub_57A480 reads it through five _mm_loadu_si128 loads plus one scalar slot, pinning the layout to exactly five 16-byte rows.
typedef struct TileirasConfig {
/*+0x00*/ const void *input_buffer_data; // bytecode bytes
/*+0x08*/ size_t input_buffer_size; // bytes in buffer
/*+0x10*/ int32_t gpu_id; // 100/103/110/120/121 (sub_57A450 whitelist)
/*+0x14*/ int32_t opt_level; // 0..3
/*+0x18*/ int32_t device_debug; // 0 or 1
/*+0x1C*/ int32_t lineinfo; // 0 or 1
/*+0x20*/ int32_t host_arch; // 0=x86_64, 1=aarch64, 2=arm64ec
/*+0x24*/ int32_t host_os; // 0=linux, 1=windows
/*+0x28*/ int32_t sanitize; // 0=off, 1=memcheck
/*+0x2C*/ uint32_t pad_2C;
/*+0x30*/ /* std::string SSO */ // output file name (driver side)
/*+0x40*/ const char *cuda_root_ptr; // resolved by sub_5773C0
/*+0x48*/ size_t cuda_root_len;
} TileirasConfig;
The fields at +0x10..+0x28 are the validated driver options. The CUDA root string at +0x40 is the resolution of the CUDA_ROOT / CUDA_HOME / CUDA_PATH environment chain (with /proc/self/exe fallback) performed by sub_5773C0; tileiras carries it into the program handle because the compile dispatch needs an installation path to invoke nvdisasm.
TileirasProgram Layout
The 104-byte program handle reuses the first 80 bytes of the configuration almost verbatim, with one deliberate field reorder: gpu_id moves to +0x28 so the validated 32-bit fields at +0x10..+0x28 form a contiguous scalar block that the compile dispatch reads via aligned 32-bit loads.
typedef struct TileirasProgram {
/*+0x00*/ const void *input_buffer_data; // ptr to bytecode bytes
/*+0x08*/ size_t input_buffer_size; // bytes in buffer
/*+0x10*/ int32_t opt_level; // 0..3
/*+0x14*/ int32_t device_debug; // 0 or 1
/*+0x18*/ int32_t lineinfo; // 0 or 1
/*+0x1C*/ int32_t host_arch; // 0=x86_64, 1=aarch64, 2=arm64ec
/*+0x20*/ int32_t host_os; // 0=linux, 1=windows
/*+0x24*/ int32_t sanitize; // 0 or 1 (memcheck)
/*+0x28*/ int32_t gpu_id; // 100/103/110/120/121
/*+0x2C*/ uint32_t pad_2C;
/*+0x40*/ const char *cuda_root_ptr; // resolved at compile time
/*+0x48*/ size_t cuda_root_len; // ── overlapping slot ──
/*+0x48*/ uint8_t *output_data; // SSO: same 16 bytes, second life
/*+0x50*/ uint64_t output_capacity;
/*+0x58*/ size_t output_length;
/*+0x60*/ uint32_t owning_flag; // 1 = handle owns output_data
} TileirasProgram;
Five unaligned SSE copies from the configuration populate the block. The first copy moves input_buffer_data and input_buffer_size; the next two move the eight 32-bit option fields; the fourth clears the slot at +0x30; the fifth installs the CUDA-root SSO. owning_flag at +0x60 clears to zero before sub_57A480 returns, so the handle starts life with no output buffer to free.
SSO Overlap at +0x48
The slot at +0x48 is the only address in the handle that hosts two different fields across the program lifetime. While the front end runs, +0x40..+0x4F carries a (cuda_root_ptr, cuda_root_len) pair pointing at the resolved CUDA install path. The compile dispatcher reads the pair once, uses it to locate the nvdisasm binary, and never touches it again. The same 16 bytes are then overwritten to hold the compiled-output buffer descriptor: output_data at +0x48, output_capacity at +0x50, output_length at +0x58.
The offsets used by the consumers make this observable. sub_57A8E0 writes output_data at +72 (0x48), output_capacity at +80 (0x50), and output_length at +88 (0x58). sub_57A850 returns {ptr=+0x48, length=+0x58} to the caller. The lifetimes never overlap: by the time sub_57A8E0 installs the output bytes, the CUDA-root string has already been consumed and is no longer needed by any subsequent stage.
A reimplementation should keep this overlap as an internal storage trick and never expose it to callers. The public contract is simply that the program-output getter is invalid until compile has succeeded.
Ownership and Release
A single bit at +0x60, owning_flag, controls release behavior. It starts at 0 immediately after tileirasProgramCreate. sub_57A8E0 sets it to 1 once the compiled byte buffer has landed at +0x48..+0x58. sub_57A7C0 (tileirasProgramRelease) reads the flag before tearing down: 1 makes the release path call the buffer's deleter on output_data; 0 leaves the slot alone. Either way, the 104-byte handle itself is freed via the matching sub_4580C60 deallocator after the conditional output free.
The ownership rule means calling tileirasProgramRelease on a never-compiled handle is safe and touches no output memory. It also means the only legal way to retain compiled bytes after release is to copy them out via tileirasProgramGetOutput first.
Public-API Surface
Four C-API entry points operate on the handle. Each one is a small wrapper that validates its arguments, walks a fixed offset path, and routes diagnostics through sub_578D40.
| Symbol | Identity | Role |
|---|---|---|
sub_57A480 | tileirasProgramCreate | Validates (out_program, config, inputBuffer), allocates 0x68 bytes via sub_44A8C20, copies the configuration body, clears owning_flag. |
sub_57A8E0 | tileirasProgramCompile | Reads option fields at +0x10..+0x28, runs the compile dispatcher, writes the output buffer at +0x48..+0x58, sets owning_flag at +0x60 to 1. |
sub_57A850 | tileirasProgramGetOutput | Returns {ptr = *(uint8_t**)(handle+0x48), length = *(size_t*)(handle+0x58)}; emits program has not been compiled (code 4) if owning_flag is 0. |
sub_57A7C0 | tileirasProgramRelease | If owning_flag is 1, frees output_data; then frees the 104-byte handle via sub_4580C60. |
All four entry points share the same C-API error space. The diagnostic emitter at sub_578D40 is the single sink for every public message, which is why codes overlap across functions: code 4 covers program is null, output pointer is null, and program has not been compiled from the getter and release sites, while code 5 is reserved for the compile-time failure path inside sub_57A8E0.
Lifecycle
The normal lifetime runs create → compile → get output → release, and the driver's main follows that sequence exactly. The handle is opaque at the public surface; the offset layout above is an implementation detail that reimplementers should reproduce for binary compatibility with the recovered driver but should never surface to consumers.
TileirasProgram *program = NULL;
int err = tileirasProgramCreate(&program, &config);
if (err == 0) {
err = tileirasProgramCompile(program);
}
if (err == 0) {
TileirasByteView out;
err = tileirasProgramGetOutput(program, &out);
if (err == 0) {
write_output_file(opts.output_path, out.data, out.length);
}
}
tileirasProgramRelease(program);
Compile is allowed to fail after a successful create. When it does, sub_57A8E0 returns code 5 (failed to compile Tile IR program), owning_flag stays 0, the output slot still holds the CUDA-root pair, and release frees only the handle. The getter rejects subsequent calls with code 4 (program has not been compiled), preserving the invariant that a returned byte view stays valid until the next mutating call on the handle.
Reimplementation Invariants
The handle is 104 bytes. The option fields at +0x10..+0x28 are eight contiguous 32-bit integers in the order (opt_level, device_debug, lineinfo, host_arch, host_os, sanitize, gpu_id, pad). The slot at +0x48..+0x57 is a single 16-byte region with two sequential lifetimes — CUDA-root SSO during the front end, output-buffer descriptor afterwards. The byte at +0x60 is the ownership flag, the only field tileirasProgramRelease inspects when deciding whether to free the output buffer. Allocation goes through sub_44A8C20(0x68) and deallocation through sub_4580C60; no intermediate reallocation occurs on the create or compile paths.
Cross-References
Driver main() Entry documents how the handle is threaded through the four-phase driver and how the configuration is built from cl::opt storage. The compile dispatcher that mutates the handle is described in the TileIR Pipeline Overview.
Driver CLI Options
Abstract
The tileiras command-line surface has two layers. Normal users see only
the first — a small driver layer with input file, output file, target
selection, optimization level, debug mode, line info, and memcheck. The
second is the TileIR pipeline option structure, which surfaces when the
driver constructs the pass pipeline or when an integrator embeds the
pipeline directly.
The two layers reuse a few names on purpose. Driver --opt-level defaults
to 3; the embedded pipeline option named opt-level defaults to 2.
Treat them as separate axes unless the driver has explicitly copied the
command-line choice into the pipeline options.
Driver Options
| Option | Values | Default | Effect |
|---|---|---|---|
<tileir-bytecode> | path | required | Input bytecode buffer parsed as TileIR MLIR bytecode. |
--output-file, -o | path | elf.o | Host relocatable output path. |
--gpu-name | sm_100, sm_103, sm_110, sm_120, sm_121 | sm_100 | GPU target selected for lowering and ptxas. |
--host-arch | x86_64, aarch64, arm64ec | platform-dependent | Host architecture used for target triples and callbacks. |
--host-os | linux, windows | platform-dependent | Host operating-system component of the generated target. |
--sanitize | memcheck | unset | Enables memcheck-oriented TileIR instrumentation. |
--opt-level, -O | integer 0..3 | 3 | Driver optimization level. Values above 3 are rejected. |
--lineinfo | boolean | false | Emits line information without requiring full debug mode. |
--device-debug, -g | boolean | false | Enables full device debug; valid only with -O0. |
The driver parses these with LLVM command-line semantics — aliases are exact aliases, boolean flags follow LLVM's normal spelling rules, and unknown options are rejected before any compilation work starts.
Enum-valued Options as int32 Codes
The four enum-valued driver options — --gpu-name, --host-arch,
--host-os, --sanitize — are wired through cl::opt<cl::ValuesClass>
template instantiations that share one parser shape. Each option carries
its own cl::values(...) mapping table that pairs an accepted spelling
with an int32 code, plus a default integer to use when the option is
absent. The parser walks the table once at command-line time, stores the
resulting integer, and downstream code never sees the original string.
--gpu-name maps spellings to the corresponding SM number and defaults
to 100:
| String | int32 code | Notes |
|---|---|---|
"sm_100" | 100 | Datacenter Blackwell (default) |
"sm_103" | 103 | Blackwell variant |
"sm_110" | 110 | Jetson Thor |
"sm_120" | 120 | Consumer RTX 50** / Pro |
"sm_121" | 121 | DGX Spark |
The driver surface accepts only the bare sm_NN spelling — a and
f variant suffixes are not parsed here. The architecture-specific
selection happens one level up, on the nv_tileaa.compute_capability
module attribute set by the frontend. A frontend that lowers WGMMA,
tcgen05.mma, or block-scaled mma.sync carries the matching
target_spec field on the module; the backend reads both fields
when constructing the NVPTX target machine, picks sm_100a (for
example) instead of sm_100, and emits .target sm_100a
accordingly. --gpu-name is therefore a defaulting hint for the
major SM number, not the final word on the .target line. The
full subtarget-construction mechanism — including how --gpu-name
combines with +ptxNN feature flags to drive the
.version/.target header — is documented in PTX Version and
Target Selection.
Two practical consequences follow. First, a kernel emitted by a
frontend that requires arch-conditional instructions cannot be
redirected to a plain sm_NN target by changing --gpu-name
alone — the lowering will fail in the selector when no legal
MachineInstr is found. Second, this driver does not list sm_90:
its primary deployment surface is Blackwell, and Hopper targets are
reachable only through the frontend's own attribute writes plus a
host environment that pins the build to an sm_90-capable
subtarget table.
--host-arch defaults to 0:
| String | int32 code | Notes |
|---|---|---|
"x86_64" | 0 | Linux/Windows x86-64 |
"aarch64" | 1 | ARM 64-bit |
"arm64ec" | 2 | ARM64EC (Windows on ARM); reuses the aarch64 record at a sub-entry |
--host-os defaults to 0:
| String | int32 code |
|---|---|
"linux" | 0 |
"windows" | 1 |
--sanitize defaults to 0 and is the only option whose unset state
carries semantic weight downstream:
| String | int32 code | Notes |
|---|---|---|
| (unset) | 0 | No sanitizer |
"memcheck" | 1 | Activates the -sanitize=memcheck -g-tmem-access-check nvdisasm tail |
The host-architecture lookup table is keyed by code and walked with two
strides — 39 for the x86_64 record and 36 for both aarch64 entries.
arm64ec reuses the aarch64 record at a distinct sub-entry; that
sub-entry is the only place the two ARM modes diverge in the host-side
code path. The host-OS index resolves to 7 for linux and 15 for
windows, both of which select a target-triple OS fragment and the
matching object-file format.
Each parser exposes an 8-slot vtable shared by all four options. The
slots are: typeinfo helper, destructor, parse (string → int32 map
probe), print (int32 → string lookup for --help), valuesDefault
initialiser, and three reserved slots. parse is the only operation
invoked at command-line time; print fires only when the user requests
help text.
Validation Algorithm
The option validator is deliberately strict. It checks the bytecode buffer and the requested target before allocating the program handle, keeping failure paths simple and steering clear of partially initialized session state.
int validate_driver_options(const ByteSpan *input, const DriverOptions *opts) {
if (input == NULL || input->data == NULL)
return error("input buffer is null"); // code 2
if (!is_tileir_bytecode(*input)) {
if (looks_like_mlir_bytecode(*input))
return error("failed to parse IR bytecode (it looks like MLIR bytecode instead)"); // code 3
return error("input does not correspond to Tile IR bytecode"); // code 3
}
if (!is_supported_gpu(opts->gpu_name))
return error("unsupported GPU target"); // code 2
if ((uint32_t)opts->opt_level > 3)
return error("invalid optimization level"); // code 2
if (opts->device_debug && opts->opt_level != 0)
return error("optimized debugging is not supported, "
"change optimization level to 0 or disable full debug info"); // code 2
return 0;
}
The diagnostic strings above are the verbatim messages emitted by the
validator entry point; the full error-code table with severity bytes
lives in Driver Program Handle.
The debug rule is not cosmetic — full device debug mode injects NVVM
debug options that disable several code-motion and block-merge
transforms, so the driver demands -O0 rather than silently degrading
an optimized build.
Pipeline Options
The TileIR pass pipeline carries a much larger option structure. These options matter most to integrators who build a pass pipeline directly or expose advanced tuning flags in a higher-level tool.
| Pipeline option | Default | Effect |
|---|---|---|
opt-level | 2 | TileIR pipeline optimization level when invoked outside the driver wrapper. |
v2-opt-level | 0 | Separate optimization level for the TileIR V2 path. |
num-warps | 4 | Logical warps per CTA for scheduling and partitioning. |
num-ctas | 1 | CTAs per cluster used by cluster-aware launch metadata. |
pipeline-strategy | none | Selects no software pipeline, unspecialized, or warp-specialized flow. |
unspecialized-pipeline-num-stages | 4 | Stage count for the unspecialized pipeline. |
dynamic-persistent | false | Enables the dynamic persistent-kernel rewrite. |
emit-line-info | none | Selects the IR snapshot used to build source line records. |
schedule-trace-file | empty | Writes scheduler trace JSON when non-empty. |
dump-host | empty | Dumps generated host-side callback code when non-empty. |
host-triple | native | Host triple used by host-code generation. |
rrt-size-threshold | 4096 | Resource-reservation-table compression threshold. |
max-constraint-iterations | 10 | Iteration cap for resource-constraint generation. |
approx | false | Allows approximate math in eligible lowerings. |
ftz | false | Enables flush-to-zero math behavior. |
index-bitwidth | 32 | Bit width used for MLIR index lowering; 0 means host word size. |
enable-random-delay | false | Stress option for scheduler delay injection. |
enable-debug-logging | false | Enables TileIR callback debug logging paths. |
use-nvgpucomp-libnvvm | false | Routes NVVM compilation through NVGpuComp when enabled. |
The two scheduler knobs — rrt-size-threshold and
max-constraint-iterations — are compile-time controls. Lower thresholds
compress the resource reservation table earlier; lower iteration caps make
the solver stop sooner and fall back to conservative scheduling when
constraints remain unresolved.
Effective Option Merge
A TileIRPipelineOptions value is the resolved configuration that reaches
the pass manager. The driver builds it in three layers, applied in order;
each layer can only overwrite fields the next layer explicitly touches, so
the precedence is unambiguous.
The first layer is the TableGen-declared per-field default. Every option
in the pipeline has a default literal written into the pass definition —
opt-level = 2, num-warps = 4, rrt-size-threshold = 4096, and so on.
Constructing a fresh TileIRPipelineOptions populates every field with
this baseline.
The second layer is the per-pass override that arrives through MLIR's
--pass-pipeline="tileir{key=value, ...}" syntax. When the user (or an
integrator) invokes the pipeline through that surface, MLIR's option
parser walks the brace-enclosed key=value list and writes each value into
the matching pipeline field, leaving every other field at its TableGen
default.
The third layer is driver-level legacy propagation. The command-line
driver predates the per-pass options syntax, and several user-facing
flags — --opt-level, --gpu-name, --lineinfo, --device-debug,
--sanitize, --host-arch, --host-os — must continue to work for
users who never type a --pass-pipeline argument. The driver therefore
copies each of those into the corresponding pipeline field after the
first two layers have settled.
TileIRPipelineOptions make_pipeline_options(const DriverFlags &flags) {
TileIRPipelineOptions opts; // TableGen defaults
if (flags.pass_pipeline_set)
parsePassPipelineOptions(opts, flags.pass_pipeline_text); // brace-list overrides
opts.opt_level = flags.opt_level; // legacy propagation
opts.compute_capability = sm_number_of(flags.gpu_name);
opts.emit_line_info = flags.lineinfo ? LineInfo::FromInput : LineInfo::None;
opts.device_debug = flags.device_debug;
opts.sanitize_memcheck = flags.sanitize == Sanitizer::Memcheck;
opts.host_arch = flags.host_arch;
opts.host_os = flags.host_os;
return opts;
}
The propagation exists because a single --opt-level=2 should still
configure the pipeline correctly without forcing the user to spell out
--pass-pipeline="tileir{opt-level=2}". A reimplementer who skips the
propagation step ends up with a tool whose -O2 silently runs at the
pipeline default of 2 for most fields but at the driver default of 3
in any field the driver does not propagate — a subtle divergence that
turns up only when an integrator's regression suite compares produced
SASS across versions.
Do not collapse v2-opt-level into driver --opt-level. The two are
independent axes: v2-opt-level defaults to 0 and is only meaningful
inside the V2 pipeline, which the driver does not select on its own.
Diagnostics Surface
Four options produce artifacts useful for debugging:
| Option | Artifact |
|---|---|
--lineinfo | Source line records in the generated device code. |
emit-line-info=<stage> | A selected IR snapshot used as the line-info source. |
schedule-trace-file=<path> | Chrome-timeline-style scheduler trace JSON. |
dump-host=<path> | Generated host callback code. |
The driver does no semantic check on these paths beyond ordinary file I/O. When a path is set, the corresponding pipeline stage owns the write and reports failure through the normal compile error path.
Related pages
Driver main() Entry shows how main consumes the parsed
options; Driver Overview frames the overall
compile contract; Driver Program Handle
defines the public error-code numbering returned through the exit status;
Host Launch ABI and ptxas Knobs
covers --knobs-file=, the only ptxas-side option the driver forwards.
Debugging and Introspection is the
user-facing debugging surface: it ties the four diagnostic options in the
table above (--lineinfo, emit-line-info, schedule-trace-file,
dump-host) to the MLIR-side print, timing, and stack-trace flags and gives
a symptom-to-flag decision matrix.
Troubleshooting and Known Issues
catalogs the verbatim rejection strings produced by the validator above
(unsupported GPU target, invalid optimization level, optimized debugging is not supported, could not find libdevice), pairs each with
its root cause, and lists the gotchas that the strict CLI parser surfaces —
notably that --gpu-name does not accept the a/f arch-conditional
suffix and that sm_90 is not in the accept table.
Driver Env Vars + Runtime Gates
Abstract
tileiras carries two configuration surfaces beyond ordinary command-line
options. First, a small set of environment variables drives toolkit
discovery, ptxas knob forwarding, TMA policy, swizzle selection, and a
TileAS shared-memory debug escape hatch. Second, a family of pass-level
runtime gates lives as LLVM command-line options — not process environment
variables, but worth documenting here because they shape the same
compile-time behavior.
Environment-variable parsing is uneven by design. Some variables are presence-only, some require an exact value, and one parses a base-10 integer. A faithful reimplementation should preserve those differences rather than collapse every variable into a generic boolean parser.
Per-Feature Runtime Gates
Each variable has a known consumer sub_ADDR and a known parse mode. The
parse modes vary on purpose: some are presence-only, one demands the literal
byte "1", one parses a base-10 integer through strtol, and one is a path
forwarded verbatim into an argv slot. A faithful reimplementation must
preserve those differences rather than collapse them into a single boolean
parser.
| Variable | Consumer | Parse mode | Default behaviour | Hazard |
|---|---|---|---|---|
TILEIR_DELAY_TMA_STORE_WAIT | (varies) | strtol | 0 (no delay) | strtol failure on non-numeric input throws a std::stoi-shaped exception that aborts the process |
TILEIR_DEBUG_DUMP_BC | (varies) | presence | false | Setting to anything (including "0") enables BC dump |
TILEIR_DEBUG_DUMP_LLVM | (varies) | presence | false | Setting to anything enables LLVM-IR dump |
TILEIR_PREFER_TMA_FOR_LOAD_STORE | (varies) | string boolean ("true" / "false") | false | Prefers TMA lowering over the default cp.async / vector heuristic |
TILEIR_ALWAYS_SWIZZLE | (varies) | presence | normal swizzle heuristic | Setting to anything (including "0") forces the swizzled-layout path |
TILE_AS_DEBUG_UNLIMITED_SMEM | RCB | presence | SMEM cap = 232448 B | Setting to anything overrides the cap to INT_MAX; on hardware with less than 232448 B SMEM this miscompiles |
TILE_AS_DEBUG_VERBOSE | (varies) | string-eq "1" | false | Compared to "1" only; other truthy strings like "true" are ignored |
MLIR_ENABLE_EVO | (varies) | presence | false | Internal experimental switch; behaviour is undocumented and subject to silent change |
PTX_KNOBS_PATH | ptxas subprocess | path string | (none) | Forwarded verbatim as --knobs-file=<path> into the ptxas argv |
Three operational details matter. First, MLIR_ENABLE_EVO and
PTX_KNOBS_PATH form an AND gate inside the ptxas-argv builder — setting
only one does not forward a knob file. Second, the presence-only gates do
not strip the value: assigning "0" or "false" still enables them,
opposite of what a deployment-script reader usually expects. Third,
TILE_AS_DEBUG_VERBOSE is a string-equality compare against "1", so
"true" and "yes" are silently ignored.
The TILEIR_DELAY_TMA_STORE_WAIT hazard deserves its own paragraph because
the failure mode is brutal. The consumer parses the value with strtol,
but the calling path treats parse failure as an exception (a
std::stoi-shaped std::invalid_argument) that percolates up unhandled.
The result is SIGABRT on terminate. Setting
TILEIR_DELAY_TMA_STORE_WAIT=foo instead of TILEIR_DELAY_TMA_STORE_WAIT=0
produces no warning and no fallback to default — it aborts the compile.
Deployment scripts should either omit the variable entirely or set a
decimal integer.
CUDA-Root Resolution
CUDA-root resolution is trickier than it looks: two separate resolvers live
in the binary and they disagree on the miss path. The driver-side resolver
sub_5773C0 fires early during command-line processing; the NVVM-side
resolver sub_1A41D30 fires later from inside libdevice and libnvvm lookup.
Both walk the same env-var chain — CUDA_ROOT, then CUDA_HOME, then
CUDA_PATH — but they part ways when every variable is unset.
| Resolver | sub_ADDR | Chain | Miss behaviour |
|---|---|---|---|
| Driver | sub_5773C0 | CUDA_ROOT → CUDA_HOME → CUDA_PATH | Walks /proc/self/exe up two directories; aborts with "cannot find CUDA installation" if that path does not exist |
| NVVM | sub_1A41D30 | CUDA_ROOT → CUDA_HOME → CUDA_PATH | Returns the empty string; no /proc/self/exe fallback, no abort |
The divergence is the hazard. A deployment that leaves CUDA_ROOT unset
but keeps /proc/self/exe resolvable inside the expected toolkit layout
sees the driver succeed silently. Later, when the NVVM path tries to locate
libdevice.10.bc, its resolver returns "", and the libdevice loader
joins that empty string against nvvm/libdevice/libdevice.10.bc. The user
gets a confusing "libdevice.10.bc not found in $CUDA_ROOT/nvvm/libdevice"
error even though the driver itself reported no problem. The workaround is
mechanical: always export CUDA_ROOT explicitly in production deployments,
even when /proc/self/exe would have been enough for the driver alone.
Runtime-gate globals (static-ctor populated)
The runtime-gate layer is ordinary LLVM option storage for individual
passes. These flags help when debugging pass behavior, but getenv never
reads them.
| Gate family | Representative flags | Default behavior |
|---|---|---|
| CDP inline pretreat | CDP launch-name table | Recognizes the known CDP launch helper names. |
| Unsafe algebra | -opt-unsafe-algebra | Enabled. |
| Dead barrier elimination | -basic-dbe | Disabled unless requested. |
| SCEV-CGP | -scev-cgp-*, -do-scev-cgp, -do-function-scev-cgp | Enabled with bounded search budgets. |
| Base-address strength reduction | -do-base-address-strength-reduce | Enabled at level 4. |
| DOT and FFMA fusion | -enable-dot, -enable-fma-to-ffma2, -balance-dot-chain | DOT is on; FFMA2 fusion is off. |
| IPMSP | -do-clone-for-ip-msp, -dump-ip-msp | Clone budget is automatic; dump is off. |
| LSA | -lsa-opt | Enabled. |
| Memory-space optimization | -track-indir-load, -dump-ir-before-memory-space-opt | Tracking is on; dumps are off. |
| ProcessRestrict | -process-restrict, -apply-multi-level-restrict | Base restrict processing is on. |
| NVPTX printf lowering | -nvvm-lower-printf | Enabled. |
| Kernel selection | -select-kernel-range, -select-kernel-list | Empty selection means no narrowing. |
Consumers
Each environment variable is consumed close to the operation it affects:
PtxasArgs build_ptxas_args(const DriverState *state) {
PtxasArgs args = default_ptxas_args(state);
if (getenv("MLIR_ENABLE_EVO") != NULL) {
const char *knobs = getenv("PTX_KNOBS_PATH");
if (knobs != NULL)
args_append(&args, concat("--knobs-file=", knobs));
}
return args;
}
bool unlimited_smem_debug(void) {
return getenv("TILE_AS_DEBUG_UNLIMITED_SMEM") != NULL;
}
bool verbose_debug(void) {
const char *value = getenv("TILE_AS_DEBUG_VERBOSE");
return value != NULL && strcmp(value, "1") == 0;
}
int delay_tma_store_wait(void) {
const char *value = getenv("TILEIR_DELAY_TMA_STORE_WAIT");
if (value == NULL)
return 0;
// Parse path raises std::invalid_argument on bad input;
// calling frame does not catch, so the process aborts.
return std::stoi(value);
}
The two CUDA-root resolvers look structurally similar, but their miss paths are not equivalent. The pseudocode below is the contract a reimplementation must preserve byte-for-byte — in particular, the NVVM resolver must keep its empty-string return, because downstream libdevice loading relies on that sentinel to defer the actual lookup to a layered fallback that lives outside this function.
const char *resolveCudaRoot_driver(void) { // sub_5773C0
if (const char *p = getenv("CUDA_ROOT")) return p;
if (const char *p = getenv("CUDA_HOME")) return p;
if (const char *p = getenv("CUDA_PATH")) return p;
return walkSelfExeUpTwo(); // /proc/self/exe -> ../../
}
const char *resolveCudaRoot_nvvm(void) { // sub_1A41D30
if (const char *p = getenv("CUDA_ROOT")) return p;
if (const char *p = getenv("CUDA_HOME")) return p;
if (const char *p = getenv("CUDA_PATH")) return p;
return ""; // hazard: no fallback
}
Related pages
Host Launch ABI and ptxas Knobs
covers how the PTX_KNOBS_PATH value lands in the ptxas argv;
Subprocess Harness shows the
ptxas-launcher argv shape that consumes the forwarded knob path;
Driver CLI Options documents the
companion option flags whose semantics overlap with these gates;
Resource Constraint Builder and RRT
is the RCB consumer of TILE_AS_DEBUG_UNLIMITED_SMEM.
Host Launch ABI + ptxas Knobs
Abstract
tileiras is assembler-side. It never calls cuLaunchKernel,
cuLaunchKernelEx, or cuKernelSetAttribute directly — instead it emits
kernel launch metadata into IR attributes and PTX directives, and ptxas
lifts that information into the cubin metadata consumed by the CUDA runtime
or driver.
The host-visible launch ABI splits across three channels:
- PTX directives in each kernel
.entryheader. - MLIR
nvvm.*attributes on lowered LLVM functions. gpu.launch_funcandnv_tileaa.launch_funcproperties that carry dynamic launch operands through the lowering pipeline.
The ptxas --knobs-file=<path> path is separate. tileiras forwards the
argument only when the environment gate is enabled; ptxas owns the file
grammar and every diagnostic.
Host-side launch ABI
Since the driver never synthesizes CUDA-driver launch calls, the compiled
cubin carries static launch metadata and leaves dynamic launch assembly to
the consumer. Static metadata flows from nvvm.* attributes and PTX
directives; dynamic metadata rides on launch-operation properties and
segment-size arrays during MLIR lowering.
The split that matters:
| Channel | Carrier | Purpose |
|---|---|---|
| Static thread shape | nvvm.maxntid, nvvm.reqntid, .maxntid, .reqntid | Communicates block shape constraints. |
| Static cluster shape | nvvm.cluster_dim, .reqnctapercluster, .maxclusterrank | Communicates SM90+ cluster launch constraints. |
| Static register budget | nvvm.maxnreg, .maxnreg | Communicates register budget to ptxas. |
| Static CTA residency hint | nvvm.minctasm, .minnctapersm | Communicates minimum CTAs per SM. |
| Dynamic operands | operandSegmentSizes | Preserves launch operand partitioning through lowering. |
| Dynamic shared memory | launch operand segment | Eventually drives %dynamic_smem_size in PTX/SASS. |
Cluster directives are gated to SM90 and newer. On older targets the
compiler suppresses .blocksareclusters, .explicitcluster,
.reqnctapercluster, and .maxclusterrank even when cluster-shaped
metadata is present upstream.
gpu.launch_func carries kernelFunc, kernelModule, and
operandSegmentSizes. The setter also accepts the older
operand_segment_sizes spelling for compatibility with MLIR v17-era IR.
By the nv_tileaa.launch_func stage the kernel reference flattens into a
single kernel property alongside the same operand segment sizing.
nvvm.* Annotations and PTX Directives
The nvvm.* attribute family is the canonical in-IR carrier of launch
metadata. Legacy !nvvm.annotations tuples still parse and can be
transplanted into attribute form; an internal marker prevents repeated
legacy scans after the transplant.
The verifier enforces the shape rules that matter:
- Dimensional attributes contain one to three
i32values, except cluster dimensions, which require three values. - Scalar resource attributes are integer attributes.
nvvm.blocksareclustersrequires bothnvvm.reqntidandnvvm.cluster_dimon the same function.
| Kind | Attribute name | Shape | PTX projection | Target gate |
|---|---|---|---|---|
| kernel | nvvm.kernel | UnitAttr | .entry instead of .func | all SMs |
| maxntid | nvvm.maxntid | 1..3 i32 values | .maxntid | all SMs |
| reqntid | nvvm.reqntid | 1..3 i32 values | .reqntid | all SMs |
| cluster_dim | nvvm.cluster_dim | exactly 3 i32 values | .explicitcluster, .reqnctapercluster | SM90+ |
| minctasm | nvvm.minctasm | integer | .minnctapersm | all SMs |
| maxnreg | nvvm.maxnreg | integer | .maxnreg | all SMs |
| maxclusterrank | nvvm.maxclusterrank | integer | .maxclusterrank | SM90+ |
| blocksareclusters | nvvm.blocksareclusters | UnitAttr | .blocksareclusters | SM90+ |
| grid_constant | nvvm.grid_constant | 1-based argument index list | Drives constant-argument layout | all SMs |
| annotations_transplanted | nvvm.annotations_transplanted | UnitAttr | Internal marker only | all SMs |
Several invariants are core for reimplementers. nvvm.maxclusterrank is
stored as an integer-valued function attribute, unlike the string-shaped
legacy forms used by some older launch metadata. local_maxnreg has no
new nvvm.* mirror — it stays legacy-only and is never printed as a PTX
directive by this stage. When updating dimensional attributes, write
every axis back together so the new attribute form stays coherent even
when the legacy source used split per-axis tuples.
PTX emission walks the verified attribute set in a fixed order so that
related directives stay adjacent in the kernel header. The thread-shape
group (.maxntid, .reqntid) emits first, followed by the residency
hints (.minnctapersm, .maxnreg), and finally the cluster group
(.blocksareclusters, .explicitcluster, .reqnctapercluster,
.maxclusterrank) when the target supports clusters. Both .maxntid
and .reqntid may appear on the same kernel — the PTX semantics make
them complementary: .reqntid declares an exact block shape the kernel
relies on, .maxntid declares an upper bound for register-pressure
budgeting. The verifier checks shape consistency but does not collapse
or override either directive, and the emitter prints both as written
when both are set.
void emit_launch_directives(LLVMFuncOp fn, Target target, PTXWriter &out) {
if (auto dims = get_dim_attr(fn, "nvvm.maxntid"))
out.directive(".maxntid", *dims);
if (auto dims = get_dim_attr(fn, "nvvm.reqntid"))
out.directive(".reqntid", *dims);
if (auto n = get_int_attr(fn, "nvvm.minctasm"))
out.directive(".minnctapersm", *n);
if (auto n = get_int_attr(fn, "nvvm.maxnreg"))
out.directive(".maxnreg", *n);
if (!target_supports_clusters(target))
return; // suppress all cluster directives pre-SM90
if (fn->hasAttr("nvvm.blocksareclusters"))
out.directive(".blocksareclusters");
if (auto dims = get_dim_attr(fn, "nvvm.cluster_dim")) {
out.directive(".explicitcluster");
out.directive(".reqnctapercluster", *dims);
}
if (auto n = get_int_attr(fn, "nvvm.maxclusterrank"))
out.directive(".maxclusterrank", *n);
}
Two structural invariants keep this loop from being more complex.
nvvm.blocksareclusters is verified to require both nvvm.reqntid and
nvvm.cluster_dim on the same function, so by the time emission runs
the three directives are guaranteed to form a coherent triple. Cluster
directives are suppressed wholesale on pre-SM90 targets; the verifier
permits the attributes upstream so a single IR module can lower for
multiple targets, but the per-target emitter refuses to print them when
ptxas would reject the result.
How tileiras chooses each directive
Verifying an attribute is well-formed is not the same as choosing its value. The well-formedness rules above guard against malformed PTX; the choice of value is what determines whether the kernel runs at all and how fast it runs when it does. Each directive has its own input channel — the kernel-spec attribute the upstream lowering attaches, a user-supplied annotation that survives the front-end, or a constraint imposed by an instruction the compiler emitted later. The table below walks the policy for each directive.
| Directive | Primary input | Policy |
|---|---|---|
.entry kernel_name | nvvm.kernel marker on the LLVM function | always emit when the marker is present; the function name is the symbol the cubin exposes |
.maxntid X, Y, Z | upper-bound hint from kernel-spec or DSL annotation | emitted when the bound is not also a hard contract; lets ptxas size the register fragment without pinning the launch shape |
.reqntid X, Y, Z | hard contract from kernel-spec — warp-specialized split, warp-group requirement, or named-warp partition | emitted when the lowering depends on an exact thread count (every WGMMA or tcgen05 user, every kernel with named producer/consumer warps) |
.minnctapersm N | occupancy floor from kernel-spec | emitted when the user requested a minimum residency, usually for kernels whose throughput is sensitive to warp-scheduler latency hiding |
.maxnreg N | per-thread register budget from kernel-spec | emitted to let ptxas trade registers for occupancy — typical values come from a kernel-specific computation of accumulator_regs + working_regs + slack |
.explicitcluster | implied by nvvm.cluster_dim presence | always emitted with .reqnctapercluster when the kernel is cluster-shaped on SM90+ |
.reqnctapercluster X, Y, Z | cluster shape from kernel-spec | emitted on SM90+ when nvvm.cluster_dim is present; suppressed wholesale on older targets |
.maxclusterrank N | portability cap from kernel-spec | emitted on SM90+ when the user wants a portable launch shape, capping cluster size below the device-specific maximum |
.blocksareclusters | only legal when .reqntid and .cluster_dim are also present | emitted on SM90+ for kernels that opt into the single-CTA-cluster convention; lets cluster-aware code paths execute on a degenerate cluster shape |
The .maxntid versus .reqntid distinction is the policy decision
that affects the most kernels. .maxntid is an upper bound the launch
must respect but does not have to saturate; the driver accepts any
launch shape with X, Y, Z components no larger than the declared
maxima. .reqntid is a hard contract — the driver rejects any launch
whose block shape does not match the declared values exactly. Tileiras
emits .reqntid whenever the lowering has already baked in a specific
thread count: any kernel that emits WGMMA needs 128 threads per CTA
(four warps form one warp group), any kernel with warp-specialized
producer/consumer splits needs the exact named warp count, and any
kernel with named-warp NamedBarrier slots needs the exact thread count
the slot binding assumed. For kernels that adapt to launch shape —
elementwise kernels, kernels that use only synchronous mma.sync
forms, kernels with no warp specialization — tileiras emits only
.maxntid so the same cubin works for a range of launch shapes.
The .maxnreg choice is similarly central to performance. A
WGMMA-using kernel must leave room for the accumulator fragment: an
m64n256k16 FP32 WGMMA needs 32 FP32 registers per thread just for
the accumulator, plus the working set for descriptors, loop indices,
and any other live values. Setting .maxnreg too low forces ptxas to
spill the accumulator to local memory, which silently regresses
throughput by an order of magnitude. Setting it too high reduces
occupancy and hurts latency hiding. The kernel-spec carries the
result of a per-kernel computation that balances both — usually
accumulator_regs + descriptor_regs + slack, with slack calibrated
to the SM's register file size and the desired CTAs per SM.
Cluster-shape directives are an all-or-nothing group. When the
kernel-spec carries nvvm.cluster_dim, the lowering emits
.explicitcluster, .reqnctapercluster, and any nvvm.maxclusterrank
or nvvm.blocksareclusters markers; when the spec is silent, no
cluster directive is emitted. The verifier rule that
nvvm.blocksareclusters requires both nvvm.reqntid and
nvvm.cluster_dim means the three-directive triple is always coherent
by the time the emitter sees it.
GPU Execution Model is the canonical reference for how the five tiers (thread, warp, CTA, cluster, grid) consume these directives at runtime, with a worked example that traces the directive emission from kernel-spec to PTX header for a Hopper GEMM.
ptxas Knobs File Format
When both MLIR_ENABLE_EVO and PTX_KNOBS_PATH are set, tileiras
forwards --knobs-file=<path> to ptxas. It does not parse or validate the
file — the grammar belongs to ptxas.
The file format is:
arbitrary preamble
[knobs]
command command command
The [knobs] sentinel is case-sensitive; text before it is ignored.
After the sentinel, whitespace, ~, and ;; separate commands. The
command stream has no quoting, no escaping, no comment syntax.
Commands have three forms:
| Form | Meaning |
|---|---|
identifier=value | Assign a knob value. The = is accepted but not always required by ptxas. |
WHEN ... | Parse a conditional knob clause. |
INJECTSTRING ... ;; | Parse an internal SASS-splice string terminated by ;;. |
Values parse per the knob descriptor type. The recovered parser accepts
signed and unsigned integers, integer ranges, integer lists, 32-bit and
64-bit floats, strings, pointers, opcode lists, opcode-pair lists, and
WHEN clauses. Integer parsing is decimal — a string like 0x10 parses
as zero, with the trailing text ignored by the numeric conversion path.
Malformed knob files are fatal to the ptxas child process. Duplicate assignments follow a last-wins policy: the later command overwrites the earlier runtime value. Identifier matching is case-insensitive from the user's point of view.
tileiras runs no preflight check that the path exists, contains
[knobs], or uses valid identifiers. Every knob-file diagnostic comes
from ptxas and surfaces through the normal subprocess diagnostic buffer.
Related pages
Driver Overview covers how the produced kernel
directives travel into the relocatable object;
Driver CLI Options catalogues the
user-visible flags that map into pipeline options;
ptxas Handoff Protocol
documents the ptxas-side knob-file grammar in detail;
Attribute System and Lowering
documents the full lifecycle of each launch-shape attribute from frontend
hint through nvvm.* directive carrier, including which transitions
silently drop the fact and produce a degraded kernel.
Subprocess Harness
Abstract
One POSIX subprocess harness drives every external tool tileiras invokes.
The same launcher serves ptxas, nvdisasm, and any helper tool — argument
construction is tool-specific but process behavior is shared: argv/envp
setup, stdio redirection, optional timeout, child status decoding, stderr
capture, and resource-usage collection.
The harness itself is CUDA-agnostic. CUDA-specific arguments like
--input-as-string, --knobs-file=<path>, --nv-host=<path>, and
temporary cubin paths are assembled by the driver before it calls the
generic launcher.
POSIX subprocess launcher
The launcher accepts an executable path, argv vector, envp vector, redirection descriptors for stdin/stdout/stderr, optional timeout information, and output buffers for diagnostics and resource usage.
| Path | Used when | Behavior |
|---|---|---|
posix_spawn | No session or resource-limit hook is needed. | Fast path with file actions and retry on interrupted spawn attempts. |
fork + exec | setsid or process resource limits are requested. | Child applies redirection, limits, optional setsid, then executes the target. |
The fork path uses shell-compatible exit codes for exec failure — 127
means the program was not found, 126 means the program existed but could
not be executed. The wait decoder maps those codes back into user-facing
diagnostics.
int launch_child(ProcessSpec *spec, ChildProcess *child, string *error) {
if (!spec->setsid && !spec->has_resource_limits)
return spawn_with_posix_spawn(spec, child, error);
pid_t pid = fork();
if (pid < 0)
return system_error(error, "Couldn't fork");
if (pid == 0)
exec_child_or_exit(spec);
child->pid = pid;
return 0;
}
stderr-merge optimization
When stderr and stdout target the same sink, the launcher redirects stderr
with dup2(stdout, stderr) instead of opening the same destination twice.
This is the common ptxas shape — both streams fold into one in-memory
diagnostic buffer so the parent can report assembler failure with full
context.
SIGALRM wait4 timeout
Timeout handling rides on wait4 and SIGALRM. With a timeout enabled,
the parent installs a temporary alarm handler, arms alarm(seconds), and
waits on the child. If wait4 is interrupted by the alarm, the parent
sends SIGKILL, disarms the alarm, restores the old signal handler, and
reaps the child.
int wait_child(pid_t pid, unsigned timeout_s, ProcessResult *result) {
install_alarm_handler();
if (timeout_s != 0)
alarm(timeout_s);
int status = 0;
int rc = wait4(pid, &status, 0, &result->usage);
if (rc < 0 && errno == EINTR && timeout_s != 0) {
kill(pid, SIGKILL);
alarm(0);
restore_alarm_handler();
waitpid(pid, &status, 0);
return child_timed_out(result);
}
alarm(0);
restore_alarm_handler();
return decode_child_status(status, result);
}
Status decoding follows normal POSIX rules. A signaled child reports the
signal name and whether a core file was produced. Exit code 126 means the
program could not be executed; exit code 127 means command-not-found.
Other codes are returned directly to the caller.
ptxas launcher
The ptxas adapter assembles a fixed argv shape around the serialized PTX text:
| Argv token | Origin | Purpose |
|---|---|---|
-arch sm_NN | --gpu-name | Selects sm_100, sm_103, sm_110, sm_120, or sm_121. |
--opt-level N | --opt-level | Forwards the driver optimization level, default 3. |
--input-as-string <PTX> | PTX serializer | Passes PTX text by argv instead of through a temporary PTX file. |
--knobs-file=<path> | MLIR_ENABLE_EVO and PTX_KNOBS_PATH | Forwards a ptxas internal knob file when both env vars are set. |
--nv-host=<path> | host-code path | Points ptxas at companion host-side code when present. |
The normal ptxas call takes the posix_spawn fast path and merges stdout
and stderr into one capture buffer. The assembled cubin comes back through
the child's stdout, not via a temporary output file.
nvdisasm driver
The SASS dump adapter takes a command string, splits it into argv words,
and expects the first word to resolve through PATH. The documented
default is:
nvdisasm -c
The adapter writes the ptxas-produced cubin to a temporary file, appends that file path as the final argv element, launches the command, captures stdout, and removes the temporary cubin if the driver created it. When the caller provided an existing cubin path, lifetime management stays with the caller.
int dump_sass(const DumpOptions *opts, ByteSpan cubin, string *out) {
Argv argv = split_command(opts->dump_sass_command);
if (argv.len == 0)
return error("Please provide a valid dumpSassCommand. For example, `nvdisasm -c`.");
TempFile temp = write_temp_cubin(opts->input_base_name, cubin);
argv_push(&argv, temp.path);
int rc = run_child(argv, CAPTURE_STDOUT_AND_STDERR, out);
remove_temp_file(&temp);
return rc;
}
Dump-command failures surface as compile-path failures because the driver treats SASS dumping as part of the requested output action.
nvdisasm command-line construction
Once tileiras finishes its MLIR pipeline and PTX emission, the compile dispatcher sub_57A8E0 shells out to two
external programs: ptxas (PTX to SASS) and nvdisasm (SASS to disassembled text, embedded as an ELF section). Both
invocations route through the subprocess helpers sub_44A36C0 and sub_44A3430, with the raw-ostream sink sub_6CF9C0
draining the child's stdout back into the parent.
Every nvdisasm invocation starts with the literal 5-byte flag block "-uumn" stored at rodata 0x57BB97. The literal
packs four flags into one argv token: -u (unsigned offsets), -u (literal second occurrence, triggering the
canonical re-entry behaviour known from nvdisasm pre-13.1), -m (mnemonic-only emission), and -n (no header). All
five bytes are pushed as one argv element rather than four separate flags.
With config.sanitize == 1 — the only currently-defined sanitize mode, memcheck — the helper appends a 41-byte tail
starting with a leading space: -sanitize=memcheck -g-tmem-access-check. The trailing flag -g-tmem-access-check
instruments tensor-memory (TMEM) loads and stores, a Blackwell-and-newer concern consistent with the
sm_100..sm_121 target table. The full sanitize-on argv therefore consists of the 5-byte "-uumn" token followed by the
41-byte tail token.
The dispatcher writes the literal "nvdisasm -c" at rodata 0x57B6BB as the command-line prefix before the flag block.
Program path resolution leaves nvdisasm to PATH; the -c flag asks nvdisasm to emit a section-friendly format
suitable for embedding inside the final ELF relocatable.
| Rodata address | Literal contents | Role |
|---|---|---|
0x57B6BB | nvdisasm | Program name, resolved through PATH. |
0x57B6C8 | -c | Section-friendly output format. |
0x57BB97 | -uumn | Combined flag block (unsigned / re-entry / mnemonic / no-header). |
0x57BB9C | -sanitize=memcheck -g-tmem-access-check | Sanitize tail, leading space included, appended only when sanitize == 1. |
The sibling ptxas invocation assembles its argv differently. The prefix is "ptxas", followed by --input-as-string
and the PTX text as a sized string, then --knobs-file= with the optional knobs path from the env-var registry, and
finally --nv-host= with the host triple. The boundary-spec page
ptxas Handoff Protocol
covers the ptxas side in detail; this section sticks to how the parent assembles the nvdisasm argv.
void buildNvdisasmCmd(const TileirasProgram *p, ArgvBuilder *out) {
argvPush(out, "nvdisasm"); // 0x57B6BB
argvPush(out, "-c"); // 0x57B6C8
argvPush(out, "-uumn"); // 0x57BB97 — 5 bytes, single token
if (p->sanitize == 1) {
argvPush(out, " -sanitize=memcheck -g-tmem-access-check"); // 0x57BB9C — leading space included
}
}
With the argv vector built, sub_44A3430 forks via posix_spawn and wires stdout and stderr through pipes back to
the parent. The parent drains the disassembly text from the stdout pipe via sub_6CF9C0 (the raw_ostream sink) and
concatenates the captured bytes into the final ELF relocatable as a .nvdisasm section. The temporary cubin path
written by the SASS dump adapter is passed as the final argv element, exactly as the previous section described.
Related pages
ptxas Handoff Protocol
documents the ptxas side of the boundary including the PTX text protocol and
cubin return path; Host Launch ABI and ptxas Knobs
covers the --knobs-file= grammar consumed by ptxas;
Driver Env Vars and Runtime Gates
catalogues the env-var registry that produces the optional PTX_KNOBS_PATH
forwarded into the ptxas argv;
Driver CLI Options
documents the --sanitize=memcheck option that adds the nvdisasm sanitize
tail.
TILEIR_CALLBACKS ABI
Abstract
Tileiras emits a callback ABI that lets a runtime shim discover and patch TileIR launch hooks by symbol name. The ABI is not debug logging — it is a compile-time-inserted module table, a pre-load trampoline, a per-function callback table, and an argument-change callback the runtime can invoke when kernel launch arguments or TMA descriptors change.
The public symbol set is:
| Symbol | Role |
|---|---|
__CUDA_TILEIR_CALLBACKS | Module-level ABI entry vector. |
__CUDA_TILEIR_ON_PRE_LOAD | Compiler-emitted function returning the module callback vector. |
__CUDA_TILEIR_CALLBACKS_ON_PRE_LOAD | Weak runtime-patched pre-load callback slot. |
__CUDA_TILEIR_FUNC_CALLBACKS | Per-function callback table, fixed 64 bytes. |
__CUDA_TILEIR_FUNC_ON_ARGUMENTS_CHANGE | Per-function hook called when launch arguments are updated. |
Global linkage tuples
Every __CUDA_TILEIR_* global is created with an (isConstant, linkage) tuple at definition time. The driver
loader inspects those tuples to decide which symbols it may claim and which slots are mandatory. Linkage codes are
LLVM's: 0 denotes External (strong), 7 denotes Weak / WeakODR (optional override).
| Global | Size | (isConstant, linkage) | Notes |
|---|---|---|---|
__CUDA_TILEIR_CALLBACKS | 9 × i64 = 72 B | (1, 0) — const, External | Top-level callback table. |
__CUDA_TILEIR_FUNC_CALLBACKS | 8 × i64 = 64 B (fixed) | (1, 0) — const, External | Per-function callback table. Fixed 64 B, one global per kernel. |
__CUDA_TILEIR_CALLBACKS_ON_PRE_LOAD | 7 × i64 = 56 B | (0, 7) — mutable, Weak/WeakODR | Optional pre-load hook table. |
The 64-byte size of __CUDA_TILEIR_FUNC_CALLBACKS is fixed at the type level. The pass emits one such global per
kernel function in the module, and the struct shape inside each global is constant. An older note that called
the layout variable-size with one slot per emitted func-callback is wrong; the struct shape is pinned and
populated by exactly eight insertvalue operations at indices 0..7.
Callback vector ABI
The module-level __CUDA_TILEIR_CALLBACKS global is the entry vector — a constant external object with nine
64-bit slots, total size 72 bytes. The current ABI revision is the integer 0x40, decimal 64. The
binary stores this value as the immediate 0x40; the two forms are numerically identical and both are correct
references to the same revision.
| Slot | u64 | Semantic |
|---|---|---|
| 0 | __cuda_tileir_init fn pointer | Called on dlopen. |
| 1 | __cuda_tileir_fini fn pointer | Called on dlclose. |
| 2 | __cuda_tileir_compile_begin fn pointer | |
| 3 | __cuda_tileir_compile_end fn pointer | |
| 4 | ABI_REVISION = 0x40 | Verified at load. |
| 5 | reserved (zero) | |
| 6 | MUL multiplier A | NOT a flag — see correction below. |
| 7 | MUL multiplier B | NOT a flag — see correction below. |
| 8 | sentinel | Always zero; marks end of table. |
typedef struct TileirCallbackVector {
uint64_t init_fn;
uint64_t fini_fn;
uint64_t compile_begin_fn;
uint64_t compile_end_fn;
uint64_t abi_revision; /* = 0x40 */
uint64_t reserved_zero;
uint64_t mul_multiplier_a;
uint64_t mul_multiplier_b;
uint64_t sentinel_zero;
} TileirCallbackVector;
Slots 6 and 7 are MUL multipliers, not flags
Earlier wiki revisions treated slots 6 and 7 as OR-able bit flags. They are not. The body emitter at
sub_870430 multiplies a runtime counter by the slot value and writes the product back into the lowered IR. The
multiplication routes through sub_868170, the llvm.mul helper. In the generated IR the operation appears as
%v = mul i64 %counter, <slot value>
with <slot value> taken verbatim from slot 6 or slot 7. Treating these slots as OR-combined flag bits would
silently miscompile every program that sets more than one — a | b and a * b agree only on single-bit values.
Reimplementations must preserve the multiplicative semantics.
Body emission
Three sub-emitters cooperate to produce the callback objects:
sub_8689C0(4 293 B) emits the per-revision constant data block — the ninei64slots for revision0x40, including the addresses of__CUDA_TILEIR_FUNC_CALLBACKSand__CUDA_TILEIR_ON_PRE_LOAD.sub_86DAD0(~22 KB) emits the full callback dispatch trampoline including the type converter and thenvvm.kernelattribute lift. The relatedcute.kernel->nvvm.kernelrename is performed by the downstream D08 patternCuteKernelToNvvmRewriteatsub_1698C20, documented in TileAS to LLVM Lowering; it is not part of the callback ABI itself.sub_870430is the body emitter that wires the MUL multipliers from slots 6 and 7 into the dispatch path viasub_868170(thellvm.mulhelper).
ON_PRE_LOAD trampoline
At each TileIR launch site, the compiler can emit a pre-load callback call. The runtime-patched symbol is weak and null-guarded, so binaries remain executable even when no runtime shim has installed a callback.
void maybe_call_on_pre_load(void *arg_desc, int64_t sm_num, void *tma_arena) {
TileirOnPreLoadSlot *slot = &__CUDA_TILEIR_CALLBACKS_ON_PRE_LOAD;
if (slot == NULL || slot->fn == NULL)
return;
slot->fn(arg_desc, sm_num, tma_arena);
}
The argument descriptor is an 11-slot, 88-byte block. The callback receives the descriptor pointer, the sign-extended SM number, and a pointer to the TMA descriptor arena.
Callback signatures
| Callback | C signature | Calling convention | Who emits | Who calls |
|---|---|---|---|---|
__CUDA_TILEIR_ON_PRE_LOAD | TileirCallbackVector (*)(void) | cdecl, no args | compiler | driver at module load |
(*__CUDA_TILEIR_CALLBACKS_ON_PRE_LOAD.fn) | void (*)(void *arg_desc, int64_t sm_num, void *tma_arena) | cdecl, 3 args | runtime slot | compiled launch sites |
__CUDA_TILEIR_FUNC_ON_ARGUMENTS_CHANGE | int32_t (*)(void *cookie, void *arg_buf, void *tma_arena, <kernel args...>) | cdecl, 3+N args | compiler | runtime on argument-buffer change |
__CUDA_TILEIR_FUNC_ON_ARGUMENTS_CHANGE returns i32, not void. Its first three arguments are pointers — a
runtime cookie, an argument-buffer pointer, a TMA-arena pointer. The user kernel arguments follow that prefix in
their lowered kernel ABI order.
TMA descriptor shape
Each TMA argument occupies 64 bytes — eight 64-bit words. That shape matches the SM90+ tensor-map descriptor used
by cp.async.bulk.tensor. Debug printing splits the descriptor into two four-word rows for readability, but the
storage object is one contiguous 64-byte descriptor.
typedef struct TileirTmaDescriptor {
uint64_t word[8];
} TileirTmaDescriptor;
typedef struct TileirTmaArena {
uint64_t header[2];
TileirTmaDescriptor descriptor[8];
} TileirTmaArena;
Device-side descriptor storage must be 64-byte aligned. The compiler relies on the runtime to provide a correctly aligned arena. Non-SM90 targets take the zero path and do not require a populated TMA arena.
Version and backwards-compatibility handling
Revision tracking lives in slot 4 of the module vector. The compiler writes revision 0x40 (= 64 decimal)
unconditionally. The runtime checks both the sentinel zero in slot 8 and the revision in slot 4 before installing
hooks. No versioned alias symbols like __CUDA_TILEIR_CALLBACKS_v1 exist, so every compatibility check has to go
through the vector contents.
The weak runtime-patched pre-load slot and the null guard form the primary backwards-compatibility mechanism.
Older runtimes can leave the slot unresolved or null, and the compiled launch site simply skips the callback. The
per-function callback table is (1, 0) const+External and fixed at 64 bytes; absent specialized hooks, the
argument-change hook is the default callback target.
Related pages
Host Launch ABI and ptxas Knobs
covers the kernel-attribute side of launch metadata that the per-function
callback table sits alongside;
TileAS to LLVM Lowering
documents the downstream CuteKernelToNvvmRewrite pass that finalises the
nvvm.kernel attribute the callback ABI references;
Driver Overview frames where callback emission sits
in the larger compile sequence.
MLIR Bytecode Format
Abstract
tileiras consumes a private TileIR bytecode container for cuda_tile modules. The format borrows MLIR bytecode's spirit — unsigned LEB128 integers, section-local offset tables, indexed cross-section references, self-contained attribute payloads — but it is not upstream MLIR bytecode. The magic, version block, section enum, type tags, attribute tags, and cuda_tile opcode space are all private to this binary.
Wire-format divergence vs upstream MLIR. The shipped AttrTag numbering inside sub_59F100 breaks wire compatibility with upstream mlir/Bytecode/BytecodeEnums.h::AttributeTag. Tag 1 is StringAttr here (upstream IntegerAttr=1), tag 4 is DenseElementsAttr (upstream TypeAttr=4), tag 5 is DenseElementsAttr<string> (upstream StringAttr=5), and tag 13 is AssumePredicateAttr — a slot upstream does not define at all. Bytecode emitted by stock MLIR with stock numbering decodes wrong through this binary; bytecode emitted by this binary cannot be consumed by stock MLIR. Any external tool that needs to round-trip through both must freeze the tileiras tag assignments rather than the upstream header. The full 13-tag table sits below in Self-Contained Attribute Dispatch. At the envelope level, the partition signal is magic byte 7 — 0x00 here, '\n' upstream.
Reader contract. The bytecode reader is an in→out transform: a byte buffer goes in, a builtin.module containing a cuda_tile body comes out. Failure paths return nullptr/false with one of the verbatim diagnostics enumerated in the dispatcher sections below.
ModuleOp readTileIRBytecode(ByteSpan input, MLIRContext *ctx) {
if (!sub_5838A0_validate_header(input)) // magic + version + dialect-list + blob preamble
return nullptr; // see "Header Parser (sub_5838A0)" below
SectionSpans spans = scan_section_table(input);
StringTable strings = read_string_section(spans.string);
TypeTable types = read_type_section(spans.type, ctx);
ConstantTable constants = read_constant_section(spans.constant, ctx); // routes payloads through sub_59F100
DebugTable debug = read_debug_section(spans.debug, ctx); // routes payloads through sub_589B90
ModuleOp module = create_builtin_module(ctx);
read_globals(spans.global, module, strings, types, constants);
read_functions(spans.func, module, strings, types, constants, debug); // bodies invoke sub_5B13D0 per op
return module;
}
The rest of the page walks the container at reimplementation depth: file envelope, the six payload sections, the inter-section reference model, validation behavior, and the separate LLVM bitcode path used when tileiras hands an NVVM module to libNVVM.
Overall File Format
The container starts with an 8-byte magic. The first three bytes are the upstream MLIR-bytecode framing prefix (0x06 0x03 0x80); bytes 3–6 spell "Tile"; byte 7 is 0x00 — the tileiras / upstream MLIR split (upstream writes "\nMLIR" starting at byte 7):
06 03 80 54 69 6c 65 00 // MLIR-bc framing + "Tile" + 0x00 terminator
The trailing zero is a sentinel byte inside the magic, not a C-string terminator. Three unsigned-LEB128 VarInts follow — major, minor, optional patch — encoding the Tile version triple. The CUDA 13.1 reader accepts Tile version 13.1.x for any patch value. A mismatched major or minor produces an unsupported Tile version ... diagnostic rather than falling back to upstream MLIR bytecode parsing. See Header Parser (sub_5838A0) for the full magic and version table.
After the version block comes a sequence of sections. Each section starts with one ID byte: the low seven bits are the section ID, the high bit signals an alignment field in the header. A zero ID marks end-of-bytecode.
typedef struct SectionHeader {
uint8_t id_and_alignment_flag;
uleb128 length;
optional<uleb128> alignment;
uint8_t padding[];
uint8_t payload[];
} SectionHeader;
length covers the optional alignment field, padding, and payload. Padding bytes are 0xcf. The end marker carries no length, alignment, padding, or payload, and its high bit must be clear.
The reader scans section headers first and records the byte span for each section, then decodes bodies in a second pass. Physical section order is therefore flexible — required sections must exist and all offsets must land inside captured spans, but the order on disk is the producer's choice.
VarInt Encoding
Every multi-byte integer in the container — section length, offset, type reference, operand index, opcode — uses the same unsigned LEB128 variant as upstream MLIR bytecode. The encoding rule is a leading-byte trick: the number of low-order 1-bits in the first byte indicates how many additional bytes follow, the bits above the run-of-ones in that byte form the low bits of the integer, and the subsequent bytes contribute eight bits each in little-endian order.
| First byte | Total bytes | Payload bits | Value range |
|---|---|---|---|
0xxxxxxx | 1 | 7 | 0..127 |
xxxxxx01 | 2 | 14 | 0..16383 |
xxxxx011 | 3 | 21 | 0..2097151 |
xxxx0111 | 4 | 28 | up to 2^28 - 1 |
xxx01111 | 5 | 35 | up to 2^35 - 1 |
xx011111 | 6 | 42 | up to 2^42 - 1 |
x0111111 | 7 | 49 | up to 2^49 - 1 |
01111111 | 8 | 56 | up to 2^56 - 1 |
00000000 | 9 | 64 | up to 2^64 - 1 (all 8 trailing bytes form the integer) |
The leading byte counts its low-order 1-bits, masks them off, and shifts the surviving high bits up to occupy the bottom of the decoded integer; the remaining bytes are appended above those bits. A 10-byte encoding is rejected as overlong — the canonical 9-byte form covers the entire 64-bit range. Signed payloads (the location_index slot in particular) wrap the unsigned VarInt in zig-zag: (n << 1) ^ (n >> 63) going out, (u >> 1) ^ -(u & 1) coming back.
Three concrete decoded examples make the bit pattern unambiguous:
0x0a → first byte 0000_1010, low bit clear → 1-byte form,
payload = 0a >> 1 = 5
0x01 0x02 → first byte 0000_0001, trailing "01" → 2-byte form,
payload bits = (byte0 >> 2) | (byte1 << 6)
= (0x01 >> 2) | (0x02 << 6)
= 0 | 0x80 = 128
0xfb 0xff 0x0f → first byte 1111_1011, trailing "011" → 3-byte form,
payload bits = (byte0 >> 3) | (byte1 << 5) | (byte2 << 13)
= (0xfb >> 3) | (0xff << 5) | (0x0f << 13)
= 0x1f | 0x1fe0 | 0x1e000 = 0x1ffff = 131071
Producers must always emit the canonical (shortest) encoding for a given integer: an overlong but otherwise-valid encoding decodes to the same integer but flags as "non-canonical VarInt" and rejects the section.
Section Walker Algorithm
Once sub_5838A0 has accepted the envelope and built the blob-section descriptor array, the top-level driver invokes the section walker. The walker is not a recursive parser — it is a fixed, dependency-ordered dispatch over the descriptor array. Earlier sections build the lookup tables that later sections cross-reference, so the order is required even though the on-disk order is the producer's choice.
⚡ QUIRK — on-disk section order is free, walker order is fixed The bytecode envelope lets the producer write sections in any order, but the reader's
walk_sectionsalways dispatches them in the fixed dependency-ordered sequence (STRING → TYPE → ATTRIBUTE → IR → optional RESOURCE/DEBUG). Two byte-different files can therefore decode to identical IR, and the reader uses descriptor-array lookup rather than position to find each section. A reimplementation that follows on-disk order to drive parsing will read references against half-built tables and emitunknown <kind>errors for files the official reader accepts.
ParseResult walk_sections(BytecodeReader *r, const BlobSectionDesc *desc, size_t n_desc,
MLIRContext *ctx, ModuleOp *out_module) {
SectionSpans spans = {0};
for (size_t i = 0; i < n_desc; ++i)
spans.by_kind[desc[i].section_kind] = (Span){desc[i].offset, desc[i].length};
/* Sections must be present in dependency order:
* 1. STRING section (referenced by every later section)
* 2. DIALECT section (declares the dialects this bytecode uses; implicit here:
* the dialect list lives in the envelope, not in its own section)
* 3. TYPE section (references STRING)
* 4. ATTRIBUTE / CONSTANT section (references TYPE)
* 5. IR section: FUNC + GLOBAL records (references all of the above)
* 6. RESOURCE section (optional)
* 7. DEBUG section (optional, references the IR section via location slots)
*/
StringTable strings = read_string_section (r, spans.by_kind[SEC_STRING]);
TypeTable types = read_type_section (r, spans.by_kind[SEC_TYPE], ctx, strings);
ConstantTable constants = read_constant_section (r, spans.by_kind[SEC_CONSTANT], ctx, strings, types);
DebugTable debug = spans.by_kind[SEC_DEBUG].length
? read_debug_section (r, spans.by_kind[SEC_DEBUG], ctx, strings, types)
: empty_debug_table();
ModuleOp module = create_builtin_module(ctx);
read_globals (r, spans.by_kind[SEC_GLOBAL], module, strings, types, constants);
read_functions(r, spans.by_kind[SEC_FUNC], module, strings, types, constants, debug);
*out_module = module;
return success();
}
The walker traverses each section by seeking the reader cursor to span.begin, decoding records until the cursor reaches span.begin + span.length, then asserting that no bytes were left over. A short read inside a section emits a per-section truncation diagnostic; a long read — cursor past span.length after the final record — emits a section-overflow diagnostic. Both fire before any cross-section index from that section is exposed to the next reader, so a corrupt section never poisons downstream lookups with a half-populated table.
The reader trusts the descriptor's offset and length pair to disjoint sections — overlapping spans are not checked. The validation that prevents two sections from claiming the same bytes happens earlier, inside sub_5838A0's preamble loop, where each new descriptor's [offset, offset+length) range is checked against the union of previously accepted ranges.
Header Parser (sub_5838A0)
sub_5838A0 parses the bytecode header. It is invoked from sub_57FF40 once the top-level reader confirms the input buffer starts with the tileiras magic, and it validates the magic prefix, the Tile version field, and the dialect-list / blob-section preamble before handing control to the per-section sub-readers. Everything downstream — section dispatch, attribute decoding, opcode decoding, debug decoding — assumes the header parser has already consumed and accepted this envelope.
Magic prefix. Every TileIR bytecode container opens with an 8-byte magic prefix, compared byte-for-byte against the static literal at rodata 0x45EBF08. The bytes flow through sub_57F420, the bounded-read helper that refuses to walk past the end of the input buffer.
| Offset | Byte | Meaning |
|---|---|---|
| 0 | 0x06 | MAGIC_LEN_HI — MLIR-bytecode-compatible framing |
| 1 | 0x03 | MAGIC_LEN_LO — MLIR-bytecode-compatible framing |
| 2 | 0x80 | MAGIC_FLAGS — MLIR-bytecode-compatible framing |
| 3 | 'T' (0x54) | tileiras dialect tag, byte 1 |
| 4 | 'i' (0x69) | tileiras dialect tag, byte 2 |
| 5 | 'l' (0x6C) | tileiras dialect tag, byte 3 |
| 6 | 'e' (0x65) | tileiras dialect tag, byte 4 |
| 7 | 0x00 | tileiras terminator (upstream MLIR writes "\nMLIR" instead) |
The trailing 0x00 is the byte that separates tileiras-MLIR-bytecode from upstream MLIR-bytecode at the magic level. Upstream MLIR fills the same slot with "\nMLIR", so the shared MLIR-bytecode framing prefix combined with the tileiras-specific terminator cleanly partitions the two dialects. On mismatch the parser emits "invalid magic number at position " concatenated with the buffer offset, then ", got " concatenated with the offending byte sequence, then " expected " concatenated with the expected literal — three verbatim fragments joined into one diagnostic.
Tile version table. Three VarInts follow the magic and encode the Tile version triple. The minimum and maximum supported versions live in a static table at rodata 0x45EBF10:
static const TileVersion supported_versions[] = {
/*min:*/ { .major = 13, .minor = 1, .patch = 0 }, // inclusive
/*max:*/ { .major = 13, .minor = 1, .patch = UINT32_MAX }, // inclusive (only 13.1.x)
};
Major and minor VarInts are mandatory. The patch VarInt is optional and defaults to zero when absent — the parser reads it for forward compatibility but never gates on its value. The version-range predicates are:
| Predicate | Decision |
|---|---|
major < 13 | reject |
major == 13 && minor < 1 | reject |
major > 13 | reject |
major == 13 && minor > 1 | reject |
major == 13 && minor == 1 | accept (any patch) |
Only 13.1.x decodes. On reject, the parser emits "unsupported Tile version " followed by the parsed version (formatted via "{0}.{1}.{2}" when the patch field was present, or "{0}.{1}" when absent), then "this reader supports versions [" followed by the rendered min..max range. The two-string format mirrors the patch-optional encoding: a producer that omits the patch VarInt sees its rejection echoed back without a synthetic .0 suffix.
Dialect list and blob preamble. A VarInt dialect-count comes next, then for each dialect a VarInt length-prefixed string followed by a VarInt op-count. Each dialect string is checked against the reader's registered dialect set; unknown dialects emit "unregistered dialect: " concatenated with the offending name. The blob-section preamble that follows is a fixed 24-byte record per section:
typedef struct BlobSectionDesc {
uint8_t section_kind; // one of the seven section IDs documented above
uint8_t pad[3]; // alignment padding
uint64_t offset; // byte offset from the start of the buffer
uint64_t length; // payload length in bytes
uint32_t alignment; // required alignment of the payload base
} BlobSectionDesc;
The descriptor array lets the reader build the section-span table without scanning section headers in order: every section's location is known once the preamble is consumed, so the per-section sub-readers can run in any order the driver finds convenient.
Section alignment / ordering invariants. Two structural invariants carry verbatim diagnostics. The end-of-bytecode marker section must have alignment == 0 — otherwise the parser emits "end section should not have alignment flag set". The end marker must also be the last section in the preamble; otherwise the parser emits "end section is not the last section". Both checks fire before any per-section body is decoded, so a malformed envelope is rejected before the reader commits a single byte of section-payload allocation.
Pseudocode shape. The full parser is large, but its skeleton is a magic-and-version prefix followed by the dialect/blob loop. The prefix half looks like:
LogicalResult parseBytecodeHeader(Reader *r, TileVersion *out_ver) {
uint8_t magic[8];
if (!readBytes(r, magic, 8)) return emit("failed to read magic"), failure();
if (memcmp(magic, kMagicLit, 8) != 0) return emitMagicMismatch(magic), failure();
TileVersion v;
if (!readVarInt(r, &v.major)) return emit("failed to read major"), failure();
if (!readVarInt(r, &v.minor)) return emit("failed to read minor"), failure();
if (!readOptionalVarInt(r, &v.patch)) v.patch = 0;
if (!versionInRange(v)) return emitVersionReject(v), failure();
*out_ver = v;
return success();
}
The dialect-list and blob-preamble loops follow the same shape: every VarInt read pairs with a verbatim diagnostic, every structural invariant is checked before the next field is consumed, and the function returns failure() on the first mismatch so the surrounding driver surfaces a single precise error rather than a cascade.
Per-Section Grammar
The seven section IDs are:
| ID | Name | Required | Body layout | Reference width | Outbound references |
|---|---|---|---|---|---|
0x00 | EndOfBytecode | yes | Single zero byte; must be last. | n/a | none |
0x01 | String | yes | numStrings, aligned u32 stringOffsets[], UTF-8 string data. Strings are not NUL-terminated. | u32 offsets | none |
0x02 | Func | yes | Sequential function records, each with name, signature, flags, location, optional optimization hints, body length, and body bytes. | sequential | String, Type, Debug, Constant |
0x03 | Debug | no | Parallel debug-op, debug-index, and debug-attribute tables plus debug payload data. | u32 and u64 offsets | String, Type, Debug |
0x04 | Constant | no | numConstants, aligned u64 constantOffsets[], and self-contained attribute payloads. | u64 offsets | String, Type, Constant, Debug |
0x05 | Type | no | numTypes, aligned u32 typeOffsets[], and type-tag payloads. | u32 offsets | Type |
0x06 | Global | no | Sequential global records: symbol name, type, constant initializer, alignment. | sequential | String, Type, Constant |
Function records are the richest records in the container:
typedef struct FunctionRecord {
uleb128 name_string_index;
uleb128 function_type_index;
uint8_t flags;
uleb128 location_index;
optional<Attribute> optimization_hints;
uleb128 body_length;
uint8_t body[body_length];
} FunctionRecord;
The low three flag bits encode visibility, entry/kernel kind, and whether optimization_hints is present. Function bodies contain regions, blocks, block arguments, and operation records. Each operation record starts with an opcode and a location reference, followed by opcode-specific operands, attributes, regions, and result type references.
typedef struct RegionBody {
uleb128 block_count;
Block blocks[block_count];
} RegionBody;
typedef struct Block {
uleb128 arg_count;
uleb128 arg_type_index[arg_count];
uleb128 op_count;
OperationRecord ops[op_count];
} Block;
typedef struct OperationRecord {
uleb128 opcode; // cuda_tile opcode, 0..109 in CUDA 13.1
sleb128 location_index; // -1 means unknown location
uint8_t payload[]; // decoded by the opcode-specific reader
} OperationRecord;
Cross-section references are width-sensitive. String and Type offset tables use u32 slots; Constant offsets and some Debug cross-reference tables widen to u64 so large dense payloads remain addressable. Function and Global records are sequential rather than table-indexed.
VarInts are unsigned LEB128 capped at 10 bytes — overlong 11-byte encodings are rejected. Signed integer payloads use zig-zag decoding; APInt multiword payloads apply zig-zag per word.
Type Tag Dispatch
sub_59C710 decodes the Type-section payload table. The routine is 6 474 bytes long and switches on a one-byte TypeTag at 0x59C79D; it is the type-side sibling of the Op, Attr, and Debug dispatchers covered later. Each type-table entry triggers one call from sub_58C0C0 (the cached Type-by-index lookup), itself reached via sub_58C400 whenever a downstream consumer needs to resolve a type reference. The TypeTag numbering is independent of upstream MLIR's BytecodeTypeOpcodes.td and uses the dense 0..18 assignment shown below.
Type records start with a TypeTag followed by a tag-specific payload:
| Tag | Type |
|---|---|
0..4 | i1, i8, i16, i32, i64 |
5..11 | f16, bf16, f32, tf32, f64, f8E4M3FN, f8E5M2 |
12 | Pointer type |
13 | Tile type |
14 | Tensor-view type |
15 | Partition-view type |
16 | Function type |
17 | Token type |
18 | f8E8M0FNU extension, reachable through the extensible registered-type path |
Tile types carry an element type and shape. Tensor views add strides. Partition views point at a tensor-view type, then add tile shape, dimension map, and partition mode. Function types carry parameter and result type-index vectors. The per-tag operand layout in wire order is:
| Tag | Operands read in order | Total VarInts (worst case) |
|---|---|---|
0..11 | none (integer/float width fully determined by tag) | 0 |
12 Pointer | one type-ref | 1 |
13 Tile | one type-ref, then a read_i64_shape (count VarInt + count signed-LEB128 dims) | 2 + dim_count |
14 Tensor view | one type-ref, then a shape VarInt list, then a stride VarInt list | 3 + dim_count + stride_count |
15 Partition view | one type-ref, then a shape, then a dim-map list, then a partition-mode byte | 4 + dim_count + map_count |
16 Function | a type-ref list for inputs, a type-ref list for results | 2 + input_count + result_count |
17 Token | none | 0 |
18 f8E8M0FNU | extension prefix (registered-type path), then per-extension payload | varies |
Type read_type(Reader *r) { // sub_59C710
uint64_t tag = read_uleb128(r); // (1) one TypeTag VarInt
switch (tag) {
case TYPE_I1: case TYPE_I8: case TYPE_I16: case TYPE_I32: case TYPE_I64:
return integer_type(width_for_tag(tag));
case TYPE_F16: case TYPE_BF16: case TYPE_F32: case TYPE_TF32:
case TYPE_F64: case TYPE_F8E4M3FN: case TYPE_F8E5M2:
return float_type_for_tag(tag);
case TYPE_POINTER:
return pointer_type(read_type_ref(r));
case TYPE_TILE:
return tile_type(read_type_ref(r), read_i64_shape(r));
case TYPE_TENSOR_VIEW:
return tensor_view_type(read_type_ref(r), read_i64_shape(r), read_i64_strides(r));
case TYPE_PARTITION_VIEW:
return partition_view_type(read_type_ref(r), read_shape(r), read_dim_map(r), read_partition_mode(r));
case TYPE_FUNCTION:
return function_type(read_type_ref_list(r), read_type_ref_list(r));
case TYPE_TOKEN:
return token_type();
default:
return registered_extension_type(tag, r);
}
}
Operation Opcode Dispatch
One master dispatcher decodes the entire cuda_tile public opcode space: sub_5B13D0, 10 650 bytes, jump table at 0x5B158E. It takes one operation record at a time, reads the opcode as a VarInt, attaches a source location, switches on the opcode integer, and either inlines the canonical op-builder skeleton or tail-calls a dedicated per-op parser. It returns true if the opcode was recognized and the resulting MLIR Operation* passed verification, false otherwise.
Calls come from sub_57FF40, the bytecode-parse-into-scratch path that walks a function body and stages each operation into per-block operand and location tables before final placement into the materialized region. sub_5B13D0 itself stays op-local: no block-structure work, no region allocation, and no cursor advances outside its argument context.
The canonical body is a five-step sequence repeated for every operation record:
bool parse_operation(BytecodeReader *r, OpBuilder *b, ValueList *operands,
LocAttrList *locs, AttrList *attrs, Operation **out) {
uint64_t opcode;
if (!sub_5847D0_read_opcode(r, &opcode)) // (1) read opcode VarInt
return error(r, "failed to read operation opcode.");
Location loc;
if (!read_location(r, locs, &loc)) // (2) resolve source location
return error(r, "failed to read operation location.");
switch (opcode) { // (3) dispatch on opcode 0..109
case 0x00: return sub_58C5C0(r, b, operands, attrs, loc); // (4a) delegate to per-op parser
case 0x09: return build_inline_bitcast(r, b, operands, attrs, loc, out); // (4b) inline skeleton
/* ... 108 more cases ... */
default: return error(r, "unknown or unimplemented opcode: ", opcode);
}
/* (5) common LABEL_280 cleanup chain frees scratch buffers and returns the verifier result */
}
Step (1) calls the opcode reader, which returns the VarInt and a position-advanced cursor. Step (2) resolves the location either by reading a LocAttr index from the current location-table slot or, when no location was emitted, by synthesizing an UnknownLoc from the MLIR context. Step (3) is the 110-entry jump table at 0x5B158E. Step (4) splits two ways: a tail call to the per-op parser, which reads its own result types, operand indices, and attributes before calling sub_5847D0; or an inline ODS-shaped skeleton that reads the result type via sub_58C400, reads zero or more operand indices via sub_585AD0, then calls sub_5847D0 with the verbatim mnemonic literal embedded in the dispatcher binary. Step (5) is shared by every case path that does not tail-call: it frees the SSO op-name scratch and the transient attribute vector, then returns the verifier's verdict on the freshly built operation.
Four verbatim diagnostic strings are emitted by this dispatcher and its helpers:
"failed to read operation opcode.""failed to read operation location.""unknown or unimplemented opcode: ""failed to create operation '…' due to verification error."
The first three live inside sub_5B13D0 itself. The fourth comes from sub_5847D0, the create-and-verify helper every case path eventually reaches; its prefix and suffix wrap the mnemonic the current case passed in, so a verifier rejection on, say, cuda_tile.addf surfaces as failed to create operation 'cuda_tile.addf' due to verification error..
Two reserved opcode ranges sit in the jump table and fall through to the default arm. Opcodes 25–36 inclusive (0x19–0x24) and opcodes 52–57 inclusive (0x34–0x39) are unassigned in CUDA 13.1 and emit the "unknown or unimplemented opcode: " diagnostic if a producer encodes them. These gaps line up with corresponding holes in the public ODS opcode assignment — room for future op additions without invalidating already-shipped bytecode.
The full 110-row dispatch table follows. Each row gives the decimal opcode, the cuda_tile mnemonic, and either the dedicated parser address or inline for cases handled by the inline ODS-shaped skeleton inside sub_5B13D0. Region-bearing ops (entry, for, if, loop, module, reduce, scan) delegate to dedicated parsers because they additionally consume nested region bodies before calling sub_5847D0.
| Opcode | Mnemonic | Handler | Notes |
|---|---|---|---|
| 0 | cuda_tile.absf | sub_58C5C0 | |
| 1 | cuda_tile.absi | sub_58C930 | |
| 2 | cuda_tile.addf | sub_58CCA0 | |
| 3 | cuda_tile.addi | sub_58D3A0 | |
| 4 | cuda_tile.andi | sub_58D7B0 | |
| 5 | cuda_tile.assert | sub_587B50 | |
| 6 | cuda_tile.assume | sub_5A1CA0 | |
| 7 | cuda_tile.atomic_cas_tko | sub_58DB20 | |
| 8 | cuda_tile.atomic_rmw_tko | sub_58EF30 | |
| 9 | cuda_tile.bitcast | inline | |
| 10 | cuda_tile.break | sub_5AC120 | |
| 11 | cuda_tile.broadcast | sub_590280 | |
| 12 | cuda_tile.cat | sub_5A8300 | |
| 13 | cuda_tile.ceil | inline | |
| 14 | cuda_tile.cmpf | sub_590560 | |
| 15 | cuda_tile.cmpi | sub_590F00 | |
| 16 | cuda_tile.constant | sub_5AFE90 | |
| 17 | cuda_tile.continue | sub_5AB850 | |
| 18 | cuda_tile.cos | inline | |
| 19 | cuda_tile.cosh | inline | |
| 20 | cuda_tile.divf | sub_591400 | |
| 21 | cuda_tile.divi | sub_591B00 | |
| 22 | cuda_tile.entry | sub_5BAD00 | region-op |
| 23 | cuda_tile.exp | sub_592670 | |
| 24 | cuda_tile.exp2 | sub_5920A0 | |
| 25–36 | — | default | reserved; emits "unknown or unimplemented opcode: " |
| 37 | cuda_tile.exti | sub_592950 | |
| 38 | cuda_tile.extract | sub_5A8B60 | |
| 39 | cuda_tile.floor | sub_593930 | |
| 40 | cuda_tile.fma | sub_593C10 | |
| 41 | cuda_tile.for | sub_5BBFF0 | region-op |
| 42 | cuda_tile.ftof | sub_592E80 | |
| 43 | cuda_tile.ftoi | sub_5933B0 | |
| 44 | cuda_tile.get_global | sub_59E980 | |
| 45 | cuda_tile.get_index_space_shape | sub_5A9D70 | |
| 46 | cuda_tile.get_num_tile_blocks | inline | |
| 47 | cuda_tile.get_tensor_shape | sub_5AA6E0 | |
| 48 | cuda_tile.get_tile_block_id | inline | post-switch fallthrough loop at 0x5B2C05 |
| 49 | cuda_tile.global | sub_5B0720 | GlobalOp |
| 50 | cuda_tile.if | sub_5BCCD0 | region-op |
| 51 | cuda_tile.int_to_ptr | inline | |
| 52–57 | — | default | reserved; emits "unknown or unimplemented opcode: " |
| 58 | cuda_tile.iota | inline | |
| 59 | cuda_tile.itof | sub_594400 | |
| 60 | cuda_tile.join_tokens | sub_5AAF80 | |
| 61 | cuda_tile.load_ptr_tko | sub_5A30D0 | |
| 62 | cuda_tile.load_view_tko | sub_5A4420 | |
| 63 | cuda_tile.log | inline | |
| 64 | cuda_tile.log2 | sub_594980 | |
| 65 | cuda_tile.loop | sub_5BDA00 | region-op |
| 66 | cuda_tile.make_partition_view | sub_594CF0 | |
| 67 | cuda_tile.make_tensor_view | sub_5AE190 | |
| 68 | cuda_tile.make_token | inline | |
| 69 | cuda_tile.maxf | sub_594FD0 | |
| 70 | cuda_tile.maxi | sub_595630 | |
| 71 | cuda_tile.minf | sub_595B60 | |
| 72 | cuda_tile.mini | sub_5961C0 | |
| 73 | cuda_tile.mmaf | sub_5966F0 | |
| 74 | cuda_tile.mmai | sub_596A60 | |
| 75 | cuda_tile.module | sub_5BE6E0 | region-op |
| 76 | cuda_tile.mulf | sub_596EE0 | |
| 77 | cuda_tile.mulhii | sub_5979F0 | |
| 78 | cuda_tile.muli | sub_5975E0 | |
| 79 | cuda_tile.negf | inline | |
| 80 | cuda_tile.negi | inline | |
| 81 | cuda_tile.offset | sub_597CD0 | |
| 82 | cuda_tile.ori | inline | |
| 83 | cuda_tile.permute | sub_59E060 | |
| 84 | cuda_tile.pow | sub_597FB0 | |
| 85 | cuda_tile.print | sub_5AD2C0 | renamed from upstream print_tko |
| 86 | cuda_tile.ptr_to_int | sub_598290 | |
| 87 | cuda_tile.ptr_to_ptr | sub_598570 | |
| 88 | cuda_tile.reduce | sub_5BF2E0 | region-op |
| 89 | cuda_tile.remf | inline | |
| 90 | cuda_tile.remi | sub_598850 | |
| 91 | cuda_tile.reshape | sub_598D80 | |
| 92 | cuda_tile.return | sub_5A9400 | |
| 93 | cuda_tile.rsqrt | sub_599110 | |
| 94 | cuda_tile.scan | sub_5B9B20 | region-op |
| 95 | cuda_tile.select | inline | |
| 96 | cuda_tile.shli | sub_599700 | |
| 97 | cuda_tile.shri | sub_599B10 | |
| 98 | cuda_tile.sin | sub_59A3B0 | |
| 99 | cuda_tile.sinh | sub_59A040 | |
| 100 | cuda_tile.sqrt | sub_59A690 | |
| 101 | cuda_tile.store_ptr_tko | sub_5A55B0 | |
| 102 | cuda_tile.store_view_tko | sub_5A6790 | |
| 103 | cuda_tile.subf | sub_59B0E0 | |
| 104 | cuda_tile.subi | sub_59B7E0 | |
| 105 | cuda_tile.tan | inline | |
| 106 | cuda_tile.tanh | sub_59BBF0 | |
| 107 | cuda_tile.trunci | sub_59BF60 | |
| 108 | cuda_tile.xori | sub_59C3A0 | |
| 109 | cuda_tile.yield | sub_5AC9F0 |
Opcode 0x6E — atan2 in the public 13.2 opcode space — is absent from this binary. The dispatcher has no case for it and embeds no cuda_tile.atan2 mnemonic; encoding the op would land on the default arm and surface the "unknown or unimplemented opcode: " diagnostic. Consistent with a 13.1-vintage reader that predates the atan2 addition.
Worked encode example. Take the operation
%c = cuda_tile.addi %a, %b : tile<8 × i32>
and assume the surrounding context has already populated the per-section tables so that %a is value-table entry 4, %b is value-table entry 5, and tile<8xi32> is type-table entry 3. The function body's operation-record encoder writes seven fields in fixed order, each as a single VarInt:
| Field | VarInt | Byte | Decoded |
|---|---|---|---|
| opcode | 0x03 | 03 | 3 → cuda_tile.addi (dispatch row 3 above) |
| location index (signed LEB128) | 0x7f | 7f | -1 → UnknownLoc (no --lineinfo) |
| result type-ref | 0x03 | 03 | 3 → tile<8xi32> from the type table |
| operand count | 0x02 | 02 | 2 operands |
| operand 0 value-ref | 0x04 | 04 | 4 → %a from the value table |
| operand 1 value-ref | 0x05 | 05 | 5 → %b from the value table |
| attribute-dict ref | 0x00 | 00 | 0 → empty dict (no inline attrs) |
The final on-wire byte stream for this operation record is therefore exactly seven bytes:
03 7f 03 02 04 05 00
A run with --lineinfo replaces the 0x7f sentinel with a non-negative LocAttr index encoded as a positive zig-zagged VarInt — typically one byte (0x00 for index 0, 0x02 for index 1, 0x04 for index 2, and so on) — and stretches the record to eight bytes. A run with a non-empty inline attribute dictionary stretches the trailing 0x00 into a VarInt index into the attribute table, again typically one byte for small modules.
The operation cost in the IR section is therefore constant in the number of operands plus a tiny constant for the bookkeeping fields, and is independent of the mnemonic string. The mnemonic cuda_tile.addi lives once — in the dispatcher's per-opcode string literal at dispatch case 0x03 — and never appears in the per-operation byte stream.
The three functions around this dispatcher fit together cleanly. sub_5847D0 is the opcode-reader producing the integer the master switch keys on, and every case of sub_5B13D0 either inlines the ODS skeleton that ends in a sub_5847D0 call or tail-calls a per-op parser that itself ends in sub_5847D0. The Location decoder runs once per operation — between the opcode read and the switch — and writes the resolved Location into the per-op slot every case path reads when populating its OperationState. One layer up, sub_57FF40 is the bytecode-parse-into-scratch path the driver invokes per function body; it calls sub_5B13D0 in a loop for each operation record while maintaining the operand, location, and attribute vectors the dispatcher consumes through its argument context.
Self-Contained Attribute Dispatch
Every attribute payload is self-contained. Constants, function optimization hints, and the inline attribute slots on operations all funnel through the same decoder. The 13-case dispatcher inside sub_59F100 recognizes string, float, type, dense-elements (int/float and string variants), divisibility, dense-i64-array (two layout variants), same-elements, bounded (three discriminator variants), and assume-predicate attributes. The integer, bool, array, dictionary, and optimization-hints attribute kinds are not handled by this dispatcher — they arrive through the upstream MLIR builtin dispatcher path on a different code path. A Global initializer must resolve to a dense integer-or-floating elements attribute even though the Constant section can store a broader attribute set.
Anywhere the reader encounters an attribute payload that does not come pre-resolved through the Constant offset table — operation attribute dictionaries, type-attribute slots, location-attribute slots, the Constant payloads themselves — the bytes route through sub_59F100. This dispatcher is the attribute-side sibling of the 110-case opcode switch. Roughly 8 KB, it dispatches on a uint32_t AttrTag through a jump table at the entry switch and returns either a heap-allocated Attribute on success or nullptr on failure. The caller pushes the result into the bytecode reader's attribute table; failures propagate up to the section-level error path that aborts the load.
The shipped tileiras tag numbering is wire-format-breaking versus upstream MLIR. The two numberings are reproduced side by side so the divergence is unambiguous:
| AttrTag | Upstream MLIR BytecodeEnums.h::AttributeTag | Tileiras sub_59F100 |
|---|---|---|
| 0 | (reserved / sentinel) | (default-arm; emits "unsupported AttributeTag") |
| 1 | IntegerAttr | StringAttr |
| 2 | FloatAttr | FloatAttr |
| 3 | BoolAttr | TypeAttr |
| 4 | TypeAttr | DenseElementsAttr (int/float) |
| 5 | StringAttr | DenseElementsAttr (string) |
| 6 | ArrayAttr | DivByAttr |
| 7 | DenseElements | DenseI64ArrayAttr (variant A) |
| 8 | DivByAttr | DenseI64ArrayAttr (variant B) |
| 9 | SameElementsAttr | SameElementsAttr |
| 10 | Dictionary | BoundedAttr (variant 0) |
| 11 | OptimizationHints | BoundedAttr (variant 1) |
| 12 | BoundedAttr | BoundedAttr (variant 2) |
| 13 | (no upstream slot) | AssumePredicateAttr |
The only tag that matches upstream by accident is tag 2 (FloatAttr in both). Every other tag in the 1..13 range disagrees: tag 1 is StringAttr here versus upstream IntegerAttr, tag 4 lands on DenseElementsAttr instead of TypeAttr, tag 5 lands on DenseElementsAttr<string> instead of StringAttr, tag 6 lands on DivByAttr instead of ArrayAttr, and so on. Going the other direction, an AssumePredicateAttr emitted by tileiras at tag 13 has no destination in upstream's table at all — upstream's reader rejects the tag with its own default-arm diagnostic.
The structural consequence is sharper than tag-by-tag remapping: tileiras's bytecode reader cannot consume upstream MLIR's bytecode files when those files carry attributes, and tileiras-emitted bytecode (when a future build links a writer) cannot be loaded by stock MLIR. The textual MLIR asm is still interoperable through the printer / parser, but the bytecode wire format is a hard fork. Any external tool that wants to round-trip MLIR bytecode through both tileiras and upstream MLIR must freeze the tag assignments used by this binary rather than the ones in the upstream header. The upstream numbering is reserved for future stock cuda_tile builds; the shipped binary stays compatible with an earlier frozen scheme.
The thirteen recognized tag values, the attribute kinds they construct, and the per-tag builder functions are:
| Tag | Attribute kind | Builder | Notes |
|---|---|---|---|
| 1 | StringAttr | inline | Reads SSO + raw bytes |
| 2 | FloatAttr | inline | Reads u32 type-ref + f64 value |
| 3 | TypeAttr | inline | Reads u32 type-ref |
| 4 | DenseElementsAttr (int/float) | sub_59FB80 | Reads shape + elem-type + payload |
| 5 | DenseElementsAttr (string) | sub_59FCD0 | Reads shape + length-prefixed strings |
| 6 | DivByAttr | sub_59FE40 | Reads divisor + verify-with-assume payload |
| 7 | DenseI64ArrayAttr (variant A) | sub_59FF60 | Inline-cap layout |
| 8 | DenseI64ArrayAttr (variant B) | sub_5A0080 | Sidecar-cap layout |
| 9 | SameElementsAttr | sub_5A01A0 | Reads canonical-form payload |
| 10 | BoundedAttr (variant 0) | sub_5A02C0 | Reads lower-bound payload |
| 11 | BoundedAttr (variant 1) | sub_5A03E0 | Reads upper-bound payload |
| 12 | BoundedAttr (variant 2) | sub_5A0500 | Reads lower+upper payload |
| 13 | AssumePredicateAttr | sub_5A0620 | Reads packed predicate |
Tags 1, 2, and 3 decode inline in sub_59F100 itself. Tag 1 resolves a string via sub_59AD90 and wraps it in StringAttr. Tag 2 reads a type reference, validates it as a FloatType via sub_58C400, reads an inline APFloat payload via sub_586200, and dispatches through the sub_4462700-family float-type-builder to produce a FloatAttr. Tag 3 reads a type reference via sub_58BDE0 and wraps it in TypeAttr. Every other tag tail-calls a dedicated sub-decoder in the sub_59FB80–sub_5A0620 cluster; those decoders read the tag-specific payload, build the corresponding attribute, and either return the new attribute or emit the per-decoder error string and return nullptr. The default arm covers tag 0 and every tag above 13: it emits "unsupported AttributeTag " (verbatim, trailing space included) concatenated with the tag integer and the suffix " for self-contained attribute", then returns nullptr. Diagnostics route through the standard emitter chain sub_57EA50 / sub_581460 at severity 0x103.
The canonical body is the entry prologue plus the 13-way switch and the default-arm error path:
Attribute *parseSelfContainedAttribute(BytecodeReader *r) {
uint32_t tag;
if (!read_varint_u32(r, &tag)) { // (1) read AttrTag VarInt
emit_error(r, "failed to read AttributeTag for self-contained attribute.");
return NULL;
}
switch (tag) { // (2) 13-case dispatch
case 1: return read_string_attr_inline(r); // StringAttr via sub_59AD90
case 2: return read_float_attr_inline(r); // FloatAttr (type-ref + APFloat)
case 3: return read_type_attr_inline(r); // TypeAttr (type-ref only)
case 4: return sub_59FB80(r); // DenseElementsAttr int/float
case 5: return sub_59FCD0(r); // DenseElementsAttr string
case 6: return sub_59FE40(r); // DivByAttr
case 7: return sub_59FF60(r); // DenseI64ArrayAttr variant A
case 8: return sub_5A0080(r); // DenseI64ArrayAttr variant B
case 9: return sub_5A01A0(r); // SameElementsAttr
case 10: return sub_5A02C0(r); // BoundedAttr variant 0
case 11: return sub_5A03E0(r); // BoundedAttr variant 1
case 12: return sub_5A0500(r); // BoundedAttr variant 2
case 13: return sub_5A0620(r); // AssumePredicateAttr
default: // (3) default-arm error path
emit_error(r, "unsupported AttributeTag ", tag, " for self-contained attribute");
return NULL;
}
}
Twenty-one verbatim diagnostic strings are reachable from sub_59F100 and its inline tag arms. They are reproduced below exactly as they appear in the binary; the trailing period or trailing space is part of the string. The "Tag" column identifies which switch arm emits each string, and the "Trigger" column states the failure condition that surfaces the diagnostic.
| Tag | Verbatim string | Trigger |
|---|---|---|
| any | "failed to read AttributeTag for self-contained attribute." | Entry-prologue VarInt read of the AttrTag failed. |
| 1 | "string index " | Out-of-range string-table index reported by sub_59AD90; concatenated with the offending index. |
| 1 | "failed to read StringAttr." | StringAttr decode reached the SSO read but the underlying byte slice was short. |
| 2 | "failed to read valid FloatType for FloatAttr" | Type-ref for the FloatAttr resolved to something that is not a FloatType. |
| 2 | "failed to cast parsed attribute to FloatAttr" | Builder produced a non-FloatAttr (post-construction invariant guard). |
| 3 | "failed to get referenced type for TypeAttr" | Type-table lookup for TypeAttr's type-ref returned null. |
| 4 | "failed to read valid MLIR Type for self-contained DenseElementsAttr" | Element-type reference for the dense attribute did not resolve. |
| 4 | "array contains unsupported value " | Dense int/float bulk-element loop hit a payload word it cannot decode; concatenated with the value. |
| 5 | "failed to read number of string attrs in DenseElementsAttr" | String-variant count-prefix read failed. |
| 5 | "failed to read string in DenseElementsAttr" | String-variant per-element string read failed. |
| 6 | "failed to read divisor for DivByAttr" | DivByAttr divisor field VarInt read failed. |
| 6 | "failed to read flags byte for DivByAttr" | DivByAttr flags byte (verify-with-assume + covariance bits) read failed. |
| 6 | "failed to read value for 'every' in DivByAttr" | DivByAttr every predicate-covariance field read failed. |
| 6 | "failed to read value for 'along' in DivByAttr" | DivByAttr along predicate-covariance field read failed. |
| 7,8 | "failed to read DenseI64ArrayAttr values." | DenseI64ArrayAttr bulk i64 value read failed in either layout variant. |
| 9 | "failed to read DenseI64ArrayAttr for SameElementsAttr" | SameElementsAttr canonical-form payload (which is itself a DenseI64ArrayAttr) failed to decode. |
| 10,11,12 | "failed to read flags byte for BoundedAttr" | BoundedAttr variant-discriminator flags byte read failed. |
| 10,11,12 | "failed to read lower bound for BoundedAttr" | BoundedAttr lower-bound payload read failed (variants 0 and 2). |
| 10,11,12 | "failed to read upper bound for BoundedAttr" | BoundedAttr upper-bound payload read failed (variants 1 and 2). |
| default | "unsupported AttributeTag " | Default arm; concatenated with the tag integer. |
| default | " for self-contained attribute" | Default-arm suffix; concatenated after the tag integer to complete the diagnostic. |
The "unsupported AttributeTag " / " for self-contained attribute" pair is the canonical sentinel for forward-incompatible bytecode: any future tileiras that adds AttrTag values 14+ will be rejected by this CUDA 13.1 reader with that exact pair of fragments wrapping the offending integer. Producers that need to stay compatible with the shipped binary must restrict themselves to the thirteen tags above.
This dispatcher relates to the rest of the bytecode reader the same way the opcode dispatcher does. Callers are sub_5A0A50 (the Constant-section attribute-table populator), sub_5A1CA0 (the cuda_tile.assume parser, which carries a self-contained AssumePredicateAttr payload), sub_5A2410, and sub_5A7AD0. Each one passes a reader cursor and receives a heap-allocated Attribute or a propagated failure back; no caller attempts to recover from a nullptr return. The per-tag builders in the sub_59FB80–sub_5A0620 cluster share callees with the inline arms — sub_57BCF0 for VarInts, sub_58C400 for Type-by-index, sub_58BDE0 for Attr-by-index, sub_456A580 for vector reservation, sub_586200 for inline APInt/APFloat decoding — so the dispatcher's behavior is fully described by the tag table above plus the diagnostic table.
This section is the attribute-side companion to the Operation Opcode Dispatch above. The corresponding dialect-level question — which cuda_tile attributes are recognized by this binary at all — is summarized in Dialect Bytecode Reader/Writer Status — Status Matrix. The cuda_tile bytecode reference details the attribute-readers per dialect, including the exact payload layout each sub_59FB80–sub_5A0620 builder consumes.
Debug-Info Attribute Dispatch
Debug information — DICompileUnit, DIFile, DILexicalBlock, DILoc, DISubprogram, CallSite — does not flow through the AttrTag dispatcher above. It has its own third dispatcher: sub_589B90. The function is 8 779 bytes, switches on a uint32_t DebugTag at 0x589E05, and fires whenever a DistinctAttr-class or LLVM DI* attribute appears in any Location slot, in the Debug section's parallel attribute table, or — through recursion — as a nested scope reference inside another debug attribute. It rounds out the trio of bytecode dispatchers alongside the 13-case AttrTag switch in sub_59F100 and the 110-case Op opcode switch in sub_5B13D0. All three are reached from the top-level bytecode-parse routine sub_57FF40, which walks function bodies and routes each encountered tag through the appropriate dispatcher based on its section context.
The seven recognized tag values, the attribute kinds they construct, and the per-tag builder strategy are:
| Tag | Attribute kind | Builder | Notes |
|---|---|---|---|
| 0 | (default) | inline | Hit when the tag is 0 or unknown; emits the "fail to read kind" diagnostic and returns nullptr |
| 1 | DICompileUnitAttr | inline | File ref + producer string + flags |
| 2 | DIFileAttr | inline | Name string + directory string |
| 3 | DILexicalBlockAttr | inline | Scope ref + file ref + line + column; recursive on scope |
| 4 | DILocAttr | inline | Scope ref + file ref + line + column + optional inlined-at ref; recursive on scope and inlined-at |
| 5 | DISubprogramAttr | sub_588E60 (3 375 B) | CU ref + name + linkage name + scope + line + flags + sp-purpose + optional spFlags |
| 6 | CallSiteAttr | inline | Callee subprogram ref + caller subprogram ref + line + column; recursive on both subprogram refs |
Tag 5 is the only case that tail-calls a dedicated sub-parser. DISubprogramAttr carries the heaviest payload of the seven — compile-unit reference, function name string, linkage-name string, enclosing scope reference, line number, generic flags, special-purpose discriminator, and an optional spFlags word — and the body is large enough that the dispatcher delegates rather than inlining it. The 8 779-byte main body of sub_589B90 open-codes every smaller case: tags 1, 2, 3, 4, and 6 each have their full read sequence inline in the switch arm, with one diagnostic per failable field read.
Tags 3, 4, and 6 decode scope-like cross-references and call sub_589B90 recursively. A DILexicalBlock references its enclosing scope, itself another DI* attribute. A DILoc references both its containing scope and, if inlined, an inlined-at location, both of which route back through the same dispatcher. A CallSite references its caller subprogram and its callee subprogram, again as full debug attributes. The recursion has no cycle detection: the bytecode writer never emits cycles and the reader trusts the input, so a malformed stream with a transitive scope cycle recurses until the stack is exhausted. Producers that hand-craft debug bytecode must topologically order the debug table so every reference points at an attribute emitted earlier in the stream.
Worked example — recursive scope cycle. A small input that exercises the recursion (and shows the cycle failure mode) starts from a DILexicalBlock whose scope reference points back at itself. Upstream MLIR forbids self-cycles, but a hand-crafted bytecode stream can emit them; the reader's behavior on such input is observable and worth documenting.
debug attribute table, index 0:
tag = 3 // DILexicalBlock
scope_ref = attr_index(0) // self-reference (the cycle)
file_ref = attr_index(1) // forward ref to the DIFile below
line = 42
column = 0
debug attribute table, index 1:
tag = 2 // DIFile
name_string_index = 7
directory_string_index = 8
Loading this table walks sub_589B90 like so:
- Top-level call: tag VarInt
0x03→DILexicalBlockarm. - Read scope-ref VarInt
0x00→ recurse intosub_589B90at attribute index0. - Top-level frame is still in flight; the recursive call reads tag VarInt
0x03again and recurses again. - The cycle has no fixed point: step (3) repeats until the C stack overflows. Empirically this fires after a few thousand frames on a default 8 MiB stack; the process dies with
SIGSEGV, not with a tileiras diagnostic.
A well-formed equivalent of the same intent emits the inner attribute first and indexes into it from the outer one:
debug attribute table, index 0:
tag = 2 // DIFile, emitted first
name_string_index = 7
directory_string_index = 8
debug attribute table, index 1:
tag = 3 // DILexicalBlock, now references a *prior* attr
scope_ref = attr_index(0) // resolves cleanly
file_ref = attr_index(0)
line = 42
column = 0
The constructed mlir::LocationAttr is a DILexicalBlockAttr whose scope and file both point at the DIFileAttr at index 0. The recursion bottoms out on the first call because tag 2 (DIFile) has no scope-shaped fields and decodes inline.
The takeaway is asymmetric: well-formed input always terminates in a single bounded recursion sweep because attributes are emitted in topological order; ill-formed input that introduces a cycle is detected only by stack exhaustion. A future reader that wants to harden this path would maintain a visited-set keyed by attribute index during the recursive walk and emit a "cyclic debug attribute reference at index " diagnostic on revisit. The shipped CUDA 13.1 reader does not.
The canonical dispatcher body is the entry-prologue VarInt read plus the 7-arm switch:
Attribute *parseDebugAttribute(BytecodeReader *r) {
uint32_t tag;
if (!read_varint_u32(r, &tag)) { // (1) read DebugTag VarInt
emit_error(r, "failed to read kind tag");
return NULL;
}
switch (tag) { // (2) 7-arm dispatch
case 0: emit_error(r, "unknown debug attribute tag"); // default-equivalent inline arm
return NULL;
case 1: return read_di_compile_unit_inline(r); // DICompileUnitAttr
case 2: return read_di_file_inline(r); // DIFileAttr
case 3: return read_di_lexical_block_inline(r); // recurses on scope
case 4: return read_di_loc_inline(r); // recurses on scope + inlined-at
case 5: return sub_588E60(r); // DISubprogramAttr (delegated)
case 6: return read_call_site_inline(r); // recurses on caller and callee
default: emit_error(r, "unsupported DebugTag ", tag); // forward-incompatibility sentinel
return NULL;
}
}
Fourteen verbatim diagnostic strings are reachable from sub_589B90 and its inline tag arms. They are reproduced below exactly as they appear in the binary; the trailing punctuation, if any, is part of the string. The "Tag" column identifies which switch arm emits each string, and the "Trigger" column states the failure condition that surfaces the diagnostic.
| Tag | Verbatim string | Trigger |
|---|---|---|
| any | "string index " | Out-of-range string-table index reported by the shared string lookup; concatenated with the offending index. |
| 1 | "failed to read file attribute when parsing DICompileUnitAttr" | DICompileUnitAttr file-reference field read failed. |
| 1 | "failed to read producer for DICompileUnitAttr" | DICompileUnitAttr producer string field read failed. |
| 2 | "failed to read file name attribute when parsing DIFileAttr" | DIFileAttr file-name string field read failed. |
| 2 | "failed to read directory attribute when parsing DIFileAttr" | DIFileAttr directory string field read failed. |
| 3 | "failed to read scope attribute when parsing DILexicalBlockAttr" | DILexicalBlockAttr enclosing-scope recursive read failed. |
| 3 | "failed to read file attribute when parsing DILexicalBlockAttr" | DILexicalBlockAttr file-reference field read failed. |
| 3 | "failed to read line number when parsing DILexicalBlockAttr" | DILexicalBlockAttr line-number VarInt read failed. |
| 3 | "failed to read column number when parsing DILexicalBlockAttr" | DILexicalBlockAttr column-number VarInt read failed. |
| 4 | "failed to read scope attribute when parsing DILocAttr" | DILocAttr containing-scope recursive read failed. |
| 4 | "failed to read file name attribute when parsing FileLineColLoc" | Inner FileLineColLoc file-name field read failed inside the DILocAttr arm. |
| 4 | "failed to read line number when parsing FileLineColLoc" | Inner FileLineColLoc line-number VarInt read failed inside the DILocAttr arm. |
| 4 | "failed to read column number when parsing FileLineColLoc" | Inner FileLineColLoc column-number VarInt read failed inside the DILocAttr arm. |
| 6 | "failed to read callee attribute when parsing CallSiteLoc" | CallSiteAttr callee-subprogram recursive read failed. |
| 6 | "failed to read caller attribute when parsing CallSiteLoc" | CallSiteAttr caller-subprogram recursive read failed. |
Two structural observations follow from the table. The tag 4 arm emits its inner errors under the FileLineColLoc name rather than DILocAttr, which reflects how DILocAttr is built on top of MLIR's FileLineColLoc primitive: line and column flow through a sub-helper shared with plain locations, and that sub-helper emits its own diagnostics under its own attribute name. The tag 6 arm spells CallSiteLoc rather than CallSiteAttr for the same reason: the location form predates the attribute form in MLIR's debug-info subsystem, and the embedded diagnostic literals carry the older name. Consumers parsing tileiras error output must accept both spellings as referring to the same sub_589B90 switch arms.
sub_589B90's relationship with the rest of the bytecode reader mirrors the other two dispatchers. Callers are sub_588E60 itself (when the DISubprogramAttr body needs to recurse for its CU or scope reference), sub_58BDE0 (the Attr-by-index cached lookup, which routes any DistinctAttr-class index through the debug dispatcher), and itself recursively as described above. Every caller passes a reader cursor and receives a heap-allocated Attribute or a propagated nullptr; nobody recovers from a failed debug read, so a single corrupt DebugTag aborts the entire module load. Shared callees with the other two dispatchers are sub_57BCF0 for VarInts, sub_58BDE0 for Attr-by-index, the sub_57EA50 / sub_581460 emitter pair at severity 0x103 for diagnostics, and the string-table lookup path that emits the shared "string index " prefix on out-of-range references.
This section is the debug-info companion to the Operation Opcode Dispatch and Self-Contained Attribute Dispatch sections above. The dialect-level question of which debug attributes the bytecode writer actually emits in CUDA 13.1 is tracked in Dialect Bytecode Reader/Writer Status — Status Matrix; the seven tags above correspond to the strict subset of LLVM DI* attributes that survive round-tripping through this reader.
Ordering and Diagnostics
Physical section order stays flexible because the reader captures section spans before decoding bodies. Required sections are String and Func. Structural errors live in a separate channel from mandatory-section errors and per-section decode errors — the distinction lets tools tell "not a TileIR file" apart from "valid TileIR envelope with a malformed Type section".
The driver also distinguishes TileIR bytecode from upstream MLIR bytecode. When the input looks like ordinary MLIR bytecode, the diagnostic says so rather than emitting only a generic magic-number failure.
NVPTX LLVM Bitcode Path
When tileiras takes the in-process libNVVM path, it also serializes LLVM bitcode — a different format entirely from the TileIR bytecode described above. The NVPTX64 data layout is stamped onto the LLVM module unconditionally before serialization:
const char *NVPTX64_DATA_LAYOUT =
"e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-"
"i64:64:64-i128:128:128-f32:32:32-f64:64:64-v16:16:16-v32:32:32-"
"v64:64:64-v128:128:128-n16:32:64";
The override discards whatever data layout MLIR's LLVM dialect translation left behind. The NVPTX ABI fact worth pinning down is address space 3 — 32-bit shared memory.
NVPTX modules ship as bare LLVM bitcode, not the wrapper-framed object format used for some host triples. The bitcode identifies itself as LLVM 21-era output and reaches libNVVM under the module name mlir-input.
BitcodeBuffer serialize_for_libnvvm(LLVMModule *m) {
module_set_data_layout(m, NVPTX64_DATA_LAYOUT);
BitcodeBuffer bc = write_llvm_bitcode(
m,
/*wrapper=*/false,
/*producer=*/"LLVM21.0.0git");
return bc;
}
NvvmResult compile_with_libnvvm(BitcodeBuffer bc) {
NvvmProgram prog = nvvmCreateProgram();
nvvmAddModuleToProgram(prog, bc.data, bc.size, "mlir-input");
nvvmCompileProgram(prog, options);
nvvmVerifyProgram(prog, options);
return nvvmGetCompiledResult(prog);
}
The default command-line path still goes through PTX and ptxas. The bitcode path only matters when the pipeline is wired to use libNVVM directly. The NVPTX target initialization and the data-layout stamping documented above are covered end-to-end in NVPTX Bring-up and Target Init; the libdevice bitcode that gets linked into the same module is documented in libdevice Overview — Link, inline, simplify.
Cross-References
This page documents the wire-format the bytecode reader consumes; four companion pages cover the reader from complementary angles. Frontend Contract and Tile IR Emission documents the producer-side rules a frontend must satisfy to emit conformant bytecode — the dialect list, the magic and version constants, the AttrTag numbering, and the canonical VarInt encoding — and catalogues the common emission mistakes that produce buffers this reader rejects. Dialect Bytecode Reader/Writer Status restricts the wire format to the dialects that actually ship a reader — cuda_tile is the only TileIR dialect with one, and no TileIR dialect ships a writer — and frames the asymmetry as a deliberate input-only driver contract. Dialect Asm-Printer Status documents the textual side of the same contract, because round-trip workflows on intermediate dialects rely on the asm-printer rather than the bytecode writer this binary does not link. cuda_tile Bytecode Reader zooms back in on the cuda_tile-private dispatchers — the 18-case TypeTag dispatcher, the cuda_tile-specific AttrTag payload shapes that route through the 13-case dispatcher documented above, and the 110-case Op opcode dispatcher whose dispatch table is reproduced in Operation Opcode Dispatch. The wire-format-breaking AttrTag divergence is the most consequential single fact across all four pages: a bytecode file containing attributes is not portable between tileiras and stock MLIR, and any reimplementation must freeze the tileiras numbering reproduced in Self-Contained Attribute Dispatch.
Dialect Bytecode Reader/Writer Status
Abstract
tileiras consumes TileIR bytecode in one direction only. It accepts serialized cuda_tile modules at the driver boundary, lowers them through several internal dialects, and emits PTX or object code — none of those dialects ever round-trip back to MLIR bytecode. The compatibility rule is simple: cuda_tile is the only TileIR dialect with a linked bytecode reader, and no TileIR dialect in this binary ships a bytecode writer. Downstream dialects — nv_tileaa, nv_tileas, cute, cute_nvgpu, cutlass — are in-memory pipeline representations.
Reader architecture
A single reader-driver walks the bytecode container in a fixed order: file envelope (magic, version, dialect-list), string section, type section, attribute section, IR section (functions and globals), resource section, and an optional debug section. The driver reads each section header, validates the recorded byte count against the available span, and hands the section body to a per-section dispatcher. Section dispatchers iterate the body, reading one record at a time and routing each record through a tag-keyed switch onto a typed handler.
The reader is decoder-only at every level. Four wire-format dispatchers carry the work — one for operation opcodes (110 cases over the cuda_tile public opcode space), one for self-contained attribute payloads (13 cases, wire-format-breaking versus upstream MLIR), one for type tags (18 cases), and one for debug attributes (7 cases). None has a sibling writer dispatcher linked in. A reimplementation that wants to produce TileIR bytecode must build its own encoder against the tag numberings documented in MLIR Bytecode Format; the shipped reader is the only source of truth for the wire-format constants, and the attribute-tag numbering deliberately diverges from upstream MLIR (tag 1 is StringAttr, tag 13 is AssumePredicateAttr, magic byte 7 is 0x00).
Status Matrix
| Dialect | Bytecode reader | Bytecode writer | Public meaning |
|---|---|---|---|
cuda_tile | Present | Absent | Input wire format accepted by the driver. |
nv_tileaa | Absent | Absent | Produced by lowering from cuda_tile; not loadable from bytecode. |
nv_tileas | Absent | Absent | Produced by TileAA-to-TileAS conversion; not loadable from bytecode. |
cute | Absent | Absent | Persisted through textual asm only when dumped. |
cute_nvgpu | Absent | Absent | Persisted through textual asm only when dumped. |
cutlass | Absent | Absent | Frontend scheduling dialect inside the pipeline, not a bytecode format. |
Upstream MLIR builtin bytecode support is still linked in because the file
container uses MLIR infrastructure for built-in types and attributes. That
does not mean the TileIR dialects themselves provide general MLIR bytecode
round-tripping.
Reader contract
The cuda_tile reader is the only path that materializes IR at the driver boundary. Its top-level loop validates the envelope, scans the section table, then dispatches each section in container order — the later sections reference indices into the earlier ones, so reordering would break cross-section lookups.
ModuleOp read_tileir_module(ByteSpan input, MLIRContext *ctx) {
BytecodeReader r = bytecode_reader_init(input);
/* envelope: 8-byte magic, LEB128 version, dialect-list, blob preamble */
if (!read_and_verify_magic(&r)) { diag(r, "invalid TileIR magic"); return nullptr; }
uint64_t version = read_leb128(&r);
if (!version_is_supported(version)) { diag(r, "unsupported version"); return nullptr; }
DialectList dialects = read_dialect_list(&r); /* requires "cuda_tile" entry */
/* sections appear in a fixed order; later sections index into earlier ones */
SectionTable sections = scan_section_table(&r);
StringTable strings = read_string_section (&r, sections.string);
TypeTable types = read_type_section (&r, sections.type, ctx, strings);
AttributeTable attrs = read_attribute_section(&r, sections.attribute, ctx, strings, types);
ResourceTable resources = read_resource_section (&r, sections.resource, ctx, strings);
DebugTable debug = sections.debug.present
? read_debug_section (&r, sections.debug, ctx, strings, attrs)
: empty_debug_table();
ModuleOp module = create_builtin_module(ctx);
read_ir_section(&r, sections.ir, module, strings, types, attrs, resources, debug);
return module;
}
Each section dispatcher follows the same shape — read a small header (record count, dialect index, optional flags), then iterate record bodies, routing each record's lead byte through the appropriate tag switch:
ParseResult read_attribute_section(BytecodeReader *r, SectionSpan span, ...) {
bytecode_reader_seek(r, span.begin);
uint64_t count = read_leb128(r);
for (uint64_t i = 0; i < count; ++i) {
uint8_t tag = read_byte(r);
switch (tag) {
case ATTR_STRING: parse_string_attr(r, /*has_dialect=*/false); break; // tag 1
case ATTR_FLOAT: parse_float_attr(r); break; // tag 2
case ATTR_TYPE: parse_type_attr(r); break; // tag 3
case ATTR_DENSE_ELT: parse_dense_elements_attr(r); break; // tag 4
case ATTR_DENSE_ELT_STRING: parse_dense_elements_string_attr(r); break; // tag 5
case ATTR_DIV_BY: parse_div_by_attr(r); break; // tag 6
case ATTR_DENSE_I64_ARRAY_A: parse_dense_i64_array_attr_a(r); break; // tag 7
case ATTR_DENSE_I64_ARRAY_B: parse_dense_i64_array_attr_b(r); break; // tag 8
case ATTR_SAME_ELEMENTS: parse_same_elements_attr(r); break; // tag 9
case ATTR_BOUNDED_LO: parse_bounded_attr(r, /*variant=*/0); break; // tag 10
case ATTR_BOUNDED_HI: parse_bounded_attr(r, /*variant=*/1); break; // tag 11
case ATTR_BOUNDED_LO_HI: parse_bounded_attr(r, /*variant=*/2); break; // tag 12
case ATTR_ASSUME_PREDICATE: parse_assume_predicate_attr(r); break; // tag 13
default: diag_unsupported_attr_tag(r, tag); break;
}
}
return success();
}
The tag-to-attribute-kind mapping in the case names above is the wire-format-breaking tileiras numbering documented in MLIR Bytecode Format — Self-Contained Attribute Dispatch; tags IntegerAttr/BoolAttr/ArrayAttr/DictionaryAttr/OptimizationHintsAttr from upstream MLIR are not present in this dispatcher and arrive instead through the upstream MLIR builtin reader path (builtin dialect's own bytecode arms).
Operation records inside the IR section follow the same dispatch shape with the 110-case opcode switch in place of the 13-case attribute switch. Type records use the 18-case type switch; debug records use the 7-case debug switch. The four switches are independent — they share no fallthrough — and each one terminates a section: when the section's byte count is exhausted, the reader returns to the driver loop.
Non-cuda_tile TileIR bytecode is rejected at the driver boundary. When the input looks like ordinary upstream MLIR bytecode (different magic byte 7, different attribute-tag numbering), the driver reports that shape explicitly instead of silently reinterpreting it as TileIR.
Reader-only contract
The missing writers are user-visible. A tool can hand tileiras a cuda_tile bytecode module and ask for compiled output, but it cannot ask this binary to emit optimized TileIR bytecode or any intermediate-dialect bytecode. The capability surface is a one-line predicate:
bool tileiras_can_read_bytecode (const char *dialect) { return strcmp(dialect, "cuda_tile") == 0; }
bool tileiras_can_write_bytecode(const char *dialect) { (void)dialect; return false; }
The asymmetry is deliberate. Round-trip workflows use textual IR dumps for inspection; cacheable intermediate artifacts require an external writer linked against compatible dialect implementations. Treating the intermediate dialects as encoder-absent on purpose lets the pipeline evolve without freezing a stable wire format for every internal representation — the only stable surface is the cuda_tile input boundary.
Several driver behaviors fall out of this asymmetry: the command-line input must be TileIR bytecode (not generic MLIR bytecode); the driver exposes no --emit-bytecode or --write-bytecode mode; intermediate IR dumps, when enabled, are textual MLIR asm rather than bytecode; the internal dialect stack can change shape between releases without breaking external tooling.
Cross-references
The detailed wire format consumed by the reader-driver — file envelope, section ordering, every tag enumeration, the validation diagnostics, and the LLVM bitcode path used for NVVM modules — lives in MLIR Bytecode Format. The textual-asm side of the reader contract, including how intermediate dialects are inspected when bytecode serialization is unavailable, is covered in Dialect Asm-Printer Status. The dialect-level semantics that the bytecode reader materializes are documented in the per-dialect references — cuda_tile bytecode reference and cuda_tile Overview for the input dialect, plus the corresponding overviews for nv_tileaa, nv_tileas, cute, cute_nvgpu, and cutlass.
Dialect Asm-Printer Status
Abstract
An MLIR asm printer turns in-memory operations, attributes, types, and values into the textual .mlir form used for human inspection and crash dumps. Each dialect contributes to the result through three mechanisms: a printType / parseType pair that handles dialect-specific type bodies, a printAttribute / parseAttribute pair that handles dialect-specific attribute bodies, and an OpAsmDialectInterface that supplies short readable aliases plus per-value SSA-name hints. Per-operation pretty-printing is layered on top through OpAsmOpInterface hooks attached at ODS time. When a dialect leaves a hook unimplemented, MLIR's default trampoline takes over and emits the verbose generic form — "dialect.op"(operands) : (types) -> types for operations, !ns<...stored_payload...> for types, #ns<...stored_payload...> for attributes.
Textual assembly is the only inspection path for the non-input dialects inside tileiras, since this binary never serializes them as bytecode. The printer surface is intentionally uneven: dialects near the user boundary invest in custom spelling and aliases, while short-lived pipeline dialects fall back on the generic printer. Expect polished textual forms for cuda_tile, cute, and cute_nvgpu; expect generic MLIR for most nv_tileaa, nv_tileas, and cutlass operations, with a few aliases or SSA-name hints sprinkled in to keep large dumps legible.
Alias resolution
OpAsmDialectInterface exposes two virtual hooks the printer consults before falling back on the generic form: getAlias(Type, raw_ostream&) and getAlias(Attribute, raw_ostream&). Each hook returns an AliasResult — NoAlias, OverridableAlias, or FinalAlias — and, when the result is not NoAlias, writes the alias name into the stream. The printer queries every loaded dialect in registration order; the first non-NoAlias answer wins. FinalAlias short-circuits subsequent dialects; OverridableAlias permits a later dialect to refine the name.
bool emit_type_with_alias(AsmPrinter *p, Type t) {
SmallString<32> name;
raw_svector_ostream os(name);
for (Dialect *d : p->context()->loadedDialects()) {
OpAsmDialectInterface *iface = d->getRegisteredInterface<OpAsmDialectInterface>();
if (!iface) continue;
AliasResult r = iface->getAlias(t, os);
if (r == AliasResult::NoAlias) { name.clear(); continue; }
register_alias_decl(p, t, name); /* emit `!name = type ...` at top of module */
p->os() << "!" << name;
return true;
}
return false; /* caller falls back to generic !ns<...> form */
}
When emit_type_with_alias returns false the printer writes the generic form — !ns<storage-blob> for parametric types, the registered mnemonic plus storage for ODS-generated types. Attribute printing follows the same shape with # in place of !.
Per-dialect feature matrix
The table below summarizes which textual-IR hook each dialect installs. "ODS-only" means the slot is wired by the TableGen-generated dialect registration to MLIR's default trampoline (which reads the registered mnemonic/storage and emits the canonical form). "stub" means the slot is patched to a body that either does nothing or emits a parsing in dialect '<ns>' is disabled diagnostic. "real" means a hand-written dispatcher of non-trivial size.
| Dialect | printType | parseType | printAttribute | parseAttribute | OpAsmDialectInterface | per-op OpAsmOpInterface |
|---|---|---|---|---|---|---|
cuda_tile | ODS/default | ODS/default | ODS/default | ODS/default | full aliasing and constant names | yes, including constants and selected TKO ops |
nv_tileaa | ODS/default | ODS/default | ODS/default | ODS/default | absent | yes on six operations |
nv_tileas | ODS/default | ODS/default | ODS/default | ODS/default | attribute and type aliases | none |
cute | handled through printable type interfaces | disabled | ODS/default for registered attributes | real keyword parser | absent | none |
cute_nvgpu | real type printer | real type parser | empty/default | disabled | type aliases | none |
cutlass | empty/default | disabled | empty/default | disabled | absent | none |
cuda_tile — user-facing input syntax
cuda_tile has the richest textual surface. Constants receive stable SSA-name
hints — cst, true, false, cst_NaN, cst_<int> — that keep debug dumps
legible. Selected TKO load/store and atomic operations carry hand-written
printers and parsers instead of generic MLIR spelling.
nv_tileaa — generic dialect with a few name hints
nv_tileaa installs no dialect-wide asm aliases — most operations print in
generic MLIR form. Six operations attach per-op asm interfaces, and the only
pretty-name behavior worth knowing lives on nv_tileaa.load: the value result
is named result, and the optional memory-token result is named resultMemToken.
nv_tileas — aliases for scheduling concepts
nv_tileas falls back on generic operation printing for most ops but ships
useful dialect-level aliases for scheduling attributes and types. Attribute
aliases cover memory-space layouts, copy atoms, reduction atoms, MMA atoms,
and resource requirements. Type aliases cover pipeline and role-qualified
iterator types such as producer and consumer iterators.
cute — attributes are the serialized type surface
cute disables standalone type parsing. Its canonical textual form represents
types as #cute.<keyword> attributes rather than !cute.<keyword> types. The
attribute parser recognizes layout-algebra terms — coord, stride, shape,
tile, swizzle, layout, composed_layout, ptr, memref, coord_tensor —
along with constrained integer forms.
cute_nvgpu — architecture atom spelling
cute_nvgpu ships a full type parser/printer for architecture-specific MMA, copy, TMA, shared-memory descriptor, and tensor-memory atoms. Its OpAsmDialectInterface aliases collapse the otherwise unwieldy stored type body — element-type triples, operand orderings, instruction shapes, swizzle patterns — into a short label that survives in 10 000-line dumps.
The alias hook is a discriminated dispatch on the type's class id. Each branch builds the alias name from the type's own accessors rather than from the encoded storage body, so the alias survives parameter reordering inside the storage struct.
AliasResult cute_nvgpu_type_alias(Type t, raw_ostream &os) {
if (auto m = dyn_cast<MemRefAtomType>(t)) {
os << "memref_" << element_type_keyword(m.getElementType())
<< "_" << m.getRank();
return AliasResult::OverridableAlias;
}
if (auto c = dyn_cast<CopyAtomType>(t)) {
os << "Cp(" << element_type_keyword(c.getElementType()) << ","
<< c.getShape().M << "x" << c.getShape().N << ","
<< layout_keyword(c.getLayout()) << ")";
return AliasResult::FinalAlias;
}
if (auto m = dyn_cast<MmaAtomType>(t)) {
os << "Mma(m" << m.getInstShape().M
<< "n" << m.getInstShape().N
<< "k" << m.getInstShape().K << ","
<< element_type_keyword(m.getElementTypeA()) << ","
<< layout_keyword(m.getLayoutA()) << ","
<< layout_keyword(m.getLayoutB()) << ")";
return AliasResult::FinalAlias;
}
if (isa<TmaDescriptorType, SharedDescriptorType, TensorMemoryType>(t))
return tma_family_alias(t, os);
return AliasResult::NoAlias;
}
The MMA alias exposes the instruction shape and element-type/layout triple without expanding the full atom storage body. A printed cute_nvgpu dump that would otherwise contain !cute_nvgpu.mma_atom<inst_shape = <m = 16, n = 8, k = 16>, a = <element = f16, layout = row>, b = <element = f16, layout = col>, c = <element = f32>> instead reads !Mma(m16n8k16,f16,row,col). The 23 typed copy and MMA atoms collapse to one short label each.
cutlass — generic spelling
cutlass leaves textual assembly to the framework on purpose. Its operations
carry registered attributes and opaque types, so the generic ODS printer
produces sufficient IR without dialect-wide aliases or per-op pretty names.
Practical rules
A reimplementer who wants compatible textual dumps starts at the cuda_tile boundary because every input module passes through it: stable constant SSA hints (cst, true, false, cst_NaN, cst_<int>) and the hand-written TKO load/store/atomic printers carry the largest readability payoff. The nv_tileaa.load result names result and resultMemToken are fixed contract — downstream regression dumps reference them. The cute invariant that all type-like syntax appears as #cute.<keyword> attributes rather than !cute.<keyword> types must be preserved because the textual parser refuses standalone cute types outright. The cute_nvgpu aliases for memref, copy, and MMA atoms cover the bulk of the alias-driven readability gain; per-atom custom printers can come later. Operations in cutlass and most of nv_tileas ship without aliases on purpose — the generic ODS form is precise enough and avoids divergence between the textual dialect description and the dialect's stored representation.
Cross-references
The companion page Dialect Bytecode Reader/Writer Status covers the input-side wire format that produces the IR these printers later spell. Layout-algebra spelling for the cute attribute family is documented in cute Layout Algebra and Descriptor Grammar; tensor-memory and TMA atom encodings are described under cute_nvgpu Asm Printer and Mnemonic Hash and cute_nvgpu TMA Atoms. The architectural rationale for keeping intermediate dialects on the generic textual form lives in cuda_tile Asm Printer.
cuda_tile Dialect Overview
Frontends write cuda_tile and the compiler promises to accept it. It is the
public input contract of tileiras — the only dialect a producer ever has to
construct — and the gate before lowering descends into the private TileAA,
TileAS, CuTe, CUTLASS, NVGPU, LLVM, and NVVM layers. In practice it is a
compact tile-programming IR: structured control flow, shaped tile values,
view-based memory access, token-threaded side effects, tensor-core operations,
and just enough attributes to preserve numeric and memory semantics until
target-specific lowering takes over.
Producers generate cuda_tile; reimplementers treat it as an ABI boundary. A
module that verifies here flows through the rest of the compiler without the
frontend ever touching nv_tileaa, nv_tileas, or any backend dialect.
Programming Model
A normal input module is rooted in cuda_tile.module and contains one or more
cuda_tile.entry operations that each become a GPU kernel. Inside each entry,
the dialect carries its own structured control flow (if, for, loop,
yield, break, continue, return) so frontends never have to lower into
scf or func first.
Values fall into four broad categories:
| Category | Role |
|---|---|
| Tiles | Shaped SSA values with static rank and element type. |
| Views | ptr, tensor_view, and partition_view values that describe memory. |
| Tokens | Ordering edges for memory operations with side effects. |
| Scalars and attributes | Numeric operands, predicates, rounding modes, padding values, and optimization hints. |
The dialect is target-aware but not target-lowered. Accepted element types are
f16, bf16, f32, tf32, f64, f8E4M3FN, f8E5M2, and the integer
widths i1, i8, i16, i32, i64. Architecture-specific choices —
MMA atom selection, TMA materialization, register allocation, FP4/FP6
microscaling, final PTX features — all come later, in the private lowering
pipeline.
Operation Families
The operation surface is best understood by family rather than by registration order:
| Family | Examples | Contract |
|---|---|---|
| Arithmetic and logic | addf, addi, mulf, cmpf, cmpi, shli, xori, fma | Operate on scalar or tile-shaped values while preserving explicit signedness, overflow, comparison, rounding, and fast-math attributes. |
| Math intrinsics | exp, exp2, log, log2, pow, rsqrt, sin, cos, sqrt, tanh | Preserve source-level numeric intent until lowered to math, NVVM, or backend intrinsics. |
| Memory and pointers | load_ptr_tko, load_view_tko, store_ptr_tko, store_view_tko, atomic_cas_tko, atomic_rmw_tko, offset | Express typed global-memory access and atomics through explicit token dependencies. |
| Structured control flow | module, entry, if, for, loop, yield, break, continue, return | Keep kernel structure and region control flow in the public dialect. |
| Tile shape algebra | broadcast, cat, extract, permute, reshape, iota, select | Transform tile shapes and values without choosing hardware layout yet. |
| Reductions and scans | reduce, scan | Carry reduction dimensions, identities, and pure body regions. |
| MMA | mmaf, mmai | Describe matrix multiply-accumulate intent before atom selection and schedule generation. |
| Conversion | exti, trunci, itof, ftoi, ftof, bitcast, int_to_ptr, ptr_to_int, ptr_to_ptr | Make type changes explicit so the first lowering pass can preserve legality. |
| Diagnostics and assumptions | assert, assume, print, constant, global, get_global | Preserve compile-time constants, diagnostics, globals, and optimization assumptions. |
The exact roster is maintained in Operation Roster. Two practical
version deltas matter for producers targeting this binary: the emitted mnemonic
is cuda_tile.print, not the open-source cuda_tile.print_tko, and the build
rejects cuda_tile.atan2 outright.
Type Contracts
cuda_tile types describe the source-level shape and memory model. They should
be treated as verifier-backed contracts, not as backend storage layouts.
| Type | Meaning | Main verifier contract |
|---|---|---|
cuda_tile.tile | Static shaped value with an element type. | Dimensions are positive powers of two; total element count is capped. |
cuda_tile.ptr | Typed global pointer to a numeric scalar element. | Pointee type is numeric; pointer-to-pointer is rejected. |
cuda_tile.tensor_view | Element type plus tensor shape and stride metadata. | Shape and stride ranks match; static dimensions and strides are positive. |
cuda_tile.partition_view | Tile partition over a tensor view. | Tile rank matches tensor rank; dim_map covers each tile dimension exactly once; padding is type-compatible. |
cuda_tile.token | Zero-runtime ordering marker. | Used as an SSA dependency for side-effecting operations. |
cuda_tile.string | Observed binary type for string-like handles. | Treat as implementation-specific unless the producer is targeting this exact binary contract. |
The tile-shape verifier walks the shape, rejecting non-positive or non-power-of-two dimensions and enforcing a 16-million-element ceiling. The element count is tracked using a divide-and-compare to detect overflow before it can happen.
LogicalResult verify_tile_shape(ArrayRef<int64_t> shape) {
const int64_t max_elements = 16 * 1024 * 1024;
int64_t elements = 1;
for (int64_t dim : shape) {
if (dim <= 0) {
return emit_error("tile dimensions must be positive");
}
if ((dim & (dim - 1)) != 0) {
return emit_error("tile dimensions must be powers of two");
}
if (elements > max_elements / dim) {
return emit_error("tile would exceed the maximum element count");
}
elements *= dim;
}
return success();
}
tensor_view uses dynamic shape and stride slots, but each dynamic slot is
still part of a fixed-rank type. The verifier rejects rank mismatches between
shape and stride and rejects any non-positive static dimension or static stride.
LogicalResult verify_tensor_view(Type element_type,
ArrayRef<int64_t> shape,
ArrayRef<int64_t> stride) {
if (shape.size() != stride.size()) {
return emit_error("tensor_view shape and stride must have the same rank");
}
for (int64_t dim : shape) {
if (dim != kDynamic && dim <= 0) {
return emit_error("static tensor_view dimensions must be positive");
}
}
for (int64_t step : stride) {
if (step != kDynamic && step <= 0) {
return emit_error("static tensor_view strides must be positive");
}
}
return success();
}
partition_view is the bridge between logical tensors and tile-shaped access.
The verifier checks rank agreement, validates dim_map as an injective function
into tensor axes, enforces the power-of-two tile shape rule, and gates special
padding values on floating-point element types.
LogicalResult verify_partition_view(ArrayRef<int32_t> tile_shape,
TensorViewType tensor,
ArrayRef<int32_t> dim_map,
Optional<PaddingValue> padding) {
if (tile_shape.empty()) {
return emit_error("partition tiles must have rank");
}
if (tile_shape.size() != tensor.rank()) {
return emit_error("partition tile rank must match tensor rank");
}
if (dim_map.size() != tile_shape.size()) {
return emit_error("dim_map must cover every tile dimension");
}
BitSet used_tensor_dims(tensor.rank());
for (size_t tile_dim = 0; tile_dim < dim_map.size(); ++tile_dim) {
if (tile_shape[tile_dim] <= 0) {
return emit_error("partition tile dimensions must be positive");
}
if (!is_power_of_two(tile_shape[tile_dim])) {
return emit_error("partition tile dimensions must be powers of two");
}
int32_t tensor_dim = dim_map[tile_dim];
if (tensor_dim < 0 || tensor_dim >= (int32_t)tensor.rank()) {
return emit_error("dim_map target must be inside the tensor rank");
}
if (used_tensor_dims.test(tensor_dim)) {
return emit_error("dim_map must not map two tile dimensions to one tensor dimension");
}
used_tensor_dims.set(tensor_dim);
}
if (padding.has_value() && padding->is_nan_or_infinity_or_negative_zero()) {
if (!tensor.element_type().is_float()) {
return emit_error("special padding values require a floating-point element type");
}
}
return success();
}
Memory and Tokens
The _tko suffix means token-ordered. Memory effects ride on dataflow: the
token is an SSA value, and a pass may reorder memory operations only when it
preserves the dependency graph that ties them together.
struct Token {};
struct LoadResult {
Value value;
Token token;
};
LoadResult load_ptr_tko(Pointer ptr, Indices indices, Token in);
LoadResult load_view_tko(PartitionView view, Indices indices, Token in);
Token store_ptr_tko(Pointer ptr, Indices indices, Value value, Token in);
Token store_view_tko(PartitionView view, Indices indices, Value value, Token in);
struct AtomicResult {
Value old_or_result;
Token token;
};
AtomicResult atomic_rmw_tko(Pointer ptr, AtomicOp op, Value value, Token in);
AtomicResult atomic_cas_tko(Pointer ptr, Value expected, Value desired, Token in);
A pass may delete, merge, or reorder token-ordered operations only when the observable token order survives intact. That is the source-level memory contract that later TileAA and TileAS passes refine into schedulable memory operations.
Semantic Attributes
The attribute set is small but consequential. Most attributes are not decoration — they constrain legal lowering:
| Attribute family | Used by | Meaning |
|---|---|---|
| Comparison predicate/order | cmpf, cmpi, select-like rewrites | Ordered/unordered floating compares and integer predicate selection. |
| Signedness and overflow | Integer arithmetic, shifts, conversions | Whether integer operations are signed and whether overflow has defined assumptions. |
| Rounding and padding | Floating conversions, partition views | Rounding mode selection and legal fill value for out-of-bounds view reads. |
| Optimization hints | Entries, memory ops, layout-sensitive ops | Producer-supplied scheduling and target hints keyed by architecture or operation kind. |
| Assumption predicates | assume and related transforms | Facts such as divisibility, boundedness, and same-elements properties. |
| Debug info | source locations and lexical scopes | Optional provenance carried through lowering when debug/line info is enabled. |
Key design choice: public because it's the API
cuda_tile is public because it is the producer-facing API. Every dialect
below it is an implementation detail. A frontend should construct valid
cuda_tile, serialize it as TileIR bytecode, and hand it to tileiras —
never touching internal TileAA or TileAS operations.
The lowering direction is one-way. The driver runs in three phases.
Phase 1: the verifier rejects modules that contain operations from any
non-public dialect. A producer that emits IR through this entry point must
restrict itself to cuda_tile, builtin, and a small set of supporting
upstream dialects (arith constants, func symbol references, debug-info
attributes). Any other dialect at this point is a producer bug.
Phase 2: a partial dialect conversion drives the rewrite. The conversion
target marks cuda_tile illegal, marks the destination dialects (arith,
math, func, gpu, scf, nv_tileaa) legal, and registers a dynamic
legality check on ub.poison so untyped poison values pick up legal
TileAA types as they flow through. Each cuda_tile op carries a conversion
pattern that emits the corresponding TileAA shape; the type converter
rewrites scalar, tile, pointer, view, and token types in parallel.
Phase 3: a post-conversion verifier confirms that no cuda_tile operation
survived the conversion. After this point, ordinary producers will never see
cuda_tile again; the rest of the pipeline works in progressively more
hardware-facing internal dialects (TileAA → TileAS → CuTe → NVGPU → NVVM →
LLVM).
The driver is structured as a single greedy pass rather than a per-family
sweep because the rewrite patterns produce IR that immediately matches further
patterns: a cuda_tile.load_view_tko lowers into a TileAA tiled_load that
exposes new shape and layout structure to the next op's lowering. A
per-family sweep would force a fixed phase order; the greedy pass lets pattern
match order respond to the IR as the conversion produces it.
Open-source cross-reference
The public cuda_tile source distribution is the best reference for syntax,
ODS definitions, operation classes, type definitions, and dialect interfaces.
The binary follows that public surface with the practical deltas noted above:
print_tko is exposed as print, atan2 is absent, and this binary also
contains an implementation-specific cuda_tile.string type.
The useful public source anchors are:
| Area | Public source role |
|---|---|
| Dialect initialization | Registers attributes, types, operations, and dialect interfaces. |
| Operation definitions | TableGen records for the accepted cuda_tile.* operation surface. |
| Type definitions | TableGen and C++ verifier/printer code for tile, pointer, tensor view, partition view, and token types. |
| Interfaces | Inlining and asm-printing behavior. |
| Optimizer transforms | Public cleanup transforms that overlap conceptually with, but do not fully describe, the binary's private lowering pipeline. |
AbstractOperation Record
Every registered op in cuda_tile carries one AbstractOperation descriptor.
The dialect constructor walks its 92-op roster, allocates one descriptor per
op, fills it from that op's registration thunk, and appends it to the
dialect's registered-op vector. An Operation* resolves through its
OperationName slot into this descriptor to reach the dialect's interface
tables and fold callback.
The descriptor's logical layout:
| Slot | Purpose |
|---|---|
| op vtable | Per-op dispatch (operand/result accessors, asm-printer hooks). |
| mnemonic | An embedded StringRef pointing at a read-only literal in the binary's .rodata. |
| inliner interface | Inlining policy for this op. |
| asm interface | Custom asm-printer/parser behavior. |
| fold interface | Operation-fold concept model. |
| type-inference interface | Result-type inference. |
| bytecode interface | Bytecode round-trip. |
| memory-effects interface | Whether the op reads, writes, or allocates memory. |
| destination-style interface | Tensor-style operand/result mapping. |
| extra interface slots | Reserved for future concept models. |
| fold callback | Per-op rewriter that runs during the canonicalize step. |
The descriptor slab is zero-initialized, so unused interface slots stay null and
the dispatcher probes them without a presence flag. The mnemonic field is an
embedded StringRef that points at the binary's read-only literal, not a
heap-interned copy — the ASM printer and the verifier read it back verbatim.
The descriptors sit consecutively in a statically-allocated array. The dialect
indexes the array by mnemonic hash through the registration helper documented
in TypeID Sentinels and Anchors;
live Operation* instances reach the descriptor through their OperationName
slot — the resolution path documented in
Operation Layout — Pointer-Identity Dispatch. The
per-op fold-callback assignments for the rest of the roster are catalogued in
Operation Roster — Op Method Surface.
Cross-links
- Frontend Contract and Tile IR Emission — producer-facing rules for kernel signatures, attribute namespaces, operand- order conventions, and the bytecode-format constraints a conformant frontend must satisfy.
- Operation Roster — operation families, producer contract, and version-specific mnemonic notes.
- Types and Attributes — public types, element predicates, semantic attributes, assumption predicates, and optimization hints.
- Verifiers — numeric, memory, region, aggregate, and MMA verification contracts.
- Canonicalizers and Folds — public folds, select and if rewrites, and the recursive simplifier contract.
- Assembly Printer — textual assembly, token-memory syntax, attribute elision, enum spellings, and SSA result-name hints.
cuda_tile Operation Roster
Abstract
A frontend emitting cuda_tile is writing tile values, structured kernel
control flow, token-ordered memory effects, tensor views, matrix
multiply-accumulate intent, and source-level numeric attributes — everything
the compiler will subsequently lower into private implementation dialects.
This page is the producer and reimplementation reference: operation families,
the behavior each family promises, and how a compiler should lower the surface
without leaning on internal registration details.
In this build, the token-ordered print operation is spelled cuda_tile.print.
The newer cuda_tile.atan2 is rejected outright, so a frontend that supports
multiple TileIR revisions should gate it behind explicit version logic.
Operation Families
| Family | Operations | Contract |
|---|---|---|
| Floating and integer arithmetic | absf, absi, addf, addi, ceil, cmpf, cmpi, cos, cosh, divf, divi, exp, exp2, floor, fma, log, log2, maxf, maxi, minf, mini, mulf, mulhii, muli, negf, negi, pow, remf, remi, rsqrt, sin, sinh, sqrt, subf, subi, tan, tanh | Operate elementwise on scalar or tile values while preserving rounding, signedness, overflow, comparison, and fast-math choices. |
| Integer logic | andi, ori, shli, shri, xori | Bitwise and shift operations over integer scalar or tile values. |
| Token-ordered memory | load_ptr_tko, load_view_tko, store_ptr_tko, store_view_tko, atomic_cas_tko, atomic_rmw_tko, make_token, join_tokens, offset, global, get_global, make_tensor_view, make_partition_view | Express pointer, view, global, token, and atomic memory behavior without committing to backend layout or scheduling. |
| Structured control flow | module, entry, if, for, loop, yield, break, continue, return, assert, assume | Keep kernel structure in the source dialect and verify region arity, yielded values, and early-exit ancestry. |
| Shape algebra | broadcast, cat, extract, get_index_space_shape, get_num_tile_blocks, get_tensor_shape, get_tile_block_id, iota, permute, reshape | Transform tile rank, tile extents, launch geometry, and indexing without choosing hardware layout. |
| Reductions and scans | reduce, scan | Carry the reduction dimension, identities, input/result types, and pure combiner body. |
| Matrix multiply-accumulate | mmaf, mmai | Preserve floating and integer MMA intent until atom selection and scheduler lowering. |
| Type conversion | bitcast, exti, ftof, ftoi, int_to_ptr, itof, ptr_to_int, ptr_to_ptr, trunci | Make widening, narrowing, bit reinterpretation, float/int conversion, and pointer casts explicit. |
| Constants, selection, diagnostics | constant, select, print | Materialize literal values, value selection, and token-ordered runtime diagnostics. |
The family boundaries are semantic, not syntactic. fma is arithmetic because
it is elementwise; mmaf and mmai are MMA because they contract matrix
dimensions. assert and assume live with control flow because regions and
dominance scope their meaning, even though their payload is an attribute or
predicate.
Producer Contract
A valid producer should build modules with this shape:
cuda_tile.module {
cuda_tile.entry @kernel(%arg0 : !cuda_tile.tensor_view<...>) {
%tok0 = cuda_tile.make_token : !cuda_tile.token
%tile, %tok1 = cuda_tile.load_view_tko %view[%i, %j] token=%tok0
%acc = cuda_tile.mmaf %a, %b, %c : ...
%tok2 = cuda_tile.store_view_tko %view[%i, %j], %acc token=%tok1
cuda_tile.return
}
}
The exact textual syntax is described in Assembly Printer, but the contract is independent of formatting:
- memory effects are threaded through
cuda_tile.token; - tile values have static rank and element type;
- view values carry shape and stride metadata;
- structured control flow yields values rather than branching through
cf; - numeric choices such as rounding and signedness are attributes, not implicit frontend assumptions;
- debug info and optimization hints may be present but must not be required for semantic correctness.
Lowering Sketch
The first lowering stage converts public cuda_tile into alias-aware TileAA.
Arithmetic and shape operations keep their mathematical meaning intact. Memory
operations gain explicit memref and token structure. Control flow is rewritten
only once region and token legality are already proven.
Module lower_cuda_tile_to_tileaa(Module module, Target target) {
require(module.only_uses_dialect("cuda_tile", "builtin", "arith"));
verify_cuda_tile_module(module, target);
TypeConverter types;
types.add(convert_scalar_type);
types.add(convert_tile_type);
types.add(convert_pointer_type);
types.add(convert_view_type);
types.add(convert_token_type);
RewritePatternSet patterns;
add_arithmetic_patterns(patterns, types);
add_shape_patterns(patterns, types);
add_memory_patterns(patterns, types);
add_control_flow_patterns(patterns, types);
add_mma_patterns(patterns, types);
apply_conversion(module, patterns);
require(!module.contains_dialect("cuda_tile"));
return module;
}
Lowering must not erase source-level facts prematurely. A load_view_tko
becomes an operation with explicit view, index, mask, fallback, memory
ordering, memory scope, and token dependencies — not an unstructured pointer
load until the alias and layout passes have the context to handle it safely.
Numeric Operations
Arithmetic ops accept scalar or tile-shaped operands. Tile operands must agree on shape and element type unless the op has an explicit shape-changing contract. Floating operations carry rounding mode and flush-to-zero policy forward until a lower dialect decides whether the target instruction can encode those choices directly.
Value lower_elementwise_arith(ArithOp op) {
require_same_shape(op.operands);
require_legal_element_type(op);
NumericPolicy policy = {
.rounding = op.rounding_mode,
.flush_to_zero = op.flush_to_zero,
.signedness = op.signedness,
.overflow = op.overflow,
};
return tileaa_elementwise(op.kind, op.operands, policy);
}
mulhii returns the high half of a signed integer product. Implement it as a
wide multiply followed by a high-half extract — never as ordinary
multiplication that relies on target-width overflow.
Operand and Result Tables
The most heavily emitted ops carry the following operand/attribute/result
shape. The _tko family threads a cuda_tile.token through every memory
effect.
cuda_tile.load_view_tko
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | view | partition_view | yes | source tile view |
| operand 1..R | indices | index | yes (R = tile rank) | per-axis tile coordinate |
| operand R+1 | mask | tile<S × i1> | optional | per-lane predicate |
| operand R+2 | other | tile<S × element> | optional | fallback value when masked off |
| operand R+3 | token | cuda_tile.token | yes | input ordering edge |
| result 0 | value | tile<S × element> | yes | matches view element type |
| result 1 | token | cuda_tile.token | yes | successor ordering edge |
attr mem_semantic | enum | weak|relaxed|acquire | optional | acquire requires scope |
attr mem_scope | enum | tl_blk|cluster|gpu|sys | conditional | required for non-weak |
attr optimization_hints | dict | architecture-keyed | optional | |
attr operandSegmentSizes | dense i32 | length 5 | yes | {view, indices, mask, other, token} |
A representative two-dimensional tile load with a predicate mask:
%tile, %t1 = cuda_tile.load_view_tko %view[%i, %j], %mask, %fallback, %t0
{ mem_semantic = #cuda_tile<mem_semantic relaxed>,
mem_scope = #cuda_tile<mem_scope gpu>,
operandSegmentSizes = array<i32: 1, 2, 1, 1, 1> }
: !cuda_tile.partition_view<128x64xf32>, index, index,
tile<128x64xi1>, tile<128x64xf32>, !cuda_tile.token
-> tile<128x64xf32>, !cuda_tile.token
The mask and fallback shapes equal the result tile shape; the view element type matches the result element type. The token chain threads %t0 in and %t1 out, ordering the load against any preceding or following memory effect that consumes the same chain.
cuda_tile.store_view_tko
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | view | partition_view | yes | destination view |
| operand 1 | value | tile<S × element> | yes | element type matches view |
| operand 2..R+1 | indices | index | yes | per-axis tile coordinate |
| operand R+2 | mask | tile<S × i1> | optional | |
| operand R+3 | token | cuda_tile.token | yes | input ordering edge |
| result 0 | token | cuda_tile.token | yes | successor ordering edge |
attr mem_semantic | enum | weak|relaxed|release | optional | acquire variants rejected |
attr mem_scope | enum | as above | conditional | |
attr operandSegmentSizes | dense i32 | length 5 | yes |
cuda_tile.atomic_rmw_tko
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | pointers | tile<S × ptr> | yes | per-lane address |
| operand 1 | value | tile<S × element> | yes | RMW operand |
| operand 2 | mask | tile<S × i1> | optional | |
| operand 3 | token | cuda_tile.token | yes | |
| result 0 | old | tile<S × element> | yes | |
| result 1 | token | cuda_tile.token | yes | |
attr kind | enum | add|addf|and|or|xor|xchg|min|max|umin|umax | yes | |
attr ordering | enum | full | yes | |
attr scope | enum | full | conditional |
cuda_tile.mmaf / cuda_tile.mmai
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | A | tile<[B ×] M × K × elem_a> | yes | rank 2 or 3 (batched) |
| operand 1 | B | tile<[B ×] K × N × elem_b> | yes | K agrees with A |
| operand 2 | C | tile<[B ×] M × N × elem_c> | yes | accumulator |
| result 0 | D | tile<[B ×] M × N × elem_c> | yes | shape equals C shape |
attr signedness_a | enum | signed|unsigned | integer MMA | required for mmai |
attr signedness_b | enum | signed|unsigned | integer MMA | required for mmai |
attr rounding | enum | IEEE basic | optional | mmaf only |
A 16×16×16 floating MMA with an f32 accumulator and f16 inputs:
%d = cuda_tile.mmaf %a, %b, %c
: tile<16x16xf16>, tile<16x16xf16>, tile<16x16xf32>
-> tile<16x16xf32>
A batched integer MMA with explicit signedness attributes:
%d = cuda_tile.mmai %a, %b, %c
{ signedness_a = #cuda_tile<signedness signed>,
signedness_b = #cuda_tile<signedness unsigned> }
: tile<4x16x32xi8>, tile<4x32x16xi8>, tile<4x16x16xi32>
-> tile<4x16x16xi32>
The M/N dimensions of A and B agree with C; the K dimension is contracted. The verifier rejects rank mismatch, K disagreement, accumulator/result type mismatch, missing signedness on mmai, and any input/accumulator pair that lies outside the target's legal MMA element-type tuple.
cuda_tile.if
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | condition | i1 | yes | scalar predicate |
| region 0 | then | terminated by yield | yes | yields result_types |
| region 1 | else | terminated by yield | required when results non-empty | yields result_types |
| result 0.. | values | any non-view type | optional | view-typed results rejected |
cuda_tile.for
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | lower | integer | yes | |
| operand 1 | upper | integer | yes | same width as lower |
| operand 2 | step | integer | yes | same width as lower |
| operand 3.. | iter args | any non-view | optional | types equal result_types |
| region 0 | body | terminated by yield | yes | block arg 0 = induction var |
| result 0.. | yielded iter args | any non-view | optional |
Memory and Token Operations
The _tko suffix means token ordered. Every token-ordered memory op consumes
an input token and produces a successor. Loads and atomics also produce data;
stores produce only the successor token. That discipline is the public memory
model — later passes refine it into barriers, async copies, and backend memory
instructions.
LoadResult lower_load_ptr_tko(LoadPtrTkoOp op) {
MemRef ref = make_memref_from_pointer(op.pointer, op.indices);
MemoryPolicy policy = memory_policy(op.ordering, op.scope, op.hints);
Value data = tileaa_load(ref, op.mask, op.padding, policy, op.input_token);
Token next = token_after(data.memory_effect, op.input_token);
return (LoadResult){ .value = data, .token = next };
}
Atomics check both memory ordering and element type. Integer bitwise modes are integer-only; floating add is floating-only; compare-and-swap is restricted to element widths the backend can update atomically.
Structured Control Flow
cuda_tile ships its own region operations because frontends need a stable
kernel-level API. Later lowering may translate these regions into SCF, CFG, or
private control-flow dialects, but the verifier enforces these rules first:
ifresult types match every non-empty yielding branch;forinduction, bounds, step, iter args, and results are type-consistent;loopiter args and results are type-consistent;breakexits the nearest compatibleloop;continueexits to the next iteration of a compatiblefororloop;returnappears in anentrycontext and matches the entry function type;yieldappears only in a parent op that expects region yields.
MMA Operations
mmaf and mmai are deliberately narrow public abstractions: they describe
matrix multiply-accumulate intent, not final tensor-core instruction selection.
The verifier checks shape compatibility and element-type legality. Choosing
WGMMA, smaller MMA atoms,
tensor-memory paths, or emulation is left to the
lowering pipeline.
LogicalResult verify_mma_shape(Tile lhs, Tile rhs, Tile acc, Tile result) {
require(lhs.rank == 2 || lhs.rank == 3);
require(rhs.rank == lhs.rank);
require(acc.rank == lhs.rank);
require(result.rank == lhs.rank);
if (lhs.rank == 3) {
require(lhs.dim(0) == rhs.dim(0));
require(lhs.dim(0) == acc.dim(0));
require(lhs.dim(0) == result.dim(0));
}
require(lhs.k_dim == rhs.k_dim);
require(lhs.m_dim == acc.m_dim);
require(rhs.n_dim == acc.n_dim);
require(acc.shape == result.shape);
return success();
}
Version Notes
- Emit
cuda_tile.printfor runtime diagnostic printing in this build. - Do not emit
cuda_tile.print_tkounless targeting a source tree that uses that mnemonic. - Do not emit
cuda_tile.atan2for this build; guard it behind a newer TileIR version check. - Treat
cuda_tile.stringas implementation-specific unless the target contract explicitly documents it.
Op Method Surface
Every cuda_tile.* op exposes four registered functions to the framework:
a builder (or textual parse entry), a registration thunk that interns the
mnemonic and installs the op vtable, a verifier hook, and a lowering pattern
that rewrites the op during the first conversion stage. The functions follow
predictable shapes by op family.
| Family | Builder shape | Verifier shape | Lowering arm |
|---|---|---|---|
Trivial unary (absf, absi, ceil, floor, negf, negi, sqrt, cos, sin, transcendentals) | Default trampoline; constructs result from one operand and forwards rounding/flush-to-zero attributes. | Generic trait-only verification with element-type and rank checks. | Arithmetic-group conversion pattern. |
Floating binary (addf, subf, mulf, divf, maxf, minf, remf) | Forwards rounding mode and flush-to-zero. | Type-equality, shape-equality, and rounding-mode legality (see Verifiers — Type-Compatibility Diagnostics). | Arithmetic-group conversion pattern. |
Integer binary (addi, subi, muli, divi, maxi, mini, mulhii, remi, andi, ori, xori, shli, shri) | Forwards signedness and overflow attributes. | Type-equality, shape-equality, signedness-presence. | Arithmetic-group conversion pattern (integer max routes through a dedicated arm). |
Conversion (exti, trunci, ftof, ftoi, itof, bitcast, int_to_ptr, ptr_to_int, ptr_to_ptr) | Builds result from operand element type and target element type. | Width-direction and rounding-mode checks; identity conversions are rejected. | Pointer-cast specialty arm for the four pointer-family ops; arithmetic-group arm for the rest. |
Shape (broadcast, cat, extract, permute, reshape, iota) | Builds result from result shape, source shape, and axis attributes. | Rank, element-count, and axis legality. | Arithmetic-group conversion pattern. |
Token-ordered memory (load_*_tko, store_*_tko, atomic_*_tko, make_token, join_tokens) | Builds the result tile plus the successor token; threads the input token through. | Token presence, pointer/value element-type match, mask shape match, ordering/scope pairing. | Arithmetic-group arm; the lowering produces a TileAA tiled_load/tiled_store/atomic_rmw and threads the new mem-token chain. |
View construction (make_tensor_view, make_partition_view, offset, global, get_global) | Builds the view type from element type, shape, stride, and dynamic operands. | Dynamic-operand count match, element-type compatibility, partition dim_map injectivity. | Arithmetic-group arm; lowers to TileAA make_memref plus address-space metadata. |
Structured control flow (module, entry, if, for, loop, yield, break, continue, return) | Builds region(s) plus block argument types from result types and iter-arg types. | Region structure, terminator arity, yield-type match, view-result rejection. | Routes through the control-flow conversion arm that produces SCF/CF dialect output. |
Aggregate (reduce, scan) | Builds the result-type list plus the combiner body region. | Body purity, rank-zero block argument types, identity-vs-input element-type match. | Arithmetic-group arm; produces a TileAA reduce with the same body region. |
MMA (mmaf, mmai) | Builds result from A/B/C tile types; signedness attributes preserved. | Rank, K/M/N dimension agreement, accumulator/result type match, signedness presence for integer MMA. | Arithmetic-group arm; lowers to TileAA dot with optional scale-factor operands. |
Constants and diagnostics (constant, select, assert, assume, print) | Constants carry a typed attribute; select carries condition plus two values; diagnostics carry a message and operand list. | Constant-attribute type match; select arm-type match; assume-predicate interface checks. | Constant-and-select arm (constants are constant-folded into the TileAA constant pool). |
The default builder for trivial unary ops shares one trampoline that constructs the op from a single operand and forwards rounding/flush-to-zero attributes; the default verifier hook installs a no-op stub when the op's contract is fully covered by trait-level checks. The control-flow lowering routes through one driver that owns cuda_tile.if, cuda_tile.for, cuda_tile.loop, cuda_tile.continue, and cuda_tile.return together so it can preserve region nesting and the structured-exit ancestry contract.
One count discrepancy is worth flagging. The roster in this build is 92 mnemonics. The two names missing from open-source documentation are cuda_tile.atan2 (excluded entirely from this binary) and the rename cuda_tile.print_tko → cuda_tile.print. Producers should follow the version notes above and emit only the 92 mnemonics this dialect accepts.
The dialect constructor walks the registration thunks in roster order; each thunk interns the mnemonic into the dialect's OperationName table and installs the op's vtable, fold callback, and verifier hook through the slots described in overview — AbstractOperation Record and Operation Layout — Pointer-Identity Dispatch. Lowering patterns are matched as conversion patterns by the arithmetic-group and pointer-cast dispatchers during the first lowering stage; the conversion is documented in Cuda Tile to TileAA.
Cross-References
Overview describes the dialect's role as the public producer-facing API and the AbstractOperation record structure. Verifiers details the verbatim verifier diagnostics each family emits. Canonicalizers and Folds describes the rewrites applied after verification. Bytecode Reader and Writer documents the on-wire encoding the opcode dispatcher consumes.
cuda_tile Types and Attributes
Abstract
cuda_tile types are the public shape and memory vocabulary of TileIR: tile
values, typed pointers, tensor views, partitioned views, and ordering tokens.
Attributes layer on the numeric, memory, target, padding, assumption,
optimization, and debug facts that make lowering deterministic. This page lays
out those contracts in the terms a frontend or reimplementation needs — which
values may be constructed, which attributes are semantic, and which facts are
verified before the module enters private lowering.
Concrete Types
| Type | Meaning | Contract |
|---|---|---|
cuda_tile.tile | Static shaped value with an element type. | Rank and dimensions are part of the type; dimensions are positive powers of two; total element count is bounded. |
cuda_tile.ptr | Typed pointer to a numeric scalar element. | Pointee is integer or floating; pointer-to-pointer is rejected. |
cuda_tile.tensor_view | Global-memory view with element type, shape, and stride. | Shape and stride ranks match; static dimensions and strides are positive. |
cuda_tile.partition_view | Tile partition over a tensor view. | Tile shape, tensor view, dimension map, and optional padding describe one legal tiled access pattern. |
cuda_tile.token | Memory-ordering edge. | Carries dependency ordering between token-ordered memory effects and has no user-visible payload. |
cuda_tile.string | Implementation-specific string handle. | Treat as nonportable unless the target contract explicitly accepts it. |
The first five types form the stable public surface. cuda_tile.string is
useful for reading this build's dumps; portable producers must not depend on
it.
TypeStorage and TypeID Singletons
Each of the six concrete types is a normal MLIR Type subclass backed by its
own TypeStorage derivative. Construction flows through the MLIRContext's
StorageUniquer gateway documented in
Storage Uniquer and Context Impl — getOrCreate Gateway:
the dialect's self-registration ctor hands a TypeID singleton and a build
hook to the uniquer, which hashes the storage payload, looks up an existing
instance, and either returns the cached pointer or allocates a fresh storage
block in the context arena. Every public Type handle in the IR is a
24-bit-tagged pointer into that arena.
All six types share a 24-byte BaseStorage header (vtable, context pointer,
hash-bucket pointer) and then append a type-specific payload. Storage sizes
are byte-exact across the build and stable under bytecode round-trip.
| Type | TypeID singleton | Storage size | Self-ctor |
|---|---|---|---|
cuda_tile.tile | &unk_5B38BC0 | 0x30 | sub_6C5870 |
cuda_tile.tensor_view | &unk_5B38BB8 | 0x40 | sub_6C5C40 |
cuda_tile.partition_view | &unk_5B38BB0 | 0x40 | sub_6C5E80 |
cuda_tile.ptr | &unk_5B38BC8 | 0x20 | sub_6C5630 |
cuda_tile.token | off_5A2E208 slot | 0x18 | sub_6C6240 |
cuda_tile.string | off_5A2DB38 slot | 0x18 | sub_6C6500 |
The PointerType singleton &unk_5B38BC8 is the exact value the
TileElementType predicate (sub_6C4E20) tests against in its final arm when
it accepts a typed pointer as a tile element — the registration ctor's TypeID
slot is observably the same one driving tile-element verification.
TileType storage
cuda_tile.tile carries a static shape and an element type. The shape is held
as an ArrayRef<int64_t> (begin pointer plus size, 16 bytes); the element
type is a single tagged pointer.
typedef struct TileTypeStorage {
/*+0x00*/ BaseStorage base; // vtable=&unk_5B38BC0, ctx, hash bucket
/*+0x18*/ const int64_t *shape_begin; // pointer into context-owned int64 array
/*+0x20*/ uint64_t shape_size; // dimension count
/*+0x28*/ Type element_type; // scalar element or cuda_tile.ptr
} TileTypeStorage;
The shape array is interned alongside the storage block, and copies returned
to callers re-use that pointer. The element type field accepts the
TileElementType palette — f16, bf16, f32, tf32, f64, f8E4M3FN, f8E5M2, i1,
i8, i16, i32, i64, and cuda_tile.ptr. Total storage 0x30 bytes.
TensorViewType storage
cuda_tile.tensor_view carries an element type, a shape, and a stride. Both
shape and stride are full ArrayRef<int64_t> records, and both accept the
shared MLIR kDynamic = INT64_MIN sentinel independently per dimension.
typedef struct TensorViewTypeStorage {
/*+0x00*/ BaseStorage base; // vtable=&unk_5B38BB8
/*+0x18*/ Type element_type;
/*+0x20*/ const int64_t *shape_begin;
/*+0x28*/ uint64_t shape_size;
/*+0x30*/ const int64_t *stride_begin;
/*+0x38*/ uint64_t stride_size;
} TensorViewTypeStorage;
Element type must satisfy the bare-Number predicate (sub_6C2A10): the
tile-element palette minus the cuda_tile.ptr arm. Total storage 0x40 bytes.
PartitionViewType storage
cuda_tile.partition_view overlays a power-of-two tile grid on a tensor view
and optionally selects a padding value for out-of-bounds reads. The tile shape
is held as a DenseI32ArrayAttr attribute pointer (interned independently by
the attribute uniquer); the dimension map is a raw ArrayRef<int32_t>; the
padding value is a nullable attribute slot.
typedef struct PartitionViewTypeStorage {
/*+0x00*/ BaseStorage base; // vtable=&unk_5B38BB0
/*+0x18*/ DenseI32ArrayAttr tile_shape; // interned attr
/*+0x20*/ TensorViewType tensor_view;
/*+0x28*/ const int32_t *dim_map_begin;
/*+0x30*/ uint64_t dim_map_size;
/*+0x38*/ PaddingValueAttr padding_value; // nullable, attr pointer
} PartitionViewTypeStorage;
At registration time, the self-registration ctor also wires the TileView
interface concept-model pointer and method table into the TypeStorage+0x88
slot. Op verifiers such as verifyViewLoadStoreCommon reach the partition
view through that interface to read getViewIndexRank() and
getViewTileType(), so the interface vtable is consulted on every view load
and store. Total storage 0x40 bytes.
PointerType storage
cuda_tile.ptr is a typed pointer to a single scalar element. This build
carries no explicit address-space field: the pointer is always a global-memory
typed reference, and any address-space variation rides on the tensor_view or
partitioned access it flows through, not on the pointer type itself.
typedef struct PointerTypeStorage {
/*+0x00*/ BaseStorage base; // vtable=&unk_5B38BC8
/*+0x18*/ Type pointee_type; // bare-Number palette only
} PointerTypeStorage;
The pointee type must satisfy the bare-Number predicate
(sub_6C2840); pointer-to-pointer is forbidden by that arm. Total storage
0x20 bytes.
TokenType storage
cuda_tile.token is parameter-free. It carries no payload beyond the shared
BaseStorage header — its only job is to thread ordering edges between
token-producing and token-consuming memory operations.
typedef struct TokenTypeStorage {
/*+0x00*/ BaseStorage base; // vtable from off_5A2E208 slot
} TokenTypeStorage;
Total storage 0x18 bytes. Two cuda_tile.token SSA values produced by
different ops are unequal IR values, but their storage instance is unique —
every !cuda_tile.token type in a context resolves to the same TypeStorage.
StringType storage
cuda_tile.string is also parameter-free at the storage layer in this build.
It is an internal handle used by debug and diagnostic plumbing; public
producers should treat it as nonportable.
typedef struct StringTypeStorage {
/*+0x00*/ BaseStorage base; // vtable from off_5A2DB38 slot
} StringTypeStorage;
Total storage 0x18 bytes. Like cuda_tile.token, the type resolves to a
single canonical storage instance per context.
Element-type dispatch and the 11-arm predicate
The TileType self-registration ctor at sub_6C5870 wires the
TileElementType predicate at sub_6C4E20 into the parameter-trait verifier
emitted by TableGen. That predicate is an unrolled AnyTypeOf<> switch over
the element-type singleton table — each arm a direct vtable-pointer test or
a width-keyed isInteger(N) call. The accepted set has thirteen arms in the
binary (f16, bf16, f32, tf32, f64, f8E4M3FN, f8E5M2, i1, i8, i16, i32, i64,
and cuda_tile.ptr) and emits the verbatim failure string
failed to verify 'elementType': f16 or bf16 or f32 or tf32 or f64 or f8E4M3FN or f8E5M2 or i1 or i8 or i16 or i32 or i64 or Pointer type on
mismatch. The bare-Number predicate at sub_6C2A10 is the same dispatch
table minus the final cuda_tile.ptr arm — which is how the verifier forces
tensor_view element types to be scalar while still letting pointer elements
live inside tiles.
Dynamic-dimension sentinel and tile cap
Both tensor_view shape/stride and the inner check in
PartitionViewType::verify use the MLIR-wide kDynamic = INT64_MIN sentinel
to mark an unknown-at-IR-build-time dimension; the parser accepts ? and
stores INT64_MIN. The tile element-count cap is 0x1000000 = 16777216
elements, enforced by the overflow-safe numElems > kMaxElems / dim check in
verifyTileSize before each multiplication. Both constants are part of the
storage-level contract: a reimplementation that picks a different sentinel
collides with the positivity check in shape verification, and a looser tile
cap admits tiles that exceed shared-memory capacity on Blackwell.
Tile Type
Tiles are shaped SSA values. Shape is static; element type is one of the accepted integer, floating, or pointer element types. The verifier is simple by design — every dimension must be a positive power of two, and the product must stay under the compiler's tile-size ceiling.
LogicalResult verify_tile_type(Shape shape, ElementType element) {
require(is_tile_element_type(element));
int64_t elements = 1;
int64_t max_elements = 16 * 1024 * 1024;
for (int64_t dim : shape) {
require(dim > 0);
require(is_power_of_two(dim));
require(elements <= max_elements / dim);
elements *= dim;
}
return success();
}
That strong shape rule pays off downstream. Tile lowerings routinely assume powers of two when picking warp lanes, vector widths, and layout factors, and the verifier guarantees those assumptions never fail.
Pointer and View Types
cuda_tile.ptr is a typed pointer to a numeric element. The pointer itself is
not a tensor; tensor structure is introduced by tensor_view and
partition_view.
LogicalResult verify_pointer_type(ElementType pointee) {
require(is_integer_type(pointee) || is_float_type(pointee));
require(!is_pointer_type(pointee));
return success();
}
tensor_view stores element type, rank, shape, and stride. Dynamic dimensions
and strides are allowed, but the rank is fixed.
LogicalResult verify_tensor_view(Type element, Shape shape, Strides strides) {
require(is_numeric_type(element));
require(shape.rank == strides.rank);
for (int axis = 0; axis < shape.rank; ++axis) {
require(shape[axis] == dynamic_dim() || shape[axis] > 0);
require(strides[axis] == dynamic_stride() || strides[axis] > 0);
}
return success();
}
partition_view describes how a tile-shaped access maps onto a tensor view. It
is where padding legality and dimension mapping are checked.
LogicalResult verify_partition_view(PartitionViewType view) {
TensorViewType tensor = view.tensor;
Shape tile_shape = view.tile_shape;
require(tile_shape.rank == tensor.rank);
require(view.dim_map.length == tile_shape.rank);
BitSet used_tensor_dims(tensor.rank);
for (int tile_axis = 0; tile_axis < tile_shape.rank; ++tile_axis) {
int tensor_axis = view.dim_map[tile_axis];
require(tile_shape[tile_axis] > 0);
require(is_power_of_two(tile_shape[tile_axis]));
require(0 <= tensor_axis && tensor_axis < tensor.rank);
require(!used_tensor_dims.contains(tensor_axis));
used_tensor_dims.insert(tensor_axis);
}
if (view.padding.has_value && view.padding.value.requires_float()) {
require(is_float_type(tensor.element_type));
}
return success();
}
Element-Type Palette
The public element palette includes the integer widths i1, i8, i16,
i32, and i64; the floating types f16, bf16, f32, tf32, and f64;
and the FP8 formats used by current tile operations. Lower-precision FP4, FP6,
and block-scale helper formats are introduced in lower internal dialects rather
than as general cuda_tile element types.
| Predicate family | Accepted types | Typical users |
|---|---|---|
| Any integer | i1, i8, i16, i32, i64 | Integer arithmetic, logic, indices, predicates. |
| Any float | f16, bf16, f32, tf32, f64, FP8 formats | Floating arithmetic, MMA, conversion, padding. |
| Numeric | Integers or floats | Pointers, tensor views, constants. |
| Tile element | Numeric or pointer | Tile values and pointer tiles. |
| Pointer tile | Tile whose element is cuda_tile.ptr | Gather, scatter, and pointer-tile memory forms. |
Attribute Families
| Family | Attributes | Contract |
|---|---|---|
| Integer mode | signedness, overflow | Select signed or unsigned interpretation and optional overflow assumptions. |
| Floating mode | rounding, comparison_ordering, comparison_predicate | Preserve rounding and ordered or unordered comparison semantics. |
| Atomic and memory model | atomic_rmw_mode, memory_scope, memory_ordering_semantics | Define legal atomic operation, visibility scope, and ordering semantics. |
| Padding | padding_value | Select the fill value for out-of-bounds partitioned view reads. |
| Assumption predicates | div_by, same_elements, bounded | Attach verifier-checked facts to cuda_tile.assume. |
| Optimization hints | optimization_hints | Carry optional architecture-keyed tuning hints for entries and memory ops. |
| Debug and location | di_loc, di_compile_unit, di_file, di_lexical_block, di_subprogram | Preserve source provenance when debug info is enabled. |
Parse enum-like attributes as closed sets. Validate data attributes' payload shape at parse time and, where needed, again in the consuming operation's verifier.
Assumption Predicates
div_by, bounded, and same_elements implement the assumption-predicate
contract. They mean anything only when attached to cuda_tile.assume. Later
passes can lean on them for simplification — but only because the verifier
type-checks the constrained value first.
LogicalResult verify_assume_predicates(AssumeOp op) {
Type type = op.value.type;
for (Attribute attr : op.predicates) {
switch (attr.kind) {
case DIV_BY:
require(attr.divisor > 0);
require(is_power_of_two(attr.divisor));
require(is_integer_like(type) || is_pointer_like(type));
require(optional_pair_is_complete(attr.every, attr.along));
break;
case BOUNDED:
require(is_integer_like(type));
require_bounds_fit_integer_width(attr.lower, attr.upper, type);
require_lower_not_greater_than_upper(attr.lower, attr.upper);
break;
case SAME_ELEMENTS:
require(attr.values.length == ranked_shape(type).rank);
require_each_value_fits_axis(attr.values, ranked_shape(type));
break;
default:
break;
}
}
return success();
}
The dispatch above is the public contract. The implementation lives in three per-attribute verifier bodies
that the bytecode reader reaches through a small fan-out of trampolines. The next sections document those
bodies as they appear in the binary — together they cover the only cuda_tile attributes that carry a
non-trivial verifier. The remaining attributes in the family table are simple key-value records that the
generic attribute parser accepts without a dedicated verify slot.
DivByAttr Verifier
DivByAttr is the divisibility assumption used on cuda_tile.assume. Its verifier lives at sub_15107A0 —
the largest attribute-verifier body in the binary at roughly 1 467 lines of decompiled C, almost all of it
type-universe dispatch and overflow bookkeeping. The symbol-table name reads DivByAttr::verifyWithAssumeOp,
and the diagnostics it emits sometimes spell the attribute as nv_tileaa.div_by rather than
cuda_tile.div_by — the dialect was renamed mid-binary and the diagnostic strings were never refreshed.
Treat both spellings as the same attribute when matching error output.
The verifier opens by checking that the divisor is positive. A non-positive divisor is rejected
immediately; the verifier emits a diagnostic suffixed with the verbatim "' divisor must be a power of 2"
phrase (the leading ' closes the quoted attribute-name prefix the diagnostic prints first). It then
bound-checks the magnitude against 2^62. The ceiling is chosen so the divisor can be multiplied by a
signed 64-bit residue without overflow during downstream simplification — the primary reason a divisibility
fact gets consulted.
After the magnitude check, the verifier walks the constrained value's type universe. Four branches, all
structural rather than nominal: the dispatch keys on the value's TypeKind, not on the printed type name,
so retypings during canonicalization do not change which branch runs.
| Branch | Type-class | Verifier action |
|---|---|---|
| 0 | Integer (any width) | Bound-check divisor against 2^62; accept any positive integer divisor. |
| 1 | Float (f16/bf16/f32/tf32/f64 and FP8) | Reject — divisibility is not defined for floating point and the attribute is refused with a diagnostic. |
| 2 | Pointer | Bound-check divisor against the pointee element size in bytes; alignment must be a multiple of sizeof(pointee). |
| 3 | Aggregate (cuda_tile.tile, cuda_tile.tensor_view) | Recurse into the element type; the same dispatch then runs against the element. |
The aggregate branch is what lets div_by apply to a tile uniformly: the verifier descends through the tile
type and rechecks the leaf element. A pointer-of-pointer or tile-of-tile terminates in the rejection arm
because each recursion is guarded by the same dispatch.
DivByAttr carries two optional covariant fields, every and along. every asserts the fact for every
dimension of a multi-dim divisor; along restricts the assertion to a single axis. The two fields obey a
joint-presence contract policed by three verbatim binary diagnostics — "' 'every'/'along' must be used in combination",
"' 'every'/'along' cannot be used if the constrained value is a tensor_view", and
"' 'every'/'along' cannot be used if the constrained value is a 0D tile" (each with the leading '
closing the quoted attribute-name prefix). When every is present on a multi-dim divisor the verifier
requires every dim of the divisor to divide cleanly into the corresponding tile extent; when along is
present it checks divisibility only along the named axis and leaves the other axes unconstrained.
BoundedAttr Verifier
BoundedAttr is the integer-range assumption verified at sub_150EB90. It runs much shorter than the
divisibility verifier because there is no type-universe walk — bounds only apply to integer-typed values,
and the verifier rejects everything else up front. The primary check is the consistency relation
lo <= hi, emitted as a diagnostic when it fails. The verifier also checks that both bounds fit in the
integer width of the constrained value; an out-of-range bound is reported with the offending width and
value.
Three optional fields tune the relation. min provides the minimum permitted value and defaults to
INT64_MIN; max provides the maximum and defaults to INT64_MAX; strict, when true, switches the
relation from inclusive (lo <= v <= hi) to strict (lo < v < hi) on both ends. The strict flag changes
only the predicate emitted to downstream passes; the verifier itself enforces the same lo <= hi
consistency regardless of the flag.
SameElementsAttr Verifier
SameElementsAttr is the splat-form assumption verified at sub_150D3F0. It applies to attributes shaped
like DenseElementsAttr and asserts that every element of the dense payload equals one canonical value. The
verifier confirms the underlying DenseElementsAttr really is splat-form — its dense storage collapses to a
single value — and then stores only that canonical value rather than the full payload. The optimizer reads
the stored canonical value to fold splat-multiply-x patterns into element-multiply-x, which is the main
reason the attribute exists.
A non-splat payload is rejected outright. There is no per-element scan in the verifier itself; the splat check is a constant-time query on the dense attribute's internal layout.
Verifier Trampolines
The bytecode reader does not call the three verifier bodies directly. Each one sits behind a 64-byte
trampoline that the reader installs as the attribute kind's verify slot. The trampolines at sub_1517B70,
sub_1517B90, and sub_1517BB0 are byte-identical apart from the inner call target — they dispatch to
DivByAttr::verifyWithAssumeOp at sub_15107A0, BoundedAttr at sub_150EB90, and SameElementsAttr at
sub_150D3F0 respectively. The thunks exist because the bytecode reader stores a uniform function pointer
in each attribute's vtable slot, and each trampoline adapts the generic call signature to its verifier's
specific argument layout.
Optimization Hints
optimization_hints is a dictionary keyed by architecture name, then by
operation-specific hint name. The contents are advisory but still verified:
unknown architectures and unknown keys must be rejected so producers never
think a hint was honored when it was actually ignored.
LogicalResult verify_optimization_hints(Operation op, DictAttr hints) {
for (NamedDict arch_entry : hints.entries) {
require(is_allowed_architecture_key(arch_entry.name));
for (NamedAttribute hint : arch_entry.value.entries) {
require(is_allowed_hint_for_operation(op.name, hint.name));
require(hint_value_has_expected_type(op.name, hint.name, hint.value));
}
}
return success();
}
Common hint concepts include occupancy, CTA clustering, latency, and whether TMA is allowed for a view load or store. A missing hint means the compiler is free to choose.
Invariants
- Tile dimensions are static, positive powers of two.
- Pointer pointees are numeric, never pointers.
- Tensor views have matching shape and stride ranks.
- Partition views map tile dimensions injectively into tensor dimensions.
- Special padding values such as NaN or infinity require floating-point element types.
- Tokens are ordering values, not runtime data visible to the program.
- Assumption predicates are verifier-checked before they can justify a rewrite.
- Optimization hints must be explicit and known to the verifier.
cuda_tile Verifiers
Abstract
cuda_tile verification is the public gate before TileIR enters private lowering. The verifier rejects malformed tile shapes, illegal numeric policies, broken memory-ordering pairs, structurally invalid control flow, view construction errors, MMA shape mismatches, malformed assumption predicates, and inconsistent optimization hints. Generic ODS-style checks for operand counts, result counts, region counts, attribute presence, and trait-driven type constraints run first; domain-specific verifiers run only against operations that already pass generic checks.
The diagnostic strings emitted by the verifier are part of the public contract. Frontends and tests key off the wording to distinguish "I emitted illegal IR" from "I hit a compiler bug." This page enumerates the verbatim diagnostics by category and shows the branch logic that decides which message a given operation receives.
Verification Pipeline
LogicalResult verify_cuda_tile_operation(Operation op, Target target) {
if (failed(verify_generic_traits(op))) {
return failure();
}
switch (op.family) {
case FAMILY_FLOAT_ARITH: return verify_float_arith(op);
case FAMILY_CONVERSION: return verify_conversion(op);
case FAMILY_TOKEN_MEMORY: return verify_token_memory(op);
case FAMILY_CONTROL_FLOW: return verify_control_flow(op);
case FAMILY_VIEW_AND_SHAPE: return verify_view_and_shape(op);
case FAMILY_AGGREGATE: return verify_aggregate(op);
case FAMILY_MMA: return verify_mma(op, target);
case FAMILY_ASSUMPTION: return verify_assume(op);
default: return success();
}
}
Generic verification covers operand count, result count, region count, required attribute presence, and trait-driven type constraints. The first failure short-circuits the rest of the pipeline so a malformed operand list never reaches a domain verifier that would dereference it.
Block-Structure Diagnostics
The strongest invariant covers region structure. Region-bearing operations require a non-empty entry block, block argument types that obey the operation's region contract, and a terminator whose operand list lines up with both the block-argument list and the enclosing op's result list.
LogicalResult verify_region_structure(Operation op) {
for (Region region : op.regions) {
Block entry = region.entry_block;
if (entry.operations.empty()) {
return op.emit_error("expect non-empty block");
}
for (size_t i = 0; i < entry.arguments.size(); ++i) {
Type arg_type = entry.arguments[i].type;
if (!isa<TileType>(arg_type)) {
return op.emit_error("expected TileType for block arguments but got types: ")
.append_type_list(entry.argument_types());
}
if (cast<TileType>(arg_type).rank != 0) {
return op.emit_error("expect 0-rank tile type at index: ").append(i);
}
}
Operation terminator = entry.terminator;
if (terminator.operands.size() != op.expected_terminator_arity()) {
return op.emit_error("expect number of terminators operands (")
.append(terminator.operands.size())
.append(") to match expected (")
.append(op.expected_terminator_arity())
.append(")");
}
for (size_t i = 0; i < terminator.operands.size(); ++i) {
Type tt = terminator.operands[i].type;
Type bt = entry.arguments[i].type;
if (tt != bt) {
return op.emit_error("expected TileType for operand and terminator types but got: ")
.append(tt).append(" vs ").append(bt);
}
}
}
return success();
}
The diagnostics emitted by this code path are:
| Diagnostic | Cause |
|---|---|
"expect non-empty block" | A region's entry block contains no operations (no terminator either). |
"expect 0-rank tile type at index: " followed by the index | A reduction or scan body block argument is not the required rank-zero tile. |
"expected TileType for block arguments but got types: " followed by the argument-type list | A body block argument carries a non-tile type. |
"expect number of terminators operands (" followed by actual-vs-expected | The terminator's operand list does not match the parent's result-type list. |
"expected TileType for operand and terminator types but got: " followed by both types | A terminator operand's element type does not match the corresponding block-argument element type. |
"only pure operations allowed" | A reduction or scan body contains an operation with memory effects. |
"invalid op: " followed by the operation name | A body region contains an operation the family rejects (for example, a side-effecting op inside reduce). |
These messages name the failing slot in the contract, not the implementation function that produced them. A producer reading them can locate the operand or argument by index without consulting compiler internals.
Type-Compatibility Diagnostics
Floating arithmetic preserves the producer's numeric choices. The verifier rejects illegal rounding modes, illegal flush-to-zero applications, and mismatched element types in the same arithmetic op.
LogicalResult verify_float_arith(Operation op) {
if (!operand_and_result_types_equal(op)) {
return op.emit_error("expected matching operand and result element types");
}
if (op.has_flush_to_zero && op.element_type != f32_type()) {
return op.emit_error("flush-to-zero is only legal for f32 element type");
}
if (op.kind == DIVF) {
if (!rounding_is_valid_for_division(op.rounding)) {
return op.emit_error("invalid rounding mode for divf");
}
if ((op.rounding == APPROX || op.rounding == FULL)
&& op.element_type != f32_type()) {
return op.emit_error("approx and full rounding modes require f32");
}
} else if (!rounding_is_ieee_basic(op.rounding)) {
return op.emit_error("rounding mode not allowed outside divf");
}
return success();
}
Integer conversions check width direction. exti widens, trunci narrows, ftof rejects identity conversions, and bitcast requires equal widths.
LogicalResult verify_conversion(Operation op) {
int from = bit_width(op.input.element_type);
int to = bit_width(op.result.element_type);
switch (op.kind) {
case EXTI:
if (from >= to) {
return op.emit_error("exti requires the result width to be strictly greater than the input width");
}
break;
case TRUNCI:
if (from <= to) {
return op.emit_error("trunci requires the result width to be strictly less than the input width");
}
break;
case FTOF:
if (from == to) {
return op.emit_error("ftof rejects identity conversions");
}
if (op.rounding != NEAREST_EVEN) {
return op.emit_error("ftof requires nearest-even rounding");
}
break;
case ITOF:
if (op.rounding != NEAREST_EVEN) {
return op.emit_error("itof requires nearest-even rounding");
}
break;
case FTOI:
if (op.rounding != NEAREST_INT_TO_ZERO) {
return op.emit_error("ftoi requires truncation toward zero");
}
break;
case BITCAST:
if (from != to) {
return op.emit_error("bitcast requires equal-width source and destination element types");
}
break;
}
return success();
}
Operand-Shape Diagnostics
Token-ordered memory operations check three independent contracts: the pointer or view's pointee/element type matches the value type, the mask shape matches the tile shape, and the memory ordering and scope form a legal pair.
LogicalResult verify_token_memory(Operation op) {
if (op.pointer.pointee != op.value.element_type) {
return op.emit_error("pointer pointee element type must match value element type");
}
if (!shapes_compatible(op.pointer_tile, op.value)) {
return op.emit_error("pointer tile shape must match value shape");
}
if (op.has_mask && op.mask.shape != op.value.shape) {
return op.emit_error("mask shape must match value shape");
}
if (op.kind == LOAD) {
if (op.ordering != WEAK && op.ordering != RELAXED && op.ordering != ACQUIRE) {
return op.emit_error("load ordering must be weak, relaxed, or acquire");
}
} else if (op.kind == STORE) {
if (op.ordering != WEAK && op.ordering != RELAXED && op.ordering != RELEASE) {
return op.emit_error("store ordering must be weak, relaxed, or release");
}
}
if (op.ordering == WEAK && op.scope.has_value) {
return op.emit_error("weak memory ordering must not carry a scope");
}
if (op.ordering != WEAK && !op.scope.has_value) {
return op.emit_error("non-weak memory ordering requires an explicit scope");
}
return success();
}
Atomic RMW further restricts the element type by mode. Bitwise and integer-arithmetic modes accept only i32 or i64; floating add requires a floating type the target can atomically update; exchange and compare-and-swap require an element width the hardware supports atomically.
LogicalResult verify_atomic_rmw(AtomicRmwOp op) {
if (failed(verify_token_memory(op))) {
return failure();
}
switch (op.mode) {
case AND: case OR: case XOR:
case ADD: case MAX: case MIN: case UMAX: case UMIN:
if (op.value.element_type != i32_type() && op.value.element_type != i64_type()) {
return op.emit_error("integer rmw mode requires i32 or i64 element type");
}
break;
case ADDF:
if (!is_supported_atomic_float(op.value.element_type)) {
return op.emit_error("floating-add rmw requires a target-supported floating width");
}
break;
case XCHG:
if (!is_supported_atomic_xchg(op.value.element_type)) {
return op.emit_error("xchg rmw requires a target-supported atomic width");
}
break;
}
return success();
}
View and Shape Diagnostics
View construction is verified before memory lowering. make_tensor_view checks that the base-pointer pointee matches the view's element type and that the dynamic shape/stride operand count matches the type's dynamic-slot count. make_partition_view checks that the operand tensor view matches the tensor view embedded in the partition type.
Shape operations carry tight element-count and rank rules:
LogicalResult verify_reshape(ReshapeOp op) {
if (op.source.element_type != op.result.element_type) {
return op.emit_error("reshape requires matching element types");
}
int64_t in = num_elements(op.source.shape);
int64_t out = num_elements(op.result.shape);
if (in != out) {
return op.emit_error("reshape element-count mismatch: source has ")
.append(in).append(" elements, result has ").append(out);
}
return success();
}
A concrete reshape failure walk illustrates the contract. Consider the input
%dst = cuda_tile.reshape %src : tile<8x8xf32> -> tile<7x9xf32>
The verifier computes 8·8 = 64 elements in and 7·9 = 63 elements out, the equality check fails, and the diagnostic identifies both counts so the producer knows whether to fix the source rank/shape, the result rank/shape, or both. The op never reaches lowering, so no downstream pass needs to defend against a partially-typed reshape.
extract, cat, permute, and iota carry parallel contracts. extract divides every source dimension by the corresponding result dimension and rejects non-integer ratios. cat accepts only inputs that agree on rank and element type and differ on the concatenation axis. permute requires a dense permutation of the input rank with no repeated indices. iota produces a one-dimensional integer tile whose extent fits in the chosen integer width.
Structured Control Flow
Region verifiers reject view-typed loop-carried values and view-typed branch results. Views describe memory that may outlive the region that constructs them; allowing them as iter-args or yields would let a frontend smuggle aliasing past the verifier and into later passes that assume views never escape a region.
LogicalResult verify_if(IfOp op) {
if (!op.then_region.exists()) {
return op.emit_error("if requires a then-region");
}
if (op.results.empty()) {
return success();
}
if (!op.else_region.exists()) {
return op.emit_error("if with results requires an else-region");
}
if (failed(verify_region_yields(op.then_region, op.result_types))) {
return failure();
}
if (failed(verify_region_yields(op.else_region, op.result_types))) {
return failure();
}
for (Type t : op.result_types) {
if (is_view_type(t)) {
return op.emit_error("view-typed if results are not permitted");
}
}
return success();
}
LogicalResult verify_for(ForOp op) {
if (op.induction_var.type != op.lower_bound.type
|| op.lower_bound.type != op.upper_bound.type
|| op.lower_bound.type != op.step.type) {
return op.emit_error("for induction, lower, upper, and step must share an integer type");
}
if (op.init_values.types != op.region_iter_arg_types) {
return op.emit_error("for init values must match region iter-arg types");
}
if (op.result_types != op.region_iter_arg_types) {
return op.emit_error("for result types must match region iter-arg types");
}
for (Type t : op.result_types) {
if (is_view_type(t)) {
return op.emit_error("view-typed for results are not permitted");
}
}
return success();
}
break and continue walk outward through nested if regions until they hit a compatible loop or for. The verifier rejects an early-exit op that jumps out through an unrelated parent.
LogicalResult verify_early_exit(Operation op, Set<OpKind> allowed_parents) {
Operation parent = op.parent;
while (parent.kind == IF) {
parent = parent.parent;
}
if (!allowed_parents.contains(parent.kind)) {
return op.emit_error("early-exit must be enclosed by a compatible loop or for");
}
if (op.operands.types != expected_exit_types(parent, op.kind)) {
return op.emit_error("early-exit operand types must match the enclosing region contract");
}
return success();
}
Special-Op Diagnostics
Reductions and scans share a verifier. The body region must be pure, must consume pairs of rank-zero tile arguments, and must yield one value per input.
LogicalResult verify_aggregate(AggregateOp op) {
if (op.inputs.empty()) {
return op.emit_error("reduce/scan requires at least one input");
}
if (op.inputs.length != op.results.length) {
return op.emit_error("reduce/scan must produce one result per input");
}
if (op.dim < 0 || op.dim >= op.inputs[0].rank) {
return op.emit_error("reduction dimension is out of range");
}
for (size_t i = 0; i < op.identities.length; ++i) {
if (op.identities[i].type != op.inputs[i].element_type) {
return op.emit_error("identity element type must match input element type");
}
}
Block body = op.body.entry_block;
if (body.arguments.length != 2 * op.inputs.length) {
return op.emit_error("reduce/scan body must take two rank-zero arguments per input");
}
for (size_t i = 0; i < body.arguments.length; ++i) {
TileType arg = cast<TileType>(body.arguments[i].type);
if (arg.rank != 0) {
return op.emit_error("expect 0-rank tile type at index: ").append(i);
}
}
for (Operation nested : op.body.operations) {
if (!nested.is_memory_effect_free()) {
return op.emit_error("only pure operations allowed");
}
if (!nested.kind_legal_in_aggregate_body()) {
return op.emit_error("invalid op: ").append(nested.name);
}
}
return verify_region_yields(op.body, op.result_types);
}
The diagnostic "only pure operations allowed" fires the moment the body walk encounters an op with memory effects; "invalid op: " fires for an op kind the body region rejects categorically (for example, a cuda_tile.return inside a reduce body).
MMA Diagnostics
Floating and integer MMA share their shape rules but diverge on element type. Operands are rank-2 or rank-3 (the rank-3 form is batched). Contracting dimensions must agree, accumulator and result shapes must match, and integer MMA must carry explicit signedness attributes.
LogicalResult verify_mma(MmaOp op, Target target) {
if (op.lhs.rank != 2 && op.lhs.rank != 3) {
return op.emit_error("mma operand A must be rank-2 or rank-3");
}
if (op.rhs.rank != op.lhs.rank
|| op.acc.rank != op.lhs.rank
|| op.result.rank != op.lhs.rank) {
return op.emit_error("mma operands must share rank");
}
if (op.lhs.k_dim != op.rhs.k_dim) {
return op.emit_error("mma contracting dimension mismatch");
}
if (op.lhs.m_dim != op.acc.m_dim || op.rhs.n_dim != op.acc.n_dim) {
return op.emit_error("mma accumulator must agree with A and B on M and N");
}
if (op.acc.shape != op.result.shape) {
return op.emit_error("mma accumulator and result shapes must match");
}
if (op.is_integer) {
if (!op.has_signedness_lhs || !op.has_signedness_rhs) {
return op.emit_error("expect signedness attribute for operand A");
}
if (op.acc.element_type != i32_type() || op.result.element_type != i32_type()) {
return op.emit_error("integer mma accumulator and result must be i32");
}
if (!is_legal_integer_mma_input(op.lhs.element_type)
|| op.lhs.element_type != op.rhs.element_type) {
return op.emit_error("integer mma inputs must share a legal integer element type");
}
} else {
if (!is_legal_float_mma_tuple(op.lhs.element_type, op.acc.element_type)) {
return op.emit_error("floating mma input/accumulator pair is not supported on the target");
}
if (op.acc.type != op.result.type) {
return op.emit_error("floating mma accumulator and result must share type");
}
}
return success();
}
Assumption Predicates
assume accepts a value and a list of predicate attributes. Each predicate runs against the constrained value's type and its own parameters; the first failure short-circuits the rest.
LogicalResult verify_assume(AssumeOp op) {
Type t = op.value.type;
for (Attribute pred : op.predicates) {
if (auto verifier = dyn_cast_assume_predicate(pred)) {
if (failed(verifier.verify(pred, t, op))) {
return failure();
}
}
}
return success();
}
div_by, bounded, and same_elements each implement that interface. Their per-predicate diagnostics are documented alongside the attribute surface for the dialect they share with nv_tileaa.
Diagnostic Stability
Verifier diagnostics are a producer-facing contract. Keep them specific — name the failing slot, the offending type, or the integer index of the bad argument. Avoid generic phrases that force the producer to read compiler source to interpret the failure. The verbatim strings above (and the typo-preserving variants from later TileAS verification) survive across builds because external infrastructure has been matching them; silently rewording a diagnostic breaks downstream log capture.
Invariants
- Generic ODS-style checks run before any domain-specific verifier.
- Floating rounding and flush-to-zero policies are target- and type-checked.
- Conversions are not allowed to hide no-op casts.
- Weak memory ordering rejects scope; non-weak memory ordering requires it.
- Token-ordered operations preserve token inputs and outputs.
- Structured control-flow results cannot smuggle view lifetimes across regions.
- Aggregate bodies are pure and yield the expected result types per input.
- MMA verifiers check shape and element-type legality before lowering.
- Diagnostic wording is stable; reimplementations must reproduce it byte-for-byte.
Cross-References
Operation Roster catalogues the operations these verifiers run against. Types and Attributes describes the underlying tile, view, pointer, and token types and their structural contracts. Canonicalizers and Folds describes the rewrites that run after verification succeeds. The nv_tileaa verifier set in nv_tileaa Types, Attributes, Verifiers reuses the assumption-predicate contract documented here.
cuda_tile Canonicalizers and Folds
Abstract
cuda_tile keeps the public IR simple and leaves the heavy lifting to private
lowering. Its public fold surface is deliberately small: constants fold to
their value, a guarded floating add folds when both operands are safe
constants, if inverts a negated condition by swapping branches, and select
carries the usual identity, constant-condition, boolean, compare, and
nested-select folds. A separate expression simplifier handles deeper
recursive cleanup before conversion; the rules below capture the public
semantics.
Beneath the fold surface sits a larger pattern set. cuda_tile.if registers
eight rewrite patterns, and cuda_tile.select adds three more. The eleven
patterns drive structural canonicalization for control-flow tile ops; they run
during the dialect's own canonicalize step, not as part of dialect-to-dialect
conversion. The two non-trivial entries — CombineIfs and CombineNestedIfs
— rewrite across more than one operation and are documented below as
input/output IR pairs.
Fold Surface
| Operation | Fold | Safety condition |
|---|---|---|
constant | Returns the literal attribute. | Always safe. |
addf | Constant plus constant becomes a constant sum. | Both operands are finite constants and the fold can preserve the expected floating semantics. |
if | if (xori(cond, true)) then A else B becomes if (cond) then B else A. | The else region is present and the RHS of xori is the boolean constant true. |
select | Applies identity, constant-condition, boolean, compare, invert, and nested-select rules. | Rules must preserve result type and avoid materializing side effects. |
The small surface is deliberate. Public cuda_tile folding strips obvious
redundancy without committing to target-specific numeric or memory choices
too early.
Constant and AddF
cuda_tile.constant is the canonical literal operation. Folding it returns the
attribute value directly.
OpFoldResult fold_constant(ConstantOp op) {
return op.value;
}
addf folds only when both operands are constants that need no special NaN
or infinity handling. Algebraic identities like x + 0 wait for later
canonicalization because floating flags and target lowering can change their
legality.
Optional<Attribute> fold_addf(AddFOp op) {
Optional<FloatAttr> lhs = finite_float_constant(op.lhs);
Optional<FloatAttr> rhs = finite_float_constant(op.rhs);
if (!lhs.has_value || !rhs.has_value) {
return none();
}
FloatAttr sum = add_with_declared_semantics(lhs.value, rhs.value, op.rounding);
return cast_to_result_element_type(sum, op.result.type);
}
If Inversion
The if fold spots a boolean negation expressed as xori(cond, true),
rewrites the condition to cond, and swaps the then and else regions. The
rewrite is in-place and produces no replacement values.
RewriteResult fold_if_negated_condition(IfOp op) {
XorIOp xor = dyn_cast_xori(op.condition.defining_op);
if (!xor.valid) {
return no_change();
}
if (!is_constant_true(xor.rhs)) {
return no_change();
}
if (op.else_region.empty()) {
return no_change();
}
op.condition = xor.lhs;
swap_regions(op.then_region, op.else_region);
return changed();
}
The fold is correct because if (!c) A else B is equivalent to
if (c) B else A whenever both branches exist and region result types already
match the verifier contract.
IfOp Canonicalization Pattern Set
Eight patterns registered for cuda_tile.if cover the structural rewrites the
dialect performs on its own ops before any conversion driver runs:
| Pattern | Action |
|---|---|
| RemoveUnusedResults | Drops if results that have no uses. |
| ReplaceYieldWithValue | When both then- and else-branches yield the same SSA value, replaces the if with that value. |
| RemoveStaticCondition | When the condition is a compile-time constant, replaces the if with the chosen branch's contents. |
| ConvertToSelect | When both branches are single-op yield-only, rewrites into cuda_tile.select. |
| RemoveEmptyElseBranch | Drops empty else-branches that yield no values. |
| CombineIfs | Two adjacent ifs with the same condition merged into one. |
| CombineNestedIfs | Nested if (a) { if (b) { ... } } rewritten as if (a && b) { ... }. |
| MoveTerminatorToParent | When a branch ends with a cuda_tile.return, hoists it past the if. |
The two structural combiners (CombineIfs and CombineNestedIfs) rewrite
across more than one operation. Each is documented below as an input/output
pair plus a precise match predicate.
CombineIfs
The pattern fires on two adjacent cuda_tile.if ops in the same block whose
conditions are pointer-identical SSA values. Identity (not value equality) is
the match criterion: identity comparison is constant-time, and any earlier
canonicalization that normalized one of the conditions has already replaced
the SSA value at every use site.
Input IR:
%a, %b = cuda_tile.if %cond -> (tile<128xf32>, tile<128xi1>) {
%x = cuda_tile.mulf %lhs, %rhs : tile<128xf32>
%m = cuda_tile.cmpf olt, %x, %thr : tile<128xf32>
cuda_tile.yield %x, %m : tile<128xf32>, tile<128xi1>
} else {
cuda_tile.yield %zero, %fmask : tile<128xf32>, tile<128xi1>
}
%c = cuda_tile.if %cond -> (tile<128xf32>) {
%y = cuda_tile.addf %a, %a : tile<128xf32>
cuda_tile.yield %y : tile<128xf32>
} else {
cuda_tile.yield %a : tile<128xf32>
}
Output IR:
%a, %b, %c = cuda_tile.if %cond -> (tile<128xf32>, tile<128xi1>, tile<128xf32>) {
%x = cuda_tile.mulf %lhs, %rhs : tile<128xf32>
%m = cuda_tile.cmpf olt, %x, %thr : tile<128xf32>
%y = cuda_tile.addf %x, %x : tile<128xf32>
cuda_tile.yield %x, %m, %y : tile<128xf32>, tile<128xi1>, tile<128xf32>
} else {
cuda_tile.yield %zero, %fmask, %zero : tile<128xf32>, tile<128xi1>, tile<128xf32>
}
Match predicate (all must hold):
firstandsecondare bothcuda_tile.if.first.conditionandsecond.conditionare the same SSA value (pointer-identical).secondimmediately followsfirstin the same block; no other op separates them.- Each use of a result of
firstthat lives insidesecond's regions has already been rewritten — otherwise the merge would create a dominance violation.
The two then-regions are concatenated in source order; the two else-regions are concatenated in source order; each yield-list is the concatenation of the original yield-lists. Uses of the original results redirect to the corresponding slice of the combined result list before either original is erased.
CombineNestedIfs
The pattern fires on an outer cuda_tile.if whose then-region is exactly one
inner cuda_tile.if followed by a yield that forwards the inner op's results,
and whose else-region yields poison or undef for every outer result. Under
those preconditions, the two condition tests fold into one cuda_tile.andi
without changing observable semantics: the outer else-branch was already
producing an undefined value.
Input IR:
%r = cuda_tile.if %a -> (tile<64xi32>) {
%inner = cuda_tile.if %b -> (tile<64xi32>) {
%v = cuda_tile.muli %x, %y : tile<64xi32>
cuda_tile.yield %v : tile<64xi32>
} else {
cuda_tile.yield %x : tile<64xi32>
}
cuda_tile.yield %inner : tile<64xi32>
} else {
cuda_tile.yield %poison : tile<64xi32>
}
Output IR:
%conj = cuda_tile.andi %a, %b : i1
%r = cuda_tile.if %conj -> (tile<64xi32>) {
%v = cuda_tile.muli %x, %y : tile<64xi32>
cuda_tile.yield %v : tile<64xi32>
} else {
cuda_tile.yield %poison : tile<64xi32>
}
Match predicate:
outer.then_regionhas exactly two ops: an innercuda_tile.ifand a yield that forwards the inner op's results unchanged.- The inner op's result-type list matches
outer's result-type list. outer.else_region's yield supplies poison/undef for every result, so the outer-else value is already unobservable.- Both
outer.conditionandinner.conditionarei1scalars.
The rewrite emits cuda_tile.andi %outer.cond, %inner.cond : i1 to build the
combined predicate, hoists the inner then-region's body into a new outer-shaped
if, and forwards the original else-region (still yielding poison). Because the
outer else-branch was already undefined, the rewrite preserves every legitimate
observation. The combined predicate may itself fold later if either input is a
constant.
Select Rules
select tries value-preserving folds in a fixed order. Same-arm folding
runs before constant-condition folding; boolean identity folding runs before
the more expensive rewrites. Swap that order and folds shadow each other.
Optional<Value> fold_select(SelectOp op) {
if (op.true_value == op.false_value) {
return op.true_value;
}
Optional<bool> cond = constant_bool(op.condition);
if (cond.has_value) {
return cond.value ? op.true_value : op.false_value;
}
if (is_i1_tile(op.result.type)) {
if (is_true(op.true_value) && is_false(op.false_value)) {
return op.condition;
}
}
Optional<Value> cmp_fold = fold_select_with_compare(op);
if (cmp_fold.has_value) {
return cmp_fold.value;
}
Optional<Value> xor_fold = fold_select_with_inverted_condition(op);
if (xor_fold.has_value) {
return xor_fold.value;
}
return fold_select_with_nested_select(op);
}
The inverted-condition case may mutate the op by replacing the condition with
the underlying value and swapping arms. The nested-select case collapses
patterns like select(c, select(c, a, b), d) whenever doing so erases a
duplicate condition test.
SelectOp Canonicalization Pattern Set
Alongside the fold logic above, cuda_tile.select registers three standalone
rewrite patterns:
| Pattern | Action |
|---|---|
| ReplaceConstantSelect | select(true, a, b) becomes a; select(false, a, b) becomes b. |
| ReplaceIdenticalSelect | select(c, a, a) becomes a. |
| InverseConditionSelect | select(not c, a, b) becomes select(c, b, a). |
The constant and identical patterns overlap with the corresponding fold rules but stay registered because the canonicalize driver applies them to operations the fold path skips — for example, after a peer rewrite materializes a constant where there was previously a variable condition.
Recursive Expression Simplifier
A private expression simplifier handles repeated scalar, mask, and integer-like cleanup. It lowers expression fragments into a compact tree, memoizes simplified nodes, and rebuilds canonical operations. The private expression IR, kind table, and memoization caches are documented in cuda_tile Simplifier Walker. The useful algorithm is ordinary fixed-point simplification:
Value simplify_expr(Node node, Mode mode, int depth) {
if (depth > max_simplifier_depth()) {
return rebuild_without_simplifying(node);
}
CacheKey key = { .node = node, .mode = mode };
if (cache_contains(key)) {
return cache_get(key);
}
SmallVector<Value> operands;
for (Node child : node.operands) {
operands.push(simplify_expr(child, mode, depth + 1));
}
Value simplified = simplify_by_kind(node.kind, operands, node.flags);
cache_put(key, simplified);
return simplified;
}
Typical rules include boolean negation cleanup, variadic and/or folding,
integer comparison folding, arithmetic simplification, bit-vector constants,
and select simplification. Keep the simplifier deterministic and bounded —
every recursive path needs a depth limit and a memoization cache.
Canonicalization Driver
The public canonicalization set is small and predictable. The driver applies fold rules and rewrite patterns in a greedy fixed-point loop, then runs the recursive expression simplifier once over the residual IR.
The rewrite pipeline carries three contracts:
- Pure tile rewrites never reorder, duplicate, or erase a token-ordered memory
operation.
combine_adjacent_ifsmay merge twoifs only when no side-effecting op sits between them in the parent block, because the merge reshuffles the operation's position relative to the surrounding token chain. - Floating folds preserve declared rounding mode and flush-to-zero policy.
fold_addfis the only constant-folding rule that touches floats and runs only when both operands are finite constants, so the rule does not silently rewrite aninf - infform whose semantics the producer chose. - Region rewrites preserve verifier-approved branch and yield types. The
CombineIfsandCombineNestedIfsrewrites grow the combined op's result list rather than reordering it, so dominance and result-list slices stay consistent across the pattern boundary.
The greedy driver iterates until no pattern fires, then hands control to the recursive expression simplifier, which performs deeper boolean, integer, and mask cleanup with memoization and a depth cap. The expression simplifier never rewrites region-bearing ops; that responsibility belongs to the rewrite pattern set above.
Dense Constant Printing
Debug replay paths may print dense constants as comma-separated element lists. That output is fine for diagnostics, but treat it as throwaway — it omits shape and dialect typing context. Round-tripable IR must come from the normal operation printer.
Invariants
- Public folding never drops a token-ordered memory effect.
- Floating folds avoid NaN and infinity cases unless the exact semantics are modeled.
- Region rewrites preserve verifier-approved branch and yield types.
selectfolds preserve result type and condition dominance.- Recursive simplification is memoized and depth-bounded.
- Debug dense-element printing is not a serialization format.
Cross-References
Verifiers describes the legality contracts these rewrites must preserve. Operation Roster catalogues the operations the patterns target. The TileAS-side counterpart in nv_tileas Folds and Memory Consistency describes the rewrite shapes that operate on the next dialect down.
cuda_tile Assembly Printer
Abstract
cuda_tile installs a textual assembly surface so public TileIR reads like a source-level dialect instead of generic quoted MLIR. Most operations use declarative assembly formats. Custom printing is concentrated around token-ordered memory operations, aggregate regions, constant result-name hints, enum spelling, and cuTe layout attributes carried by tile metadata. This page documents the behavior a printer and parser must preserve.
The dialect installs the CudaTileOpAsmInterface concept on its operation models. Three consumers matter: a 5-path result-name algorithm on cuda_tile.constant, five custom *_tko printer callbacks for token-producing memory and atomic ops, and a set of seven namespace-getter trampolines that route concept-model interface calls back to the dialect's "cuda_tile" StringRef.
Printing Model
The printer should prefer concise, typed operation syntax:
%tok0 = cuda_tile.make_token : !cuda_tile.token
%v, %tok1 = cuda_tile.load_view_tko acquire gpu %view[%i, %j] token=%tok0
: !cuda_tile.partition_view<...> -> !cuda_tile.tile<...>, !cuda_tile.token
The general rules are:
- print the
cuda_tile.prefix at dialect boundaries; - allow the default dialect shortcut inside
cuda_tileregions when supported; - print inherent attributes in the structured syntax and elide them from the trailing attribute dictionary;
- keep memory ordering, memory scope, token operands, masks, and result token types visible for token-ordered operations;
- use SSA result-name hints only as hints, never as semantic identifiers.
Token-Ordered Memory Syntax
The _tko family carries the most important custom syntax — memory ordering
and token edges ride the operation, not a trailing attribute dictionary. A
reimplementation may tweak spacing, but every field must remain visible and
parseable.
%old, %tok1 = cuda_tile.atomic_cas_tko acq_rel gpu %ptrs, %cmp, %val, %mask
token=%tok0 : !cuda_tile.tile<ptr>, !cuda_tile.tile<i32>, !cuda_tile.tile<i1>
-> !cuda_tile.tile<i32>, !cuda_tile.token
%old, %tok1 = cuda_tile.atomic_rmw_tko acquire gpu add %ptrs, %val, %mask
token=%tok0 : !cuda_tile.tile<ptr>, !cuda_tile.tile<i32>, !cuda_tile.tile<i1>
-> !cuda_tile.tile<i32>, !cuda_tile.token
%tok1 = cuda_tile.store_view_tko release gpu %view[%i, %j], %value, %mask
token=%tok0 : !cuda_tile.partition_view<...>, !cuda_tile.tile<f32>
-> !cuda_tile.token
load_ptr_tko and store_ptr_tko mirror the view forms but take pointer
tiles instead of view-plus-indices. make_token and join_tokens use the
declarative formats — their syntax is pure token construction and merging.
Attribute Elision
Custom printers must not duplicate inherent attributes. When ordering, scope, mode, token segment sizes, or optimization hints already appear in the operation's structured syntax, drop them from the trailing attribute dictionary.
void print_optional_attrs(OpAsmPrinter *printer, Operation op) {
Set<StringRef> elided = {
"memory_ordering_semantics",
"memory_scope",
"operandSegmentSizes",
"mode",
"optimization_hints",
};
printer->print_optional_attr_dict(op.attributes, elided);
}
The parser reconstructs the same attributes from the structured fields so round-tripping loses nothing.
Enum Spellings
Enum attributes print as short stable keywords:
| Attribute | Example spellings |
|---|---|
| Memory ordering | weak, relaxed, acquire, release, acq_rel, seq_cst |
| Memory scope | tl_blk, cluster, gpu, sys |
| Atomic RMW mode | add, addf, addu, and, or, xor, xchg, min, umin, max, umax, cmpxchg |
| Signedness | signed, unsigned |
| Rounding | nearest_even, zero, negative_inf, positive_inf, approx, full |
Printers emit canonical spellings even when parsers accept legacy aliases.
cuTe Layout Attributes
cuda_tile may carry cuTe layout attributes on tile metadata. Those attributes
use cuTe's basis-vector notation: a stride and dimension pair prints as
N@dim, and tuples print as parenthesized comma-separated lists.
#cute.layout<(1@0, 16@1)>
#cute.shape<(128@0, 64@1)>
#cute.stride<(1@0, 128@1)>
cuda_tile treats these as attributes, not first-class cuTe types. When
layout parsing fails, report the malformed parameter — never fall back to an
opaque string.
SSA Result Names and the 5-Path Constant Algorithm
SSA names are not semantic, but good hints make dumps readable. cuda_tile.constant is the most visible case. When the IR printer formats a constant op, it asks the op for a preferred name through getAsmResultNames. The cuda_tile implementation walks five cases keyed on the constant's value and writes the hint into a SmallString<32> (32-byte inline buffer plus heap spillover, the canonical libc++ SmallString layout with the capacity marker at offset +0).
| Path | Trigger | Result name |
|---|---|---|
| 1 | value is boolean true (i1 constant 1) | "true" |
| 2 | value is boolean false (i1 constant 0) | "false" |
| 3 | value is a floating-point NaN, by any bit pattern matching the type's NaN encoding | "cst_NaN" |
| 4 | value is i1 and is neither 0 nor 1 (poison, defensive only) | "cst_i1" |
| 5 | default: any integer prints "cst_<int>" with the decimal value; non-integer fallback prints "cst" and lets the IR printer pick a numeric suffix | "cst_42", "cst" |
Path 4 never fires on well-formed IR, but the printer handles it anyway so a poisoned i1 constant still produces a deterministic dump.
StringRef constant_result_name(ConstantOp op, SmallString<32> *scratch) {
if (is_i1_splat(op.value, true)) {
return "true";
}
if (is_i1_splat(op.value, false)) {
return "false";
}
if (is_splat_float_nan(op.value)) {
return "cst_NaN";
}
if (is_i1_dense(op.value)) {
return "cst_i1";
}
if (Optional<int64_t> value = splat_integer(op.value)) {
scratch->assign(format("cst_%lld", value.value));
return scratch->view();
}
return "cst";
}
Other useful hints include view, bcast, red, and mm for view construction, broadcast, reductions, and matrix multiply results. A parser must never rely on those names — they exist only for humans reading dumps.
Custom *_tko Printer Callbacks
Five cuda_tile operations carry custom AsmPrinter callbacks instead of declarative assembly. The _tko suffix marks ops that produce a cuda_tile.token; the callbacks emit specialised syntax with structured fields and a tight elided-attribute set, keeping the trailing attribute dictionary compact.
| Address | Op | Elided attrs |
|---|---|---|
0x664920 | cuda_tile.load_view_tko | addr_space, align, token |
0x664E90 | cuda_tile.store_view_tko | addr_space, align, token |
0x66C1A0 | cuda_tile.atomic_cas_tko | success_ordering, failure_ordering, token |
0x66C810 | cuda_tile.atomic_rmw_tko | ordering, kind, token |
0x677250 | cuda_tile.make_token | (none — emits the token name directly) |
The elided lists suppress the verbose attribute dictionaries the default printer would otherwise emit. Structured fields in each printer (memory ordering keyword, memory scope keyword, RMW kind keyword, token operand) carry the same information in keyword form, so round-tripping through the parser rebuilds every elided attribute.
OpAsmPrinter Vtable Slots
Custom printers touch only four OpAsmPrinter vtable slots:
| Offset | Slot | Notes |
|---|---|---|
+16 | getStream | Returns the raw raw_ostream being written to |
+48 | printAttr | Prints a single Attribute |
+128 | printOperandList | Prints an SSA-value list with formatting |
+192 | printOptionalAttrDict | Prints the attribute dictionary, optionally eliding listed keys |
A reimplementation that targets the same binary ABI keeps these offsets stable. The printOptionalAttrDict slot at +192 consumes every elided-attr list in the *_tko table above.
Namespace-Getter Trampoline Quad
Seven 8-byte trampolines sit between 0x61F5D0 and 0x61FB20, each ret-tail-returning the cached pointer to the cuda_tile dialect's "cuda_tile" StringRef. They route concept-model interface calls to a single namespace string:
| Address | Concept |
|---|---|
0x61F5D0 | Dialect::getDialectNamespace |
0x61F640 | InlinerInterface::getDialectNamespace |
0x61F6B0 | OpAsmInterface::getDialectNamespace |
0x61F720 | BytecodeInterface::getDialectNamespace |
0x61F790 | FoldInterface::getDialectNamespace |
0x61F800 | TypeInfererInterface::getDialectNamespace |
0x61FB20 | MemoryEffectsInterface::getDialectNamespace |
Each trampoline is seven bytes — a single ret after the cached pointer load. Seven distinct copies exist because every concept-model instantiation generates its own getDialectNamespace; the linker deduplicates only to these seven slots because the model classes are template-instantiated separately, each keeping an independent code address. Tail-calling all of them into one shared body would shrink the binary, but the existing layout matches what an MLIR concept-model emitter produces when no explicit deduplication is applied.
Region Printing
Structured control-flow printers keep region layout readable:
%r = cuda_tile.if %cond -> (!cuda_tile.tile<...>) {
^then:
cuda_tile.yield %a : !cuda_tile.tile<...>
} else {
^else:
cuda_tile.yield %b : !cuda_tile.tile<...>
}
for, loop, reduce, and scan likewise name body blocks and print
yield types explicitly enough that verifier failures stay easy to diagnose.
Dense Constant Debug Output
Some debug paths print dense element payloads as comma-separated values
without full MLIR type or shape wrappers. That format is for logs and replay
debugging, not persistent IR. Round-tripable IR always uses the normal
cuda_tile.constant operation and its typed attribute.
Invariants
- Memory ordering, memory scope, atomic mode, masks, token operands, and token result types remain visible for
_tkooperations. - Attributes printed in custom syntax are elided from the optional attribute dictionary and reconstructed by the parser; the elided sets match the per-op lists in the
*_tkoprinter table. - The constant result-name algorithm preserves its 5-path ordering —
"true","false","cst_NaN","cst_i1", then the integer/default fallback — so dumps stay stable across rebuilds. - The
OpAsmPrintervtable slots at+16,+48,+128, and+192are the only entry points the custom printers use. - All seven
getDialectNamespacetrampolines return the same"cuda_tile"StringRef. - SSA result-name hints are deterministic and non-semantic.
- cuTe layout values are parsed and printed as attributes.
- Debug dense-element output is not treated as stable assembly.
cuda_tile Bytecode Reader
Abstract
The cuda_tile dialect ships its own bytecode reader; no bytecode writer is linked into this binary (see Dialect Bytecode Reader/Writer Status — Status Matrix). The reader does not parse a standalone container — the top-level TileIR envelope is handled by the generic MLIR bytecode header parser documented in MLIR Bytecode Format. What cuda_tile contributes is the dialect-private Op-opcode dispatcher plus the cuda_tile-introduced arms of three otherwise-shared dispatchers (TypeTag, AttributeTag, DebugTag). Only the Op-opcode dispatcher is exclusively cuda_tile; the other three carry both builtin and cuda_tile cases.
The split:
| Dispatcher | Cases | Owner |
|---|---|---|
| TypeTag | 19 (0..18) | shared sub_59C710; tags 12..18 are cuda_tile-introduced types |
| AttributeTag | 13 (1..13) | shared sub_59F100; cuda_tile attrs route through tags 4..13 |
| DebugTag | 7 (0..6) | shared sub_589B90 |
| Op opcode | 110 (0..109) | cuda_tile-private sub_5B13D0 |
The private Op-opcode dispatcher is reached from the top-level bytecode-parse-into-scratch path. The three shared dispatchers come in through that same path and through other dialects' readers; they hold no per-dialect state, so the same Type, Attribute, and Location results round-trip through either entry point.
TypeTag Dispatcher
The TypeTag dispatcher (sub_59C710) reads a single VarInt tag and switches on it across a dense [0..18] namespace shared between builtin element types and the cuda_tile-introduced aggregate types. Tags 0..11 are builtin integer/float element types resolved without any further reads; tags 12..17 are the cuda_tile aggregate types (Pointer, Tile, TensorView, PartitionView, Function, Token); tag 18 is the microscale f8E8M0FNU element type reachable only as a leaf inside a tile shape. The full byte-for-byte table lives in Wire-Format Constants — Layer 2: TypeTag Namespace and the dispatcher walk in MLIR Bytecode Format — Type Tag Dispatch. The summary the cuda_tile-side reader cares about:
| Tag | Type | Payload |
|---|---|---|
0..4 | i1, i8, i16, i32, i64 | none (width fully determined by tag) |
5..11 | f16, bf16, f32, tf32, f64, f8E4M3FN, f8E5M2 | none |
12 | PointerType | pointee type-ref + VarInt address space |
13 | TileType | element type-ref + VarInt rank + VarInt-encoded shape |
14 | TensorViewType | element type-ref + shape + strides |
15 | PartitionViewType | element type-ref + shape + dim-map + partition-mode byte |
16 | FunctionType | input type-ref list + result type-ref list |
17 | TokenType | no payload |
18 | f8E8M0FNU | parameterless; reachable only as a leaf via the extension path |
TileType is the workhorse of the cuda_tile-introduced cluster. Its payload is a TypeRef for the element type, a VarInt rank, and a VarInt-encoded shape. The reader shares its shape parser with TensorViewType and PartitionViewType, keeping the three Tile-family decoders byte-compatible across the shape prefix. (No writer ships in this binary; the shape format is documented as a wire-format contract rather than a writer-side helper.) PointerType carries a TypeRef for the pointee and a VarInt address space; TokenType is payload-free.
The dispatcher's contract with its caller is uniform: every case path returns a heap-allocated MLIR Type on success or nullptr on failure. The single-byte return convention lets the bytecode reader push results straight into the Type-section table without rechecking each case.
Six Enum-Attr Readers
Six attribute kinds defined by cuda_tile carry one-of-N enum payloads — Comparison, Overflow, PaddingValue, Rounding, Signedness, Width. Each has its own dedicated reader body, byte-identical to the others except for the embedded enum-value-to-name lookup table. Each body decodes the enum payload, validates it against the table, and emits a per-enum diagnostic on out-of-range values.
The byte-identity is a consequence of the table-driven layout: every reader reads a VarInt, indexes into its embedded (name, value) array, and either constructs the enum attribute or emits the diagnostic. Since the only thing that differs between the six readers is the table they consult, a future deduplication could collapse them into a single shared body plus six table pointers without touching the wire format. The shipped binary keeps them separate.
F8E8M0FNU Tag 18 Fallback
The cuda_tile builder normally emits f8E4M3FN and f8E5M2 as tagged FloatTypes through the upstream MLIR builtin reader. Those two element types have stock TypeTag values in the upstream Type space and the upstream reader resolves them without ever entering the cuda_tile dispatcher.
The microscale f8E8M0FNU element type is the exception. Used by the microscale FP8 attention path, it has no upstream tag, the upstream reader doesn't recognize it, and the cuda_tile-private dispatcher catches it on the fallback path through tag 18. Tag 18 fires only when f8E8M0FNU appears as the element type of a TileType, TensorViewType, or PartitionViewType — that is, only as a leaf type inside a tile shape. A standalone f8E8M0FNU outside any tile shape cannot be emitted because the cuda_tile builder does not expose it as a top-level type; tag 18 is a leaf-only fallback, not a general-purpose tag.
Op Opcode Dispatcher
The op-opcode dispatcher reads a VarInt opcode and switches on it. The 110 opcodes cover the 92-op user-visible roster (some opcodes use private fallthrough variants). The full opcode table is reproduced on MLIR Bytecode Format.
Each opcode arm decodes the operation's expected payload: location reference (optional), result type-refs from the type table, operand value-refs from the value table, attribute-dictionary reference, and any op-specific region bodies. The dispatcher returns the constructed Operation* on success or nullptr on failure.
AttrTag Payloads
The cross-dialect attribute dispatcher accepts cuda_tile-owned attributes alongside attributes owned by builtin and other dialects. The cuda_tile attribute families fall into five payload shapes:
| Attribute family | Payload shape |
|---|---|
| Enum attrs (Comparison, Overflow, PaddingValue, Rounding, Signedness, Width) | VarInt enum index; resolved through the dedicated table-driven reader described above. |
| Optimization hint dict | VarInt entry count, then (architecture-key, value) pairs where each value is an AttributeRef into the attribute table. |
Assumption predicate (div_by, bounded, same_elements) | Predicate-kind VarInt, then predicate-specific payload (divisor + optional every/along, lower/upper bounds, or shape extents respectively). |
| Operand-segment array | Dense i32 array encoded as VarInt rank + N signed VarInts; reused by every op with operand segments. |
| Tile-shape attribute | VarInt rank + N VarInt extents; reused by ops that carry a shape attribute independent of result type. |
These payload shapes are reader-side contracts; no bytecode writer is linked into this binary, so the producer side must be supplied by an external encoder that targets the same shapes exactly.
Per-Tag Builder Cluster
The 13-case sub_59F100 dispatcher described in MLIR Bytecode Format — Self-Contained Attribute Dispatch is the entry point. Each tag's builder consumes a tag-specific payload shape, constructs the corresponding mlir::Attribute, and either returns it or emits a per-builder diagnostic and returns nullptr. The per-tag wire-format / builder / failure-mode triple is the practical reference a reimplementation needs:
Tag 1 — StringAttr (inline)
- Wire-format bytes. SSO (short-string-optimization) length VarInt, then the raw UTF-8 bytes — or, when the length VarInt encodes a string-table index instead, a single VarInt that points into the String section. The discriminator between the two encodings is the low bit of the length VarInt: even → embedded bytes (length =
len >> 1), odd → string-table index (index = len >> 1). - Builder.
sub_59AD90resolves the string and wraps the result inStringAttrvia the inline arm ofsub_59F100itself. - Failure modes. Out-of-range string-table index →
"string index "concatenated with the offending index. SSO read past the section span →"failed to read StringAttr.".
Tag 2 — FloatAttr (inline)
- Wire-format bytes. Type-ref VarInt (must resolve to a
FloatType), then an inlineAPFloatpayload — IEEE-754 bit pattern packed bysub_586200according to the float type's bit width. - Builder. Inline arm of
sub_59F100.sub_58C400resolves the type ref;sub_586200reads the bit pattern; thesub_4462700-family float-type-builder casts the pattern into anAPFloatof the right semantics and wraps it inFloatAttr. - Failure modes. Type ref does not resolve to a
FloatType→"failed to read valid FloatType for FloatAttr". Post-construction cast guard →"failed to cast parsed attribute to FloatAttr".
Tag 3 — TypeAttr (inline)
- Wire-format bytes. A single type-ref VarInt.
- Builder. Inline arm of
sub_59F100.sub_58BDE0looks up the type and wraps it inTypeAttr. - Failure modes. Null type lookup →
"failed to get referenced type for TypeAttr".
Tag 4 — DenseElementsAttr int/float variant (sub_59FB80)
- Wire-format bytes. Shape: VarInt rank + N VarInt extents; element-type ref VarInt (must resolve to an integer or float type); payload: a flat run of element-typed words, total count = product of extents. Integer payload words use VarInt-zig-zag encoding; float payload words use the same IEEE-754 bit pattern as tag 2.
- Builder.
sub_59FB80allocates a result vector viasub_456A580, fills it from the payload run, and wraps it inDenseIntOrFPElementsAttr. - Failure modes. Element type does not resolve to an int/float type →
"failed to read valid MLIR Type for self-contained DenseElementsAttr". Payload word fails to decode →"array contains unsupported value "concatenated with the offending VarInt.
Tag 5 — DenseElementsAttr string variant (sub_59FCD0)
- Wire-format bytes. Shape, then a per-element VarInt count (total element count = product of extents), then that many length-prefixed strings. Each string follows the same SSO rule as tag 1.
- Builder.
sub_59FCD0builds theDenseStringElementsAttrfrom the per-element string vector. - Failure modes. Count prefix read failed →
"failed to read number of string attrs in DenseElementsAttr". Per-element string read failed →"failed to read string in DenseElementsAttr".
Tag 6 — DivByAttr (sub_59FE40)
- Wire-format bytes. Divisor VarInt; flags byte (low two bits:
verify_with_assume,predicate_covariance); on flags bit 1, two extra VarIntseveryandalong. - Builder.
sub_59FE40constructs theDivByAttr(div_byassumption predicate) with the populated divisor and the optionalevery/alongcovariance fields. - Failure modes. Divisor VarInt failed →
"failed to read divisor for DivByAttr". Flags byte failed →"failed to read flags byte for DivByAttr".everyfield failed →"failed to read value for 'every' in DivByAttr".alongfield failed →"failed to read value for 'along' in DivByAttr".
Tags 7 / 8 — DenseI64ArrayAttr two layout variants (sub_59FF60, sub_5A0080)
- Wire-format bytes. Both variants encode the same logical content — a length-prefixed i64 array — but with two physical layouts. Variant A (
sub_59FF60) keeps the array inline next to the dispatch tag (suitable for short arrays). Variant B (sub_5A0080) emits a sidecar offset VarInt that points into a shared i64 pool elsewhere in the Constant section (suitable for arrays that recur across many attributes). Both layouts start with a VarInt rank, then either inline or sidecar-resolved i64 values. - Builder. Both arms build the same
DenseI64ArrayAttr; only the source of the i64 stream differs. - Failure modes. Either layout's bulk value read failed →
"failed to read DenseI64ArrayAttr values.".
Tag 9 — SameElementsAttr (sub_5A01A0)
- Wire-format bytes. A nested DenseI64ArrayAttr-shaped payload encoding the shape extents (the "all elements equal" invariant means the attribute carries only the shape and a single splat value, but the wire format reuses the dense-array codec for the shape).
- Builder.
sub_5A01A0constructs theSameElementsAttrafter decoding the canonical-form payload. - Failure modes. Nested decode failed →
"failed to read DenseI64ArrayAttr for SameElementsAttr".
Tags 10 / 11 / 12 — BoundedAttr three discriminator variants (sub_5A02C0, sub_5A03E0, sub_5A0500)
- Wire-format bytes. All three variants share a flags byte that selects between lower-only, upper-only, and lower+upper layouts. Variant 0 (tag 10): flags + lower-bound payload. Variant 1 (tag 11): flags + upper-bound payload. Variant 2 (tag 12): flags + lower-bound + upper-bound payload. Each bound is a VarInt-encoded i64.
- Builder. Each arm constructs the
BoundedAttrwith the populated bound fields. - Failure modes. Flags byte read failed →
"failed to read flags byte for BoundedAttr". Lower bound read failed (variants 0, 2) →"failed to read lower bound for BoundedAttr". Upper bound read failed (variants 1, 2) →"failed to read upper bound for BoundedAttr".
Tag 13 — AssumePredicateAttr (sub_5A0620)
- Wire-format bytes. A packed predicate header (predicate kind + size) followed by the predicate-specific payload — typically a nested AttributeRef into the attribute table, plus an integer condition word.
- Builder.
sub_5A0620constructs theAssumePredicateAttrcarrying the predicate body. This is the slot that has no upstream MLIR equivalent and is the most visible piece of the wire-format-breaking divergence. - Failure modes. The packed predicate decode shares its prefix with
DivByAttrand theBoundedAttrfamily; failures here surface through the same string-table-index, divisor, and bound diagnostics those decoders emit.
The complete cross-dialect numbering — including the side-by-side comparison with upstream MLIR mlir/Bytecode/BytecodeEnums.h::AttributeTag — lives in MLIR Bytecode Format — Self-Contained Attribute Dispatch. The default arm of sub_59F100 rejects every tag outside the 1..13 range with the "unsupported AttributeTag " / " for self-contained attribute" sentinel; producers that need to remain forward-compatible with the shipped CUDA 13.1 reader must restrict themselves to those 13 tags.
Encoding Walk: cuda_tile.addi
A concrete byte-level walk closes the loop on the format. Consider the operation
%c = cuda_tile.addi %a, %b : tile<8 × i32>
assuming %a and %b occupy entries 4 and 5 of the current value table and tile<8 × i32> occupies entry 3 of the type table. The opcode for cuda_tile.addi is 3 (dispatch case 0x03 in MLIR Bytecode Format — Operation Opcode Dispatch). The on-wire encoding contains seven VarInt fields, each fitting in one byte at these table indices:
| Bytes | Field | VarInt | Hex | Decoded |
|---|---|---|---|---|
| 1 | Opcode | 0x03 | 03 | 3 → cuda_tile.addi |
| 1 | Location index (signed LEB128) | 0x7f | 7f | -1 → UnknownLoc (no --lineinfo) |
| 1 | Result-type ref | 0x03 | 03 | 3 → tile<8 × i32> |
| 1 | Operand count | 0x02 | 02 | 2 operands |
| 1 | Operand 0 ref | 0x04 | 04 | 4 → %a |
| 1 | Operand 1 ref | 0x05 | 05 | 5 → %b |
| 1 | Attribute-dict ref | 0x00 | 00 | 0 → empty dict |
The final on-wire byte stream is therefore exactly seven bytes:
03 7f 03 02 04 05 00
With --lineinfo enabled the 0x7f sentinel becomes a non-negative LocAttr index (one byte for any module with fewer than 64 distinct locations after zig-zag encoding). With a non-empty inline attribute dictionary the trailing 0x00 becomes a VarInt index into the attribute table; if the dict is built from cuda_tile-private enum attributes (Comparison, Overflow, …), each entry routes through the dedicated table-driven reader documented above before reaching the dispatch in MLIR Bytecode Format — Self-Contained Attribute Dispatch.
All references are positional into per-section tables; the bytecode never embeds operand SSA names or string mnemonics in the operation stream. The mnemonic resides exactly once per operation kind in the dialect's mnemonic table; per-op cost stays constant in the section size, not linear in the mnemonic length.
An external encoder targeting this reader must emit the same fields in the same order. The shape parser for TileType resolves the result-type reference before the op-opcode dispatcher fires, so the type-table index already exists by the time cuda_tile.addi's opcode arm runs. The result type's element width — i32 — is recovered through the type-table lookup, not through the op opcode.
Missing Op 0x6E (atan2)
The op-opcode dispatcher covers 110 cases numbered 0..109. The underlying cuda_tile dialect advertises 111 ops to the MLIR registry, so exactly one op has no dispatcher case. The missing op is cuda_tile.atan2, removed from this binary as documented in cuda_tile Overview — Operation Families.
The wire-level consequence: opcode 110 lands on the default arm of the dispatcher and surfaces the "unknown or unimplemented opcode: " diagnostic. A producer that hand-encodes opcode 110 against the next-version opcode space sees its module load fail at that exact opcode. A future-version reader accepts the opcode by adding the 111th case at the end of the dispatch table; this reader has no path to do so.
Version-13.1 vs 13.2 Compatibility
The bytecode header version check accepts only 13.1.x. The version-range table is encoded as an inclusive [13.1.0 .. 13.1.UINT32_MAX] window, and the predicate major == 13 && minor == 1 is the only one that yields acceptance.
A 13.2.0 file emitted by a future tileiras would carry additional TypeTag, AttributeTag, and DebugTag values — at minimum a 14th AttributeTag for any new attribute kind, a 19th TypeTag for any new Type subclass, and an 8th DebugTag for any new debug attribute. The 13.1 reader never sees those tag values: it rejects the version block before any section body decoding begins. The forward-incompatibility guarantee is therefore stronger than tag-by-tag rejection — a single header-block check shields the entire downstream pipeline from unknown payloads.
Cross-References
MLIR Bytecode Format is the parent reference for the wire format consumed by this dialect's dispatchers. The four wire-format dispatchers — Type Tag Dispatch, Operation Opcode Dispatch, Self-Contained Attribute Dispatch, and Debug-Info Attribute Dispatch — together cover every byte the reader looks at after the envelope is accepted. The wire-format-breaking AttrTag numbering and the side-by-side comparison with upstream MLIR live in the third of those sections; the per-builder failure modes documented above expand the same numbering with the payload bytes each builder reads.
Dialect Bytecode Reader/Writer Status restricts the parent reference to the dialects that actually ship a reader. The status matrix shows that cuda_tile is the only TileIR dialect with a linked bytecode reader and that no TileIR dialect ships a writer, which is why this page only documents the reader half of the contract.
Types and Attributes — Concrete Types documents the underlying cuda_tile Type and Attribute subclasses that the TypeTag and AttributeTag dispatchers construct. Operation Roster lists the 92 user-visible ops that the opcode dispatcher covers, alongside the small set of private-region ops.
nv_tileaa Dialect Overview
nv_tileaa exists only inside the tileiras binary. There is no
open-source counterpart: no header, no TableGen file, no entry in any
public NVIDIA component ships under this name. In the lowering cascade
it sits one step below the user-facing cuda_tile dialect and one step
above the assembler-near nv_tileas dialect, and its job is to expose
the alias and memory-space information that the lower tiers need before
they commit to layouts and TMA descriptors. The "aa" stands for
alias-aware: this dialect reifies buffer lifetime, pointer arithmetic,
and tile provenance as first-class IR ops, so downstream passes can run
ordinary dataflow analyses over them.
Purpose
The dialect bridges two very different worlds. Above it, cuda_tile
speaks in user-level terms — tiles, dot products, reductions, control
flow. Below it, nv_tileas already commits to async pipelines, TMA
descriptors, and per-agent register budgets. nv_tileaa is what fits
between them. It keeps the high-level operation set largely intact
(addf, dot, reduce, scan, broadcast, extract_slice) and
layers on three orthogonal kinds of structural information the upper
dialect lacks: explicit pointer arithmetic (addptr, int_to_ptr,
ptr_to_int, make_memref), explicit memory-token lifetimes
(create_mem_token, join_mem_token, mark_for_reuse), and a
launch/queue skeleton (launch_func, execute, plugin, func,
queue.get, queue.put, queue.yield). The result is an IR tier
where alias relationships, buffer reuse, and structural decomposition
all show up as plain SSA edges — ready for the layout assignment, async
materialization, and pipeline-region passes that run later in
nv_tileas.
In-memory only
nv_tileaa is strictly an internal pass-to-pass IR. With no
BytecodeDialectInterface, the binary contains no bytecode reader and
no writer — the cascade consumes cuda_tile bytecode, lowers in
memory, and never serializes an nv_tileaa module. The dialect also
installs no OpAsmDialectInterface: no custom textual printer, no
custom parser, no type or attribute aliases, nothing that would let it
round-trip through generic MLIR text. A handful of ops (func, load,
atomic_cas, atomic_rmw, tiled_load, tiled_atomic_rmw) install
the per-op OpAsmOpInterface purely for SSA-name pretty-printing and a
getDefaultDialect shortcut; everything else falls through to the
ODS-emitted generic form. The takeaway: any textual dump of an
nv_tileaa module is a lossy debugging artifact, never a stable wire
format.
Semantic Surface
The dialect has a small named type surface, a target and memory attribute surface, and a compact operation set arranged around alias-aware tile computation. The operation roster is catalogued in Operation Roster; the useful overview is by semantic family:
| Family | Examples | Purpose |
|---|---|---|
| Pointer and memref construction | addptr, bitcast, int_to_ptr, ptr_to_int, make_memref | Turn public view/pointer concepts into explicit addressable objects. |
| Memory operations | load, store, tiled_load, tiled_store, gather_load, scatter_store, atomic_cas, atomic_rmw | Express memory access with visible provenance, reuse, and token dependencies. |
| Tile compute | addf, subf, mulf, divf, fma, dot, conv_dot, reduce, scan, histogram | Preserve tile math while making alias and layout preconditions explicit. |
| Shape and view transforms | broadcast, extract, extract_slice, expand_dims, permute, view, cat, make_range | Carry shape manipulation into the internal pipeline before TileAS layout assignment. |
| Program-grid queries | get_program_id, get_num_programs, get_dim_size, is_valid_program_id | Represent kernel-grid structure without committing to NVVM builtins yet. |
| Memory-token protocol | create_mem_token, join_mem_token, mark_for_reuse | Encode ordering and reuse information as SSA dataflow. |
| Structural operations | func, call, return, yield, execute, plugin, launch_func, queues, globals, diagnostics | Provide the internal function, queue, launch, and extension shell used by later passes. |
The dialect installs only the interfaces needed for internal inlining and generic IR handling. There is no public bytecode or text format — by design.
Alias-Aware Contract
nv_tileaa is the first stage where the compiler reasons about memory
provenance as IR rather than as implicit frontend intent. Three contracts
matter:
MemRef make_memref(Pointer base, Shape shape, Stride stride,
MemorySpace space, AliasScope scope);
Token create_mem_token();
Token join_mem_token(ArrayRef<Token> inputs);
Value load(MemRef ref, Indices indices, Token token);
Token store(MemRef ref, Indices indices, Value value, Token token);
Op signatures differ by operation family, but the discipline is uniform:
- memory references carry element type, address space, shape, stride, and alias provenance;
- memory effects are ordered through token SSA edges;
- reuse intent is explicit through
mark_for_reuse; - queues and plugin hooks remain structural until TileAS and companion lowering decide how to materialize them;
- math and shape operations keep tile semantics stable while alias information becomes available to downstream analyses.
That is why the dialect sits between cuda_tile and nv_tileas: it has
enough information to refine memory and alias behavior, but has not yet
committed to the final schedule, layout, async pipeline, or TMA descriptor
form.
Position in the cascade
nv_tileaa is the central waypoint of the three-dialect tile
lowering. Conceptually:
cuda_tile
-> nv_tileaa
-> nv_tileas
-> llvm/nvvm
The conversion into TileAA is pattern driven: public cuda_tile arithmetic,
shape, view, token, and memory operations rewrite into internal TileAA forms.
The conversion out of TileAA lowers those forms into TileAS, where layouts,
schedules, async pipelines, TMA descriptors, and CTA/cluster behavior become
explicit.
nv_tileaa never serializes, so it is purely transient. The conversion in
produces it, the conversion out consumes it, and users must not depend on its
textual spelling or on seeing it on disk.
Lowering Invariants
- A verified
nv_tileaamodule has no remainingcuda_tileoperations. - Memory references carry enough provenance for alias and reuse analysis.
- Token-producing and token-consuming memory operations preserve ordering dependencies as SSA dataflow.
- Tile compute still describes mathematical intent; target-specific layout and scheduling are deferred to TileAS.
- Queue and plugin operations are structural bridges, not final backend ABI.
- The dialect may appear in debug dumps, but those dumps are not a stable file format.
AbstractOperation Record
Every registered op in nv_tileaa carries a single 0x70-byte AbstractOperation record — eight bytes wider
than the cuda_tile record. The dialect ctor allocates each slab with sub_44A8C20(0x70) and uses the extra
slot at +0x68 for the alias-token concept pointer that gives the dialect its alias-aware identity. The
shape is otherwise the same descriptor that an Operation* resolves through its OperationName slot to
reach the dialect's interface tables and fold callback.
typedef struct AbstractOperation {
/*+0x00*/ void **vtable; // dispatch for the op
/*+0x08*/ StringRef mnemonic; // e.g. "nv_tileaa.addptr"
/*+0x18*/ ConceptModel *interface_inliner;
/*+0x20*/ ConceptModel *interface_opasm;
/*+0x28*/ ConceptModel *interface_fold;
/*+0x30*/ ConceptModel *interface_typeinfer;
/*+0x38*/ ConceptModel *interface_bytecode;
/*+0x40*/ ConceptModel *interface_memeffects;
/*+0x48*/ ConceptModel *interface_destinationstyle;
/*+0x50*/ ConceptModel *interface_alias; // alias-aware concept (nv_tileaa-only)
/*+0x58*/ ConceptModel *interface_extra1;
/*+0x60*/ FoldCallback fold_canon; // op-fold and canonicalize hook
/*+0x68*/ ConceptModel *interface_alias_token; // extra slot for the alias-token concept
} AbstractOperation;
The allocator zero-initializes the slab, so unused interface slots stay null and the dispatcher probes them
without a presence flag. The mnemonic field is an embedded StringRef pointing at a .rodata literal owned
by the binary, not a heap-interned copy. The interface-concept pointers at +0x18..+0x58 are the MLIR
concept-model singletons that wire inlining, asm printing, folding, type inference, bytecode, memory effects,
destination-style, and — at +0x50 — the alias-aware concept that nv_tileaa uses to track buffer
provenance through ordinary dataflow. The fold callback at +0x60 is the op's per-op rewriter; the extra
concept pointer at +0x68 is the alias-token model backing create_mem_token, join_mem_token, and
mark_for_reuse. nv_tileaa.addptr, for instance, is registered with vtable &unk_59E0238 and a fold/canon
record at &unk_5B46F60, both populated by its reg thunk in sub_1543B70.
The records sit consecutively in a statically-allocated array in .data.rel.ro that mirrors the layout of
cuda_tile's bank: one slab per op, walked from the dialect base by mnemonic hash. The exact range for this
build is the bank that holds the 61 registered ops; the surrounding fold-record cluster sits at
0x5B46D28..0x5B46F68, which is the secondary index that the registrar threads through the slab's +0x28
fold-interface pointer. The end-of-registered-ops boundary is marked by the same null sentinel as
cuda_tile, 0x5BE6138; lookup helpers stop walking the bank when they hit it.
This is the static-sentinel idiom described in
TypeID Sentinels and Anchors: the bank is
allocated once, lives for the entire process, and is indexed by mnemonic hash from the dialect base. Live
Operation* instances reach this record through their OperationName slot — the resolution path documented
in Operation Layout — Pointer-Identity Dispatch. The per-op vtable and
fold-callback pairs for the rest of the roster are catalogued in Operation Roster.
Cross-references
- Operation Roster — operation families and behavioral contracts.
- Types, Attributes, Verifiers — the type surface,
the target and memory attribute surface, the
compute-capability/compute_capabilityspelling pair, the parametricdiv_by/same_elements/boundedtrio, and the dialect's verifier contracts. - Folds, Canonicalizers, Tokens —
per-op fold and canonicalization records, plus the
create_mem_token/join_mem_token/mark_for_reuselinear-token protocol that gives the dialect its alias-awareness.
nv_tileaa Operation Roster
Abstract
nv_tileaa is the alias-aware tile dialect between public cuda_tile IR and
the lower nv_tileas scheduling dialect. Its operations keep the mathematical
shape of the original program while making pointer provenance, memory
ordering, queue flow, and plugin boundaries explicit enough for later passes
to schedule and materialize. The operation surface is a reimplementation
contract — what each family represents, which operands and attributes matter,
and which invariants a verifier or lowering must preserve.
The roster groups operations by behavior, not by binary registration order. A reimplementation should track the family contracts and textual mnemonics, not the incidental internal layout used by one compiler build.
Semantic Families
| Family | Operations | Contract |
|---|---|---|
| Floating and integer tile math | addf, subf, mulf, divf, fma, sqrt, rsqrt, exp2, clampf, mulhiui | Elementwise arithmetic over scalar or tile-shaped values. Floating ops accept rounding or NaN propagation attributes where applicable. |
| Dot, convolution, and collectives | dot, conv_dot, reduce, scan, histogram | Preserve high-level collective math until TileAS can choose MMA, tensor-memory, or reduction pipelines. |
| Shape and tile construction | splat, broadcast, expand_dims, extract, extract_slice, view, cat, permute, make_range, generate, block_tile, conv_tile, get_dim_size | Express rank changes, indexing, view reinterpretation, generated tiles, and convolution blocking without hiding shape dependencies. |
| Pointer, memref, and type conversion | addptr, int_to_ptr, ptr_to_int, make_memref, bitcast, fp_to_fp, extern_elementwise, elementwise_inline_asm, call_elementwise_intrinsic | Convert public pointer and element-type concepts into explicit addressable objects and per-element escape hatches. |
| Memory effects | get_global, load, store, tiled_load, tiled_store, gather_load, scatter_store, atomic_cas, atomic_rmw, tiled_atomic_rmw | Represent memory traffic with visible masks, volatility, TMA eligibility, bounds facts, and token dependencies. |
| Tokens, assumptions, and lifetime hints | create_mem_token, join_mem_token, mark_for_reuse, assert, assume, optimization_barrier, pragma, message, print | Carry ordering, alias, reuse, diagnostics, and optimizer constraints as SSA-visible IR. |
| Functions, plugins, and launch structure | func, call, return, yield, global, launch_func, execute, plugin, inject_ir | Provide the internal symbol, function, launch, and extension shell that survives until queue and plugin lowering. |
| Grid and queue flow | get_program_id, get_num_programs, is_valid_program_id, cancel_next_program_id, create_queue, queue.get, queue.put, queue.yield | Model program-grid queries and queue dataflow before they become TileAS async pipeline regions. |
Core Operation Contracts
Arithmetic
Elementwise arithmetic is shape-preserving. The verifier rejects operand sets
that cannot be broadcast or matched by the dialect's normal shape rules. For
floating operations, the result element type is the operand element type;
fp_to_fp is the explicit conversion boundary and must be the only operation
that silently changes floating width.
TileValue verify_elementwise_arithmetic(Op op) {
Shape result_shape = infer_common_shape(op.operands);
ElementType type = infer_common_element_type(op.operands);
require_all_operands_compatible(op.operands, result_shape, type);
require_rounding_mode_if_needed(op);
return TileValue(type, result_shape);
}
mulhiui is an unsigned high-half multiply: for each lane, multiply the
zero-extended operands at double width and return the upper half. Never model
it as ordinary signed multiplication followed by a shift.
uintN mulhiui(uintN a, uintN b) {
uint2N wide = zero_extend(a) * zero_extend(b);
return truncate_to_N_bits(wide >> N);
}
Dot and Convolution
nv_tileaa.dot
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | A | tile<M × K × elem_a> | yes | rank-2 or rank-3 (batched) |
| operand 1 | B | tile<K × N × elem_b> | yes | K dimension agrees with A |
| operand 2 | C | tile<M × N × elem_c> | yes | accumulator |
| operand 3 | sfa | tile<scale × E8M0> | block-scaled only | scale factors for A |
| operand 4 | sfb | tile<scale × E8M0> | block-scaled only | scale factors for B |
| result 0 | D | tile<M × N × elem_c> | yes | same shape as C |
attr signedness_a | enum | signed|unsigned | integer MMA | |
attr signedness_b | enum | signed|unsigned | integer MMA | |
attr propagate_nan | bool | optional | floating-point only | |
attr operandSegmentSizes | dense i32 | length 5 | yes | {A, B, C, sfa, sfb} |
dot abstracts matrix multiply-accumulate. It accepts ordinary float and
integer MMA shapes and carries the scale-factor operands needed for
Blackwell-style block-scaled MMA. The verifier owns four properties:
- A and B use a legal paired element type.
- The accumulator/result type is legal for that pair.
- Integer operands agree on bit width and signedness rules.
- Block-scaled forms are gated on a target that supports them.
LogicalResult verify_dot(DotOp op, Target target) {
MmaShape shape = infer_mma_shape(op.a, op.b, op.c);
require_compatible_contracting_dims(shape);
if (op.has_scale_factors) {
require(target.supports_block_scaled_mma);
require_legal_scale_factor_types(op.sfa, op.sfb, shape);
}
require_legal_mma_element_tuple(op.a.type, op.b.type, op.c.type, op.result.type);
require_signedness_attrs_for_integer_mma(op);
return success();
}
conv_dot, conv_tile, and block_tile keep convolution lowering
structured. The key behavior is not a special address calculation — it is
preserving padding, activation layout, and tile-blocking facts until the
memory layout pass can pick the right producer and consumer layouts.
The element-type rules that govern legal (A, B, C) tuples — FP8 e4m3 and
e5m2, block-scaled MX-FP and NV-FP4, and the f32 accumulator requirement on
narrow-precision inputs — are documented in
Fast-Math and Numerical Precision.
Shape Operations
Shape operations stay cheap, explicit, and canonicalizable. view changes
interpretation without changing elements, expand_dims inserts size-one axes,
broadcast repeats values across larger axes, and extract or
extract_slice projects subshapes. splat is the scalar-to-tile constructor
and the canonical sink for many reshape folds.
Shape infer_shape(Op op) {
switch (op.kind) {
case SPLAT:
return op.result_shape;
case EXPAND_DIMS:
return insert_axes(op.input.shape, op.axes, size_one_axes());
case BROADCAST:
require_can_broadcast(op.input.shape, op.result_shape);
return op.result_shape;
case VIEW:
require_same_element_count(op.input.shape, op.result_shape);
return op.result_shape;
case EXTRACT:
return remove_indexed_axes(op.input.shape, op.indices);
case EXTRACT_SLICE:
return op.slice_shape;
default:
return infer_from_traits(op);
}
}
Pointer and Memref Construction
nv_tileaa.addptr
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | base | ptr or tile<ptr> | yes | preserves address space |
| operand 1 | offset | index or tile<index> | yes | shape must broadcast against base |
| result 0 | ptr | same as base | yes | element type and address space inherited |
nv_tileaa.make_memref
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | base | ptr | yes | base pointer of the memref |
| operand 1 | offset | index | optional | byte offset added to base |
| operand 2..R+1 | sizes | index | yes per-dynamic-dim | matches dynamic slots in result shape |
| operand R+2.. | strides | index | yes per-dynamic-stride | matches dynamic stride slots |
| result 0 | memref | memref | yes | element type, shape, stride packed into result type |
attr alias_scope | scope id | u32 | optional | provenance tag consumed by alias analysis |
attr operandSegmentSizes | dense i32 | length 4 | yes | {base, offset, sizes, strides} |
addptr is the primary pointer-arithmetic operation. It accepts scalar or
tile-shaped offsets and preserves the base pointer's address-space and
provenance. Canonicalization collapses chained additions into a single
addition with a folded offset expression.
PointerValue addptr(PointerValue base, IndexValue offset, Layout layout) {
ByteOffset bytes = scale_offset_by_element_size(offset, base.element_type);
return PointerValue(base.address + bytes, base.element_type, base.space, layout.provenance);
}
A representative scalar addptr:
%p1 = nv_tileaa.addptr %p0, %off
: !nv_tileaa.ptr<f32, 1>, index -> !nv_tileaa.ptr<f32, 1>
A tile-shaped addptr produces per-lane addresses for a gather:
%pp = nv_tileaa.addptr %pbase, %lanes
: !nv_tileaa.ptr<f16, 1>, tile<128xindex>
-> tile<128x!nv_tileaa.ptr<f16, 1>>
make_memref packages a base pointer with offset, sizes, strides, element
type, memory space, and alias provenance. Later TMA descriptor generation
depends on this object being structurally complete — never hide strides or
bounds behind opaque pointer arithmetic.
%mr = nv_tileaa.make_memref %pbase, %off, %sz0, %sz1, %st0, %st1
{ alias_scope = 7,
operandSegmentSizes = array<i32: 1, 1, 2, 2> }
: (!nv_tileaa.ptr<f32, 1>, index, index, index, index, index)
-> !nv_tileaa.memref<?x?xf32, 1>
Memory Effects
Scalar memory ops and tiled memory ops share one discipline: every memory effect consumes the incoming token and returns a token representing the effect after the access. Loads return a value too; stores and atomics still return the updated token even when their data result is unused.
nv_tileaa.load / nv_tileaa.tiled_load
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | base | ptr or memref | yes | source address |
| operand 1..R | indices | index | yes (R = base rank) | per-axis index |
| operand R+1 | mask | tile<S × i1> | optional | per-lane predicate |
| operand R+2 | other | tile<S × element> | optional | fallback value when masked off |
| token slot | token | mem_token | optional | drives ordering |
| result 0 | value | tile<S × element> or scalar | yes | element type matches base pointee |
| result 1 | token | mem_token | optional | mirrors token operand |
attr cache_modifier | enum | none|ca|cg|cs|lu|cv | optional | |
attr eviction_policy | enum | none|first|last|normal | optional | |
attr mem_semantic | enum | weak|relaxed|acquire | optional | |
attr mem_scope | enum | tl_blk|cluster|gpu|sys | required when semantic > weak | |
attr in_bounds | dense bool | per-axis | optional | |
attr operandSegmentSizes | dense i32 | length 4 | yes | {base, indices, mask, other} |
nv_tileaa.store / nv_tileaa.tiled_store
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | base | ptr or memref | yes | destination |
| operand 1 | value | tile<S × element> or scalar | yes | element type matches base pointee |
| operand 2..R+1 | indices | index | yes | per-axis index |
| operand R+2 | mask | tile<S × i1> | optional | per-lane predicate |
| token slot | token | mem_token | optional | |
| result 0 | token | mem_token | optional | mirrors token operand |
attr mem_semantic | enum | weak|relaxed|release | optional | acquire/acq_rel rejected |
attr mem_scope | enum | as above | required when semantic > weak | |
attr cache_modifier | enum | as above | optional | |
attr operandSegmentSizes | dense i32 | length 4 | yes | {base, value, indices, mask} |
nv_tileaa.atomic_cas / nv_tileaa.atomic_rmw / nv_tileaa.tiled_atomic_rmw
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | base | ptr or memref | yes | atomic target |
| operand 1 | value or compare | scalar/tile | yes | RMW operand or CAS compare |
| operand 2 | replacement | scalar/tile | CAS only | |
| operand 3.. | indices, mask | index, tile<i1> | optional | per-axis index; predicate |
| token slot | token | mem_token | optional | |
| result 0 | old value | scalar/tile | yes | |
| result 1 | token | mem_token | optional | |
attr rmw_mode | enum | full set | RMW only | add|and|or|xor|xchg|min|max|umin|umax|addf |
attr mem_semantic | enum | full set | optional | |
attr mem_scope | enum | full set | required when semantic > weak |
LogicalResult verify_memory_op_common(MemoryOp op) {
require_operand_segments(op);
require_indices_match_base_rank(op.base(), op.indices());
require_mask_shape_matches_result(op.mask(), op.result(0).shape());
require_token_arity_matches_segment(op);
if (op.mem_semantic() != WEAK) {
require(op.mem_scope().has_value(),
"non-weak memory ordering requires explicit scope");
} else {
require(!op.mem_scope().has_value(),
"weak memory ordering must not carry a scope");
}
return success();
}
The mem_semantic / mem_scope pair on atomic_cas, atomic_rmw, and tiled_atomic_rmw is the user-facing entry point into the layered memory model documented in Concurrency and Sync Semantics. That page enumerates which (semantic, scope) combinations each op family accepts, how the pair survives every lowering stage down to the PTX .sem / .scope modifiers, and how the implicit release/acquire pair on mbarrier.arrive.expect_tx and mbarrier.try_wait.parity fits into a producer/consumer pipeline.
LoadResult lower_memory_read(MemRef ref, Indices indices, Token token, Mask mask) {
require_indices_in_rank(ref, indices);
if (mask.is_constant_false()) {
return LoadResult(mask.other_value_or_undef(), token);
}
Value value = masked_or_unmasked_load(ref, indices, mask);
Token next = sequence_after(token, value.memory_effect);
return LoadResult(value, next);
}
tiled_load, tiled_store, and tiled_atomic_rmw are the TMA-aware forms.
Their attributes record whether TMA is allowed and whether each dimension is
known in bounds. The TileAA verifier validates the structural facts; the
TileAS lowering decides whether a concrete TMA instruction is profitable and
legal for the selected layout.
Worked Example: addptr → tiled_load → dot → tiled_store
A representative GEMM-style fragment threads four operations through a single memory token chain. Each operation consumes the incoming token and produces a new one; later passes may reorder operations only when the token graph stays intact.
// Initial token at the function entry
%t0 = nv_tileaa.create_mem_token : !nv_tileaa.mem_token
// Compute the per-stage base pointers
%pa = nv_tileaa.addptr %a_base, %off_a
: !nv_tileaa.ptr<f16, 1>, index -> !nv_tileaa.ptr<f16, 1>
%pb = nv_tileaa.addptr %b_base, %off_b
: !nv_tileaa.ptr<f16, 1>, index -> !nv_tileaa.ptr<f16, 1>
// Wrap each pointer in a memref describing shape and stride
%mr_a = nv_tileaa.make_memref %pa, %off_a, %M, %K, %s_a_row, %s_a_col
: (!nv_tileaa.ptr<f16, 1>, index, index, index, index, index)
-> !nv_tileaa.memref<?x?xf16, 1>
%mr_b = nv_tileaa.make_memref %pb, %off_b, %K, %N, %s_b_row, %s_b_col
: (!nv_tileaa.ptr<f16, 1>, index, index, index, index, index)
-> !nv_tileaa.memref<?x?xf16, 1>
// Token-ordered tile loads
%av, %t1 = nv_tileaa.tiled_load %mr_a[%i, %k], %t0
{ in_bounds = array<i1: true, true>,
operandSegmentSizes = array<i32: 1, 2, 0, 0> }
: !nv_tileaa.memref<?x?xf16, 1>, index, index, !nv_tileaa.mem_token
-> tile<128x32xf16>, !nv_tileaa.mem_token
%bv, %t2 = nv_tileaa.tiled_load %mr_b[%k, %j], %t1
{ in_bounds = array<i1: true, true>,
operandSegmentSizes = array<i32: 1, 2, 0, 0> }
: !nv_tileaa.memref<?x?xf16, 1>, index, index, !nv_tileaa.mem_token
-> tile<32x128xf16>, !nv_tileaa.mem_token
// Block-scaled dot accumulating into an f32 accumulator
%d = nv_tileaa.dot %av, %bv, %c_in
{ operandSegmentSizes = array<i32: 1, 1, 1, 0, 0> }
: tile<128x32xf16>, tile<32x128xf16>, tile<128x128xf32>
-> tile<128x128xf32>
// Token-ordered tile store; %t3 succeeds the store in the token chain
%mr_c = nv_tileaa.make_memref %c_base, %off_c, %M, %N, %s_c_row, %s_c_col
: (!nv_tileaa.ptr<f32, 1>, index, index, index, index, index)
-> !nv_tileaa.memref<?x?xf32, 1>
%t3 = nv_tileaa.tiled_store %mr_c[%i, %j], %d, %t2
{ in_bounds = array<i1: true, true>,
operandSegmentSizes = array<i32: 1, 1, 2, 0> }
: !nv_tileaa.memref<?x?xf32, 1>, tile<128x128xf32>, index, index,
!nv_tileaa.mem_token
-> !nv_tileaa.mem_token
The four operations carry one continuous token chain %t0 → %t1 → %t2 → %t3.
The dot consumes no token because it is a pure tile-on-tile computation; the
operations on either side of it commit their memory effects through the chain.
A reordering pass may swap the two tiled_loads only if it also rewires their
token edges, because the verifier rejects any pair where the second load's
token input is not produced by an operation it dominates. The discipline lets
later TileAS scheduling reorder TMA loads aggressively without ever losing
the producer/consumer ordering between memory and compute.
Tokens and Lifetime
create_mem_token creates an initial memory-order value. join_mem_token
merges several order edges into one. mark_for_reuse tells buffer allocation
that a value's lifetime extends beyond naive SSA liveness. The token value
carries no user-visible data — it is an ordering edge that later lowering can
map to barrier phase state.
Token join_mem_token(ArrayRef<Token> inputs) {
if (inputs.empty()) {
return create_mem_token();
}
Token result = inputs[0];
for (Token token : inputs.drop_front()) {
result = merge_order_edges(result, token);
}
return result;
}
Plugins and Queues
plugin and execute give opaque kernel fragments a structured extension
point. They carry function-like operands, layout metadata, and resource
requirements such as registers, shared memory, tensor memory, and named
barriers. queue.get, queue.put, and queue.yield form a dataflow shell
that TileAS later collapses into producer and consumer pipeline regions.
void lower_queue_region(QueueRegion region, PipelineBuilder builder) {
for (QueueOp op : region.ops) {
switch (op.kind) {
case QUEUE_GET:
builder.emit_consumer_wait(op.queue, op.consumer_index);
bind_queue_results(op);
break;
case QUEUE_PUT:
builder.emit_producer_commit(op.queue, op.values);
break;
case QUEUE_YIELD:
builder.close_region(op.yielded_values);
break;
}
}
}
Verification Invariants
- A TileAA module should contain no remaining
cuda_tileoperations. - Every memory op with effects participates in the token protocol.
- Pointer and memref operations preserve address space, element type, and alias provenance.
- Shape-changing ops preserve element count unless their semantics explicitly create or remove repetition.
yield,queue.yield, andreturnoperands match the parent region or function contract.func,call,plugin, andexecutesymbol references resolve inside the containing module.- Block-scaled MMA and tensor-memory features require a target that supports the needed Blackwell instruction family.
- TMA eligibility attributes are promises to later lowering, not proof that TMA must be emitted.
Cross-References
Types, Attributes, Verifiers catalogues the type and attribute surface these operations use and the verbatim verifier diagnostics they emit. Folds, Canonicalizers, Tokens describes the rewrites applied after verification succeeds. The TileAS-side counterpart in nv_tileas Operation Roster and Builders extends this surface with async pipeline and TMA operations.
nv_tileaa Types, Attributes, Verifiers
Abstract
nv_tileaa carries just enough type and attribute structure to make alias, memory, layout, and target facts explicit between cuda_tile and nv_tileas. The type system covers pointer-like values, queues, memrefs, tiled views, program identifiers, and memory tokens. The attribute system covers target capability, memory policy, atomic mode, arithmetic rounding, convolution layout, and assumption predicates. Verification is concentrated in the few places where a wrong fact would make later scheduling unsound.
The diagnostic strings emitted by the dialect's verifier are part of the producer contract. They are reproduced here verbatim so a reimplementation matches the exact text the binary emits.
Type Surface
| Type | Purpose |
|---|---|
nv_tileaa.ptr | Pointer value with element type and memory space. |
nv_tileaa.program_id | Grid program index value. |
nv_tileaa.queue | Typed queue handle for producer and consumer regions. |
nv_tileaa.memref | Strided memory reference with base, offset, sizes, strides, element type, memory space, and alias scope. |
nv_tileaa.tiled_view | Tile-shaped view over a value or memory object. |
nv_tileaa.mem_token | Memory-order token; ordering only, no payload. |
memref and tiled_view are the structural types that matter most. memref answers "where is this data and how is it strided?"; tiled_view answers "how should tile-level computation interpret this value?" Keeping those two questions separate lets layout assignment swap a view without rewriting the underlying pointer provenance.
typedef struct {
Pointer base;
Index offset;
Shape sizes;
Strides strides;
ElementType element_type;
MemorySpace memory_space;
AliasScope alias_scope;
} TileAAMemRef;
typedef struct {
Value source;
Shape shape;
Layout layout;
ElementType element_type;
} TileAATiledView;
Type Storage and Uniquing
Every nv_tileaa type is a normal MLIR Type subclass backed by its own TypeStorage derivative routed through the context StorageUniquer documented in Storage Uniquer and Context Impl — getOrCreate Gateway. The uniquer key for each type names the fields the hasher consumes and the equality check compares.
| Type | Uniquer key |
|---|---|
nv_tileaa.ptr | (pointee_type, address_space) |
nv_tileaa.program_id | parameterless (single canonical storage per context) |
nv_tileaa.queue | (result_types: ArrayRef<Type>, isolated_flag: bool) |
nv_tileaa.memref | (element_type, shape: ArrayRef<int64_t>, stride: ArrayRef<int64_t>, address_space, alias_scope_id) |
nv_tileaa.tiled_view | (source_type, shape, layout_attr) |
nv_tileaa.mem_token | parameterless |
Shape, stride, and result-type arrays are interned alongside the storage block; copies returned to callers reuse those pointers. Pointer identity on the resulting Type* is the dispatch key every walker and type converter in the cascade consumes, so a reimplementation must intern through one StorageUniquer per context rather than allocating fresh storage per call site.
Attribute Surface
The dialect has eighteen logical attributes plus a legacy spelling of compute_capability retained for compatibility with older text and bytecode producers.
| Group | Attributes | Meaning |
|---|---|---|
| Target and kernel configuration | compute_capability, compute-capability, target_spec, kernel_spec | Select architectural features, launch shape, and kernel-level policy. |
| Memory policy | cache_modifier, eviction_policy, mem_semantic, mem_scope | Annotate loads, stores, and atomics with cache, eviction, ordering, and scope facts. |
| Atomic and arithmetic modes | rmw_mode, rounding_mode, propagate_nan, signedness | Select atomic operation, floating rounding, NaN behavior, and integer MMA signedness. |
| Convolution and layout | padding_value, activation_layout, conv_params | Preserve convolution padding, activation order, and structured convolution parameters. |
| Assumption predicates | div_by, bounded, same_elements | Attach verifier-checked facts to assume so later passes can simplify safely. |
Most attributes are enum-like or data containers. Parsing validates their spelling and payload; the consuming op's verifier runs a second pass when it matters. The three assumption predicates implement a runtime verification interface against the value constrained by nv_tileaa.assume.
Dot and Block-Scaled MMA Diagnostics
nv_tileaa.dot is the densest verifier in the dialect. It runs five phase checks against the A/B/C element-type tuple, the optional sfa/sfb scale-factor operands, the integer signedness attributes, the operand ranks, and the result shape. Each phase emits a specific diagnostic.
LogicalResult verify_dot(DotOp op, Target target) {
bool all_int = is_integer(op.a.elem_t) && is_integer(op.b.elem_t)
&& is_integer(op.c.elem_t) && is_integer(op.d.elem_t);
bool all_float = is_float(op.a.elem_t) && is_float(op.b.elem_t)
&& is_float(op.c.elem_t) && is_float(op.d.elem_t);
if (!all_int && !all_float) {
return op.emit_error("expected the element types of A, B, C, and D to be either all integers or all floats.");
}
if (all_int) {
if (bit_width(op.a.elem_t) != bit_width(op.b.elem_t)) {
return op.emit_error("expects #A and #B have same bit width but got ")
.append(bit_width(op.a.elem_t)).append(" vs ").append(bit_width(op.b.elem_t));
}
if (!op.has_signedness_a()) {
return op.emit_error("expect signedness attribute for operand A");
}
}
if (all_float) {
if (is_fp4(op.a.elem_t)) {
if (op.c.elem_t != f32_type() && op.c.elem_t != f16_type()) {
return op.emit_error("expects #C element type to be either f32 or f16, but got ")
.append(op.c.elem_t);
}
} else if (op.c.elem_t != f32_type()) {
return op.emit_error("expects #C element type to be f32, but got ").append(op.c.elem_t);
}
}
int rank = op.d.rank;
if (rank != 2 && rank != 3) {
return op.emit_error("expects rank-2 or rank-3 tensor for result, but got (")
.append(rank).append(")");
}
if (!shapes_compatible_for_mma(op.a, op.b)) {
return op.emit_error("expects the shape of operand #A and #B to be compatible");
}
if (!shapes_compatible_for_acc(op.a, op.b, op.c)) {
return op.emit_error("expects the shape of operand #C is compatible with operands #A and #B");
}
if (op.has_sfa || op.has_sfb) {
if (!op.has_sfa || !op.has_sfb) {
return op.emit_error("expects both SFA and SFB to be present");
}
}
return success();
}
The diagnostics this routine emits:
| Diagnostic | Cause |
|---|---|
"expected the element types of A, B, C, and D to be either all integers or all floats." | A mixed integer/floating tuple was supplied. |
"expects #A and #B have same bit width but got " followed by both widths | Integer MMA inputs at unequal widths. |
"expect signedness attribute for operand A" | Integer MMA without a signedness_a attribute. |
"expects #C element type to be f32, but got " followed by the actual type | Floating accumulator is not f32. |
"expects #C element type to be either f32 or f16, but got " followed by the actual type | FP4 inputs with an accumulator that is neither f32 nor f16. |
"expects rank-2 or rank-3 tensor for result, but got (" followed by the actual rank | Result rank outside the legal range. |
"expects the shape of operand #A and #B to be compatible" | Contracting dimensions disagree. |
"expects the shape of operand #C is compatible with operands #A and #B" | Accumulator/result shape disagrees with the M/N derived from A/B. |
"expects both SFA and SFB to be present" | Block-scaled MMA with only one of sfa/sfb. |
Convolution and Tile Blocking
nv_tileaa.block_tile and nv_tileaa.conv_tile preserve convolution structure until the memory layout pass can pick producer and consumer layouts. Their verifier rejects malformed tileSizes, dimGroups, filter sizes, and convolution parameter tuples.
LogicalResult verify_block_tile(BlockTileOp op) {
if (op.tile_sizes.empty()) {
return op.emit_error("expects non-empty tileSizes");
}
if (op.dim_groups.empty()) {
return op.emit_error("expects non-empty dimGroups");
}
if (op.tile_sizes.size() != op.dim_groups.size()) {
return op.emit_error("expects rank of tileSizes be equal to rank of dimGroups, but got ")
.append(op.tile_sizes.size()).append(" vs ").append(op.dim_groups.size());
}
for (int64_t s : op.tile_sizes) {
if (s <= 0) {
return op.emit_error("expects all tile size bigger than zero");
}
}
Set<int> seen;
for (DimGroup g : op.dim_groups) {
for (int axis : g.dims) {
if (!seen.insert(axis)) {
return op.emit_error("expects dim is being tiled only one time, but got ").append(axis);
}
}
}
return success();
}
LogicalResult verify_conv_tile(ConvTileOp op) {
for (int s : op.filter_sizes) {
if (s < 1 || s > 3) {
return op.emit_error("expects filter size must be in range 1 to 3");
}
}
if (!conv_params_consistent(op.conv_params)) {
return op.emit_error("expects conv params size matched with each other");
}
if (!any_group_contains_channel(op.dim_groups, op.channel_axis)) {
return op.emit_error("expects channel must be a dimGroup");
}
if (!channel_group_is_singleton(op.dim_groups, op.channel_axis)) {
return op.emit_error("expects channel dimGroup should contain only channel");
}
return success();
}
The diagnostics:
| Diagnostic | Cause |
|---|---|
"expects non-empty tileSizes" | tileSizes attribute is missing or empty. |
"expects non-empty dimGroups" | dimGroups attribute is missing or empty. |
"expects dim is being tiled only one time, but got " followed by the duplicated axis | A spatial axis appears in more than one dim group. |
"expects rank of tileSizes be equal to rank of dimGroups, but got " followed by both ranks | Rank disagreement between the two attributes. |
"expects all tile size bigger than zero" | A tile-size entry is zero or negative. |
"expects filter size must be in range 1 to 3" | A convolution filter size falls outside the supported range. |
"expects conv params size matched with each other" | The convolution-parameter tuples disagree on rank. |
"expects channel must be a dimGroup" | The convolution's channel axis is not assigned to any dim group. |
"expects channel dimGroup should contain only channel" | The channel dim group contains additional non-channel axes. |
Region Terminator Diagnostics
Region operations route through a yield verifier. The terminator must be nv_tileaa.yield and operate inside a nv_tileaa.func parent.
LogicalResult verify_region_terminator(Operation op) {
Operation term = op.region(0).front().terminator;
if (term.name != "nv_tileaa.yield") {
return op.emit_error("expects regions to end with 'nv_tileaa.yield'");
}
return success();
}
LogicalResult verify_yield_parent(YieldOp op) {
Operation parent = op.parent;
if (parent.name != "nv_tileaa.func") {
return op.emit_error("expects parent op 'nv_tileaa.func'");
}
return success();
}
The diagnostics:
| Diagnostic | Cause |
|---|---|
"expects regions to end with '" (binary string; the required terminator op-name and a closing ' are appended at print time, e.g. nv_tileaa.yield) | A region-bearing op's terminator is the wrong op kind. |
"expects parent op " (binary string; the required parent op-name is appended at print time, e.g. 'nv_tileaa.func') | A yield, return, or function-scoped op appears outside its required parent. |
Assumption Predicate Verification
nv_tileaa.assume accepts a value and a list of predicate attributes. Each predicate that implements the assumption interface verifies the value's type and its own parameters. The first failing predicate emits the diagnostic; later predicates never run.
LogicalResult verify_assume(AssumeOp op) {
Type constrained_type = op.value.type;
for (Attribute predicate : op.predicates) {
AssumePredicate verifier = dyn_cast_assume_predicate(predicate);
if (verifier == nullptr) {
continue;
}
if (failed(verifier.verify_with_assume_op(predicate, constrained_type, op))) {
return failure();
}
}
return success();
}
div_by
div_by states that every constrained element is divisible by a positive power-of-two divisor. Optional every and along fields refine the statement to a periodic subset of an axis; they must appear together.
LogicalResult verify_div_by(DivByAttr attr, Type type) {
if (!is_integer_like(type) && !is_pointer_like(type) && !is_memref_like(type)) {
return emit_diag("div_by requires an integer-, pointer-, or memref-like value");
}
if (attr.divisor <= 0 || !is_power_of_two(attr.divisor)) {
return emit_diag("div_by divisor must be a positive power of two");
}
bool has_every = attr.every.has_value;
bool has_along = attr.along.has_value;
if (has_every != has_along) {
return emit_diag("div_by every and along must appear together");
}
if (has_every) {
if (attr.every.value <= 0) {
return emit_diag("div_by every must be positive");
}
if (!axis_is_valid(type, attr.along.value)) {
return emit_diag("div_by along must reference a valid axis");
}
}
return success();
}
bounded
bounded states that the constrained integer-like value falls within an inclusive range. Bounds are interpreted using the constrained element width, so the verifier checks both order and representable range.
LogicalResult verify_bounded(BoundedAttr attr, Type type) {
ElementType element = integer_element_type(type);
if (!element.is_integer) {
return emit_diag("bounded requires an integer-like element type");
}
IntegerRange range = signed_integer_range(element.bit_width);
if (attr.lower.has_value && !range.contains(attr.lower.value)) {
return emit_diag("bounded lower exceeds the element's representable range");
}
if (attr.upper.has_value && !range.contains(attr.upper.value)) {
return emit_diag("bounded upper exceeds the element's representable range");
}
if (attr.lower.has_value && attr.upper.has_value
&& attr.lower.value > attr.upper.value) {
return emit_diag("bounded lower must not exceed upper");
}
return success();
}
same_elements
same_elements records a shape fact: each listed axis must have exactly the specified extent. The attribute earns its keep after rank-changing canonicalization, when a later pass needs to prove that two views still cover the same logical tile.
LogicalResult verify_same_elements(SameElementsAttr attr, Type type) {
Shape shape = ranked_shape(type);
if (attr.values.length != shape.rank) {
return emit_diag("same_elements length must match the constrained value's rank");
}
for (int axis = 0; axis < shape.rank; ++axis) {
if (attr.values[axis] < 0 || attr.values[axis] > shape.dim(axis)) {
return emit_diag("same_elements axis bound is out of range");
}
}
return success();
}
Operation-Level Verifier Dispatch
Most operations rely on generic trait checks. The operations that need domain-specific verification on top route through this dispatch:
LogicalResult verify_tileaa_operation(Operation op, Target target) {
if (failed(verify_generic_traits(op))) {
return failure();
}
switch (op.kind) {
case DOT: return verify_dot(cast<DotOp>(op), target);
case BLOCK_TILE: return verify_block_tile(cast<BlockTileOp>(op));
case CONV_TILE: return verify_conv_tile(cast<ConvTileOp>(op));
case FP_TO_FP: return verify_float_conversion(cast<FpToFpOp>(op), target);
case FUNC: return verify_function_contract(cast<FuncOp>(op));
case EXECUTE: return verify_execute_contract(cast<ExecuteOp>(op), target);
case YIELD:
case QUEUE_YIELD: return verify_yield_parent(op);
case ASSUME: return verify_assume(cast<AssumeOp>(op));
case TILED_LOAD:
case TILED_STORE:
case TILED_ATOMIC: return verify_tiled_memop(op);
default: return success();
}
}
Element-Type Contract
The dialect reuses the ordinary MLIR integer and floating families and adds the low-precision formats needed by FP8, FP4, and block-scaled MMA. The legality is a finite table, not ad hoc string tests.
| Element family | Typical use |
|---|---|
f16, bf16, tf32, f32 | Standard MMA input and accumulator paths. |
| FP8 E4M3 and E5M2 | Low-precision MMA inputs and conversion targets. |
| E8M0 scale factors | Block-scaled MMA scale-factor operands. |
| FP4 (OCP MX-FP4 and NVFP4) | Blackwell-era block-scaled MMA input paths. |
| Integer widths | Integer MMA, pointer arithmetic, predicates, and indices. |
LogicalResult verify_float_conversion(FpToFpOp op, Target target) {
if (!is_supported_float_element(op.source.element_type)) {
return op.emit_error("fp_to_fp source element type is not supported");
}
if (!is_supported_float_element(op.result.element_type)) {
return op.emit_error("fp_to_fp result element type is not supported");
}
if ((uses_block_scaled_format(op.source) || uses_block_scaled_format(op.result))
&& !target.supports_block_scaled_mma) {
return op.emit_error("fp_to_fp block-scaled format requires a target that supports block-scaled MMA");
}
return success();
}
Invariants
compute-capabilityandcompute_capabilityparse to one logical target-capability concept; new IR emits the canonical underscore spelling.- Enum-like attributes are validated by parser tables and by consuming ops.
div_by,bounded, andsame_elementsare meaningful only throughnv_tileaa.assume.- Memory-policy attributes do not create ordering by themselves; tokens and memory effects do.
- Low-precision element formats are target-gated where the hardware requires it.
- Function, plugin, and queue attributes remain structured until their symbols and resource requirements have been resolved.
- Diagnostic strings are stable across builds. Reimplementations reproduce them verbatim.
Cross-References
Operation Roster catalogues the operations these verifiers run against and shows complete IR examples. Folds, Canonicalizers, Tokens describes the rewrites that run after verification succeeds. The nv_tileas block-scaled MMA verifier in nv_tileas Verifiers — Block-Scaled MMA extends the dot contract documented here with Blackwell-specific atom catalog checks.
nv_tileaa Folds, Canonicalizers, Tokens
Abstract
nv_tileaa is where the tile pipeline turns high-level intent into a cleaner,
alias-aware program. Verification proves the IR is structurally legal;
canonicalization makes it useful. The folds that matter strip redundant shape
wrappers, fuse pointer arithmetic, prune dead queue and pragma results,
simplify masked memory operations, reduce atomic identities, and preserve
memory ordering through a small token algebra.
This page lays out those transformations as algorithms. A reimplementation doesn't need the same pattern classes or registration order, but it does need the same observable rewrites and the same safety conditions.
Canonicalization Surface
| Area | Rewrite | Safety condition |
|---|---|---|
| Dot folding | Fold constant dot operands into a constant result or an accumulator identity. | Both multiplicands and the accumulator path are compile-time constants or identity values. |
| Pointer arithmetic | addptr(addptr(base, a), b) becomes addptr(base, a + b). | The two offsets use the same element-size interpretation and address space. |
| Assumptions over splats | assume(splat(x), pred) becomes splat(assume(x, pred)). | The predicate is elementwise and does not depend on lane identity. |
| Select over splats | splat(select(c, t, f)) becomes select(splat(c), splat(t), splat(f)). | All three splats have the same result shape. |
| Masked load | Constant-true mask becomes an unmasked load; constant-false mask becomes the fallback value. | The fallback value has the exact load result type. |
| Masked store | Constant-true mask becomes an unmasked store; constant-false mask erases the store. | The store has no other required side effect besides the memory write. |
| Shape wrappers | Fuse nested view; fold view, broadcast, and expand_dims around splat. | Element count and result shape remain equal to the original result type. |
| Extract motion | Move extract through elementwise operations and through matching expand_dims. | The extracted lane maps one-to-one to the source lane. |
| Queue result pruning | Drop unused queue.get results and update the matching queue.yield. | Result order for the remaining values is preserved. |
| Pragma result pruning | Drop unused pragma-carried results and rewrite the region terminator. | The pragma's semantic payload is independent of the removed result. |
| Atomic identities | Reduce no-op or identity atomics to cheaper reads or preserved tokens. | The selected atomic mode has a true algebraic identity for the operand type. |
Pattern Driver
Implement the canonicalization pass as an ordinary greedy MLIR-style rewrite loop. The trick is to register shape and memory folds together, since a shape fold often exposes a memory fold on the next iteration.
void populate_tileaa_canonicalizers(PatternSet *patterns) {
add(patterns, fold_constant_dot);
add(patterns, fuse_addptr_chain);
add(patterns, push_assume_through_splat);
add(patterns, push_select_through_splat);
add(patterns, canonicalize_masked_load);
add(patterns, canonicalize_masked_store);
add(patterns, fold_expand_dims_of_splat);
add(patterns, fold_view_chain);
add(patterns, fold_broadcast_of_splat);
add(patterns, hoist_extract_through_elementwise);
add(patterns, prune_queue_get_results);
add(patterns, prune_pragma_results);
add(patterns, fold_atomic_cas);
add(patterns, fold_atomic_rmw);
}
void canonicalize_tileaa(Module module) {
PatternSet patterns;
populate_tileaa_canonicalizers(&patterns);
run_greedy_rewrite(module, patterns);
}
Constant Dot Folding
dot is the only expensive arithmetic fold in the dialect, and it respects
the same element-type and accumulator rules as the verifier. Folding is legal
when all inputs needed for the multiply-accumulate are constant-like, or when
a zero multiplicand proves the result equals the accumulator unchanged.
Optional<Value> fold_dot(DotOp op) {
if (is_zero_tile(op.a) || is_zero_tile(op.b)) {
return op.accumulator;
}
ConstantTile a = dyn_cast_constant_tile(op.a);
ConstantTile b = dyn_cast_constant_tile(op.b);
ConstantTile c = dyn_cast_constant_tile(op.accumulator);
if (!a.valid || !b.valid || !c.valid) {
return none();
}
ConstantTile result = c;
for (int m = 0; m < op.m; ++m) {
for (int n = 0; n < op.n; ++n) {
for (int k = 0; k < op.k; ++k) {
result[m, n] += convert(a[m, k]) * convert(b[k, n]);
}
}
}
return materialize_constant(result, op.result.type);
}
For integer MMA, the convert step honors the operation's signedness
attributes. For floating MMA, it honors the accumulator type — never the
narrow input format, which would silently drop precision.
Pointer and Shape Folds
Pointer-arithmetic canonicalization keeps addressing expressions shallow. The fold is safe only when both offsets are measured in the same logical element units. If one offset has already been converted to bytes and the other has not, the rewriter normalizes them before adding.
Optional<AddPtrOp> fuse_addptr_chain(AddPtrOp outer) {
AddPtrOp inner = dyn_cast_addptr(outer.base);
if (!inner.valid) {
return none();
}
require(inner.result.address_space == outer.result.address_space);
IndexValue lhs = normalize_offset(inner.offset, inner.result.element_type);
IndexValue rhs = normalize_offset(outer.offset, outer.result.element_type);
IndexValue fused = add_index_values(lhs, rhs);
return rebuild_addptr(inner.base, fused, outer.result.type);
}
Shape folds all follow one rule: remove wrappers that don't change the logical element stream, but keep the final result type exactly as the original op requested.
Optional<Value> fold_shape_wrapper(Op op) {
if (op.kind == VIEW && producer_is_view(op.input)) {
return rebuild_view(op.input.source, op.result.type);
}
if (op.kind == VIEW && producer_is_splat(op.input)) {
return rebuild_splat(op.input.scalar, op.result.type);
}
if (op.kind == BROADCAST && producer_is_splat(op.input)) {
return rebuild_splat(op.input.scalar, op.result.type);
}
if (op.kind == EXPAND_DIMS && producer_is_splat(op.input)) {
return rebuild_splat(op.input.scalar, op.result.type);
}
return none();
}
Masked Memory Folds
Masked load and store folds look simple but are easy to get wrong: memory effects and token results must stay valid. A constant-false load performs no read, so it returns the fallback data and the original token. A constant-false store performs no write, so the op disappears and its token users get rewired to the incoming token.
RewriteResult canonicalize_masked_load(LoadOp op) {
if (!op.mask.is_constant()) {
return no_change();
}
if (op.mask.is_true()) {
return replace_with_unmasked_load(op);
}
Value fallback = op.other.has_value ? op.other.value : undef(op.result.type);
replace_value(op.data_result, fallback);
replace_value(op.token_result, op.input_token);
erase(op);
return changed();
}
RewriteResult canonicalize_masked_store(StoreOp op) {
if (!op.mask.is_constant()) {
return no_change();
}
if (op.mask.is_true()) {
return replace_with_unmasked_store(op);
}
replace_value(op.token_result, op.input_token);
erase(op);
return changed();
}
Atomic Folds
Atomic folds are strength reductions, not permission to erase memory ordering. Even when the data operation becomes a load or no-op, token users still need to see the correct ordering edge.
RewriteResult fold_atomic_cas(AtomicCasOp op) {
Optional<Constant> compare = constant_value(op.compare);
Optional<Constant> replacement = constant_value(op.replacement);
if (!compare.has_value || !replacement.has_value) {
return no_change();
}
if (constants_equal(compare.value, replacement.value)) {
Value loaded = atomic_load(op.address, op.ordering, op.scope);
replace_value(op.data_result, loaded);
replace_value(op.token_result, sequence_after(op.input_token, loaded));
erase(op);
return changed();
}
return rebuild_with_constants(op, compare, replacement);
}
RewriteResult fold_atomic_rmw(AtomicRmwOp op) {
Optional<Constant> rhs = constant_value(op.value);
if (!rhs.has_value) {
return no_change();
}
if (is_identity_for_rmw(op.mode, rhs.value)) {
Value loaded = atomic_load(op.address, op.ordering, op.scope);
replace_value(op.data_result, loaded);
replace_value(op.token_result, sequence_after(op.input_token, loaded));
erase(op);
return changed();
}
return no_change();
}
For add, or, and xor, the identity is zero. For and, it is all bits
set. For exchange, the fold is legal only when another proof says the stored
value is already present — constant equality with a compare operand is not
enough unless that compare participates in the same atomic contract.
Queue and Pragma Pruning
Queue and pragma ops often carry multiple results because earlier lowering doesn't yet know which values get consumed. Once ordinary DCE has marked some results unused, canonicalization shrinks the result list and updates the region terminator to yield only the survivors.
RewriteResult prune_region_results(RegionOp op, Terminator terminator) {
BitSet live = live_result_indices(op);
if (live.count == op.results.count) {
return no_change();
}
SmallVector<Type> new_types;
SmallVector<Value> new_yields;
for (int i = 0; i < op.results.count; ++i) {
if (!live.contains(i)) {
continue;
}
new_types.push(op.results[i].type);
new_yields.push(terminator.operands[i]);
}
RegionOp replacement = clone_with_result_types(op, new_types);
replacement.terminator.operands = new_yields;
replace_live_results(op, replacement, live);
erase(op);
return changed();
}
This fold preserves relative order. Reordering live queue results changes the meaning of downstream consumers, even when the types happen to match.
Memory Token Lowering
At TileAA level, mem_token is an abstract SSA value. By TileAS and NVVM
level it has become a compact phase-bearing integer tied to async barrier
state. The exact integer encoding is a backend choice; the required
semantics are that a joined token cannot be considered complete until every
input token is complete, and that every memory effect produces a successor
token.
typedef struct {
int barrier_id;
int phase;
} LoweredToken;
LoweredToken lower_create_mem_token(BarrierAllocator *allocator) {
int barrier = allocator->allocate();
emit_mbarrier_init(barrier);
return (LoweredToken){ .barrier_id = barrier, .phase = 0 };
}
LoweredToken lower_join_mem_token(ArrayRef<LoweredToken> inputs) {
LoweredToken result = inputs[0];
for (LoweredToken token : inputs.drop_front()) {
result = later_of(result, token);
}
return result;
}
LoweredToken sequence_memory_effect(LoweredToken input, MemoryEffect effect) {
emit_effect_after_token(effect, input);
return toggle_phase_when_needed(input, effect);
}
Mem-Token Lifecycle
nv_tileaa.mem_token is the linear-type SSA value that threads memory-ordering edges through the IR. Produced by
create_mem_token, consumed by join_mem_token, ultimately materialised as an mbarrier physical handle by the
downstream lowering passes. The mem_token is a pure ordering edge — no user-visible data, only a proof that every
preceding memory effect on the edge has completed before any successor effect observes it. That mechanism is what
lets the scheduler reason about async copy, WGMMA, and TMA completion without baking specific barrier hardware into
the upper-dialect IR.
The TypeID slot for nv_tileaa.mem_token is the static-sentinel at &unk_5B46F78. Pointer-identity dispatch against
this slot is how walkers, type converters, and verifiers recognise mem_token values without parsing their printed
form. Anchor the type with one stable address per process, not a per-context allocation — cross-pass machinery
compares the slot by pointer.
The mem_token reaches its mbarrier physical form in two lowering hops. Both are pattern-driven and leave the
token-graph topology intact, changing only the underlying carrier type.
cuda_tile.make_token nv_tileaa.create_mem_token nvvm.mbarrier.init (i32 handle)
| hop 1 | hop 2 |
v (CudaTile -> TileAA v (TileAA -> TileAS v
cuda_tile.join_tokens sub_5F8DC0) nv_tileaa.join_mem_token sub_110B730) nvvm.mbarrier.try_wait.parity.shared
Hop one runs inside the Part-B populator of ConvertCudaTileToTileAA at sub_5F8DC0. The routine rewrites every
cuda_tile.make_token into an nv_tileaa.create_mem_token and every cuda_tile.join_tokens into the matching
nv_tileaa.join_mem_token. Op kinds receive new TypeIDs at the rewrite boundary, but operand counts, result
counts, and ordering semantics survive one-for-one. No barrier resource is allocated yet; the token is still an
abstract SSA value.
Hop two runs inside the TileAA-to-TileAS conversion driver at sub_110B730. Two conversion patterns dominate:
CreateMemTokenOpConversion, dispatched through vtableoff_59D53C0, turns eachnv_tileaa.create_mem_tokeninto anmbarrier.initLLVM intrinsic call. The op's SSA result becomes a 32-bit handle that mbarrier hardware tracks per CTA. The init's phase-count operand is the number of producers the original token expected to merge into the barrier, threaded from the op's producer attribute.JoinMemTokenOpConversion, dispatched through vtableoff_59D5410, turns eachnv_tileaa.join_mem_tokeninto a chain ofmbarrier.try_wait.parity.sharedcalls, one per producer in the join. The chain encodes the spin loop on the parity bit so the joined token cannot retire until every input producer has flipped its share of the phase.
When a builder emits create_mem_token without an explicit result type, the op's inferReturnTypes hook supplies
one. The hook implements a five-step algorithm:
- Walk the operands to find the alias-set that defines the token's scope.
- Look up the scope's existing token type via the surrounding function's local TypeConverter.
- If no existing token type exists for the scope, create a fresh
MemTokenTypecarrying the inferred scope. - Stash the inferred type in the function's TypeConverter cache so later builders share it.
- Return the cached type as the op's result type.
That caching keeps mem_token types pointer-equal across a function even when the builder fires from many different rewrite sites. A reimplementation that re-derives the type per call site fragments the cache and breaks the pointer-identity dispatch above.
After hop two completes, the mem_token is fully replaced. The successor i32 value holds the per-CTA mbarrier
index; the mbarrier slot itself comes from D-pass buffer-assignment out of the per-CTA 32-mbarrier pool. Each
create_mem_token op claims one slot from that pool — see
Buffer Assignment and Mbarriers — Phase 2 for the
allocation details. Pool exhaustion is a hard failure of the lowering pipeline, never a fallback to software
ordering.
Value lowerCreateMemToken(Op op, ConversionPatternRewriter &rw) {
Value mbarrier = rw.create<nvvm::MbarrierInitOp>(loc, /*phaseCount=*/op.getNumProducers());
return mbarrier;
}
Value lowerJoinMemToken(Op op, ConversionPatternRewriter &rw, ArrayRef<Value> tokens) {
Value tryWait = rw.create<nvvm::MbarrierTryWaitParitySharedOp>(loc, tokens.front(), /*phase=*/0);
/* spin loop on parity bit, repeated for each remaining input token */
return tryWait;
}
The cross-reference pages cover the supporting machinery: Operation Roster — Tokens and Lifetime for the create_mem_token
and join_mem_token op rosters, Cuda Tile to TileAA — Tokens and Atomics for hop
one, TileAA to TileAS — Three Populators for hop two, and
Buffer Assignment and Mbarriers — Phase 2 for mbarrier
slot allocation.
Plugin and Queue Contract
Plugin operations carry resource requirements the scheduler must honor: register budget, shared-memory scratch, tensor-memory scratch, named barriers, input layouts, output layouts. Queue lowering consumes those requirements while turning queue regions into TileAS producer and consumer pipeline regions.
LogicalResult lower_plugin_execute(ExecuteOp execute, ResourceModel model) {
require(model.registers_available(execute.max_registers));
require(model.named_barriers_available(execute.named_barriers));
require(model.shared_memory_available(execute.shared_memory_bytes));
require(model.tensor_memory_available(execute.tensor_memory_bytes));
PipelineRegion region = materialize_agent_region(execute);
attach_plugin_layouts(region, execute.input_layouts, execute.output_layouts);
return success();
}
The queue conversion fails loudly when a producer or consumer cannot map to a pipeline slot. Silent fallback to unordered memory traffic loses the main correctness property the queue was carrying.
Invariants
- Canonicalization must preserve memory tokens even when it removes data work.
- Shape folds may change producer structure but not the final result type.
- Pointer folds must normalize offsets before adding them.
- Queue and pragma pruning preserve live-result order.
- Atomic folds must preserve memory ordering and volatility semantics.
- Token lowering may choose any compact representation that preserves join and successor ordering.
nv_tileas Dialect Overview
nv_tileas is the operational async-scheduling dialect in the TileIR lowering stack, sitting below nv_tileaa — where queues and agents still describe intent — and above the LLVM/NVVM dialects, where the same work is spelled out as barriers, bulk tensor-memory transfers, register-budget changes, and target intrinsics.
The dialect's job is to make a warp-specialized kernel explicit enough to schedule. Queue edges become producer and consumer regions. Stage movement becomes an SSA iterator. Asynchronous copies become token-producing memory operations. Layout choices become concrete tiled loads, stores, descriptor construction, and conversions. Once a kernel is in nv_tileas, later passes can ask precise questions: which agent owns this region, which pipeline stage this operation occupies, which async value orders this consumer, and which memory operation must not be reordered past another.
Users of the compiler pipeline treat nv_tileas as an internal form — write cuda_tile or nv_tileaa and let the pipeline materialize it. Reimplementers treat it as the main contract between high-level tile semantics and target-specific code generation.
Position in the Cascade
cuda_tile
|
| lift public tile operations into agent-aware TileIR
v
nv_tileaa
|
| materialize queues, agents, async regions, and pipeline iterators
v
nv_tileas
|
| schedule stages, assign layouts, lower async memory and barriers
v
llvm + nvvm
nv_tileaa is declarative: it says a producer puts a value into a queue and a consumer later gets it. nv_tileas is operational: it says how that edge is represented — producer acquire/write/commit, consumer wait/read/release, stage iteration, region yields. That distinction is why the scheduler and final lowering passes operate on nv_tileas.
Programming Model
The central abstraction is a bounded asynchronous pipeline shared by one or more producer agents and one or more consumer agents. Each pipeline has a fixed stage count, stage-local storage, producer and consumer token types, a rotating iterator identifying the active stage, and optional async result tokens for operations whose completion is observed separately from the producer/consumer handshake.
The normal pipeline lifecycle:
Pipeline pipe = create_pipeline(stages, storage, producer_group, consumer_group);
PipelineIter iter = create_iterator(pipe);
for (int logical_stage = 0; logical_stage < work_items; ++logical_stage) {
if (current_agent_is_producer()) {
ProducerToken ready = producer_acquire(pipe, iter);
ProducerToken written = producer_write(ready, iter) {
// Fill the stage-local buffer: TMA load, tiled load, layout conversion,
// MMA preparation, or another producer-side action.
yield(next_producer_values);
};
producer_commit(written);
}
if (current_agent_is_consumer()) {
ConsumerToken ready = consumer_wait(pipe, iter, consumer_idx);
ConsumerToken consumed = consumer_read(ready, iter) {
// Read values produced for this stage and run the consumer work:
// MMA, reductions, stores, or downstream async launches.
yield(next_consumer_values);
};
consumer_release(consumed);
}
iter = inc_iter(iter);
}
The real IR uses MLIR regions rather than C blocks. The pseudocode is the semantic contract — acquire grants ownership of a producer stage, write fills it, commit publishes it, wait observes a committed stage, read consumes it, release returns the stage to the pipeline.
Operation Families
Three related operation families divide the dialect.
| Family | Representative operations | Purpose |
|---|---|---|
| Async pipeline | async.pipeline.create_pipeline, create_iterator, inc_iter, produce_one, consume_one, producer_acquire, producer_write, producer_commit, consumer_wait, consumer_read, consumer_release, agent_switch, yield | Represents warp-specialized producer/consumer execution as explicit regions and tokens. |
| Memory and movement | tiled_load, tiled_store, tiled_atomic_rmw, async.tiled_load, async.tiled_tma_load, async.tiled_tma_store, async.gather_tma_load, async.scatter_tma_store, make_tiled_tma_desc, copy, dot, load, store | Represents layout-aware data movement, TMA descriptors, asynchronous copies, and tiled compute edges. |
| Tile structure | alloc_tensor, convert_layout, view, reinterpret, extract_slice, insert_slice, expand_dims, shuffle, reduce, scan, generate, pragma | Represents local tile storage, shape/view manipulation, layout conversion, reductions, scans, generation, and optimizer directives. |
The atom attribute connects tiled memory operations to the later cute_nvgpu atom selection. padding_value controls out-of-bounds behavior for gather-like operations. consumer_idx distinguishes consumers inside a consumer group. ocgEnterDirectives and ocgLeaveDirectives carry optimizer directive payloads through structured regions.
Queue-to-Pipeline Lowering
The primary producer of nv_tileas is the lowering from nv_tileaa. It turns an abstract queue program into a pipeline program the scheduler can reason about.
void lower_tileaa_to_tileas(Module module) {
for (ExecuteOp exec : module.execute_ops()) {
replace exec with agent_switch {
clone_each_agent_body(exec);
preserve_agent_group_ids_and_register_budgets(exec);
};
}
for (QueueOp queue : module.queues()) {
Pipeline pipe = create_pipeline(
queue.stage_count,
queue.stage_storage,
queue.producer_group,
queue.consumer_group);
replace queue.create with pipe;
for (QueuePutOp put : queue.puts()) {
replace put with produce_one(pipe.producer_token, pipe.iterator) {
ProducerToken t0 = producer_acquire(pipe, iterator);
ProducerToken t1 = producer_write(t0, iterator) {
clone_producer_body(put);
yield(produced_values);
};
producer_commit(t1);
yield(updated_pipeline_values);
};
}
for (QueueGetOp get : queue.gets()) {
replace get with consume_one(pipe.consumer_token, pipe.iterator) {
ConsumerToken t0 = consumer_wait(pipe, iterator, get.consumer_idx);
ConsumerToken t1 = consumer_read(t0, iterator) {
clone_consumer_body(get);
yield(consumed_values);
};
consumer_release(t1);
yield(updated_pipeline_values);
};
}
}
propagate_pipeline_iterator_types_through_scf(module);
erase_dead_queue_scaffolding(module);
}
The lowering is deliberately one-way. After this point the compiler doesn't need the original queue identity — only the pipeline stages, tokens, agent regions, and iterator values, since those are the handles scheduling, layout assignment, and final lowering work with.
Iterator Propagation
PipelineIteratorType is the type-level wrapper that lets an iterator travel through structured control flow without losing the element type it indexes. The propagation pass rewrites loop and branch signatures until every path agrees on the same iterator type.
Type propagate_iterator_type(Value value, Type expected) {
if (!is_pipeline_iterator(value.type)) {
value.type = PipelineIteratorType(expected);
}
for (Use use : value.uses) {
Operation owner = use.owner;
if (owner is scf.for) {
rewrite_loop_iter_arg(owner, use.index, value.type);
rewrite_loop_yield(owner, use.index, value.type);
} else if (owner is scf.if) {
Type then_type = propagate_branch_yield(owner.then_region, use.index, value.type);
Type else_type = propagate_branch_yield(owner.else_region, use.index, value.type);
assert(then_type == else_type);
owner.result(use.index).type = then_type;
} else {
rewrite_operand(owner, use.index, value.type);
}
}
return value.type;
}
Branch agreement is the key invariant: when a pipeline iterator yields from both sides of an scf.if, the two yielded values must have the same iterator type. Without that rule, later schedule and lowering passes cannot assign a single stage meaning to the merged SSA value.
Memory and Layout Contract
Tiled memory operations carry two independent contracts. The layout contract demands that value shape, chosen atom, and any convert_layout operations agree on how the tile is partitioned across registers, shared memory, tensor memory, or global memory. The ordering contract demands that async memory operations and descriptor construction carry memory-consistency information so canonicalization cannot reorder them across a visible synchronization edge.
A correct lowering treats TMA descriptor construction as a separate operation, not a side effect of a TMA load or store. Later passes then have a real SSA value for the descriptor, verifier logic can check descriptor alignment and capture restrictions, and the scheduler can account for descriptor-producing work independently from the transfer that consumes it.
TmaDesc make_tiled_tma_desc(TensorView global, TileShape box, Atom atom) {
require(atom.kind == TMA_LOAD || atom.kind == TMA_STORE);
require(box.rank == atom.box_rank);
require(global.element_stride == 1);
require(descriptor_pointer_is_aligned());
return encode_descriptor(global.base, global.shape, global.strides, box, atom);
}
AsyncToken async_tiled_tma_load(TmaDesc desc, SmemTile dst, PipelineStage stage) {
require(dst.layout.is_tma_compatible());
require(stage.is_owned_by_current_producer());
return launch_bulk_tensor_copy(desc, dst, stage.barrier);
}
Verifier Invariants
A correct nv_tileas implementation enforces these invariants before the scheduler runs:
produce_oneandconsume_oneregions end withnv_tileas.async.pipeline.yield.- Producer region argument types match the element types carried by the producer token.
- Consumer region argument types match the element types carried by the consumer token.
- Region result types match the operation result types.
consumer_idxidentifies a valid consumer in the consumer group.- Pipeline iterator values yielded from both arms of a branch have the same type.
- TMA operations use the correct atom kind for load or store.
- TMA box dimensions and atom box dimensions agree.
- TMA element stride is one.
- Shared-memory layouts consumed by TMA operations are TMA-compatible.
- Special padding values are only used for floating-point element types.
- Memory operations that carry acquire, release, or stronger semantics are not reordered across their synchronization boundary.
None of these checks is cosmetic. The scheduler assumes region types, iterator types, consumer indices, and memory ordering are already valid. Invalid nv_tileas that reaches scheduling can produce a (stage, order) assignment that looks well-formed while representing an impossible pipeline.
AbstractOperation Record
Every registered op in nv_tileas carries a single 0x70-byte AbstractOperation record — same layout as
nv_tileaa, eight bytes wider than the cuda_tile record. The dialect ctor allocates each slab with
sub_44A8C20(0x70) and uses the extra slot at +0x68 for the alias-token concept pointer the dialect
inherits from its alias-aware sibling. Otherwise the shape matches the descriptor an Operation* resolves
through its OperationName slot to reach the dialect's interface tables and fold callback.
typedef struct AbstractOperation {
/*+0x00*/ void **vtable; // dispatch for the op
/*+0x08*/ StringRef mnemonic; // e.g. "nv_tileas.async.tiled_tma_load"
/*+0x18*/ ConceptModel *interface_inliner;
/*+0x20*/ ConceptModel *interface_opasm;
/*+0x28*/ ConceptModel *interface_fold;
/*+0x30*/ ConceptModel *interface_typeinfer;
/*+0x38*/ ConceptModel *interface_bytecode;
/*+0x40*/ ConceptModel *interface_memeffects;
/*+0x48*/ ConceptModel *interface_destinationstyle;
/*+0x50*/ ConceptModel *interface_alias; // alias-aware concept inherited from nv_tileaa
/*+0x58*/ ConceptModel *interface_extra1;
/*+0x60*/ FoldCallback fold_canon; // op-fold and canonicalize hook
/*+0x68*/ ConceptModel *interface_alias_token; // extra slot for the alias-token concept
} AbstractOperation;
The slab is zero-initialized by the allocator, so unused interface slots stay null and the dispatcher can
probe them without a presence flag. The mnemonic field is an embedded StringRef whose pointer references
a .rodata literal owned by the binary, not a heap-interned copy. The interface-concept pointers at
+0x18..+0x58 are the MLIR concept-model singletons that implement inlining, asm printing, folding, type
inference, bytecode, memory effects, destination-style, and the alias-aware concept at +0x50. The fold
callback at +0x60 is the op's per-op rewriter, and the extra concept pointer at +0x68 is the alias-token
model. The per-op class vtables are dense in the unk_59DC... neighbourhood: for example,
nv_tileas.alloc_tensor is registered with vtable &unk_59DC860, nv_tileas.async.pipeline.create_pipeline
with &unk_59DC9F0, and nv_tileas.async.tiled_tma_load with the corresponding &unk_59DD... slot, each
populated by its inline reg stanza in sub_147CAC0.
The records sit consecutively in a statically-allocated array in .data.rel.ro that mirrors the layout of
the other tile dialects: one 0x70 slab per op, walked from the dialect base by mnemonic hash. The end-of-
registered-ops boundary is marked by the same null sentinel as cuda_tile and nv_tileaa, 0x5BE6138;
lookup helpers stop walking the bank when they hit it.
This is the static-sentinel idiom described in
TypeID Sentinels and Anchors: the bank is
allocated once, lives for the entire process, and is indexed by mnemonic hash from the dialect base. Live
Operation* instances reach this record through their OperationName slot, which is the resolution path
documented in Operation Layout — Pointer-Identity Dispatch. The per-op vtable
and fold-callback pairs for the rest of the roster are catalogued in
Operation Roster and Builders.
Cross-links
- Op Roster and Builders — complete operation list and low-level builder notes.
- Types — pipeline tokens, iterator types, agent-like interfaces, and yield interface behavior.
- Verifiers — detailed verifier behavior for pipeline regions, TMA operations, tiled memory operations, and layout constraints.
- Folds and Memory Consistency — canonical rewrites, memory consistency interface behavior, and ordering-sensitive rewrite rules.
Appendix: Pass-Object PIMPL Layout
In the binary, most TileAS passes appear as a thin mlir::Pass shim wrapping a single heap-allocated PIMPL
object. The shim's run method dispatches through the vtable at offset +0x00 of the PIMPL; the secondary
interface pointer at +0x08 is the analysis-and-statistics object the pass framework calls into when the
run method asks for an option store, a dominator analysis, or a symbol-table cache. Every pass shares this
skeleton — find one pass, find all of them by following the same offsets.
The objects range from 0x150 to 0x3C0 bytes. Beyond the two leading pointers, one field matters for
verification: the pass-failure bit at *(pass+40) & 4, which the framework reads after the run method
returns to decide whether to mark the pipeline failed. Options the pass forwards from the command line sit
at predictable offsets — typically +0x1D0, +0x2A0, +464, +672, or +880, depending on how many
string and integer options the pass carries.
Three passes from the TileAS pipeline are visible in the binary with clean vtable cross-references; their sizes and the location of their option slabs are:
| Pass | Size | Vtable | Notes |
|---|---|---|---|
D08 MaterializeConvertLayout | 752 B (0x2F0) | off_59B4688 | String options at +0x1D0 and +0x2A0; failure bit at +40 |
D09 MaterializeSchedule | 960 B (0x3C0) | &unk_59B4768 | Three option slabs at +464, +672, +880; failure bit at +40 |
D11 UnspecializedPipeline | — | — | numStages option lives at +464; the rest of the object is shared with D09 |
The pattern is uniform enough to reimplement as a single C++ base class with offsets parameterised by a
template argument. The vtable's first slot is the standard runOnOperation entry; the second is
getPassName; later slots carry the mlir::Pass clone, dependent-dialect, and options-printing hooks. The
secondary interface at +0x08 is the analysis-manager facet — touching it after runOnOperation returns
is undefined behavior, because the framework tears it down before invoking the next pass.
nv_tileas Operation Roster and Builders
Abstract
nv_tileas is the operational surface for async scheduling, tiled memory movement, layout conversion, TMA descriptor use, and scheduled tile compute. This page lists the operation families, explains which attributes belong to the public contract, and describes the builder helpers used by scheduling and materialization passes.
The useful reference is semantic. The binary holds plenty of generated registration thunks, but a reimplementation only needs the operation names, operand/result contracts, attributes, and builder behavior described here.
Operation Families
| Family | Operations | Purpose |
|---|---|---|
| async pipeline | async.pipeline.create_pipeline, create_iterator, inc_iter, produce_one, produce_one_async, consume_one, consume_one_async, producer_acquire, producer_write, producer_commit, consumer_wait, consumer_read, consumer_release, agent_switch, async.pipeline.yield | producer/consumer pipeline regions, stage iteration, ownership handshakes, and agent partitioning |
| async tokens | async.wait, async.future_wait, async.to_async, async.token_to_async, create_none | async completion, token bridging, and placeholder values |
| tiled memops | tiled_load, tiled_store, tiled_atomic_rmw, async.tiled_load, async.copy, async.load, async.store, copy, load, store, gather_load, scatter_store | token-ordered and async memory movement |
| tensor slices | alloc_tensor, extract_slice, insert_slice, async.extract_slice, async.insert_slice | local tile storage and shape manipulation |
| layout | convert_layout, view, expand_dims, reinterpret, shuffle, generate | layout conversion, value views, and generated tile bodies |
| TMA | make_tiled_tma_desc, async.tiled_tma_load, async.tiled_tma_store, async.gather_tma_load, async.scatter_tma_store | TMA descriptor construction and async tensor bulk copies |
| compute | dot, async.dot, reduce, scan | MMA and region-bearing reduction operations |
| control and metadata | yield, pragma, cancel_next_program_id, async.cancel_next_program_id | region termination, optimizer directives, and scheduling control |
Attribute Roster
| Attribute | Owner concepts | Meaning |
|---|---|---|
atom | copy, dot, tiled memory, TMA, gather/scatter | selects copy, MMA, TMA, or reduce atom |
padding_value | gather/load/store variants | value used when an access is out of bounds |
consumer_idx | consumer wait/read paths | selects a consumer inside a consumer group |
ocgEnterDirectives | pragma | optimizer-control directives active on entry |
ocgLeaveDirectives | pragma | optimizer-control directives active on exit |
operandSegmentSizes | segmented memops and descriptor ops | separates view, coordinate, offset, token, and metadata operands |
| memory semantic/scope attrs | tiled memory operations | ordering and visibility contract |
| in-bounds attrs | loads and stores | per-dimension bounds information |
Attributes belong to the operation contract. Pattern rewrites may remove stale caches, but they must preserve semantic attributes unless they replace the operation with a semantically equivalent form.
PipelineOp Enum
The nv_tileas.async.pipeline.* op family is a closed 16-entry enumeration. Each entry pairs with a single builder helper and a fixed OperationState shape, so a reimplementation can drive the entire family from one indexed dispatch instead of per-op registration code. Entries 0..14 are active; entry 15 is reserved.
| # | Mnemonic | OperationState |
|---|---|---|
| 0 | nv_tileas.async.pipeline.create_pipeline | 6 named operands: numStages (i32), bufferView, producerGroupId (u8), consumerGroupId (u8), sharedMem (bool), dynamic (bool) |
| 1 | nv_tileas.async.pipeline.produce_one | 1 region op |
| 2 | nv_tileas.async.pipeline.produce_one_async | 1 region op |
| 3 | nv_tileas.async.pipeline.consume_one | 1 region op + consumer_idx i32 attr |
| 4 | nv_tileas.async.pipeline.consume_one_async | 1 region op |
| 5 | nv_tileas.async.pipeline.consumer_read | scalar op + consumer_idx i32 attr |
| 6 | nv_tileas.async.pipeline.producer_write | scalar op |
| 7 | nv_tileas.async.pipeline.producer_acquire | scalar op |
| 8 | nv_tileas.async.pipeline.producer_commit | scalar op |
| 9 | nv_tileas.async.pipeline.consumer_wait | scalar op |
| 10 | nv_tileas.async.pipeline.consumer_release | scalar op |
| 11 | nv_tileas.async.pipeline.yield | variadic terminator |
| 12 | nv_tileas.async.pipeline.inc_iter | scalar op |
| 13 | nv_tileas.async.pipeline.create_iterator | scalar op |
| 14 | nv_tileas.async.pipeline.agent_switch | variadic body builder: num_agents_per_group i32, max_regs per-agent list, isolated bool |
| 15 | (reserved) | — |
Two builders deserve individual notes. create_pipeline is the largest builder because each of its six named operands runs through the named-operand helper before the state populates; the names ride along with the operation so they reappear in IR-printed form rather than as positional %0..%5 references. agent_switch is variadic in agent-body count: the emitted operation state carries an arbitrary number of regions, one per agent, plus the num_agents_per_group count, a DenseI32ArrayAttr of per-agent max_regs budgets, and an isolated boolean that controls whether an agent's region sees the surrounding SSA scope.
The region-op verifiers attached to the produce/consume variants and the yield are documented in Verifiers — Region-Op Verifier Template. The operation-state trailing-objects layout each builder fills in is documented in Operation Layout — TrailingObjects Decoder.
Worked Example: Producer/Consumer Pipeline Region
A representative two-stage pipeline that loads a tile through TMA in the producer region, waits for it in the consumer region, and feeds a dot in the consumer region:
// Build the pipeline. numStages=2, one producer, one consumer.
%prod_tok, %cons_tok = nv_tileas.async.pipeline.create_pipeline %buf_view
{ numStages = 2 : i32,
producerGroupId = 0 : i8,
consumerGroupId = 1 : i8,
sharedMem = true,
dynamic = false }
: !nv_tileaa.tiled_view<2x128x128xf16>
-> !nv_tileas.async.pipeline.producer_token, !nv_tileas.async.pipeline.consumer_token
// Stage iterator
%iter = nv_tileas.async.pipeline.create_iterator %prod_tok
: !nv_tileas.async.pipeline.producer_token -> !nv_tileas.async.pipeline.iterator<tile<128x128xf16>>
// Producer region — TMA loads, one per stage
%prod_tok2 = nv_tileas.async.pipeline.produce_one %prod_tok, %iter
{ producer_types = [tile<128x128xf16>] } : (
!nv_tileas.async.pipeline.producer_token,
!nv_tileas.async.pipeline.iterator<tile<128x128xf16>>
) -> !nv_tileas.async.pipeline.producer_token {
^bb0(%stage_buf : tile<128x128xf16>):
%async_tok = nv_tileas.async.tiled_tma_load
%tma_desc, %stage_buf[%k_outer]
{ atom = #nv_tileas<atom tma_load_2d>,
operandSegmentSizes = array<i32: 1, 1, 1, 1> }
: !cute_nvgpu.tma_descriptor_tiled, !nv_tileaa.tiled_view<128x128xf16>,
index, !nv_tileaa.mem_token
-> !async.value<tile<128x128xf16>>
nv_tileas.async.pipeline.yield %stage_buf : tile<128x128xf16>
}
// Consumer region — wait for stage, dot, release
%cons_tok2 = nv_tileas.async.pipeline.consume_one %cons_tok, %iter
{ consumer_idx = 0 : i32,
consumer_types = [tile<128x128xf16>] } : (
!nv_tileas.async.pipeline.consumer_token,
!nv_tileas.async.pipeline.iterator<tile<128x128xf16>>
) -> !nv_tileas.async.pipeline.consumer_token {
^bb0(%a_tile : tile<128x128xf16>):
%waited = nv_tileas.async.pipeline.consumer_wait %cons_tok, %iter
{ consumer_idx = 0 : i32 }
: !nv_tileas.async.pipeline.consumer_token,
!nv_tileas.async.pipeline.iterator<tile<128x128xf16>>
-> !nv_tileas.async.pipeline.consumer_token
%d = nv_tileas.dot %a_tile, %b_tile, %acc
{ atom = #nv_tileas<atom mma_f16_f16_f32> }
: tile<128x128xf16>, tile<128x128xf16>, tile<128x128xf32>
-> tile<128x128xf32>
%released = nv_tileas.async.pipeline.consumer_release %waited
: !nv_tileas.async.pipeline.consumer_token
-> !nv_tileas.async.pipeline.consumer_token
nv_tileas.async.pipeline.yield %a_tile : tile<128x128xf16>
}
The pipeline state attribute on create_pipeline records the stage count, the producer/consumer agent group ids, the buffer view, and the sharedMem flag that pins per-stage storage to shared memory. The producer_types and consumer_types attributes on the region ops match the producer token's payload type list, which is what the region-op verifier checks before lowering. The mbarrier slot the TMA load deposits into is the consumer's stage barrier; consumer_wait observes the same barrier and consumer_release returns the stage to the producer pool. The iterator rotates through numStages stages and is incremented per outer loop iteration through nv_tileas.async.pipeline.inc_iter.
TMA Op Operand/Result Tables
nv_tileas.make_tiled_tma_desc
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | global view | tiled_view with GMEM residency tag | yes | residency is read from the view's address-space attribute, not the SSA type; element stride must equal 1 |
| operand 1..R | box dims | index | yes (R = atom box rank) | per-axis box size |
| result 0 | descriptor | nv_tileas.tma_desc | yes | consumed by async.tiled_tma_load/_store |
attr atom | atom | TMA load or store atom | yes | drives kind selection |
attr swizzle_mode | enum | none|32B|64B|128B | optional | shared-memory swizzle |
attr oob_mode | enum | zero|nan|constant | optional | out-of-bounds behavior |
nv_tileas.async.tiled_tma_load
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | descriptor | tma_desc | yes | from make_tiled_tma_desc |
| operand 1 | shared destination | tiled_view with SMEM residency tag | yes | residency read from the view's address-space attribute; TMA-compatible swizzled layout |
| operand 2..R+1 | coords | index | yes | per-axis source coordinate |
| operand R+2 | barrier | mem_token | yes | mbarrier for completion |
| result 0 | async token | AsyncTokenType | yes | observed by async.wait |
attr atom | atom | TMA load atom | yes | matches descriptor atom kind |
attr padding_value | typed attr | element-typed scalar | optional | floating-point only |
attr operandSegmentSizes | dense i32 | length 4 | yes | {desc, dst, coords, barrier} |
nv_tileas.async.tiled_tma_store
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | descriptor | tma_desc | yes | TMA store kind |
| operand 1 | shared source | tiled_view (shared) | yes | TMA-compatible swizzled layout |
| operand 2..R+1 | coords | index | yes | per-axis destination coordinate |
| result 0 | async token | AsyncTokenType | yes | |
attr atom | atom | TMA store atom | yes | |
attr operandSegmentSizes | dense i32 | length 3 | yes | {desc, src, coords} |
nv_tileas.async.gather_tma_load / scatter_tma_store
The discontiguous TMA variants take a per-lane coordinate tile (gather) or
per-lane address tile (scatter) on top of the contiguous operands, and reject
modes the descriptor doesn't support. Their attribute sets mirror the
contiguous variants — gather_tma_load accepts padding_value,
scatter_tma_store rejects it.
LogicalResult verify_make_tiled_tma_desc(MakeTmaDescOp op) {
require(op.atom().is_tma());
require(op.box_dims().size() == op.atom().box_rank());
require(op.global_view().element_stride() == 1);
require_descriptor_alignment(op.global_view().base());
require_captures_are_descriptor_abi_compatible(op);
return success();
}
Pipeline Op Operand/Result Tables
nv_tileas.async.pipeline.create_pipeline
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | buffer view | tiled_view | yes | stage-local storage view |
| result 0 | producer token | PipelineProducerTokenType | yes | feeds producer_acquire |
| result 1 | consumer token | PipelineConsumerTokenType | yes | feeds consumer_wait |
attr numStages | i32 | yes | stage count | |
attr producerGroupId | u8 | yes | agent group emitting producers | |
attr consumerGroupId | u8 | yes | agent group emitting consumers | |
attr sharedMem | bool | optional | stage storage lives in shared memory | |
attr dynamic | bool | optional | dynamic stage indexing |
nv_tileas.async.pipeline.produce_one / produce_one_async
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | producer token | PipelineProducerTokenType | yes | input ownership |
| operand 1 | iterator | PipelineIteratorType | yes | stage indexing |
| region 0 | body | producer | yes | terminated by async.pipeline.yield |
| result 0 | producer token | PipelineProducerTokenType | yes | returned to caller |
| result 1 | async token | AsyncTokenType | async variant only | completion of async producer work |
attr producer_types | typed array | yes | element-type list yielded by body |
nv_tileas.async.pipeline.consume_one / consume_one_async
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | consumer token | PipelineConsumerTokenType | yes | input ownership |
| operand 1 | iterator | PipelineIteratorType | yes | |
| region 0 | body | consumer | yes | terminated by async.pipeline.yield |
| result 0 | consumer token | PipelineConsumerTokenType | yes | |
| result 1 | async token | AsyncTokenType | async variant only | |
attr consumer_idx | i32 | yes | selects a consumer in consumer group | |
attr consumer_types | typed array | yes | element-type list yielded by body |
nv_tileas.async.pipeline.producer_acquire / producer_commit / consumer_wait / consumer_release
| Op | Operand 0 | Result 0 | Notes |
|---|---|---|---|
producer_acquire | producer token + iterator | producer token | grants stage ownership |
producer_commit | producer token | producer token | publishes stage |
consumer_wait | consumer token + iterator | consumer token | observes commit |
consumer_release | consumer token | consumer token | returns stage to pool |
consumer_wait and consumer_read additionally carry the consumer_idx i32 attribute that maps the wait to a specific consumer inside the consumer group.
nv_tileas.async.pipeline.yield
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0.. | yielded values | variadic | yes | operand types match enclosing region's result types |
nv_tileas.async.pipeline.create_iterator / inc_iter
| Op | Operand 0 | Result 0 | Notes |
|---|---|---|---|
create_iterator | pipeline value | PipelineIteratorType | rotates through numStages stages |
inc_iter | iterator | iterator | advances to next stage |
LogicalResult verify_pipeline_handshake(Operation op) {
require_token_kind(op, op.operand(0));
require_iterator_type_payload_matches(op.region(0), op.producer_types_attr());
require_region_terminator_is(op.region(0), "nv_tileas.async.pipeline.yield");
require_yield_operand_types_match_results(op.region(0), op.result_types());
return success();
}
Pipeline Builders
Pipeline builders create region operations and token handshakes. A good implementation exposes small helper functions instead of forcing every pass to build raw operation states.
ProduceOneOp build_produce_one(Rewriter *rw,
Location loc,
ProducerToken token,
PipelineIterator iter,
TypeRange result_types,
RegionBuilder body) {
ProduceOneOp op = rw->create<ProduceOneOp>(loc, result_types, token, iter);
body(op.body(), op.region_arguments());
ensure_pipeline_yield(op.body());
return op;
}
ConsumeOneOp build_consume_one(Rewriter *rw,
Location loc,
ConsumerToken token,
PipelineIterator iter,
uint32_t consumer_idx,
TypeRange result_types,
RegionBuilder body) {
ConsumeOneOp op = rw->create<ConsumeOneOp>(loc, result_types, token, iter);
op.set_consumer_idx(consumer_idx);
body(op.body(), op.region_arguments());
ensure_pipeline_yield(op.body());
return op;
}
agent_switch is variadic in agent body count and carries per-agent register-budget data. The builder keeps body regions, group counts, and max-register lists together so execution-unit propagation can reason about them.
Tiled Memop Operand/Result Tables
The tiled memory family shares one segmented operand layout. operandSegmentSizes separates view, coordinate, offset, token, and optional padding/mask operands so the verifier walks each slice without re-parsing the op.
Throughout the tables below, the SSA operand type is tiled_view<…> (a TileAS dialect type, not the MLIR built-in memref). Residency — RMEM, SMEM, TMEM, or GMEM — is an attribute on the tiled_view type, not encoded in the SSA type name. Verifier rules that say "shared" or "global" inspect that address-space tag, not the SSA type; two operands that both type-print as tiled_view<128x128xf16> can disagree on residency and be rejected by the memory-space-pair check.
nv_tileas.tiled_load
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | view | nv_tileaa.tiled_view or nv_tileas.tiled_view | yes | source tile view |
| operand 1..R | coords | index | yes (R = view rank) | per-axis coordinate |
| operand R+1.. | offsets | index | optional | per-axis offset; segment may be empty |
| token slot | token | mem_token or async_token | optional | one or zero |
| result 0 | tile | tile<S × element> | yes | shape S = atom box shape |
| result 1 | token | mem_token or async_token | optional | present when token slot was supplied |
attr atom | atom | AtomAttr | yes | selects copy/TMA atom |
attr mem_semantic | enum | weak|relaxed|acquire | optional | acquire_release rejected |
attr mem_scope | enum | tl_blk|cluster|gpu|sys | required when semantic > weak | rejected when semantic = weak |
attr in_bounds | dense bool | per-axis | optional | defaults to false |
attr padding_value | typed attr | element-typed scalar | optional | only with in_bounds=false |
attr operandSegmentSizes | dense i32 | length 4 or 5 | yes | {view, coords, offsets, token[, mask]} |
LogicalResult verify_tiled_load(TiledLoadOp op) {
require_operand_segments(op, {1, op.view().rank(), -1, /*token*/ -1});
require_optional_token(op);
require_coordinate_types_match_index(op);
require_tile_shape_matches_atom_box(op.atom(), op.result(0));
require_tile_dimensions_power_of_two(op.result(0).shape());
if (op.mem_semantic() == ACQUIRE_RELEASE) {
return op.emit_error("tiled_load rejects acquire_release semantic");
}
require_scope_iff_non_weak(op.mem_semantic(), op.mem_scope());
require_padding_only_when_not_in_bounds(op);
return success();
}
nv_tileas.tiled_store
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | view | tiled_view | yes | destination tile view |
| operand 1 | value | tile<S × element> | yes | element type matches view element type |
| operand 2..R+1 | coords | index | yes | per-axis coordinate |
| operand R+2.. | offsets | index | optional | per-axis offset |
| token slot | token | mem_token or async_token | optional | |
| result 0 | token | mem_token or async_token | optional | mirrors input token slot |
attr atom | atom | AtomAttr | yes | TMA store, register-to-global, etc. |
attr mem_semantic | enum | weak|relaxed|release | optional | acquire and acquire_release rejected |
attr mem_scope | enum | as above | required when semantic > weak | |
attr in_bounds | dense bool | per-axis | optional | |
attr padding_value | typed attr | element-typed scalar | optional | only with in_bounds=false |
attr operandSegmentSizes | dense i32 | length 4 or 5 | yes |
nv_tileas.tiled_atomic_rmw
| Slot | Kind | Type | Required | Notes |
|---|---|---|---|---|
| operand 0 | view | tiled_view | yes | atomic destination |
| operand 1 | value | tile<S × element> | yes | RMW operand |
| operand 2..R+1 | coords | index | yes | per-axis coordinate |
| token slot | token | mem_token | optional | |
| result 0 | tile | tile<S × element> | yes | old value tile |
| result 1 | token | mem_token | optional | |
attr atom | atom | AtomAttr | yes | |
attr rmw_mode | enum | add|and|or|xor|xchg|min|max|umin|umax|cmpxchg|addf | yes | |
attr mem_semantic | enum | full set | optional | matches CAS semantics |
attr mem_scope | enum | as above | required when semantic > weak | |
attr operandSegmentSizes | dense i32 | length 4 | yes |
The atomic verifier also rejects 8-bit element types across all modes and rejects 16-bit integer atomics; 16-bit floating atomics restrict the mode set to add, max, min. The shared invariants for memory semantics, scope, and tile-shape validation appear in Verifiers.
Tiled Load and Store Builders
The most common composite builders emit a view followed by a tiled memory operation. They normalize rank and coordinate widths, attach operand segment sizes, and carry memory-ordering attributes through.
TiledLoadOp build_view_then_tiled_load(Rewriter *rw,
Location loc,
Value source,
TileViewSpec view,
TiledLoadAttrs attrs) {
Value tile_view = rw->create<ViewOp>(loc, view.type, source, view.indices);
return rw->create<TiledLoadOp>(
loc,
attrs.result_types,
tile_view,
attrs.coords,
attrs.offsets,
attrs.token,
attrs.semantic_attrs());
}
TiledStoreOp build_view_then_tiled_store(Rewriter *rw,
Location loc,
Value value,
Value destination,
TileViewSpec view,
TiledStoreAttrs attrs) {
Value tile_view = rw->create<ViewOp>(loc, view.type, destination, view.indices);
return rw->create<TiledStoreOp>(
loc,
tile_view,
value,
attrs.coords,
attrs.offsets,
attrs.token,
attrs.semantic_attrs());
}
Scheduling preparation and materialization passes lean on these builders because they repeatedly need the same view-plus-memory-operation shape.
Dot and Mask Builders
Dot builders cover several recurring patterns:
- allocate a zero accumulator and emit a dot;
- wrap dot emission in
scf.forandscf.ifwhen a predicate or stage guard is needed; - synthesize a predicate mask, convert layout, and emit dot;
- install dot simplification patterns for select-constant cases.
Value build_zero_accumulator_dot(Rewriter *rw,
Location loc,
DotInputs inputs,
Type acc_type,
AtomAttr atom) {
Value acc = rw->create<AllocTensorOp>(loc, acc_type);
Value zero = rw->create<arith::ConstantOp>(loc, zero_attr(acc_type));
initialize_accumulator(rw, acc, zero);
return rw->create<DotOp>(loc, inputs.a, inputs.b, acc, atom).result();
}
Dot builders preserve the atom and signedness attributes — later NVGPU/NVVM lowering uses them to pick the actual instruction.
Arithmetic Helper Builders
The builder library also ships thin wrappers for common arith operations: constants, add, multiply, subtract, signed division, signed max, and select. These helpers let composite TileAS builders materialize index math without depending on caller-specific boilerplate.
Value build_index_expr(Rewriter *rw, Value base, Value lane, Value stride) {
Value scaled = rw->create<arith::MulIOp>(lane.get_loc(), lane, stride);
return rw->create<arith::AddIOp>(base.get_loc(), base, scaled);
}
Wrappers must not add overflow or fast-math attributes unless the caller explicitly asks for them. Defaults belong to the arith dialect operation itself.
Schedule Infrastructure Builders
After schedule generation, three helper algorithms convert analysis into concrete IR:
| Helper | Purpose |
|---|---|
| materialize schedule | partitions resident and pending loads/stores/async roots from schedule analysis |
| build stages | turns union constraints into stage-ordered producer/consumer pairs |
| expand single tiled op | clones a tiled operation for each scheduled stage and rewires operands |
ScheduleMaterialization materialize_schedule(ScheduleAnalysis analysis, MaterializeOptions options) {
ScheduleMaterialization out = {};
out.resident_loads = compute_resident_loads(analysis, options);
out.resident_stores = compute_resident_stores(analysis, options);
out.pending_loads = expand_iteration_arguments(analysis, Side::Read);
out.pending_stores = expand_iteration_arguments(analysis, Side::Write);
out.resident_async = filter_async_eligible(out.resident_loads, options);
out.pending_async = filter_async_eligible(out.pending_loads, options);
return out;
}
Stage expansion needs two maps: one from original operands to their source operation, and one from each source operation to the per-stage replica. Those two maps are what let a single scheduled tiled operation become several stage-specific SSA operations without mixing operands from different stages.
void expand_single_tiled_op(TiledOp op, StageMap stages, Rewriter *rw) {
OperandSourceMap sources = collect_operand_sources(op);
ReplicaMap replicas = clone_op_per_stage(op, stages, rw);
for (Operation *replica : replicas.values()) {
for (OpOperand &operand : replica->get_op_operands()) {
if (Value repl = lookup_stage_replacement(operand, sources, replicas)) {
operand.set(repl);
}
}
}
}
Cross-References
Verifiers describes the verbatim diagnostics the operations defined here must satisfy. Types describes the pipeline-token, iterator, and agent types that ride on these ops. Folds and Memory Consistency describes the rewrite shapes applied to the slice and structured-control scaffolding. The TileAA-side counterpart in nv_tileaa Operation Roster feeds these scheduling operations through the alias-aware lowering boundary.
nv_tileas Types
Abstract
The nv_tileas type system carries the state needed to make asynchronous tile pipelines explicit: producer and consumer tokens, generic async completion tokens, pipeline iterators, agent metadata, and layout-bearing value conventions. These types let passes reason about stage ownership, agent boundaries, region yields, and memory ordering before the IR is flattened into LLVM and NVVM operations.
Most TileAS types are control and scheduling types, not runtime heap objects — SSA-level contracts that verifiers, schedulers, and lowerings consume.
Pipeline Types
| Type | Role |
|---|---|
PipelineProducerTokenType | producer-side ownership token; acquired before writing a stage and consumed by commit |
PipelineConsumerTokenType | consumer-side ownership token; produced by wait and consumed by release |
AsyncTokenType | generic completion token for async copy, async dot, and other asynchronous work |
PipelineIteratorType | rotating stage iterator that carries the element type and stage position through control flow |
Producer and consumer tokens carry no payload data — they represent ordering and ownership. Payload values move through region arguments and yields.
typedef struct PipelineState {
uint32_t stage_count;
uint32_t producer_group;
uint32_t consumer_group;
Value storage;
} PipelineState;
typedef struct PipelineIterator {
Type element_type;
uint32_t stage;
uint32_t phase;
IteratorKind kind;
} PipelineIterator;
PipelineIteratorType is the only pipeline type with meaningful structural payload. Producer-side and consumer-side iterators stay distinct because they participate in different handshakes, but both unwrap to the element type yielded through the pipeline region.
Type Storage and Uniquing
Pipeline types are routed through the context StorageUniquer documented in Storage Uniquer and Context Impl — getOrCreate Gateway. Producer/consumer tokens and the generic async token are parameterless and resolve to a single canonical storage per context; the iterator type carries a wrapped element type and is keyed on that pointer.
| Type | Uniquer key |
|---|---|
PipelineProducerTokenType | parameterless (single canonical storage per context) |
PipelineConsumerTokenType | parameterless |
AsyncTokenType | parameterless |
PipelineIteratorType | (element_type) pointer |
Producer-side and consumer-side token classes share storage shape but carry distinct TypeIDs, so pointer-identity dispatch in the verifier and lowering tells them apart without parsing names. The iterator TypeID is consulted by the region-op verifier template (see Verifiers) before producer-type comparison; the unwrap always runs on the block-argument side, never on the producer-type list.
Iterator Propagation
Pipeline iterators must survive structured control flow. Loops carry them as iter-args; branches must yield the same iterator type from both arms.
The iterator type encodes four logical fields:
| Field | Meaning |
|---|---|
element_type | The payload type the iterator carries (typically the tile type yielded by the producer region). |
count | The number of distinct stages the iterator rotates through, fixed by numStages on the enclosing pipeline. |
stride | The advance step taken by inc_iter (always one in current TileAS). |
address_space | The memory space the iterator's payload references (shared, tensor, or register). |
Propagation through structured control flow obeys explicit rules:
- Async producer/consumer ops preserve
countandstridebut may transformaddress_space. A producer region that materializes its payload into shared memory exposes a shared-space iterator to the consumer region; a consumer region that copies the payload into registers exposes a register-space iterator to whatever consumes the consumer's yield. - Reduction and scan ops divide
countby the reduction factor when the reduction collapses an entire stage dimension. The verifier rejects a reduction whose factor does not evenly dividecount. - Structured branches must yield iterators that agree on all four fields.
scf.ifwith aPipelineIteratorTyperesult requires both arms' yields to match. - Loops carry the iterator unchanged as an iter-arg. The loop-coalescing pattern in Folds and Memory Consistency — Coalesce Perfectly Nested Loops rejects coalescing a loop that carries an iterator iter-arg because the merged loop's iteration count would no longer match the iterator's
count.
LogicalResult verify_iterator_merge(Value lhs, Value rhs) {
if (!isa<PipelineIteratorType>(lhs.get_type())) {
return failure();
}
if (lhs.get_type() != rhs.get_type()) {
return failure();
}
return success();
}
PipelineIteratorType propagate_through_async(PipelineIteratorType in,
AddressSpace producer_space) {
return PipelineIteratorType(in.element_type, in.count, in.stride, producer_space);
}
PipelineIteratorType propagate_through_reduction(PipelineIteratorType in,
uint32_t reduction_factor) {
require(in.count % reduction_factor == 0);
return PipelineIteratorType(in.element_type,
in.count / reduction_factor,
in.stride,
in.address_space);
}
Treat iterator propagation as part of queue-to-pipeline lowering. Delaying it until final lowering means the scheduler cannot reliably assign stage meaning to merged SSA values, and the verifier loses the ability to reject a malformed reduction-over-stages pattern at the right phase.
Agent Types
Agent metadata describes warp-specialized execution regions. It rides on agent_switch and related execute operations rather than appearing as ordinary SSA values.
| Agent field | Meaning |
|---|---|
| agent body regions | One region per logical agent; each region runs on a disjoint subset of the warp budget. |
num_agents_per_group | Number of agents in the group; controls how the launch's warp budget partitions. |
max_regs | Per-agent register budget hint; quantizes to a warp-count-like unit. |
isolated | Whether an agent's region sees the surrounding SSA scope or runs in an isolated value-space. |
| warp count | Derived from register budget or inherited from enclosing launch metadata. |
The register budget quantizes to a warp-count-like unit. A sentinel value means "inherit the enclosing budget"; the scheduler and execution-unit propagation passes resolve that placeholder against the actual kernel configuration later.
uint32_t quantize_agent_warps(uint32_t max_regs) {
if (max_regs == INHERIT_REGISTER_BUDGET) {
return INHERIT_REGISTER_BUDGET;
}
return 8 * ceil_div(max_regs + 7, 8);
}
The agent verifier (see Verifiers) checks two structural facts: all agent regions in one group agree on their warp count (or inherit it), and the sum of resolved warp counts does not exceed the enclosing launch budget. Once resolved, the warp count drives both the launch geometry recorded in the GPU module attributes and the per-agent register-allocation decisions taken by NVGPU lowering.
Layout-Carrying Values
nv_tileas does not lean on one monolithic layout type. Layout rides on the value type plus attributes such as atom, layout descriptors, memory-space information, and operand segment sizes.
| Layout carrier | Purpose |
|---|---|
| value type | element type, rank, shape, and memory-space view |
atom attribute | selects the copy, MMA, TMA, or reduce atom used by the operation |
| layout descriptor | describes register/shared/tensor-memory arrangement |
| operand segments | separate view operands, coordinate operands, offsets, and tokens |
One operation describes both a logical tile and the hardware atom that will eventually move or compute it.
Producer Interface
Producer-like operations expose their producer region through a private interface. The behavior is simple:
produce_oneandproduce_one_asyncexpose the region that generates producer values.producer_writeexposes the region that writes into pipeline storage.- a producer marker lets later passes identify producer boundaries without rediscovering the operation shape.
Region *get_producer_region(Operation *op) {
if (isa<ProduceOneOp>(op) || isa<ProduceOneAsyncOp>(op)) {
return &op->region(0);
}
if (isa<ProducerWriteOp>(op)) {
return &op->region(0);
}
return NULL;
}
Agent-Like Interface
Agent-like operations expose body regions and warp-count information. agent_switch is the primary TileAS user; the upstream execute operation shares the same conceptual interface before queue-to-pipeline lowering.
SmallVector<uint32_t> get_agent_warp_counts(AgentLikeOp op) {
SmallVector<uint32_t> counts;
for (AgentBody body : op.agent_bodies()) {
counts.push_back(resolve_or_inherit_warp_count(body));
}
return counts;
}
Verification must ensure every path crossing an agent boundary agrees on the agent budget lowering will use.
Yield Terminator Interface
Both ordinary TileAS yield and async pipeline yield act as region-branch terminators. Their successor regions and operands delegate to the enclosing region operation.
The rule stays local: a pipeline region decides what its yield values mean; the terminator just supplies the yielded operands.
SuccessorInfo get_successors(YieldOp yield) {
Operation *parent = yield.parent_region_op();
return parent->region_branch_successors(yield.operands());
}
Cross-References
Operation Roster and Builders shows the operations that consume and produce each type. Verifiers — Region-Op Verifier Template details the region-op verifier template that validates iterator unwrap and producer-type agreement. Folds and Memory Consistency describes the rewrites that respect the iterator-propagation rules above.
nv_tileas Verifiers
Abstract
The nv_tileas verifier layer shields the scheduler from impossible pipeline, memory, layout, TMA, and MMA shapes. Two broad families fall under it: async pipeline operations with region and token contracts, and target-facing operations — tiled memory ops, TMA descriptors, layout conversions, copies, dots, and block-scaled MMA.
These verifiers belong to the public reimplementation contract. Scheduling assumes they have already run. A malformed TileAS operation may still look like valid MLIR, but it can describe a pipeline or memory operation the target cannot execute.
Async Pipeline Verification
Async pipeline verification is mostly structural. Region-bearing operations need matching block argument types, result types, and terminators. Token-only operations fall under ordinary operand/result arity and type rules.
| Operation | Required invariant |
|---|---|
create_pipeline | results form the producer/consumer token pair for the pipeline |
produce_one | producer region arguments match producer-token element types |
produce_one_async | same as produce_one, plus async result token shape |
consume_one | consumer region arguments match consumer-token element types |
consume_one_async | same as consume_one, plus async result token shape |
producer_write | producer body region arguments match the write payload |
producer_acquire | operand is a producer token |
producer_commit | operand is a producer token produced by the write/acquire path |
consumer_wait | operand is a consumer token and consumer_idx is valid |
consumer_release | operand is a consumer token produced by read/wait |
async.pipeline.yield | operands match the enclosing pipeline region result contract |
LogicalResult verify_pipeline_region(PipelineRegionOp op) {
Region ®ion = op.body();
if (!region.front().has_terminator()
|| region.front().terminator().name != "nv_tileas.async.pipeline.yield") {
return op.emit_error("pipeline regions must end with 'nv_tileas.async.pipeline.yield'");
}
if (!block_args_match_token_elements(region, op.input_token_type())) {
return op.emit_error("expects region arguement types to match with producer types");
}
if (!yield_operands_match_results(region, op.result_types())) {
return op.emit_error("expects region result types to be match with operation result types");
}
return success();
}
The verifier also checks iterator agreement across structured control flow. When two branch arms yield a pipeline iterator, both yielded values must have the same iterator type. The diagnostic emitted on mismatch is "branch arms must yield matching pipeline iterator types".
The diagnostics this routine emits:
| Diagnostic | Cause |
|---|---|
"expects regions to end with '" (binary stores the closing op name separately) | A pipeline region's terminator is the wrong op kind; the trailing op-name fragment is appended at print time. |
"expects region arguement types to match with producer types [" (typo preserved, trailing bracket opens the printed type list) | Region block-argument types disagree with the producer-type list. |
"expects region result types to be match with operation result types [" (phrasing preserved, trailing bracket opens the printed type list) | Yield operand types disagree with the parent's result types. |
"expects region yield types to match with result types [" | A region's yield operand types disagree with the parent op's result types. |
"expected 'consumer_idx' less than the number of consumer " | A consumer_wait or consumer_read carries an index outside the consumer group's bounds. |
"expected 'consumer_idx' in token to be the same as 'consumer_idx' attribute of this operation " | The token operand and the op's consumer_idx attribute disagree. |
Region-Op Verifier Template
Five region-bearing pipeline ops share one verifier template: produce_one, produce_one_async, consume_one, consume_one_async, and async.pipeline.yield. Each op installs the template against its own OperationName and producer-type accessor, so the per-op bodies remain distinct in the binary even though their algorithm is identical.
The shared algorithm has four steps:
- Fetch producer types. Each pipeline op carries a
producer_types: ArrayAttr<Type>attribute encoding the type-list the producer agreed to emit. The verifier reads it from the op's attribute dictionary. - Iterator-arg unwrap. Block arguments of type
PipelineIteratorTypewrap a payload type. The verifier unwraps each block argument before comparing it against the producer-type entry; the producer-type list is already in payload form, so a double unwrap would compare payload against payload-of-payload and accept type-incoherent regions. - Arity and type match. The verifier walks the region's block-argument list and the producer-type list in parallel. On length or per-position mismatch, it emits
"expects region arguement types to match with producer types ["(verbatim, including the typo"arguement"), followed by the producer-type list,"], but got: [", the actual types, and a closing"]". - Terminator-yield match. The region's terminator —
nv_tileas.async.pipeline.yield— carries its own operand types. These must equal the parent op's result-type list. On mismatch, the verifier emits"expects region result types to be match with operation result types ["(verbatim, with the additional grammatical oddity).
LogicalResult verify_pipeline_region_op(Operation *op) {
ArrayRef<Type> producers = op->getAttr("producer_types").cast<ArrayAttr>().getValues();
Region &body = op->getRegion(0);
BlockArgListType args = body.front().getArguments();
if (args.size() != producers.size()) {
return emit(op, "expects region arguement types to match with producer types [",
producers, "], but got: [", args.getTypes(), "]");
}
for (size_t i = 0; i < args.size(); ++i) {
Type bodyArg = unwrap_pipeline_iterator(args[i].getType());
if (bodyArg != producers[i]) {
return emit(op, "expects region arguement types to match with producer types [",
producers, "], but got: [", args.getTypes(), "]");
}
}
Operation *term = body.front().getTerminator();
ArrayRef<Type> termTypes = term->getOperandTypes();
ArrayRef<Type> resultTypes = op->getResultTypes();
if (termTypes != resultTypes) {
return emit(op, "expects region result types to be match with operation result types [",
resultTypes, "], but got: [", termTypes, "]");
}
return success();
}
Two diagnostic invariants are worth preserving. The typo "arguement" and the phrasing "result types to be match" are stable across all five verifiers — error-scraping infrastructure downstream has been matching them exactly, and silently fixing them breaks log capture. The iterator-unwrap step always runs on the block-arg side, never on the producer-type side.
⚡ QUIRK — two preserved English errors are part of the public diagnostic contract The verifier emits
"region arguement types to match"(noun typo) and"region result types to be match with"(verb-form mistake) verbatim across all five pipeline ops. These look like obvious bugs but are wire-format-stable strings: log scrapers, frontends, and golden tests downstream key on the exact text. Silently correcting either string is a contract break with the same blast radius as renaming an op — reimplementations must keep both errors byte-identical, and any fix has to roll out at the consumer side first.
Agent Switch Verification
agent_switch has two region groups — one leaving an agent context, one entering another. The verifier checks that the regions agree on warp count and that the sum of requested warps doesn't exceed the enclosing launch budget.
LogicalResult verify_agent_switch(AgentSwitchOp op, GpuModuleInfo module) {
SmallVector<uint32_t> counts = op.agent_warp_counts();
if (!all_equal_or_inherited(counts)) {
return op.emit_error("agent regions disagree on warp count");
}
if (resolved_warp_count(counts) > module.available_warps()) {
return op.emit_error("agent warp count exceeds module budget");
}
return success();
}
TMA Verification
TMA operations get checked against atom kind, descriptor shape, box dimensions, memory layout, and padding behavior.
| Operation | Required invariant |
|---|---|
async.tiled_tma_load | atom is a TMA load atom; box dimensions match; element stride is one |
async.tiled_tma_store | atom is a TMA store atom; box dimensions and layout are store-compatible |
async.tiled_atomic_rmw TMA mode | atom is a TMA reduce atom; unsupported scatter modes are rejected |
make_tiled_tma_desc | descriptor pointer is aligned; captured values are representable; structured-control dependencies are rejected |
LogicalResult verify_tma_load(TmaLoadOp op) {
if (!op.atom().is_tma_load()) {
return op.emit_error("expected a TMA load atom");
}
if (op.box_dims().size() != op.atom().box_dims().size()) {
return op.emit_error("TMA box dimensions do not match atom box dimensions");
}
if (op.element_stride() != 1) {
return op.emit_error("TMA descriptors require unit element stride");
}
if (!op.shared_layout().is_tma_compatible()) {
return op.emit_error("shared-memory layout is not TMA-compatible");
}
return success();
}
The TMA diagnostic surface (verbatim strings carried by the binary):
| Diagnostic | Cause |
|---|---|
"tma box-dim and copy atom box-dim mismatch" | The op's box-dim count differs from the atom's. |
"tma leading box-dim bit-width is not 16 bytes aligned" | The leading box-dim of a TMA descriptor is not a multiple of 16 bytes. |
"TmaLoad only support zero padding now" | A non-zero padding value was supplied to a TMA load. |
"expected MakeTiledTMADescOp not depends on scf" | A descriptor builder captures an SSA value defined inside a structured-control region. |
"expect lower MakeTiledTMADescOp" | A make_tiled_tma_desc op survived past the lowering point that should have erased it. |
Descriptor capture is deliberately conservative. A descriptor moved to the host or passed through the descriptor ABI must not depend on values the ABI cannot represent — that constraint is enforced by the "expected MakeTiledTMADescOp not depends on scf" check above.
Tiled Memop Verification
tiled_load, tiled_store, and tiled_atomic_rmw share a base shape:
- operand segments are
{view, coords, offsets, token}; - token segment has zero or one value;
- coordinate count matches the view rank, plus any descriptor-specific coordinate;
- coordinate operands are MLIR
index-typed (the dialect usestiled_view, not the upstreammemreftype — the rank and element type come from thetiled_view, the address-space tag on it pins residency to RMEM/SMEM/TMEM/GMEM); - the SSA result
tile<…>shape matches thetiled_viewshape, regardless of whether the view's address space is RMEM, SMEM, TMEM, or GMEM; - the SSA result
tile<…>element type matches thetiled_viewelement type; - tile dimensions are positive powers of two and do not exceed the implementation limit.
Load and store differ in allowed memory semantics.
| Operation | Additional rules |
|---|---|
tiled_load | acquire and acquire-release semantics are rejected |
tiled_store | release and acquire-release semantics are rejected; padding and in-bounds flags must agree |
tiled_atomic_rmw | rmw_mode is required; 8-bit types and 16-bit integer atomics are rejected |
LogicalResult verify_tiled_memop(TiledMemOp op) {
if (failed(verify_operand_segments(op))) {
return op.emit_error("operandSegmentSizes does not match the op schema");
}
if (op.has_token() && op.token_segment_size() > 1) {
return op.emit_error("tiled memop token segment must hold zero or one value");
}
if (op.coord_count() != op.view().rank() + op.descriptor_coord_count()) {
return op.emit_error("tiled memop coordinate count does not match the view rank");
}
if (!coords_are_index_typed(op)) {
return op.emit_error("tiled memop coordinates must be index-typed");
}
if (op.tile_shape() != op.view().shape()) {
return op.emit_error("tile shape must match tensor shape");
}
if (op.tile_element_type() != op.view().element_type()) {
return op.emit_error("tile element type must match view element type");
}
for (int64_t d : op.tile_shape()) {
if (d <= 0 || (d & (d - 1)) != 0) {
return op.emit_error("tile dimensions must be positive powers of two");
}
}
if (total_elements(op.tile_shape()) > MAX_TILE_ELEMENTS) {
return op.emit_error("tile total element count exceeds the implementation limit");
}
return verify_memory_semantics(op);
}
The atomic-RMW diagnostic surface (verified verbatim strings in the binary):
| Diagnostic | Cause |
|---|---|
"requires attribute 'rmw_mode'" | The atomic op is missing its rmw_mode attribute. |
"tiled_atomic_rmw not supported for 8-bit types" (async variant: "async_tiled_atomic_rmw not supported for 8-bit types") | An 8-bit element was passed to an atomic op. |
"tiled_atomic_rmw not supported for 16-bit integer" (async variant: "async_tiled_atomic_rmw not supported for 16-bit integer") | A 16-bit integer atomic was attempted. |
"tiled_atomic_rmw for 16-bit float only supports add, max, min operations" (async variant: "async_tiled_atomic_rmw for 16-bit float only supports add, max, min operations") | A 16-bit floating atomic uses an unsupported mode. |
"tiled_atomic_rmw op cannot use fadd operation, please use add instead for both int and float types" | An atomic op uses fadd where the verifier requires add. |
"tiled_atomic_rmw op cannot use xchg operation" | An atomic op uses an unsupported xchg mode. |
"tiled_atomic_rmw's tiled_view type must be produced by block_tile directly" | The tiled_view operand is not the direct SSA result of block_tile. |
The companion rules for tiled_load and tiled_store (segment-size partitioning, coordinate rank match, index-typed coordinates, tile-shape / element-type agreement with the tiled_view, positive power-of-two tile dimensions, and the load/store-specific memory-ordering restrictions described above) are enforced by trait verifiers and shared helpers whose user-visible diagnostic spellings are emitted from outside this op family — the strings the binary reserves locally for the tiled-memop family are the load/store-direction analogues of the tiled_view provenance rule, namely "tiled_load's tiled_view type must be produced by block_tile directly" and "tiled_store's tiled_view type must be produced by block_tile directly". MED confidence: a few of the trait-level diagnostics in the binary are emitted through printf-style format helpers and are not stored as one verbatim string, which is why this table does not enumerate them.
Atomic RMW carries stricter element-type rules. Sixteen-bit floating-point atomics are limited to add, max, and min. The path rejects fadd and exchange modes on 16-bit floats so the lowering can pick a supported target operation without ambiguity.
Layout, Copy, and Dot Verification
convert_layout checks that source and destination tiles have the same element type, the same total element count, and layouts that the materialization pass knows how to decompose.
copy and async.copy require an atom attribute and a legal source/destination memory-space pair. The pair is read from the address-space tag carried on each tiled_view operand — the SSA type alone (tiled_view<…>) does not pin residency, so the verifier inspects the view's residency attribute (RMEM/SMEM/TMEM/GMEM) rather than the SSA type to compute the pair. Legal pairs include GMEM/RMEM, GMEM/SMEM, RMEM/GMEM, RMEM/SMEM, RMEM/TMEM, SMEM/GMEM, SMEM/RMEM, SMEM/TMEM, and TMEM/RMEM (named here by residency, not by SSA type).
dot and async.dot require an atom, compatible A/B element types, the right signedness attributes for integer MMA, and a Float32 accumulator for floating-point paths.
Block-Scaled MMA Verification
Block-scaled MMA is the Blackwell-specific correctness gate driving the tcgen05.mma::block_scale family. Every nv_tileas.block_scaled_mma op flows through one verifier function shared between the op builder, the ConvertTileAAToTileAS MMA lowering, and the dialect builder.
The verifier takes seven typed handles — A type, B type, accumulator type, scale-factor-A (sfa) type, scale-factor-B (sfb) type, the MMA atom kind handle, and the destination tile type — followed by a selector that picks between the 2-CTA and 1-CTA atom catalogs. On success it returns a packed (atom_K << 32) | vecSize; on failure it returns zero and the diagnostic is already on the op. Callers treat 0 as "verification rejected", not as a legal (0, 0) shape.
Type Resolution
MLIR built-in types are pointer-comparable singletons. The verifier resolves every type predicate by comparing the incoming handle against the canonical entries for Float32, Float8E8M0FNU, Float8E5M2, Float8E4M3FN, and the two FP4 variants (Float4E2M1FN and FloatNV4E0M3F). The FP4 variants share an internal type slot intentionally: NVIDIA reuses the same logical tile element for the OCP MX-FP4 and NVFP4 paths, and the scale-factor ratio is what resolves which Blackwell instruction kind to emit. A verifier that tries to disambiguate FP4 by element type alone rejects legal NVFP4 programs.
Phase-Ordered Diagnostics
Eleven diagnostics cover five phases. The phase order is fixed: presence, agreement, accumulator, K-extent, catalog. Reordering the phases changes which diagnostic the user sees when more than one phase is wrong, and breaks downstream test expectations.
| Phase | Diagnostic | Cause |
|---|---|---|
| 1 — scale-factor presence | "fp4 mma should expect scaling factors" | A type pair landed on the FP4 slot but sfa or sfb is missing |
| 2 — scale-factor agreement | "expects sfa/sfb element types to be the same" | sfa and sfb resolve to different types |
| 3 — accumulator type | "expects c type to be Float32" | The destination/accumulator type is not Float32 |
| 4 — K-extent agreement | "Scale factor vector size mismatch:" followed by two formatted K extents | A and B disagree on the scale-factor K dimension after vectorisation |
| 5 — atom catalog | per-combo expectation diagnostics — "expects A and B element types are valid 4bit types, such asFloat4E2M1FNType or FloatNV4E0M3FType , when (atom_K=64 && vecSize=16)", "expects sfa/sfb element types to be Float8E8M0FNUType or Float8E4M3FNType when (atom_K=64 && vecSize=16)", "expects A/B element types to be Float4E2M1FNType and sfa/sfb element types to be Float8E8M0FNUType when (atom_K=64 && vecSize=32)", plus "invalid block scale vector size. Expecting 32, but got " for the vector-size axis, "mma block scale is not supported by compute capability < sm100" for the SM-tier gate, and "Block scale is not supported for f16, tf32, f8f6f4, and i8 types" / "Block scale not supported for f16, tf32, f8f6f4 and int8 types" for the element-type gate | The resolved (atom_K, vecSize) does not appear in the legal catalog |
The trailing colon in the phase-4 diagnostic signals that two integers follow on the same line. Reimplementations that print the integers on a separate line break log-scrapers.
Legal Atom Catalog
Three (atom_K, vecSize) pairs survive verification. Each maps to exactly one Blackwell MMA kind, and each has a fixed packed return value:
(atom_K, vecSize) | Type pattern | PTX kind | Return |
|---|---|---|---|
(32, 32) | FP8 (E5M2 or E4M3FN) tiles with E8M0 scales | tcgen05.mma.kind::f8f6f4 | 0x2000000020 |
(64, 16) | FP4 tiles with E8M0 or E4M3FN scales | tcgen05.mma.kind::mxf4 (OCP MX-FP4) | 0x4000000010 |
(64, 32) | FP4 tiles with E8M0 scales, block size 64 | tcgen05.mma.kind::mxf4nvf4 (NVFP4) | 0x4000000020 |
Shape (64, 16) discriminates OCP MX-FP4 from NVFP4. OCP requires scale block size 16 and tolerates an E4M3FN scale; NVFP4 pins block size to 32 and demands E8M0 scales over a 64-K tile. The 2-CTA selector further narrows the catalog — 1-CTA accepts all three rows, 2-CTA rejects the NVFP4 row because Blackwell has no mxf4nvf4 2-CTA atom.
uint64_t verify_block_scaled_mma(Type a, Type b, Type c,
Type sfa, Type sfb,
MmaAtomKind atom, Type dst,
bool two_cta) {
bool is_fp4 = (a == Float4E2M1FN) || (a == FloatNV4E0M3F);
bool is_fp8 = (a == Float8E5M2) || (a == Float8E4M3FN);
if (is_fp4 && (!sfa || !sfb)) {
emit_diag(op, "fp4 mma should expect scaling factors");
return 0;
}
if (sfa && sfb && sfa != sfb) {
emit_diag(op, "sfa and sfb element type mismatch");
return 0;
}
if (c != Float32) {
emit_diag(op, "expects c type to be Float32");
return 0;
}
uint32_t atom_k = resolve_atom_k(a, b, atom);
uint32_t vec_size = resolve_vec_size(a, sfa);
uint32_t k_a = scale_factor_k_extent(a, sfa);
uint32_t k_b = scale_factor_k_extent(b, sfb);
if (k_a != k_b) {
emit_diag(op, "Scale factor vector size mismatch: ", k_a, ", ", k_b);
return 0;
}
if (is_fp8 && atom_k == 32 && vec_size == 32) {
return ((uint64_t)32 << 32) | 32;
}
if (is_fp4 && atom_k == 64 && vec_size == 16 && !two_cta) {
return ((uint64_t)64 << 32) | 16;
}
if (is_fp4 && atom_k == 64 && vec_size == 32 && !two_cta) {
return ((uint64_t)64 << 32) | 32;
}
/* Emit the per-combo expectation diagnostic matching the failing axis
* (element type, scale-factor type, vecSize, SM tier) — see the
* phase-5 table above for the verbatim binary strings. */
return 0;
}
The packed return uses atom_K in the high word and vecSize in the low word, both as 32-bit unsigned values. Zero is reserved for failure; legal shapes always have at least the vecSize field set. The op builder reads the low 32 bits as vecSize and the high 32 bits as atom_K before writing the result into the op's atom attribute — any other return encoding silently corrupts the op.
Worked Failure: sfa/sfb Element Type Mismatch
A concrete walk illustrates the phase-2 diagnostic. Consider the input
%d = nv_tileas.block_scaled_mma %a, %b, %c, %sfa, %sfb
{ atom = #nv_tileas<atom mxf4>,
operandSegmentSizes = array<i32: 1, 1, 1, 1, 1> }
: tile<128x64xf4E2M1FN>, tile<64x128xf4E2M1FN>, tile<128x128xf32>,
tile<128x4xf8E8M0FNU>, tile<4x128xf8E4M3FN>
-> tile<128x128xf32>
The atom selects MX-FP4 with (atom_K=64, vecSize=16). Phase 1 passes because both sfa and sfb are present. Phase 2 fails: sfa resolves to Float8E8M0FNU while sfb resolves to Float8E4M3FN. The verifier emits "expects sfa/sfb element types to be the same" and returns 0. Phases 3 through 5 never run; the op never reaches lowering. Fixing the input requires the producer to choose a single scale-factor element type (typically Float8E8M0FNU for MX-FP4, since Float8E4M3FN is legal only on the OCP MX-FP4 path that further constrains the block size).
A correct reimplementation therefore enforces:
- Phase order is presence, agreement, accumulator, K-extent, catalog.
- The FP4 element-type slot is shared. Disambiguation is by
(atom_K, vecSize)and the 2-CTA selector, never by element identity alone. - The packed return uses
atom_Kin the high word andvecSizein the low word, both as 32-bit unsigned values. - Zero is reserved for failure.
Shared Helper Rules
Several checks are reused across the dialect:
| Helper concept | Rule |
|---|---|
| tile dimensions | every dimension must be a positive power of two; total tile size is capped |
| memory semantics and scope | scope is required when semantic is stronger than weak; weak semantic must not carry scope |
| store padding | padding value is allowed only when in-bounds is false |
| special padding | NaN, infinities, and negative zero are valid only for floating-point elements |
| operand segments | segment-size attribute must match the op schema |
| pipeline terminators | pipeline regions must end in async.pipeline.yield |
OpTrait::nv_tile Inventory
Verifier behavior in this dialect is partly templated by op traits — small mixin classes the
OpTrait::nv_tile namespace declares once and that every op in the families above stamps onto its
declaration. The trait family is closed: twenty-three traits, each with a single semantic job, and
verification dispatches on the trait set attached to a concrete op rather than on a per-op
switch. Recovering this table from the binary is straightforward because every trait emits a
typeinfo string of the form OpTrait::nv_tile::<TraitName> that the verifier framework reads at
registration time.
| Trait | Role in Verification |
|---|---|
FirstOperandIsNonAliasingQueue | the queue operand must not alias any other op input or output |
MemoryModelReadTrait, MemoryModelWriteTrait, MemoryModelReadWriteTrait | tags the op for the memory-effect collector; orthogonal to the side-effect interface |
MustHaveMemLayoutAmongOperandsAndResult | at least one operand or the result must carry an explicit memory layout |
PipelineAcquireOpTrait, PipelineReleaseOpTrait | marks the op as a producer-acquire or producer-release boundary in the async pipeline region |
ResultsAreSharedEncoding | every result inherits the shared-memory encoding of the producing tile |
SameLoadStoreOperandsAndResultEncoding, SameLoadStoreOperandsAndResultShape | tiled load/store must agree on both encoding and shape between operands and result |
SameLoadStoreOperandsEncoding, SameLoadStoreOperandsShape | weaker form, used by ops with no result (stores) |
SameOperandsAndResultEncoding, SameOperandsEncoding | encoding-only invariants for ops that touch tile values without changing shape |
SameOperandsAndResultsAtom, SameOperationAndResultsAtom | every operand or result that carries a copy/MMA atom must report the same atom identity |
SameTiledViewAndTensorShapeTrait | shape on a tiled view must match the producing tensor's shape on the same axes |
TensorSizeTrait | total tile size must be representable as a positive power of two within the per-dialect cap |
TensorTypeHavingLayout | every tensor operand must already carry a layout attribute when the verifier runs |
TiledLoadStoreOpTrait, TiledLoadStoreOpSameElementTypeTrait | grouping trait that pulls in the tiled memop helper bundle; the element-type variant adds the elementType-match check |
TiledPaddingValueTrait | padding value is allowed only when the in-bounds attribute is false (also enforced by the Shared Helper Rules table) |
ValidTileASLoadOperandsAndResultEncoding | combined operand/result encoding check specific to tiled_load and its async variant |
A correct reimplementation declares each trait as a mixin whose verifyTrait returns
LogicalResult and chains into the op's bespoke verifier. The trait order is irrelevant — every
trait either succeeds standalone or emits its own diagnostic — and the framework runs them all
before the op's own verify method gets a chance.
Cross-References
Operation Roster and Builders catalogues the operations these verifiers run against, with full operand/result tables and a worked producer/consumer pipeline example. Types describes the pipeline-token and iterator types the region-op verifier template inspects. Folds and Memory Consistency describes the rewrites that run after verification succeeds. The nv_tileaa block-scaled MMA contract documented here is grounded by the dot verifier in nv_tileaa Types, Attributes, Verifiers — Dot Diagnostics. The OpInterface side of the dispatch story — including the BasicPtxBuilderInterface and PtxBuilderOpInterface families consumed by NVVM lowering — is inventoried in Interface Vtables and Dispatch — Interface Inventory.
nv_tileas Folds and Memory Consistency
Abstract
nv_tileas canonicalization is deliberately split. Pure tile-structure rewrites simplify alloc_tensor, insert_slice, extract_slice, view, and structured control-flow scaffolding. Memory-ordering operations sit behind MemoryConsistencyOpInterface — pure canonicalizations must not reorder, duplicate, or erase them.
The rewrite shapes, the legality conditions that gate them, and the separation rule that keeps folding apart from ordering-sensitive transformations appear in the sections below.
Folding Model
Most useful TileAS simplification lives in rewrite patterns rather than per-operation constant folds. The interesting cases are structural — typically an scf.for or scf.if plus tile slice operations — not a single operation with constant operands. The canonicalize driver runs all seven patterns to fixed point against the entire module; the recursive expression simplifier handles deeper boolean and integer cleanup elsewhere.
Pipeline-related lowering may still invoke ordinary MLIR folding during one-to-N conversion. Treat those folds as local simplifications only. Larger layout-chain removal belongs to the layout-conversion removal pass, not to a hidden convert_layout fold.
Canonicalization Patterns
The dialect installs seven canonicalization patterns. Each is documented below as an input/output pair plus the matching legality condition.
| Pattern | Root | Summary |
|---|---|---|
| simplify extract slice | nv_tileas.extract_slice | Constant offsets/sizes/strides collapse into a static-shape view. |
| decompose loop iter args | scf.for | Sinks alloc_tensor into the loop body; removes redundant iter args. |
| decompose if by insert slice | scf.if | Duplicates allocation/insertion chains into each branch. |
| decompose if by extract slice | scf.if | Sinks extraction into each branch. |
| swap view and extract slice | nv_tileas.extract_slice | Rewrites extract_slice(view(x)) into view(extract_slice(x)). |
| coalesce perfectly nested loops | scf.for | Flattens compatible nested loops. |
| simplify extract from insert | nv_tileas.extract_slice | Replaces exact extract-after-insert with the inserted source. |
Simplify Extract Slice
The pattern collapses a slice operation whose offset, size, and stride operands are all arith.constant values into a view whose result type bakes those values into the static shape.
Input IR:
%c0 = arith.constant 0 : index
%c64 = arith.constant 64 : index
%c1 = arith.constant 1 : index
%slice = nv_tileas.extract_slice %src[%c0, %c0][%c64, %c64][%c1, %c1]
: tensor<128x128xf32> to tensor<?x?xf32>
Output IR:
%slice = nv_tileas.extract_slice %src[0, 0][64, 64][1, 1]
: tensor<128x128xf32> to tensor<64x64xf32>
Legality: every offset, size, and stride operand must resolve to a non-negative integer constant. The rewrite preserves the slice's memory ordering attributes (there are none on the pure slice op) and uses fold-aware constant indexing the canonicalizer already trusts.
Decompose Loop Iter Args
The pattern recognizes a loop iter-arg whose init traces back to alloc_tensor through a chain of insert_slice operations, and whose yielded value traces back to the same allocation through a parallel chain. It sinks the allocation into the loop body, re-emits the insertion chain inside the body, and drops the iter-arg from the loop's signature.
Input IR:
%init = nv_tileas.alloc_tensor : tensor<128x128xf32>
%init_v = nv_tileas.insert_slice %seed into %init[%i0, %j0][%m, %n][1, 1]
: tensor<?x?xf32> into tensor<128x128xf32>
%out = scf.for %k = %k0 to %k1 step %k_step
iter_args(%buf = %init_v) -> tensor<128x128xf32> {
%step_v = "produce_tile"(%buf, %k) : (tensor<128x128xf32>, index) -> tensor<?x?xf32>
%next = nv_tileas.insert_slice %step_v into %buf[%i_k, %j_k][%m, %n][1, 1]
: tensor<?x?xf32> into tensor<128x128xf32>
scf.yield %next : tensor<128x128xf32>
}
Output IR:
%out = scf.for %k = %k0 to %k1 step %k_step iter_args() {
%buf = nv_tileas.alloc_tensor : tensor<128x128xf32>
%step_v = "produce_tile"(%buf, %k) : (tensor<128x128xf32>, index) -> tensor<?x?xf32>
nv_tileas.insert_slice %step_v into %buf[%i_k, %j_k][%m, %n][1, 1]
: tensor<?x?xf32> into tensor<128x128xf32>
scf.yield
}
Legality (all must hold):
- The iter-arg init traces through a chain of pure tile-structure ops to a single
alloc_tensor. - The yielded value traces through a parallel chain to the same allocation.
- No operation in either chain implements
MemoryConsistencyOpInterface. - No use of the iter-arg outside the loop body depends on the loop-carried value (the rewrite eliminates that result).
The rewrite is safe because alloc_tensor is a pure tile constructor: re-emitting it inside the loop body produces a value with the same SSA semantics for each iteration, and the loop's signature contracts by exactly one iter-arg.
Decompose If by Insert / Extract Slice
The two branch-decomposition patterns rewrite scf.if results whose chains involve insert_slice or extract_slice so that each branch performs its own allocation and slice work rather than yielding a shared mutable tile.
Input IR:
%init = nv_tileas.alloc_tensor : tensor<64x64xf32>
%init_v = nv_tileas.insert_slice %seed into %init[0, 0][%m, %n][1, 1]
: tensor<?x?xf32> into tensor<64x64xf32>
%r = scf.if %cond -> tensor<64x64xf32> {
%v = nv_tileas.insert_slice %a into %init_v[%i, %j][%m, %n][1, 1]
: tensor<?x?xf32> into tensor<64x64xf32>
scf.yield %v : tensor<64x64xf32>
} else {
%v = nv_tileas.insert_slice %b into %init_v[%i, %j][%m, %n][1, 1]
: tensor<?x?xf32> into tensor<64x64xf32>
scf.yield %v : tensor<64x64xf32>
}
Output IR:
%r = scf.if %cond -> tensor<64x64xf32> {
%ta = nv_tileas.alloc_tensor : tensor<64x64xf32>
%ta_v = nv_tileas.insert_slice %seed into %ta[0, 0][%m, %n][1, 1] : ...
%v = nv_tileas.insert_slice %a into %ta_v[%i, %j][%m, %n][1, 1] : ...
scf.yield %v : tensor<64x64xf32>
} else {
%tb = nv_tileas.alloc_tensor : tensor<64x64xf32>
%tb_v = nv_tileas.insert_slice %seed into %tb[0, 0][%m, %n][1, 1] : ...
%v = nv_tileas.insert_slice %b into %tb_v[%i, %j][%m, %n][1, 1] : ...
scf.yield %v : tensor<64x64xf32>
}
Legality: both branches' yielded values trace back to the same allocation through pure tile-structure chains, no chain crosses a memory-consistency op, and the allocation has no live use outside the scf.if. Duplicating the allocation per branch is what makes the rewrite safe — the rewrite never creates a shared mutable tile across the two branches.
Swap View and Extract Slice
The pattern rewrites extract_slice(view(x)) into view(extract_slice(x)) when the slice can be performed on the underlying storage at the same offset/stride and then re-viewed. The legality condition is that the view's layout transformation commutes with the slice operation — that is, applying the slice to the underlying tensor and then taking the view produces the same SSA value as applying the view to the underlying tensor and then taking the slice.
This rewrite typically fires after coalesce-perfectly-nested-loops produces fresh extract_slice ops over a view-shaped source.
Coalesce Perfectly Nested Loops
The pattern flattens an outer scf.for and an inner scf.for when the inner loop is the only operation in the outer loop's body, neither loop has live iter-args, and the inner loop's bounds and step are constant. The merged loop carries the product of the two iteration ranges and re-derives the original induction variables inside the body via arith.divsi/arith.remsi.
Legality: the outer body must contain only the inner loop plus a terminator. Any other operation in the outer body forbids coalescing because it would have to run a different number of times after the merge.
Simplify Extract from Insert
The pattern recognizes extract_slice(insert_slice(src, dst, offsets), offsets) → src when the extract and insert use the exact same offsets, sizes, and strides. The fold returns the inserted source directly, bypassing the storage round-trip.
Input IR:
%t = nv_tileas.insert_slice %x into %dst[%i, %j][%m, %n][1, 1]
: tensor<?x?xf32> into tensor<64x64xf32>
%y = nv_tileas.extract_slice %t[%i, %j][%m, %n][1, 1]
: tensor<64x64xf32> to tensor<?x?xf32>
Output IR:
%y = %x
Legality: the offsets, sizes, and strides must be equal as SSA values (or as constants after fold-aware comparison); no other operation may insert or extract a slice into the same storage region between the matched pair.
Memory Consistency Interface
MemoryConsistencyOpInterface marks operations whose ordering matters. Canonicalization may inspect them, but pure tile rewrites must not move across them or erase them.
| Operation group | Why it participates |
|---|---|
| async load/store/copy/dot | has visible async memory ordering |
| async waits | observes completion of async work |
| async TMA load/store/reduction/gather/scatter | consumes descriptor and memory-ordering semantics |
synchronous copy | may observe or publish data relevant to async regions |
make_tiled_tma_desc | descriptor result is consumed by TMA operations |
reduce and scan | region bodies may carry ordering-sensitive operations |
Pure tile-shaping operations are intentionally excluded:
alloc_tensorinsert_sliceextract_sliceviewasync.future_wait(ordering rides on the future token itself)- async pipeline region plumbing (ordering rides on producer/consumer interfaces and tokens)
Safe-Rewrite Predicate
A canonicalization pattern is safe when every operation it moves, duplicates, or erases lies outside the memory-consistency set and is reached only through SSA chains of pure tile-structure operations. The match driver walks each chain from its root toward the defining op of the rewrite source, rejecting the match the moment it encounters a memory-consistency op or any op that is neither pure tile structure nor a constant.
The walk terminates at a block boundary, at the first non-pure operation, or at a fixed-point sink (an alloc_tensor for the iter-arg decomposition, an insert_slice for the extract-from-insert fold). A chain that hits a memory-consistency op aborts immediately so the rewrite never even considers reshuffling ordered ops.
Layout Conversion Folding
The identity convert_layout(convert_layout(x)) belongs to the layout-conversion removal pass, not to a local convert_layout fold. The legality of commuting or deleting a layout conversion depends on whether the value lives in register space, shared memory, tensor memory, or crosses a pipeline boundary — and only the pass has that context.
The pass-level rewrite trims a chain when the composition reduces to identity in the target's atom catalog. Two factors decide whether the composition reduces:
- The inner conversion's source layout and the outer conversion's destination layout must lie in compatible storage classes. A register-to-shared conversion followed by a shared-to-register conversion is identity only when the register layout on both sides agrees on lane assignment and vector width.
- The atom catalog must contain a direct atom from the inner source to the outer destination. If it does not, the pass keeps the chain because materializing the intermediate layout is what makes the round-trip legal at all.
When both conditions hold, the pass replaces the outer conversion's result with the inner conversion's source, leaving the inner conversion as dead code that ordinary DCE picks up.
Keeping this in a pass rather than a fold lets the compiler consult target atom plans and memory-space rules.
Ordering Invariants
- Canonicalization roots may be pure tile ops or structured control-flow ops.
- Match chains may include
alloc_tensor,insert_slice,extract_slice,view, and constants. - Match chains must reject
copy, async memory operations, TMA operations, reductions, scans, and descriptor builders. - Rewrites must not alter memory semantic, memory scope, in-bounds, padding, or RMW attributes.
- Branch decomposition duplicates allocations per branch rather than sharing a mutable tile across arms.
- Layout-chain removal belongs to the layout-conversion pass, where target layout plans are available.
Cross-References
Operation Roster and Builders catalogues the operations these rewrites target. Verifiers describes the legality contracts that survive the rewrites. Types describes the iterator and async-token types that anchor the memory-consistency interface.
cute Dialect Overview
Provenance vs Upstream MLIR
cute is NVIDIA-introduced and has no upstream MLIR equivalent. Upstream MLIR has no dialect that models CUTLASS cuTe layout algebra as first-class IR — the open-source CUTLASS library expresses the same algebra in C++ templates, not in MLIR. Tileiras lifts those templates into an MLIR dialect so passes can inspect, compose, verify, and lower layout values rather than expand them at C++ compile time. Without this dialect the pipeline would have no in-IR carrier for shape/stride/swizzle/atom data between layout assignment and the architecture-specific cute_nvgpu binding step.
Abstract
cute is tileiras's MLIR form of CUTLASS cuTe layout algebra. It encodes shapes, strides, layouts, swizzles, coordinates, tiles, pointer views, copy atoms, and MMA atoms — together with the operations that compose, divide, complement, coalesce, and filter them — and stops short of binding any of it to NVIDIA hardware. That binding is the job of cute_nvgpu. Every later GPU-specific dialect (cute_nvgpu, nvgpu, nvvm) reads layout values produced here.
cute is not a code-generation dialect. Its values describe structure: how a logical tile maps to physical coordinates, how coordinates become offsets, how one layout composes with another, how a tiled copy or tiled MMA partitions work across lanes, warps, and memory spaces. That makes it the common language shared by CUTLASS pipeline modeling, TileAS layout assignment, TMA descriptor construction, and MMA lowering.
Role in the Cascade
cuda_tile / nv_tileaa / nv_tileas
|
| choose tile shapes, views, and partitioning
v
cute
|
| attach target-specific atoms and SM-tier constraints
v
cute_nvgpu
|
| normalize to nvgpu and nvvm
v
PTX
cute is a compact typed form of the same algebra that CUTLASS C++ expresses with templates. The templates become values and attributes that passes inspect, compose, verify, and lower.
Core Concepts
| Concept | Meaning | Typical use |
|---|---|---|
| Shape | Extents of a logical tile or nested coordinate tuple | Describes the iteration space of a tile. |
| Stride | Offset step for each coordinate dimension | Converts coordinates into linear offsets. |
| Layout | Shape plus stride, optionally decorated with swizzle | Maps logical coordinates to storage locations. |
| Tile | A grouped shape/layout fragment | Represents a fragment moved or computed as a unit. |
| Coord | A point in a shape or tile | Indexes layouts, views, and partitioned fragments. |
| Swizzle | Bit permutation applied to low address bits | Avoids bank conflicts or matches hardware layout rules. |
| View | Pointer or memref plus layout metadata | Describes an addressed object without losing its layout. |
| Tiled copy / MMA | Layout plus atom-level partitioning | Feeds target-specific copy or matrix-multiply lowering. |
The key invariant is that cute values remain algebraic. A layout should be composable and queryable without knowing whether it will eventually become a TMA descriptor, an ldmatrix load, a WGMMA operand, or a Blackwell tensor-memory operation.
Layout Semantics in One Line
A layout maps a coordinate to an offset. The simplest model is (shape, stride); the real dialect adds nested tuples, composition, complement, divide, product, and swizzles on top of that single primitive. The algebraic rules and the concrete compose/complement/divide/product definitions live on the algebra page below; this overview only states the kernel.
int64_t layout_offset(Layout L, Coord c) {
int64_t offset = 0;
for (int d = 0; d < rank(c); ++d) offset += c[d] * L.stride[d];
return apply_swizzle(L.swizzle, offset);
}
For a reimplementation, the storage class the original compiler picks does not matter. What does matter: equivalent layouts canonicalize consistently, nested tuple layouts preserve rank and dimension identity, and swizzle composition stays explicit until a target-specific lowering consumes it.
Where to Find What
The dialect is split across four pages by concern. Use this map to find the exact place a topic is documented; the overview does not duplicate any of these.
| Topic | Page |
|---|---|
| Layout algebra rules (composition, complement, divide, product, coalesce, filter) | Layout Algebra and Descriptor Grammar — Algebra Rules on Shape and Stride Tuples |
| Tuple-shape grammar, swizzle composition, descriptor round-trip | Layout Algebra and Descriptor Grammar — Descriptor Grammar |
Tile partitioning ops (local_tile, local_partition, group_modes, dice, slice) | Tile and Divide Ops — Builder Operations |
Atom builders (make_atom, make_tiled_copy, make_tiled_mma) and desugar rewrites | Atom Builders and Desugar — Atom Builder Contract |
cute.make_int_tuple hub, make_layout desugaring shape | Atom Builders and Desugar — make_int_tuple Hub |
Kernel-entry ABI (cute.kernel → nvvm.kernel, grid-constant arg-attrs) | Atom Builders and Desugar — Kernel-entry ABI |
| Verbatim verifier diagnostics (every error string the dialect emits) | Verifiers — Verbatim Diagnostics |
| Mode-range, divide, product, tuple-arithmetic verifier algorithms | Verifiers — Mode and Rank Checks |
crd2idx weak-congruence walk, worked diagnostic example | Verifiers — Worked Example: crd2idx Weak Congruence Violation |
LayoutTypeInterface kind discriminator and per-kind dispatch tables | Verifiers — LayoutTypeInterface Kind Discriminator |
In-Memory IR Tier
Treat cute as an in-memory compiler tier. It exists so passes can exchange rich layout objects without serializing every intermediate shape into the public input format. Textual rendering helps with debugging and documentation; production input normally enters through cuda_tile, nv_tileaa, cutlass, or another higher-level dialect, and the pipeline constructs cute objects internally.
Practical consequence: do not build tooling that depends on cute bytecode as a stable interchange format unless the serializer is explicitly provided. Textual dumps are for inspecting the compiler, not as a user-facing artifact.
If You Know CUTLASS (open source) — cross-walk
The open-source cute/ C++ headers map almost directly onto this dialect:
| CUTLASS C++ (cute namespace) | tileiras cute IR |
|---|---|
cute::Shape<...> and cute::Stride<...> | hierarchical (shape, stride) tuples in a !cute.layout |
cute::Layout<Shape, Stride> | !cute.layout type, kind-discriminated through the seven-entry sentinel table |
cute::Swizzle<B, M, S> | !cute.swizzle value composed into a layout via make_composed_layout |
cute::make_tile, cute::make_layout | cute.make_tile, cute.make_layout ops |
cute::Tensor<Engine, Layout> | cute.make_view ties a pointer/memref to a layout |
composition, complement, logical_divide, logical_product | identically-named cute.* ops |
cute::make_tiled_copy, cute::make_tiled_mma | cute.make_tiled_copy, cute.make_tiled_mma (target binding deferred to cute_nvgpu) |
| Compile-time integer arithmetic in C++ templates | cute.make_int_tuple + tuple_div/mod/mul/sub ops |
The main difference is where the target boundary sits. The open-source cute/ library compiles SM-specific MMA_Atom and Copy_Atom traits straight into the same headers; tileiras keeps the SM-neutral atoms in cute and pushes every target-specific atom into cute_nvgpu. A pass running inside cute should never need to ask which SM tier is in use. If it does, the layout choice belongs on the cute_nvgpu side.
Cross-links
- Layout Algebra and Descriptor Grammar — Descriptor Grammar covers the concrete grammar and Round Trip rules.
- Tile and Divide Ops — Divide Variants covers tile partitioning operations.
- Atom Builders and Desugar — Per-Atom Desugar Rewrites covers construction of copy and MMA atoms.
- Verifiers — Verbatim Diagnostics covers layout and atom verifier behavior.
cuTe Layout Algebra and Descriptor Grammar
Abstract
A cute layout is a hierarchical pair: a shape tree paired with a stride tree, together mapping logical coordinates to physical offsets. The algebra over those pairs describes tensor views, tile partitions, swizzles, MMA operands, copy atoms, and the layout conversions that later become NVGPU and TileAS code. The rest of this page covers the mathematical model, textual descriptor grammar, parser behavior, composition algorithm, and round-trip invariants.
Layout Model
A cuTe layout is a function from a coordinate domain to an offset domain. It is stored as two congruent trees:
Layout = (Shape, Stride)
Shape = integer leaf | tuple of Shape
Stride = integer leaf | tuple of Stride
size = product of all Shape leaves
offset = sum(coord_leaf[i] * stride_leaf[i])
cosize = maximum reachable offset plus one
For a flat two-dimensional row-major tile:
Shape = (2, 2)
Stride = (2, 1)
coord(row, col) -> row * 2 + col
For a column-major tile:
Shape = (2, 2)
Stride = (1, 2)
coord(row, col) -> row + col * 2
Hierarchy matters. A mode can itself contain a sub-layout, so a shape like ((2, 2), 4) is not a flattened rank-three vector. The inner 2 x 2 structure survives composition, divide, product, filtering, and swizzling. That is why most cute verifier and folder code is a structural tree walk rather than a flat affine-matrix calculation.
Descriptor Grammar
Textual layout descriptors use basis-vector entries of the form N@dim. A
descriptor is a parenthesized list; entries may nest, and one basis may name
more than one output dimension.
layout ::= group ;
group ::= "(" ws [ entry { ws "," ws entry } ] ws ")" ;
entry ::= group | basis ;
basis ::= count ws "@" ws dim { ws "@" ws dim } ;
count ::= int | int "/" int ;
dim ::= uint ;
int ::= [ "-" ] digit { digit } ;
uint ::= digit { digit } ;
digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
ws ::= { " " | "\t" } ;
Examples:
(1@0, 1@1)
(16@0, 1@1)
((1@0, 8@1), 1@2)
(1/2@0, 4@1)
(1@0@1)
The grammar ignores whitespace. Empty groups are legal and stand for the degenerate empty layout. Fractional counts parse cleanly at the syntactic level, but normalization must drop or reject impossible bases so no divide-by-zero or non-integral layout reaches lowering.
Parser Algorithm
Parsing is recursive descent over groups, basis counts, and dimension lists. Malformed descriptors must surface as invalid attributes with a precise diagnostic — never as a silently manufactured default layout.
ParseResult parse_layout(StringRef text) {
Parser parser = { .input = text, .pos = 0 };
LayoutNode root = parse_group(&parser);
skip_ws(&parser);
if (!parser.at_end()) {
return parse_error("unexpected trailing layout text");
}
if (!root.valid) {
return parse_error("failed to parse layout descriptor");
}
return ParseResult(root);
}
LayoutNode parse_group(Parser *parser) {
require_char(parser, '(');
SmallVector<LayoutEntry> entries;
skip_ws(parser);
if (peek(parser) == ')') {
consume(parser);
return LayoutNode(entries);
}
while (true) {
entries.push(parse_entry(parser));
skip_ws(parser);
if (peek(parser) == ')') {
consume(parser);
return LayoutNode(entries);
}
require_char(parser, ',');
}
}
The basis parser reads an integer or fraction, then one or more @dim pieces:
LayoutEntry parse_basis(Parser *parser) {
Rational count = parse_count(parser);
SmallVector<uint32_t> dims;
do {
require_char(parser, '@');
dims.push(parse_uint(parser));
skip_ws(parser);
} while (peek(parser) == '@');
require(count.denominator != 0);
return LayoutEntry(count, dims);
}
Composition
Composition is the core layout fold. Given layouts A and B,
composition(A, B) describes applying A first and B second. In functional
notation:
C(i) = B(A(i))
The fold is legal when the image of A fits inside the domain of B. With
static layouts, the result can be computed and interned immediately.
Optional<Layout> compose_layout(Layout a, Layout b) {
if (cosize(a) > size(b)) {
return none();
}
Shape shape = a.shape;
Stride stride = compose_stride_tree(a, b);
Layout result = normalize_layout(Layout(shape, stride));
if (!is_valid_layout(result)) {
return none();
}
return intern_layout(result);
}
Composition preserves hierarchy. Flatten too early and you lose information that divide/product regrouping and swizzle-aware atom selection still need.
Layout Primitives
The cute dialect represents tile layouts as nested-tuple (Shape, Stride) pairs. Six primitive operations compute layout
transformations. Each branches at entry on the 7-sentinel kind tag at *(type + 0x88) of the operand Layout-class Type
to handle the per-kind variation, then delegates to a per-kind handler resolved through the 16-entry dispatch table at
0x59B1DE0. The shape and stride trees stored inside a Layout share a single Tuple representation:
typedef struct Tuple {
/*+0x00*/ uint8_t kind; // 0 = leaf, 1 = tuple, 2 = dynamic
/*+0x08*/ union {
int64_t i; // leaf value
Tuple *t; // tuple children
};
/*+0x10*/ uint32_t n; // children count (for tuple kind)
} Tuple;
typedef struct Layout {
/*+0x00*/ Tuple shape;
/*+0x18*/ Tuple stride;
} Layout;
crd2idx(coord, Shape, Stride) -> idx converts a multi-dimensional coordinate into a linear memory offset. The walk
mirrors the congruence between the coordinate, shape, and stride trees: at a leaf, the coordinate is multiplied by the
matching stride leaf; at a tuple, the sum is taken over the children.
int64_t crd2idx(Tuple coord, Tuple shape, Tuple stride) {
if (isLeaf(coord)) return coord.i * stride.i;
int64_t sum = 0;
for (size_t i = 0; i < coord.n; ++i) sum += crd2idx(coord.t[i], shape.t[i], stride.t[i]);
return sum;
}
shape_div(Shape, divisor) -> Shape performs element-wise integer division across the shape tree with a rounding mode
(ceil, floor, or exact). Verifier sub_18B4200 (1 114 B) reports failure when any shape leaf does not divide cleanly
under the chosen rounding mode. ceil_div(a, b) -> ceil(a / b), verified by sub_18AC960 (1 432 B), is the helper used
by shape_div in ceil mode and by stage-count math elsewhere in the compiler.
int64_t ceil_div(int64_t a, int64_t b) {
return (a + b - 1) / b;
}
filter_zeros(Layout) at sub_18B3510 (3 298 B, the largest primitive) eliminates every Stride == 0 axis from a
layout. Zero-stride axes are broadcasting axes that do not address memory, and removing them is a prerequisite for
coalescing and for emitting correct TMA descriptors. The walk is recursive: a leaf with zero stride collapses to the
scalar layout of its shape; a tuple keeps only those children whose recursive result is not the unit scalar layout.
Layout filter_zeros(Layout L) {
if (isLeaf(L)) return (L.stride == 0) ? scalarLayout(L.shape) : L;
Tuple newShape, newStride;
for (size_t i = 0; i < L.n; ++i) {
Layout sub = filter_zeros(L.children[i]);
if (sub != scalarLayout(1)) { newShape.push(sub.shape); newStride.push(sub.stride); }
}
return Layout{newShape, newStride};
}
group_modes(Layout, indices) -> Layout at sub_18C5F40 (2 329 B) collapses the specified mode indices into a single
nested tuple, converting for example (M, N, K) into ((M, N), K). The operation is purely a regrouping: the leaves
and their order are preserved, only the tree shape changes.
coalesce(Layout) -> Layout merges adjacent axes when the inner axis's stride times its shape equals the outer axis's
stride, which is precisely the condition for the two axes to be contiguous in memory. After filter_zeros has removed
broadcast axes, coalesce reduces the remaining hierarchy as far as the contiguity test allows without changing the
function the layout computes.
Layout coalesce(Layout L) {
Layout out = emptyLayout();
for (size_t i = 0; i < L.n; ++i) {
Layout inner = L.children[i];
if (!out.empty() && out.back().stride * out.back().shape == inner.stride) {
out.back().shape = out.back().shape * inner.shape;
} else {
out.push(inner);
}
}
return out;
}
complement(Layout, total_size) -> Layout returns a layout that addresses the elements of [0, total_size) not
already covered by the input. It is the stride remainder used by partition operations: given a tile layout that names
part of a tensor, the complement names the surrounding storage so the two together tile the whole array exactly once.
Algebra Rules on Shape and Stride Tuples
The transformations above are not opaque routines. Each has an algebraic definition over the shape/stride tuple representation small enough to type-check by hand. Treat these rules as the canonical specification and the recursive walkers as one possible implementation.
Notation: S = (s_0, ..., s_{n-1}) is a shape tuple, D = (d_0, ..., d_{n-1}) is a stride tuple; both are
hierarchical (leaves may be sub-tuples). |S| is product(s_i) taken over the flattened leaves.
// composition(A, B) : domain(A) -> codomain(B), defined when |codomain(A)| <= |domain(B)|.
// Layout(S_A, D_A) ∘ Layout(S_B, D_B) = Layout(S_A, D_C)
// where D_C is obtained by walking B with the offset stream A produces.
//
// complement(A, M) : produces the unique layout C such that
// |A| * |C| == M AND image(A) ∩ image(C) == {0} AND image(A) ⊕ image(C) covers [0, M).
// Algorithm: take the sorted-by-stride flatten of A as boundary points (b_i, s_i),
// then emit the missing-interval layout between consecutive boundaries.
//
// logical_divide(A, T) = (A ∘ T, complement(A ∘ T, |A|)) per divided mode.
// logical_product(A, B) = composition(A, identity(|A|)) regrouped against B's shape tree.
// coalesce(A): merge adjacent leaves (s_i, d_i), (s_{i+1}, d_{i+1}) when d_{i+1} == s_i * d_i.
// filter_zeros(A): replace every leaf with d_i == 0 by the scalar leaf shape(1).
Four rules — composition, complement, divide, product — generate the rest of the layout algebra. tiled_divide, flat_divide, zipped_divide, raked_product, and blocked_product are the same operation seen through different regrouping permutations of the resulting mode tree; their algebraic content matches logical_divide / logical_product exactly.
A useful sanity invariant: coalesce ∘ filter_zeros is idempotent and preserves layout meaning. Two layouts that differ only after this canonicalisation compare equal in any verifier-level equivalence check.
Worked Algebra Examples
The rules above are short enough to verify by hand, but every worked example below shows the full derivation so the result table doubles as a regression oracle. Notation: a flat 1-D layout writes as <shape : stride>; a 2-D layout writes as (s0, s1) : (d0, d1). The coordinate-to-offset map at a leaf is coord -> coord * stride; at a tuple, the sum of the per-leaf products.
Composition, dense one-mode case
Inputs:
A = <8 : 2>— domain[0, 8), offset mapi -> 2i, image{0, 2, 4, 6, 8, 10, 12, 14}, cosize 15.B = <4 : 1>— domain[0, 4), offset mapj -> j, image{0, 1, 2, 3}, cosize 4.
composition(A, B) applies A first and B second — C(i) = B(A(i)). Legality test: cosize(A) = 15 must be <= size(B) = 4. It is not. The fold is rejected; the verifier emits a domain-mismatch diagnostic. Reordered as composition(B, A): cosize(B) = 4 <= size(A) = 8, legal. The composite is C(j) = A(B(j)) = A(j) = 2j over domain [0, 4), yielding the one-mode layout <4 : 2>.
Composition, hierarchical case
Inputs:
A = (4, 2) : (1, 4)— 4x2 column-major over[0, 8). At coord(r, c), offset isr + 4c.B = (2, 2) : (1, 2)— 2x2 column-major selector, image{0, 1, 2, 3}, cosize 4.
Walk B and look up A at each image element. B(0,0) = 0 -> A(0,0) = 0. B(1,0) = 1 -> A(1,0) = 1. B(0,1) = 2 -> A(2,0) = 2. B(1,1) = 3 -> A(3,0) = 3. The resulting offsets {0, 1, 2, 3} are linear in the 2x2 B domain, so composition(A, B) = (2, 2) : (1, 2). The composite forgets the second mode of A because B's image never crosses the second-mode stride. This is the structural reason composition preserves mode hierarchy: the result mode tree is B's, populated by A's strides.
Complement, gap-filling case
Input:
A = <4 : 32>— image{0, 32, 64, 96}inside the target domain[0, 256).
complement(A, 256) returns the layout whose image is [0, 256) \ {0, 32, 64, 96} + offsets that tile cleanly around A. The sorted-by-stride flatten of A produces one boundary point (0, 32); between consecutive image points the gap is 32 elements at stride 1; after the last image point 96, the gap to 256 is 160, but the complement must factor evenly with |A| = 4 so each gap is 256 / 4 - 1 = 63 plus the one-element image, i.e., 32 elements per gap. The complement is <32 : 1>: 32 contiguous elements per A-slot, four slots, total 4 * 32 = 128. Combined image: image(A) + image(C) covers [0, 128) ∪ {128..., not yet covered}. The reader should note that |A| * |C| = 4 * 32 = 128, not 256; the complement only spans the elements A failed to cover up to the smallest factor of 256 that closes the tile. To complete the cover of [0, 256), compose this complement with the second-level complement, which is how complement(A, M) is used inside logical_divide for multi-mode tiles.
Logical divide, one-mode case
Inputs:
A = <128 : 1>— contiguous run of 128 elements.T = <32 : 1>— tile of 32 contiguous elements.
logical_divide(A, T) produces (A ∘ T, complement(A ∘ T, |A|)) per divided mode. A ∘ T = <32 : 1> because composing two unit-stride layouts gives a unit-stride layout of the inner domain. complement(<32:1>, 128) returns the residual layout <4 : 32> — four copies stepping by 32 elements. Combined, logical_divide(<128:1>, <32:1>) = ((32:1), (4:32)): mode 0 is the tile of size 32 at stride 1; mode 1 is the rest mode of size 4 at stride 32. Total size: 32 * 4 = 128, preserved.
Logical product, one-mode case
Inputs:
A = <128 : 1>— contiguous run of 128 elements.B = <4 : 32>— four copies stepping by 32.
logical_product(A, B) produces a 2-mode layout (A, A' ) where A' is A re-laid-out at B's coordinate basis. The standard cuTe construction is (A, complement(A, |A| * |B|) ∘ B). Here |A| * |B| = 128 * 4 = 512. complement(<128:1>, 512) is <4 : 128> — four slots each starting at multiples of 128. Composing with B = <4:32> gives <4 : 32 * 128 / |A| > = <4 : 128> once the strides are scaled by A's extent. The result is ((128:1), (4:128)): the inner tile is the 128-element A; the outer mode places 4 copies of that tile, 128 elements apart. Total size: 512, total image: [0, 512).
The reader who walks the same derivation with B = <4 : 1> will arrive at a different result — ((128:1), (4:1)) is illegal because the two modes overlap at offset 0. logical_product is well-defined only when B's image, after scaling by |A|, avoids A's image; the complement-then-compose construction enforces that automatically.
Coalesce, contiguous-mode case
Inputs:
L = (2, 4) : (4, 1)— two modes; the outer mode steps by 4, the inner by 1.
Contiguity test: stride(inner) * shape(inner) = 1 * 4 = 4 = stride(outer). The two modes are adjacent in memory and can fuse. coalesce returns <8 : 1> — one mode of shape 8 at stride 1, semantically identical but representationally flat.
A non-coalescible counter-example: L = (2, 4) : (8, 1). stride(inner) * shape(inner) = 4, but stride(outer) = 8. There is a 4-element gap per outer step; the modes cannot fuse. coalesce returns L unchanged.
Filter-zeros, broadcasting-mode case
Input:
L = (4, 3) : (1, 0)— 4 contiguous elements broadcast 3 times.
filter_zeros walks the tree and replaces any leaf with stride 0 by the scalar shape-1 leaf, then drops the resulting scalar from the tuple. Result: <4 : 1>. The broadcast count of 3 is correct semantically but contributes no addressable memory; downstream coalesce and complement walks must not see it, or they would over-count the footprint.
Swizzle Operator
A swizzle<B, M, S> permutes bits of a linear SMEM offset so that consecutive elements in a tile land in different SMEM banks, eliminating conflicts when a warp issues 32-way parallel loads. The operator is a pure XOR-based permutation; it preserves the offset domain, so any layout L composed with a swizzle has the same size and cosize as L.
Bit-Manipulation Formula
uint32_t swizzle_apply(uint32_t B, uint32_t M, uint32_t S, uint32_t offset) {
uint32_t mask = ((1u << B) - 1u) << S;
uint32_t bits_to_xor = (offset & mask) >> S;
return offset ^ (bits_to_xor << M);
}
The parameter triple has the following meanings:
B— number of bits to swizzle. The mask(1 << B) - 1is the field width.M— destination bit position. The XOR pattern lands at bits[M, M + B).S— source bit position. The XOR pattern is read from bits[S, S + B)of the offset.
The forward transform writes output[M..M+B] = input[M..M+B] XOR input[S..S+B]. The transform is self-inverse: applying it twice returns the original offset, because XOR is its own inverse. That self-inverse property is why the same swizzle<B, M, S> mediates both stores and loads — there is no separate "unswizzle" operator.
Worked Example: swizzle<3, 4, 3>(0x40)
Step through swizzle<3, 4, 3>(offset = 0x40 = 0b1000000):
- Compute the mask:
((1 << 3) - 1) << 3 = 0b111 << 3 = 0b111000 = 0x38. - Extract source bits:
(0x40 & 0x38) >> 3 = (0b1000000 & 0b0111000) >> 3 = 0b0000000 >> 3 = 0b000 = 0x0. - XOR target field:
0x0 << 4 = 0b0000000 = 0x0. - Apply:
0x40 ^ 0x0 = 0x40. Output:0x40.
This input lies outside the swizzle's active range — bit 6 (0x40) sits above the destination field [4, 7) but does not overlap with the source field [3, 6). Now try offset = 0x18 = 0b11000:
- Mask:
0x38(unchanged). - Source extract:
(0x18 & 0x38) >> 3 = (0b11000 & 0b111000) >> 3 = 0b011000 >> 3 = 0b011 = 0x3. - XOR field:
0x3 << 4 = 0b00110000 = 0x30. - Apply:
0x18 ^ 0x30 = 0b00011000 ^ 0b00110000 = 0b00101000 = 0x28.
Offset 0x18 swizzles to 0x28. The next offset 0x20 = 0b100000 swizzles to:
- Source extract:
(0x20 & 0x38) >> 3 = 0b100000 >> 3 = 0b100 = 0x4. - XOR field:
0x4 << 4 = 0b01000000 = 0x40. - Apply:
0x20 ^ 0x40 = 0b01100000 = 0x60.
Offsets 0x00, 0x08, 0x10, 0x18, 0x20, 0x28, 0x30, 0x38 (eight consecutive 8-byte rows) swizzle to 0x00, 0x18, 0x30, 0x28, 0x60, 0x78, 0x50, 0x48. Modulo 128 (the 32-bank SMEM word stride times 4 bytes), the swizzled bank indices are 0, 6, 12, 10, 24, 30, 20, 18 — all distinct, eight rows wide, no two rows colliding on the same bank.
Canonical SMEM Swizzle Modes
The Hopper SMEM swizzle modes use three triples. Each entry below names the byte-row width it accepts and the bit-manipulation parameters it carries.
| Swizzle | Bytes per row | B | M | S | Used by |
|---|---|---|---|---|---|
swizzle<0, 0, 0> | any | 0 | 0 | 0 | no swizzle, plain row-major SMEM |
swizzle<1, 4, 3> | 32 | 1 | 4 | 3 | 32-B WGMMA operand sub-tile |
swizzle<2, 4, 3> | 64 | 2 | 4 | 3 | 64-B WGMMA operand sub-tile |
swizzle<3, 4, 3> | 128 | 3 | 4 | 3 | canonical 128-B Hopper SMEM swizzle |
Reading the table: M = 4 and S = 3 are fixed across all non-zero modes; only B varies. B = 3 produces an 8-row × 8-bank rotation; B = 2 produces 4-row × 4-bank; B = 1 produces 2-row × 2-bank. The WGMMA SMEM descriptor's swizzle_mode field encodes these as 0/3/2/1 respectively — see MMA Atoms SM70-SM120 — SMEM-Descriptor Construction for the descriptor packing.
A swizzle is part of the layout's identity. Two layouts L1 and L2 with the same (shape, stride) but different swizzles produce different memory access patterns; verifier equivalence treats them as distinct.
Cross-references: Verifiers — LayoutTypeInterface Kind Discriminator for the seven LayoutTypeInterface sentinels read from
*(type + 0x88) and the per-primitive verifier table, and cute_nvgpu TMA Atoms — Descriptor Builder for the descriptor
builders that consume these primitives. Tile and Divide Operations — Divide Variants and Product Variants show how the worked examples above re-group when the divide/product result is named through the cute.* op surface. MMA Atoms SM70-SM120 — SMEM-Descriptor Construction consumes the swizzle parameters in the canonical Hopper descriptor word.
Candidate Records
The implementation carries two conceptual layout-candidate records:
| Record | Role | Fields |
|---|---|---|
| Simple candidate | Parser-local basis entry. | Count, denominator, dimension list, layout-kind tag. |
| Rich candidate | Layout-assignment candidate. | Basis list, layout kind, optional swizzle, normalized stride, uniqued layout identity, reference state. |
The simple candidate must never escape parsing. The rich candidate is what layout assignment, convert-layout materialization, and atom planning consume. Splitting the two keeps every parsed token from dragging along state only the layout search ever reads.
Round Trip
A valid descriptor should round-trip through parse, normalize, print, and parse again without changing the represented layout. Whitespace and redundant grouping may change; basis order and meaning must not.
bool round_trips(StringRef descriptor) {
Layout first = parse_layout(descriptor).value;
StringRef printed = print_layout(first);
Layout second = parse_layout(printed).value;
return layouts_equivalent(first, second);
}
For diagnostic output, printers should prefer the basis notation users can read:
(1@0, 1@1) is better than dumping the internal tree object.
Invariants
- Shape and stride trees are congruent.
sizeis the product of shape leaves.cosizeis the maximum reachable offset plus one.- Composition is defined only when the inner image fits the outer domain.
- Normalization may simplify hierarchy but must not change the layout function.
- Parser output is either a valid candidate or a precise parse failure.
- Rich layout candidates carry swizzle and assignment metadata; simple parser candidates do not.
cute Tile and Divide Operations
Abstract
Tile and divide ops are the layout-partitioning toolkit cute exposes before any hardware atom is selected. They build shapes, coordinates, layouts, and views; simplify layouts via coalesce, filter, and complement; split layouts into tile and rest modes; form Cartesian products; and compose layouts into new coordinate maps. None of them lower straight to PTX. They shape the layout algebra that later cute_nvgpu, NVGPU, and TileAS passes consume.
Builder Operations
| Operation | Contract |
|---|---|
cute.make_shape | Build a shape or integer tuple from integer leaves. |
cute.make_coord | Build a coordinate tuple from integer leaves. |
cute.make_layout | Build a layout from shape and optional stride. |
cute.make_identity_layout | Build a unit-stride identity layout for a shape. |
cute.make_identity_tensor | Build an identity coordinate tensor for a shape. |
cute.make_ordered_layout | Build a layout with stride order determined by an order tuple. |
cute.make_tuple | General tuple constructor used by textual and desugared builders. |
cute.make_view | Bind a pointer or iterator to a layout-backed view. |
Builder verification is mostly kind checking. Shapes must be shape-like, coords coord-like, layouts must carry compatible shape and stride structure, and views must bind a valid layout to an addressable object.
LogicalResult verify_make_layout(MakeLayoutOp op) {
require(is_shape_like(op.shape));
if (op.stride.has_value) {
require(is_stride_like(op.stride.value));
require(weakly_congruent(op.shape.type, op.stride.value.type));
}
return success();
}
Canonicalizers
coalesce, filter_zeros, and complement normalize layouts before divide
and product operations consume them.
| Operation | Contract |
|---|---|
cute.coalesce | Merge contiguous modes into the smallest equivalent rank. |
cute.filter_zeros | Collapse zero-stride broadcast dimensions to shape-one modes. |
cute.complement | Compute the layout that covers the target domain not covered by the input. |
Layout filter_zeros(Layout input, Optional<Profile> target_profile) {
Layout result = input;
for (Mode mode : result.modes) {
if (mode.stride == 0) {
mode.shape = 1;
}
}
if (target_profile.has_value) {
require(profile_matches(result, target_profile.value));
}
return normalize_layout(result);
}
Divide Variants
Divide operations split an input layout A by a tiler T. Each divided mode
produces a tile component and a rest component. The variants differ only in how
they regroup those components.
| Operation | Regrouping |
|---|---|
cute.logical_divide | Each divided mode becomes (tile_i, rest_i) in place. |
cute.tiled_divide | The first mode is the tuple of all tile modes; rest modes follow. |
cute.flat_divide | Tile modes, rest modes, and untouched outer modes are flattened. |
cute.zipped_divide | Tile modes and rest modes are grouped into sibling tuples. |
cute.stencil_divide | Sliding-window divide with window, stride, dilation, and padding-like bounds. |
DividedLayout divide_layout(Layout input, Layout tiler, DivideMode mode) {
require(rank(tiler) <= rank(input));
SmallVector<Mode> tile_modes;
SmallVector<Mode> rest_modes;
SmallVector<Mode> untouched_modes;
for (int axis = 0; axis < rank(input); ++axis) {
if (axis < rank(tiler)) {
Division part = divide_mode(input.mode(axis), tiler.mode(axis));
tile_modes.push(part.tile);
rest_modes.push(part.rest);
} else {
untouched_modes.push(input.mode(axis));
}
}
return regroup_division(tile_modes, rest_modes, untouched_modes, mode);
}
Inner and outer divide are one partition viewed from opposite ends of the mode tree. The cleanest implementation normalises outer divide by reversing the relevant modes, running inner divide, then reversing the regrouped result.
DividedLayout outer_divide(Layout input, Layout tiler, DivideMode mode) {
Layout flipped_input = reverse_modes(input);
Layout flipped_tiler = reverse_modes(tiler);
DividedLayout divided = divide_layout(flipped_input, flipped_tiler, mode);
return reverse_modes(divided);
}
Stencil Divide
stencil_divide is the convolution and sliding-window form. For each selected dimension it counts the output positions a window produces:
int64_t stencil_output_len(int64_t input,
int64_t window,
int64_t stride,
int64_t dilation) {
require(input > 0);
require(window > 0);
require(stride > 0);
require(dilation > 0);
int64_t effective_window = (window - 1) * dilation + 1;
require(input >= effective_window);
return 1 + (input - effective_window) / stride;
}
The result mode carries both the window coordinate and the output coordinate. Lowering then maps the window coordinate to per-lane fetches and the output coordinate to the destination tile.
Product Variants
Product operations compute a Cartesian product of layouts and regroup the result. They are the symmetric counterpart of divide.
| Operation | Regrouping |
|---|---|
cute.logical_product | Pair corresponding modes from the two operands. |
cute.tiled_product | Gather the tiler modes into a leading tuple. |
cute.flat_product | Flatten input and tiler modes into one mode list. |
cute.zipped_product | Group input modes and tiler modes as sibling tuples. |
cute.raked_product | Interleave modes for raked replication patterns. |
cute.blocked_product | Replicate blocks as tile-of-tile structure. |
Layout product_layout(Layout lhs, Layout rhs, ProductMode mode) {
require(is_layout_like(lhs));
require(is_layout_like(rhs));
SmallVector<Mode> lhs_modes = modes(lhs);
SmallVector<Mode> rhs_modes = modes(rhs);
return regroup_product(lhs_modes, rhs_modes, mode);
}
If You Know CUTLASS (open source) — cross-walk
The divide and product family maps almost one-to-one onto the open-source cute/ C++ headers:
CUTLASS C++ (cute::) | tileiras cute.* op |
|---|---|
logical_divide(layout, tiler) | cute.logical_divide |
zipped_divide(layout, tiler) | cute.zipped_divide |
tiled_divide(layout, tiler) | cute.tiled_divide |
flat_divide(layout, tiler) | cute.flat_divide |
local_tile(tensor, tiler, coord, mode) | cute.local_tile |
local_partition(tensor, tiler, coord, mode) | cute.local_partition |
logical_product(A, B) | cute.logical_product |
zipped_product, tiled_product, flat_product | same names under cute.* |
blocked_product, raked_product | same names under cute.* |
composition(A, B) | cute.composition |
coalesce(A) | cute.coalesce |
filter(A) (zero-stride filter) | cute.filter_zeros |
complement(A, total_size) | cute.complement |
Each op's algebraic semantics match the open-source library: ranks, modes, tile shapes, and result mode-tree structure are preserved. The differences are representational — hierarchy lives in nested (shape, stride) trees rather than C++ template parameter packs, and verification happens through an MLIR verifier rather than a static_assert chain.
Builder Op IR Signatures
The four most common builders carry one operand kind and one result kind each. The MLIR signatures and a worked before/after let a reader trace the IR shape end-to-end.
cute.make_shape
%shape = cute.make_shape [%m, %n, %k]
: (index, index, index) -> !cute.shape<3>
Operands are integer leaves (or nested tuples produced by an inner cute.make_shape); the result is a rank-3 shape value. Verifier rule: each operand must be index-typed or a !cute.shape of compatible rank, and the result rank must equal the operand count for the top-level builder.
cute.make_layout
%layout = cute.make_layout(%shape, %stride)
: (!cute.shape<3>, !cute.stride<3>) -> !cute.layout<3>
The stride operand is optional; when absent, the builder synthesises a column-major identity stride from the shape. Verifier rule: weakly_congruent(shape, stride) — the two trees must match in mode count at every level, though leaf values may be dynamic.
cute.make_identity_layout
%layout = cute.make_identity_layout(%shape) : (!cute.shape<2>) -> !cute.layout<2>
Synthesises a layout whose offset map is the identity over [0, size(shape)). For shape = (4, 2) the synthesised stride is (1, 4) — column-major identity. The result is congruent with the shape, has size equal to the product of shape leaves, and is verified by the same weakly_congruent predicate as make_layout.
cute.tiled_divide
%divided = cute.tiled_divide(%layout, %tiler)
: (!cute.layout<R>, !cute.tiler<T>) -> !cute.layout<...>
The result rank depends on the regrouping (see Divide Variants). The verifier enforces rank(tiler) <= rank(layout) and, per partitioned mode, the divisibility predicate shape(layout, axis) % shape(tiler, axis) == 0 when both are static.
Worked Example: tiled divide of a 128x128 column-major tensor
Input IR:
%shape = cute.make_shape [%c128, %c128] : (index, index) -> !cute.shape<2>
%layout = cute.make_identity_layout(%shape) : (!cute.shape<2>) -> !cute.layout<2>
// %layout has shape (128, 128) and stride (1, 128).
%tile_shape = cute.make_shape [%c64, %c64] : (index, index) -> !cute.shape<2>
%tile = cute.make_layout(%tile_shape) : (!cute.shape<2>) -> !cute.layout<2>
// %tile has shape (64, 64) and stride (1, 64).
%divided = cute.tiled_divide(%layout, %tile)
: (!cute.layout<2>, !cute.layout<2>) -> !cute.layout<3>
After divide the result layout has the form ((tile_M, tile_N), rest_M, rest_N) — the tile modes group into a leading tuple per the tiled_divide regrouping. With M = N = 128 and tile 64 x 64:
tile_M = (64 : 1),tile_N = (64 : 128)— the tile carries its own M and N strides.rest_M = (2 : 64),rest_N = (2 : 8192)— two tile-columns along M (stride =tile_M_size), two tile-rows along N (stride =tile_M_size * tile_N_size = 64 * 128).
Result layout: ((64, 64), 2, 2) : ((1, 128), 64, 8192). Size: 64 * 64 * 2 * 2 = 16384 = 128 * 128. The verifier checks the divisibility predicate: 128 % 64 == 0 on both axes; the rank table (rank(layout)=2, rank(tile)=2 -> rank(result) in {2, 3}) from the Tiled partition verifier is satisfied with rank(result) = 3.
A failure case, tile = (40, 64): the divisibility predicate fails on the M axis (128 % 40 != 0); the verifier emits the format-string prefix "expects same size in rank 0 but got srcShape: " followed by the printed source and destination shapes, and the op never lowers.
Worked Example: logical divide preserving hierarchy
Same %layout = (128, 128) : (1, 128), but with %tile_logical = (32, 16) : (1, 32) and the logical_divide regrouping:
%divided_logical = cute.logical_divide(%layout, %tile_logical)
: (!cute.layout<2>, !cute.layout<2>) -> !cute.layout<2>
logical_divide keeps the original mode count. Each mode splits into (tile_i, rest_i):
- Mode 0:
tile_0 = (32 : 1),rest_0 = (4 : 32)— 4 tiles of 32 along M. - Mode 1:
tile_1 = (16 : 128),rest_1 = (8 : 2048)— 8 tiles of 16 along N.
Result: ((32, 4), (16, 8)) : ((1, 32), (128, 2048)). The mode tree retains its rank-2 outer shape; the tile and rest live inside each mode as a nested pair. tiled_divide of the same inputs would produce a flatter ((32, 16), 4, 8) regrouping — same image, different mode tree.
Composition
cute.composition is the binary layout-function composition primitive.
Optional<Layout> verify_and_compose(Layout lhs, Layout rhs) {
require(is_layout_like(lhs));
require(is_layout_like(rhs));
if (cosize(lhs) > size(rhs)) {
return none();
}
return compose_layout(lhs, rhs);
}
Composition underlies most divide and product rewrites. Divide uses the tiler's inverse and complement to split the input; product uses composition with a regrouping permutation.
Invariants
rank(tiler) <= rank(input)for divide operations.- Divide does not change the covered coordinate set; it only exposes tile and rest coordinates.
- Product expands the coordinate set as a Cartesian product.
- Coalesce, filter, and complement preserve layout meaning while changing representation.
- Stencil divide requires positive window, stride, and dilation values.
- Composition is legal only when the inner image fits the outer domain.
Tiled partition verifier
sub_196AFF0 is the shared verifier for cute.copy, cute.tiled_partition, cute.tiled_divide, and the other partition-emitting ops in this family. One routine, 13 349 bytes, 27 distinct diagnostic strings — and despite the size it walks a single linear pipeline. The verifier never selects an atom and never inspects target-specific state; it only checks that operand shapes, the predicate operand, and the residual atom-v-rank line up with the op's declared partitioning contract.
Phase one is the rank cross-check. For cute.copy(A, C) and its tiled-partition siblings, source and destination ranks satisfy a small relation rather than strict equality, because partition ops legally drop or fold one rank between input and output:
rank(A) | Legal rank(C) |
|---|---|
1 | 1 or 2 |
2 | 2 or 3 |
3 | 3 |
When the pair falls outside this table the verifier emits the format-string prefix "expects same size in rank" followed by the disagreeing rank, then " but got srcShape: " and the printed source and destination shape tuples. The diagnostic keys on the disagreeing rank, not the operand pair, so a rank-3-to-rank-1 failure reports the first rank that cannot be reconciled rather than the overall pair.
Phase two runs only when the op carries the optional pred operand. The predicate is a tile-shaped mask that suppresses out-of-bounds lanes inside a partitioned copy, and it must share the same memref-shaped envelope as the data tiles. Concretely: pred must be a CuteMemRefType, its memory space must be one of rmem, smem, gmem, or generic, and its layout's swizzle component must be the identity. Bit-reversal swizzles are rejected here because a swizzled predicate would reorder mask bits relative to the data lanes they gate, breaking the per-lane correspondence the lowering relies on. On failure the verifier emits the matching diagnostic verbatim: "pred must be a CuteMemRefType", "pred memory space invalid", or the swizzle-identity message.
Phase three handles restAtomVRank retiling. When the op replicates an atom multiple times across the tile, the residual atom-v-rank is the set of dimensions the atom's natural shape does not consume. The verifier walks each residual dimension and checks that it tiles cleanly into the corresponding operand layout extent — that is, the operand extent is a multiple of the atom extent along that axis. This is the same divisibility check cute.tiled_divide enforces on its tiler argument, lifted into the partition verifier so copy and partition ops share one feasibility predicate.
The ordering is deliberate: phase one rejects rank-shape mismatches before phase two looks at predicate type, and both run before phase three touches the atom-v-rank walk. A reimplementation should keep that ordering. It lets the diagnostics name the first thing that went wrong rather than the deepest layer, and it lets the residual-rank walk assume rank and predicate have already been normalised.
Cross-References
Layout Algebra and Descriptor Grammar — Worked Algebra Examples derives the same logical_divide and tiled_divide results at the shape/stride tuple level without the MLIR op wrapper, complementing the IR-level walkthroughs in this page. Algebra Rules on Shape and Stride Tuples gives the canonical specification of composition, complement, divide, and product that every cute.* op in this page implements. Verifiers — LayoutTypeInterface Kind Discriminator covers the per-kind dispatch that the divide and product verifiers route through. SM Tier Roster and Copy Atom Registry — Atom TypeID Registry shows the copy and MMA atoms whose tile-shape contracts these divide and product ops feed.
cute Atom Builders and Desugar
Abstract
cute uses atoms to stand in for hardware-sized copy, prefetch, and MMA instructions before any target-specific lowering runs. High-level builders — make_atom, make_tiled_copy, make_tiled_mma — construct typed atom values. CuteDesugar expands syntactic sugar into primitive cute, arith, scf, memref, and LLVM-compatible operations. The final cute-to-LLVM conversion strips out the remaining target-neutral layout helpers. The contract that ties the three stages together is the result-type interface: the desugar pass and the LLVM conversion both ask the result type what operands and attributes to produce.
Atom Builder Contract
cute.make_atom is generic. The result type decides whether the atom is an MMA, copy, prefetch, or other atom-like value; the builder queries the result-type interface rather than guessing from operand count. The dispatcher reads a TypeID slot off the result type, looks up the matching atom interface, and forwards the operand and attribute bundles to the per-atom builder.
The atom call ops — cute.copy_atom_call and cute.mma_atom_call — wrap the underlying atom value with the operand layouts the call site supplies. Verification covers layout rank, operand rank, and atom-instance compatibility against the target; the selected atom type carries the SM-specific rules, so the generic cute dialect never branches on SM tier directly.
Per-Atom Desugar Rewrites
CuteDesugar rewrites convenience syntax into smaller primitives. It does not select SM-specific instructions — that is the next pass. Its job is to make layout, coordinate, view, and atom construction explicit enough for ordinary conversion patterns to handle, in three ordered phases per op family: shape-eval, stride-eval, and composed-layout construction.
make_layout → make_int_tuple + make_layout_raw
cute.make_layout consumes a shape and a stride and produces a !cute.layout. The desugar pass rewrites it into the tuple-construction primitives so later passes can fold equal shapes and equal strides without re-parsing the sugar.
// Before
%L = cute.make_layout(shape = (M, N, K), stride = (s_M, s_N, s_K))
: !cute.layout
// After
%shape = cute.make_int_tuple %M, %N, %K : !cute.int_tuple
%stride = cute.make_int_tuple %s_M, %s_N, %s_K : !cute.int_tuple
%L = cute.make_layout_raw %shape, %stride : !cute.layout
The rewrite preserves operand order. make_int_tuple is the shared constructor for compile-time integer tuples; canonicalisation later compares equal-valued tuples structurally so two layouts built from the same shape produce identical SSA values.
make_shape and make_stride → tuple construction
These two are pure shape-eval and stride-eval respectively.
// Before
%S = cute.make_shape (M, N) : !cute.shape
// After
%S = cute.make_int_tuple %M, %N : !cute.shape
The result type narrows from !cute.int_tuple to !cute.shape (or !cute.stride) through a kind-discriminator field on the tuple type. The desugar pass writes the kind directly on the constructor; no separate cast is inserted.
make_coord and make_tile → tuple construction + kind tag
Both rewrite the same way, distinguished only by the trailing kind tag.
%C = cute.make_int_tuple %i, %j, %k : !cute.coord
%T = cute.make_int_tuple %M, %N, %K : !cute.tile
make_composed_layout → outer + inner + offset
cute.make_composed_layout is the only sugar that emits more than one primitive. It builds the inner layout (typically a swizzle), the outer layout (the value layout the swizzle composes with), and an integer-tuple offset.
// Before
%C = cute.make_composed_layout(outer = %outer,
swizzle = swizzle<B, M, S>,
offset = (0, 0))
: !cute.layout
// After
%inner = cute.make_swizzle B, M, S : !cute.swizzle
%offset = cute.make_int_tuple %c0, %c0 : !cute.int_tuple
%C = cute.make_composed_layout_raw %outer, %inner, %offset : !cute.layout
The rewrite preserves the upstream CUTLASS rule that the inner component is the address-bit permutation and the outer component is the value-layout — flipping their order produces a different layout and the verifier catches it as a compose_layout failure.
View construction → layout extraction + make_view
A cute.make_tensor or cute.tensor form expands into three primitives: take the pointer/memref, take the layout, and reassemble through make_view.
%layout = cute.get_layout %tensor : !cute.layout
%iter = cute.get_iter %tensor : !cute.iter
%v = cute.make_view %iter, %layout : !cute.view
equal over views or layouts → shape equality + stride equality + andi
The sugar form cute.equal lifts a logical equality test over composite types. Desugaring expands it into the per-field equalities the IR can fold.
%s1 = cute.get_shape %L1 : !cute.shape
%s2 = cute.get_shape %L2 : !cute.shape
%t1 = cute.get_stride %L1 : !cute.stride
%t2 = cute.get_stride %L2 : !cute.stride
%seq = cute.tuple_eq %s1, %s2 : i1
%teq = cute.tuple_eq %t1, %t2 : i1
%r = arith.andi %seq, %teq : i1
make_atom with atom interface → result-type-driven rebuild
Atom sugar is rebuilt through the result type's atom interface. The desugar pass walks the result type's TypeID, dispatches to the per-atom builder, and replaces the sugar op with the atom-specific construction shape. For MMA atoms the rebuild expands into an atom value plus an attribute bundle (MMA shape, element types, accumulator type); for copy atoms it expands into an atom value plus a copy-shape attribute. The actual instruction selection still belongs to the cute_nvgpu lowering pass that runs later.
Dynamic print → coord loop
Dynamic print is the most involved desugaring because it builds an scf.for over the flattened coordinate domain. It is strictly a debugging transform — not a data-layout optimisation.
void rewrite_dynamic_print(PrintOp op) {
Shape shape = infer_runtime_shape(op.value);
int64_t total = product(shape);
scf_for(0, total, 1, [&](Value flat_index) {
Coord coord = flat_to_coord(flat_index, shape);
Value element = cute_memref_load(op.value, coord);
emit_scalar_print(op.format, coord, element);
});
erase(op);
}
Rewrite Order
The order matters for two reasons. Shape-eval runs first because every later step reads the shape tuple to drive its own rewrite — composed-layout construction needs the shape to size the offset tuple, and view construction needs the shape to validate the layout against the iter. Stride-eval runs second because composed-layout construction needs both shape and stride to call make_composed_layout_raw. Composed-layout construction runs last because all earlier ops have already been replaced with their primitive forms and the layout constructor sees only stable SSA values for its operands.
void run_cute_desugar(Module module) {
for (Operation op : module.walk()) {
// Phase 1: shape-eval sugar.
if (is_make_shape_sugar(op)) rewrite_make_shape(op);
else if (is_make_coord_sugar(op)) rewrite_make_coord(op);
else if (is_make_tile_sugar(op)) rewrite_make_tile(op);
}
for (Operation op : module.walk()) {
// Phase 2: stride-eval sugar.
if (is_make_stride_sugar(op)) rewrite_make_stride(op);
}
for (Operation op : module.walk()) {
// Phase 3: layout / view / atom construction.
if (is_make_layout_sugar(op)) rewrite_make_layout(op);
else if (is_make_composed_sugar(op)) rewrite_make_composed_layout(op);
else if (is_view_sugar(op)) rewrite_view_construction(op);
else if (is_equal_sugar(op)) rewrite_equal(op);
else if (is_dynamic_print(op)) rewrite_dynamic_print(op);
else if (is_atom_builder(op)) rewrite_atom_builder(op);
}
}
Target-Neutral LLVM Conversion
Once desugaring is done, target-neutral cute helpers lower into stock MLIR and LLVM ops. The conversion covers tuple construction, layout field access, integer tuple arithmetic, descriptor iterators, pointer casts, pointer loads and stores, and descriptor dereferencing. SM-specific copies, MMA atoms, TMA, and WGMMA stay in cute_nvgpu and later target passes — they do not belong here.
void populate_cute_to_llvm_patterns(PatternSet *patterns) {
add(patterns, lower_make_int_tuple);
add(patterns, lower_make_shape);
add(patterns, lower_make_layout);
add(patterns, lower_get_shape);
add(patterns, lower_get_stride);
add(patterns, lower_tuple_arithmetic);
add(patterns, lower_descriptor_iterator);
add(patterns, lower_pointer_casts);
add(patterns, lower_pointer_load_store);
}
The descriptor-iterator lowering materializes an LLVM struct carrying base pointer, shape, stride, swizzle metadata, and rank. Model this as a typed descriptor object — never as a bag of unrelated scalars threaded through the pipeline.
DescriptorIterator lower_make_desc_iter(MakeDescIterOp op) {
DescriptorIterator desc;
desc.base = op.base_pointer;
desc.shape = materialize_shape(op.layout);
desc.stride = materialize_stride(op.layout);
desc.swizzle = encode_swizzle(op.layout);
desc.rank = rank(op.layout);
return desc;
}
make_int_tuple Hub
cute.make_int_tuple is the shared constructor for compile-time integer tuples. Most layout operations reach for it whenever they need a static rank, shape, permutation, coordinate, or mode list.
Value make_int_tuple(OpBuilder *builder, ArrayRef<int64_t> values) {
Type type = infer_int_tuple_type(values.length);
SmallVector<Value> constants;
for (int64_t value : values) {
constants.push(builder->create_index_constant(value));
}
return builder->create("cute.make_int_tuple", type, constants);
}
Desugaring canonicalizes equivalent static tuples so later layout folds can compare them structurally.
Error Handling
A builder failure caused by a missing dialect or missing operation is a fatal compiler configuration error. A verification failure for illegal operands, layouts, or atom instances is a normal MLIR diagnostic. Keeping the two classes separate keeps frontend mistakes debuggable and stops broken pass registration from hiding behind them.
Invariants
- Atom kind is determined by result type interfaces.
- Atom call verification checks both structural layout compatibility and target-specific atom legality.
- Desugar expands syntax but does not choose SM-specific instructions.
- Descriptor iterators lower to typed aggregate state.
- Static integer tuples are canonical intermediate values.
- Missing operation registration is a compiler setup bug, not a recoverable rewrite miss.
Kernel-entry ABI
The CuteKernelToNvvmRewrite pass runs downstream of MaterializeConvertLayout, after the type converter has produced LLVM-legal function arguments. Each kernel function gets two related rewrites: a kernel-attribute rename so NVPTX codegen recognises the entry, and a per-argument lift of each grid-constant arg-attr into the LLVM-dialect triple the backend emits as a PTX .param constant-space descriptor.
The first rewrite is the cute.kernel-to-nvvm.kernel rename. Kernel functions enter the pass tagged with a cute.kernel UnitAttr left over from the front end; the rewrite drops it and writes nvvm.kernel in its place. NVPTX codegen recognises kernel entries by nvvm.kernel, so after this rewrite the function is visible to the downstream NVVM lowering as a real kernel entry rather than a plain device function. Nothing about the function body changes — only the function-level attribute.
The second rewrite walks every function argument carrying the cute_nvgpu.grid_constant arg-attribute. For each such argument it deletes the cute_nvgpu.grid_constant arg-attr and installs the LLVM-dialect triple {llvm.align = 16, llvm.byval, nvvm.grid_constant}. Each component of the triple has a specific job in the final ABI:
| Attribute | Role at the kernel boundary |
|---|---|
llvm.align = 16 | matches the TMA descriptor's 16-byte alignment requirement; Hopper TMA hardware refuses unaligned descriptors. |
llvm.byval | tells the LLVM backend to pass the descriptor by value, in .param space, rather than as a pointer to host memory. |
nvvm.grid_constant | persists through NVVM lowering to the final PTX as constant-space placement on the kernel parameter. |
Ordering matters. The pass must run after the type converter has produced LLVM-legal function arguments — llvm.byval is only meaningful on an LLVM-dialect aggregate type and would attach to a non-LLVM type if the rewrite ran earlier. It must also run after MaterializeConvertLayout has finalised the descriptor argument types, because the alignment requirement is keyed off the descriptor's concrete layout. Encode both ordering constraints in the pass-manager pipeline rather than relying on the textual order of pass registration.
Together the two rewrites make a kernel function self-describing to the NVVM backend. The function-level attribute tells the backend "this is a kernel entry, emit .entry"; the per-argument triple tells the backend "place this descriptor in .param constant space, 16-byte aligned, by value". After this pass the kernel is ready for plain NVVM-to-PTX translation, and no later pass touches the kernel-entry ABI.
Cross-References
Verifiers — Verbatim Diagnostics lists every verbatim diagnostic the verifier surface emits, including the atom-call diagnostics that fire on the desugared forms above. Layout Algebra and Descriptor Grammar — Layout Primitives covers the layout primitives the shape-eval and stride-eval phases produce. SM Tier Roster and Copy Atom Registry — Atom TypeID Registry documents the atom interfaces the result-type-driven atom rebuild dispatches against.
cute Verifiers
Abstract
The cute verifier surface guards layout algebra. It checks that shapes, coordinates, layouts, composed layouts, views, tuples, divide/product operands, memrefs, atom fragments, and tuple arithmetic stay compatible before lowering picks a target instruction. The mental model is short: verifiers guard kind, rank, congruence, staticness, and algebraic validity.
Verification Categories
| Category | Examples | Verbatim diagnostic prefix |
|---|---|---|
| Layout builders | make_layout, make_shape, make_stride, make_composed_layout | structural — checked by the type-kind discriminator |
| Layout queries | get_shape, get_stride, get_layout | "expects \input` to be a layout or a view, got "` |
| Algebra | composition, right_inverse, coalesce, filter_zeros, recast_layout | "expects an input of type layout or composed layout, but got " |
| Divide / product | logical_divide, tiled_divide, flat_divide, stencil_divide | "invalid tiler type, got" / "expects rank(tiler) <= rank(input), but got input=" |
| Tile and mode | local_tile, local_partition, group_modes, select, size, cosize | "unexpected tiler type, got " |
| Coordinates | crd2idx, make_fragment_like, make_view | "unexpected coordinate type, got " / "expected a coordinate of rank " |
| Tile-to-shape | tile_to_shape | "invalid input types for tile_to_shape, got " |
| Atom call sites | copy_atom_call, mma_atom_call | covered by the atom-interface verifier |
Verbatim Diagnostics
Every cute op emits diagnostics through Op::emitOpError(<verbatim string>). The strings below are the user-visible contract; tests match diagnostics by string and a reimplementation must preserve them byte-for-byte. Grouped by op family:
cute.local_tile
"unexpected tiler type, got "— tiler is not aLayoutTypeorTileType"unexpected coordinate type, got "— coord is not aCoordType"expected a coordinate of rank "(followed by<n>and" but got "and the coord's type) — coord rank does not match the selected mode"Failed to dice "(followed by the tile and the coord) —dice_viewreturned a malformed view"failed to construct a valid coordinate from "(followed by the coord print)"expected a view as an input but got "(followed by the input type)
cute.local_partition
"expects LayoutType tiler, but got ""expects LayoutType tiler with static shape, but got ""expects `input` to be a layout or a view, got ""expects `target_profile` be CoordType, but got ""unable to coalesceinputof type ""unable to construct a coordinate for local_partition"
cute.make_fragment_like
"expects `input` is CuteMemRefType or CuteLayoutTypeInterface, but got ""expects `src` is LayoutTypeInterface or CuteMemRefType, but got ""expects `src.layout` is LayoutType or ComposedLayoutType, but got ""unable to make fragment-like layout""unable to make fragment-like layout from "
cute.group_modes
"expects begin in the range of [-rank , rank-1], but got begin [{0}] and rank [{1}]""expects end in the range of [-rank+1 , rank], but got end [{0}] and rank [{1}]""expects begin < end, but got begin [{0}] ([{1}]) and end [{2}] ([{3}])""expects view or layout type, but got ""unable to infer return type with inputs "
cute.size and cute.cosize
"input type [{0}] has invalid values.""mode [{0}] has invalid values for input type {1}""unable to compute size for input {0} and mode [{1}]""can't derive meaningful cosize of composed layout when inner is affine: ""mode [{0}] is invalid for input type {1}""unable to compute cosize for input {0} and mode [{1}]"
cute.select
"Invalid results for select(). Modes: ["
Out-of-range and duplicate modes are reported through the shared mode-list helper. The helper formats the offending mode and the rank into a "vector::_M_range_check: __n (which is %zu) >= this->size() (which is %zu)" failure when the underlying small-vector lookup throws.
cute.tile_to_shape
"invalid input types for tile_to_shape, got ""Target and kernel shapes must be congruent.""Lower padding shape must be congruent with target shape""Upper padding shape must be congruent with target shape""Traversal stride must be congruent with target shape""expects target shape and order operands have same rank, but got ""expects only static order modes, but got "
cute.logical_divide / tiled_divide / flat_divide / stencil_divide
"invalid tiler type, got"(followed by the tiler type print)"invalid input type, got ""expects rank(tiler) <= rank(input), but got input="(followed by both ranks)"failed to perform a valid division of "(followed by<input>and<tiler>)
cute.recast_layout
"expects `src` is LayoutTypeInterface, but got ""unable to recast layout "
cute.right_inverse
"expects an input of type layout or composed layout, but got ""expects an input with static shape, but got ""unable to compute a right inverse for input "
cute.filter_zeros
"expects a ShapeType for the target profile, but got ""Expects target_profile has the same profile with the src, but src_profile is: ""and target_profile is: ""unable to filter zeros for input "
Layout Builder Checks
make_layout accepts a shape and a stride, checks rank congruence, and produces a LayoutType. make_composed_layout accepts an outer layout, an inner layout or swizzle, and an integer-tuple offset, then runs the full compose(outer, inner) algebra to confirm the result is a well-formed layout.
LogicalResult verify_make_composed_layout(MakeComposedLayoutOp op) {
if (!implements_layout_interface(op.outer))
return op.emitOpError("expects an input of type layout or composed layout, but got ") << op.outer.getType();
if (!is_int_tuple(op.offset))
return op.emitOpError("expects `target_profile` be CoordType, but got ") << op.offset.getType();
if (!implements_layout_interface(op.inner) && !is_swizzle(op.inner))
return op.emitOpError("expects `input` to be a layout or a view, got ") << op.inner.getType();
Optional<Layout> layout = compose_layout(op.outer, op.inner);
if (!layout)
return op.emitOpError("failed to perform a valid division of ") << op.outer << " " << op.inner;
if (!offset_is_valid_for_layout(op.offset, *layout))
return op.emitOpError("unable to construct a coordinate for local_partition");
return success();
}
Mode and Rank Checks
cute mode-range operations accept Python-style ranges. Negative modes are normalised relative to rank before the range check, and the three boundary checks emit distinct diagnostics so users can see which side failed.
LogicalResult verify_mode_range(int begin, int end, int rank, Op op) {
int nb = begin < 0 ? begin + rank : begin;
int ne = end < 0 ? end + rank : end;
if (nb < 0 || nb >= rank)
return op.emitOpError(
"expects begin in the range of [-rank , rank-1], but got begin [")
<< begin << "] and rank [" << rank << "]";
if (ne < 0 || ne > rank)
return op.emitOpError(
"expects end in the range of [-rank+1 , rank], but got end [")
<< end << "] and rank [" << rank << "]";
if (nb >= ne)
return op.emitOpError(
"expects begin < end, but got begin [")
<< begin << "] ([" << nb << "]) and end [" << end << "] ([" << ne << "])";
return success();
}
The select-family mode list rejects out-of-range modes and duplicates by formatting the offending set into the "Invalid results for select(). Modes: [" prefix.
LogicalResult verify_mode_list(ArrayRef<int32_t> modes, int rank, Op op) {
BitSet seen(rank);
for (int32_t m : modes) {
if (m < 0 || m >= rank || seen.contains(m))
return op.emitOpError("Invalid results for select(). Modes: [")
<< format_mode_list(modes) << "]";
seen.insert(m);
}
return success();
}
Divide and Product Checks
Divide requires a layout-like input and a tile-like tiler, with tiler rank at most input rank. The verifier actually runs the algebra during verification so an invalid regrouping fails inside verify rather than producing an ill-formed result type. The algorithm has three gates and one algebraic step:
LogicalResult verify_logical_divide(LogicalDivideOp op) {
Type tiler = op.tiler.getType();
if (!implements_tile_like(tiler) && !implements_layout_like(tiler))
return op.emitOpError("invalid tiler type, got") << tiler;
Type input = op.input.getType();
if (!implements_layout_like(input) && !implements_view_like(input))
return op.emitOpError("invalid input type, got ") << input;
if (rank(tiler) > rank(input))
return op.emitOpError("expects rank(tiler) <= rank(input), but got input=")
<< rank(input) << " and tiler=" << rank(tiler);
Layout in_layout = layout_of(op.input);
Optional<Layout> result = try_logical_divide(in_layout, op.tiler);
if (!result || result->getType() != op.result.getType())
return op.emitOpError("failed to perform a valid division of ")
<< in_layout << " by " << op.tiler;
return success();
}
tiled_divide, flat_divide, and stencil_divide follow the same skeleton with a different kind impl in the algebraic step. Product variants share the rank gate but build through the layout-product algebra instead.
Tuple Arithmetic Checks
Tuple arithmetic is structural. The operands must have the same tuple kind, and each leaf operation must be defined. Division and modulo reject zero divisors before any leaf walk runs because a zero divisor would propagate a hard error through every later layout fold.
LogicalResult verify_tuple_arithmetic(TupleArithOp op) {
if (!same_tuple_kind(op.lhs.getType(), op.rhs.getType()))
return op.emitOpError("input type [") << op.lhs.getType() << "] has invalid values.";
for (LeafPair leaf : zip_leaves(op.lhs, op.rhs)) {
if ((op.kind == TUPLE_DIV || op.kind == TUPLE_MOD) && is_zero(leaf.rhs))
return op.emitOpError("mode [") << index_of(leaf) << "] has invalid values for input type "
<< op.lhs.getType();
if (!arithmetic_supported_for_leaf(leaf))
return op.emitOpError("unable to compute size for input ")
<< op.lhs.getType() << " and mode [" << index_of(leaf) << "]";
}
return success();
}
to_int_tuple rejects scaled bases, underscores, error leaves, and non-tuple sources because the LLVM lowering downstream expects a plain integer tuple.
Coordinates, Local Tiles, and Slices
Coordinate-based operations check weak congruence: the coordinate profile must fit the layout or view profile but may be less specific wherever the input has dynamic structure. The local_tile verifier runs five gates in fixed order:
LogicalResult verify_local_tile(LocalTileOp op) {
if (!is_tile_like(op.tiler) && !is_shape_like(op.tiler))
return op.emitOpError("unexpected tiler type, got ") << op.tiler.getType();
if (!is_coord(op.coord))
return op.emitOpError("unexpected coordinate type, got ") << op.coord.getType();
if (op.mode.length < rank(op.coord))
return op.emitOpError("expected a coordinate of rank ")
<< op.mode.length << " but got " << op.coord.getType();
if (!is_view(op.input) && !is_layout_like(op.input))
return op.emitOpError("expected a view as an input but got ") << op.input.getType();
Optional<View> v = dice_view(op.input, op.tiler, op.coord, op.mode);
if (!v)
return op.emitOpError("Failed to dice ") << op.tiler << " with " << op.coord;
return success();
}
local_partition shares the input-kind gate but applies a stricter tiler check (LayoutType with static shape) and asks for a CoordType target profile.
Worked Example: crd2idx Weak Congruence Violation
A coordinate that does not satisfy weak congruence against the layout's shape fails at the rank gate. Consider a rank-3 layout indexed by a rank-2 coordinate:
%shape = cute.make_int_tuple %c4, %c8, %c16 : !cute.shape
%stride = cute.make_int_tuple %c128, %c16, %c1 : !cute.stride
%layout = cute.make_layout_raw %shape, %stride : !cute.layout
%coord = cute.make_int_tuple %ci, %cj : !cute.coord<rank=2>
%idx = cute.crd2idx %coord, %layout : !cute.coord, !cute.layout -> index
cute.crd2idx desugars into a local_tile-style walk for verification — the coord profile must be weakly congruent with the layout's shape profile. The rank gate fires first because the coord type stores rank 2 while the layout's shape stores rank 3:
error: expected a coordinate of rank 3 but got !cute.coord<rank=2>
The diagnostic uses the verbatim "expected a coordinate of rank " prefix from the local_tile ladder; crd2idx shares the same helper so the message is identical. The mode list (%mode = [] here) has length 0, so the rank check reduces to mode.length(0) < rank(coord)(2), which fails and selects this diagnostic.
A second variant of the same bug — a same-rank coord whose shape does not weakly fit — fails at the dicing step instead:
%coord = cute.make_int_tuple %ci, %cj, %ck : !cute.coord<rank=3>
// %ck has static value 32, but layout's shape[2] = 16
%idx = cute.crd2idx %coord, %layout : !cute.coord, !cute.layout -> index
error: Failed to dice !cute.layout<((4,8,16),(128,16,1))> with !cute.coord<(0,0,32)>
The rank gate passes (mode.length == rank(coord) == 3), but dice_view rejects the coord because its third component (32) is out of bounds for the layout's third shape element (16). The diagnostic uses the verbatim "Failed to dice " prefix and prints both the tile and the offending coord.
Memref and Scaled-Index Checks
cute.memref.load and related pointer helpers validate element type, bit width, address space, and coordinate congruence. Boolean element loads are accepted only in the memory space where the implementation can represent them safely.
LogicalResult verify_memref_load(MemrefLoadOp op) {
MemrefType memref = op.memref.type;
require(is_supported_element_type(memref.element_type));
require(is_power_of_two(bit_width(memref.element_type)));
require(is_supported_address_space(memref.address_space));
require(is_coord(op.coord));
require(weakly_congruent(profile(op.coord), profile(memref.layout)));
if (memref.element_type == i1_type()) {
require(memref.address_space == register_memory_space());
}
return success();
}
load_scaled_index adds two requirements: a cute pointer type and an integer-tuple stride. Non-power-of-two element widths are rejected because scaled-index math would otherwise need a slow path the lowering does not provide.
Atom and Fragment Checks
Tiled copy and tiled MMA builders confirm that the result atom type matches the operand atom type. cute.mma.make_fragment is stricter — it checks operand role, atom type, input profile, vector-mode staticness, and the inferred result type.
LogicalResult verify_mma_fragment(MmaFragmentOp op, Target target) {
require(is_mma_operand_id(op.operand_id));
require(op.atom.type.implements_mma_atom());
require(is_memref_like(op.source) || is_shape_like(op.source));
Profile profile = infer_profile(op.source);
require(profile.rank >= 3);
require(vector_mode(profile).is_static);
require(vector_mode(op.atom.type.profile).is_static);
require(vector_modes_compatible(profile, op.atom.type.profile));
Type inferred = infer_fragment_type(op.atom, op.source, op.operand_id, target);
require(inferred == op.result.type);
return success();
}
The fragment verifier reaches the target only through the atom interface. The generic cute dialect must not hard-code every SM instruction variant.
LayoutTypeInterface Kind Discriminator
Every cute Type carries a kind-discriminator slot inside its TypeStorage block, separate from the upstream-MLIR TypeID at the head of the storage. The slot is one of seven static sentinels; sentinel identity, not content, drives dispatch. Walkers, verifiers, builders, parsers, and folders all dispatch on this slot by pointer-identity against the seven-entry table, exactly the same way upstream MLIR dispatches on TypeID at the type header. The duplicated slot exists because the upstream TypeID carries the LayoutTypeInterface interface-id, and the cute dialect needs a separate, denser tag for the seven-kind switch.
| Ordinal | Kind | Meaning |
|---|---|---|
| 0 | ComposedLayout | compose(L1, L2) — a layout formed by composing two sub-layouts |
| 1 | Layout | Plain (Shape, Stride) pair |
| 2 | Swizzle | swizzle<B, M, S> — bit-reversal swizzle layout |
| 3 | Tile | Tile-shape descriptor (shape only, no stride) |
| 4 | Shape | Pure shape tuple (no stride) |
| 5 | Coord | Coordinate tuple (no stride) |
| 6 | IntTuple | Pure-integer tuple |
Four parallel function-pointer tables index by the kind ordinal — the row position in the seven-entry table — and each holds one handler per kind. Together they cover the lifecycle of every cute Type: verification, asm printing, bytecode parsing, and folding.
| Table | Role |
|---|---|
verify | per-kind verifier callback |
print | per-kind asm printer |
parse | per-kind bytecode reader |
fold | per-kind canonicalisation |
A separate nine-entry operand-kind table records the expected kind discriminator for each operand slot of multi-operand ops. The arity of nine covers the widest cute op: cute.partition consumes up to nine sub-layouts (input view, tiler, coord, mode list, and up to five auxiliary atom-binding slots). Narrower ops such as cute.compose (two operands) and cute.zipped_divide (three) leave the trailing entries unused but read the same table layout, which keeps the verifier-side per-operand checks index-uniform. The partition op page documents the consumer side of this slot table — see Tile and Divide Ops — Tiled partition verifier.
Dispatch is a linear scan over the seven sentinels followed by an indexed call into the appropriate table. Pointer-identity comparison keeps the inner loop to a single compare per kind; falling off the end is a hard error because every well-formed cute Type must carry one of the seven sentinels.
void *dispatch_by_kind(const CuteType *t, const Handler *table) {
for (int i = 0; i < 7; ++i) {
if (t->kind == kSentinels[i]) {
return table[i];
}
}
abort(); /* unknown kind — should be unreachable */
}
A reimplementation should keep the kind ordering and the four-table convention. Reordering the sentinels silently mis-routes verify and fold to the wrong handler because every table is indexed by the same ordinal.
Side Effects
Most layout algebra is pure. Copy atoms, local partitions, fragments, and view construction may allocate or read/write resources through MLIR side-effect interfaces. Model effects explicitly — otherwise canonicalizers will reorder memory-meaningful operations past each other.
Invariants
- Kind checks are interface-based where possible, not string-based.
- Shape and stride operands are weakly congruent when paired.
- Divide and product run the algebra enough to prove their result type.
- Tuple division and modulo reject zero divisors.
- Coordinates are weakly congruent with the layout or view they index.
- Memref operations reject unsupported element widths and address spaces.
- Atom fragments verify through atom interfaces and target profiles.
- Pure layout algebra remains movable; effectful atom and view operations do not.
Cross-References
TypeID Sentinels and Anchors — Idiom 1: Static Pointer-Identity Sentinel documents the upstream-MLIR sentinel idiom that the cute kind discriminator mirrors at the dialect level. Layout Algebra and Descriptor Grammar — Layout Primitives covers the layout primitives whose Types carry these sentinels. Atom Builders and Desugar — Per-Atom Desugar Rewrites covers the desugar pass whose rewrites the verifiers run against. Tile and Divide Ops — Tiled partition verifier documents the higher-level tile partitioning ops whose verifiers reuse the rank, mode, and coordinate gates listed here.
cute_nvgpu Dialect Overview
Provenance vs Upstream MLIR
cute_nvgpu is NVIDIA-introduced and has no upstream MLIR counterpart. Upstream MLIR exposes NVIDIA hardware operations only through nvgpu (a thin bridge dialect) and nvvm (typed intrinsics). Neither models the SM-tier-qualified atom catalogue — WGMMA, UMMA, TMA, TMEM lifecycle, ldmatrix/stmatrix, block-scaled MMA forms — that tileiras needs to keep around between cute layout algebra and nvgpu lowering. Without this dialect the layout-to-intrinsic step would have to collapse atom selection, SM-tier verification, and intrinsic emission into one rewrite; the dialect splits those concerns so the SM gate can run before NVVM conversion. See nvgpu for the upstream-linked bridge below this layer.
Abstract
cute_nvgpu is the NVIDIA architectural atom dialect sitting on top of cute. It hosts MMA atoms from SM70 through SM120 (WGMMA and UMMA included), TMA descriptor and transfer atoms, TMEM lifecycle operations, LDSM/STSM matrix-load atoms, and SMEM descriptor views. Every operation passes through an explicit SM-tier verifier, so an invalid (shape, element type, target) triple is rejected before NVVM emission. This page is the dialect-level map; per-family detail lives in the linked sub-pages.
Where cute describes target-neutral layout algebra, cute_nvgpu binds that algebra to real NVIDIA operations — MMA, WGMMA, TMA, TMEM allocation, ldmatrix, stmatrix, async bulk copies, SM-specific copy atoms. It is the seam where a layout stops being merely algebraic and starts requesting a specific GPU instruction family.
The dialect is organised by architecture tier. Older tiers describe classic tensor-core MMA and copy atoms. Hopper-era tiers add WGMMA and TMA descriptor movement. Blackwell-era tiers add tensor-memory lifecycle and block-scaled MMA forms. Tier names live in the operation spellings so verifiers and lowerings can reject invalid shape, element-type, or target combinations before NVVM conversion.
Position in the Cascade
cute
|
| select target-specific copy, MMA, TMA, and tensor-memory atoms
v
cute_nvgpu
|
| normalize architecture atoms
v
nvgpu
|
| emit NVVM intrinsics
v
PTX
cute_nvgpu preserves the high-level atom boundary. It sits below pure layout algebra and above raw NVVM intrinsics — the natural place to enforce SM-tier constraints and descriptor compatibility while keeping the final intrinsic selection simple.
Architecture Tiers
| Tier | Main operations | Meaning |
|---|---|---|
| SM70 | universal FMA and copy fallbacks | Baseline tensor-core-era atom vocabulary. |
| SM80 | sm80.mma, sparse MMA | Ampere MMA and structured sparsity forms. |
| SM89 | FP8-oriented MMA variants | Ada-generation element-type extensions. |
| SM90 | WGMMA, TMA descriptor views | Hopper warpgroup MMA and tensor-memory async movement. |
| SM100 | UMMA, TMEM lifecycle, block-scaled MMA | Blackwell datacenter tensor-memory and tcgen-style operations. |
| SM120 | consumer Blackwell block-scaled forms | Consumer Blackwell microscaling and per-lane scale metadata. |
Tier spelling is part of the IR contract. Lowering must not silently reinterpret an operation as a different tier just because another instruction shape looks similar. If a target does not support the tier named by the operation, verification fails before NVVM lowering.
Atom Families
The major atom families are:
- MMA atoms, including dense, sparse, FP8, WGMMA, UMMA, and block-scaled forms.
- TMA atoms for tensor-memory load, store, gather, scatter, prefetch, and descriptor use.
- Copy atoms for register, shared-memory, global-memory, and tiled partition movement.
- TMEM lifecycle operations for allocation, deallocation, permit transfer, and pointer retrieval.
- Descriptor view operations that connect
cutelayouts to hardware descriptor operands. - Kernel-marker lowering that turns a
cutekernel marker into the entry-point marker expected by NVVM.
Each family consumes cute layout values and emits lower-level operations whose shapes, element types, and memory spaces are visible to the target.
Kernel Lowering
The kernel boundary stays deliberately simple. A function marked as a cute kernel becomes an NVVM kernel entry, and every architecture atom in the body lowers or normalises toward nvgpu and nvvm.
void lower_cute_kernel_to_nvvm(Function func, Target target) {
if (has_attr(func, "cute.kernel")) {
remove_attr(func, "cute.kernel");
set_attr(func, "nvvm.kernel");
}
for (Operation *op : func.walk()) {
if (is_cute_nvgpu_mma(op)) {
require(target_supports_mma_tier(target, op));
lower_mma_atom(op, target);
} else if (is_cute_nvgpu_tma(op)) {
require(target_supports_tma(target, op));
lower_tma_atom(op, target);
} else if (is_cute_nvgpu_tmem(op)) {
require(target_supports_tmem(target, op));
lower_tmem_lifecycle_op(op, target);
} else if (is_cute_layout_carrier(op)) {
rewrite_descriptor_or_view(op, target);
}
}
}
The rewrite preserves the semantic shape of the atom. A WGMMA atom lowers through a warpgroup MMA op, not a scalarized loop that happens to compute the same value. A TMA atom lowers through descriptor construction and async tensor-memory ops, not through ordinary elementwise loads — unless an explicit fallback path exists.
Verifier Invariants
A correct verifier should reject invalid target combinations early:
- the selected target supports the SM tier named by the operation,
- MMA tile shapes are supported by that tier,
- operand element types match the tier and the chosen MMA mode,
- sparse MMA forms include valid metadata and selector attributes,
- block-scaled MMA forms include valid scale-vector layout and per-lane scale ids,
- TMA descriptor operands agree with the source or destination layout,
tcgen05.mmakind words clear the 13-rule mutual-exclusion ladder before opcode selection,- TMA partition ops clear the 11-step ladder (type, layout-kind, integer-stride, swizzle, static, shape-equiv, G-basis, layout, tensor-type, multicast),
- tensor-memory operations respect allocation, deallocation, and permit-transfer order,
- descriptor views preserve address space, element type, shape, and swizzle requirements,
- kernel entry markers are rewritten before NVVM emission.
These invariants are easiest to enforce while the atom name is still present. Once the op has become an NVVM intrinsic the diagnostic context shrinks, and the original layout intent may already be gone.
If You Know CUTLASS (open source) — cross-walk
For readers fluent in cutlass/arch/*.hpp and the per-SM atom traits in open-source CUTLASS:
| CUTLASS C++ | tileiras IR (cute_nvgpu) |
|---|---|
cutlass::arch::Mma<...> SM70/SM80/SM89 specialisations | atom.universal_fma, sm80.mma, sm89.mma (plus sm80.sparse_mma) |
cutlass::arch::Wmma<...> traits | accessed through atom.universal_fma and tier-generic paths |
Hopper GMMA::ss/rs/sr descriptor builders | cute_nvgpu.smem_desc_view + the descriptor packer at sub_17DD6A0 |
Hopper WGMMA atom + make_smem_desc | cute_nvgpu.sm90.mma op consuming a !smem_desc_view typed operand |
Hopper TMA cp.async.bulk.tensor family | atom.tma_load, atom.tma_store, atom.tma_reduce plus the non-exec variants |
Hopper cuTensorMapEncodeTiled | tma_descriptor_tiled type + the TMA descriptor builder |
| Blackwell TCGEN / UMMA atoms | sm100.mma, sm100.mma_sp, sm100.mma_bs, sm100.mma_bs_sp |
| Blackwell TMEM allocation / lifecycle | atom.tmem_load, atom.tmem_store, atom.s2t_copy, the TMEM lifecycle ops |
cutlass::arch::Sm120BlockScaledMma<...> | SM120.mma_bs (uppercase SM is required) |
Shared-memory matrix loads (ldmatrix) | atom.ldsm, atom.stsm with the mode/size pattern matrix in Mode Pattern Verifiers — LDSM and STSM Matrix |
Two departures from the open-source surface matter. First, SM120.mma_bs is the only SM120 entry — no SM120.mma, no sparse variant — matching the consumer-Blackwell FP4 surface where sparse MMA is not exposed. Second, the SMEM descriptor is a first-class IR type (!smem_desc_view) rather than an i64 immediate, so the verifier can re-check the descriptor's swizzle and tile-stride encoding against the same layout that produced it.
Cross-links
- SM Tier Roster and Copy Atom Registry — Atom Surface by Tier lists the atom families by target tier.
- MMA Atoms SM70-120 — Per-Arch MMA Shape Lattice covers matrix-multiply atom semantics.
- TMA Atoms — Atom Family covers tensor-memory descriptor and transfer atoms.
- Mode Pattern Verifiers — UMMA Canonical Layout Verifier covers shape, element-type, and mode checks.
- Asm Printer and Mnemonic Hash — Mnemonic Perfect-Hash Dispatch covers textual spelling and parser dispatch.
SM Tier Roster and Copy Atom Registry
Abstract
cute_nvgpu registers MMA, copy, prefetch, TMA, tensor-memory, and descriptor atom types per SM tier, then exposes them through common atom interfaces. The rest of the compiler asks uniform questions through those interfaces: what shape does this atom operate on, what element types are legal, where do operands live, what resources does the target need? The registry below is the product contract — every atom mnemonic, the interfaces it implements, the SM tier that registers it, and the residencies its operands accept.
Registry Model
The dialect uses interface-driven atom records:
| Interface | Implemented by | Purpose |
|---|---|---|
| MMA atom | Universal FMA, SM80, SM89, SM90, SM100/SM103, SM120/SM121 MMA families | Reports MMA shape, operand element types, accumulator type, atom class, and verifier hooks. |
| Copy atom | TMEM load/store, S2T copy, universal copy, async copy, LDSM/STSM, TMA atoms | Reports copy shape, value type, memory spaces, vector width, and legality. |
| Prefetch atom | TMA load, store, reduce, and non-executing tiled TMA atoms | Reports descriptor, prefetch tile, and cache-hint behavior. |
| Descriptor type | SMEM descriptor views and TMA descriptors | Carries hardware descriptor state as typed IR. |
The design point that matters: generic cute code dispatches through interfaces, not through string comparisons on target names.
LogicalResult verify_atom_instance(Atom atom, Target target, Shape use_shape) {
if (MmaAtomInterface mma = dyn_cast_mma_atom(atom.type)) {
return mma.verify_instance(atom, target, use_shape);
}
if (CopyAtomInterface copy = dyn_cast_copy_atom(atom.type)) {
return copy.verify_instance(atom, target, use_shape);
}
if (PrefetchAtomInterface prefetch = dyn_cast_prefetch_atom(atom.type)) {
return prefetch.verify_prefetch(atom, target, use_shape);
}
return failure("atom type does not implement a known cute_nvgpu interface");
}
Atom Surface by Tier
| Tier | MMA atoms | Copy and descriptor atoms | Notes |
|---|---|---|---|
| All tiers | atom.universal_fma | atom.universal_copy | Generic fallback atom vocabulary. |
| SM75+ | No dedicated MMA mnemonic | atom.ldsm | Turing introduces ldmatrix-style shared-memory matrix loads. |
| SM80 | sm80.mma, sm80.sparse_mma | atom.simt_async_copy, atom.ldsm | Ampere dense and sparse mma.sync, plus cp.async-style copy atoms. |
| SM89 | sm89.mma | SM80 copy atoms | Ada extends the dense register-MMA surface with FP8 inputs. |
| SM90 | sm90.mma, smem_desc_view | atom.tma_load, atom.tma_store, atom.tma_reduce, atom.stsm, non-exec TMA atoms | Hopper WGMMA, SMEM descriptors, and TMA descriptor traffic. |
| SM100/SM103 | sm100.mma, sm100.mma_sp, sm100.mma_bs, sm100.mma_bs_sp | atom.tmem_load, atom.tmem_store, atom.s2t_copy, TMA atoms | Datacenter Blackwell UMMA, sparse UMMA, block-scaled MMA, sparse block-scaled MMA, and tensor memory. |
| SM110 (Jetson Thor) | — | inherited universal atoms | SM110 is registered as a target tier (sm_110 / sm_110a / sm_110f) but has no dedicated MMA mnemonic; see note below. |
| SM120/SM121 | SM120.mma_bs | Register-based copy and scale-factor paths | Consumer Blackwell block-scaled MMA with uppercase SM120 spelling. |
The uppercase spelling in SM120.mma_bs is part of the textual contract. A parser that lowercases it cannot round-trip IR for this dialect.
⚡ QUIRK — SM110 (Jetson Thor) registers a target tier but exposes no dedicated MMA surface (HIGH) The compiler's SM-target roster enumerates
sm_110,sm_110a, andsm_110falongside the other Blackwell tiers, and lowering will accept those targets as legal architecture flags. The dialect does NOT register asm110.mmaatom mnemonic — every MMA mnemonic in the registry issm80.mma,sm89.mma,sm90.mma,sm100.mma{,_sp,_bs,_bs_sp}, orSM120.mma_bs. Kernels targeting SM110 fall through to the universal-FMA atom or to whichever earlier-tier MMA atom the architecture-conditional gate accepts. No WGMMA, notcgen05.mma, no block-scaled register MMA is dialect-side dispatched for SM110 in this compiler. A reimplementation that expects SM110 to carry its own MMA / TMEM / WGMMA atom family will find none here, and a kernel that wants tensor-core throughput on Thor must either route through a non-MMA path or accept that the dispatcher will not synthesise a Thor-specific machine form.
Atom TypeID Registry
The dialect registers one MLIR TypeID per atom kind. Generic cute code never sees the per-atom C++ class; it sees a typed value whose TypeID resolves to an interface vtable, and that vtable carries the verifier, the asm printer, the bytecode round-trip, and the per-atom legality predicates. The registry below lists the contract per atom: which interfaces it implements, which residencies its operands accept, and which SM tier first registers it.
| Atom mnemonic | Min tier | Interfaces implemented | Residency contract |
|---|---|---|---|
atom.universal_fma | all | MmaAtom | A, B, D in rmem |
atom.universal_copy | all | CopyAtom | any source-destination pair the target supports |
atom.ldsm | SM75 | CopyAtom | src=smem, dst=rmem, shape ∈ LDSM matrix |
atom.stsm | SM90 | CopyAtom | src=rmem, dst=smem, shape ∈ STSM matrix |
atom.simt_async_copy | SM80 | CopyAtom, AsyncCopyAtom | cp.async gmem → smem |
sm80.mma | SM80 | MmaAtom | A, B, D in rmem; one element type per operand |
sm80.sparse_mma | SM80 | MmaAtom, SparseMetadataAtom | A, B, D in rmem; metadata operand alongside A |
sm89.mma | SM89 | MmaAtom | A, B, D in rmem; FP8 element types added |
sm90.mma (WGMMA) | SM90 | MmaAtom, SmemDescriptorAtom | A in rmem or smem_desc_view; B in smem_desc_view; D in rmem |
smem_desc_view | SM90 | DescriptorType | typed view over an SMEM descriptor |
atom.tma_load | SM90 | CopyAtom, TmaAtom, PrefetchAtom, AsyncCopyAtom | descriptor-driven gmem → smem |
atom.tma_store | SM90 | CopyAtom, TmaAtom, AsyncCopyAtom | descriptor-driven smem → gmem |
atom.tma_reduce | SM90 | CopyAtom, TmaAtom, AsyncCopyAtom | descriptor-driven reduce into gmem |
atom.non_exec_tiled_tma_* | SM90 | TmaAtom (non-exec) | partition-verified TMA atom waiting on mbarrier and cache binding |
sm100.mma (UMMA) | SM100 | MmaAtom, TmemAtom | A in memref/smem_desc_view; B in smem_desc_view; D in memref (tmem) |
sm100.mma_sp | SM100 | MmaAtom, TmemAtom, SparseMetadataAtom | UMMA contract + structurally-sparse A and metadata operand |
sm100.mma_bs | SM100 | MmaAtom, TmemAtom, BlockScaleAtom | UMMA contract + scale-factor operand |
sm100.mma_bs_sp | SM100 | MmaAtom, TmemAtom, BlockScaleAtom, SparseMetadataAtom | UMMA block-scale + sparsity |
atom.tmem_load | SM100 | CopyAtom | src=tmem, dst=rmem |
atom.tmem_store | SM100 | CopyAtom | src=rmem, dst=tmem |
atom.s2t_copy | SM100 | CopyAtom, AsyncCopyAtom | src=smem, dst=tmem |
SM120.mma_bs | SM120 | MmaAtom, BlockScaleAtom | A, B, D in rmem; two scale-factor operands (one per A, B) |
The Interfaces implemented column is the dispatch contract. A pass that walks every atom and asks "do you support prefetch?" calls dyn_cast<PrefetchAtomInterface> on each atom value; the SM90+ TMA load atom is the only positive hit, and the call collapses to a TypeID compare. The Residency contract column lists the legality bounds the per-atom verifier enforces; it is the same checklist a CUTLASS C++ user reads from Copy_Traits<> and MMA_Traits<> headers.
Copy Atom Operand-Layout Contracts
Every copy atom carries an operand-layout contract that the verifier checks before lowering. The contract pins source and destination residency, the per-thread fragment shape, the natural shape one atom invocation transfers, and the PTX (or NVVM intrinsic) instruction the lowering emits. The table below is the per-tier catalog; each row is one atom mnemonic.
SM70 and SM75 register copy atoms
| Atom | Source | Destination | Natural shape | Element width | Per-thread fragment | Lowering target |
|---|---|---|---|---|---|---|
atom.universal_copy | any | any (target-supported) | one element | any | one value | scalar ld/st of matching width |
atom.ldsm<m8n8> (SM75) | smem | rmem | 8 x 8 matrix tile | 16 bits | 2 elements per lane | ldmatrix.sync.aligned.m8n8.x1.shared.b16 |
atom.ldsm<m8n8.x2> (SM75) | smem | rmem | 8 x 16 matrix tile | 16 bits | 4 elements per lane | ldmatrix.sync.aligned.m8n8.x2.shared.b16 |
atom.ldsm<m8n8.x4> (SM75) | smem | rmem | 8 x 32 matrix tile | 16 bits | 8 elements per lane | ldmatrix.sync.aligned.m8n8.x4.shared.b16 |
atom.ldsm<m8n8.x4.trans> (SM75) | smem | rmem | 8 x 32 transposed tile | 16 bits | 8 elements per lane | ldmatrix.sync.aligned.m8n8.x4.trans.shared.b16 |
The x1/x2/x4 suffix is the number of 8x8 sub-tiles the atom fetches in one instruction. The transposed variants swap the per-lane fragment layout so that two register-resident MMA operands meet at the same memory cell after the matrix multiply; the verifier checks the transpose flag against the consuming MMA atom's expected operand layout.
SM80 and SM86 async-copy and matrix-load atoms
| Atom | Source | Destination | Natural shape | Element width | Per-thread fragment | Lowering target |
|---|---|---|---|---|---|---|
atom.simt_async_copy<4> | gmem | smem | 4-byte element | 32 bits | 1 i32 per lane | cp.async.ca.shared.global (4 bytes) |
atom.simt_async_copy<8> | gmem | smem | 8-byte element | 64 bits | 1 i64 per lane | cp.async.ca.shared.global (8 bytes) |
atom.simt_async_copy<16> | gmem | smem | 16-byte element | 128 bits | 1 i128-equivalent per lane | cp.async.cg.shared.global (16 bytes; bypass L1) |
atom.ldsm<m8n8.*> | smem | rmem | inherited from SM75 | 16 bits | inherited | ldmatrix.sync.aligned.* |
The 4/8/16 vector widths are the only legal cp.async granularities; any other width is rejected before lowering — the binary stores no dedicated diagnostic for the width-out-of-range case, so the failure surfaces through the standard '{0}' cannot vectorize copy to {1} elements (static strides must be 1) / '{0}' cannot vectorize copy to {1} elements (static strides must match) template that the copy-vectorization helper emits for any vectorisation failure. The 16-byte variant uses the cg cache-policy (bypass L1) because the L1 cache cannot satisfy a 128-bit single-instruction store; the 4- and 8-byte variants use ca (cache-all). Lowering chooses the cache policy from the atom's width alone — there is no per-op cache hint.
SM90 TMA atom family
TMA atoms are descriptor-driven; the per-lane fragment layout is implicit in the descriptor word rather than in the atom's MLIR operand types.
| Atom | Source | Destination | Descriptor kind | Natural shape | Lowering target |
|---|---|---|---|---|---|
atom.tma_load | gmem (descriptor) | smem | TMA tile descriptor | rank-1..rank-5 box | cp.async.bulk.tensor.NDIM.shared::cluster.global |
atom.tma_load_multicast | gmem (descriptor) | smem (multi-CTA) | TMA tile descriptor + CTA mask | rank-1..rank-5 box | cp.async.bulk.tensor.NDIM.shared::cluster.global.multicast::cluster |
atom.tma_load_im2col | gmem (descriptor) | smem | TMA im2col descriptor | rank-3..rank-5 spatial | cp.async.bulk.tensor.NDIM.shared::cluster.global.im2col |
atom.tma_store | smem | gmem (descriptor) | TMA tile descriptor | rank-1..rank-5 box | cp.async.bulk.tensor.NDIM.global.shared |
atom.tma_reduce<op> | smem | gmem (descriptor) | TMA tile + reduce kind | rank-1..rank-5 box | cp.reduce.async.bulk.tensor.NDIM.global.shared.OP |
atom.stsm<m8n8.*> | rmem | smem | none (register copy) | 8 x 8 matrix tile per sub-tile | stmatrix.sync.aligned.m8n8.x[1,2,4].shared.b16 |
The TMA atoms accept rank-1 through rank-5 boxes; the descriptor word encodes the per-dimension extents, strides, and box edges (see the dedicated TMA atom page). The multicast variant adds a 16-bit CTA mask that names which CTAs in the cluster receive the loaded data, enabling one-to-many fanout from a single GMEM read. The im2col variant rewrites the descriptor's box coordinates through a convolution-style spatial reshape so a single load presents the data already in NCHW-to-window form for convolution kernels.
atom.stsm mirrors atom.ldsm from SM75 in reverse — rmem -> smem rather than smem -> rmem — and shares the same sub-tile multiplicity convention.
SM100 and SM103 TMEM copy atoms
Datacenter Blackwell adds tensor memory as a fourth memory class alongside register, shared, and global. The copy atom family covers every legal direction between TMEM and the other three classes.
| Atom | Source | Destination | Natural shape | Element width | Lowering target |
|---|---|---|---|---|---|
atom.tmem_load | tmem | rmem | one TMEM column tile per atom | 32/16/8 bits | tcgen05.ld.sync.aligned.shape.b32 |
atom.tmem_store | rmem | tmem | one TMEM column tile per atom | 32/16/8 bits | tcgen05.st.sync.aligned.shape.b32 |
atom.s2t_copy | smem | tmem | TMA-box-shaped SMEM slice | 8/16/32/64 bits | tcgen05.cp.shared::cta.async |
atom.tmem_to_smem_copy | tmem | smem | TMEM column tile | 32/16/8 bits | tcgen05.cp.async.shared::cta (reverse direction) |
atom.tcgen05.cp | tmem | tmem (cross-CTA) | column tile inside one cluster | 32 bits | tcgen05.cp.async (cluster-scope) |
TMEM is column-organised: an atom transfers one or more TMEM columns at a time. The verifier checks that the operand layout addresses TMEM columns in a contiguous range matching the natural shape, and that the column count matches the destination tile. SM100 splits the tcgen05 family into tcgen05.ld / tcgen05.st (register-mediated, synchronous-looking) and tcgen05.cp (cluster-scope async); the atom mnemonics make that distinction explicit.
A TMEM-resident MMA accumulator does not move out of TMEM until a atom.tmem_load retires its column range into registers. Lowering must keep that retire op alive across any consumer that reads the accumulator from registers — eliding it produces undefined values.
MMA Records
MMA records carry:
- architecture tier;
- operand A, B, and accumulator element types;
- tile shape, usually expressed as
(M, N, K); - operand residency, such as register memory, shared-memory descriptor, or tensor memory;
- sparse or block-scaled metadata, when present;
- verifier and lowering hooks.
typedef struct {
SmTier min_tier;
Shape mnk;
ElementType a_type;
ElementType b_type;
ElementType c_type;
Residency a_residency;
Residency b_residency;
Residency d_residency;
bool supports_sparse;
bool supports_block_scale;
} MmaAtomContract;
Copy Records
Copy atoms carry copy width, source and destination residency, optional async behaviour, and any descriptor or prefetch behaviour. TMA atoms add a descriptor flavour and a prefetch interface on top.
typedef struct {
SmTier min_tier;
Residency source;
Residency destination;
int value_bits;
bool is_async;
bool uses_tma_descriptor;
bool supports_prefetch;
} CopyAtomContract;
Per-Tier Semantics
SM70 and SM75
Volta and Turing mostly use generic atoms. SM75 introduces the shared-memory matrix-load family, where ldsm becomes tier-gated. Older MMA forms route through universal or backend intrinsic paths — there is no dedicated cute_nvgpu.sm70.mma spelling.
SM80
Ampere is the first full register-MMA tier. sm80.mma covers dense mma.sync forms; sm80.sparse_mma covers the structured-sparse forms with metadata operands. simt_async_copy models Ampere asynchronous copies. The verifier's anchors here are register-resident MMA operands, supported integer and floating input types, valid sparse metadata, and legal copy vector widths.
SM89
Ada keeps the SM80 register-resident model but adds FP8 input combinations. Sparse FP8 is not part of this tier's atom surface.
SM90
Hopper introduces WGMMA and TMA. sm90.mma accepts shared-memory descriptor operands; B is always descriptor-backed, A is either register- or descriptor-backed depending on mode. TMA load/store/reduce atoms are descriptor-driven and often start as non-executing tiled atoms, then bind to mbarrier and cache state to form executable atoms.
SM100 and SM103
Datacenter Blackwell introduces UMMA and tensor memory. sm100.mma is the plain tensor-memory MMA family; sm100.mma_bs and sm100.mma_bs_sp carry block-scale and sparse block-scale metadata. TMEM load/store and shared-to-tmem copy atoms move values between register, shared, global, and tensor-memory domains. SM103 reuses the same dialect surface — the distinction is a target flag, not a new atom family.
SM120 and SM121
Consumer Blackwell block-scaled MMA has no TMEM dependency. It carries two scale-factor operands — one for A, one for B — and keeps the accumulator in register memory. SM121 shares the same surface.
MMA Atom Verifier Diagnostics
Every MMA atom registers one verifier through the dialect. The verifier emits verbatim diagnostics so test suites can match by string. The strings below are the user-visible contract.
Layout-shape verifier (all UMMA / SM90 / SM100 variants)
"expects Mma atom layout of "(binary string, with the canonical reference layout streamed in after the trailing space) — strict equality between the op's declared per-operand layout and the canonical layout the atom's traits table reconstructs."expects static and no scaled basis layout for"(printed layout follows) — the stride basis must be static integer; scaled-basis layouts are rejected because the descriptor packer cannot encode them.
Element-type ladder (UMMA family)
"expects operand a with element type {0}, but get {1}.""expects operand b with element type {0}, but get {1}.""expects operand c with element type {0}, but get {1}."(the verifier usescfor both C and D operand slots)
The element-type check happens before the residency check. The {0} slot prints the expected element type from the atom's traits; {1} prints the actual operand element type.
Residency ladder (UMMA family)
"expects memref/smem_desc_view for operand A, but gets A:{0}.""expects smem_desc_view for operand B, but gets B: {0}.""expects memref for operand D, but gets D: {0}."
UMMA B is always SMEM-descriptor; A is either an RMEM memref or an SMEM-descriptor (note: per verify_sm100_umma an RMEM A is rejected — A must be either an SMEM descriptor or a TMEM memref — see MMA Atoms SM70-120 — SM100 and SM103 UMMA); D is a memref in the tensor-memory address space (result.residency == TENSOR_MEMORY) — the verifier diagnostic spells the SSA type "memref", but the residency contract pins the accumulator to TMEM, not register memory. The verifier emits the first mismatched operand and stops.
Layout composability (UMMA family)
"invalid layout of A/B/D. A: {0}, B: {1}, D:{2}"— emitted when one of the three per-operand canonical-layout checks fails before the composability step runs."layoutA, layoutB and layoutD fail to form a gemm. A: {0}, B: {1}, D: {2}"— emitted when the three layouts pass individually but their composition does not encode a valid(M, N, K)triple.
Non-UMMA shared verifier (SM70 / 75 / 80 / 89)
"expects all mma operands to have element type""expects rmem for input operands, but got A: "(binary string; printed A/B/D operand types follow)"expects operand a with element type "(followed by expected and actual types)
The non-UMMA path enforces the simpler rule that A, B, and D all share one element type and all live in register memory. This is the only path SM70-89 use.
Composed-layout rejection (tiled-copy / tiled-mma builders)
"doesn't support composed layout for ""A/B/D, but got: A: {0}, B: {1}, D:{2}""expects A, B to have the same rank, but got A: ""expects C, D to have the same rank, but got C: ""expects C to have rank 1 or 2, but got C: ""expects C to have rank 2 or 3, but got C: ""expects C to have rank 3, but got C: "
Registry Invariants
- Atom names encode the minimum architecture tier or intentionally remain tier-generic.
- Generic tiling code dispatches through interfaces, not mnemonic switches.
- Sparse and block-scaled atoms expose their metadata through typed operands or attributes.
- TMA atoms that prefetch descriptors implement the prefetch interface.
- Descriptor view types remain explicit until the backend has emitted the corresponding WGMMA, TMA, or TCGEN instruction sequence.
- Target verification rejects atoms whose tier exceeds the selected target.
Cross-References
Mode Pattern Verifiers — LDSM and STSM Matrix documents the LDSM/STSM, UMMA Canonical Layout Verifier, tcgen05.mma Kind-Word Verifier, and SM120 Block-Scaled Lattice verifiers each atom registers. TMA Atoms — Atom Family covers the descriptor-driven TMA family in depth. MMA Atoms SM70-120 — Per-Arch MMA Shape Lattice covers the per-tier MMA shape lattice. MMA Atoms SM70-SM120 — Operand Contract by Tier cross-references the consumer side of every copy atom in the table above. Layout Algebra and Descriptor Grammar — Swizzle Operator covers the bit-manipulation formula the SMEM-resident atoms (atom.ldsm, atom.stsm, TMA descriptors, atom.s2t_copy) rely on for bank-conflict-free placement.
MMA Atoms SM70-SM120
Abstract
cute_nvgpu MMA atoms describe every NVIDIA matrix multiply-accumulate family from classic register MMA through Hopper WGMMA and Blackwell UMMA. Each atom records target tier, tile shape, operand element types, operand residency, sparsity, block scaling, and descriptor requirements. The compiler verifies layout legality against the atom and picks the correct NVGPU/NVVM lowering — all without losing the higher-level tile algebra.
Cross-Tier Summary
| Tier | Instruction family | Operand residency | Main element families |
|---|---|---|---|
| SM70/SM75 | Legacy mma.sync forms | Register fragments | f16, bf16, f32 accumulators. |
| SM80 | Dense and sparse mma.sync | Register fragments | f16, bf16, tf32, integer low-bit modes. |
| SM89 | FP8 register MMA | Register fragments | FP8 E4M3/E5M2 inputs with f32 accumulators. |
| SM90 | WGMMA async | A in registers or SMEM descriptor; B in SMEM descriptor; D in registers | f16, bf16, tf32, FP8, integer modes. |
| SM100/SM103 | TCGEN/UMMA | A in SMEM descriptor or TMEM; B in SMEM descriptor; D in TMEM | FP8, FP6/FP4-like formats, f16, tf32, integer modes. |
| SM120/SM121 | Consumer block-scaled MMA | Register operands and per-input scale factors | MXFP8, MXFP4, NVFP4-style inputs with E8M0 scale factors. |
Per-Arch MMA Shape Lattice
The table below summarises the (M, N, K) tile shapes and element-type tuples each tier accepts. Lowering reads this lattice as the first feasibility gate, before any descriptor or operand-layout check runs. Empty cells mean the shape is not exposed for that tier.
| Shape (M, N, K) | sm_70 | sm_75 | sm_80 | sm_89 | sm_90 (WGMMA) | sm_100 (UMMA) | sm_120 (block-scaled) |
|---|---|---|---|---|---|---|---|
| 8x8x4 (legacy) | f16/f32 acc | — | — | — | — | — | — |
| 16x8x8 | — | f16/bf16 | f16/bf16/tf32 | — | — | — | — |
| 16x8x16 | — | — | f16/bf16, sparse | — | — | — | — |
| 16x8x32 (int/FP8) | — | — | s8/u8, sparse | e4m3/e5m2 | — | — | f4/f6/f8 + E8M0 scales |
| 16x8x64 (int4) | — | — | s4/u4 | — | — | — | f4 + E8M0 scales |
| 64x{8..256}x{8..32} | — | — | — | — | f16/bf16/tf32/FP8/int (B in SMEM desc; A reg or SMEM desc) | — | — |
| 64x{8..256}x{16..64} | — | — | — | — | — | f16/tf32/FP8/FP6/FP4 (A: SMEM desc or TMEM; B: SMEM desc; D: TMEM) | — |
| 128x{N}xK (2-CTA UMMA) | — | — | — | — | — | cluster-coop variant | — |
Notes on the lattice:
Mfor SM90 WGMMA is fixed at 64 per warp-group instruction;Nranges over{8, 16, 24, ..., 256}in steps of 8; the canonicalKper element type is256 / elem_bits(see the table below).Mfor SM100 UMMA is 64 (single-CTA) or 128 (2-CTA cooperative).Nis a multiple of 8 up to 256, andKmatches512 / elem_bitsfor theUMMA_Korientation or256 / elem_bitsforUMMA_MN.- SM120 block-scaled MMA accepts only
K = 32(FP4/FP6/FP8 inputs with E8M0 scales,vec_size = 32) orK = 64(FP4 only,vec_size in {16, 32}). - Sparse variants halve the structurally-sparse operand and add a metadata operand; the shape entry above
applies to the dense operand. SM100 carries both a dense-sparse atom (
sm100.mma_sp) and a block-scaled-sparse atom (sm100.mma_bs_sp); the former keeps the UMMA element-type set, the latter overlays the FP4/FP8 microscale lattice.
LogicalResult check_shape_in_lattice(SmTier tier, Shape mnk,
ElementType a, ElementType b, ElementType c) {
const ShapeLatticeRow *row = lookup_lattice_row(tier, mnk);
require(row != NULL);
require(in_set(a, row->legal_a_types));
require(in_set(b, row->legal_b_types));
require(in_set(c, row->legal_acc_types));
return success();
}
If You Know CUTLASS (open source) — what is different here
Coming from the open-source cutlass/cute C++ headers, the differences are representational rather than semantic.
| CUTLASS C++ concept | tileiras IR form |
|---|---|
cute::MMA_Atom<MMA_Traits<sm90_64x128x16_F16F16F32_SS>> | cute_nvgpu.sm90.mma op with shape_MNK, a_type, b_type, c_type attributes plus operand-residency-typed values |
cute::Layout<Shape, Stride> template | !cute.layout type with hierarchical (shape, stride) trees and a 7-kind discriminator (see cute Verifiers — LayoutTypeInterface Kind Discriminator) |
cute::TiledCopy / cute::TiledMMA | cute.make_tiled_copy / cute.make_tiled_mma builders consuming atom values |
cutlass::PipelineTmaAsync<Stages> class template | cutlass.pipeline.create + cutlass.pipeline.init ops with explicit producer/consumer participant attributes |
cutlass::PersistentTileScheduler class template | cutlass.tile_scheduler.create_static_persistent_params op returning a typed scheduler handle |
WGMMA descriptor packed by make_smem_desc | cute_nvgpu.smem_desc_view type (see WGMMA descriptor construction) |
Sparse metadata operand on mma.sp | Dedicated sparse_metadata value with its own layout, slot 3 of the synthesised layout result |
Block-scaled scale_factor_a/b template arguments | scale_a/scale_b operands typed as E8M0 fragments (SM120) or TMEM-resident scale vectors (SM100) |
Two practical consequences for porters: every template-time decision becomes an op attribute the verifier can re-check, and every operand residency (register / SMEM descriptor / TMEM) becomes a typed value the lowering routes through a dedicated atom path. The library's make_smem_desc is the per-atom call to sub_17DD6A0; the open-source cute_tile_scheduler is the cutlass.tile_scheduler.* family.
Common Atom Contract
LogicalResult verify_mma_atom(MmaAtom atom, Target target, MmaUse use) {
require(target.supports(atom.min_tier));
require(use.shape == atom.shape || shape_is_compatible(use.shape, atom.shape));
require(use.a.element_type in atom.legal_a_types);
require(use.b.element_type in atom.legal_b_types);
require(use.acc.element_type in atom.legal_accumulator_types);
require(use.a.residency in atom.legal_a_residency);
require(use.b.residency in atom.legal_b_residency);
require(use.result.residency in atom.legal_result_residency);
if (atom.requires_sparse_metadata) {
require(use.sparse_metadata.valid);
}
if (atom.requires_scale_factors) {
require(use.scale_factors.valid);
require(scale_factor_layout_is_legal(atom, use.scale_factors));
}
return success();
}
Check layout and residency in the verifier — not after lowering. Once an atom has become a raw NVVM intrinsic or an inline PTX fragment, diagnostics can no longer explain the original layout mismatch clearly.
Operand Contract by Tier
Each tier pins its operands to a specific memory space and presents a specific kind of typed value to the lowering. The table below lays this out per tier so a reimplementation can carry one operand-type classifier per row.
| Tier / atom | A operand | B operand | D / accumulator | Predicate | Extra |
|---|---|---|---|---|---|
| SM70 universal FMA | register fragment | register fragment | register fragment | none | — |
SM80 dense sm80.mma | register fragment | register fragment | register fragment | none | f16/bf16/tf32/s8/s4 family |
SM80 sparse sm80.sparse_mma | structurally-sparse register fragment | register fragment | register fragment | none | u32 metadata fragment (slot 3) |
SM89 FP8 sm89.mma | register fragment (e4m3 or e5m2) | register fragment | f32 register fragment | none | — |
SM90 WGMMA sm90.mma | register fragment or SMEM descriptor (!cute_nvgpu.smem_desc_view) | SMEM descriptor | register fragment (async — not ready until wait) | none | mbarrier for completion; scale-D selector |
SM100 UMMA sm100.mma | SMEM descriptor or TMEM pointer | SMEM descriptor | TMEM pointer | none | mbarrier; 2-CTA mask when clustered |
SM100 block-scaled sm100.mma_bs | SMEM descriptor / TMEM | SMEM descriptor | TMEM pointer | none | scale-factor vectors in TMEM, E8M0 |
SM100 sparse block-scaled sm100.mma_bs_sp | sparse SMEM/TMEM | SMEM descriptor | TMEM pointer | none | metadata vector + scale vectors |
SM120 block-scaled SM120.mma_bs | register fragment | register fragment | register fragment | none | scale_a and scale_b register fragments (E8M0) |
Reading the table:
- register fragment means the operand is an SSA value typed as a
!cute.layout-shaped register slice. - SMEM descriptor means a packed 64-bit descriptor word built by the constructor at
sub_17DD6A0and surfaced in IR as!cute_nvgpu.smem_desc_view<src, layout>. - TMEM pointer means a Blackwell tensor-memory tile address, typed by the TMEM allocation lifecycle.
- mbarrier for SM90/SM100 means the atom's completion is observed by a separate
mbarrier.waitorwgmma.wait_groupop; no register-side operand carries the completion token.
The missing predicate column is deliberate. MMA atoms here do not carry per-lane predicates; masking is the job of the producer/consumer pipeline of the enclosing region — see cutlass Pipeline and Tile Scheduler — Pipeline Operations.
SM70 and SM75
Older tensor-core tiers travel through universal or backend intrinsic paths — no dedicated per-tier cute_nvgpu mnemonic. The public contract is:
- register-resident input and accumulator fragments;
- classic
mma.syncshapes; f16andbf16style input families depending on tier;- no WGMMA descriptor, TMA descriptor, TMEM, or block-scale operands.
These atoms remain useful as compatibility targets, but most modern layout-selection logic starts at SM80 or later.
SM80 and SM89 Reference-Layout Synthesizer
sub_1854CF0 (6 640 bytes) is the per-mma_atom builder that emits the canonical Layout for SM80 and SM89 register-MMA tile-fragment placement. It keys on a 5-tuple (K, M, sparse, fp8, trans_a) and routes to one of seven arms; each arm composes shape/stride triples that match the PTX form the lowering will eventually emit. The output Layouts feed straight into the operand-layout verifier, so the synthesiser and the verifier share one source of truth for fragment placement.
Seven-Arm Dispatch
Each MMA atom carries its tile shape and element type in the 5-tuple key. The synthesiser reads the key out of the atom descriptor and routes to the arm whose tuple matches exactly. No fallthrough between arms — an unmatched key already failed verification earlier in the pipeline.
| Arm | K | M | sparse | fp8 | trans_a | PTX form |
|---|---|---|---|---|---|---|
| 0 | 16 | 16 | no | no | no | mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 |
| 1 | 16 | 16 | no | no | yes | mma.sync.aligned.m16n8k16.row.row.f16 |
| 2 | 16 | 16 | yes | no | no | mma.sp.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 |
| 3 | 32 | 16 | no | no | no | mma.sync.aligned.m16n8k32.row.col.s8.s8.s32 |
| 4 | 32 | 16 | no | yes | no | mma.sync.aligned.m16n8k32.row.col.e4m3.e4m3.f32 (SM89) |
| 5 | 32 | 16 | yes | no | no | mma.sp.sync.aligned.m16n8k32.row.col.s8.s8.s32 |
| 6 | 8 | 16 | no | no | no | mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 |
Arm 4 is the SM89-only FP8 path. The remaining arms apply at SM80 and above. Arms 2 and 5 are the structured-sparse forms, and they select the four-slot return path described below.
Stride Triples
Each arm assembles its output Layout from one of three stride triples. The triples land verbatim in the result Layouts and get matched against PTX-encoded offsets at lowering time.
| Triple | Stride values | Used by |
|---|---|---|
| dense.A | {128, 256, 1024} | dense-MMA A-operand |
| dense.B | {2048, ...} | dense-MMA B-operand |
| sparse.metadata | {0x200000, 0x4000000, 0x8000000} | metadata stride for sparse arms 2 and 5 |
The sparse-metadata triple encodes per-warp metadata-buffer offsets at the 21-, 26-, and 27-bit positions. Those bit positions match the metadata-stride field of the mma.sp PTX form, so the synthesised Layout surfaces the PTX wire format directly rather than as an abstract description awaiting translation.
Result-Slot Encoding
Output Layouts are stored consecutively in a 152-byte stride array. Each entry holds the shape vector, the stride vector, and 24 bytes of decoration: per-element-type metadata, padding, and alignment information that the verifier compares against the declared operand layout. Slot zero through slot two always carry the A, B, and C Layouts. When the arm is sparse, the four-slot helper at sub_1854130 writes the metadata Layout into slot three at offset +456 of the result buffer.
typedef struct {
Layout slots[4]; /* 152 bytes each; slot[3] valid only on sparse arms. */
uint32_t slot_count; /* 3 for dense arms, 4 for arms 2 and 5. */
} MmaLayoutResult;
The dispatcher picks between the three-slot and four-slot paths by inspecting the metadata-stride field of the input atom: a non-zero stride forces the sparse path. The caller-provided return buffer is fixed-size, so callers must read the slot count alongside the buffer rather than infer it from buffer width.
Warp-Fragment Element Counts
Each arm also returns the per-thread fragment element count. The calling layout pass uses it to size the warp's register-file allocation. The counts come straight from dividing the tile size across the 32-thread warp tile:
| Arm class | Per-thread elements | Reasoning |
|---|---|---|
Dense f16 | 8 | 16 * 8 * 16 / 256 over a four-warp warp-group footprint |
Dense s8 | 16 | wider K and narrower element width |
| Dense FP8 | 16 | same K and lane footprint as the s8 dense path |
| Sparse | half of the dense count | the structured-sparse input layout is halved, metadata replaces the missing half |
Atom Verifier Contract
The verifier consumes the synthesised Layouts directly. Residency, shape, and element-type tuples are checked together, and the sparse-metadata layout participates in the same equivalence check.
LogicalResult verify_sm80_mma(MmaUse use, bool sparse) {
require(use.a.residency == REGISTER_MEMORY);
require(use.b.residency == REGISTER_MEMORY);
require(use.result.residency == REGISTER_MEMORY);
require(is_supported_sm80_mma_shape(use.shape));
require(is_supported_sm80_element_tuple(use.a.type, use.b.type, use.acc.type));
MmaLayoutResult expected = synthesize_sm80_layouts(use.atom);
require(layouts_equivalent(use.a.layout, expected.slots[0]));
require(layouts_equivalent(use.b.layout, expected.slots[1]));
require(layouts_equivalent(use.acc.layout, expected.slots[2]));
if (sparse) {
require(use.sparse_metadata.valid);
require(expected.slot_count == 4);
require(layouts_equivalent(use.sparse_metadata.layout, expected.slots[3]));
}
return success();
}
SM80 sparse metadata is part of the atom contract. A lowering that drops it is not equivalent to dense MMA, and a verifier that skips the slot-three Layout comparison will miss a mis-sized metadata buffer entirely before lowering.
SM89
SM89 extends the register-MMA model with FP8 E4M3 and E5M2 inputs and f32 accumulators. Mixed FP8 input pairs are legal as long as both operands pick supported FP8 types.
LogicalResult verify_sm89_fp8_mma(MmaUse use) {
require(use.a.residency == REGISTER_MEMORY);
require(use.b.residency == REGISTER_MEMORY);
require(is_fp8_e4m3_or_e5m2(use.a.type));
require(is_fp8_e4m3_or_e5m2(use.b.type));
require(use.acc.type == f32_type());
require(use.shape.k == 32);
return success();
}
There is no sparse FP8 companion in this tier.
SM90 WGMMA
SM90 WGMMA is a warp-group asynchronous operation. B always rides an SMEM descriptor; A is either a register fragment or another SMEM descriptor. The result lives in registers, but it is not ready until the WGMMA wait sequence completes.
void lower_sm90_wgmma(WgmmaAtom atom, WgmmaUse use) {
require(use.b.is_smem_descriptor);
require(use.a.is_register_fragment || use.a.is_smem_descriptor);
emit_wgmma_fence();
for (MmaTile tile : split_into_wgmma_tiles(use)) {
emit_wgmma_mma_async(atom, tile);
}
emit_wgmma_commit_group();
emit_wgmma_wait_group();
}
A correct lowering preserves asynchronous ordering. Reading accumulators before the wait is a correctness bug even if the IR dependency graph looks fine.
The SMEM descriptor carries base address, leading byte offset, stride byte offset, base offset, and swizzle mode. Build it from the same layout algebra the operand verifier uses; otherwise descriptor construction and verification can drift apart.
SMEM-Descriptor Construction
sub_17DD6A0 (4 984 bytes) packs the 64-bit SMEM descriptor that each wgmma.mma_async.sync.aligned instruction consumes for its A and B operands. The descriptor is built once per operand before the WGMMA tile loop, then threaded through the inline-asm fragment as an l-constraint i64 input. The same bit layout serves every Hopper WGMMA shape, so the constructor is one routine fed by per-atom shape and swizzle metadata — not a family of per-shape variants.
The 64-bit packing layout is a bitfield over the canonical Hopper descriptor word:
typedef union WgmmaDescriptor {
uint64_t raw;
struct {
uint64_t start_addr : 14; /* bits 0-13 : low 14 bits of SMEM byte offset (>>4) */
uint64_t lbo : 16; /* bits 14-29 : leading byte offset (per-warp tile size) */
uint64_t sbo : 16; /* bits 30-45 : stride byte offset (between warp tiles) */
uint64_t base_offset : 3; /* bits 46-48 : base offset (per-CTA SMEM offset, divided 8) */
uint64_t reserved : 3; /* bits 49-51 : reserved, always zero */
uint64_t swizzle_mode : 2; /* bits 52-53 : 0=none, 1=128-B, 2=64-B, 3=32-B */
uint64_t pad : 10; /* bits 54-63 : padding */
};
} WgmmaDescriptor;
The start_addr field stores the low 14 bits of (smem_offset >> 4). WGMMA only accepts 16-byte-aligned SMEM addresses, so the constructor shifts and masks the raw SMEM byte offset rather than embedding it unshifted. lbo and sbo together encode the two-dimensional tile-stride layout for an A or B operand: lbo is the leading byte offset between rows of a single warp tile, and sbo is the stride byte offset between consecutive warp tiles along K. base_offset is a per-CTA offset scaled by eight. The reserved field must be zero per the Hopper ISA, and the constructor masks it explicitly.
The swizzle-mode field picks the SMEM bit-reversal pattern that lets two warps in the warp-group read the same SMEM region without bank conflicts:
swizzle_mode | Bytes-per-row | Used for |
|---|---|---|
| 0 | none | Plain row-major SMEM |
| 1 | 128 | Hopper canonical 128-B swizzle |
| 2 | 64 | 64-B swizzle (smaller TC operand) |
| 3 | 32 | 32-B swizzle (sub-tile WGMMA) |
The 128-B mode is the canonical Hopper choice for full-width A and B tiles. The 64-B and 32-B modes kick in when the operand element width or warp-tile footprint is smaller than a canonical 128-B row.
GMMA_K and MN Constraints
Per element type, the canonical K-size one WGMMA instruction consumes is 256 / elem_bits, with one exception (b1 rides a .xor.popc / .and.popc reduction over 256 bits of K). The MN extent must be a multiple of 8 in every case — a WGMMA hardware constraint on the output-tile size, independent of input element type. The dialect-side atom verifier rejects each unsupported input pair with a dedicated diagnostic: "expects A/B of type s8/u8 and D of type i32", "expects A/B of type s4/u4 and D of type i32", "expects A/B of type u1 and D of type i32", and "expects A/B of the e5m2/e4m3 type" for the asymmetric FP8 mix.
| Element type | K-size (canonical) | MN multiple |
|---|---|---|
f16 × f16 (acc f16/f32) | 16 | 8 |
bf16 × bf16 (acc f32) | 16 | 8 |
tf32 × tf32 (acc f32) | 8 | 8 |
e4m3 × e4m3 (FP8, acc f32) | 32 | 8 |
e5m2 × e5m2 (FP8, acc f32) | 32 | 8 |
Mixed e4m3 × e5m2 (acc f32) | 32 | 8 |
s8 × s8 / u8 × u8 (acc s32) | 32 | 8 |
s4 × s4 / u4 × u4 (acc s32) | 64 | 8 |
b1 × b1 popcount (acc s32) | 256 | 8 |
The constructor derives lbo and sbo byte counts from the abstract tile shape via this table. An m64n128k16.f16 tile uses K = 16 because 256 / 16 = 16, and the leading byte offset is K * sizeof(f16) scaled by the swizzle mode. See Topics → WGMMA Emission Protocol — Per-Shape Lattice for the full (M, N, K) legal-shape product and the cross-tier comparison; the table above is the same lattice surfaced from the dialect side so the descriptor packer and the lowering see one source of truth.
Inline-Asm Template
sub_17DD6A0 ends by emitting an inline-asm fragment whose PTX body has the canonical WGMMA form. For m64n128k16.f32.f16.f16 the emitted string is:
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16
{ %f0, %f1, ... },
%r2, %r3, %p4
The accumulator register list expands to the per-thread fragment count for the chosen tile shape. The constraint string is =f,=r,l,r,n in argument order:
=fmarks each float output register in the accumulator fragment;=rmarks the i32 output register used for the descriptor's scale-D return slot;lis the i64 descriptor input that the constructor produced;ris the i32 scale input that selects the accumulator-update mode;nis the immediate predicate input that conditions the MMA on a compile-time-known flag.
A correct lowering threads the same WgmmaDescriptor.raw value into the l slot for the operand-B descriptor and, when A is SMEM-resident rather than register-resident, into a second l slot for operand A. The constructor and the verifier must read the descriptor layout from the same table — if the verifier expects 128-B swizzle but the constructor emits 64-B, the inline-asm fragment runs against the wrong SMEM region and produces silently wrong results.
⚡ QUIRK — descriptor swizzle mismatch fails silently at runtime, not at compile time The WGMMA descriptor is shipped as an opaque i64 into the inline-asm fragment. The verifier and the constructor each compute the swizzle bits from their own table; if those tables drift apart (verifier expects 128-B, constructor emits 64-B), the IR verifies, ptxas accepts, and the kernel launches — but the inline-asm fragment reads from a different SMEM region than the producer wrote to, so the accumulator absorbs unrelated bytes. There is no fence, no compile-time check, and no runtime trap. The only symptom is silently wrong numerics across an MMA tile.
SM100 and SM103 UMMA
SM100 introduces tensor memory and TCGEN-style MMA. The output accumulator lives in TMEM; A comes from an SMEM descriptor or from TMEM; B always comes from an SMEM descriptor. Sparse and block-scaled variants add metadata and scale-factor operands.
LogicalResult verify_sm100_umma(MmaUse use, UmmaKind kind) {
require(use.result.residency == TENSOR_MEMORY);
require(use.b.is_smem_descriptor);
require(use.a.is_smem_descriptor || use.a.residency == TENSOR_MEMORY);
require(is_supported_umma_shape(use.shape));
require(is_supported_umma_element_tuple(use, kind));
if (kind.is_sparse) {
require(use.sparse_metadata.valid);
}
if (kind.is_block_scaled) {
require(use.scale_factors.valid);
require(use.scale_factors.type == e8m0_type());
}
return success();
}
Two-CTA and cluster variants belong to the UMMA contract too — they affect TMEM allocation, write-disable behaviour, and barrier transaction counts.
SM100 UMMA Block-Scaled (atom_K, vecSize) Atoms
SM100 UMMA's block-scaled MMA atom family covers FP4 and FP8 microscale matrix multiplication with per-block scale factors in tensor memory. The verifier sub_14B71C0 enumerates exactly three legal (atom_K, vecSize) triples and returns a packed encoding (atom_K << 32) | vecSize (or zero on error). Callers mask the result with ~7 to extract a 3-bit tag from the low bits, and the atom builder records that tag to track which block-scaled variant the op carries.
| (atom_K, vecSize) | A type x B type | Scale type | PTX kind | Packed return |
|---|---|---|---|---|
| (32, 32) | FP8 x FP8 | E8M0 | kind::f8f6f4 | 0x2000000020 |
| (64, 16) | FP4 x FP4 | E8M0 / E4M3FN | kind::mxf4 (OCP MX-FP4) | 0x4000000010 |
| (64, 32) | FP4 x FP4 | E8M0 | kind::mxf4nvf4 (NVFP4 block-64) | 0x4000000020 |
The accumulator type is hard-locked to Float32 across all three variants, regardless of input element type. Any other accumulator type triggers "expects c type to be Float32" and the op fails before lowering.
cute_nvgpu carries two 4-bit element-type TypeIDs sharing the same .data.rel.ro slot at &unk_5BE6068: Float4E2M1FN is the IEEE-style OCP MX-FP4 encoding (2 exponent, 1 mantissa, finite-only), and FloatNV4E0M3F is NVIDIA's NVFP4 fixed-point encoding (0 exponent, 3 mantissa). They share the slot because both are 4-bit packed types, but the dispatcher in sub_14B71C0 distinguishes them by the sf_a and sf_b scale-factor element types. When sf_a == sf_b == E8M0 the layout is NVFP4 and selects kind::mxf4nvf4. When the scale-factor element type is E4M3FN the layout is OCP MX-FP4 and selects kind::mxf4. A mismatch between sf_a and sf_b triggers "expects sfa/sfb element types to be the same".
The verifier's accept set is the conjunction of four predicates:
c.elementType == Float32always.(a.elementType, b.elementType, atom_K)matches one of(FP8, FP8, 32)or(FP4, FP4, 64).(sf.elementType, vecSize)matches one of(E8M0, 32),(E8M0, 16), or(E4M3FN, 16).sf_a.elementType == sf_b.elementType.
Every other combination is rejected by the per-combo expectation diagnostics listed in the nv_tileas page and returns 0. See
nv_tileas Verifiers — Block-Scaled MMA Verification for the broader verifier context this table summarises, and
NVPTX Subtarget Feature Matrix — Cached Tensor-Memory Predicate for the tmem feature
that gates SM100 atoms.
SM120 and SM121 Block-Scaled MMA
SM120 keeps block-scaled MMA register-resident and uses two scale-factor operands — one for A, one for B. That sets it apart from SM100, where block-scaled forms are tied to the tensor-memory path.
LogicalResult verify_sm120_block_scaled(MmaUse use) {
require(use.a.residency == REGISTER_MEMORY);
require(use.b.residency == REGISTER_MEMORY);
require(use.result.residency == REGISTER_MEMORY);
require(use.scale_a.valid);
require(use.scale_b.valid);
require(use.scale_a.type == e8m0_type());
require(use.scale_b.type == e8m0_type());
require(use.shape.k == 32 || use.shape.k == 64);
require(is_supported_sm120_input_type(use.a.type));
require(is_supported_sm120_input_type(use.b.type));
return success();
}
For K = 32, FP4, FP6-like, and FP8-like input families are allowed with a fixed scale-vector shape. For K = 64, the accepted input family narrows to FP4-style operands, and the scale-fragment width must match the selected vector size.
Per-Atom Operand-Layout Contracts
The tables below document the per-thread fragment counts and per-operand layout pieces every MMA atom records. Each row corresponds to one PTX instruction shape; the verifier emits the exact same numbers when reconstructing the canonical reference layout. All entries assume a 32-thread warp unless otherwise noted; SM90 WGMMA and SM100 UMMA also reference a 128-thread warp-group footprint.
SM70 / SM75 m16n8k8 f16
| Operand | Memory class | Per-thread elements | Per-thread layout footprint |
|---|---|---|---|
| A | register | 4 (f16, packed as 2 x i32) | (2, 2, 2) : (8, 1, 16) — 4 rows x 2 mode-K lanes |
| B | register | 2 (f16, packed as 1 x i32) | (2, 2) : (1, 16) — 2 cols x 2 mode-K lanes |
| C / D | register | 4 (f32 or f16) | (2, 2, 2) : (4, 1, 8) |
The legacy SM70 m8n8k4 form keeps the same memory class but uses smaller fragments — 4 elements per thread total across A and B combined.
SM80 dense m16n8k16 f16
| Operand | Memory class | Per-thread elements | Per-thread layout footprint |
|---|---|---|---|
| A | register | 8 (f16, packed as 4 x i32) | (2, 2, 2, 2) : (1, 16, 8, 128) |
| B | register | 4 (f16, packed as 2 x i32) | (2, 2, 2) : (1, 8, 16) |
| C / D | register | 4 (f32) | (2, 2, 2) : (4, 1, 8) |
These per-thread counts match the seven-arm dispatch table — dense f16 rests at 8 elements per thread, dense s8 and FP8 paths jump to 16 by widening K from 16 to 32 against the same lane footprint.
SM80 INT8 sparse m16n8k32 s8/s32
| Operand | Memory class | Per-thread elements | Per-thread layout footprint |
|---|---|---|---|
| A (structurally sparse) | register | 8 (s8, packed as 2 x i32) — half the dense count | (2, 2, 2) : (1, 32, 128) |
| B | register | 8 (s8, packed as 2 x i32) | (2, 2, 2) : (1, 16, 32) |
| C / D | register | 4 (s32) | (2, 2, 2) : (4, 1, 8) |
| Sparse metadata | register | 1 (u32) — 16 metadata pairs per warp | (1) : (1) with metadata-stride encoded via the (0x200000, 0x4000000, 0x8000000) triple |
| Sparsity selector | immediate | implicit — selector 0 means alternating-pair pattern | not represented as an operand |
The sparse A fragment carries 8 packed s8 values rather than the dense 16; the metadata operand encodes which two of every four positions are non-zero. The selector is not a separate operand at the IR level — it lives in the atom's textual mnemonic and is folded into the PTX form at lowering time. Slot 3 of the synthesised MmaLayoutResult (152 bytes per slot) holds the metadata layout; verification compares it against the declared layout under the same equivalence predicate it uses for A, B, and D.
SM89 FP8 m16n8k32 e4m3/e5m2
| Operand | Memory class | Per-thread elements | Per-thread layout footprint |
|---|---|---|---|
| A | register | 16 (e4m3 or e5m2, packed as 4 x i32) | (2, 2, 2, 2) : (1, 32, 16, 256) |
| B | register | 8 (FP8, packed as 2 x i32) | (2, 2, 2) : (1, 16, 32) |
| C / D | register | 4 (f32) | (2, 2, 2) : (4, 1, 8) |
Both operands may pick e4m3 or e5m2 independently. The verifier checks each operand's type against the FP8 union; mixed FP8 input pairs (one e4m3, one e5m2) are legal as long as the accumulator is f32.
SM90 WGMMA m64nNk16 f16 (canonical Hopper)
| Operand | Memory class | Per-warp elements | Per-thread layout / descriptor source |
|---|---|---|---|
| A | SMEM descriptor or register fragment | 64 * 16 = 1024 (across the 128-thread warp-group) | descriptor encodes (64, 16) : (16, 1) row-major tile with 128-B swizzle |
| B | SMEM descriptor | 16 * N per WGMMA instance | descriptor encodes (16, N) : (N, 1) with matching swizzle |
| C / D | register fragment (warp-group) | 64 * N / 128 per thread (e.g., N=128 -> 64 elements per thread) | (2, 2, ..., 2) : (...) derived from the warp-group canonical fragment |
Per-thread fragment count for C/D is the tile area divided by the 128-thread warp-group footprint: 64 * N / 128 = N / 2. For N = 128 each thread holds 64 accumulator elements; for N = 256, 128 elements; for N = 8, 4 elements. The SMEM descriptors carry the swizzle field (128-B / 64-B / 32-B per the canonical table) so two warps in the group can stream operands without bank conflicts.
SM100 UMMA m64nNk16 f16 (single-CTA)
| Operand | Memory class | Per-warp-group elements | Per-thread layout / descriptor source |
|---|---|---|---|
| A | SMEM descriptor or TMEM | 64 * 16 per instance | descriptor or TMEM column range; layout (64, 16) : (16, 1) |
| B | SMEM descriptor | 16 * N per instance | descriptor; layout (16, N) : (N, 1) |
| D | TMEM | 64 * N per instance | TMEM column-range; persists across wait |
For the 2-CTA cooperative variant the M extent doubles to 128 and the TMEM accumulator is striped across two CTAs in the cluster; the verifier checks the cluster-shape attribute against the atom's cluster requirement.
SM100 UMMA block-scaled (atom_K=64, vecSize=32) FP4 / NVFP4
| Operand | Memory class | Per-warp-group elements | Notes |
|---|---|---|---|
| A | SMEM descriptor or TMEM | M * 64 per instance, 4-bit packed | Float4E2M1FN (OCP MX-FP4) or FloatNV4E0M3F (NVFP4) depending on sf_a type |
| B | SMEM descriptor | 64 * N per instance, 4-bit packed | same FP4 encoding as A |
| C / D | TMEM | M * N per instance, f32 | accumulator hard-locked to Float32 |
| Scale factor A | TMEM column | M * (64 / vecSize) = M * 2 per instance | E8M0 for NVFP4; E4M3FN rejected at this vecSize |
| Scale factor B | TMEM column | (64 / vecSize) * N = 2 * N per instance | matches A's scale-factor element type |
Scale factor vectors live in TMEM columns next to the accumulator; the layout walk for each scale-factor operand mirrors the consumer's vec-size walk through the K axis. The verifier rejects any combination outside the three legal (atom_K, vecSize) triples documented earlier in this page via the per-combo expectation diagnostics listed under nv_tileas Verifiers — Block-Scaled MMA Verification.
SM120 block-scaled m16n8k32 FP4 / FP8 (register-resident)
| Operand | Memory class | Per-thread elements | Notes |
|---|---|---|---|
| A | register | 8 (fp4) or 16 (fp8, packed as 4 x i32) | per-thread layout from the SM80 dispatch table, narrowed for FP4 |
| B | register | 4 (fp4) or 8 (fp8) | same pack convention |
| C / D | register | 4 (f32) | accumulator hard-locked to f32 |
| Scale factor A | register | 1 (E8M0, packed as 1 x i32 per warp tile) | per-A-block scale vector |
| Scale factor B | register | 1 (E8M0, packed as 1 x i32 per warp tile) | per-B-block scale vector |
The consumer Blackwell path keeps every operand in registers — no TMEM dependency. The two scale-factor operands enter the inline-asm fragment as two extra r-constraint inputs alongside the A, B, and D register vectors.
Operand Layout Grammar
MMA atoms use cute layout algebra to record which thread owns which fragment element. A verifier reconstructs the expected layout for the atom and compares it against the declared one:
LogicalResult verify_operand_layout(MmaAtom atom, OperandRole role, Layout layout) {
Layout expected = expected_mma_layout(atom, role);
require(layouts_equivalent(normalize_layout(layout), normalize_layout(expected)));
require(layout_is_static(layout));
require(!layout_has_scaled_basis(layout));
return success();
}
For WGMMA and UMMA the layout often lives in a descriptor rather than a lane-by-lane register layout. The verifier still derives the descriptor from layout algebra and rejects descriptors the declared layout cannot explain.
Invariants
- The target supports the tier named by the atom.
- Operand residency matches the tier: registers, SMEM descriptor, or TMEM.
- MMA shape and element-type tuples are checked together.
- Sparse atoms carry valid metadata.
- Block-scaled atoms carry valid scale factors and scale-vector parameters.
- WGMMA lowering emits fence, async MMA, commit, and wait in order.
- UMMA lowering preserves TMEM allocation and CTA-group semantics.
- SM120 uses two scale-factor operands and preserves uppercase
SM120spelling.
Cross-References
SM Tier Roster and Copy Atom Registry — Atom TypeID Registry lists every MMA atom alongside the copy atoms that feed it, and Copy Atom Operand-Layout Contracts documents the LDSM/STSM/TMA/TMEM-copy atoms that move operands into the residencies these MMA atoms require. Mode Pattern Verifiers — UMMA Canonical Layout Verifier and SM120 Block-Scaled Lattice cover the verifier ladders that consume the operand-layout contracts in this page. Layout Algebra and Descriptor Grammar — Swizzle Operator covers the bit-manipulation formula that feeds the WGMMA descriptor's swizzle_mode field. TMA Atoms — Atom Family covers the descriptor-driven TMA family that produces the SMEM tiles every WGMMA and UMMA atom in this page reads through descriptors.
TMA Atoms
Abstract
The cute_nvgpu TMA atom family surfaces Hopper and Blackwell tensor-memory transfers as descriptor-driven IR. A TMA descriptor records the global tensor, tile box, strides, rank, swizzle, fill behaviour, and cache policy. Executable TMA atoms bind that descriptor to coordinates, an mbarrier, optional multicast state, and cache hints, then lower to asynchronous tensor copy or reduce instructions. This page documents the atom family, the descriptor contract, the verifier rules, and the lowering shape.
Atom Family
| Operation | Role |
|---|---|
atom.tma_load | Execute asynchronous global-to-shared tensor load. |
atom.tma_store | Execute asynchronous shared-to-global tensor store. |
atom.tma_reduce | Execute asynchronous tensor reduction into global memory. |
atom.non_exec_tiled_tma_load | Describe a tiled TMA load before mbarrier/cache binding. |
atom.non_exec_tiled_tma_store | Describe a tiled TMA store before execution binding. |
atom.non_exec_tiled_tma_reduce | Describe a tiled TMA reduce before execution binding. |
prefetch_tma_desc | Prefetch descriptor state before a transfer. |
tma_descriptor_tiled | Descriptor type for ordinary tiled tensor movement. |
tma_descriptor_im2col | Descriptor type for im2col tensor movement. |
atom.make_exec_tma | Bind a non-exec atom with mbarrier, multicast, and cache mode. |
The non-exec atoms pay off because layout and partitioning can be verified before any pass commits to a runtime barrier or cache policy.
Partition Op and Mode Enums
The TMA atom family rooted at cute_nvgpu.atom.tma_partition routes every executable and non-exec TMA atom through one partition op — the canonical place where descriptor shape, transfer mode, multicast cardinality, and reduce kind are validated together. The partition verifier enforces eleven invariants on every TMA partition op and, on success, returns a packed result record per partitioned tile.
Three mode enums select the transfer variant. Load-mode covers single-CTA, two-CTA cooperative, and warp-multicast loads at two granularities; store-mode covers tiled stores and im2col-flavour stores; reduce-kind covers the asynchronous reduces the hardware supports.
typedef enum TmaLoadMode {
TMA_LOAD_NO_MULTICAST = 0, // single-CTA load
TMA_LOAD_TWO_CTA = 1, // 2-CTA cluster cooperative load
TMA_LOAD_W_MULTICAST = 2, // warp multicast (16-thread)
TMA_LOAD_W128_MULTICAST = 3, // wide warp multicast (128-thread)
} TmaLoadMode;
typedef enum TmaStoreMode {
TMA_STORE_TILED = 0, // tiled SMEM -> GMEM
TMA_STORE_IM2COL = 1, // im2col-flavor tiled store
TMA_STORE_IM2COL_W = 2, // im2col + warp multicast
TMA_STORE_IM2COL_W128 = 3,
} TmaStoreMode;
typedef enum TmaReduceKind {
TMA_REDUCE_ADD = 0, TMA_REDUCE_MIN = 1, TMA_REDUCE_MAX = 2,
TMA_REDUCE_INC = 3, TMA_REDUCE_DEC = 4,
TMA_REDUCE_AND = 5, TMA_REDUCE_OR = 6, TMA_REDUCE_XOR = 7,
} TmaReduceKind;
The enums are part of the verifier's input contract. Consistency between load mode, store mode, and reduce kind is checked together with rank and swizzle in the eleven-step walk below.
Partition Result ABI
The partition verifier returns one packed TmaPartitionResult per partitioned tile into a SmallVector owned by the verifier and forwarded to the executable-atom builder. The 24-byte record carries the interned TMA tensor type, the tile element count, a flags word, and the non-exec atom body that downstream lowering consumes.
typedef struct TmaPartitionResult {
/*+0x00*/ uint64_t tma_tensor_type; // interned MLIR Type * (TmaLoad/Store/ReduceAtomType)
/*+0x08*/ uint32_t tile_element_count; // size(canonical_smem) * num_multicast
/*+0x0C*/ uint16_t flags; // see "Flags Word" below
/*+0x0E*/ uint8_t swizzle_mode; // 0=none, 2=128B, 3=32B/64B blend
/*+0x0F*/ uint8_t rank; // descriptor rank (1..5)
/*+0x10*/ uint64_t non_exec_atom_body; // interned non-exec atom Attribute *
} TmaPartitionResult;
Only the tensor type and atom body get consumed during executable-atom binding. The flags word, swizzle mode, and rank are echoed back so the executable-atom builder and downstream prefetch logic do not have to re-derive them from the descriptor type.
Flags Word
The 16-bit flags word records every property the partition verifier learned about the tile while it was walking the layout — the multicast mode, the im2col shape, the sparsity tier, the two-CTA cooperative bit, and a handful of operand-source bits used to short-circuit later checks. Downstream passes read this word bit-by-bit rather than re-running the partition algorithm.
| Bit | Field | Meaning |
|---|---|---|
| 0 | multicast | tile lowers to a multicast TMA load (W or W128) |
| 1 | im2col | tile is im2col-flavoured (rank reduced before transfer) |
| 2 | im2col_w | im2col with warp-cooperative offset table |
| 3 | im2col_w128 | im2col with wide warp-cooperative offset table (128-thread) |
| 4 | two_cta | 2-CTA cooperative load; CTA V-map has been folded into the SMEM layout |
| 5 | sparse | metadata operand present; sparsity-aware stride walk |
| 6 | static_smem | SMEM layout passed the static-shape predicate |
| 7 | static_vmap | CTA V-map passed the static-shape predicate |
| 8 | gmem_int_stride | GMEM layout passed the integer-stride walk |
| 9 | smem_int_stride | SMEM layout passed the integer-stride walk |
| 10 | shape_equiv | top-level shape equivalence between SMEM and V-map held |
| 11 | g_basis_ok | G-basis computation returned a valid layout |
| 12 | s2t_descriptor | result wraps a get_copy_s2t_smem_desc view (Blackwell SMEM-to-tmem descriptor) |
| 13 | prefetch_eligible | descriptor handle survives prefetch (no per-axis dynamism that would invalidate it) |
| 14 | reserved | — |
| 15 | reserved | — |
Bits 6 through 13 mirror the predicate checks the eleven-step verifier ran in steps 4 through 7 and 11. Folding the outcomes back into the flags word lets the executable-atom builder skip the equivalent predicates entirely — the partition verifier is the only place where these layout invariants get checked.
Eleven-Step Partition Verifier
The partition verifier walks eleven invariants in fixed order. Each gate emits a verbatim diagnostic on failure; the strings are part of the user-visible contract and a reimplementation must preserve them byte-for-byte.
| # | Step | Verbatim diagnostic |
|---|---|---|
| 1 | Type gate on the SMEM and GMEM operands | "invalid operand types, got " |
| 2 | SMEM layout-kind gate (LayoutType or ComposedLayoutType) | "invalid smem layout type, expected LayoutType or ComposedLayoutType, got " |
| 3 | GMEM layout-kind gate | "unsupported layout for the GMEM tensor, got " |
| 4 | Integer-stride walk on both layouts | "expected the GMEM and SMEM layouts to have integer stride elements, but got " |
| 5 | SMEM layout must be a swizzle layout | "expected the SMEM layout to be a swizzle layout, but got " |
| 6 | SMEM layout and CTA V-map must be static | "expected the SMEM layout and the CTA V-map to be static, but got " |
| 7 | Top-level shape equivalence between SMEM and V-map | "expected top-level shape equivalence between the SMEM layout and the CTA V-map, but got " |
| 8 | TMA G-basis computation | "failed to compute the TMA G-basis, got " |
| 9 | Final TMA layout validity check | "Computed TMA layout is invalid, got " |
| 10 | TMA tensor-type construction | "Failed to construct the TMA tensor type" |
| 11 | Multicast-count consistency (load variant only) | "missing or invalid num_multicast for a multicast TMA load" |
Order matters. The cheap type and structural gates — steps 1 through 6 — run before the more expensive G-basis and layout-product computations in steps 8 and 9. Step 11 is specific to the load variant; the store and reduce variants skip it because TMA store and reduce never take a multicast operand.
A twelfth string "got num_multicast of " is emitted as a companion to step 11 when the multicast mode is non-multicast (mode 0 or mode 2) but the supplied num_multicast value is not 1. The two error paths share the same FAIL label and treat the pair as one diagnostic: a missing or zero multicast count for a multicast mode, or a non-unit count for a non-multicast mode.
Treat only the descriptor base pointer, per-axis dimension sizes, and non-leading strides as device-mutable. Rank, element type, swizzle, multicast count, and mode are descriptor-construction facts and cannot change once the partition op has verified.
Worked Example: Rank-6 Rejection
A TMA descriptor builder consuming a rank-6 input lands on step 1 of the partition verifier. The SMEM and GMEM types print as a rank-6 layout, which is outside the accepted LayoutType / ComposedLayoutType union the partition core requires, and the diagnostic chain emits the verbatim ladder shown below before the verifier returns failure.
// Input op — rank 6 is one above the TMA hardware cap.
%bad = cute_nvgpu.atom.non_exec_tiled_tma_load
%desc_r6, %tile_r6, %cta_map_r6 : !cute_nvgpu.tma_descriptor_tiled
error: invalid operand types, got !cute.layout<(a,b,c,d,e,f),...>, !cute.layout<(a,b,c,d,e,f),...>, and !cute.layout<...>
The rank-6 tile prints into the first <smem_ty> slot, the rank-6 GMEM type into the second <gmem_ty> slot, and the CTA V-map into the trailing <v_map_ty> slot. The verifier prints all three because the failing condition is a combination — the type gate runs on the trio as a unit, so the diagnostic must show every operand that participated.
A stride-4-byte (not 16-byte-aligned) input fails one step later. Steps 1 through 3 pass because the layout kinds are accepted; step 4 walks the GMEM stride tuple, finds a non-integer or below-16-byte entry, and emits:
error: expected the GMEM and SMEM layouts to have integer stride elements, but got !cute.layout<...>, and !cute.layout<...>
The two printed layouts are the GMEM and SMEM layouts in the same order step 1 printed them. The trailing " and " between the two arguments is the same shared format helper the type-gate diagnostic uses.
Descriptor Builder
Descriptor construction consumes a global tensor, a layout, dynamic shapes, dynamic strides, padding values, TMA mode, store mode, element width, multicast metadata, and operand segment sizes.
TmaDescriptor build_tma_descriptor(Tensor tensor,
Layout layout,
ArrayRef<Value> shapes,
ArrayRef<Value> strides,
TmaMode mode,
TmaStoreMode store_mode) {
require(tensor.memory_space == GLOBAL_MEMORY);
require(rank(tensor) >= 1 && rank(tensor) <= 5);
require(!is_composed_layout(layout));
require(layout_is_static_enough_for_tma(layout));
TmaDescriptor desc;
desc.base = tensor.base;
desc.element_bits = bit_width(tensor.element_type);
desc.rank = rank(tensor);
desc.box = compute_box_sizes(layout, shapes);
desc.strides = compute_tma_strides(layout, strides);
desc.mode = mode;
desc.store_mode = store_mode;
desc.cache_policy = default_cache_policy();
return desc;
}
The first box dimension times the element bit width must divide evenly by the TMA transfer granularity. Padding values are restricted — non-zero padding requires a mode that explicitly supports it.
Non-Exec Atom Verification
The shared non-exec verifier checks the tuple of shared-memory layout, global layout, partitioner tile, and CTA value map. Success yields a TMA tensor type and a non-executing atom body ready to bind to runtime state later.
LogicalResult verify_non_exec_tma(NonExecTmaAtom atom) {
require(is_smem_layout(atom.smem_layout));
require(is_global_layout(atom.global_layout));
require(is_tile_like(atom.partitioner));
require(is_cta_value_map(atom.cta_v_map));
require(smem_layout_uses_supported_swizzle(atom.smem_layout));
require(layouts_are_statically_resolvable(atom.smem_layout, atom.cta_v_map));
require(tma_partition_is_valid(atom));
return success();
}
Load, store, and reduce variants add mode-specific checks. TMA reduce accepts only the reductions the target instruction family supports.
Executable Atom Binding
atom.make_exec_tma turns a non-exec atom into an executable atom by attaching
runtime state:
ExecTmaAtom make_exec_tma(NonExecTmaAtom atom,
MBarrier barrier,
CacheMode cache,
Optional<MulticastMask> multicast) {
require(atom.verified);
require(barrier.memory_space == SHARED_MEMORY);
ExecTmaAtom exec;
exec.atom = atom;
exec.barrier = barrier;
exec.cache_mode = cache;
exec.multicast = multicast;
return exec;
}
Executable TMA lowering increments the barrier transaction count by the number of bytes the transfer will complete.
Lowering Shape
void lower_tma_load(ExecTmaAtom atom, MemRef dst, Coord coord) {
require(atom.atom.kind == TMA_LOAD);
require(dst.memory_space == SHARED_MEMORY);
require(coord.rank == atom.atom.descriptor.rank);
prefetch_descriptor_if_requested(atom.atom.descriptor);
emit_cp_async_bulk_tensor_load(atom.atom.descriptor,
dst,
coord,
atom.barrier,
atom.cache_mode,
atom.multicast);
}
void lower_tma_store(ExecTmaAtom atom, MemRef src, Coord coord) {
require(atom.atom.kind == TMA_STORE);
require(src.memory_space == SHARED_MEMORY);
emit_cp_async_bulk_tensor_store(atom.atom.descriptor, src, coord, atom.cache_mode);
}
TMA load completes through an mbarrier — a consumer must wait on the barrier before using the destination tile. TMA store and reduce follow the target's async-bulk ordering rules and must not be reordered across conflicting memory effects.
Descriptor Mutation
Device-side descriptor mutation is limited to three fields: the global base pointer, per-axis dimension extents, and non-leading strides (the leading stride is implicit element-size and never written). The atom dialect exposes those three changes as dedicated update kinds rather than as a general byte write, so verification can reject any other mutation at IR construction time:
void update_tma_descriptor(TmaDescriptor *desc, TmaUpdate update) {
switch (update.kind) {
case UPDATE_BASE_POINTER:
desc->base = update.base;
break;
case UPDATE_DIM:
desc->shape[update.axis] = update.value;
break;
case UPDATE_STRIDE:
require(update.axis > 0);
desc->strides[update.axis] = update.value;
break;
default:
fail("TMA descriptor field is not device-mutable");
}
}
The three update kinds map directly to the tensormap.replace.tile.{global_address, global_dim, global_stride} PTX mutator family. The rebind sequence on the device side — acquire fence, address write, rank dim writes, rank-1 stride writes, release fence — and the proxy-fence ordering that pairs each rebind with its cp.async.bulk.tensor.* consumer is documented in TMA Tensormap and cp.async.bulk — TMA Descriptor Mutators. The descriptor builder above is the partition-time view; the atom-lowering page covers how the runtime side issues those three mutators in the contractually mandated order.
If You Know CUTLASS (open source) — cross-walk
Coming from CUTLASS Hopper/Blackwell TMA usage:
| CUTLASS C++ | tileiras IR (cute_nvgpu) |
|---|---|
cuTensorMapEncodeTiled(&tmap, ...) (host-side, runtime API) | nv_tileas.make_tiled_tma_desc op materialising a !tma_descriptor_tiled typed value |
cuTensorMapEncodeIm2col(&tmap, ...) | nv_tileas.make_tiled_tma_desc with im2col mode → !tma_descriptor_im2col |
cute::SM90_TMA_LOAD::copy(...) | cute_nvgpu.atom.tma_load op (after make_exec_tma binding) |
cute::SM90_TMA_STORE::copy(...) | cute_nvgpu.atom.tma_store op |
cute::SM90_TMA_REDUCE_ADD::copy(...) | cute_nvgpu.atom.tma_reduce with kind = TMA_REDUCE_ADD |
Multicast TMA (SM90_TMA_LOAD_MULTICAST) | tma_load_mode attribute on the partition op |
cute::prefetch_tma_descriptor(tmap) | cute_nvgpu.prefetch_tma_desc op |
mbarrier::arrive_and_expect_tx(mbar, bytes) paired with TMA | barrier operand + expect_tx attribute on the executable TMA op |
The structural difference: in CUTLASS the descriptor is an opaque CUtensorMap blob bound at runtime. Tileiras carries rank, element width, swizzle mode, box shape, and stride layout as typed IR attributes the partition verifier re-checks before each TMA op lowers. Device-side mutation is restricted to base pointer, per-axis dimension, and non-leading stride (see Descriptor Mutation above) — the same surface the hardware allows, exposed through dedicated ops rather than raw byte writes.
Worked Example
%desc = nv_tileas.make_tiled_tma_desc %tensor, %layout
shapes(%m, %n, %k) strides(%sn, %sk) paddings()
{mode = #cute_nvgpu.tma_load_mode<tiled>,
elementBitWidth = 16} : !cute_nvgpu.tma_descriptor_tiled
%atom = cute_nvgpu.atom.non_exec_tiled_tma_load %desc, %tile, %cta_map
{num_multicast = 1}
%exec = cute_nvgpu.atom.make_exec_tma %atom, %mbar
{cache_mode = #cute_nvgpu.load_cache_mode<cg>}
cute_nvgpu.atom.tma_load %exec, %smem_tile, %coord
{allow_tma = true, inBounds = true}
After lowering the executable load becomes a cp.async.bulk.tensor-style op with descriptor, coordinate, destination, barrier, and optional cache or multicast modifiers.
Invariants
- TMA rank is between one and five.
- Descriptor pointers are aligned to the hardware descriptor requirement.
- Composed layouts are rejected where the descriptor builder needs a plain static layout.
- Shared-memory layouts use supported swizzle modes.
- Global and shared layouts agree with the partitioner and CTA value map.
- Descriptor base, dimensions, and strides are the only mutable device fields.
- TMA load completion is ordered through an mbarrier.
- Im2col and multicast modes are architecture-gated.
Cross-References
Mode Pattern Verifiers — Swizzle Legality documents the swizzle-legality, UMMA Canonical Layout Verifier, and tcgen05.mma Kind-Word Verifier verifiers that the TMA partition core composes with. SM Tier Roster and Copy Atom Registry — Atom TypeID Registry covers the SM90/SM100/SM120 atom interfaces TMA atoms implement. cute Atom Builders and Desugar — Kernel-entry ABI covers the kernel-entry ABI that hoists TMA descriptors as .param constant-space arguments.
Mode Pattern Verifiers
Abstract
Mode-pattern verifiers sit between target-neutral layout algebra and architecture-specific atom lowering. They check LDSM/STSM modes, register fragment sizes, SMEM descriptor layouts, SM120 block-scaled mode parameters, swizzle legality, and TMA rank constraints. The checks are small individually but together they stop invalid atom shapes from reaching NVVM, where the original layout intent would be much harder to diagnose.
LDSM and STSM Matrix
LDSM and STSM atoms accept only a finite set of shape, transpose, size-pattern, and matrix-count combinations.
| Mode | Shape | num_matrices | Accepted size patterns | Transpose |
|---|---|---|---|---|
.M88 | 8 x 8 | 1, 2, 4 | u16 | no |
.MT88 | 8 x 8 | 1, 2, 4 | u16 | yes |
.M816 | 8 x 16 | 1, 2, 4 | u4to8, s4to8, packed 4/6-bit to 8-bit modes | no |
.M832 | 8 x 32 | 1, 2, 4 | u2to4, s2to4 | no |
.MT1616 | 16 x 16 | 1, 2 | u8, packed 4/6-bit to 8-bit modes | yes |
LogicalResult verify_ldsm_mode(LdsmMode mode,
LdsmSizePattern size,
int num_matrices,
bool transpose,
Shape result_shape) {
require(num_matrices == 1 || num_matrices == 2 || num_matrices == 4);
require(transpose == mode.requires_transpose);
require(size in mode.accepted_sizes);
require(result_shape.rank == 1);
require(result_shape.dim(0) == expected_ldsm_extent(mode, num_matrices));
return success();
}
For binary-compatible diagnostic tests, keep the exact legacy strings where the test suite expects them. For new user-facing documentation and errors, prefer clear corrected wording.
Shared-Memory Matrix Movement
Load-side matrix atoms move shared memory into registers; store-side atoms move the other way. The verifier checks both memory spaces and the fragment shape.
LogicalResult verify_matrix_space_copy(MatrixCopyOp op) {
if (op.is_load) {
require(op.src.memory_space == SHARED_MEMORY);
require(op.dst.memory_space == REGISTER_MEMORY);
} else {
require(op.src.memory_space == REGISTER_MEMORY);
require(op.dst.memory_space == SHARED_MEMORY);
}
require(fragment_shape_matches_mode(op.mode, op.result_shape));
require(pointer_alignment_meets_atom_requirement(op.shared_operand));
return success();
}
Register-space copy atoms additionally verify that the register count matches the layout cosize:
LogicalResult verify_register_fragment(Layout layout, int register_count) {
int expected_bits = 32 * cosize(layout);
int actual_bits = 32 * register_count;
require(actual_bits == expected_bits);
return success();
}
UMMA Canonical Layout Verifier
UMMA atoms require canonical UMMA_MN (matrix-major) or UMMA_K (k-major) layouts for their A and B operands. The UMMA layout verifier enforces those invariants on every mma_atom op before it can lower to PTX. Each gate emits a specific diagnostic, so a layout that survives this pass is structurally valid for the descriptor packer that runs immediately after.
The verifier takes four inputs: a direction that is either UMMA_MN or UMMA_K; an elem_bits width of 4, 8, 16, or 32; a swz_triple (swz_mode, B, M) read from the swizzled descriptor; and the cute.layout being verified. Direction selects the canonical operand orientation, element width sets the expected K-extent, and the swizzle triple picks one of a small accepted set of bit-mask shapes. The layout may be a plain Layout or a ComposedLayout whose inner component is a swizzle — both forms walk uniformly once they pass the first gate.
Seven verbatim diagnostics fire from this verifier. Each is emitted at most once per verification; a failure stops further checking. The strings are part of the user-visible contract — reproducing them byte-for-byte is required for test suites that match diagnostics by string:
"unsupported swizzle, got ""Not a canonical UMMA_MN Layout: Expected K-size 256/sizeof_bits<T> or 512/sizeof_bits(T) in sparse gemm kernels.""Not a canonical UMMA_MN Layout: No flat offset mode""Not a canonical UMMA_MN Layout: Expected stride failure.""Not a canonical UMMA_K Layout: Expected MN-size multiple of ""Not a canonical UMMA_K Layout: No flat offset mode""Not a canonical UMMA_K Layout: Expected stride failure."
The verifier walks the same shape extraction first, then forks on direction into a UMMA_MN branch and a UMMA_K branch. Each branch reads a per-element-size encoding table that maps elem_bits to two integers (per_lane_count, stride_multiplier) consumed by the rebuilt expected layout; the table also encodes the SM100 TMEM rule that element widths above 32 bits are rejected outright.
- Entry: classify the swizzle triple. Accepted triples are
(0, 2, 5)(no swizzle),(2, 5, 2)(128-byte swizzle),(n, 4, 3)forn in {0..3}(compact/canonical path), and(2, 5, 2)withdirection == UMMA_K. Any other triple emits"unsupported swizzle, got "followed by the serialised swizzle. - Shape extraction: build a small vector of shape/stride pairs limited to 128 entries (the hard cap on tile dimensions UMMA_MN and UMMA_K accept).
- Element-size decode: encode
elem_bitsthrough a 4-byte classification table intoper_lane_countandstride_multiplier. The fp4 path produces(4, 8); the default path produces(8, computed); element widths outside the table land on anundefinedstride and stop later steps from succeeding. - Direction split:
direction == 1enters UMMA_MN;direction == 0enters UMMA_K; any other value is a bug. - UMMA_MN branch:
a. Read the K-mode size; require
K_elements == 256/elem_bitsorK_elements == 512/elem_bits(the latter is the sparse-gemm path with doubled K). Failure emits"Not a canonical UMMA_MN Layout: Expected K-size 256/sizeof_bits<T> or 512/sizeof_bits(T) in sparse gemm kernels.". b. Synthesize the expected(1-shape, stride_multiplier-stride) / (1-shape, per_lane_count-stride)pair, build the flattened expected layout, and walk a 152-byte-per-slot work vector comparing it to the op's actual modes. c. Require every mode to have exactly 80 bytes of flat-mode storage. Failure emits"Not a canonical UMMA_MN Layout: No flat offset mode". d. Verify each rebuilt mode's stride matches the(stride_multiplier, per_lane_count)pair from step 3. Failure emits"Not a canonical UMMA_MN Layout: Expected stride failure.". - UMMA_K branch:
a. Read the MN-mode size; require
MN_size % per_lane_count == 0. Failure emits"Not a canonical UMMA_K Layout: Expected MN-size multiple of "followed by the decimal value ofper_lane_countand a terminating".". b. Synthesize the expected(1, per_lane_count) / (2, 1)pair, walk the same 152-byte work vector, and require the 80-byte flat-mode condition. Failure emits"Not a canonical UMMA_K Layout: No flat offset mode". c. Stride check on the rebuilt modes. Failure emits"Not a canonical UMMA_K Layout: Expected stride failure.". - On success, pack
(elem_class, k_size, mn_size)as the verifier's result.
LogicalResult verify_umma_canonical_layout(UmmaDirection direction,
uint32_t elem_bits,
SwizzleTriple swz,
LayoutLike layout) {
if (!is_accepted_swizzle(swz, direction)) {
return emit("unsupported swizzle, got ") << serialize(swz);
}
ElementClass ec = decode_element_class(elem_bits);
if (!ec.valid) {
return failure(); // element width above 32 bits — caller diagnoses
}
if (direction == UMMA_MN) {
uint64_t k_elements = product_of(shape_of_k_mode(layout));
uint64_t expected_dense = 256u / elem_bits;
uint64_t expected_sparse = 512u / elem_bits;
if (k_elements != expected_dense && k_elements != expected_sparse) {
return emit("Not a canonical UMMA_MN Layout: Expected K-size "
"256/sizeof_bits<T> or 512/sizeof_bits(T) in sparse "
"gemm kernels.");
}
if (!has_flat_offset_mode(layout)) {
return emit("Not a canonical UMMA_MN Layout: No flat offset mode");
}
if (!strides_match_expected(layout, ec)) {
return emit("Not a canonical UMMA_MN Layout: Expected stride failure.");
}
} else /* UMMA_K */ {
uint64_t mn_size = product_of(shape_of_mn_mode(layout));
if (mn_size % ec.per_lane_count != 0) {
return emit("Not a canonical UMMA_K Layout: Expected MN-size multiple of ")
<< ec.per_lane_count << ".";
}
if (!has_flat_offset_mode(layout)) {
return emit("Not a canonical UMMA_K Layout: No flat offset mode");
}
if (!strides_match_expected(layout, ec)) {
return emit("Not a canonical UMMA_K Layout: Expected stride failure.");
}
}
return success();
}
The accepted swizzle set is the small closed enumeration the descriptor packer can express in shared-memory descriptors. (0, 2, 5) is the no-swizzle case; (2, 5, 2) is the 128-byte swizzle; the (n, 4, 3) family with n in {0, 1, 2, 3} covers the 32-, 64-, and 128-byte interleaved variants whose choice depends on operand element width. Any other triple is rejected before any size check runs, keeping the diagnostic specific to the swizzle field rather than blaming a downstream size mismatch.
The 152-byte work-vector stride matches the dense per-mode record size used throughout this dialect: shape, stride, and a per-mode decoration word giving three slots per element. The sparse path doubles the K-extent budget (the 512-bit case in step 5a) but the verifier still walks the same 152-byte stride; the metadata operand is verified by a sibling pass once this layout walk succeeds.
A sister verifier runs the same algorithm for arbitrary layout shapes and is invoked by ops taking non-MMA layouts. The two verifiers share most of their bodies, but the MMA-side verifier is specialised for the MMA path with hard-coded k_size formulas keyed off direction and elem_bits. The split exists because callers that already know they have an MMA operand pay no dispatch cost, and the larger sibling only runs for layouts whose K-extent must be derived rather than computed.
tcgen05.mma Kind-Word Verifier
The Blackwell tcgen05.mma op family packs several orthogonal attributes into a 9-bit kind word, and the verifier checks that the bits are mutually consistent before any lowering pass sees the op. The kind word carries the CTA-group selector, the scale-vector size, the scale-input-accumulator bit, the block-scale bit, and a 3-bit selector that picks one of seven concrete mma_kind enum values. A separate weight-stationary flag overlays bit 0 of the same word and is read as a 1-bit predicate (its cta_group::1 requirement is enforced as a cross-field rule). The verifier walks the mutual-exclusion rules below and returns an NVPTX opcode index from the closed range 10521..10530 on success, so the lowering pass can branch directly on the result.
typedef union Tcgen05MmaKind {
uint32_t raw : 9;
struct {
uint32_t cta_group : 2; // bits 0-1: 0=reserved, 1=1-CTA, 2=2-CTA, 3=4-CTA
uint32_t scale_vector_size : 2; // bits 2-3: 0=1X (16), 1=2X (32), 2=4X (64), 3=reserved
uint32_t scale_input_acc : 1; // bit 4: 1 = scale applied to accumulator
uint32_t block_scale : 1; // bit 5: 1 = block-scaled (FP4/FP8 microscale)
uint32_t mma_kind : 3; // bits 6-8: one of the seven enum values below
};
} Tcgen05MmaKind;
The warp-specialized variant reuses bit 0 of the same word and is materialized by the lowering pass as a boolean predicate ws = (raw & 1) != 0. The two views are mutually exclusive at the encoding layer: a kind word with ws == 1 always has cta_group == 1 (single-CTA), so rule 4 below rejects every other cta_group value the moment the WS bit is set.
⚡ QUIRK —
cta_groupis in the low bits,mma_kindin the high bits — swapping order silently dispatches a different opcode The bitfield order iscta_groupat bits0..1, thenscale_vector_size,scale_input_acc,block_scale, and finallymma_kindat bits6..8. A frontend that constructs the kind word with the field order reversed (mma_kindin the low bits,cta_groupin the high bits — the natural reading order for a humans-and-docs format) builds a word that the verifier still accepts: the resultingcta_groupbits land inside themma_kindenum range (0..7), and the resultingmma_kindbits land inside thecta_grouprange (0..3). The verifier walks its 13 rules over the wrong field interpretations, may pass them all, andselect_tcgen05_opcodereturns an opcode index in10521..10530for an entirely different instruction. No diagnostic fires. A reimplementation must reproduce the exact bit layout shown in theTcgen05MmaKindunion —cta_grouplow,mma_kindhigh — or every emittedtcgen05.mmais the wrong opcode.
The mma_kind field picks one of seven enum values. Each implies a different element type and a different valid range for the rest of the kind word; the verifier uses it as the primary dispatch key for type-specific rules.
| Value | mma_kind | Notes |
|---|---|---|
| 0 | mxf4nvf4 | NVFP4 with block-scale |
| 1 | i8 | Signed 8-bit integer matmul |
| 2 | mxf8f6f4 | OCP MX-FP8/FP6/FP4 microscale |
| 3 | f16 | Half-precision float |
| 4 | tf32 | TensorFloat-32 (8-exp, 10-mantissa) |
| 5 | f8f6f4 | (alias of mxf8f6f4 for backward compat) |
| 7 | mxf4 | OCP MX-FP4 (no NVFP4 distinction) |
The 13 verbatim diagnostics below fire in the order shown. Each rule is independent; the verifier walks them in fixed sequence and reports the first failure rather than collecting all violations, so a kind word that clears one rule is not yet globally valid until the whole ladder completes. The "colletor" typo in rule 10 is preserved verbatim — reproducing it byte-for-byte is required for test suites that match diagnostics by string.
⚡ QUIRK —
colletortypo + fail-first walk masks later violations Rule 10's diagnostic spells the nouncolletor(missingc) instead ofcollector, and the ladder bails on the first failure rather than collecting every violation. Two surprises compose: a kind word that fails rule 3 may also trip rules 7 and 10, but the user sees only the rule-3 message; iteratively patching one symptom at a time is the only debugging path. Combined with the typo, log scrapers that search for the corrected spelling silently miss every rule-10 hit even when the verifier fires.
| # | Diagnostic | Trigger condition |
|---|---|---|
| 1 | "INT8 type is supported only on arch-conditional variants." | mma_kind == i8 outside an arch-conditional / family-conditional variant |
| 2 | "MXF4 and MXF4NVF4 types with Sparsity are supported only on arch-conditional variants." | mma_kind in {mxf4nvf4, mxf4} with sparsity bit set, non-arch-conditional |
| 3 | "Explicit scale vector size is supported only on arch-conditional variants." | scale_vector_size != 0 outside an arch-conditional variant |
| 4 | "Scale input accumulator is not supported on this architecture." | scale_input_acc == 1 on an ISA strictly below SM100a |
| 5 | "Scale input accumulator can only be used with f16 and tf32 types" | scale_input_acc == 1 && mma_kind not in {f16, tf32} |
| 6 | "Block scale is not supported for f16, tf32, f8f6f4, and i8 types" | block_scale == 1 && mma_kind in {i8, f16, tf32, f8f6f4} |
| 7 | "ashift is not supported with tcgen05.mma.block_scale variants" | ashift bit set on a block-scale opcode (10521 / 10526) |
| 8 | "cta_group::2 is not supported with weight stationary" | (raw & 3) == 3 — i.e. cta_group == 2 selector with WS set |
| 9 | "Cannot use weight stationary with mxf8f6f4 and fp4 types" | ws == 1 && mma_kind in {mxf8f6f4, f8f6f4, mxf4} |
| 10 | "Cannot use collector::a::use or colletor::a::fill with ashift" | collector-a use/fill combined with ashift |
| 11 | "Cannot use 2X or 4X as scale vector size for mxf8f6f4 type" | mma_kind == mxf8f6f4 && scale_vector_size > 1 |
| 12 | "Cannot use 1X as scale vector size for mxf4nvf4 type" | mma_kind == mxf4nvf4 && scale_vector_size == 0 (1X) |
| 13 | "Cannot use 1X or 4X as scale vector size for mxf4 type" | mma_kind == mxf4 && scale_vector_size in {0, 2} |
Rules 1, 2, 3, and 4 are architecture gates: the corresponding type/scale combinations only exist as arch-conditional or family-conditional variants of tcgen05.mma. Rule 5 narrows the scale-input-accumulator option to the two floating types that actually support it. Rule 6 expresses the inverse: the block-scale microscale path is defined for the FP4 / FP6 / FP8 narrow types, not for FP16, TF32, the legacy f8f6f4, or INT8. Rules 8 and 9 fence the warp-specialized variant: cta_group::2 and the wider mxf8f6f4/f8f6f4/mxf4 selectors are not part of the WS dispatch table. Rules 11, 12, and 13 each pin a single type's scale_vector_size to the one encoding the corresponding NVPTX instruction supports.
LogicalResult verify_tcgen05_mma_kind(Tcgen05MmaKind k,
uint32_t collector,
uint32_t opcode,
bool is_arch_cond,
uint32_t isa_version) {
bool ws = (k.raw & 1) != 0;
if (k.mma_kind == I8 && !is_arch_cond) {
return emit("INT8 type is supported only on arch-conditional variants.");
}
if ((k.mma_kind == MXF4NVF4 || k.mma_kind == MXF4)
&& sparsity_bit(k) && !is_arch_cond) {
return emit("MXF4 and MXF4NVF4 types with Sparsity are "
"supported only on arch-conditional variants.");
}
if (k.scale_vector_size != 0 && !is_arch_cond) {
return emit("Explicit scale vector size is supported only on "
"arch-conditional variants.");
}
if (k.scale_input_acc != 0 && isa_version < SM100A) {
return emit("Scale input accumulator is not supported on this architecture.");
}
if (k.scale_input_acc != 0
&& k.mma_kind != F16 && k.mma_kind != TF32) {
return emit("Scale input accumulator can only be used with f16 and tf32 types");
}
if (k.block_scale != 0
&& (k.mma_kind == I8 || k.mma_kind == F16
|| k.mma_kind == TF32 || k.mma_kind == F8F6F4)) {
return emit("Block scale is not supported for f16, tf32, f8f6f4, and i8 types");
}
if (is_block_scale_opcode(opcode) && (collector & ASHIFT) != 0) {
return emit("ashift is not supported with tcgen05.mma.block_scale variants");
}
if ((k.raw & 3) == 3) {
return emit("cta_group::2 is not supported with weight stationary");
}
if (ws && (k.mma_kind == MXF8F6F4 || k.mma_kind == F8F6F4 || k.mma_kind == MXF4)) {
return emit("Cannot use weight stationary with mxf8f6f4 and fp4 types");
}
if ((collector & COLLECTOR_A_USE_OR_FILL) != 0 && (collector & ASHIFT) != 0) {
return emit("Cannot use collector::a::use or colletor::a::fill with ashift");
}
if (k.mma_kind == MXF8F6F4 && k.scale_vector_size > 1) {
return emit("Cannot use 2X or 4X as scale vector size for mxf8f6f4 type");
}
if (k.mma_kind == MXF4NVF4 && k.scale_vector_size == 0) {
return emit("Cannot use 1X as scale vector size for mxf4nvf4 type");
}
if (k.mma_kind == MXF4 && (k.scale_vector_size == 0 || k.scale_vector_size == 2)) {
return emit("Cannot use 1X or 4X as scale vector size for mxf4 type");
}
return select_tcgen05_opcode(k); // returns one of 10521..10530
}
On success the verifier hands back an opcode index in the closed range 10521..10530. Each of the ten NVPTX MI opcodes — tcgen05.mma, tcgen05.mma.sp, tcgen05.mma.block_scale, tcgen05.mma.sp.block_scale, and their warp-specialized siblings — corresponds to exactly one combination of cta_group, weight-stationary, sparsity, and block-scale bits the lowering pass needs to pick a final instruction encoding. Returning the index from the verifier keeps the kind-word decode in one place and prevents the lowering pass from rederiving the dispatch table from raw bits.
Worked Example: Kind Word 0x42
A concrete kind word makes the bit packing and the ladder order easier to follow. Take Tcgen05MmaKind.raw = 0x42. In 9-bit binary, with bit 0 on the right, this is
bit: 8 7 6 5 4 3 2 1 0
raw: 0 0 1 0 0 0 0 1 0 = 0x42
Reading the fields out of the bitfield declared above:
| Field | Bits | Value | Decoded |
|---|---|---|---|
cta_group | 0-1 | 10 | 2 — cta_group::2 (two-CTA dispatch) |
scale_vector_size | 2-3 | 00 | 0 — 1X (16-element scale vector) |
scale_input_acc | 4 | 0 | not set |
block_scale | 5 | 0 | not set |
mma_kind | 6-8 | 001 | 1 — i8 |
The overlaid weight-stationary predicate is ws = (raw & 1) != 0 — for 0x42 bit 0 is 0, so ws = false. Sparsity bit (raw & 0x20) is also 0 — the sparsity bit overlays bit 5 of the encoding the way the bitfield's block_scale does, and reads zero here.
Walking the verifier ladder against this kind word, with is_arch_cond = false and isa_version = SM100 (not the arch-conditional variant):
- Rule 1 —
k.mma_kind == I8 && !is_arch_cond. Both predicates hold. The verifier fires"INT8 type is supported only on arch-conditional variants."and stops. No later rule runs.
Lifting the gate by setting is_arch_cond = true lets the kind word continue down the ladder. Rules 2 and 3 short-circuit (mma_kind != mxf4nvf4/mxf4, scale_vector_size == 0). Rule 4 short-circuits (scale_input_acc == 0). Rule 5 short-circuits for the same reason. Rule 6 short-circuits (block_scale == 0). Rule 7 short-circuits (no block-scale opcode in play). Rule 8 checks (raw & 3) == 3 — for 0x42, raw & 3 = 2, so the rule does not fire. Rule 9 reads the weight-stationary view, finds ws = 0, and short-circuits. Rules 10-13 all short-circuit on the same field-clear conditions. The ladder reaches select_tcgen05_opcode, which picks tcgen05.mma (opcode 10522, the dense, non-block-scale, non-WS path) on cta_group::2.
A symmetric example flips the gate the other direction. Take raw = 0xE2 (0b011100010):
| Field | Bits | Value | Decoded |
|---|---|---|---|
cta_group | 0-1 | 10 | 2 |
scale_vector_size | 2-3 | 00 | 0 |
scale_input_acc | 4 | 0 | not set |
block_scale | 5 | 1 | set |
mma_kind | 6-8 | 011 | 3 — f16 |
The ladder walks rules 1-5 without firing (mma_kind is neither i8 nor mxf4*, scale_vector_size == 0, scale_input_acc == 0). Rule 6 sees block_scale == 1 && mma_kind in {i8, f16, tf32, f8f6f4} — mma_kind == f16 matches the set and the verifier fires "Block scale is not supported for f16, tf32, f8f6f4, and i8 types".
Two takeaways follow from the worked examples. First, the bit packing is order-sensitive: cta_group sits in the low two bits, mma_kind in the high three, with single-bit predicates between them — a writer that confuses bit order silently changes the dispatched opcode. Second, the ladder is fail-first: once any rule fires the verifier stops, so a kind word that passes rule 6 has not been proven globally valid until every later rule clears too. The 13-rule sequence is the complete witness.
SM120 Block-Scaled Lattice
SM120 block-scaled MMA verifies shape, input type, scale-factor type, scale-vector size, and scale-fragment width as one combined gate.
LogicalResult verify_sm120_scale_lattice(Sm120ScaleParams p) {
require(p.scale_vector_size == 16 || p.scale_vector_size == 32);
require(p.k == 32 || p.k == 64);
if (p.k == 32) {
require(is_fp4_fp6_or_fp8(p.a_type));
require(is_fp4_fp6_or_fp8(p.b_type));
require(p.sf_type == e8m0_type());
require(p.scale_vector_size == 32);
require(p.scale_fragment_bits == 8);
return success();
}
require(p.a_type == fp4_e2m1_type());
require(p.b_type == fp4_e2m1_type());
require(p.scale_fragment_bits * p.scale_vector_size == 512);
return success();
}
The K = 64 row deliberately narrows the accepted input set. Do not reuse the K = 32 FP6/FP8 allow-list there.
Swizzle Legality
apply_swizzle and add_offset do not commute freely. The verifier rejects rewrites that assume:
add_offset(apply_swizzle(x), k) == apply_swizzle(add_offset(x, k))
unless the selected swizzle is identity for the affected address bits.
LogicalResult verify_swizzle_offset_commutation(Swizzle swizzle, Offset offset) {
if (swizzle.is_identity()) {
return success();
}
require(offset_preserves_swizzle_partition(swizzle, offset));
return success();
}
Accepted swizzle modes are a closed target-aware enum. Unknown modes must not silently fold to identity after parsing.
TMA Rank and Mode Gates
TMA bulk tensor operations support ranks one through five. Im2col and scatter variants tighten the rank requirements, and some modes are Blackwell-only.
LogicalResult verify_tma_rank_and_mode(TmaMode mode, int rank, Target target) {
require(1 <= rank && rank <= 5);
if (mode == IM2COL || mode == IM2COL_W || mode == IM2COL_W128) {
require(rank >= 3);
}
if (mode == SCATTER4) {
require(rank == 2);
}
if (mode == IM2COL_W || mode == IM2COL_W128) {
require(target.supports_blackwell_tma_modes);
}
return success();
}
Invariants
- LDSM/STSM mode, transpose, size pattern, and matrix count are verified as one tuple.
- Shared-memory matrix movement checks memory-space direction and alignment.
- Register fragment size is derived from layout cosize.
- UMMA canonical layouts emit one of seven verbatim diagnostics on failure, keyed on direction and on flat-mode / stride structure.
tcgen05.mmakind words are gated by 13 mutual-exclusion diagnostics over a 9-bit packed encoding plus a separate weight-stationary predicate.- SM120 block-scaled validation distinguishes
K = 32fromK = 64. - Swizzle and offset rewrites must prove commutation.
- TMA ranks and special modes are target-gated before PTX emission.
Cross-References
TMA Atoms — Eleven-Step Partition Verifier documents the partition verifier whose eleven-step ladder these mode verifiers compose with. SM Tier Roster and Copy Atom Registry — MMA Atom Verifier Diagnostics lists the MMA atom verifier diagnostics that the layout walker emits before the canonical-layout check runs.
cute_nvgpu Assembly Printer and Type Mnemonics
Abstract
cute_nvgpu textual assembly is primarily a type surface. Atom and descriptor types print as compact mnemonics — sm90.mma, atom.tma_load, SM120.mma_bs. Parameterized types add angle-bracket payloads for descriptor views, universal copy atoms, or sub-byte integer fragments. The dialect provides no dialect-scoped attributes in text; operation attributes use ordinary builtin attribute syntax. This page covers the printer, the parser-facing mnemonics, their parameters, enum spellings, alias hints, and the 27-entry length-keyed packed-XOR perfect-hash dispatcher (sub_1826CF0, 6164 B) that resolves them all.
Type Mnemonics
| Family | Mnemonics |
|---|---|
| MMA atoms | atom.universal_fma, sm80.mma, sm80.sparse_mma, sm89.mma, sm90.mma, sm100.mma, sm100.mma_sp, sm100.mma_bs, sm100.mma_bs_sp, SM120.mma_bs |
| SMEM descriptors | smem_desc, smem_desc_view |
| TMEM and copy atoms | atom.tmem_load, atom.tmem_store, atom.s2t_copy, atom.universal_copy, atom.simt_async_copy, atom.ldsm, atom.stsm |
| TMA descriptors and atoms | tma_descriptor_tiled, tma_descriptor_im2col, atom.tma_load, atom.tma_store, atom.tma_reduce, atom.non_exec_tiled_tma_load, atom.non_exec_tiled_tma_store, atom.non_exec_tiled_tma_reduce |
SM120.mma_bs is case-sensitive. It prints and parses with uppercase SM.
Parameterized Types
smem_desc_view
smem_desc_view wraps a source type and a layout attribute:
!cute_nvgpu.smem_desc_view<memref<128xf16, 3>, #cute.layout<(4@0, 32@1)>>
The source type describes the shared-memory object. The layout attribute tells WGMMA or UMMA how to interpret that shared-memory tile.
atom.universal_copy
Universal copy atoms carry value type, optional bit width, optional distributed shared-memory allowance, and optional PTX-like memory order and scope:
!cute_nvgpu.atom.universal_copy<f16>
!cute_nvgpu.atom.universal_copy<f16, 128 b>
!cute_nvgpu.atom.universal_copy<f16, 128 b, allow_dsmem>
!cute_nvgpu.atom.universal_copy<f16, mem_order=acquire, mem_scope=cluster>
The b suffix means bits. Keep the space before b to match this dialect's printer exactly.
Atom Integer Type
Sub-byte and microscaling fragments print as integer widths with an optional division factor:
!cute_nvgpu.i4
!cute_nvgpu.i6
!cute_nvgpu.i8
!cute_nvgpu.i4<divby 2>
!cute_nvgpu.i2<divby 4>
The division factor controls packed-lane interpretation, not integer arithmetic.
Enum Spellings
Universal copy atoms use PTX-like memory order and scope spellings.
| Enum | Spellings |
|---|---|
| Memory order | relaxed, acquire, release, acq_rel, sc, mmio, constant, volatile |
| Memory scope | cluster, gpu, sys |
An omitted order or scope is not the same as printing a default value. Printers elide absent fields rather than emit sentinel keywords.
Alias Hints
The dialect may provide human-readable SSA aliases for common atom families. Aliases are non-semantic and may be overridden by operation-level result naming.
StringRef alias_for_type(Type type) {
if (is_memref_family(type)) {
return format("memref_%s_%d", element_name(type), rank(type));
}
if (is_copy_atom(type)) {
return format("copy_%s", copy_atom_suffix(type));
}
if (is_mma_atom(type)) {
return format("mma_%s_%s_%s_%s",
a_element_name(type),
b_element_name(type),
c_element_name(type),
shape_name(type));
}
return "";
}
Examples:
%copy_sm90_tma_load = ...
%mma_f16_f16_f32_16x8x16 = ...
%memref_f16_3 = ...
Attribute Text
The dialect does not parse #cute_nvgpu.* attributes. Operation attributes
such as a_type, b_type, shape_MNK, cache modes, scale-vector size, and
thread or byte identifiers should be represented with ordinary builtin or
operation-specific attribute syntax.
// Good: op-owned attributes.
%atom = cute_nvgpu.make_sm120_mma_bs
{shape_MNK = [16, 8, 32], vec_size = 32}
// Not a supported dialect attribute surface.
// #cute_nvgpu.some_attribute<...>
Parser Strategy
Any efficient mnemonic dispatch will do, as long as it behaves as if it recognises the exact case-sensitive set above. Unknown type mnemonics produce a clear diagnostic naming both the bad token and the cute_nvgpu dialect.
Type parse_cute_nvgpu_type(Parser *parser) {
StringRef mnemonic = parser->parse_keyword();
if (mnemonic == "smem_desc_view") {
return parse_smem_desc_view(parser);
}
if (mnemonic == "atom.universal_copy") {
return parse_universal_copy_atom(parser);
}
if (starts_with_atom_integer_prefix(mnemonic)) {
return parse_atom_integer_type(parser, mnemonic);
}
Type type = lookup_zero_parameter_type(mnemonic);
require(type != NULL);
return type;
}
Mnemonic Perfect-Hash Dispatch
The compiled mnemonic dispatcher (sub_1826CF0, 6164 bytes) is a hand-written length-keyed perfect-hash walk. It fetches a single token via parseOptionalKeyword through the AsmParser vtable, then compares the token against 27 precomputed entries in a fixed order. Each entry is a tuple (length, first_qword, second_qword, tail_bytes); the comparison uses 8-byte unaligned loads XORed against the stored qword literal — the classic packed-XOR memcmp collapse. The walk order is the order the entries appear in the table below, and preserving it matters: the printer's slot index must match the parser's branch order so type-name resolution round-trips through identical decision-chain offsets.
The hash is perfect because every distinct mnemonic in the set has either a unique length or a unique first qword keyed on length. The compiler emits a chained linear walk rather than a switch table: each arm gates on len == LEN first, then on a fused XOR of one or two qwords, then on any remaining dword, word, and byte tail. A miss falls through to the next arm; a hit calls the per-mnemonic builder and sets HIBYTE(v44) = 1 as the success sticky bit.
| # | Length | First qword (LE) | q0 literal | Second qword / tail | Mnemonic |
|---|---|---|---|---|---|
| 0 | 18 | 0x696E752E6D6F7461 | "atom.uni" | q1 0x665F6C6173726576 "versal_f", w@+16 "ma" | atom.universal_fma |
| 1 | 8 | 0x616D6D2E30386D73 | "sm80.mma" | — | sm80.mma |
| 2 | 15 | 0x6170732E30386D73 | "sm80.spa" | d@+8 "rse_", w@+12 "mm", b@+14 'a' | sm80.sparse_mma |
| 3 | 8 | 0x616D6D2E39386D73 | "sm89.mma" | — | sm89.mma |
| 4 | 9 | 0x7365645F6D656D73 | "smem_des" | b@+8 'c' | smem_desc |
| 5 | 14 | 0x7365645F6D656D73 | "smem_des" | d@+8 "c_vi", w@+12 "ew" | smem_desc_view |
| 6 | 8 | 0x616D6D2E30396D73 | "sm90.mma" | — | sm90.mma |
| 7 | 9 | 0x6D6D2E3030316D73 | "sm100.mm" | b@+8 'a' | sm100.mma |
| 8 | 12 | 0x6D6D2E3030316D73 | "sm100.mm" | d@+8 "a_sp" | sm100.mma_sp |
| 9 | 12 | 0x6D6D2E3030316D73 | "sm100.mm" | d@+8 "a_bs" | sm100.mma_bs |
| 10 | 15 | 0x6D6D2E3030316D73 | "sm100.mm" | d@+8 "a_bs", w@+12 "_s", b@+14 'p' | sm100.mma_bs_sp |
| 11 | 12 | 0x6D6D2E3032314D53 | "SM120.mm" | d@+8 "a_bs" | SM120.mma_bs |
| 12 | 14 | 0x656D742E6D6F7461 | "atom.tme" | d@+8 "m_lo", w@+12 "ad" | atom.tmem_load |
| 13 | 15 | 0x656D742E6D6F7461 | "atom.tme" | d@+8 "m_st", w@+12 "or", b@+14 'e' | atom.tmem_store |
| 14 | 13 | 0x7432732E6D6F7461 | "atom.s2t" | d@+8 "_cop", b@+12 'y' | atom.s2t_copy |
| 15 | 20 | 0x637365645F616D74 | "tma_desc" | q1 0x745F726F74706972 "riptor_t", d@+16 "iled" | tma_descriptor_tiled |
| 16 | 21 | 0x637365645F616D74 | "tma_desc" | q1 0x695F726F74706972 "riptor_i", d@+16 "m2co", b@+20 'l' | tma_descriptor_im2col |
| 17 | 13 | 0x616D742E6D6F7461 | "atom.tma" | d@+8 "_loa", b@+12 'd' | atom.tma_load |
| 18 | 14 | 0x616D742E6D6F7461 | "atom.tma" | d@+8 "_sto", w@+12 "re" | atom.tma_store |
| 19 | 15 | 0x616D742E6D6F7461 | "atom.tma" | d@+8 "_red", w@+12 "uc", b@+14 'e' | atom.tma_reduce |
| 20 | 19 | 0x696E752E6D6F7461 | "atom.uni" | q1 0x635F6C6173726576 "versal_c", w@+16 "op", b@+18 'y' | atom.universal_copy |
| 21 | 20 | 0x6D69732E6D6F7461 | "atom.sim" | q1 0x5F636E7973615F74 "t_async_", d@+16 "copy" | atom.simt_async_copy |
| 22 | 9 | 0x73646C2E6D6F7461 | "atom.lds" | b@+8 'm' | atom.ldsm |
| 23 | 9 | 0x7374732E6D6F7461 | "atom.sts" | b@+8 'm' | atom.stsm |
| 24 | 28 | 0x6E6F6E2E6D6F7461 | "atom.non" | q1 0x69745F636578655F "_exec_ti", q@+16 "led_tma_", d@+24 "load" | atom.non_exec_tiled_tma_load |
| 25 | 29 | 0x6E6F6E2E6D6F7461 | "atom.non" | q1 0x69745F636578655F "_exec_ti", q@+16 "led_tma_", d@+24 "stor", b@+28 'e' | atom.non_exec_tiled_tma_store |
| 26 | 30 | 0x6E6F6E2E6D6F7461 | "atom.non" | q1 0x69745F636578655F "_exec_ti", q@+16 "led_tma_", d@+24 "redu", w@+28 "ce" | atom.non_exec_tiled_tma_reduce |
Entries 4, 15, and 16 (smem_desc, tma_descriptor_tiled, tma_descriptor_im2col) are zero-parameter sugared types that build straight through the MLIR type uniquer with a fixed TypeID global. Entries 5 and 20 (smem_desc_view, atom.universal_copy) carry inline sub-parsers for layout, scope, and order. Every other entry routes to a per-mnemonic builder trampoline whose address acts as a typed nonce in the dialect's registration table — the trampoline itself is a 3-byte xor eax, eax; ret, and the linker pins the function pointer as the opcode tag.
atom.i<N>_divby_<M> Prefix Branch
Once all 27 literal arms miss, the dispatcher checks whether the token starts with 'i' (0x69). If it does, the walk switches into a four-step sub-walk that reads the bare i<decimal> form first, then optionally consumes the divby keyword followed by a second decimal. This is the dialect's generic packed sub-byte fragment encoding — i4, i6, i8 for plain widths; i4_divby_2, i2_divby_4 for NVFP4-style microscaling fragments where the divisor controls packed-lane interpretation rather than arithmetic.
The walk is:
- Read
Nas a decimal suffix after the leading'i'. The digit run is delimited bystd::find_if_not(tail, end, isdigit)and converted withStringRef::getAsInteger(10, &value). - Check that the remainder of the token is exhausted and that
Nfits inuint32_t. - Probe
parseOptionalKeyword("divby", 5, &cursor)through the parser vtable. If the keyword is absent, the result is a plaini<N>atom integer type. - Read
Mas a second decimal and attach both integers to theOperationStatevia theOperationState::addAttribute("divby", i<M>)helper.
The bare divby keyword path lives entirely inside step 3 — not a separate dispatcher arm, but a sub-keyword that only ever follows a successful i<N> consumption. Reimplementers must place this prefix branch after every literal arm: anything starting with 'i' followed by an all-digit tail lands here regardless of whether the digits form a semantically valid bit-width.
SM120 Uppercase Quirk
Every SM70/80/90/100 arm uses a lowercase prefix (sm80., sm89., sm90., sm100.). The SM120 arm uppercases it. The packed first-qword literal for entry 11 is 0x6D6D2E3032314D53, decoding little-endian as "SM120.mm" — the bytes 4D 53 ('M', 'S') sit at offsets 0 and 1 of the qword, encoding the uppercase SM head. This is a genuine binary quirk preserved through the CUDA 13.1 release and confirmed against the verbatim qword constant in the decompilation. Any reimplementation must key the SM120 arm case-sensitively — never case-fold the prefix.
The table has exactly one SM120 variant: SM120.mma_bs. No SM120.mma, no _sp, no _bs_sp — the SM120 code path is gated to block-scaled MMA only, matching the consumer-Blackwell FP4 surface where sparse MMA is not exposed.
Diagnostic
When all 27 arms and the 'i'-prefix branch miss, the dispatcher emits:
unknown type `<mnem>` in dialect `<dialect>`
The literal carries two spaces between unknown and type — verbatim in the binary at .rodata offset 0x04CF76F7. A compatible reimplementation must preserve it. Bad token and dialect name are both wrapped in backticks; the diagnostic is stitched from three separate .rodata fragments concatenated through the InFlightDiagnostic::operator<<(StringRef) chain.
Perfect-Hash Compare Pseudocode
typedef struct HashEntry {
size_t length;
uint64_t q0;
uint64_t q1;
const char *tail;
void (*build)(ParseContext *);
} HashEntry;
int parseCuteNvgpuMnemonic(StringRef m, ParseResult *out) {
static const HashEntry table[27] = {
{18, 0x696E752E6D6F7461ULL, 0x665F6C6173726576ULL, "ma", build_atom_universal_fma},
{ 8, 0x616D6D2E30386D73ULL, 0ULL, "", build_sm80_mma},
{15, 0x6170732E30386D73ULL, 0ULL, "rse_mma",build_sm80_sparse_mma},
{ 8, 0x616D6D2E39386D73ULL, 0ULL, "", build_sm89_mma},
/* ...22 more rows in walk order... */
{12, 0x6D6D2E3032314D53ULL, 0ULL, "a_bs", build_sm120_mma_bs},
/* ...remaining rows... */
{30, 0x6E6F6E2E6D6F7461ULL, 0x69745F636578655F ULL,
"led_tma_reduce",
build_atom_non_exec_tiled_tma_reduce},
};
for (size_t i = 0; i < 27; ++i) {
const HashEntry *e = &table[i];
if (m.size != e->length) continue;
uint64_t q0 = unaligned_load_u64(m.data + 0);
if ((q0 ^ e->q0) != 0) continue;
if (e->length > 8) {
uint64_t q1 = unaligned_load_u64(m.data + 8);
if ((q1 ^ e->q1) != 0) continue;
}
size_t tail_len = e->length > 16 ? e->length - 16 : 0;
if (tail_len && memcmp(m.data + 16, e->tail, tail_len) != 0) continue;
e->build(out);
return 1;
}
if (m.size > 0 && m.data[0] == 'i' && all_digits(m.data + 1, m.size - 1)) {
return parse_atom_integer_with_optional_divby(m, out);
}
emit_error("unknown type `%.*s` in dialect `cute_nvgpu`",
(int)m.size, m.data);
return 0;
}
The length gate collapses the 27-way decision into a constant-time per-bucket lookup. The compiler chains the arms as if/else if rather than building a switch table because the length-OR-first-qword key is collision-free across the entire mnemonic set, and the walk order must stay stable to preserve the printer-to-parser slot correspondence the dialect's round-trip self-test depends on.
Invariants
- Type mnemonics are case-sensitive.
SM120.mma_bsprints with uppercaseSM; the packed qword literal0x6D6D2E3032314D53enforces this case-sensitively.- The 27-entry perfect-hash walk order is preserved across builds so that printer slot indices match the parser's branch order.
- The
'i'-prefix branch runs only after every literal arm has missed. - Parameterized type printers and parsers are symmetric.
- Operation attributes do not require a dialect-scoped attribute parser.
- Alias hints are deterministic but never semantic.
- The unknown-type diagnostic uses two literal spaces between
unknownandtypeand wraps both the bad token and the dialect name in backticks.
cutlass Dialect Overview
Provenance vs Upstream MLIR
cutlass is NVIDIA-introduced and has no upstream MLIR counterpart. Upstream MLIR has no dialect that models CUTLASS-style asynchronous producer/consumer pipelines, persistent tile schedulers, sequence barriers, or block-striped shared-memory movement — the open-source CUTLASS library expresses all of this in C++ templates that the compiler instantiates per kernel. Tileiras lifts those template-time constructs into IR so the scheduler and the architecture-atom dialects can see them as ordinary ops with verifier-checked operands. Without this dialect, pipeline shape, scheduler kind, and barrier identity would have to be inferred from instantiation patterns rather than stated by the producer.
Abstract
The cutlass dialect packs seventy ops across eight operation families — pinned: 31 out-of-line thunks plus 39 inline registrations in the trampoline sub_1761D90, cross-validated against exactly 70 multi-segment "cutlass.X.Y[.Z]" strings in the binary's string pool with zero overlap between the two sets. Four cover the large-scale orchestration concerns (pipeline, tile_scheduler, seq_bar, block_striped); the MODS sidecar lives under the cutlass.tile_scheduler.mods_* prefix but registers and verifies as its own family; three smaller families (named_barrier, generic_barrier, and a single async-exec op) round out the dialect. It models the structure CUTLASS C++ templates normally generate: asynchronous producer/consumer pipelines, persistent tile schedulers, ordered sequence barriers, named/generic barriers, and block-striped shared-memory movement. The dialect constructor at sub_1761D90 registers all seventy ops in a single thunk-chain (thirty-nine ops inline plus thirty-one delegated to per-op helper thunks sub_175E920..sub_1761C20), then installs two op-level verifiers and the post-verify arrive-count builder. All seventy registrations go through the same RegisteredOperationName::insert entry point (sub_4461CA0); none of the slots register an attribute or a type — the dialect's attribute and type tables are wired separately and contribute no ops to this count.
cutlass sits above cute_nvgpu and nv_tileas. cute_nvgpu provides hardware atoms — MMA, TMA. nv_tileas provides operational async scheduling. cutlass connects the two at a larger granularity: it names which agents participate in the pipeline, how tiles are assigned to CTAs, how producers and consumers synchronise, and how persistent kernels advance through their work.
Position in the Cascade
cutlass
|
| lower pipeline, scheduler, barrier, and block-striped abstractions
v
nv_tileas + cute + cute_nvgpu
|
| schedule, assign layouts, emit architecture atoms
v
nvgpu + nvvm
|
| emit LLVM IR and PTX
v
PTX
For users, cutlass is a frontend-oriented dialect — useful when the source program already has CUTLASS pipeline structure. For reimplementers, it is a bridge: preserve the CUTLASS semantics long enough to lower them into the scheduler and atom dialects without losing synchronisation or tile-scheduler intent.
Operation Roster
The seventy ops split into eight families. Tile-scheduler is the largest at thirty-one ops, carrying one op per scheduler kind plus an extensive set of per-variant accessors, fixup hooks, and parameter builders. Pipeline (twenty ops) covers the full producer/consumer state machine plus the per-CTA executor switch and the cutlass.pipeline.state.* cursor accessors. Seq-bar and block-striped are smaller and more regular. The MODS async-dispatch family is a four-op sidecar wired to an alternate async-call ABI. Three small barrier and async families register alongside the orchestration families.
| Family | Count | Examples |
|---|---|---|
| pipeline | 20 | cutlass.pipeline.create, cutlass.pipeline.init, cutlass.pipeline.make_participants, cutlass.pipeline.producer_{acquire,try_acquire,commit,tail}, cutlass.pipeline.consumer_{wait,try_wait,release}, cutlass.pipeline.{produce,consume}, cutlass.pipeline.get_producer_{barrier,mask}, cutlass.pipeline.state.{create,increment,get_count,get_index,get_phase}, cutlass.pipeline.switch_by_executor |
| tile_scheduler (non-MODS) | 31 | scheduler-kind constructors (create_{dp,static_persistent,streamk}_params, create_SM100_scheduler), per-variant param builders (make_{dp,static_persistent,streamk}_params), work-tile-info constructors and accessors (create_*_work_tile_info, work_tile_info_{get,set}_value, work_tile_info_to_{coord_mnkl,cta_coord}, initial_work_tile_info), the streamk fixup trio (fixup, fixup_increment, fixup_wait), persistent-state mutators (advance_to_next_work, query_next_work, {,static_}fetch_next_work, get_current_work, get_workid_response_ptr), workspace plumbing (initialize_workspace, get_workspace_sizes, get_grid_shape), and the K-tile boundary accessors (get_work_k_tile_{count,start}), plus compute_epilogue and params_get_value |
| seq_bar | 5 | cutlass.seq_bar.create, cutlass.seq_bar.init, cutlass.seq_bar.arrive, cutlass.seq_bar.wait, cutlass.seq_bar.state.create |
| block_striped | 4 | cutlass.block_striped.load, cutlass.block_striped.load_add, cutlass.block_striped.store, cutlass.block_striped.reduce |
| MODS (nested under tile_scheduler) | 4 | cutlass.tile_scheduler.mods_report_mainloop_start, cutlass.tile_scheduler.mods_report_mainloop_end, cutlass.tile_scheduler.mods_report_smid, cutlass.tile_scheduler.mods_throttle (four ops covering the alternate async-call ABI used by the MODS telemetry path) |
| named_barrier | 2 | cutlass.named_barrier.arrive, cutlass.named_barrier.arrive_and_wait |
| generic_barrier | 3 | cutlass.generic_barrier.arrive_increment, cutlass.generic_barrier.wait_eq, cutlass.generic_barrier.wait_less_than |
| async | 1 | cutlass.async.exec |
Two Verifiers Carry Pipeline Correctness
Of the seventy ops, only two carry non-trivial verifier code. The rest lean on type-system structural checks plus the operand-layout helpers below. Both non-trivial verifiers target the pipeline family and both gate the rest of the lowering pipeline.
PipelineInitOp::verify at sub_1771F40 is a 3 406-byte routine that verifies the cutlass.pipeline.init op's operands match the declared pipeline shape. It reads numStages from the op attribute, checks that numStages > 0, then reads the participants list via sub_172E930 and checks that its length matches numProducers. It reads the consumer list via sub_172E940 and checks that its length matches numConsumers. It then checks that barrier_id_base falls within the per-CTA NamedBarrier pool [0, 32), and that producer_group_id and consumer_group_id are distinct so producer and consumer groups do not overlap. The diagnostics it emits include "cutlass.pipeline.init: invalid numStages" and "cutlass.pipeline.init: participants length mismatch" among the per-field messages.
PipelineSwitchByExecutorOp::verify at sub_1775780 is a 4 848-byte routine that verifies the cutlass.pipeline.switch_by_executor op's branch dispatch. It walks the executor-mode arms and checks that each arm has matching numProducers and numConsumers counts so the participant accounting is consistent across modes. It then checks that the per-arm participant lists are disjoint so no participant is double-counted across executor arms. It reads num_producers via sub_172E930, num_consumers via sub_172E940, the participant list via sub_172E950, and the executor mode via sub_172E960.
Both verifiers run before any pipeline lowering pass and gate the rest of the lowering pipeline. A malformed cutlass.pipeline.init or cutlass.pipeline.switch_by_executor never reaches the TileAS scheduler.
Post-Verify Arrive-Count Builder
Once PipelineInitOp::verify passes, the post-verify builder at sub_1772C90 computes the per-stage arrive-count and stamps it as a derived attribute on the op. The arrive-count is a function of the participants list length, the consumer count, and the executor-mode mask, and downstream lowering needs it on every per-stage emit path. The ConvertPipelineToNVVM pass reads the attribute on every per-stage emit and uses it to size the per-stage NamedBarrier arrive count. Without it, the lowering would have to recompute the count at every emit site by walking the same participant tables the verifier already read.
Block-Striped Operand Checkers
Four operand-layout checkers serve the block-striped family, one per variant: sub_176E670 for load, sub_176EE10 for store, sub_176F5B0 for reduce-add, sub_176FD50 for reduce-max. Each one checks operand-layout compatibility with the per-CTA tile shape — register-memory operand width, global-memory pointer or memref shape, element-type width (at least sixteen bits), and static stripe shape. The checkers fire from the relevant op's verify thunk and reject malformed operand combinations before lowering picks a vector width and copy atom.
Cutlass-Bar Warp-Cooperative Diagnostic
BarOpLowering at sub_15FC250 is a ~5.5 KB routine that handles cutlass.named_barrier.* and cutlass.generic_barrier.* lowering and emits the warp-cooperative diagnostic. It fires when an arrive-count is not a multiple of warp size, or when the op sits outside warp-cooperative scope. The diagnostic catches the misuse pattern where a thread-level barrier lands in a kernel region the rest of the dialect expects to coordinate warps as a unit.
Barrier-Id Helper
The barrier-id helper at sub_1771850 allocates per-CTA NamedBarrier slots from the thirty-two-slot pool. Both PipelineInitOp::verify and cutlass.seq_bar.init call it to claim barrier IDs — the pool is the same physical resource on Hopper and Blackwell, so the helper is shared. Allocation order is deterministic and follows declaration order in the parent module, so two builds of the same IR produce the same barrier-id assignment.
Pipeline Lowering
The central lowering takes CUTLASS pipeline objects to TileAS pipeline regions. The CUTLASS dialect models the state machine in the same terms as the C++ library; TileAS needs explicit producer and consumer regions, stage iterators, and token flow.
void lower_cutlass_pipeline(CutlassPipeline pipeline) {
NvTileAsPipeline as_pipeline = create_tileas_pipeline(
pipeline.stage_count,
pipeline.shared_storage,
pipeline.producer_group,
pipeline.consumer_group);
for (CutlassPipelineOp op : pipeline.ops) {
switch (op.role) {
case PIPELINE_PRODUCER_ACQUIRE:
replace_with_producer_acquire(as_pipeline, op.stage);
break;
case PIPELINE_PRODUCER_COMMIT:
replace_with_producer_commit(as_pipeline, op.stage);
break;
case PIPELINE_CONSUMER_WAIT:
replace_with_consumer_wait(as_pipeline, op.stage, op.consumer_idx);
break;
case PIPELINE_CONSUMER_RELEASE:
replace_with_consumer_release(as_pipeline, op.stage);
break;
case PIPELINE_SWITCH_BY_EXECUTOR:
replace_with_agent_switch(as_pipeline, op.executor);
break;
}
}
}
The lowering preserves stage identity and executor identity. Lower a producer acquire/commit pair independently of its consumer wait/release pair without a shared pipeline object, and the scheduler can no longer prove they coordinate the same stage.
Tile Scheduler Semantics
CUTLASS tile schedulers decide which CTA owns which tile of work. The dialect preserves both the scheduling policy and the current scheduler state. Data-parallel scheduling maps CTAs straight to tiles. StreamK and split-K scheduling bring partial work, fixup paths, and a reduction workspace. Static persistent scheduling keeps CTAs resident and hands them new tiles in sequence. SM100 scheduler forms layer target-specific persistent-scheduling details on top for Blackwell kernels.
WorkTileInfo next_tile(TileScheduler *scheduler, CtaId cta) {
switch (scheduler->kind) {
case SCHEDULER_DATA_PARALLEL:
return data_parallel_tile_for_cta(scheduler, cta);
case SCHEDULER_STREAM_K:
return stream_k_next_tile(scheduler, cta);
case SCHEDULER_STATIC_PERSISTENT:
return persistent_next_tile(scheduler, cta);
case SCHEDULER_SM100:
return sm100_next_tile_with_fixup(scheduler, cta);
}
}
The work-tile-info value is not a convenience wrapper. Downstream code derives problem coordinates, mainloop bounds, reduction participation, and epilogue fixup behaviour from it.
If You Know CUTLASS (open source) — cross-walk
The cutlass dialect is the IR shape of the orchestration classes living in cutlass/pipeline/*.hpp, cutlass/gemm/kernel/tile_scheduler/*.hpp, cutlass/arch/barrier.h, and the related epilogue plumbing.
| CUTLASS C++ class / template | tileiras IR (cutlass.*) |
|---|---|
PipelineTmaAsync<Stages>, PipelineAsync<Stages> | cutlass.pipeline.create + cutlass.pipeline.init with numStages/numProducers/numConsumers attrs |
PipelineState<Stages> member tuple | !cutlass.pipeline_state typed value (phase, index, count) |
producer_acquire / commit / tail | cutlass.pipeline.producer_{acquire,commit,tail} ops |
consumer_wait / release | cutlass.pipeline.consumer_{wait,release} ops |
| Warp-specialized executor partition | cutlass.pipeline.switch_by_executor |
OrderedSequenceBarrier<Stages, ...> | cutlass.seq_bar.{create,init,arrive,wait,state.create} (five-op family) |
arch::NamedBarrier::sync(id, threads) | cutlass.named_barrier.arrive, cutlass.named_barrier.arrive_and_wait, cutlass.generic_barrier.{arrive_increment,wait_eq,wait_less_than}, cutlass.generic_barrier_sync (warp-cooperative-only; gated by the BarOpLowering diagnostic) |
PersistentTileScheduler | cutlass.tile_scheduler.create_static_persistent_params (with companion create_static_persistent_work_tile_info) |
StreamKScheduler | cutlass.tile_scheduler.create_streamk_params (with companion create_streamk_work_tile_info; SM100 variant body sub_R01) |
DataParallelScheduler | cutlass.tile_scheduler.create_dp_params (with companion create_dp_work_tile_info) |
BlockStriped<T>::load/store/reduce | cutlass.block_striped.{load,load_add,store,reduce} (four-op family) |
MODS telemetry hooks (cutlass::mods::*) | cutlass.tile_scheduler.mods_* ops (side-effecting) |
Two structural points. First, most of CUTLASS's class-template instantiations turn into op attributes on a small set of ops, so a kernel using three pipelines and two schedulers is described by a few dozen ops rather than by template specialisations in a thousand-line header. Second, the participant model — producers, consumers, warp-specialized executors — lives in explicit lists on the init op, cross-checked by PipelineInitOp::verify at sub_1771F40 before the lowering pass ever runs.
Per-Thunk Op-Name Map
The trampoline sub_1761D90 (file offset around L5680050 in tileiras_full.c) calls each of the 31 out-of-line thunks once, in registration order. Each thunk wraps exactly one sub_4461CA0(..., "cutlass.<NAME>", <len>, ..., &<TypeID-singleton>, ...) call. The 39 inline registrations sit directly between thunk calls in the same function body, each also a single sub_4461CA0(...) invocation against a distinct TypeID singleton. The table below is the verbatim mapping from thunk address to registered op name; the inline table that follows lists the 39 names in trampoline-walk order.
Thirty-One Out-of-Line Thunks
| Thunk address | Registered op |
|---|---|
sub_175E920 | cutlass.async.exec |
sub_175EAB0 | cutlass.block_striped.load_add |
sub_175ECE0 | cutlass.block_striped.load |
sub_175EF10 | cutlass.block_striped.reduce |
sub_175F140 | cutlass.block_striped.store |
sub_175F370 | cutlass.named_barrier.arrive_and_wait |
sub_175F500 | cutlass.pipeline.consume |
sub_175F690 | cutlass.pipeline.produce |
sub_175F820 | cutlass.pipeline.consumer_try_wait |
sub_175F9E0 | cutlass.pipeline.producer_try_acquire |
sub_175FBA0 | cutlass.tile_scheduler.work_tile_info_set_value |
sub_175FD60 | cutlass.pipeline.create |
sub_175FF20 | cutlass.pipeline.get_producer_barrier |
sub_1760090 | cutlass.pipeline.get_producer_mask |
sub_1760290 | cutlass.pipeline.make_participants |
sub_1760450 | cutlass.pipeline.state.create |
sub_1760610 | cutlass.pipeline.state.get_count |
sub_1760780 | cutlass.pipeline.state.get_index |
sub_17608F0 | cutlass.pipeline.state.get_phase |
sub_1760A60 | cutlass.pipeline.state.increment |
sub_1760BD0 | cutlass.tile_scheduler.create_dp_params |
sub_1760D90 | cutlass.tile_scheduler.create_static_persistent_params |
sub_1760F50 | cutlass.tile_scheduler.fetch_next_work |
sub_1761110 | cutlass.tile_scheduler.get_grid_shape |
sub_1761310 | cutlass.tile_scheduler.get_work_k_tile_count |
sub_1761480 | cutlass.tile_scheduler.get_work_k_tile_start |
sub_17615F0 | cutlass.tile_scheduler.get_workspace_sizes |
sub_17617D0 | cutlass.tile_scheduler.initial_work_tile_info |
sub_1761940 | cutlass.tile_scheduler.static_fetch_next_work |
sub_1761AB0 | cutlass.tile_scheduler.work_tile_info_to_coord_mnkl |
sub_1761C20 | cutlass.tile_scheduler.work_tile_info_to_cta_coord |
Each thunk is a 60..90 byte function whose body is dominated by malloc(0x70) for the per-op record (constant 0x70 = 112 bytes — the registered-op stride), a small constructor sequence (sub_44A8C20, then a TypeID setup), the sub_4461CA0 call with the op-name string and its length passed as char**, and sub_63F370 cleanup. Pulling the registrations out of line keeps the trampoline below the per-function code-cache budget and lets the compiler emit them as cold; the 39 inline cases are the ones whose construction sequence inlined small enough to stay in the parent.
Thirty-Nine Inline Registrations
The 39 inline registrations, listed in walk order from sub_1761D90:
cutlass.generic_barrier.arrive_increment,
cutlass.generic_barrier.wait_eq,
cutlass.generic_barrier.wait_less_than,
cutlass.named_barrier.arrive,
cutlass.pipeline.consumer_release,
cutlass.pipeline.consumer_wait,
cutlass.pipeline.init,
cutlass.pipeline.producer_acquire,
cutlass.pipeline.producer_commit,
cutlass.pipeline.producer_tail,
cutlass.pipeline.switch_by_executor,
cutlass.seq_bar.arrive,
cutlass.seq_bar.create,
cutlass.seq_bar.init,
cutlass.seq_bar.state.create,
cutlass.seq_bar.wait,
cutlass.tile_scheduler.advance_to_next_work,
cutlass.tile_scheduler.compute_epilogue,
cutlass.tile_scheduler.create_dp_work_tile_info,
cutlass.tile_scheduler.create_SM100_scheduler,
cutlass.tile_scheduler.create_static_persistent_work_tile_info,
cutlass.tile_scheduler.create_streamk_params,
cutlass.tile_scheduler.create_streamk_work_tile_info,
cutlass.tile_scheduler.fixup,
cutlass.tile_scheduler.fixup_increment,
cutlass.tile_scheduler.fixup_wait,
cutlass.tile_scheduler.get_current_work,
cutlass.tile_scheduler.get_workid_response_ptr,
cutlass.tile_scheduler.initialize_workspace,
cutlass.tile_scheduler.make_dp_params,
cutlass.tile_scheduler.make_static_persistent_params,
cutlass.tile_scheduler.make_streamk_params,
cutlass.tile_scheduler.mods_report_mainloop_end,
cutlass.tile_scheduler.mods_report_mainloop_start,
cutlass.tile_scheduler.mods_report_smid,
cutlass.tile_scheduler.mods_throttle,
cutlass.tile_scheduler.params_get_value,
cutlass.tile_scheduler.query_next_work,
cutlass.tile_scheduler.work_tile_info_get_value.
Every entry is a real op registration. cutlass.tile_scheduler.create_SM100_scheduler is the sm_100 dispatch constructor — same call shape as the other 38 inline cases (sub_4461CA0 against a dedicated TypeID singleton, unk_5B47568); the earlier audit doubt about whether it was a thunk-local helper is settled, it is the registered op for the Blackwell tile-scheduler factory.
Union: thirty-one thunk-registered ops + thirty-nine inline-registered ops = seventy distinct op names, no duplicates. Subtracting one of either set from the seventy total breaks the match against the seventy-string string-pool partition; that's the structural check that pins the count.
Cross-links
- Pipeline and Tile Scheduler — Pipeline Model covers pipeline roles and Tile Scheduler Kinds persistent tile scheduling.
- Seq Bar and Block Striped — Sequential Barrier Model covers sequence barriers and Block-Striped Model movement.
- MODS Async Dispatch — Dispatch Model and Producer/Consumer Integration covers telemetry and async-dispatch operations.
cutlass Pipeline and Tile Scheduler
Abstract
cutlass.pipeline.* and cutlass.tile_scheduler.* solve the two large-scale orchestration problems in CUTLASS-style GEMM kernels: how producer and consumer agents coordinate asynchronous work, and how CTAs receive tiles of the output problem. The pipeline family covers stage state, barriers, producer acquire/commit, consumer wait/release, and executor switching. The tile-scheduler family covers data-parallel, static-persistent, and StreamK work assignment, including workspace-based fixup for partial K splits.
The rest of this page documents the contracts and algorithms a lowering must preserve.
Pipeline Model
A CUTLASS pipeline is a staged producer/consumer state machine. Each stage has a barrier-like slot, a phase bit, an index, and a participant policy.
typedef struct {
int phase;
int index;
int count;
} PipelineState;
PipelineState pipeline_state_increment(PipelineState state, int stage_count) {
state.index += 1;
if (state.index == stage_count) {
state.index = 0;
state.phase ^= 1;
}
state.count += 1;
return state;
}
The main handshake is:
- Producer acquires an empty stage.
- Producer issues async work for that stage.
- Producer commits, optionally with expected transaction bytes.
- Consumer waits for that stage.
- Consumer reads the produced data.
- Consumer releases the stage so it can be reused.
void lower_pipeline_handshake(Pipeline pipeline, Stage stage) {
Value slot_addr = pipeline.barrier_addr(stage.index);
// Producer acquire — wait until the slot is empty.
emit("nvvm.mbarrier.try_wait.parity.shared",
/*addr=*/slot_addr,
/*phase=*/stage.phase ^ 1,
/*timeout=*/k_default_timeout);
emit_producer_body(stage);
// Producer commit — arrive with expect_tx for TMA-backed producers,
// or plain arrive for non-TMA work.
if (stage.transaction_bytes > 0)
emit("nvvm.mbarrier.arrive.expect_tx.shared",
slot_addr, /*tx_bytes=*/stage.transaction_bytes);
else
emit("nvvm.mbarrier.arrive.shared",
slot_addr, /*count=*/pipeline.num_producers);
// Consumer wait — wait until the slot is full.
emit("nvvm.mbarrier.try_wait.parity.shared",
slot_addr, /*phase=*/stage.phase,
/*timeout=*/k_default_timeout);
emit_consumer_body(stage);
// Consumer release.
emit("nvvm.mbarrier.arrive.shared",
slot_addr, /*count=*/pipeline.num_consumers);
}
Unified pipeline_step State Machine
The lowering above is per-stage. A compact way to specify the full state machine — what one iteration of one agent does given its current state and role — is the pipeline_step function below. A verifier or a model-checker can read this directly: every transition on every role is explicit, and the only side channel between roles is the barrier slot at state.index % depth.
typedef enum { ROLE_PRODUCER, ROLE_CONSUMER } AgentRole;
PipelineState pipeline_step(Pipeline p, AgentRole role,
PipelineState state, StageBody body) {
Value slot = p.barrier_addr(state.index);
switch (role) {
case ROLE_PRODUCER:
// 1. acquire — empty-side parity is the inverse of full-side parity.
emit("nvvm.mbarrier.try_wait.parity.shared", slot, state.phase ^ 1);
// 2. issue async work (TMA / async copy / WGMMA / ...).
run_producer_body(body);
// 3. commit — arrive with expect_tx for TMA-backed producers.
if (body.transaction_bytes > 0)
emit("nvvm.mbarrier.arrive.expect_tx.shared",
slot, /*tx_bytes=*/body.transaction_bytes);
else
emit("nvvm.mbarrier.arrive.shared",
slot, /*count=*/p.num_producers);
break;
case ROLE_CONSUMER:
// 1. wait — spin until the slot is full (parity matches state.phase).
emit("nvvm.mbarrier.try_wait.parity.shared", slot, state.phase);
// 2. read the stage's SMEM / TMEM / register fragments.
run_consumer_body(body);
// 3. release — arrive on the empty-side counter.
emit("nvvm.mbarrier.arrive.shared", slot, /*count=*/p.num_consumers);
break;
}
// 4. advance: increment index, flip phase on wrap.
return pipeline_state_increment(state, p.stage_count);
}
Three invariants keep this state machine model-checkable: every transition is local to one (role, state) pair; the only inter-role communication runs through the barrier slot; and the parity-bit flip in pipeline_state_increment is what lets a single barrier slot be reused across stages without aliasing. Break any of the three and you lose the ability to prove progress and safety for the producer/consumer ring.
If You Know CUTLASS (open source) — cross-walk
For readers fluent in the open-source cutlass::PipelineTmaAsync<Stages> and friends:
| CUTLASS C++ | tileiras IR |
|---|---|
PipelineTmaAsync<Stages>::producer_acquire(state) | cutlass.pipeline.producer_acquire %pipe, %state |
PipelineTmaAsync<Stages>::producer_commit(state, bytes) | cutlass.pipeline.producer_commit %pipe, %state {transaction_bytes = N} |
PipelineTmaAsync<Stages>::consumer_wait(state) | cutlass.pipeline.consumer_wait %pipe, %state |
PipelineTmaAsync<Stages>::consumer_release(state) | cutlass.pipeline.consumer_release %pipe, %state |
PipelineState<Stages> member object | !cutlass.pipeline_state typed value with phase/index/count |
cutlass::arch::NamedBarrier::sync(id, threads) | cutlass.bar op + warp-cooperative diagnostic |
| Cluster-wide barrier on Hopper / Blackwell | nvvm.cluster.arrive / nvvm.cluster.wait pair |
Template parameter Stages | numStages attribute on cutlass.pipeline.init |
Template parameter ClusterShape | cluster_shape_x/y/z fields on CutlassTileSchedulerParams |
Two differences are worth flagging. The IR carries num_producers and num_consumers as explicit attributes the verifier cross-checks against the participants list, where the C++ template collapses them into a single ThreadCategory enum. And the executor axis (warp-specialized vs cooperative) is an op-level attribute selected by cutlass.pipeline.switch_by_executor rather than a compile-time template specialisation.
Pipeline Operations
| Operation area | Contract |
|---|---|
pipeline.create | Allocate or bind the shared barrier storage for a staged pipeline. |
pipeline.init | Initialize all stage barriers with the correct participant count. |
pipeline.state.create | Construct the phase/index/count tuple for an agent. |
pipeline.state.increment | Advance index and flip phase on wraparound. |
producer_acquire / producer_try_acquire | Wait or probe for an empty producer stage. |
producer_commit | Signal that produced data is ready, with transaction byte count when needed. |
producer_tail | Drain outstanding WGMMA or async groups before leaving the mainloop. |
consumer_wait / consumer_try_wait | Wait or probe for a ready consumer stage. |
consumer_release | Signal that the consumer has released the stage. |
switch_by_executor | Split a region by executor role or warp-specialized agent mask. |
async.exec | Contain executor-specific regions that will become scheduled pipeline roles. |
switch_by_executor verifies that masks are contiguous, cover the enclosing executor set, and match the operand groups they select. It is a semantic partition, not just a branch.
ConvertPipelineToNVVM
ConvertPipelineToNVVM rewrites every cutlass.pipeline.* op into a sequence of nvvm.* intrinsics. It runs after the MaterializeAsync pass (the D07 stage of the cutlass→nvvm pipeline) and is the single point where the abstract producer/consumer state machine becomes a concrete mbarrier program on shared memory. Every later pass sees only NVVM intrinsics for synchronisation.
The driver is sub_15EC600. Its runOnOperation body builds the conversion target — marking eight dialects fully legal (builtin, arith, nvvm, llvm, scf, cute, cute_nvgpu, cutlass) — then calls sub_15E9940 to populate the pattern set and invokes applyPartialConversion. The target legality is partial on purpose: cute and cutlass stay legal so any op outside the pipeline family (TMA descriptors, WGMMA tiles, copy atoms) passes through untouched. One specific cute op, cute_nvgpu.arch.make_warp_uniform, is reserved as legal even within cute because the warp-uniformity anchor must survive past this pass — the downstream codegen relies on it to broadcast scheduler state across the warp.
The pattern set has 22 OpLowering subclasses, splitting into four functional clusters. Initialization: PipelineInitOpLowering, BarrierInitOpLowering, PipelineSwitchByExecutorOpLowering. Producer/consumer handshake: PipelineProducerAcquireOpLowering, PipelineProducerCommitOpLowering, PipelineProducerTailOpLowering, PipelineConsumerWaitOpLowering, PipelineConsumerReleaseOpLowering. State arithmetic: PipelineStateOpLowering, PipelineStateIncrementOpLowering, PipelineStateBumpOpLowering, BarOpLowering. Async-future plumbing: seven AsyncWait* and AsyncFutureWait* variants plus the cast bridges and the block-striped load/store helpers.
The table below lists each pattern with its matchAndRewrite slab address (where known), the vtable bank base under which the RTTI and method table live, the slab size in bytes, and the NVVM emit set the match produces.
| Pattern | matchAndRewrite | vtable bank base | Slab size | Emit set |
|---|---|---|---|---|
| PipelineInitOpLowering | (varies) | 0x59E4520 | 0x70 | nvvm.mbarrier.init + nvvm.barrier.cta.arrive (expect-tx form) |
| PipelineSwitchByExecutorOpLowering | — | 0x59E4520 | 0x70 | conditional branch on executor mode |
| PipelineProducerAcquireOpLowering | sub_15EFAB0 (17 KB) | 0x59E42F0 | 0x70 | 12-op emit set (see task #576) |
| PipelineProducerCommitOpLowering | — | 0x59E4340 | 0x78 | nvvm.mbarrier.arrive.expect_tx + arrives |
| PipelineProducerTailOpLowering | — | 0x59E4340 | 0x78 | nvvm.cp.async.bulk.wait_group { count = 0 } (fast path) or scf.for (slow path) |
| PipelineConsumerWaitOpLowering | — | 0x59E4570 | 0x70 | nvvm.mbarrier.try_wait.parity.shared spin loop |
| PipelineConsumerReleaseOpLowering | — | 0x59E4570 | 0x70 | nvvm.mbarrier.arrive |
| PipelineStateOpLowering | — | (varies) | 0x70 | builds the per-stage state |
| PipelineStateIncrementOpLowering | — | 0x59E45C0 | 0x70 | arith.addi plus modulo wrap |
| PipelineStateBumpOpLowering | — | 0x59E45C0 | 0x70 | sibling of above |
| BarOpLowering | sub_15FC250 (~5.5 KB) | 0x59E4610 | 0x78 | named-barrier emission |
| BarrierInitOpLowering | — | 0x59E4660 | 0x70 | nvvm.mbarrier.init (per-barrier initializer) |
| AsyncWaitOpConversionMbarrier | — | off_59D5DD8 | 0x70 | nvvm.mbarrier.try_wait.parity.shared |
| AsyncWaitOpConversionTMASTGAndTMAREDG | — | off_59D5E28 | 0x70 | TMA store-and-reduce wait |
| AsyncWaitOpConversionGMMA | — | off_59D5E78 | 0x70 | nvvm.wgmma.wait.group.sync.aligned |
| AsyncToAsyncOpConversion | — | off_59D5EC8 | 0x70 | builtin.unrealized_conversion_cast |
| CreateNoneOpConversion | — | off_59D5F18 | 0x70 | llvm.mlir.poison |
| AsyncFutureWaitMbarrier | — | off_59D5F68 | 0x70 | nvvm.mbarrier.try_wait.parity.shared (different spin form) |
| AsyncFutureWaitGroup | — | off_59D5FB8 | 0x70 | nvvm.cp.async.bulk.wait_group |
| TokenToAsyncOpConversion | — | off_59D6008 | 0x70 | builtin.unrealized_conversion_cast |
| BlockStripedLoadOpLowering | — | (varies) | 0x70 | cutlass.block_striped.load |
| BlockStripedStoreOpLowering | — | (varies) | 0x70 | cutlass.block_striped.store |
A few details in the table are worth unpacking. The vtable banks cluster patterns that share a base class: the three handshake-side acquire/wait patterns occupy the 0x59E42F0/0x59E4570 banks; the commit and tail patterns share 0x59E4340 with a slightly larger 0x78 slab to hold extra emit-set state; and the seven AsyncWait/AsyncFutureWait patterns occupy the contiguous off_59D5DD8–off_59D6008 range. The 0x70 default slab is the standard OpRewritePattern footprint plus one type-converter pointer; the 0x78 patterns carry one extra field — usually a precomputed attribute (transaction byte count for commit, fast/slow-path flag for tail).
The producer tail emits nvvm.cp.async.bulk.wait_group { count = 0 } on the fast path — when the pipeline depth is known statically and all outstanding TMA stores can be drained with a single group wait. The slow path falls back to an scf.for loop that iterates over remaining stages and arrives on each barrier in turn, and gets taken when the analysis cannot prove a single-group drain is safe. The producer commit is the canonical nvvm.mbarrier.arrive.expect_tx site — the only place in the pass that emits an expect_tx attribute — and the byte count comes verbatim from the producer_commit op's transaction-bytes operand.
The two try_wait.parity.shared emit sets are not identical. PipelineConsumerWaitOpLowering emits the canonical spin form with a single phase operand and a fixed timeout. AsyncFutureWaitMbarrier emits a variant that pulls the phase from the future handle and uses a different timeout constant. The two paths cannot be unified because the future-handle form must survive a type round-trip through builtin.unrealized_conversion_cast — the same mechanism AsyncToAsyncOpConversion and TokenToAsyncOpConversion use to bridge the async-token type system before the LLVM lowering finally collapses the casts.
The pass must keep the state tuple coherent across the rewrite. If PipelineStateIncrementOpLowering lowers a wrap that does not flip phase, the resulting try_wait.parity observes stale barrier state on the very next iteration. The increment pattern emits arith.addi plus a modulo compare-and-select that XORs the phase bit on wrap; PipelineStateBumpOpLowering is a sibling pattern used when the pipeline carries a side-channel counter that must advance in lockstep without a phase flip.
Producer Acquire 4-Phase Lowering
PipelineProducerAcquireOpLowering at sub_15EFAB0 is the largest OpLowering in ConvertPipelineToNVVM — roughly 17 KB of compiled code. It splits cleanly into four phases: operand unpack, mask-matrix construction dispatched on the pipeline mode, per-stage arrive emission, and a NamedBarriers tail. All four phases share one rewriter and one running operand vector, so phase ordering is part of the contract; each phase reads state produced by the previous one.
Phase A — Operand Unpack
The acquire op carries three operand bundles. The pipeline state record %pipe unpacks via sub_15E0190(adaptor, k) into {phase, index, count, base_ptr}. The per-stage record %state unpacks into {phase, index}. The last operand is an i1 %unacquired_state flag separating a first-time acquire from a re-entry — first-time entries skip the parity-flip path the steady-state phase masks use.
After the unpack, the lowering materialises two constants — i32 0 and the parity-flip mask i32 -2 — via sub_170B220, and reads the six pipeline shape fields (P, C, num_producers, num_consumers, participants, mode) through sub_172E920, sub_172E930, sub_172E940, sub_172E950, and sub_172E980. The participants field gives the matrix size; the mode field picks between five mask-construction strategies in Phase B.
struct AcquireOperands {
Value phase;
Value index;
Value count;
Value base_ptr;
Value state_phase;
Value state_index;
Value unacquired_state;
};
struct PipelineShape {
uint32_t P;
uint32_t C;
uint32_t num_producers;
uint32_t num_consumers;
uint32_t participants;
uint32_t mode;
};
Phase B — 16×16 Mask Matrix Dispatch
Phase B builds four 16-element mask buffers — one per parity/role combination (producer-even, producer-odd, consumer-even, consumer-odd) — and dispatches on the pipeline mode read from sub_172E7A0. The 16-row buffer is sized for the worst-case participant matrix; smaller pipelines leave trailing rows zero.
void phase_b_build_masks(Rewriter *r, PipelineShape s, MaskMatrix *out) {
uint32_t mode = read_mode(s);
switch (mode) {
case MODE_COOPERATIVE_ARRIVAL:
emit_cooperative_mask_reduction(r, s, out);
break;
case MODE_WARP_SPECIALIZED_1x1:
emit_unrolled_select_or_chain(r, s, out);
break;
case MODE_STRIDED_UREM:
case MODE_STRIDED_UDIV:
emit_scf_for_strided_mask(r, s, mode, out);
break;
case MODE_CLUSTER:
emit_cluster_arrive(r, s);
break;
}
}
Mode 0 is CooperativeArrival. The emitter walks the P × C participants matrix and, for each cell, emits a llvm.extractvalue to pull the participant slot, a llvm.and to apply the per-stage mask, an llvm.icmp ne against zero, a llvm.select to pick between the parity-flipped and unflipped mask, and a llvm.or reduction into the running mask. This collapses the per-cell predicates into a single per-mask u32 without any control flow — the entire matrix is straight-line code.
Mode 1 is WarpSpecialized P=C=1. Both dimensions are 1, so the participants matrix degenerates to a single column. The emitter unrolls a 16-stage select-and-or chain through sub_15E0F00 and sub_15E69E0 rather than building a matrix walk: fewer ops, fewer values, no per-cell extractvalues.
Modes 2 and 3 are strided urem and strided udiv. Both build an scf.for whose body emits one llvm.urem (mod-2 striding) or llvm.udiv (mod-3 striding) per stage. The loop carries the running mask as its iter-arg. The strided modes kick in when the participant count outgrows the unrolled chain but the access pattern stays regular.
Mode 4 is cluster. The mask matrix collapses to a cluster-wide barrier — sub_611A10 emits a single nvvm.cluster.arrive. No 16-element buffer is materialised; Phase C below adapts to the cluster case by substituting the cluster-arrive intrinsic for the per-stage arrives.
Phase C — Per-Stage Arrive
With the masks built, Phase C iterates g ∈ [0, P) and emits four arrives per stage through sub_15E28B0(builder, %pipe, V_arr, ..., mask, 16, 1). This is the 9-argument TMA-aware arrive helper; the ninth argument toggles expect_tx, picking between a plain mbarrier.arrive and an mbarrier.arrive.expect_tx carrying the transaction-bytes hint for TMA-backed producers.
The four arrives per stage cover producer-side and consumer-side masks under both phase parities. Producer-even and consumer-even masks match the current phase; the odd variants are pre-staged for the next phase flip, so a subsequent acquire on the same stage finds its mask already on the barrier.
void phase_c_emit_arrives(Builder *b, Value pipe, MaskMatrix *m, uint32_t P) {
for (uint32_t g = 0; g < P; ++g) {
sub_15E28B0(b, pipe, V_arr, /*role*/ PRODUCER, /*parity*/ EVEN, m->prod_even[g], 16, 1);
sub_15E28B0(b, pipe, V_arr, /*role*/ PRODUCER, /*parity*/ ODD, m->prod_odd[g], 16, 1);
sub_15E28B0(b, pipe, V_arr, /*role*/ CONSUMER, /*parity*/ EVEN, m->cons_even[g], 16, 1);
sub_15E28B0(b, pipe, V_arr, /*role*/ CONSUMER, /*parity*/ ODD, m->cons_odd[g], 16, 1);
}
}
Phase D — NamedBarriers Tail
Phase D emits one trailing arith.addi per NamedBarrier barrier-id base. The op type is &unk_5BE5898 and the builder is sub_42D92B0. The barrier-id offset comes from sub_17346A0(op, 3), where the literal 3 is the offset constant identifying NamedBarrier slots in the operand bundle. NamedBarriers piggyback on the acquire so warp-specialized named regions stay synchronised with the staged pipeline without a separate lowering pass.
Once all four phases complete, the lowering finalises with sub_36C67C0(rewriter, op, results, 1u) — the single-result commit. The 1u is the result count, not a flag: producer acquire returns the updated pipeline state record and nothing else.
TypeIDs Used
The emit set spans twelve op types across the four phases. The first nine cover mask-matrix and arrive emission; the last three cover the scf.for strided modes and the NamedBarriers tail.
| TypeID | Op |
|---|---|
&unk_5BA8EB0 | llvm.extractvalue |
&unk_5BA8E20 | llvm.icmp ne |
&unk_5BA8D60 | llvm.select |
&unk_5BA8DA8 | llvm.or |
&unk_5BA8F28 | llvm.and |
&unk_5BA8D18 | llvm.urem |
&unk_5BA8D28 | llvm.udiv |
&unk_5BA8F50 | llvm.add |
&unk_5BA8E00 | llvm.insertvalue |
&unk_5BE4008 | scf.for |
&unk_5BE3FC0 | scf.yield |
&unk_5BE5898 | arith.addi |
nv_tileas.async.* Populate Roster
The nv_tileas.async.* family is the alias-aware twin of the cutlass.pipeline family. It offers similar producer/consumer, wait, and future-wait operations, but typed at the nv_tileaa layer where buffer aliasing lives in the type system rather than as a side fact recovered by analysis. ConvertPipelineToNVVM rewrites this family through eight populator-registered OpConversionPatterns. The pattern class vtables occupy a contiguous range at off_59D5DD8..off_59D6008; each pattern is a 120-byte (0x78) record allocated through sub_44A8C20(0x78) and pushed onto the RewritePatternSet via sub_367D330. The full populate roster lives in sub_1189A50 — about 10.5 KB of pattern registration plus the surrounding constructor wiring.
| Pattern | Op | vtable | Lowering |
|---|---|---|---|
| AsyncWaitOpConversionMbarrier | nv_tileas.async.wait (mbarrier flavor) | off_59D5DD8 | nvvm.mbarrier.try_wait.parity.shared spin loop |
| AsyncWaitOpConversionTMASTGAndTMAREDG | nv_tileas.async.wait (TMA flavor) | off_59D5E28 | nvvm.cp.async.bulk.commit.group + nvvm.cp.async.bulk.wait_group |
| AsyncWaitOpConversionGMMA | nv_tileas.async.wait (GMMA flavor) | off_59D5E78 | nvvm.wgmma.commit.group.sync.aligned + nvvm.wgmma.wait.group.sync.aligned |
| AsyncToAsyncOpConversion | nv_tileas.async.to_async | off_59D5EC8 | builtin.unrealized_conversion_cast |
| CreateNoneOpConversion | nv_tileas.create_none | off_59D5F18 | llvm.mlir.poison |
| AsyncFutureWaitMbarrier | nv_tileas.async.future_wait (mbarrier) | off_59D5F68 | nvvm.mbarrier.try_wait.parity.shared (spin form) |
| AsyncFutureWaitGroup | nv_tileas.async.future_wait (group) | off_59D5FB8 | nvvm.cp.async.bulk.wait_group |
| TokenToAsyncOpConversion | nv_tileas.async.token_to_async | off_59D6008 | builtin.unrealized_conversion_cast |
The mbarrier, TMA, and GMMA wait patterns share an operation name but discriminate at match time on the source of the token they wait on. Mbarrier waits resolve to a parity spin against a shared-memory barrier; TMA waits commit and drain a bulk async group; GMMA waits commit and drain a WGMMA group. The two builtin.unrealized_conversion_cast patterns let nv_tileaa-typed values flow through later lowering passes without losing their alias typing — the cast disappears in subsequent type-conversion folds. CreateNoneOpConversion lowers nv_tileas.create_none to llvm.mlir.poison because the only legal use of a none value is to be consumed by an op that will itself be erased once its data dependence is materialised.
Wait-Group Deduplication Walker
A separate pass-body emitter, sub_1181940 (about 30 KB), walks each function's regions after the eight patterns have run. The walker dedupes nv_tileas.async.wait ops by group-id and emits exactly one wait at each region's tail, separately per flavor. It uses 184-byte per-region records keyed by Operation * in open-addressed hash maps with sentinels -4096 (empty) and -8192 (tombstone), and a key hash of (op >> 9) ^ (op >> 4). Three identical re-hash-on-grow blocks cover the three co-allocated tables: the region map at offset +0, the group-id map at offset +16, and the per-flavor cohort map at offset +32.
typedef struct GroupWaitState {
/*+0x00*/ uint32_t flavor; // 0 = wgmma, 1 = cp.async.bulk
/*+0x04*/ uint32_t base; // group-id base for this scope
/*+0x08*/ uint32_t count; // current count of outstanding ops
/*+0x0C*/ uint32_t cursor; // next group-id to claim
} GroupWaitState;
At each region exit the walker emits one wait per active flavor. For bulk async it builds nvvm.cp.async.bulk.wait_group (TypeID &unk_5B8DAC8, builder sub_2E6CFB0, flavor == 1). For WGMMA it builds nvvm.wgmma.wait.group.sync.aligned (TypeID &unk_5B8D610, builder sub_2E78330, flavor == 0). The mbarrier handshake never appears in this walker — mbarrier waits come entirely from AsyncWaitOpConversionMbarrier in the eight-pattern roster above, because each mbarrier wait ties to its own barrier slot rather than a group-id cohort that needs a region-tail drain.
The walker is what makes the eight wait patterns safe to fire eagerly. A pattern can emit commit.group and wait_group in isolation, and the walker then collapses redundant waits across a scope without the patterns having to coordinate among themselves.
Tile Scheduler Kinds
| Scheduler | Work assignment | Runtime state |
|---|---|---|
| Data-parallel | One output tile per CTA or CGA. | Linear tile id and raster unpacking. |
| Static persistent | Resident CTAs walk a closed-form tile iterator. | Current tile id, validity bit, pipe increment bit. |
| StreamK | Split K work across CTAs, then fix up partial accumulators. | K range, split state, workspace pointer, barrier counters. |
Scheduler handles carry the selected kind. Work-tile-info values carry the fields downstream mainloop and epilogue code needs.
Scheduler Bodies
The runtime work-distribution layer in Tileiras is not one routine. Six cooperating subs decide which CTA handles which tile: four scheduler bodies — one per cutlass.tile_scheduler.* op variant, with two specialisations of StreamK for SM100 vs generic — plus two helpers (workspace sizing and a Params struct factory). Every kernel using CUTLASS-style work distribution picks one of the four body subs based on the dialect op in its module and the target SM, and links in both helpers unconditionally for setup.
| Sub | Scheduler variant | Workspace | Notes |
|---|---|---|---|
sub_R01 | SM100 StreamK | needs workspace global | Blackwell-specific StreamK with cluster-level coordination |
sub_R02 | StaticPersistent | small workspace | 1-CTA-per-SM persistent kernel; works on all SMs |
sub_R03 | StreamK (generic) | needs workspace | Hopper-style StreamK; the SM100 variant supersedes when targetSM >= 100 |
sub_R04 | DataParallel | no workspace | Pure data-parallel — no work-stealing, simplest case |
sub_R05 | (helper) getWorkspaceSize | — | Computes the per-scheduler workspace requirement |
sub_R06 | (helper) Params struct factory | — | Builds the Params struct each scheduler reads |
The symbols sub_R01 .. sub_R06 are the canonical names used throughout this wiki for the six bodies. Each body exposes the same external entry shape — (Params *params, int linear_id) -> WorkTileInfo — so the dialect lowering can fix on one indirect call site and dispatch by kind at op-selection time.
Any scheduler that needs cross-CTA coordination state allocates a global buffer in the kernel's parameter space. sub_R05 computes the workspace size from (num_ctas, num_stages, tile_count) and the result lives in the kernel's workspace-global-offset attribute (read back through cutlass.tile_scheduler.get_workspace_sizes); DataParallel returns zero and the kernel skips the allocation. StaticPersistent needs only a small counter region for the persistent-advance bookkeeping. Both StreamK variants need partial-accumulator plus barrier regions, whose layout is described under StreamK Workspace below.
The shared Params struct, built by sub_R06 and passed to every body, is a 48-byte record:
typedef struct CutlassTileSchedulerParams {
/*+0x00*/ uint32_t num_tiles_m;
/*+0x04*/ uint32_t num_tiles_n;
/*+0x08*/ uint32_t num_tiles_k;
/*+0x0C*/ uint32_t cluster_shape_x;
/*+0x10*/ uint32_t cluster_shape_y;
/*+0x14*/ uint32_t cluster_shape_z;
/*+0x18*/ uint32_t num_ctas_per_cluster;
/*+0x1C*/ uint32_t total_ctas;
/*+0x20*/ uint64_t workspace_ptr; // 0 if scheduler is workspace-free
/*+0x28*/ uint32_t k_split_count; // StreamK-only; 0 for others
/*+0x2C*/ uint32_t reserved;
} CutlassTileSchedulerParams;
The struct is passed as a cute_nvgpu.grid_constant argument, which lets the compiler hoist all field loads into scalar registers at kernel entry. workspace_ptr is the only 64-bit field — it carries a global address; everything else is a count or shape index and fits in 32 bits. k_split_count is zero for DataParallel and StaticPersistent; both StreamK bodies are the only consumers, reading it to decide how many K partials to expect at fixup time. The trailing reserved word keeps the struct 8-byte aligned so workspace_ptr lands on its natural alignment without a hidden pad.
Scheduler op selection in the cutlass dialect happens at lowering time. Each cutlass.tile_scheduler.* op declares which scheduler variant it backs. The StreamK family (cutlass.tile_scheduler.create_streamk_params / create_streamk_work_tile_info) resolves to sub_R01 on SM100 and sub_R03 otherwise; the StaticPersistent family (cutlass.tile_scheduler.create_static_persistent_params / create_static_persistent_work_tile_info) always resolves to sub_R02; the DataParallel family (cutlass.tile_scheduler.create_dp_params / create_dp_work_tile_info) always resolves to sub_R04. The dialect verifier enforces the inverse direction too: SM100 streamk is illegal on sm_90 (R01 uses Blackwell cluster barriers that do not exist on Hopper), and the generic streamk op is illegal on sm_100 because R01 supersedes it and a kernel must not link both.
The SM100 streamk body (sub_R01) is the only one using cluster-level coordination. It emits nvvm.cluster.arrive / nvvm.cluster.wait pairs — the Blackwell 2-CTA and 4-CTA cooperative MMA protocol — so each cluster can claim a contiguous range of (M, N, K) tiles, distribute them across the cluster's CTAs, and coordinate K-split partial-reductions through an inter-CTA barrier. The generic streamk body (sub_R03) reaches the same logical result with per-CTA atomics on the barrier workspace. The SM100 variant exists because the cluster-barrier path is far cheaper on Blackwell — at high cluster counts the atomic path's runtime cost would dominate.
Data-Parallel Scheduler
Data-parallel assignment is the simplest mapping: linear CTA id maps to an output tile by raster order and swizzle policy.
WorkTileInfo data_parallel_tile(SchedulerParams p, int linear_id) {
int total = p.tiles_m * p.tiles_n * p.tiles_l;
require(0 <= linear_id && linear_id < total);
RasterCoord r = raster_unpack(linear_id,
p.tiles_m,
p.tiles_n,
p.tiles_l,
p.raster_order,
p.swizzle_size);
WorkTileInfo info;
info.m = r.m * p.cga_shape_m;
info.n = r.n * p.cga_shape_n;
info.l = r.l;
info.linearized_id = linear_id;
return info;
}
Data-parallel schedulers need a WorkID response pointer when running the runtime pull model.
Static Persistent Scheduler
Static persistent scheduling keeps CTAs resident. Each CTA advances by the grid width every iteration until no work remains.
AdvanceResult persistent_advance(StaticPersistentParams p, WorkTileInfo t) {
WorkTileInfo next = t;
next.linearized_id += p.grid_width;
bool valid = next.linearized_id < p.total_cga_tiles;
bool increment_pipe = (t.tile_idx + 1) == p.tiles_per_pipeline_round;
next.tile_idx = increment_pipe ? 0 : t.tile_idx + 1;
return (AdvanceResult){
.tile = next,
.is_valid_tile = valid,
.increment_pipe = increment_pipe,
};
}
advance_to_next_work belongs inside an async execution region because it advances the scheduler and may clock the enclosing pipeline.
StreamK Scheduler
StreamK splits work along the K dimension. Some CTAs compute full output tiles; others compute partial K ranges and stash partial accumulators in a workspace. A reducer CTA then waits for all partials, accumulates them, and runs the final epilogue.
StreamKParams compute_streamk_params(Problem problem, TileShape tile, int target_units) {
int cga_tiles_m = ceil_div(problem.m, tile.m);
int cga_tiles_n = ceil_div(problem.n, tile.n);
int output_tiles = cga_tiles_m * cga_tiles_n * problem.l;
int total_k_work = output_tiles * problem.k_tiles_per_output;
int units = choose_streamk_units(target_units, output_tiles, total_k_work);
int dp_tiles = choose_data_parallel_tail(output_tiles, units);
int sk_tiles = output_tiles - dp_tiles;
int sk_units = units - dp_tiles;
int total_sk_k = sk_tiles * problem.k_tiles_per_output;
int small = total_sk_k / sk_units;
int big = total_sk_k % sk_units;
return (StreamKParams){
.sk_tiles = sk_tiles,
.sk_units = sk_units,
.k_tiles_per_small_unit = small,
.big_units = big,
};
}
Per-CTA dispatch splits a linear worker id into either a StreamK slice or a data-parallel tail tile:
WorkTileInfo streamk_tile(StreamKParams p, int worker_id) {
if (worker_id >= p.sk_units) {
int tail = worker_id - p.sk_units;
return data_parallel_tail_tile(p, p.sk_tiles + tail);
}
int k_start = streamk_flat_k_start(p, worker_id);
int k_count = streamk_k_count(p, worker_id);
WorkTileInfo info;
info.tile_idx = k_start / p.k_tiles_per_output;
info.k_tile_start = k_start % p.k_tiles_per_output;
info.k_tile_count = k_count;
info.is_separate_reduction = k_count != p.k_tiles_per_output;
return info;
}
StreamK Workspace
StreamK uses two workspace regions:
- reduction workspace: partial accumulator tiles;
- barrier workspace: per-output-tile counters or synchronization words.
WorkspaceSizes streamk_workspace_sizes(StreamKParams p, int accumulator_tile_bytes) {
int splits = ceil_div(p.k_tiles_per_output, p.k_tiles_per_small_unit);
int reduction = p.sk_tiles * (splits - 1) * accumulator_tile_bytes;
int barriers = p.sk_tiles * sizeof(uint32_t);
int total = align_to(reduction + barriers, 128);
return (WorkspaceSizes){
.total = total,
.reduction = reduction,
.barriers = barriers,
};
}
The workspace pointer must be aligned for vectorised global-memory access. Data-parallel and static-persistent schedulers return zero workspace sizes.
Fixup Protocol
void streamk_fixup(StreamKWorkspace ws, WorkTileInfo info, Accumulator partial) {
if (!info.is_final_split) {
store_partial(ws.reduction, info.tile_idx, info.split_idx, partial);
atomic_add_release(ws.barrier, info.tile_idx, 1);
return;
}
wait_until(ws.barrier[info.tile_idx] == info.expected_splits - 1);
Accumulator total = partial;
for (int split = 0; split < info.expected_splits - 1; ++split) {
total += load_partial(ws.reduction, info.tile_idx, split);
}
run_epilogue(total);
reset_barrier(ws.barrier, info.tile_idx);
}
Fixup ops are epilogue ops. They do not need to live inside the pipeline async region, but they must preserve memory ordering on the workspace.
Invariants
- Pipeline stage count, producer count, consumer count, and participant masks are mutually consistent.
- Pipeline state increment flips phase on index wraparound.
- Executor masks are contiguous and cover the enclosing async execution mask.
- Data-parallel schedulers have a WorkID response source when the runtime pull model is selected.
- Static-persistent and StreamK schedulers do not use the data-parallel WorkID pointer path.
advance_to_next_workappears inside async execution.- StreamK workspace sizes and scheduler kind agree.
- StreamK fixup uses release/acquire-style ordering around partials.
cutlass Seq-Bar and Block-Striped Operations
Abstract
cutlass.seq_bar.* models CUTLASS ordered sequence barriers — a ring of mbarrier slots plus a phase/index state cursor. cutlass.block_striped.* models per-thread striped memory movement and partial reduction across a CTA. Together they form the synchronisation and cooperative-movement substrate for warp-specialized GEMM mainloops and StreamK/split-K epilogues. The seq-bar family contributes five of the seventy ops the dialect ctor registers at sub_1761D90; the block-striped family contributes four.
Sequential Barrier Model
A sequence barrier is a circular array of barrier slots. Each participant carries a state cursor:
typedef struct {
int phase;
int index;
int count;
} SeqBarState;
The slot is index % depth. Arrival advances the cursor; waiting uses the current phase.
BarrierSlot seq_bar_slot(SeqBar bar, SeqBarState state) {
int slot_index = state.index % bar.depth;
return bar.base[slot_index];
}
SeqBarState seq_bar_advance(SeqBar bar, SeqBarState state) {
state.index += 1;
state.count += 1;
if ((state.index % bar.depth) == 0) {
state.phase ^= 1;
}
return state;
}
Seq-Bar Operations
The five seq-bar ops are cutlass.seq_bar.create, cutlass.seq_bar.init, cutlass.seq_bar.arrive, cutlass.seq_bar.wait, and the per-state-machine cutlass.seq_bar.state.create. The init op claims a slot range from the per-CTA NamedBarrier pool via the same barrier-id allocator at sub_1771850 that PipelineInitOp::verify uses. The thirty-two-slot pool serves both cutlass.pipeline and cutlass.seq_bar, so pipeline-init and seq-bar-init compete for the same physical resource and must agree on allocation order.
| Operation | Contract |
|---|---|
cutlass.seq_bar.create | Allocate the ring's typed handle; lowering attaches the barrier-id range claimed by init. |
cutlass.seq_bar.init | Initialize every barrier slot in the ring; allocates slot IDs from the per-CTA NamedBarrier pool. |
cutlass.seq_bar.arrive | Arrive on the current slot and advance state. |
cutlass.seq_bar.wait | Wait for the current slot and phase. |
cutlass.seq_bar.state.create | Materialize the per-thread state record (slot index, phase) consumed by arrive/wait. |
void lower_seq_bar_wait(SeqBar bar, SeqBarState state) {
BarrierSlot slot = seq_bar_slot(bar, state);
emit_mbarrier_try_wait_parity(slot, state.phase);
}
SeqBarState lower_seq_bar_arrive(SeqBar bar, SeqBarState state) {
BarrierSlot slot = seq_bar_slot(bar, state);
emit_mbarrier_arrive(slot, bar.participant_count);
return seq_bar_advance(bar, state);
}
The verifier checks that wait and arrive use a state whose depth matches the sequence-barrier type. Mismatched depths usually mean producer and consumer cursors were constructed for different rings.
Relationship to cutlass.pipeline
seq_bar is the simple ordered-ring path. cutlass.pipeline carries richer producer/consumer participant masks and transaction-byte accounting; its init op is verified by the 3 406-byte PipelineInitOp::verify at sub_1771F40, which checks numStages > 0, that the participants list length (read via sub_172E930) matches numProducers, that the consumer list length (read via sub_172E940) matches numConsumers, that barrier_id_base falls within [0, 32), and that producer_group_id and consumer_group_id are distinct. Seq-bar init skips all of these checks because its participant model is a single flat ring. Reach for seq_bar when ordered arrival and wait are enough; reach for pipeline when TMA, WGMMA, executor masks, or participant roles must be explicit.
Once PipelineInitOp::verify passes, the post-verify builder at sub_1772C90 stamps a derived arrive-count attribute on the pipeline-init op. Seq-bar init has no equivalent post-verify step — its arrive-count is the static participant count from the op attribute and needs no recomputation.
Block-Striped Model
Block-striped movement partitions a contiguous tile across T threads. Thread i touches indices:
i, T + i, 2T + i, ...
The result: coalesced per-thread stripes with a dead-simple mapping from thread id to element index.
SmallVector<int> block_striped_indices(int thread_id, int threads, int elements) {
SmallVector<int> result;
for (int index = thread_id; index < elements; index += threads) {
result.push(index);
}
return result;
}
Block-Striped Operations
The four block-striped ops are cutlass.block_striped.load, cutlass.block_striped.load_add (fused load-then-atomic-add), cutlass.block_striped.store, and cutlass.block_striped.reduce. Type specialisation (half, bfloat, packed, integer, float) is carried as an attribute on each op rather than as separate op-name entries — each variant needs distinct atom selection at lowering time, but the op registration is shared.
| Operation | Contract |
|---|---|
cutlass.block_striped.load | Load a per-thread stripe from global memory into register memory. |
cutlass.block_striped.store | Store a per-thread stripe from register memory to global memory. |
cutlass.block_striped.reduce | Atomically add or max each register stripe into a global workspace. |
Every variant requires one register-memory memref, one global-memory pointer or memref, an element width of at least sixteen bits, and a static shape. Static shape lets the lowering pick a vector width and atom shape.
Four Operand-Layout Checkers
The block-striped family uses four operand-layout checkers — one per variant: sub_176E670 for load, sub_176EE10 for store, sub_176F5B0 for reduce-add, sub_176FD50 for reduce-max. Each one checks register-memory operand width, global-memory pointer or memref shape, element-type width (at least sixteen bits), and static stripe shape. They reject malformed operand combinations before lowering picks a vector width and copy atom.
LogicalResult verify_block_striped(BlockStripedOp op) {
require(has_rmem_operand(op));
require(has_gmem_operand(op));
require(bit_width(op.element_type) >= 16);
require(shape_is_static(op.stripe_shape));
require(op.n_threads_in_block > 0);
return success();
}
Reduce Lowering
block_striped.reduce is the side-effecting atomic path. It emits a global atomic add or max appropriate for the element type and packed width.
void lower_block_striped_reduce(BlockStripedReduceOp op) {
for (int lane_index : block_striped_indices(thread_id(), op.threads, op.elements)) {
Pointer dst = op.workspace + lane_index * sizeof(op.element_type);
Value value = load_register_fragment(op.rmem, lane_index);
emit_global_atomic_add(dst, value, GPU_SCOPE);
}
}
Half and bfloat packed reductions use the target's packed no-flush path when available. Integer and floating scalar reductions select the matching global-add op. The five type-specialised forms exist so the lowering can route straight to the correct atom without rederiving the element type from a generic op.
Cutlass-Bar Warp-Cooperative Diagnostic
BarOpLowering at sub_15FC250 is a ~5.5 KB routine handling cutlass.bar lowering plus the warp-cooperative diagnostic. The diagnostic fires when a cutlass.bar op carries an arrive-count that is not a multiple of warp size, or when the op sits outside warp-cooperative scope. Block-striped and seq-bar lowering both depend on the warp-cooperative scope guarantee the diagnostic enforces — partial-warp arrivals on a cutlass.bar would otherwise corrupt the per-stage arrive count the pipeline-init post-verify builder stamped.
Load/Add/Store Lowering
The non-atomic variants lower through cute copy atoms and vector load/store helpers.
void lower_block_striped_load_add(BlockStripedLoadAddOp op) {
CopyAtom atom = make_block_striped_copy_atom(op.element_type, op.stripe_shape);
RegisterFragment old_value = cute_load_vec(op.gmem, atom);
RegisterFragment partial = cute_load_vec(op.rmem, atom);
RegisterFragment merged = add_fragments(old_value, partial, op.element_type);
cute_store_vec(op.rmem, merged, atom);
}
Use floating addition for floating element types and integer addition for integer element types. The verifier nails that choice down by requiring a concrete element type.
StreamK and Split-K Use
StreamK and split-K reach for block-striped ops in their epilogues. Partial CTAs write accumulator fragments into a workspace. The final reducer CTA loads the partials, accumulates them, and stores or atomically reduces the result. The block-striped mapping pins every thread to a deterministic stripe of the accumulator tile.
Invariants
- Sequence-barrier state depth matches the sequence-barrier value.
- Sequence-barrier arrive advances index and flips phase on wraparound.
- Block-striped operands include one register-memory and one global-memory object.
- Block-striped element widths are at least sixteen bits.
- Block-striped shapes are static.
- Reduce is side-effecting and must not be commoned or removed.
- Load/add/store variants choose integer or floating addition from element type.
cutlass.bararrive-count is a multiple of warp size and the op is in warp-cooperative scope.
Cross-References
cutlass Pipeline and Tile Scheduler — Pipeline Model covers the richer producer/consumer protocol the simpler seq-bar ring sits next to, and Producer Acquire 4-Phase Lowering covers the NamedBarrier-pool consumer that seq-bar init shares slots with. mbarrier State Machine covers the per-stage mbarrier protocol the seq-bar slots reuse. cutlass Dialect Overview — Block-Striped Operand Checkers lists the four block-striped operand-layout checkers this page's verifier walks.
MODS Async Dispatch
Abstract
cutlass.tile_scheduler.mods_* is a four-op sidecar family for the MODS (Multi-Op Dispatch) async-dispatch path used by CUTLASS-style persistent GEMM kernels. The four ops attach to the tile-scheduler boundary of a persistent mainloop: two mark the start and end of the steady-state pipeline, one reads the current SM id, and one inserts a runtime throttle point. None of the four moves data, computes a tile, or participates in producer/consumer synchronisation. They exist to give the persistent-kernel runtime an explicit handshake with the SM hardware that ordinary cutlass.pipeline.* and cutlass.tile_scheduler.* ops do not provide.
The family is small but the placement is exact. MODS ops appear inside cutlass.async.exec regions alongside the rest of the persistent mainloop's pipeline plumbing. They lower to single-instruction NVVM intrinsics plus one ABI side effect: the mainloop-start and mainloop-end probes also drop arrive/wait pairs against the cluster-wide barrier that coordinates the alternate async-call ABI MODS uses for cross-CTA progress reporting.
Position in the cutlass Dialect
The cutlass dialect groups its seventy ops into eight families. Pipeline (twenty ops), tile_scheduler-non-MODS (thirty-one ops), seq_bar (five ops), and block_striped (four ops) cover the four large-scale orchestration concerns; three smaller barrier/async families (named_barrier, generic_barrier, and the single cutlass.async.exec op) account for the remaining six. The MODS family is a smaller cluster of its own: four ops that target the same persistent-kernel structure the other tile_scheduler ops build, but for runtime reporting rather than work assignment.
| Op | Role | Side effect |
|---|---|---|
cutlass.tile_scheduler.mods_report_mainloop_start | Mark the entry into the persistent mainloop. | Cluster barrier arrive; optional timestamp read. |
cutlass.tile_scheduler.mods_report_mainloop_end | Mark the exit from the persistent mainloop after pipeline drain. | Cluster barrier wait; optional timestamp read. |
cutlass.tile_scheduler.mods_report_smid | Read the SM id assigned to the current CTA. | Special-register read. |
cutlass.tile_scheduler.mods_throttle | Insert a backoff point to relieve hardware queue pressure. | Side-effecting throttle hook. |
Calling all four "telemetry" overstates what the start/end probes do. The probes carry the cluster handshake the MODS dispatch ABI relies on — the actual mainloop-completion signal — and removing them breaks the persistent kernel's progress contract, not just its diagnostic output.
The Persistent-Kernel Setting
The MODS ops only make sense in the context of a persistent CUTLASS kernel. A persistent kernel launches one CTA per SM (or one cluster per SM tier) and walks an internal tile iterator until no work remains. The structure is:
- Kernel entry → arrive at the persistent setup barrier.
- Pipeline init (
cutlass.pipeline.init,cutlass.seq_bar.init). - Tile-scheduler init (
cutlass.tile_scheduler.{streamk,static_persistent,data_parallel}). mods_report_mainloop_start→ mainloop entry barrier.- Steady-state mainloop: per-tile producer/consumer handshake plus pipeline advance.
mods_report_mainloop_end→ mainloop exit barrier after the pipeline tail drains.- Epilogue and kernel exit.
Steps 4 and 6 are the MODS-specific additions to a standard CUTLASS persistent kernel. They open and close the cluster-coordinated MODS execution window in between which the alternate dispatch ABI is active. Outside the window, the kernel behaves as an ordinary CUTLASS persistent kernel; inside the window, certain cross-CTA progress queries are valid that are not valid elsewhere.
mods_report_smid is independent — it can appear anywhere inside the window and lowers to a special-register read. mods_throttle is also window-internal but is conditionally emitted: the scheduler-state computation decides per-tile whether the pipeline depth is large enough that a throttle is worth inserting.
Op Signatures
The four ops carry minimal operand bundles. Mainloop probes take a participant mask used to size the cluster-barrier arrive count, an is_2cta_mma flag that determines whether participants are CTAs or cluster halves, and an optional timestamp output. mods_report_smid takes no operands and returns one i32. mods_throttle takes one constant operand: the throttle profile to apply.
%t_start = cutlass.tile_scheduler.mods_report_mainloop_start %sched, %participants
{is_2cta_mma = true} : !cutlass.tile_scheduler<streamk>, i64 -> i64
%smid = cutlass.tile_scheduler.mods_report_smid : i32
cutlass.tile_scheduler.mods_throttle {profile = 1 : i32}
cutlass.tile_scheduler.mods_report_mainloop_end %sched, %participants, %t_start
{is_2cta_mma = true} : !cutlass.tile_scheduler<streamk>, i64, i64 -> ()
The start probe optionally returns the timestamp value the end probe consumes. The two probes form a matched pair: the end probe verifies that the start probe it pairs with has the same is_2cta_mma flag and the same participant mask. Mismatch produces a verifier diagnostic rather than a runtime hang.
Dispatch Model and Producer/Consumer Integration
MODS does not replace the producer/consumer protocol the rest of the cutlass dialect implements. It runs in parallel with it: ordinary cutlass.pipeline.producer_acquire, cutlass.pipeline.producer_commit, cutlass.pipeline.consumer_wait, and cutlass.pipeline.consumer_release continue to drive the per-stage mbarrier handshake inside the mainloop. MODS adds one outer layer on top, scoped to the entire mainloop:
mods_report_mainloop_start ─┐
│ cluster-scoped MODS dispatch window
pipeline.init │
for each tile: │
producer.acquire / commit │
consumer.wait / release │
pipeline.producer.tail │
│
mods_report_mainloop_end ─┘
Inside the window, the alternate async-call ABI is active: a producer CTA can issue a progress query that another CTA in the same cluster will pick up at its next mods_throttle point. Outside the window, the same query is rejected. The verifier checks that no MODS-specific op (such as a mods_throttle with a profile that depends on cluster-wide state) appears outside a matched start/end pair.
The participant model passed to the start probe interacts with the cluster shape declared on the enclosing kernel. For single-CTA MMA, the participant mask is one bit per CTA in the cluster. For two-CTA MMA, the participant mask is one bit per cluster half, and the start probe's is_2cta_mma flag forces the lowering to read cluster.ctarank rather than tid.x when computing each participant's arrive contribution. The two paths produce different NVVM emit sets, not just different attribute values.
NamedBarrier and mbarrier Integration
The mainloop-start probe and mainloop-end probe both touch the cluster barrier, not the per-stage mbarrier slots. The cluster barrier is allocated from the NamedBarrier pool by the same cutlass.pipeline.init / cutlass.seq_bar.init barrier-id allocator the rest of the dialect uses; MODS does not claim its own pool slots. The mainloop-start probe arrives on the cluster barrier with a count derived from the participant mask; the mainloop-end probe waits for the matching arrive count from every participant before continuing.
This is why removing the probes breaks the kernel rather than just its reporting. The cluster barrier participates in the persistent-kernel progress contract — without the matched arrive/wait pair, late CTAs can re-enter the mainloop while early CTAs have already finished, and the mods_throttle hooks downstream observe inconsistent cluster-wide state.
The per-stage mbarrier slots used by cutlass.pipeline.* are untouched by MODS. The two synchronization domains are separate by design: one coordinates producer/consumer agents within a CTA, the other coordinates progress across CTAs in a cluster. MODS sits at the cluster-scoped level alongside the nvvm.cluster.arrive / nvvm.cluster.wait pair, not at the per-stage level.
Verifier Rules
The dialect contract for the MODS family covers four distinct checks:
-
Matched probe pair. Every
mods_report_mainloop_startop must be paired with exactly onemods_report_mainloop_endop in the same enclosingcutlass.async.execregion. The pair must agree on theis_2cta_mmaflag and on the participant mask. -
Window enclosure. Any
mods_throttleormods_report_smidop appearing inside a region must lie between a matched start/end pair if its operand bundle depends on cluster-wide state. Smid reads (no cluster dependency) can appear anywhere; throttle ops with cluster-aware profiles cannot. -
Cluster shape coherence. The participant mask passed to the start probe must agree with the cluster shape declared on the enclosing kernel's
gpu.module. A 2x2 cluster withis_2cta_mma = truedeclares two participants; a 4x1 cluster withis_2cta_mma = falsedeclares four. The verifier rejects participant masks that have more bits set than the cluster shape supports. -
Side-effect preservation. MODS ops carry the
MemoryEffects::Writeside-effect on the cluster barrier. Optimizers that drop ops based on read/written-value analysis must observe this and leave the probes in place even when their return values appear unused.
The first three checks fire at op verify time. The fourth is a property of the op's memory-effect declaration and is enforced by the standard MLIR optimizer machinery rather than by a dedicated verifier.
LogicalResult verify_mods_probe_pair(MlirOperation start, MlirOperation end) {
require(start->is_2cta_mma == end->is_2cta_mma);
require(start->participants == end->participants);
require(same_async_exec_region(start, end));
require(no_other_probe_op_between(start, end));
return success();
}
Lowering
The MODS lowering runs as part of ConvertPipelineToNVVM — the same pass that lowers cutlass.pipeline.* to mbarrier intrinsics. It is not a separate pass because MODS shares its barrier-allocation state with the rest of the cutlass dialect; running it later would require re-reading the NamedBarrier pool state the pipeline-init lowering already consumed.
| Op | Lowered emit set |
|---|---|
mods_report_mainloop_start | nvvm.cluster.arrive + optional nvvm.read.ptx.sreg.globaltimer |
mods_report_mainloop_end | nvvm.cluster.wait after cp.async.bulk.wait_group { count = 0 } drain + optional second nvvm.read.ptx.sreg.globaltimer and subtraction |
mods_report_smid | nvvm.read.ptx.sreg.smid |
mods_throttle | llvm.inline_asm emitting nanosleep.u32 with profile-derived constant, or nvvm.barrier.cta.sync for the cooperative-throttle profile |
The mods_report_mainloop_end lowering is the most intricate. Before it can issue the cluster-barrier wait, it must drain any outstanding TMA stores from the pipeline tail; the lowering inserts a cp.async.bulk.wait_group { count = 0 } between the pipeline's producer.tail op and the cluster wait. This is one of the few places the cutlass-to-NVVM lowering inserts a wait that does not correspond directly to a cutlass.pipeline.consumer_wait op — it exists because the cluster barrier observes the full mainloop, not a single stage, and the tail drain is what makes that observation safe.
The mods_throttle lowering picks one of three profiles. Profile 0 is a no-op — the op is dropped during lowering when the surrounding pipeline depth analysis decides the kernel does not need a throttle. Profile 1 emits a fixed-duration llvm.inline_asm wrapping the nanosleep.u32 PTX instruction. Profile 2 emits a cooperative-throttle path: the throttle becomes a participation point in a per-cluster barrier whose count adapts to current hardware queue pressure measured at runtime.
Relationship to the cutlass C++ Library
The OSS CUTLASS library exposes a cutlass::mods::* namespace with mainloop_begin, mainloop_end, report_smid, and throttle helpers. The four MLIR ops correspond one-to-one with these helpers, but the IR shape collapses what the C++ surface presents as templated function calls into named ops with explicit operand bundles. The participant mask the C++ helpers compute at compile time from the ClusterShape template parameter becomes an explicit operand on the IR ops; the is_2cta_mma flag the C++ helpers derive from the MMA::Tile policy becomes an explicit attribute.
The same simplification the cutlass dialect applies to other CUTLASS templates applies here: template specialisation chains in cutlass/mods/*.hpp turn into op attributes the verifier cross-checks, and the runtime behavior is preserved through the dialect contract rather than through the C++ template instantiation surface.
Invariants
- The four MODS ops appear only inside
nv_tileas.async.execregions, never at module or kernel scope. - The mainloop-start and mainloop-end probes form matched pairs scoped to one async region.
- The pair agrees on
is_2cta_mmaand on the participant mask. - The participant mask agrees with the cluster shape declared on the enclosing
gpu.module. - Cluster-aware throttle profiles appear only between a matched start/end pair.
- The probes' side effects on the cluster barrier are not removable by ordinary optimization passes.
mods_report_smidand the profile-0mods_throttleare the only MODS ops that can be safely dropped if their results are unused; the other three carry barrier-side-effect semantics that pin them in place.
Cross-References
- cutlass Dialect Overview — Operation Roster — the family roster and verifier inventory.
- cutlass Pipeline and Tile Scheduler — Pipeline Model — the producer/consumer protocol MODS runs alongside.
- Cluster Sync and DSMEM Handshake — Plain Cluster Barrier — the cluster-barrier mechanism the mainloop probes use.
- mbarrier State Machine — State Machine — the per-stage barrier protocol MODS does not touch.
nvgpu Dialect Overview
Abstract
nvgpu is the bridge dialect between MLIR's generic gpu dialect and NVPTX-specific nvvm. It names the NVIDIA kernel patterns that gpu cannot express — warp shuffle, MMA and WGMMA, cp.async, mbarrier, TMA — without committing yet to a concrete NVVM intrinsic. Tileiras links the upstream dialect unchanged. cute_nvgpu feeds it from above; convert-nvgpu-to-nvvm drains it from below.
About thirty ops live here. The conversion pass installs one OpConversionPattern per op and rewrites the module in a single sweep, each pattern emitting a small fixed body of nvvm.* ops — or, for a few exception cases, expanded memref / llvm / llvm.inline_asm. The pass mnemonic is convert-nvgpu-to-nvvm; in the O3 pipeline it runs immediately after the broad convert-to-llvm step and before convert-vector-to-llvm, arith-expand, and convert-memref-to-llvm (see Pass List by Optimization Level — O3), so by the time it fires every operand is already in LLVM-dialect or memref form.
Position in the Cascade
cute_nvgpu
|
| lower architecture atoms into stock GPU operations
v
nvgpu
|
| convert-nvgpu-to-nvvm: ~30 patterns, one sweep
v
nvvm
|
| translate to LLVM IR and the NVPTX backend
v
PTX
cute_nvgpu ops still speak SM-tier vocabulary — TMA atoms, WGMMA atoms, Blackwell tensor-memory operations. nvgpu strips the source-level atom naming and re-presents the same behaviour over MLIR memrefs, vectors, descriptors, barrier groups, and async tokens. That makes the NVVM conversion mechanical: every nvgpu op below has a fixed nvvm (or llvm.inline_asm) lowering.
Op Roster
populateNVGPUToNVVMConversionPatterns installs one OpConversionPattern per op. Tileiras links the upstream populator unchanged. The rewriter callbacks branch on source memory space to pick the generic or .shared form of the mbarrier and cp.async intrinsics — address space 3 always selects .shared.
The "Status" column distinguishes ops whose mnemonic string appears verbatim in this binary's string table from upstream-MLIR patterns that the linked populator carries but whose mnemonic was never interned (either because the op was renamed, dropped, or only reached through gpu-dialect routing).
| nvgpu op | NVVM op(s) emitted | Status |
|---|---|---|
nvgpu.device_async_copy | nvvm.cp.async.shared.global | interned |
nvgpu.device_async_create_group | nvvm.cp.async.commit.group | interned |
nvgpu.device_async_wait | nvvm.cp.async.wait.group | interned |
nvgpu.mbarrier.create | (no nvvm.*; memref.global + memref.get_global) | interned |
nvgpu.mbarrier.init | nvvm.mbarrier.init / nvvm.mbarrier.init.shared (address-space-driven) | interned |
nvgpu.mbarrier.arrive | nvvm.mbarrier.arrive / .shared | interned |
nvgpu.mbarrier.arrive.nocomplete | nvvm.mbarrier.arrive.nocomplete / .shared | interned |
nvgpu.mbarrier.arrive.expect_tx | nvvm.mbarrier.arrive.expect_tx / .shared | interned |
nvgpu.mbarrier.test.wait | nvvm.mbarrier.test.wait / .shared | interned |
nvgpu.mbarrier.try_wait.parity | nvvm.mbarrier.try_wait.parity.shared | interned |
nvgpu.mbarrier.inval | nvvm.mbarrier.inval[.shared] | absent from this binary; NVVM mbarrier.inval.shared is reached through the lower-level pattern |
nvgpu.tma.async.load | nvvm.cp.async.bulk.tensor.shared.cluster.global | interned |
nvgpu.tma.async.store | nvvm.cp.async.bulk.tensor.global.shared.cta | interned |
nvgpu.tma.async.reduce | nvvm.cp.async.bulk.tensor.reduce | absent from this binary; NVVM reduce intrinsic still ships, but no nvgpu wrapper interns the mnemonic |
nvgpu.tma.prefetch.descriptor | nvvm.prefetch.tensormap | interned |
nvgpu.tma.fence.descriptor | nvvm.fence.proxy.acquire | interned |
nvgpu.tma.create.descriptor | llvm.alloca + GEP/store sequence + llvm.call @cuTensorMapEncodeTiled | interned |
nvgpu.tensormap.create.descriptor (device-side replace path) | nvvm.tensormap.cp.async.shared + nvvm.tensormap.replace.* | absent from this binary; described here for completeness against upstream MLIR |
nvgpu.tensormap.update.{global_address,box_dim,element_stride} | nvvm.tensormap.replace.* per field | absent from this binary; same upstream-only status |
gpu.warp_execute_on_lane_0 (consumed at this stage) | nvvm.shfl.sync + conditional region | routed through the upstream gpu dialect; no nvgpu mnemonic interned |
nvgpu.warpgroup.descriptor (generate) | integer shl/or chain producing a 64-bit value (see WGMMA descriptor bit layout) | interned (both nvgpu.warpgroup.descriptor and nvgpu.warpgroup.generate.descriptor) |
nvgpu.warpgroup.mma | nvvm.wgmma.fence.aligned + N× nvvm.wgmma.mma_async + nvvm.wgmma.commit.group.sync.aligned + nvvm.wgmma.wait.group.sync.aligned | interned |
nvgpu.warpgroup.mma.store | per-thread llvm.store decomposition of the accumulator | interned |
nvgpu.warpgroup.mma.init.accumulator | llvm.mlir.undef (or zero) accumulator aggregate | interned |
nvgpu.mma.sync | nvvm.wmma.mma.sync.aligned (sm_70..sm_89) or nvvm.wgmma.mma_async (sm_90+) | interned |
nvgpu.mma.sp.sync | llvm.inline_asm with mma.sp.sync.aligned.m... template | interned |
nvgpu.ldmatrix | nvvm.ldmatrix + register repack | interned |
(no nvgpu.stmatrix mnemonic in this binary; the upstream stmatrix lowering targets nvvm.stmatrix directly) | nvvm.stmatrix (when available) or llvm.inline_asm | NVVM op present, nvgpu wrapper absent |
nvgpu.rcp | nvvm.rcp.approx.ftz.f family or libdevice call | interned |
nvgpu.cvt_fpext / nvgpu.cvt_fptrunc | nvvm.cvt.packfloat.f32 family | interned |
nvgpu.fma.packed.f32x2 / nvgpu.mul.packed.f32x2 | packed nvvm.fma.packed.f32x2 / nvvm.mul.packed.f32x2 | interned |
Patterns are registered at benefit = 1. The 64-bit values consumed by nvgpu.warpgroup.mma's descriptorA / descriptorB operands are the same SMEM descriptor words that flow through cute_nvgpu MMA atoms — the canonical bitfield decode is on the cute_nvgpu MMA atoms page.
Operand and Attribute Tables
The tables below pin every op family in the dialect to its operand list, attribute list, result list, and the NVVM rewrite the conversion pattern emits. SM gating lives in the per-arch availability table further down.
nvgpu.device_async_copy
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | dst | memref<...> in addr-space 3 (shared) | minor dim must be unit-stride |
| operand 1 | src | memref<...> in addr-space 1 (global) | minor dim must be unit-stride |
| operand 2 | dstIndices | variadic index | rank == dst rank |
| operand 3 | srcIndices | variadic index | rank == src rank |
| operand 4 | srcElements | optional index | runtime element count for predicated case |
| attribute | dstElements | i64 (IntegerAttr) | element count per lane; 4, 8, 16 |
| attribute | bypassL1 | optional UnitAttr | selects .cg cache modifier |
| result 0 | token | !nvgpu.device.async.token | passed to commit / wait |
Rewriter emits nvvm.cp.async.shared.global with cp_size = dstElements * eltbytes and cp_modifier = bypassL1 ? cg : ca.
nvgpu.device_async_create_group
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0..N | inputTokens | variadic !nvgpu.device.async.token | tokens to commit as a group |
| result 0 | groupToken | !nvgpu.device.async.token | feeds device_async_wait |
Rewriter emits a single nvvm.cp.async.commit.group. Input tokens are erased; the SSA edge survives only as a happens-before constraint.
nvgpu.device_async_wait
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | asyncDependencies | !nvgpu.device.async.token | group token to wait on |
| attribute | numGroups | optional i32 | passed verbatim as the wait_group immediate |
Rewriter emits nvvm.cp.async.wait.group N where N = numGroups (default 0).
nvgpu.mbarrier.create
| Position | Name | Type | Notes |
|---|---|---|---|
| attribute | numBarriers | i64 | requested mbarrier count in the group |
| result 0 | barriers | !nvgpu.mbarrier.group | wraps the shared-memory slot |
Rewriter emits no nvvm.* op. It generates a memref.global "private" @__mbarrier ... : memref<NxNumBarrier x i64, 3> and a memref.get_global returning a base pointer.
nvgpu.mbarrier.init.shared (alias of mbarrier.init over addr-space 3)
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | barriers | !nvgpu.mbarrier.group | the group from mbarrier.create |
| operand 1 | mbarId | index | barrier index within the group |
| operand 2 | count | index | participant count |
| attribute | (none) | — | the address space drives the .shared selector |
Rewriter emits nvvm.mbarrier.init.shared against the GEP-resolved slot address.
nvgpu.mbarrier.arrive
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | barriers | !nvgpu.mbarrier.group | wraps the shared-memory slot |
| operand 1 | mbarId | index | barrier index within the group |
| result 0 | token | !nvgpu.mbarrier.token | feeds mbarrier.test.wait |
Rewriter emits nvvm.mbarrier.arrive or nvvm.mbarrier.arrive.shared based on the slot's address space.
nvgpu.mbarrier.arrive.expect_tx
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | barriers | !nvgpu.mbarrier.group | mbarrier slot |
| operand 1 | mbarId | index | barrier index within the group |
| operand 2 | txCount | index | expect-tx byte count |
Rewriter emits nvvm.mbarrier.arrive.expect_tx[.shared]. No SSA result; the side effect is on the shared-memory mbarrier slot.
nvgpu.mbarrier.try_wait.parity
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | barriers | !nvgpu.mbarrier.group | mbarrier slot |
| operand 1 | mbarId | index | barrier index within the group |
| operand 2 | phase | index | phase parity (0 or 1) |
| operand 3 | ticks | index | retry budget |
Rewriter emits nvvm.mbarrier.try_wait.parity.shared returning an i1 polled in a loop.
nvgpu.mbarrier.inval (absent in this binary)
The mnemonic nvgpu.mbarrier.inval is not interned in this tileiras string table; the inval-side intrinsic (nvvm.mbarrier.inval.shared) is still present and is reached through the lower-level NVVM lowering. The operand list below is documented for upstream-MLIR parity and as a reference for reimplementers that choose to surface the wrapper.
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | barriers | !nvgpu.mbarrier.group | mbarrier slot |
| operand 1 | mbarId | index | barrier index within the group |
Upstream rewriter shape: emit nvvm.mbarrier.inval[.shared].
nvgpu.tma.async.load
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | dst | memref<...> in addr-space 3 | TMA destination |
| operand 1 | barrier | !nvgpu.mbarrier.group | arrives expect-tx on completion |
| operand 2 | tensorMapDescriptor | !nvgpu.tensormap.descriptor | from tma.create.descriptor |
| operand 3..7 | coordinates | variadic i32, rank 1..5 | tile origin in tensor space |
| operand 8 | multicastMask | optional i16 | cluster multicast bitmap |
| operand 9 | l2CacheHint | optional i64 | maps to .L2::cache_hint |
| attribute | predicate | optional i1 | gated TMA issue |
Rewriter emits a single nvvm.cp.async.bulk.tensor.shared.global. See Lowering: TMA Async Load — Operand Mapping for the operand-slot mapping.
nvgpu.tma.async.store
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | src | memref<...> in addr-space 3 | TMA source tile in SMEM |
| operand 1 | tensorMapDescriptor | !nvgpu.tensormap.descriptor | global tensor map |
| operand 2..6 | coordinates | variadic i32, rank 1..5 | tile origin in tensor space |
| operand 7 | l2CacheHint | optional i64 | maps to .L2::cache_hint |
| attribute | predicate | optional i1 | gated TMA issue |
Rewriter emits nvvm.cp.async.bulk.tensor.global.shared. No barrier — the producer side does not wait.
nvgpu.tma.prefetch.descriptor
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tensorMapDescriptor | !nvgpu.tensormap.descriptor | descriptor to prefetch |
Rewriter emits nvvm.prefetch.tensormap [%tmap].
nvgpu.tma.fence.descriptor
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tensorMapDescriptor | !nvgpu.tensormap.descriptor | descriptor being made visible |
Rewriter emits nvvm.fence.proxy.acquire.sync.cluster — the proxy-acquire fence that the WGMMA descriptor consumer needs.
nvgpu.tma.async.reduce (absent in this binary)
The nvgpu.tma.async.reduce mnemonic is not interned in this tileiras build. The underlying NVVM op (nvvm.cp.async.bulk.tensor.reduce) is present and consumed by cute_nvgpu lowerings directly; no nvgpu wrapper surfaces the reduce variant. The operand layout below documents the upstream wrapper for parity.
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | src | memref<...> in addr-space 3 | SMEM source tile |
| operand 1 | tensorMapDescriptor | !nvgpu.tensormap.descriptor | global tensor map |
| operand 2..6 | coordinates | variadic i32, rank 1..5 | tile origin in tensor space |
| operand 7 | l2CacheHint | optional i64 | L2 hint |
| attribute | redop | enum tma_redux_kind | add / min / max / inc / dec / and / or / xor |
Upstream rewriter shape: emit nvvm.cp.async.bulk.tensor.reduce with red_op decoded from the attribute.
nvgpu.tma.create.descriptor
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tensor | memref<...> | global tensor whose layout the descriptor encodes |
| operand 1..N | boxDimensions | variadic index | TMA tile shape per rank |
| attribute | swizzle | enum tma_swizzle | none / 32B / 64B / 128B |
| attribute | l2Promotion | enum tma_l2_promotion | none / 64B / 128B / 256B |
| attribute | oobFill | enum tma_oob_fill | none / nan |
| attribute | interleave | enum tma_interleave | none / 16B / 32B |
| result 0 | descriptor | !nvgpu.tensormap.descriptor | global-memory pointer to the 128-byte CUtensorMap |
Rewriter emits no nvvm.* op. It allocates a 128-byte CUtensorMap on the host stack via llvm.alloca, fills it through llvm.getelementptr + llvm.store, and calls cuTensorMapEncodeTiled.
nvgpu.tensormap.create.descriptor (device-side replace path; absent in this binary)
This op family is not interned in this tileiras build. The device-side descriptor replace path is reached directly through cute_nvgpu -> nvvm.tensormap.* without going through an nvgpu.tensormap.create.descriptor wrapper. Operand layout documented below for upstream-MLIR parity.
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | dst | !nvgpu.tensormap.descriptor in shared | destination mailbox |
| operand 1 | src | !nvgpu.tensormap.descriptor in global | source descriptor |
Upstream rewriter shape: emit nvvm.tensormap.cp.async.shared followed by a sequence of nvvm.tensormap.replace.* ops.
nvgpu.tensormap.update.global_address / box_dim / element_stride (absent in this binary)
These per-field update wrappers are also not interned in this build. Field-level descriptor updates lower directly through the nvvm.tensormap.replace.* family.
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | descriptor | !nvgpu.tensormap.descriptor in shared | descriptor being edited |
| operand 1 | value | i64 or i32 | new field value |
| attribute | ord | i32 | rank index for box_dim / element_stride |
Upstream rewriter shape: each maps to the matching nvvm.tensormap.replace.* op against the SMEM-resident descriptor.
nvgpu.warpgroup.descriptor (also spelled warpgroup.generate.descriptor)
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tensor | memref<...> in addr-space 3 | SMEM tile origin |
| attribute | layout | enum (row / col) | matrix layout |
| attribute | swizzle | enum (none / 32B / 64B / 128B) | SMEM swizzle pattern |
| result 0 | descriptor | !nvgpu.warpgroup.descriptor | 64-bit SMEM descriptor |
Rewriter packs the descriptor bits inline. The result is a i64 LLVM value built by an shl/or chain; the bit layout (start_addr[14] | lbo[16] | sbo[16] | base_offset[3] | reserved[3] | swizzle_mode[2] | pad[10]) is documented on the cute_nvgpu MMA atoms page.
nvgpu.warpgroup.mma
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | descriptorA | !nvgpu.warpgroup.descriptor | SMEM descriptor for A |
| operand 1 | descriptorB | !nvgpu.warpgroup.descriptor | SMEM descriptor for B |
| operand 2 | matrixC | !nvgpu.warpgroup.accumulator | input accumulator tile |
| attribute | transposeA | optional UnitAttr | wired into the WGMMA layout enum |
| attribute | transposeB | optional UnitAttr | wired into the WGMMA layout enum |
| attribute | waitGroup | optional i32 | controls the wait-group depth |
| result 0 | matrixD | !nvgpu.warpgroup.accumulator | output accumulator tile |
Rewriter expands to the canonical four-op WGMMA sequence: nvvm.wgmma.fence.aligned, one nvvm.wgmma.mma_async per accumulator tile, nvvm.wgmma.commit.group.sync.aligned, then nvvm.wgmma.wait.group.sync.aligned waitGroup. See WGMMA Emission Protocol — The Four-Op Sequence for the timing rules and accumulator lifetime. It validates GMMA layout up front with the canonical "Not a canonical GMMA_MN Layout" wording lifted from CUTLASS's gmma.hpp.
nvgpu.warpgroup.mma.store
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | matrixD | !nvgpu.warpgroup.accumulator | accumulator to drain |
| operand 1 | dst | memref<...> in addr-space 3 | SMEM destination tile |
Rewriter decomposes the accumulator into per-thread llvm.store operations against the destination memref. No nvvm.* op is emitted.
nvgpu.warpgroup.mma.init.accumulator
| Position | Name | Type | Notes |
|---|---|---|---|
| result 0 | accumulator | !nvgpu.warpgroup.accumulator | zero-valued accumulator |
Rewriter emits llvm.mlir.zero (or llvm.mlir.undef followed by per-field zero stores) producing the accumulator aggregate.
nvgpu.mma.sync
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | matrixA | vector<...> register fragment | A operand fragment |
| operand 1 | matrixB | vector<...> register fragment | B operand fragment |
| operand 2 | matrixC | vector<...> register fragment | accumulator fragment |
| attribute | mmaShape | ArrayAttr<i64> of length 3 | [m, n, k] |
| attribute | tf32Enabled | optional UnitAttr | enables tf32 element-type lowering |
| result 0 | matrixD | vector<...> register fragment | D = A * B + C |
Rewriter emits nvvm.wmma.mma.sync.aligned (Ampere/Ada) or routes through nvvm.wgmma.mma_async (Hopper) based on the active SM.
nvgpu.mma.sp.sync
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | matrixA | vector<...> register fragment | structurally sparse A operand |
| operand 1 | matrixB | vector<...> register fragment | dense B operand |
| operand 2 | matrixC | vector<...> register fragment | accumulator fragment |
| operand 3 | sparseMetadata | vector<2xi16> | sparse selector word |
| operand 4 | sparsitySelector | i32 | 0 or 1 — selects which packed pair |
| attribute | mmaShape | ArrayAttr<i64> of length 3 | [m, n, k] |
| result 0 | matrixD | vector<...> register fragment | sparse MMA result |
Rewriter emits llvm.inline_asm with the mma.sp.sync.aligned.m... template; upstream NVVM exposes no sparse-MMA op in the snapshot tileiras tracks.
nvgpu.ldmatrix
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | src | memref<...> in addr-space 3 | SMEM tile origin |
| operand 1..N | indices | variadic index | rank-matched indices |
| attribute | numTiles | i32 | 1, 2, or 4 |
| attribute | transpose | UnitAttr (optional) | selects .trans form |
| result 0 | res | vector<NxNxi32> | repacked register fragment |
Rewriter emits nvvm.ldmatrix.sync.aligned returning an llvm.struct<(i32, i32, ...)>, then a pack-struct-into-vector repack to match the result type.
nvgpu.stmatrix (absent in this binary)
There is no nvgpu.stmatrix mnemonic in this tileiras build's string table. The stmatrix store path is reached from the upstream MLIR vector / nvvm populators directly into nvvm.stmatrix. The operand layout below mirrors the upstream wrapper.
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | dst | memref<...> in addr-space 3 | SMEM destination |
| operand 1..N | indices | variadic index | rank-matched indices |
| operand 2 | src | vector<NxNxi32> | per-thread fragment |
| attribute | transpose | UnitAttr (optional) | selects .trans form |
Upstream rewriter shape: emit nvvm.stmatrix.sync.aligned on sm_90+ targets, or llvm.inline_asm with the matching stmatrix... template otherwise.
gpu.warp_execute_on_lane_0 (routed through upstream gpu dialect)
There is no nvgpu.warp.execute_on_lane_0 mnemonic; the corresponding op is the upstream gpu dialect's gpu.warp_execute_on_lane_0, which convert-nvgpu-to-nvvm rewrites in passing.
| Position | Name | Type | Notes |
|---|---|---|---|
| region | body | one block | runs on lane 0 only |
| result 0..N | results | any LLVM-typed values | shuffled to every lane after the region |
Rewriter emits a region predicate against lane == 0, runs the body, and broadcasts each result with nvvm.shfl.sync (idx, 0, 0xffffffff).
Packed conversion and arithmetic helpers
| Op | Operands | Result | NVVM emission |
|---|---|---|---|
nvgpu.rcp | f32 | f32 | nvvm.rcp.approx.ftz.f or libdevice call |
nvgpu.cvt_fpext | packed i32 of FP4/FP8 | vector<2xf16> / vector<2xf32> | nvvm.cvt.packfloat.f32 family |
nvgpu.cvt_fptrunc | vector<2xf16> / vector<2xf32> | packed i32 | nvvm.cvt.packfloat.f32 family |
nvgpu.fma.packed.f32x2 | three vector<2xf32> | vector<2xf32> | nvvm.fma.rn.f32x2 |
nvgpu.mul.packed.f32x2 | two vector<2xf32> | vector<2xf32> | nvvm.mul.f32x2 |
Each packed op carries a rnd enum (rn, rz, rm, rp) and, where applicable, a packed_kind enum that selects between MXFP / NVFP packing modes.
Lowering-Target Table
What each rewriter emits. The middle column gives the concrete NVVM op (or the expanded form when the pattern bypasses NVVM on purpose); the right column is what the NVPTX backend ultimately prints, not anything nvgpu itself emits.
| nvgpu op | NVVM op (or expansion) | Final PTX (after NVVM lowering) |
|---|---|---|
nvgpu.device_async_copy | nvvm.cp.async.shared.global | cp.async.{ca,cg}.shared.global [%dst], [%src], N; |
nvgpu.device_async_create_group | nvvm.cp.async.commit.group | cp.async.commit_group; |
nvgpu.device_async_wait | nvvm.cp.async.wait.group | cp.async.wait_group N; |
nvgpu.mbarrier.create | memref.global "private" + memref.get_global | (no PTX; allocates SMEM slot) |
nvgpu.mbarrier.init.shared | nvvm.mbarrier.init.shared | mbarrier.init.shared.b64 [%mbar], %count; |
nvgpu.mbarrier.arrive | nvvm.mbarrier.arrive[.shared] | mbarrier.arrive.shared.b64 %tok, [%mbar]; |
nvgpu.mbarrier.arrive.expect_tx | nvvm.mbarrier.arrive.expect_tx[.shared] | mbarrier.arrive.expect_tx.shared.b64 %tok, [%mbar], %tx; |
nvgpu.mbarrier.try_wait.parity | nvvm.mbarrier.try_wait.parity.shared | mbarrier.try_wait.parity.shared.b64 %p, [%mbar], %ph, %ns; |
nvgpu.mbarrier.inval (absent in this binary) | nvvm.mbarrier.inval[.shared] | mbarrier.inval.shared.b64 [%mbar]; |
nvgpu.tma.async.load | nvvm.cp.async.bulk.tensor.shared.global | cp.async.bulk.tensor.{1..5}d.shared::cluster.global.tile.mbarrier::complete_tx::bytes [%dst], [%tmap, {%c0,..}], [%mbar]; |
nvgpu.tma.async.store | nvvm.cp.async.bulk.tensor.global.shared | cp.async.bulk.tensor.{1..5}d.global.shared::cta.tile [%tmap, {%c0,..}], [%src]; |
nvgpu.tma.async.reduce (absent in this binary) | nvvm.cp.async.bulk.tensor.reduce | cp.async.bulk.tensor.{1..5}d.global.shared::cta.tile.{redop}.bulk_group [%tmap, {%c0,..}], [%src]; |
nvgpu.tma.prefetch.descriptor | nvvm.prefetch.tensormap | prefetch.tensormap [%tmap]; |
nvgpu.tma.fence.descriptor | nvvm.fence.proxy.acquire.sync.cluster | fence.proxy.async.shared::cluster; |
nvgpu.tma.create.descriptor | llvm.alloca + GEP/store sequence + llvm.call @cuTensorMapEncodeTiled | (no PTX; host-side encode of a 128-byte CUtensorMap) |
nvgpu.tensormap.create.descriptor (absent in this binary) | nvvm.tensormap.cp.async.shared + tensormap.replace.* | tensormap.cp.async.shared::cta.bulk_group [%dst], [%src]; then tensormap.replace.tile.{global_address,box_dim,elem_stride,...}.[%tmap], ...; |
gpu.warp_execute_on_lane_0 (upstream gpu dialect; no nvgpu.warp.execute_on_lane_0 mnemonic) | nvvm.shfl.sync + conditional region | shfl.sync.idx.b32 %r, %v, 0, 0x1f, 0xffffffff; |
nvgpu.warpgroup.descriptor | integer shl/or chain — no NVVM op | (no PTX; the 64-bit SMEM descriptor is built by ordinary integer ops; the PTX side sees the materialised b64 value) |
nvgpu.warpgroup.mma | nvvm.wgmma.fence.aligned → N× nvvm.wgmma.mma_async → nvvm.wgmma.commit.group.sync.aligned → nvvm.wgmma.wait.group.sync.aligned | wgmma.fence.sync.aligned; then wgmma.mma_async.sync.aligned.m64nXkY.f32.{f16,bf16,e4m3,e5m2}.{f16,bf16,e4m3,e5m2} {...}, %da, %db, p, 1, 1, %la, %lb; then wgmma.commit_group.sync.aligned; then wgmma.wait_group.sync.aligned N; |
nvgpu.warpgroup.mma.store | per-thread llvm.store decomposition | st.shared.b32 [%dst+off], %r; per fragment lane |
nvgpu.mma.sync | nvvm.wmma.mma.sync.aligned (sm_70..sm_89) or nvvm.wgmma.mma_async (sm_90+) | mma.sync.aligned.m16n8kK.{row,col}.{row,col}.{...} {...}, %a, %b, %c; |
nvgpu.mma.sp.sync | llvm.inline_asm with mma.sp.sync.aligned.m... template | mma.sp.sync.aligned.m16n8k{16,32}.row.col.{f16,bf16,...} {...}, %a, %b, %c, %meta, 0x0; |
nvgpu.ldmatrix | nvvm.ldmatrix.sync.aligned + repack | ldmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.shared::cta.b16 {...}, [%addr]; |
nvgpu.stmatrix (absent in this binary; upstream wrapper shape) | nvvm.stmatrix.sync.aligned or inline asm | stmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.shared::cta.b16 [%addr], {...}; |
nvgpu.cvt_fpext / nvgpu.cvt_fptrunc | nvvm.cvt.packfloat.f32.* | cvt.{rn,rz,...}.{f16,bf16,e4m3,e5m2}.f32 %r, %f; (per lane) |
nvgpu.fma.packed.f32x2 | nvvm.fma.rn.f32x2 | fma.rn.f32x2 %r, %a, %b, %c; |
The sparse-MMA path reaches PTX through llvm.inline_asm because the snapshot's upstream NVVM does not yet expose a sparse-MMA op. The template, constraint string, and result type live in the pattern body and drop verbatim into the LLVM module — see Inline-PTX templates and constraint strings on the NVVM overview for the constraint-string form.
Per-Arch Availability
convert-nvgpu-to-nvvm runs unconditionally on every target — the gates live inside the patterns and in NVVM verification, not in pass scheduling. The first column gives the lowest SM that accepts each pattern, the second the form it emits at that floor, the third the lowest PTX ISA version that defines the resulting instruction.
| nvgpu op | SM floor | Emits at floor | ptx_min |
|---|---|---|---|
nvgpu.device_async_copy | sm_80 | cp.async.{ca,cg}.shared.global | 7.0 |
nvgpu.device_async_create_group | sm_80 | cp.async.commit_group | 7.0 |
nvgpu.device_async_wait | sm_80 | cp.async.wait_group | 7.0 |
nvgpu.mbarrier.{create,init,arrive,try_wait.parity,inval} | sm_80 | shared-memory mbarrier | 7.0 (base set on 7.0; cluster-aware forms 7.8) |
nvgpu.mbarrier.arrive.expect_tx | sm_90 | mbarrier.arrive.expect_tx.shared.b64 | 7.8 |
nvgpu.tma.async.{load,store} | sm_90 | cp.async.bulk.tensor.{Nd,shared,global} | 8.0 |
nvgpu.tma.async.reduce (absent in this binary) | sm_90 | cp.async.bulk.tensor.reduce | 8.0 |
nvgpu.tma.prefetch.descriptor | sm_90 | prefetch.tensormap | 8.0 |
nvgpu.tma.fence.descriptor | sm_90 | fence.proxy.async.shared::cluster | 8.0 |
nvgpu.tma.create.descriptor | sm_90 | runtime call to cuTensorMapEncodeTiled | (host) |
nvgpu.tensormap.create.descriptor (absent in this binary) | sm_90 | tensormap.cp.async.shared + tensormap.replace.* | 8.3 |
nvgpu.tensormap.update.* (absent in this binary) | sm_90 | tensormap.replace.* | 8.3 |
nvgpu.warpgroup.mma | sm_90a | wgmma.mma_async.sync.aligned.m64nXkY.* | 8.0 |
nvgpu.warpgroup.mma.store | sm_90a | per-thread st.shared.* | 8.0 |
nvgpu.warpgroup.mma.init.accumulator | sm_90a | llvm.mlir.zero (no PTX) | 8.0 |
nvgpu.warpgroup.descriptor | sm_90a | (no PTX; SMEM descriptor synthesis) | n/a |
nvgpu.mma.sync (Ampere/Ada path) | sm_80 | mma.sync.aligned.m16n8k{16,32}.* | 7.0 |
nvgpu.mma.sync (Hopper path) | sm_90 | redirects through nvvm.wgmma.mma_async.* | 8.0 |
nvgpu.mma.sp.sync | sm_80 | inline mma.sp.sync.aligned.m16n8k{16,32}.* | 7.1 |
nvgpu.ldmatrix | sm_75 | ldmatrix.sync.aligned.m8n8.x{1,2,4} | 6.5 |
nvgpu.stmatrix (absent in this binary) | sm_90 | stmatrix.sync.aligned.m8n8.x{1,2,4} | 8.0 |
gpu.warp_execute_on_lane_0 (no nvgpu.warp.execute_on_lane_0 mnemonic) | sm_70 | shfl.sync.idx.b32 + region predicate | 6.0 |
nvgpu.cvt_fpext / nvgpu.cvt_fptrunc (FP4 / FP8) | sm_89 (FP8) / sm_100a (FP4) | cvt.{rn,rz,...}.{e4m3,e5m2}.f32 | 7.8 / 8.6 |
nvgpu.fma.packed.f32x2 / nvgpu.mul.packed.f32x2 | sm_100a | fma.rn.f32x2 / mul.rn.f32x2 | 8.6 |
nvgpu.rcp | sm_70 | rcp.approx.ftz.f32 | 6.0 |
sm_90a is the architecture-qualified variant wgmma and TMA require; plain sm_90 rejects them at NVVM verification. The dialect has no sm_100 op of its own — the Blackwell tcgen05 surface lives entirely in nvvm, accessed through cute_nvgpu atoms that lower past nvgpu. See Per-SM Emission Templates for the per-tier capability matrix.
Pattern-Set Construction
populateNVGPUToNVVMConversionPatterns is a flat populator: one OpConversionPattern per nvgpu.* op, each registered with benefit = 1. The patterns are stateless — they read operands and attributes through their OpAdaptor, emit a fixed sequence of nvvm.* (or llvm.* / memref.*) ops, and replace the root.
Tileiras consumes this populator unchanged from upstream MLIR. Reimplementations should match the same one-pattern-per-op shape; the rewriter's branch on source memory space is the only piece of policy the patterns carry. See Lowering: nvgpu / gpu to NVVM — Pattern Shapes for the rewrite primitives the patterns share.
Lowering Contract
The conversion never reinfers layout intent. By the time IR reaches nvgpu, descriptor shape, memory space, vector shape, MMA tile shape, sparse metadata, and barrier identity already live in operands and attributes. Pattern bodies stay small as a result.
The mbarrier family branches on memory space and emits one nvvm.mbarrier.*[.shared] intrinsic per op. See mbarrier State Machine for the slot transitions and NVVM mbarrier Ops for the per-op intrinsic mapping. TMA load and store each emit a single nvvm.cp.async.bulk.tensor.* intrinsic, threading the variadic coordinates, multicast mask, and L2 cache hint through unchanged. The largest pattern is nvgpu.warpgroup.mma: it emits the four-stage Hopper WGMMA sequence — fence, async MMA, commit, wait — and validates GMMA layout up front with the canonical "Not a canonical GMMA_MN Layout" wording lifted from CUTLASS's gmma.hpp.
A handful of patterns emit no nvvm.* op at all. nvgpu.mbarrier.create emits a memref.global with "private" visibility plus a memref.get_global, allocating the __mbarrier slot in shared memory. nvgpu.tma.create.descriptor emits an llvm.alloca for a 128-byte CUtensorMap, fills it via llvm.getelementptr+llvm.store sequences, then calls the CUDA driver's cuTensorMapEncodeTiled. nvgpu.warpgroup.descriptor is a pure shl/or chain over the WGMMA descriptor bitfield. nvgpu.mma.sp.sync emits an llvm.inline_asm with the verbatim "mma.sp.sync.aligned.m..." PTX template; at the snapshot revision tileiras tracks, upstream NVVM has no sparse-MMA op yet, and inline-asm is the upstream design.
Verification Invariants
The interesting nvgpu verifier checks are semantic, not lexical. TMA ops demand valid descriptor types, compatible source or destination memrefs, supported tensor-map ranks, and a legal shared-memory layout. WGMMA demands rank-2 matrix fragments, compatible M/N/K, a supported tile shape, matching accumulator and result types, and legal transpose flags. MMA and sparse MMA add element-type checks, sparse-selector bounds, and a guard that tf32 only pairs with valid floating-point operands. Device async copy requires matching element types, unit-stride minor dimensions, supported transfer sizes, and correct alignment when L1 bypass is requested.
The boundary matters because NVVM conversion assumes the op is already legal for the selected target. Invalid shapes slipping through here resurface later as much less useful intrinsic-selection or backend diagnostics.
Reimplementation Checklist
A practical reimplementation needs the operation families above, typed descriptor and barrier values, shape-aware verifiers, and a deterministic conversion table to NVVM. Keep the layer transient. Independent scheduling, high-level layout algebra, and CUDA Tile semantics all belong above nvgpu. The dialect's job is to normalise hardware operations, verify their low-level shape contracts, and hand them to NVVM with as little policy as possible.
The minimum useful surface: tensor-map descriptor creation and async TMA load/store/reduce; shared-memory barrier groups and barrier tokens with expect_tx and try_wait.parity; WGMMA accumulator init / mma / store; the WGMMA descriptor packer; MMA ops with explicit shape attributes; sparse MMA metadata and selector validation; ldmatrix and stmatrix; SM80 cp.async device-async copy; packed conversion and arithmetic helpers; a complete nvgpu-to-nvvm conversion table; and target-aware verification before conversion.
NVVM Dialect Overview
Provenance vs Upstream MLIR
nvvm is the upstream MLIR dialect, linked unchanged from the LLVM/MLIR snapshot tileiras tracks (the same dialect described in mlir/Dialect/LLVMIR/NVVMOps.td). Tileiras adds no nvvm.* op of its own — every op listed below comes from upstream. What the binary does override is usage: the inline-PTX templates, sparse-MMA path, and a few tcgen05 lowerings emit forms not yet exposed as upstream NVVM intrinsics in this snapshot. Those gaps are called out per family below.
Abstract
Every nvvm.* op exists to print one PTX instruction (or one inline-asm template). nvvm is the bottom MLIR dialect in TileIR's lowering stack — a typed intrinsic layer, not a programming model. Earlier dialects decide tiling, scheduling, pipeline stages, layouts, and target atoms; nvvm preserves those decisions in a form LLVM and the NVPTX backend understand.
Three lowering paths cover the whole dialect. Most ops become a call @llvm.nvvm.X intrinsic that the NVPTX backend prints as the matching PTX instruction. A smaller set lowers to llvm.inline_asm with a fixed PTX template — sparse MMA, a handful of TMA replace variants, a few cluster ops. The third path expands into ordinary llvm dialect ops (alloca, GEP, store, call). No nvvm.* op survives NVVM-to-LLVM conversion.
Position in the Cascade
nvgpu
|
| convert GPU operations to NVVM operations and LLVM helper IR
v
nvvm
|
| convert NVVM operations to LLVM intrinsics or inline assembly
v
llvm
|
| optimize, verify, select instructions, print PTX
v
PTX
nvgpu is the last MLIR layer that still looks like a GPU dialect. nvvm looks more like LLVM IR: pointer types, vector types, memory-order attributes, target attributes, and intrinsic operand shapes have to be explicit by the time IR arrives. Most verifier failures here are best read as "the previous lowering didn't finish specifying the target operation." See Lowering: nvgpu / gpu to NVVM for the per-op rewrite contract.
Per-Family Pages
Rather than a single op count, the dialect's footprint is best stated as a per-category breakdown:
- 128 ops (one per
mlir::NVVM::*Opclass, as measured by distinctmlir::NVVM::detail::*OpGenericAdaptorBasetemplate instantiations in the string pool — each TableGen-generated Op class instantiates exactly oneOpGenericAdaptorBaseto project its operand and attribute layout, so this count is the authoritative Op-class count); - 64 attrs (one per
mlir::NVVM::*Attrclass —AtomicOpKindAttr,CacheEvictionPriorityAttr,ConvertFP{4,6,8}TypeAttr,CpAsyncBulkTensorLoadModeAttr,FPRoundingModeAttr,LoadCacheModifierKindAttr,MBarrierScopeKindAttr,MMAB1OpAttr,MMALayoutAttr,MMAShapeAttr,MMATypesAttr,MemScopeKindAttr,Tcgen05*Attr× 9,WGMMA*Attr× 3, and the rest — these supply the enum-attribute mnemonics most ops carry); - 0 user-visible Type classes (the seven
mlir::NVVM::*Typestrings in the binary —ConvertFP{4,6,8}Type,DotAccumulateType,MMAType,ReductionType,WGMMAType— name enum-attribute kinds, not MLIR Type classes;nvvmdoes not introduce new MLIR Type subclasses); - ~70 enum-mnemonic strings (the enum-attribute kind names the parser accepts and the printer emits, e.g.
f16,bf16,e4m3,e5m2,row,col,cta,cluster,acquire,release— these are the values inside theAttrclasses above, not separate ops); - 19 module- and function-metadata keys (the single-segment
"nvvm.X"strings the NVPTX backend reads as LLVM module/function metadata after MLIR-to-LLVM translation:nvvm.kernel,nvvm.target,nvvm.annotations,nvvm.annotations_transplanted,nvvm.reqntid,nvvm.maxntid,nvvm.minctasm,nvvm.maxnreg,nvvm.maxclusterrank,nvvm.cluster_dim,nvvm.cluster_max_blocks,nvvm.blocksareclusters,nvvm.grid_constant,nvvm.hidden,nvvm.reflection,nvvm.restrict_keyword,nvvm.restrict_processed,nvvm.restrict_scope,nvvm.exit).
The PTX form count is several times larger than the Op-class count because most ops carry attribute-driven cross-products: shape × layout × element type for WMMA, rank × mode for TMA, kind × cta_group × collector for tcgen05, generic vs .shared splits for mbarrier, and so on. They split cleanly into eight large families plus a long tail of small ones. The bulk of each family is documented on its own page; this overview lists the families, their roster sizes, the SM floor, and one example op so the cross-link table doubles as an index. (WMMA is three ops in MLIR; the PTX shape × layout × element-type cross-product is reached through attributes on nvvm.wmma.{load,store,mma} rather than per-combination ops.)
Diagnosing the prior oscillation. Earlier waves had quoted three different NVVM op counts: 213/218, 124, and 86. None were wrong about what they measured; each measured a different thing. The 213-vs-218 pair counts every op-name-shaped string in the binary, including the LLVM intrinsic names (llvm.nvvm.barrier0, llvm.nvvm.cp.async.bulk.tensor.*, the full llvm.nvvm.tcgen05.* family) that NVVM-to-LLVM lowering emits — these aren't nvvm.* ops, they're the lowered IR that follows. The 1704-byte / 8-stride TypeID slab 0x5B8D610..0x5B8DCB8 documented in Op Mnemonic Master Table — §8 NVVM.* reaches 213 because the slab includes one entry per RegisteredOperationName::insert call, and a handful of mnemonics (the nvvm.read.ptx.sreg.envregN series for N=0..31, the per-axis splits of cluster registers) register separate slots even though they share an Op class. The 124 count was Op-classes-minus-an-internal-subset; the 86 count was a broken single-xref heuristic that walked one cross-reference table and missed the families reached through TypeID dispatch. The 128 above is the only number that survives all three cross-checks: mlir::NVVM::*OpGenericAdaptorBase strings (template-side), distinct mlir::NVVM::*Op class names (RTTI-side), and unique Op classes registered through sub_4461CA0 from the NVVM dialect constructor.
| Family | Count | SM floor | Example op | Page |
|---|---|---|---|---|
| WMMA — warp-synchronous register MMA | 3 | sm_70 | nvvm.wmma.mma | WMMA Ops |
| WGMMA — warp-group async MMA (Hopper) | 4 | sm_90a | nvvm.wgmma.mma_async | WGMMA Ops |
| TMA — bulk tensor copy, prefetch, reduce | 9 dialect ops (rank 1..5 and mode in attributes) | sm_90 | nvvm.cp.async.bulk.tensor.shared.cluster.global | TMA Ops |
| tcgen05 — Blackwell tensor memory + MMA | 15 dialect ops (kind / cta_group / collector / layout / sparsity / block-scale in attributes; the cross-product reaches several thousand PTX forms) | sm_100a | nvvm.tcgen05.mma.block_scale | tcgen05 Ops |
| mbarrier — shared-memory barrier state machine | 12 dialect ops (generic vs .shared address-space split adds the second variant on most ops) | sm_80 | nvvm.mbarrier.arrive.expect_tx.shared | mbarrier Ops |
| Cluster — thread-block cluster sync | 9 | sm_90 | nvvm.cluster.wait, nvvm.mapa | Cluster Ops |
Synchronisation — barrier0, barrier.cta.sync, bar.warp.sync, barrier.{arrive,sync} helpers | 8 | sm_70 | nvvm.barrier.cta.sync | (this page) |
cp.async (Ampere SM80 async-copy queue) | 6 dialect ops (vector width {4,8,16} and .ca/.cg cache modifier are attributes on nvvm.cp.async.shared.global) | sm_80 | nvvm.cp.async.shared.global | (this page) |
Special registers — tid, ctaid, ntid, etc. | 7 | sm_70 | nvvm.read.ptx.sreg.tid.x | (this page) |
| shfl / vote / elect.sync | 5 | sm_70 | nvvm.shfl.sync | (this page) |
Other (mapa, fences, ldmatrix/stmatrix, redux, prefetch) | 8 | varies | nvvm.ldmatrix | (this page) |
The family page is the normative spec: it pins each op to its operand list, LLVM intrinsic, PTX template, constraint string for inline-asm variants, and SM floor. The roster table below covers the smaller families that don't justify their own page.
Roster — Small Families
Synchronisation
| Op | LLVM intrinsic | PTX printed |
|---|---|---|
nvvm.barrier0 | llvm.nvvm.barrier0 | bar.sync 0; |
nvvm.bar.warp.sync | llvm.nvvm.bar.warp.sync | bar.warp.sync %m; |
nvvm.barrier | llvm.nvvm.barrier | barrier.cta.sync.aligned %b, %n; |
nvvm.barrier.cta.sync | llvm.nvvm.barrier.cta.sync | barrier.cta.sync %b, %n; |
nvvm.barrier.cta.arrive | llvm.nvvm.barrier.cta.arrive | barrier.cta.arrive %b, %n; |
nvvm.barrier.cta.red | llvm.nvvm.barrier.cta.red | barrier.cta.red.{op} %p, %b, %n, %src; |
nvvm.barrier.arrive | llvm.nvvm.barrier.arrive | bar.arrive %b, %n; |
nvvm.elect.sync | llvm.nvvm.elect.sync | `elect.sync %p |
Special-register reads
| Op | LLVM intrinsic | PTX printed |
|---|---|---|
nvvm.read.ptx.sreg.tid.x (.y, .z) | llvm.nvvm.read.ptx.sreg.tid.{x,y,z} | mov.u32 %r, %tid.{x,y,z}; |
nvvm.read.ptx.sreg.ntid.x (.y, .z) | llvm.nvvm.read.ptx.sreg.ntid.{x,y,z} | mov.u32 %r, %ntid.{x,y,z}; |
nvvm.read.ptx.sreg.ctaid.x (.y, .z) | llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z} | mov.u32 %r, %ctaid.{x,y,z}; |
nvvm.read.ptx.sreg.nctaid.x (.y, .z) | llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z} | mov.u32 %r, %nctaid.{x,y,z}; |
nvvm.read.ptx.sreg.warpid | llvm.nvvm.read.ptx.sreg.warpid | mov.u32 %r, %warpid; |
nvvm.read.ptx.sreg.laneid | llvm.nvvm.read.ptx.sreg.laneid | mov.u32 %r, %laneid; |
nvvm.read.ptx.sreg.smid | llvm.nvvm.read.ptx.sreg.smid | mov.u32 %r, %smid; |
cp.async (Ampere)
| Op | LLVM intrinsic | PTX printed |
|---|---|---|
nvvm.cp.async.shared.global | llvm.nvvm.cp.async.{ca,cg}.shared.global.{4,8,16} | cp.async.{ca,cg}.shared.global [%dst], [%src], N; |
nvvm.cp.async.commit.group | llvm.nvvm.cp.async.commit.group | cp.async.commit_group; |
nvvm.cp.async.wait.group | llvm.nvvm.cp.async.wait.group | cp.async.wait_group N; |
nvvm.cp.async.bulk.wait_group | llvm.nvvm.cp.async.bulk.wait_group | cp.async.bulk.wait_group N; |
nvvm.cp.async.mbarrier.arrive[.shared] | llvm.nvvm.cp.async.mbarrier.arrive[.shared] | cp.async.mbarrier.arrive[.shared].b64 [%mbar]; |
nvvm.cp.async.mbarrier.arrive.noinc[.shared] | llvm.nvvm.cp.async.mbarrier.arrive.noinc[.shared] | cp.async.mbarrier.arrive.noinc[.shared].b64 [%mbar]; |
shfl / vote
| Op | LLVM intrinsic | PTX printed |
|---|---|---|
nvvm.shfl.sync | llvm.nvvm.shfl.sync.{idx,up,down,bfly}.{i32,f32} | shfl.sync.{idx,up,down,bfly}.b32 %r, %v, %lane, %m, %mask; |
nvvm.vote.sync (kind ballot) | llvm.nvvm.vote.ballot.sync | vote.sync.ballot.b32 %r, %p, %mask; |
nvvm.vote.sync (kinds all/any/uni, selected by nvvm.vote_sync_kind) | llvm.nvvm.vote.{all,any,uni}.sync | vote.sync.{all,any,uni}.pred %p, %src, %mask; |
nvvm.match.sync | llvm.nvvm.match.{any,all}.sync.{i32,i64} | match.{any,all}.sync.b{32,64} %r, %v, %mask; |
nvvm.redux.sync | llvm.nvvm.redux.sync.{op}.{type} | redux.sync.{op}.{type} %r, %v, %mask; |
ldmatrix / stmatrix and miscellaneous
| Op | LLVM intrinsic | PTX printed |
|---|---|---|
nvvm.ldmatrix | llvm.nvvm.ldmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.{b16,b8x16,...} | ldmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.shared::cta.{b16,b8x16,...} {...}, [%addr]; |
nvvm.stmatrix | llvm.nvvm.stmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.{b16,b8x16} | stmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.shared::cta.{b16,b8x16} [%addr], {...}; |
nvvm.prefetch.tensormap | llvm.nvvm.prefetch.tensormap | prefetch.tensormap [%tmap]; |
nvvm.fence.proxy.acquire | llvm.nvvm.fence.proxy.acquire | fence.proxy.async.shared::cluster; |
nvvm.fence.mbarrier.init | llvm.nvvm.fence.mbarrier.init | fence.mbarrier_init.release.cluster; |
nvvm.cvt.packfloat.f32 | llvm.nvvm.cvt.{rn,rz,rm,rp}.{f16x2,bf16x2,e4m3x2,e5m2x2}.f32 | cvt.{rnd}.{f16,bf16,e4m3,e5m2}x2.f32 %r, %fhi, %flo; |
nvvm.mma.sync (Ampere/Ada dense) | llvm.nvvm.mma.m{8,16}n{8,16}k{...}.row.col.{...} | mma.sync.aligned.m16n8kK.{row,col}.{row,col}.{...} {...}, %a, %b, %c; |
nvvm.mma.block_scale | llvm.nvvm.mma.block_scale.m16n8k.{kind} | mma.sync.aligned.m16n8k.{kind}.scale::vec::{16,32} {...}, %a, %b, %c, %sa, %sb; |
Inline-PTX Templates and Constraint Strings
A handful of ops bypass call @llvm.nvvm.X and lower to llvm.inline_asm with a fixed PTX template plus a verbatim constraint string. The backend rejects the asm node unless template and constraint match the operand list exactly; reimplementers must reproduce both byte-for-byte.
The constraint codes used in this dialect:
| Code | Meaning |
|---|---|
r | 32-bit integer register (i32 / f32 / i16 / i8) |
l | 64-bit integer register (i64, including pointer-typed operands) |
f | 32-bit floating-point register (f32) |
h | 16-bit integer register (i16 / f16 / bf16) |
n | compile-time integer immediate |
=r / =l / =f / =h | output-only register of the matching width |
Sparse MMA
template: "mma.sp.sync.aligned.m{M}n{N}k{K}.row.col.{aType}.{bType}.{cType}.{dType}
{ %0, %1, %2, %3 }, // D (output)
{ %4, %5, %6, %7 }, // A (sparse halved)
{ %8, %9, %10, %11, %12, %13, %14, %15 }, // B
{ %16, %17, %18, %19 }, // C
%20, 0x{selector};" // sparse metadata, selector immediate
constraint: "=r,=r,=r,=r,r,r,r,r,r,r,r,r,r,r,r,r,r,r,r,r,r"
The first four =r slots are the output D fragment; the trailing r slots are the input fragments and the metadata word. The selector immediate is baked into the template literal at lowering time rather than passed as an operand; the same op emits 0x0 or 0x1 depending on the sparsitySelector attribute.
For shape m16n8k16.row.col.f16.f16.f16.f16 the constraint string above expands to four =r outputs (D fragment) and seventeen r inputs (A=4, B=8, C=4, plus the sparse-metadata word), matching the template's %0..%20 slot range. For m16n8k32.row.col.s32.s8.s8.s32 the printed PTX is {$0..$3}, {$4,$5}, {$6,$7}, {$8..$11}, $12 — four s32 outputs and nine inputs (A=2, B=2, C=4, metadata=1). The verifier rejects any combination not listed in the PTX ISA.
im2col TMA store with L2 cache hint
template: "cp.async.bulk.tensor.{N}d.global.shared::cta.im2col.bulk_group.L2::cache_hint
[%0, { %1, %2, ..., %{N} }],
[%{N+1}],
%{N+2};"
constraint: "l,r,r,r,r,r,l,l" // N=5 example
Operand 0 is the i64 descriptor pointer; the next N operands (one per rank) are 32-bit coordinates; the SMEM source pointer is l; the cache hint is l. Rank-3 and rank-4 forms drop coordinate operands and shrink the constraint string accordingly.
tcgen05.cp
template: "tcgen05.cp.{shape}.{multicast}.{src_fmt} [%0], [%1];"
constraint: "r,r"
The two r operands are the destination and source TMEM column indices. The shape, multicast, and src_fmt tokens are baked into the template literal at pattern-build time.
stmatrix fallback (pre-sm_90)
When nvvm.stmatrix.sync.aligned is targeted at a pre-sm_90 SM that exposes ldmatrix but not stmatrix directly, the op lowers through llvm.inline_asm:
template: "stmatrix.sync.aligned.m8n8.x{num}{.trans,}.shared::cta.b16
[%0], { %1, %2, ..., %{num} };"
constraint: "l,r,r,...,r" // one l for addr, num× r for fragment regs
l is the ptr addrspace(3) destination; the trailing r slots are the fragment registers.
WGMMA scale-D selector (when the immediate form is rejected)
Most wgmma.mma_async.sync.aligned variants reach PTX through the LLVM intrinsic, which carries scale_d as a compile-time argument. The few ops that drop to inline-asm use:
template: "wgmma.mma_async.sync.aligned.m64n{N}k{K}.{accT}.{aT}.{bT}
{ %0, %1, ..., %{accW-1} },
%da, %db, %p,
1, 1, %la, %lb;"
constraint: "=f,=f,...,=f,l,l,n,n,n"
Each output accumulator register is =f (for f32 accumulator types) or =h (f16). The two descriptor inputs are l. The %p predicate and the two trailing n slots are compile-time immediates. The =r slot used in some upstream snapshots for the scale-D return value does not appear on this constraint string because the immediate form is the only one tileiras emits.
Per-Arch Availability
Registration is uniform across targets; the gate lives in the verifier and the backend. The table is the practical "what runs where" view. ptx_min is the lowest PTX ISA version the final printed instruction requires.
| Family | SM floor | SM ceiling (observed) | ptx_min | Notes |
|---|---|---|---|---|
| Synchronisation | sm_70 | unbounded | 6.0 / 7.0 | aligned forms require 7.0 |
| Special registers | sm_70 | unbounded | 6.0 | always legal |
| shfl / vote | sm_70 | unbounded | 6.0 | only the .sync forms are emitted |
| cp.async (Ampere) | sm_80 | unbounded | 7.0 | Ampere async-copy queue |
| mbarrier | sm_80 (base), sm_90 (.expect_tx) | unbounded | 7.0 / 7.8 | shared-memory variant on Ampere; cluster-aware extensions on Hopper |
| WMMA | sm_70 | sm_89 (Hopper redirects through WGMMA) | 6.0 | the only MMA path on Turing/Ampere |
| WGMMA | sm_90a | sm_90a (no Blackwell WGMMA) | 8.0 | architecture-qualified; plain sm_90 is rejected |
| TMA | sm_90 | unbounded | 8.0 / 8.3 | descriptor lives in global memory |
| Cluster | sm_90 | unbounded | 8.0 | requires barrier.cluster.* PTX |
| ldmatrix / stmatrix | sm_75 (ldmatrix), sm_90 (stmatrix) | unbounded | 6.5 / 8.0 | width-4 .trans form requires 7.8 |
| tcgen05 | sm_100a | sm_100a (+ sm_100f for f-suffixed copy variants) | 8.6 | Blackwell tensor-memory family |
Block-scaled MMA (mma.block_scale) | sm_100a | sm_100a | 8.6 | the only sm_100 form in the legacy nvvm.mma namespace |
| redux / barrier-id helpers | sm_80 (redux.sync) / sm_70 (bar.{arrive,sync}) | unbounded | 7.0 / 6.0 | redux.sync requires Ampere |
Lowering Contract
NVVM-to-LLVM conversion is deliberately mechanical. Each nvvm.X op has a single registered lowering: a direct call @llvm.nvvm.X intrinsic when LLVM exposes a matching intrinsic, or an llvm.inline_asm with a hard-coded PTX template otherwise. A third path expands into ordinary llvm dialect ops for the few cases that aren't a single instruction (e.g. nvvm.shfl.sync synthesised broadcast loops).
The choice is fixed per op at registration time. The conversion driver walks each nvvm.* op, looks up the op's OperationName in the dispatch map, and invokes the matching rewrite:
LogicalResult lower_nvvm_op(Operation *op) {
const NvvmLowering *entry = lookup_by_operation_name(op->getName());
require(entry != NULL);
switch (entry->kind) {
case NVVM_DIRECT_INTRINSIC:
return replace_with_llvm_intrinsic_call(op, entry->intrinsic_id);
case NVVM_INLINE_ASM:
return replace_with_inline_asm(op, entry->ptx_template, entry->constraints);
case NVVM_LLVM_EXPANSION:
return entry->custom_expand(op);
}
}
The dispatch map is built once at dialect-load time from the TableGen records: each record declares dialect="nvvm", an op mnemonic, an LLVM intrinsic ID (or an inline-asm template + constraint string), and the kind. Lowering reads each field straight out of the entry. No layout, scheduling, or pipeline policy is reinferred here — earlier dialects must already have committed to the target operation.
After the sweep, no nvvm.* op survives. The verifier check that follows the sweep treats any remaining nvvm.* op as a missing pattern, not as a default-illegal op.
Verifier Invariants
The verifier rejects anything that cannot be legally translated to the selected target:
- intrinsic operand counts and result counts match the selected intrinsic;
- pointer address spaces are explicit and legal for the operation;
- memory scopes and memory-ordering attributes are compatible;
- MMA and WGMMA shapes are supported by the target;
- sparse and block-scaled MMA forms carry the required metadata operands;
- TMA and async-copy operands have valid descriptor, barrier, and memory-space types;
- special-register reads are valid for the target and execution model;
- inline-PTX operations have complete constraint strings and result types;
- operations requiring a newer SM generation are not emitted for an older target.
This is the last MLIR-level diagnostic point before LLVM IR and the machine backend. A good error here names the semantic mismatch, not just the intrinsic.
Target Attributes
nvvm carries the target attributes that make the LLVM handoff meaningful: architecture (nvvm.target = "sm_90a"), PTX version, feature flags (+ptx80, +tmem, ...), kernel markers, launch bounds, cluster dimensions, and assorted function- and module-level properties. Earlier passes set these through gpu.module and conversion interfaces; by the time the LLVM module materialises, the NVPTX backend has to recover a concrete subtarget from them.
The attributes are plain string / integer / array attributes attached to the gpu.module or func.func parents — the NVPTX backend reads them from the LLVM module's metadata after MLIR-to-LLVM translation. Missing or contradictory attributes here are silent disasters: the backend still receives syntactically valid LLVM IR, but generates code for the wrong target contract. The verifier rejects the obvious cases (no sm, no ptx), and the NVPTX subtarget feature matrix lists which features each SM accepts.
NVVM Properties Blob and Attr Parsers
Abstract
Every nvvm.* op that carries inline-data attributes gets a uniform Properties record bump-allocated next to its Operation*. The NVVM-to-LLVM lowering dispatcher shares one blob layout across the whole dialect, and five access patterns (A..E) cover every family: tcgen05.mma, ldmatrix / stmatrix, wgmma / mma.sync, cp.async.bulk / TMA, atomicrmw / red, the prefetch / fence / elect.sync triad, and block-scaled MMA.
NVVMDialect::initialize installs a 67-element enum-attr registrar chain that registers the enum namespaces those patterns consume. The slot tables below are normative — they pin each op mnemonic to its enum / unit / int / array slot positions, so a reimplementer can wire getAttrOfType<EnumAttr> (or OpAdaptor) to the offsets the upstream dispatcher already reads.
Properties record layout
Every dispatcher arm opens with the same property fetch. The Operation* header's discriminator byte at op[46] carries one high bit that selects inline Properties storage versus out-of-line bump-allocator storage. The bit is set for every NVVM op in this binary, so the effective offset is 16 and the fetch collapses to a single pointer load:
props_ptr = *(qword*)(op + 16 * (op[46] >> 7) + 0) // = +0
The first 16 bytes hold an OperandSegmentSizes inline buffer for ops with variadic operand groups; the bytes are zero otherwise. +16..+47 is reserved padding. Attribute slots start at +64 and march on an 8-byte stride. Each slot is an Attribute* — either null (optional attribute absent) or a pointer to an AttrStorage header whose i32 payload sits at +8:
+0..+15 OperandSegmentSizes (16 B, inline; zero when op has no segments)
+16..+47 zero-pad / reserved inline storage
+64 slot 0 Attribute* (8 B)
+72 slot 1 Attribute*
+80 slot 2 Attribute*
+88 slot 3 Attribute*
+96 slot 4 Attribute*
+104 slot 5 Attribute*
+112 slot 6 Attribute*
+120 slot 7 Attribute*
+128 slot 8 Attribute*
+136 slot 9 Attribute*
+144 slot 10 Attribute*
+152 slot 11 Attribute*
+160 slot 12 Attribute* (rare; only block-scaled MMA reaches here)
The biggest record observed reads 13 slots: nvvm.mma.block_scale touches +64..+136 plus +144. Every per-arm offset lands on 64 + 8*k for k ∈ [0,12] — no half-pointer storage, no odd alignment. Other observed slot counts: nvvm.ldmatrix=4, tcgen05.mma=9, wgmma.mma_async=12. The out-of-line bump-allocator path goes unused for this NVVM op set; lowering rejects any op whose discriminator says its properties are out-of-line.
The five access patterns
All 199 dispatcher arms reach into their Properties record through one of five inline-templated helpers. They share the slot-fetch arithmetic from the layout above and differ only in what they do with the slot's Attribute* and which payload they pull out.
| Pattern | Attribute kind | Slot read | Stored result | Used for |
|---|---|---|---|---|
| A | EnumAttr | load slot pointer, then read the padded i32 enum payload | uint32_t enum | shape, typeA/B/C, layout, trans, eltype, scale_in/out, kind, sparsity, cta_group, collectorA/B, cp_size, cache_modifier, red_op, red_type, mem_order |
| B | Optional<EnumAttr> | Pattern A, but null-tolerant | tagged uint64_t with present flag plus value | optional layout, trans, sparsity |
| C | UnitAttr / BoolAttr | test whether the slot pointer is non-null | bool | satfinite, transA/B, has_write_disable, tcgen05.fence direction, prefetch L2 marker |
| D | IntegerAttr | read the APInt value | u32, or u64 when active bits exceed 32 | mask on elect.sync, cache_level on prefetch, num on ldmatrix |
| E | ArrayAttr | read the first element of the array | first i32 of array | kernel bool, maxntid first element |
Pattern A is the workhorse: more than half of every per-arm slot read follows it, because NVVM EnumAttrs uniformly pad their payload to a full 32-bit word at slot+8 regardless of cardinality. Pattern B's tagged-int return is what feeds the present-flag inspections scattered through the dispatcher. Pattern C never touches the attribute payload at all. Pattern D bottoms out in APInt::getValue. Pattern E is the rarest — only the nvvm.kernel / nvvm.maxntid function-attribute decoders use it.
Per-op-family Properties slot maps
The 199 dispatcher arms divide into the eight families below. Access patterns reuse the A..E labels from the table above.
tcgen05.mma family (Blackwell sm_100a / sm_100f, 16 arms)
| Op mnemonic | Slot | Pattern | Field |
|---|---|---|---|
nvvm.tcgen05.mma | +64 / +72 / +80 / +88 | A / A / A / A | typeA/cType, collectorA, scale_d, layout-bits |
nvvm.tcgen05.mma.block_scale | +64 / +72 / +80 / +88 | A / A / A / A | cType, collectorA, scale_d, layout, kindA, kindB |
nvvm.tcgen05.mma.sp | +64 / +72 / +80 / +88 | A / A / A / A | same as tcgen05.mma plus the metadata operand slot |
nvvm.tcgen05.mma.ws | — | — | operand-only |
nvvm.tcgen05.mma.ws.sp | — | — | operand-only |
nvvm.tcgen05.mma.sp.block_scale | +64 / +72 / +80 / +88 | A / A / A / A | merge of sp and block_scale fields |
nvvm.tcgen05.shift | — | — | operand-only |
nvvm.tcgen05.commit | +64 | A | cta_group |
nvvm.tcgen05.commit.arrive | +64 | A | cta_group |
nvvm.tcgen05.cp | +64 / +72 / +80 | A / A / A | multicast, shape, src_fmt |
nvvm.tcgen05.alloc | +64 | A | cta_group |
nvvm.tcgen05.dealloc | +64 | A | cta_group |
nvvm.tcgen05.relinquish_alloc_permit | +64 | A | cta_group |
nvvm.tcgen05.wait | +64 | A | wait_kind (load or store) |
nvvm.tcgen05.fence | +64 | C | fence-kind marker |
nvvm.tcgen05.{ld,st}matrix | — | — | indexed-operand walker only |
tcgen05.mma is the only family where the first 16 Properties bytes aren't idle. The op carries a variable-arity operand list, so the dispatcher reserves +0..+15 for a packed OperandSegmentSizes buffer plus a second 16-byte running-offset buffer at +96..+111.
ldmatrix / stmatrix (Volta+ tensor-core fragment ops, 3 arms)
| Op mnemonic | Slot | Pattern | Field |
|---|---|---|---|
nvvm.ldmatrix | +64 / +72 / +80 / +88 | A / D / A / A | eltype/size, num, shape, trans |
nvvm.stmatrix | +64 / +72 | A / A | shape, trans; num is the SSA-vector cardinality, not a property |
nvvm.stmatrix alternate selector | +64 | A | trans encoded as a 0/1 enum |
The alternate stmatrix selector disambiguates intrinsic variants from the trans enum alone. It fires when the operand vector matches the narrower selector shape.
wgmma / mma.sync (Hopper sm_90a, 4 arms)
| Op mnemonic | Slot | Pattern | Field |
|---|---|---|---|
nvvm.wgmma.mma_async | +64 / +72 / +80 / +88 / +96 / +112 / +120 / +128 / +136 | A × 8, D × 1 | typeA, b1Op, typeB, shape, typeC, scaleIn, scaleOut, layoutA, layoutB |
nvvm.wgmma.commit_group_sync_aligned | +64 / +72 | A / A | wgmma_type, wgmma_layout |
nvvm.wgmma.wait_group_sync_aligned | +64 / +72 / +88 | A / A / A | type, layout, shape-N selector |
nvvm.mma.sync | +64 / +72 / +80 / +88 / +96 / +104 | A × 6 | b1Op, multiplicandAPtxType, layoutA, layoutB, multiplicandBPtxType, intOverflowBehavior |
nvvm.wmma family | — | — | operand-only; eltype/k/m/n/layout are baked into the resolved intrinsic name at build time |
cp.async.bulk / TMA (Hopper+ sm_90a / Blackwell sm_100, 8 arms)
| Op mnemonic | Slot | Pattern | Field |
|---|---|---|---|
nvvm.cp.async.bulk.tensor.reduce | +64 / +72 + rank-dependent slot | C / C / A | multicast presence, cache-hint, reduce_kind |
nvvm.cp.async.bulk.tensor.prefetch | +64 / +72 | A / C | im2col-type, multicast |
nvvm.cp.async.bulk.tensor.shared.cta.to.global | — | — | operand-only |
nvvm.cp.async.bulk.tensor.shared.cta.to.global.ext | — | — | im2col / cache-hint operands |
nvvm.cp.async.bulk.tensor.shared.cluster.to.global | — | — | operand-only |
nvvm.cp.async.bulk.tensor.base | +64 / +72 / +80 | C / C / C | has_im2col, has_multicast, has_cache_hint |
nvvm.cp.async.shared.*.global | +64 / +80 / +72 | A / A / C | cp_size, ca/cg cache modifier, L2-hint presence |
nvvm.cp.async.commit_group | +64 | A | ca/cg modifier |
atomicrmw / red (sm_60+, 5 arms)
| Op mnemonic | Slot | Pattern | Field |
|---|---|---|---|
nvvm.atomicrmw | +64 / +72 | A / A | mem_order, atomic_op |
nvvm.red variant 1 | +64 / +72 | A / A | red_op, red_type |
nvvm.red variant 2 | +64 / +72 | A / A | red_op, red_type |
nvvm.red variant 3 | +64 / +72 | A / A | red_op, red_type |
nvvm.atomic.cas / nvvm.red.b128 (parser-arm shorthand; neither string appears in the binary — both arms are reached by TypeID dispatch from the nvvm.cmpxchg / 128-bit reduction lowerings) | +64 / +72 / +80 / +88 | A / A / A / A | four enum slots |
prefetch / fence / elect.sync (5 arms)
| Op mnemonic | Slot | Pattern | Field |
|---|---|---|---|
nvvm.tcgen05.fence before | +64 | C | before unit marker |
nvvm.tcgen05.fence after | +64 | C | after unit marker |
nvvm.elect.sync | +64 | D | mask |
nvvm.prefetch / nvvm.prefetch.tensormap | +64 / +72 / +80 / +88 / +96 | D / C / A / A / A | cache_level, L2 marker, to-tensormap flag, evict-priority, prefetch-mode |
nvvm.cvt.packfloat.f32 | helper-decoded | A | property-decoded and emitted through a helper |
Block-scaled MMA (nvvm.mma.block_scale, sm_100a, 1 arm)
| Op mnemonic | Slot | Pattern | Field |
|---|---|---|---|
nvvm.mma.block_scale | +64 / +72 / +80 / +88 / +96 / +112 / +120 / +128 / +136 | A × 9 | typeA, b1Op, typeB, shape, typeC, scaleAFmt, scaleBFmt, scale_vec, layoutA |
Block-scaled MMA reuses the wgmma.mma_async prologue shape but rewires the slots: +128 swaps layoutA for scale_vec, and +112/+120 swap scaleIn/scaleOut for scaleAFmt/scaleBFmt. The slot index, not the byte offset, is the canonical identifier.
The 67-element enum-attr registrar chain
NVVMDialect::initialize installs 68 attribute registrars. Sixty-seven are single-namespace EnumAttr registrars; the sixty-eighth is the NVVMTargetAttr registrar carrying chip, features, link-files, and flags. Every enum registrar has the same shape — assemble an attribute-class definition tuple, add it to the dialect, attach the printer/parser pair to the attribute-name table.
The 67 namespaces cover every enum-typed Properties slot read by the
dispatcher. Grouped by family, the chain registers cache / memory hints
(cache_eviction_priority, load_cache_modifier,
load_cache_modifier_ext, store_cache_modifier, l2_prefetch,
evict_kind, prefetch_cache_level); address spaces and scopes
(state_space, shared_space, mem_scope, mbar_scope,
mbar_space); memory ordering and fences (mem_order, proxy_kind,
action, tcgen05_fence, tcgen05_wait); warp-level collectives
(shfl_kind, vote_sync_kind, match_sync_kind, redux_kind,
barrier_redux_kind); mbarrier / FP / cvt (mbar_txn_kind,
mbar_wait, fp_rnd_mode, sat_mode, rnd, sat,
convert_fp4_type, convert_fp6_type, convert_fp8_type,
packfloat_type); MMA / WMMA / WGMMA (shape, mma_layout,
mma_type, mma_frag, mma_b1op, mma_int_overflow,
mma_cta_count, sparsity_format, load_shape, store_shape,
load_src_format, wgmma_scale_in, wgmma_scale_out, wgmma_type);
block-scaled and tcgen05 (scale_vec_size, block_scale_format,
tcgen05_mma_kind, tcgen05_mma_collectorop,
tcgen05_mma_scale_vec, tcgen05_mma_collectorb, TmemLayout,
TCBarParam, tcgen05_group, tcgen05_cp_shape,
tcgen05_cp_multicast, tcgen05_cp_src_fmt, tcgen05_ldst_shape,
load_mode); TMA / atomic / reduction (tma_store_mode,
tma_redux_kind, red_op, red_type, mul_mode, atomic_op,
dot_accumulate_type).
These namespaces are exactly the enums whose i32 payloads Pattern A pulls from the slot trailers above. The chain only registers parse-side machinery; constant materialization is a later lowering concern. During the NVVM-to-LLVM rewrite, any enum payload that needs to become an SSA constant materializes as llvm.mlir.constant %c : i32. For inline-asm slots that bypass the intrinsic table, see NVVM Overview — Inline-PTX Templates and Constraint Strings.
Reimplementation Notes
A clean implementation drives off a generated slot schema, not a hand-written switch on every op:
for op in nvvm_ops:
props = read_inline_properties(op)
schema = schema_for(op.name)
for field in schema.fields:
slot = props.slots[field.index]
value = decode(slot, field.pattern)
emit_lowering_operand_or_intrinsic_selector(field.name, value)
The invariants are small. Properties are inline for this op set. Slots start at byte 64 and advance by one pointer. Enum attributes decode through padded 32-bit payloads. Optional enum attributes carry a presence bit separate from the value.
Position in the cross-stage attribute system
The Properties blob is the terminal carrier for the memory-ordering, cache-modifier, and MMA-shape attributes that ride down from the higher dialects. Earlier stages keep these facts in the op-attribute dictionary; by the time the NVVM dispatcher sees the op, the attribute has folded into a positional slot in the blob. Attribute System and Lowering documents that journey across the full pipeline — which carrier each fact lives in at each stage, which transitions are intentionally lossy, and which silent drops are wrong-output bugs that ptxas will not catch.
NVVM WMMA Ops
Abstract
The nvvm.wmma.* family is the warp-synchronous matrix-multiply-accumulate path used on every NVIDIA target from sm_70 through sm_89. The dialect carries three MLIR ops — nvvm.wmma.load, nvvm.wmma.store, and nvvm.wmma.mma — each parameterised by attributes (shape, fragment role, layout, element types). The full PTX shape × layout × element-type cross-product is reached by attribute combinations on these three ops, not by enumerating dozens of ops.
Hopper (sm_90+) does not extend this family. Warp-group MMA on Hopper lives in nvvm.wgmma.*; Blackwell MMA lives in nvvm.tcgen05.*.
Op Layout
The dialect registers exactly three op classes — nvvm.wmma.load, nvvm.wmma.store, nvvm.wmma.mma — and that is the count visible in the binary's interned mnemonic strings. The attribute cross-product on those three ops expands to roughly 64 distinct LLVM-intrinsic / PTX-instruction targets at lowering time; the right-hand column counts intrinsic targets reachable through the op, not separate dialect ops.
| Op (dialect op) | Role | Attribute axes | Reachable LLVM intrinsics |
|---|---|---|---|
nvvm.wmma.load | A / B / C fragment load | fragment ∈ {a,b,c} × shape × layout ∈ {row,col} × element type | ~36 |
nvvm.wmma.store | D fragment store | shape × layout × element type ∈ {f16,f32,s32} | ~12 |
nvvm.wmma.mma | tile MMA | shape × A-layout × B-layout × (aT,bT,cT,dT) × .popc/.and.popc for b1 | ~16 |
Tile shapes legal in PTX: m16n16k16, m8n32k16, m32n8k16 for f16/bf16; m16n16k8, m8n32k8, m32n8k8 for tf32; m16n16k16, m8n32k16, m32n8k16 for s8/u8; m8n8k128 for b1; m8n8k32 for s4/u4. The verifier rejects any attribute tuple not in this table.
Operand Tables
nvvm.wmma.load.{a,b,c}.sync.aligned.mXnYkZ.{row,col}.{T}
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | addr | ptr addrspace(3) | shared-memory tile origin |
| operand 1 | stride | i32 | row or column stride in elements |
| result 0 | frag | !llvm.struct<(T, T, ..., T)> | per-thread register fragment; cardinality fixed by shape and element type |
Each shape pins the fragment length: an m16n16k16.f16 A fragment is struct<(vector<2xf16>, vector<2xf16>, ..., vector<2xf16>)> of length 8; an m16n16k16.tf32 A fragment is struct<(i32, i32, i32, i32)>; an m16n16k16.s8 A fragment is struct<(i32, i32)>. The verifier rejects any other arity for the chosen shape/type pair.
nvvm.wmma.store.d.sync.aligned.mXnYkZ.{row,col}.{T}
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | addr | ptr addrspace(3) | shared-memory destination |
| operand 1..N | frag | per shape/type | D fragment elements, expanded into one operand per register |
| operand N+1 | stride | i32 | row or column stride in elements |
store.d flattens the fragment into separate operands rather than re-packing into a struct, which mirrors LLVM's intrinsic signature.
nvvm.wmma.mma.sync.aligned.mXnYkZ.{layoutA}.{layoutB}.{aT}.{bT}.{cT}.{dT}
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0..p | A fragment | per shape and aT | flattened |
| operand p+1..q | B fragment | per shape and bT | flattened |
| operand q+1..r | C fragment | per shape and cT | accumulator input |
| result 0 | D fragment | !llvm.struct<(dT, ..., dT)> | accumulator output |
For m16n16k16.row.col.f16.f16.f16.f16 the operand bag is A=8 f16x2, B=8 f16x2, C=4 f16x2, and the result is struct<(vector<2xf16>) x 4>. The verifier cross-checks the operand counts against the shape and types and rejects any mismatch.
LLVM Intrinsic Mapping
Every nvvm.wmma.* op lowers to one call @llvm.nvvm.wmma.mXnYkZ.{op}.{layout}.{...} intrinsic. The intrinsic name is constructed at TableGen registration time by concatenating the shape, op, layout, and type tokens. The lowering pattern reads the op's attributes only to verify; it does not pick the intrinsic at run time — the per-op vtable hardwires the intrinsic ID.
| Op | LLVM intrinsic |
|---|---|
nvvm.wmma.load (frag = A, m16n16k16, row, f16) | llvm.nvvm.wmma.m16n16k16.load.a.row.stride.f16 |
nvvm.wmma.load (frag = B, m16n16k16, col, f16) | llvm.nvvm.wmma.m16n16k16.load.b.col.stride.f16 |
nvvm.wmma.load (frag = C, m16n16k16, row, f32) | llvm.nvvm.wmma.m16n16k16.load.c.row.stride.f32 |
nvvm.wmma.store (frag = D, m16n16k16, row, f32) | llvm.nvvm.wmma.m16n16k16.store.d.row.stride.f32 |
nvvm.wmma.mma (m16n16k16, row, col, f16→f16) | llvm.nvvm.wmma.m16n16k16.mma.row.col.f16.f16 |
nvvm.wmma.mma (m16n16k16, row, col, f16→f32) | llvm.nvvm.wmma.m16n16k16.mma.row.col.f32.f32 |
nvvm.wmma.mma (m8n8k128, row, col, b1→s32) | llvm.nvvm.wmma.m8n8k128.mma.row.col.b1 |
nvvm.wmma.mma (m8n8k32, row, col, s4→s32) | llvm.nvvm.wmma.m8n8k32.mma.row.col.s4 |
Shape, fragment, layout pair, and element types all live as attributes on the three canonical dialect ops (nvvm.wmma.load, nvvm.wmma.store, nvvm.wmma.mma); the matching intrinsic name is reconstructed at NVVM-to-LLVM time by concatenating the attribute tokens.
PTX Templates
Once the LLVM intrinsic is selected, the NVPTX backend emits one PTX instruction. The templates below cover the canonical shape/type combinations; other combinations substitute the shape and type tokens without changing the skeleton.
| Op | PTX printed |
|---|---|
wmma.load.a.sync (f16, row) | wmma.load.a.sync.aligned.m16n16k16.row.shared::cta.f16 {%r0, %r1, %r2, %r3, %r4, %r5, %r6, %r7}, [%addr], %stride; |
wmma.load.b.sync (f16, col) | wmma.load.b.sync.aligned.m16n16k16.col.shared::cta.f16 {%r0, %r1, %r2, %r3, %r4, %r5, %r6, %r7}, [%addr], %stride; |
wmma.load.c.sync (f32, row) | wmma.load.c.sync.aligned.m16n16k16.row.shared::cta.f32 {%r0, %r1, %r2, %r3, %r4, %r5, %r6, %r7}, [%addr], %stride; |
wmma.store.d.sync (f32, row) | wmma.store.d.sync.aligned.m16n16k16.row.shared::cta.f32 [%addr], {%r0, %r1, %r2, %r3, %r4, %r5, %r6, %r7}, %stride; |
wmma.mma.sync (row.col.f16.f16.f16.f16) | wmma.mma.sync.aligned.m16n16k16.row.col.f16.f16 {%d0..%d3}, {%a0..%a7}, {%b0..%b7}, {%c0..%c3}; |
wmma.mma.sync (row.col.f16.f16.f32.f32) | wmma.mma.sync.aligned.m16n16k16.row.col.f32.f32 {%d0..%d7}, {%a0..%a7}, {%b0..%b7}, {%c0..%c7}; |
wmma.mma.sync (row.col.s8.s8.s32.s32) | wmma.mma.sync.aligned.m16n16k16.row.col.s8 {%d0..%d7}, {%a0..%a1}, {%b0..%b1}, {%c0..%c7}; |
wmma.mma.sync (row.col.b1.b1.s32.s32, popc) | wmma.mma.sync.aligned.m8n8k128.row.col.popc.b1 {%d0..%d1}, {%a0}, {%b0}, {%c0..%c1}; |
The .popc and .and.popc modifiers on the b1 form are encoded as a boolean attribute on nvvm.wmma.mma (and selected through the op's element-type discriminator). The verifier rejects any combination not listed in the PTX ISA.
Per-Arch Availability
| Sub-family | SM floor | ptx_min | Notes |
|---|---|---|---|
f16/f32 accumulators | sm_70 | 6.0 | universal across Volta and later |
bf16 | sm_80 | 7.0 | Ampere extension |
tf32 | sm_80 | 7.0 | only with m16n16k8 / m8n32k8 / m32n8k8 shapes |
s8 / u8 | sm_72 | 6.3 | mobile + datacenter Turing onwards |
s4 / u4 | sm_75 | 6.3 | m8n8k32 shape only |
b1 (popc / and.popc) | sm_75 | 6.3 | m8n8k128 shape only |
Hopper (sm_90+) backends accept nvvm.wmma.* for backward compatibility but Tileiras prefers nvvm.wgmma.mma_async once the target hits sm_90a. Blackwell (sm_100+) keeps WMMA legal for short-K tiles only — long-K paths go through nvvm.tcgen05.mma. See Per-SM Emission Templates — SM70 / SM75 for the Volta/Turing PTX templates.
Verification Invariants
- Tile shape and element-type tuple must match a row of the PTX ISA's WMMA shape table.
- A and B fragment cardinalities are derived from the shape; the verifier rejects mismatched operand counts.
- C and D layouts (
row/col) must agree. .popc/.and.popcare legal only on theb1form.f64WMMA does not exist in this dialect; the FP64 MMA path usesnvvm.mma.syncwith them8n8k4.f64shape/type attribute combination.
NVVM WGMMA Ops
Abstract
nvvm.wgmma.* is the warp-group asynchronous MMA family used on Hopper (sm_90a). A warp group is four contiguous warps cooperating on one m64nNkK accumulator tile, with B always resident in shared memory through a 64-bit SMEM descriptor and A either in registers or in SMEM through a second descriptor. The four ops in this family pair into a four-stage pipeline: fence, mma_async, commit, wait. See WGMMA Emission Protocol — The Four-Op Sequence for the pipeline timing and WGMMA Emission for the codegen side.
Blackwell (sm_100+) does not extend this family. The Hopper WGMMA path is the only wgmma.* PTX surface; Blackwell MMA lives in nvvm.tcgen05.*.
Op Roster
The "Properties slots used" column tracks where each op stores its attribute payload in the inline Properties record; see Properties Blob — Per-op-family slot maps for the exact byte offsets.
| Op | Role | Properties slots used |
|---|---|---|
nvvm.wgmma.fence.aligned | producer-side fence before mma_async | none |
nvvm.wgmma.mma_async | the MMA itself | typeA, b1Op, typeB, shape, typeC, scaleIn, scaleOut, layoutA, layoutB |
nvvm.wgmma.commit.group.sync.aligned | close the current MMA group | wgmma_type, wgmma_layout |
nvvm.wgmma.wait.group.sync.aligned | wait for the group with depth N | wgmma_type, wgmma_layout, shape-N |
commit.group and wait.group carry type+layout attributes even though no register operand survives — the suffix selects which earlier mma_async group the wait drains.
Operand Tables
nvvm.wgmma.fence.aligned
No operands and no result. Lowers to a single PTX wgmma.fence.sync.aligned; instruction.
nvvm.wgmma.mma_async
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | descA | i64 | WGMMA SMEM descriptor for A — or, for the A-in-registers form, an !llvm.struct register fragment |
| operand 1 | descB | i64 | WGMMA SMEM descriptor for B (always SMEM-resident) |
| operand 2 | accumIn | !llvm.struct<(T, ..., T)> of accumulator regs | accumulator input tile |
| attribute | typeA | enum wgmma_type | f16 / bf16 / tf32 / e4m3 / e5m2 / s8 / u8 / s4 / u4 / b1 |
| attribute | typeB | enum wgmma_type | mirror of typeA |
| attribute | typeC | enum wgmma_type | usually f32; f16 allowed for f16xf16 |
| attribute | shape | enum shape | m64nNkK selector — N ∈ {8, 16, 24, ..., 256} step 8 |
| attribute | scaleIn | enum wgmma_scale_in | +1 / -1 for A and B |
| attribute | scaleOut | enum wgmma_scale_out | 0 (init) or 1 (accumulate) |
| attribute | layoutA | enum mma_layout | row / col |
| attribute | layoutB | enum mma_layout | row / col |
| attribute | b1Op | enum mma_b1op | xor_popc / and_popc / none |
| result 0 | accumOut | same struct type as accumIn | accumulator after the MMA |
The accumulator struct width depends on N and typeC. For m64n128k16.f32.f16.f16 the accumulator is 64 f32 registers laid out as struct<(f32) x 64>; for m64n64k16.f32.f16.f16 it is 32 f32. The verifier rejects any struct width that does not match N * typeC_bits / 32.
nvvm.wgmma.commit.group.sync.aligned
| Position | Name | Type | Notes |
|---|---|---|---|
| attribute | wgmma_type | enum | echoes the mma_async typeA/typeB selector |
| attribute | wgmma_layout | enum | echoes the layout pair |
No operands; closes the current outstanding-MMA group.
nvvm.wgmma.wait.group.sync.aligned
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | groupDepth | i32 | number of older groups the wait keeps alive |
| attribute | wgmma_type / wgmma_layout / shape-N | enums | propagated through to the PTX suffix |
A depth-zero wait drains every outstanding group; non-zero values keep older groups in flight while ensuring the current one is complete.
WGMMA SMEM Descriptor
The 64-bit value passed as descA (when A is SMEM-resident) and descB packs the SMEM tile origin and stride into a single word. The bit layout is shared with cute_nvgpu's WGMMA descriptor construction:
typedef union WgmmaDescriptor {
uint64_t raw;
struct {
uint64_t start_addr : 14; /* low 14 bits of (smem_byte_offset >> 4) */
uint64_t lbo : 16; /* leading byte offset (per-warp tile) */
uint64_t sbo : 16; /* stride byte offset (between warp tiles)*/
uint64_t base_offset : 3; /* per-CTA SMEM base offset (>>3) */
uint64_t reserved : 3; /* always zero */
uint64_t swizzle_mode : 2; /* 0=none, 1=128B, 2=64B, 3=32B */
uint64_t pad : 10;
};
} WgmmaDescriptor;
start_addr requires 16-byte SMEM alignment because the field stores the offset shifted right by 4. lbo and sbo together encode the two-dimensional warp-tile stride layout. The swizzle field selects the canonical Hopper 128-byte mode, with 64-byte and 32-byte modes available for sub-tile widths.
The descriptor reaches nvvm.wgmma.mma_async as a plain i64 operand. The pattern that builds it sits in nvgpu.warpgroup.descriptor (see the nvgpu overview).
LLVM Intrinsic Mapping
| Op | LLVM intrinsic |
|---|---|
nvvm.wgmma.fence.aligned | llvm.nvvm.wgmma.fence.sync.aligned |
nvvm.wgmma.mma_async (m64n128k16, f32.f16.f16) | llvm.nvvm.wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 |
nvvm.wgmma.mma_async (m64n256k32, f32.e4m3.e4m3) | llvm.nvvm.wgmma.mma_async.sync.aligned.m64n256k32.f32.e4m3.e4m3 |
nvvm.wgmma.commit.group.sync.aligned | llvm.nvvm.wgmma.commit.group.sync.aligned |
nvvm.wgmma.wait.group.sync.aligned | llvm.nvvm.wgmma.wait.group.sync.aligned |
The intrinsic name is built by concatenating the shape, accumulator type, A type, and B type tokens. Tile counts (m64nNkK) are enumerated: every N ∈ {8, 16, 24, ..., 256} exposes a separate intrinsic. The verifier rejects any N outside that lattice.
PTX Templates
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64nNkK.{accT}.{aT}.{bT}
{ %d0, %d1, ..., %d{accW-1} },
%da, %db, %p,
%scale_a, %scale_b,
%trans_a, %trans_b;
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned N;
%da and %db are the 64-bit SMEM descriptors. %p is the immediate scale-D predicate (compile-time 0 or 1) that selects between init (overwrite accumulator) and accumulate. %scale_a and %scale_b are the immediate +1/-1 selectors that bind to scaleIn. %trans_a and %trans_b are the immediate transpose flags bound to layoutA / layoutB. The accumulator register list %d0..%d{accW-1} expands per N and accumulator type.
For the canonical m64n128k16.f32.f16.f16 shape:
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16
{ %f0, %f1, ..., %f63 },
%da, %db, %p, 1, 1, %la, %lb;
Per-Arch Availability
| Op | SM floor | ptx_min |
|---|---|---|
wgmma.fence.aligned | sm_90a | 8.0 |
wgmma.mma_async.sync.aligned | sm_90a | 8.0 |
wgmma.commit.group.sync.aligned | sm_90a | 8.0 |
wgmma.wait.group.sync.aligned | sm_90a | 8.0 |
Plain sm_90 is rejected; the WGMMA family requires the architecture-qualified sm_90a variant. Blackwell (sm_100+) does not extend WGMMA — the Blackwell tensor-memory MMA path is nvvm.tcgen05.mma.sync. See Per-SM Emission Templates — SM90 for the Hopper PTX templates and WGMMA Descriptor Round-Trip for the descriptor hex walk-through.
Verifier Invariants
shapeism64nNkKwithN ∈ {8, 16, ..., 256}andK = 256 / typeA_bits(or 16 fortf32).descAisi64only whenlayoutAmatches an SMEM tile; an A-in-registers fragment must be a typed struct.descBis alwaysi64.- Accumulator struct width equals
N * sizeof(typeC) / 432-bit registers. scaleOutis a compile-timei1; runtime values are rejected.commit.groupandwait.groupcarry the samewgmma_typeand layout as the in-flightmma_async.- Wait depth is non-negative and fits in 6 bits.
NVVM TMA Ops
Abstract
nvvm.cp.async.bulk.* covers the Hopper Tensor Memory Accelerator (TMA) surface: tile loads from global to shared, tile stores from shared to global, prefetches, reductions, group commit / wait, and the descriptor-fence helper that pairs with an in-SMEM CUtensorMap. Rather than enumerating one op per (mode, direction, cache-hint, multicast, rank) combination, this dialect carries a small set of canonical mnemonics; mode (tile / im2col / im2col_w / im2col_w_128), rank, cache-hint presence, and multicast presence are all encoded as attributes the printer reads at PTX emit time. See TMA Tensormap and cp.async.bulk Codegen for the per-template emission catalog and Lowering: nvgpu / gpu to NVVM — TMA Async Load for the operand-slot mapping.
TMA descriptors live in global memory as 128-byte CUtensorMap structs encoded by the CUDA driver. The device-side ops in this family consume the descriptor as an opaque global pointer, with cache-hint and multicast attributes wiring into optional intrinsic operand slots.
Op Roster
| Sub-family | Count | Mnemonic stem |
|---|---|---|
| Tile load (global → shared via cluster) | 1 op, rank/mode in attributes | nvvm.cp.async.bulk.tensor.shared.cluster.global |
| Tile load (cta-direct) | 1 op, rank/mode in attributes | nvvm.cp.async.bulk.tensor.shared.cta.global |
| Tile store (shared → global) | 2 (base + ext) | nvvm.cp.async.bulk.tensor.global.shared.cta, …ext |
| Reduce | 1 op, redop in attribute | nvvm.cp.async.bulk.tensor.reduce |
| Prefetch | 1 op | nvvm.cp.async.bulk.tensor.prefetch |
| Group control | 2 | nvvm.cp.async.bulk.commit.group, nvvm.cp.async.bulk.wait_group |
| Descriptor copy / fence | 1 | nvvm.tensormap.cp_fenceproxy |
Each canonical mnemonic above parameterises rank, mode, cache-hint presence, and multicast presence through op attributes. The NVVM-to-LLVM printer expands a single dialect op into one of the family of llvm.nvvm.cp.async.bulk.tensor.{1..5}d.<dir>.<mode> intrinsics at lowering time; the IR layer stays compact.
Operand Tables
nvvm.cp.async.bulk.tensor.shared.cluster.global (rank in rank attribute, mode = #tile)
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | dstAddr | ptr addrspace(3) | SMEM destination tile origin |
| operand 1 | tensorMap | ptr (global, opaque) | 128-byte CUtensorMap pointer |
| operands 2..N+1 | coords | variadic i32, rank N | tile origin in tensor space |
| operand N+2 | barrier | ptr addrspace(3) | mbarrier slot for expect-tx completion |
| operand N+3 | multicastMask | optional i16 | cluster multicast bitmap (positional slot in the intrinsic call) |
| operand N+4 | cacheHint | optional i64 | L2 cache hint (positional slot in the intrinsic call) |
| attribute | cacheHintEnable | UnitAttr | gates the .L2::cache_hint modifier |
| attribute | multicastEnable | UnitAttr | gates the .multicast modifier |
| attribute | mode | enum tma_load_mode | tile / im2col / im2col_w / im2col_w_128 |
The two UnitAttrs gate the corresponding optional operand. When cacheHintEnable is absent the cacheHint operand position is left empty in the LLVM intrinsic call; when present the operand must be supplied. The same pattern applies to multicastEnable and multicastMask. See Lowering: TMA Async Load — Operand Mapping (rank N) for the operand-slot mapping the nvgpu-to-nvvm rewriter performs.
nvvm.cp.async.bulk.tensor.{N}d.global.shared.cta.tile
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tensorMap | ptr (global, opaque) | 128-byte CUtensorMap |
| operands 1..N | coords | variadic i32, rank N | tile origin |
| operand N+1 | srcAddr | ptr addrspace(3) | SMEM source tile |
| operand N+2 | cacheHint | optional i64 | L2 cache hint |
| attribute | cacheHintEnable | UnitAttr | gates the .L2::cache_hint modifier |
No barrier — the producer issues the store and continues; the consumer side observes completion via cp.async.bulk.wait.group.
nvvm.cp.async.bulk.tensor.{N}d.global.shared.cta.tile.reduce
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tensorMap | ptr (global) | CUtensorMap |
| operands 1..N | coords | variadic i32, rank N | tile origin |
| operand N+1 | srcAddr | ptr addrspace(3) | SMEM source |
| operand N+2 | cacheHint | optional i64 | L2 cache hint |
| attribute | redop | enum tma_redux_kind | add / min / max / inc / dec / and / or / xor |
nvvm.cp.async.bulk.tensor.{N}d.tile.prefetch
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tensorMap | ptr (global) | CUtensorMap |
| operands 1..N | coords | variadic i32, rank N | tile origin |
| attribute | mode | enum load_mode | tile / im2col (matches the load form) |
| attribute | cacheHintEnable | UnitAttr (optional) | gates a cache-hint operand |
nvvm.cp.async.bulk.commit.group / nvvm.cp.async.bulk.wait.group
| Position | Name | Type | Notes |
|---|---|---|---|
(wait.group) operand 0 | groupDepth | i32 | number of older groups to keep in flight |
(commit.group) | — | — | no operands |
nvvm.tensormap.cp.async.shared
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | dst | ptr addrspace(3) | SMEM destination descriptor mailbox |
| operand 1 | src | ptr (global) | source descriptor |
nvvm.tensormap.replace.tile.global_address (and .box_dim, .element_stride, .box_corner, .elem_type, .swizzle, .fill)
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | descriptor | ptr addrspace(3) | SMEM-resident descriptor being mutated |
| operand 1 | value | i64 / i32 / enum | replacement value for the named field |
| attribute | ord | i32 | rank index for box_dim / element_stride |
LLVM Intrinsic Mapping
| Op | LLVM intrinsic |
|---|---|
nvvm.cp.async.bulk.tensor.shared.cluster.global | llvm.nvvm.cp.async.bulk.tensor.{1..5}d.shared.cluster.global.{tile,im2col,im2col_w,im2col_w_128} (rank/mode in attrs) |
nvvm.cp.async.bulk.tensor.shared.cta.global | llvm.nvvm.cp.async.bulk.tensor.{1..5}d.shared.cta.global.tile |
nvvm.cp.async.bulk.tensor.global.shared.cta | llvm.nvvm.cp.async.bulk.tensor.{1..5}d.global.shared.cta.tile |
nvvm.cp.async.bulk.tensor.reduce | llvm.nvvm.cp.async.bulk.tensor.{1..5}d.global.shared.cta.tile.reduce.{redop} |
nvvm.cp.async.bulk.tensor.prefetch | llvm.nvvm.cp.async.bulk.tensor.{1..5}d.tile.prefetch |
nvvm.cp.async.bulk.commit.group | llvm.nvvm.cp.async.bulk.commit.group |
nvvm.cp.async.bulk.wait_group | llvm.nvvm.cp.async.bulk.wait_group |
nvvm.tensormap.cp_fenceproxy | llvm.nvvm.cp.async.bulk.tensor.shared.cluster.tensormap.cta paired with llvm.nvvm.fence.proxy.tensormap.generic.release.cta |
The reduction intrinsic concatenates the redop name into the intrinsic ID; eight distinct intrinsics exist per rank.
PTX Templates
cp.async.bulk.tensor.{N}d.shared::cluster.global.tile.mbarrier::complete_tx::bytes
[%dst], [%tmap, {%c0, %c1, ..., %c{N-1}}], [%mbar];
cp.async.bulk.tensor.{N}d.shared::cluster.global.tile.mbarrier::complete_tx::bytes.multicast::cluster
[%dst], [%tmap, {%c0, ..., %c{N-1}}], [%mbar], %multicastMask;
cp.async.bulk.tensor.{N}d.shared::cluster.global.tile.mbarrier::complete_tx::bytes.L2::cache_hint
[%dst], [%tmap, {%c0, ..., %c{N-1}}], [%mbar], %hint;
cp.async.bulk.tensor.{N}d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes
[%dst], [%tmap, {%c0, ..., %c{N-1}}], [%mbar];
cp.async.bulk.tensor.{N}d.global.shared::cta.tile
[%tmap, {%c0, ..., %c{N-1}}], [%src];
cp.async.bulk.tensor.{N}d.global.shared::cta.tile.{redop}.bulk_group
[%tmap, {%c0, ..., %c{N-1}}], [%src];
cp.async.bulk.prefetch.tensor.{N}d.global.tile
[%tmap, {%c0, ..., %c{N-1}}];
cp.async.bulk.commit_group;
cp.async.bulk.wait_group N;
tensormap.cp.async.shared::cta.bulk_group [%dst], [%src];
tensormap.replace.tile.global_address [%tmap], %addr;
tensormap.replace.tile.box_dim.k [%tmap], %dim;
tensormap.replace.tile.element_stride.k [%tmap], %stride;
tensormap.replace.tile.elemtype [%tmap], %elt_id;
tensormap.replace.tile.swizzle [%tmap], %mode;
The multicast and L2::cache_hint suffix variants are picked per template by the presence flags. The reductions all flow through a single dialect op (nvvm.cp.async.bulk.tensor.reduce — only this mnemonic is interned) whose redop enum attribute selects between {add, min, max, inc, dec, and, or, xor}; the LLVM-intrinsic name baked at lowering time enumerates eight distinct intrinsics, one per PTX modifier.
Inline-PTX Variants
A few TMA paths reach PTX through llvm.inline_asm because no LLVM intrinsic exists at the snapshot revision Tileiras tracks. The most common is the im2col cache-hint store variant:
asm template: "cp.async.bulk.tensor.{N}d.global.shared::cta.im2col.bulk_group.L2::cache_hint
[%tmap, {%c0, ..., %c{N-1}}], [%src], %hint;"
constraints : "l,l,r,r,r,...,l"
l is the 64-bit descriptor pointer and the cache-hint operand; r is each 32-bit coordinate; the source SMEM pointer is also l (an opaque pointer). Tileiras retains the upstream constraint string verbatim; reimplementers must not rearrange operand order, because the NVPTX backend matches positional registers against the template's % slots.
Per-Arch Availability
| Op family | SM floor | ptx_min |
|---|---|---|
| Tile load / store / reduce | sm_90 | 8.0 |
| Im2col forms | sm_90 | 8.0 |
| Multicast / cluster forms | sm_90 (sm_90a for cluster mode) | 8.0 |
| Prefetch | sm_90 | 8.0 |
| Group commit / wait | sm_90 | 8.0 |
tensormap.cp.async.shared | sm_90 | 8.3 |
tensormap.replace.* | sm_90 | 8.3 |
Blackwell extends the cache-hint and OOB-fill modes but keeps the same op surface and the same intrinsic shape; verification accepts sm_100+ for every op in the family. See TMA Descriptor Shape for the CUtensorMap layout and cp.async.bulk Template Catalog for the per-rank PTX templates.
Verifier Invariants
tensorMapis a global-memory pointer; the descriptor itself is opaque.dstAddr(loads) is in addr-space 3;srcAddr(stores) is in addr-space 3.- Coordinate operand count equals the rank in the op mnemonic.
multicastEnableandcacheHintEnableagree with the operand list: presence of the attribute requires the operand to be supplied.- For reductions,
redopis one of the eight legal values. - For im2col forms, the rank is 3, 4, or 5; lower ranks are rejected.
NVVM tcgen05 Ops
Abstract
nvvm.tcgen05.* covers the Blackwell (sm_100+) tensor-memory family. Tensor memory (TMEM) is a per-SM scratchpad allocated and freed through the dialect's alloc / dealloc ops, accessed through ld / st and the long-K MMA path, and torn down before the kernel exits. The roster below is the only path to TMEM from MLIR; Hopper's WGMMA family (nvvm.wgmma.*) does not reach Blackwell tensor cores. See tcgen05 Tensor Memory Model for the TMEM allocation discipline and the variant taxonomy, and tcgen05 Machine Validation for the codegen-side verifier rules.
tcgen05.mma carries a control-word modifier table that selects element-type interpretation, sparsity, block-scaling, and collector behaviour. Block-scaled UMMA exposes scale-vector size and scale-format enums; the cross-product produces several thousand legal PTX forms from a single dialect op.
Op Roster
The "Properties slots used" column tracks where each op stores its attribute payload in the inline Properties record; see Properties Blob — Per-op-family slot maps for the exact byte offsets.
| Op | Role | Properties slots used |
|---|---|---|
nvvm.tcgen05.alloc / .shared | request a TMEM range | cta_group |
nvvm.tcgen05.dealloc | release a TMEM range | cta_group |
nvvm.tcgen05.relinquish_alloc_permit | drop the alloc-permit token | cta_group |
nvvm.tcgen05.ld | load from TMEM to registers | (none — operand-typed) |
nvvm.tcgen05.st | store from registers to TMEM | (none — operand-typed) |
nvvm.tcgen05.cp | copy TMEM tile across CTAs | multicast, shape, src_fmt |
nvvm.tcgen05.mma | MMA into TMEM accumulator | typeA/cType, collectorA, scale_d, layout-bits |
nvvm.tcgen05.mma.sp | sparse-input variant of above | same + sparse metadata operand |
nvvm.tcgen05.mma.block_scale | block-scaled variant | cType, collectorA, scale_d, layout, kindA, kindB |
nvvm.tcgen05.mma.sp.block_scale | sparse + block-scaled | merge of sparse and block-scaled fields |
nvvm.tcgen05.mma.ws | weight-stationary variant | operand-only |
nvvm.tcgen05.commit / .commit.arrive | close a group; optionally signal a barrier | cta_group |
nvvm.tcgen05.wait | wait on load or store group | wait_kind |
nvvm.tcgen05.shift | shift register fragment across TMEM | operand-only |
nvvm.tcgen05.fence | producer / consumer fence | tcgen05_fence (before / after) |
Operand Tables
nvvm.tcgen05.alloc[.shared]
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | dst | ptr addrspace(3) (or generic) | output slot for the allocated TMEM base |
| operand 1 | n | i32 | column count to allocate (must be a multiple of 32) |
| attribute | cta_group | enum tcgen05_group | cta_1 or cta_2 for 1-CTA or 2-CTA cooperative allocation |
nvvm.tcgen05.dealloc / .relinquish_alloc_permit
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tmem_base | i32 | TMEM column index returned by alloc |
operand 1 (dealloc) | n | i32 | column count being released |
| attribute | cta_group | enum tcgen05_group | matches the alloc's cta_group |
nvvm.tcgen05.ld / nvvm.tcgen05.st
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tmem_addr | i32 | TMEM column address |
operand 1 (st) | frag | !llvm.struct<(i32, ...)> | register fragment to store |
result 0 (ld) | frag | !llvm.struct<(i32, ...)> | register fragment loaded |
| attribute (encoded into mnemonic) | shape | m32n8 / m32n16 / m32n32 / ... | tile shape that fixes the fragment width |
| attribute (encoded into mnemonic) | num | x1 / x2 / x4 / ... | replication factor |
| attribute (encoded into mnemonic) | pack | pack / unpack | per-thread packing mode |
nvvm.tcgen05.cp
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tmem_dst | i32 | destination TMEM column |
| operand 1 | tmem_src | i32 | source TMEM column |
| attribute | shape | enum tcgen05_cp_shape | tile shape selector |
| attribute | multicast | enum tcgen05_cp_multicast | none / warp_x2 / warp_x4 |
| attribute | src_fmt | enum tcgen05_cp_src_fmt | source element format |
nvvm.tcgen05.mma (dense)
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tmem_d | i32 | TMEM accumulator column |
| operand 1 | desc_a | i64 | SMEM descriptor for A, or TMEM column for a_in_tmem form |
| operand 2 | desc_b | i64 | SMEM descriptor for B |
| operand 3 | scale | i32 | accumulator-update scale (compile-time 0 or 1) |
| attribute | kind | enum tcgen05_mma_kind | f8f6f4 / mxf4 / mxf4nvf4 / f16 / tf32 / i8 |
| attribute | cta_group | enum | cta_1 / cta_2 |
| attribute | collectorA | enum tcgen05_mma_collectorop | discard / fill / use / last_use |
| attribute | scale_d | enum | controls how scale selects between init and accumulate |
| attribute | layout | enum TmemLayout | TMEM tile layout |
nvvm.tcgen05.mma.sp (sparse)
Adds one operand:
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 4 | sparse_metadata | i32 | TMEM column holding the sparse selectors |
nvvm.tcgen05.mma.block_scale
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0..3 | same as mma | — | same accumulator + descriptors + scale |
| operand 4 | scale_a_vec | i32 | TMEM column for A scale vector |
| operand 5 | scale_b_vec | i32 | TMEM column for B scale vector |
| attribute | kindA / kindB | enum block_scale_format | E8M0 / E4M3FN |
| attribute | scale_vec_size | enum scale_vec_size | 16 or 32 |
The (atom_K, vecSize) triples accepted by the verifier are documented on the cute_nvgpu MMA atoms page (SM100 UMMA block-scaled).
nvvm.tcgen05.commit / .commit.arrive
| Position | Name | Type | Notes |
|---|---|---|---|
operand 0 (commit.arrive) | barrier | ptr addrspace(3) | mbarrier slot to signal |
| attribute | cta_group | enum | matches the in-flight MMA's cta_group |
nvvm.tcgen05.wait
| Position | Name | Type | Notes |
|---|---|---|---|
| attribute | wait_kind | enum tcgen05_wait | load (drain TMEM loads) or store (drain TMEM stores) |
nvvm.tcgen05.fence
| Position | Name | Type | Notes |
|---|---|---|---|
| attribute | tcgen05_fence | enum | before (producer) or after (consumer) |
Control-Word Modifier Table
The PTX form tcgen05.mma.sync.aligned.{kind}.cta_group::{1,2}.{layout}.{collector} packs several modifiers into the mnemonic. See tcgen05 Tensor Memory Model — Control Word Layout for the bit-level encoding and tcgen05 Machine Validation — Control-Word Bit Layout for the codegen-side checks. The table below pairs each modifier with its NVVM attribute and the legal value range.
| PTX modifier | NVVM attribute | Values |
|---|---|---|
{kind} | kind | f8f6f4 / mxf4 / mxf4nvf4 / f16 / tf32 / i8 |
cta_group::{1,2} | cta_group | cta_1 (single-CTA) / cta_2 (cluster-coop 2-CTA) |
{layout} | layout | mn (row-major) / kn (canonical K-major) |
{collector} | collectorA | discard / fill / use / last_use |
.sp | (op mnemonic carries .sp) | sparse (.sp) vs dense |
.block_scale | (op mnemonic carries .block_scale) | block-scaled vs unscaled |
.scale::vec::{16,32} | scale_vec_size | 16 / 32 |
.{sfA}.{sfB} | kindA / kindB | scale-factor element format |
The collector modifier controls how the MMA pipeline reuses register-file data across iterations: discard evicts on commit, fill accumulates without evicting, use consumes a previously-filled buffer, last_use consumes and then evicts.
LLVM Intrinsic Mapping
| Op | LLVM intrinsic |
|---|---|
nvvm.tcgen05.alloc (addrspace=3, shared SMEM dest) | llvm.nvvm.tcgen05.alloc.cta_group.{1,2}.shared |
nvvm.tcgen05.alloc (addrspace=0/1, generic/global dest) | llvm.nvvm.tcgen05.alloc.cta_group.{1,2} |
nvvm.tcgen05.dealloc | llvm.nvvm.tcgen05.dealloc.cta_group.{1,2} |
nvvm.tcgen05.ld | llvm.nvvm.tcgen05.ld.{shape}.{num} |
nvvm.tcgen05.st | llvm.nvvm.tcgen05.st.{shape}.{num} |
nvvm.tcgen05.mma | llvm.nvvm.tcgen05.mma.{kind}.cta_group.{1,2}.{collector} |
nvvm.tcgen05.mma.sp | llvm.nvvm.tcgen05.mma.sp.{kind}.cta_group.{1,2}.{collector} |
nvvm.tcgen05.mma.block_scale | llvm.nvvm.tcgen05.mma.block_scale.{kind}.{scale_vec}.cta_group.{1,2}.{collector} |
nvvm.tcgen05.cp | llvm.nvvm.tcgen05.cp.{shape}.{multicast}.{src_fmt} |
nvvm.tcgen05.commit | llvm.nvvm.tcgen05.commit.cta_group.{1,2} |
nvvm.tcgen05.commit.arrive | llvm.nvvm.tcgen05.commit.arrive.cta_group.{1,2} |
nvvm.tcgen05.wait | llvm.nvvm.tcgen05.wait.{load,store} |
nvvm.tcgen05.fence | llvm.nvvm.tcgen05.fence.{before,after}.thread |
PTX Templates
tcgen05.alloc.cta_group::{1,2}.shared::cta.b32 [%tmem], %n;
tcgen05.dealloc.cta_group::{1,2}.b32 [%tmem], %n;
tcgen05.relinquish_alloc_permit.cta_group::{1,2};
tcgen05.ld.sync.aligned.{shape}.{num}.b32 {%r0, %r1, ...}, [%tmem];
tcgen05.st.sync.aligned.{shape}.{num}.b32 [%tmem], {%r0, %r1, ...};
tcgen05.mma.sync.aligned.{kind}.cta_group::{1,2}.{layout}.{collector}
[%tmem_d], %desc_a, %desc_b, %scale;
tcgen05.mma.sp.sync.aligned.{kind}.cta_group::{1,2}.{layout}.{collector}
[%tmem_d], %desc_a, %desc_b, [%sparse_meta], %scale;
tcgen05.mma.block_scale.sync.aligned.{kind}.scale::vec::{16,32}.cta_group::{1,2}.{layout}.{collector}.{sfA}.{sfB}
[%tmem_d], %desc_a, %desc_b, [%sf_a], [%sf_b], %scale;
tcgen05.cp.{shape}.{multicast}.{src_fmt} [%tmem_dst], [%tmem_src];
tcgen05.commit.cta_group::{1,2};
tcgen05.commit.arrive.cta_group::{1,2}.b64 [%mbar];
tcgen05.wait::{load,store}.sync.aligned;
tcgen05.fence::{before,after}.thread;
The descriptor operands %desc_a and %desc_b are 64-bit SMEM descriptors when the operand is SMEM-resident, or TMEM column indices when the operand is TMEM-resident.
Inline-PTX Variants
nvvm.tcgen05.cp reaches PTX through llvm.inline_asm when the multicast / src_fmt combination has no matching LLVM intrinsic at the snapshot revision Tileiras tracks:
asm template: "tcgen05.cp.{shape}.{multicast}.{src_fmt} [%dst], [%src];"
constraints : "r,r"
The two r slots are the destination and source TMEM column indices. The shape, multicast, and src_fmt tokens are baked into the template literal at lowering time; the constraint string never changes.
Per-Arch Availability
| Op family | SM floor | ptx_min |
|---|---|---|
alloc / dealloc / relinquish_alloc_permit | sm_100a | 8.6 |
ld / st | sm_100a | 8.6 |
cp | sm_100a (+ sm_100f for the f-suffixed variants) | 8.6 |
mma / mma.sp | sm_100a | 8.6 |
mma.block_scale / mma.sp.block_scale | sm_100a | 8.6 |
commit / commit.arrive / wait / fence | sm_100a | 8.6 |
sm_100a is the architecture-qualified Blackwell target; the family is also legal on sm_100f for the few f-suffixed copy variants. Datacenter Blackwell (sm_100) is the only sub-arch the dialect exposes; Blackwell Ultra (sm_103) and Jetson Thor (sm_110) reuse the same op surface. See Per-SM Emission Templates — SM100 / SM103 for the codegen-side templates and NVPTX Subtarget Feature Matrix for the feature gating.
Verifier Invariants
- TMEM column counts are multiples of 32.
cta_groupagrees between matchedalloc/deallocand between the in-flight MMA and itscommit/wait.scaleis a compile-time immediate.- Block-scaled
(atom_K, vecSize)matches one of(32, 32),(64, 16),(64, 32); other combinations are rejected by the per-combo expectation diagnostics listed under nv_tileas Verifiers — Block-Scaled MMA Verification (e.g."expects A/B element types to be Float4E2M1FNType and sfa/sfb element types to be Float8E8M0FNUType when (atom_K=64 && vecSize=32)"). - Sparse metadata column must be valid TMEM and non-zero stride.
- Accumulator element type is
f32for every block-scaled variant. kindAandkindBagree (no mixed scale-factor formats).
NVVM mbarrier Ops
Abstract
nvvm.mbarrier.* covers the sm_80+ mbarrier (memory barrier) state machine — a 64-bit shared-memory slot that counts arrivals, tracks an expected-transaction byte count, advances a phase parity, and lets warps wait for the slot to flip. The ops in this family each implement one transition of that state machine and emit the matching mbarrier.* PTX instruction. See mbarrier State Machine for the cross-op semantics and mbarrier Emission for the codegen side.
Two slot variants exist for almost every op: a generic-pointer form for completeness and a .shared form for the common case where the slot lives in shared memory. Lowering picks the .shared form whenever the operand address space is 3; the generic form remains so kernels that explicitly cast through __cvta_to_shared round-trip.
State Machine
Each mbarrier slot carries four fields packed into a 64-bit word:
| Field | Bits | Role |
|---|---|---|
participant_count | low 20 | total arrivals that complete one phase |
pending_count | mid 20 | arrivals remaining before the phase completes |
tx_count | next 20 | bytes still expected (for TMA expect-tx variant) |
phase | high 1 | toggles each time the phase completes |
The state transitions are:
| Op | Transition |
|---|---|
init | participant_count := N, pending_count := N, tx_count := 0, phase := 0 |
arrive | pending_count -= 1; if zero, complete the phase: flip phase, reset pending_count := participant_count |
arrive.nocomplete | pending_count -= 1 but suppress completion |
arrive.expect_tx | arrive plus tx_count += k (for the TMA producer side) |
try_wait.parity | non-blocking: return true if phase == expected_phase |
test.wait | blocking: spin until phase matches the token |
inval | mark the slot uninitialised |
The expect_tx op is the producer-side handshake for TMA tile loads: the consumer initialises the slot with the participant count, the TMA load issues arrive.expect_tx once the bytes are committed, and the consumer waits on the phase flip. See mbarrier State Machine — Phase Parity for the parity bit semantics and Kinds: Ordinary, Transaction, Cluster for the per-kind transitions.
Op Roster
| Op | Variants |
|---|---|
nvvm.mbarrier.init | generic + .shared |
nvvm.mbarrier.inval | generic + .shared |
nvvm.mbarrier.arrive | generic + .shared |
nvvm.mbarrier.arrive.nocomplete | generic + .shared |
nvvm.mbarrier.arrive.expect_tx | generic + .shared |
nvvm.mbarrier.test.wait | generic + .shared |
nvvm.mbarrier.wait | (one op) — blocking phase wait |
nvvm.mbarrier.wait.parity | (one op) — phase-parity blocking wait |
nvvm.mbarrier.try_wait.parity | generic + .shared + .timelimit variant |
nvvm.mbarrier.try_wait.timelimit | (one op) — try-wait with deadline |
nvvm.fence.mbarrier.init | (one op) — proxy fence before init |
nvvm.mbarrier.txn / nvvm.mbarrier.txn.cta | tx-count transaction handles |
Most of these ops carry a .shared variant (the address-space split adds matching .shared entries); the wait family and the transaction-handle ops are single-variant.
Operand Tables
nvvm.mbarrier.init[.shared]
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | addr | ptr addrspace(3) (.shared) or generic ptr | mbarrier slot |
| operand 1 | count | i32 | participant count |
nvvm.mbarrier.inval[.shared]
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | addr | ptr addrspace(3) or generic | mbarrier slot |
nvvm.mbarrier.arrive[.shared]
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | addr | ptr addrspace(3) or generic | mbarrier slot |
| operand 1 | count | optional i32 | arrival weight (default 1) |
| result 0 | token | i64 | phase token consumed by test.wait |
The "drop participant" variant of arrive (mbarrier.arrive.drop in PTX) is not surfaced as a separate dialect op in this binary's string table; the upstream way to reach it is through nvvm.mbarrier.arrive with a count attribute encoding the drop semantics, or through inline asm.
nvvm.mbarrier.arrive.expect_tx[.shared]
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | addr | ptr addrspace(3) or generic | mbarrier slot |
| operand 1 | txCount | i32 | expect-tx byte count |
| result 0 | token | i64 | phase token |
nvvm.mbarrier.test.wait[.shared]
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | addr | ptr addrspace(3) or generic | mbarrier slot |
| operand 1 | token | i64 | from arrive |
| result 0 | complete | i1 | phase-match outcome |
nvvm.mbarrier.try_wait.parity[.shared]
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | addr | ptr addrspace(3) or generic | mbarrier slot |
| operand 1 | phase | i32 | parity (0 or 1) |
| operand 2 | ticks | i32 | retry budget |
| result 0 | complete | i1 | phase-match outcome |
nvvm.fence.mbarrier.init
| Position | Name | Type | Notes |
|---|---|---|---|
| (no operands) | — | — | proxy-acquire fence emitted before mbarrier.init |
LLVM Intrinsic Mapping
| Op | LLVM intrinsic |
|---|---|
nvvm.mbarrier.init.shared | llvm.nvvm.mbarrier.init.shared.b64 |
nvvm.mbarrier.init | llvm.nvvm.mbarrier.init.b64 |
nvvm.mbarrier.inval.shared | llvm.nvvm.mbarrier.inval.shared.b64 |
nvvm.mbarrier.arrive.shared | llvm.nvvm.mbarrier.arrive.shared.b64 |
nvvm.mbarrier.arrive | llvm.nvvm.mbarrier.arrive.b64 |
nvvm.mbarrier.arrive.nocomplete.shared | llvm.nvvm.mbarrier.arrive.noComplete.shared.b64 |
nvvm.mbarrier.arrive.expect_tx.shared | llvm.nvvm.mbarrier.arrive.expect_tx.shared.b64 |
nvvm.mbarrier.test.wait.shared | llvm.nvvm.mbarrier.test.wait.shared.b64 |
nvvm.mbarrier.try_wait.parity.shared | llvm.nvvm.mbarrier.try.wait.parity.shared.b64 |
nvvm.fence.mbarrier.init | llvm.nvvm.fence.mbarrier.init.release.cluster |
The intrinsic ID is selected at TableGen registration time; lowering does not re-derive it from operand types.
PTX Templates
mbarrier.init.shared.b64 [%mbar], %count;
mbarrier.inval.shared.b64 [%mbar];
mbarrier.arrive.shared.b64 %tok, [%mbar];
mbarrier.arrive.noComplete.shared.b64 %tok, [%mbar], %count;
mbarrier.arrive.expect_tx.shared.b64 %tok, [%mbar], %tx;
mbarrier.test_wait.shared.b64 %p, [%mbar], %tok;
mbarrier.try_wait.parity.shared.b64 %p, [%mbar], %ph, %ns;
fence.mbarrier_init.release.cluster;
The non-.shared forms drop the address-space token: mbarrier.init.b64 [%mbar], %count; and so on. The verifier rejects mixing — a .shared op with a generic pointer operand, or a generic op with a shared-pointer operand.
Per-Arch Availability
| Op | SM floor | ptx_min |
|---|---|---|
init, arrive, arrive.nocomplete, inval, test.wait, try_wait | sm_80 | 7.0 |
try_wait.parity | sm_80 | 7.8 |
try_wait.parity.timelimit / try_wait.timelimit | sm_90 | 8.0 |
wait / wait.parity | sm_90 | 8.0 |
txn / txn.cta (tx-count transaction handles) | sm_90 | 8.0 |
arrive.expect_tx | sm_90 | 7.8 |
fence.mbarrier.init | sm_90 | 8.0 |
Cluster-aware variants (.cluster, .release.cluster) | sm_90 | 8.0 |
The expect_tx form is the TMA producer-side handshake; it is the only op in this family that requires sm_90. See TMA Ops for the producer side and Cluster Sync and DSMEM Handshake — DSMEM Transaction Handshake for the cluster-aware transaction protocol.
Verifier Invariants
.sharedops require operand 0 in addr-space 3.countandtxCountare 32-bit unsigned; values larger than 20 bits are rejected.test.waitandtry_wait.parityrequire ani1result type.arrive.expect_txis rejected on sm_80; it requires sm_90 or later.fence.mbarrier.initcarries arelease.clusterscope; rewriting it toacquireis rejected.
NVVM Cluster Ops
Abstract
nvvm.cluster.* and the adjacent cluster-aware helpers cover Hopper's thread-block-cluster surface: a small group of CTAs running on neighbouring SMs that share a logical cluster-wide barrier and a mapa-addressable view of their peer CTAs' shared memory. The ops in this family handle cluster-wide arrival, wait, and rank queries; they pair with mbarrier ops in nvvm.mbarrier.* for the data-side handshake. See Cluster Sync and DSMEM Handshake for the cross-CTA protocol and Cluster Sync Emission for the codegen side.
Blackwell (sm_100+) keeps the cluster surface; the same op set is the access path on every sm_90+ target.
Op Roster
| Op | Role |
|---|---|
nvvm.cluster.arrive | arrive at the cluster-wide barrier (acquire-release semantics) |
nvvm.cluster.arrive.relaxed | relaxed-memory variant of cluster.arrive |
nvvm.cluster.wait | wait for every CTA in the cluster to arrive |
nvvm.mapa | translate a peer-CTA SMEM pointer to a cluster-mapped address |
nvvm.read.ptx.sreg.clusterid.x / .y / .z | read cluster-rank index |
nvvm.read.ptx.sreg.nclusterid.x / .y / .z | read cluster-rank dimension |
nvvm.read.ptx.sreg.cluster.ctarank | per-CTA rank within the cluster |
nvvm.read.ptx.sreg.cluster.nctarank | total CTAs in the cluster |
nvvm.barrier.cluster.arrive / .wait (alias spellings used by gpu.barrier lowering) | same ops, different mnemonic |
The cluster rank reads sit alongside the special-register family; the dialect exposes them under both nvvm.read.ptx.sreg.* and the cluster-specific names so kernels written against either spelling round-trip.
Operand Tables
nvvm.cluster.arrive / nvvm.cluster.arrive.relaxed / nvvm.cluster.wait
No operands and no result. Each lowers to a single PTX barrier.cluster.*; instruction.
nvvm.mapa
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | addr | ptr addrspace(3) | local-CTA SMEM pointer |
| operand 1 | ctaRank | i32 | peer CTA index within the cluster |
| result 0 | mapped | ptr addrspace(3) | cluster-mapped address that aliases peer-CTA SMEM |
The mapped pointer is dereferenceable by ordinary ld.shared / st.shared instructions and behaves as a view into the peer CTA's slot.
nvvm.read.ptx.sreg.clusterid.{x,y,z} and family
| Position | Name | Type | Notes |
|---|---|---|---|
| result 0 | r | i32 | the requested cluster coordinate |
LLVM Intrinsic Mapping
| Op | LLVM intrinsic |
|---|---|
nvvm.cluster.arrive | llvm.nvvm.barrier.cluster.arrive |
nvvm.cluster.arrive.relaxed | llvm.nvvm.barrier.cluster.arrive.relaxed |
nvvm.cluster.wait | llvm.nvvm.barrier.cluster.wait |
nvvm.mapa | llvm.nvvm.mapa.shared.cluster.i64 |
nvvm.read.ptx.sreg.clusterid.x | llvm.nvvm.read.ptx.sreg.clusterid.x |
nvvm.read.ptx.sreg.cluster.ctarank | llvm.nvvm.read.ptx.sreg.cluster.ctarank |
nvvm.read.ptx.sreg.cluster.nctarank | llvm.nvvm.read.ptx.sreg.cluster.nctarank |
PTX Templates
barrier.cluster.arrive;
barrier.cluster.arrive.relaxed;
barrier.cluster.wait;
mapa.shared::cluster.u64 %r, %addr, %cta_rank;
mov.u32 %r, %clusterid.x;
mov.u32 %r, %clusterid.y;
mov.u32 %r, %clusterid.z;
mov.u32 %r, %nclusterid.x;
mov.u32 %r, %nclusterid.y;
mov.u32 %r, %nclusterid.z;
mov.u32 %r, %cluster_ctarank;
mov.u32 %r, %cluster_nctarank;
mapa accepts a 64-bit shared-cluster address; the u64 variant is the only one the dialect emits even when the result is a 32-bit pointer in source code — LLVM widens at type-conversion time.
Per-Arch Availability
| Op family | SM floor | ptx_min |
|---|---|---|
cluster.arrive / wait | sm_90 | 8.0 |
cluster.arrive.relaxed | sm_90 | 8.1 |
mapa | sm_90 | 8.0 |
clusterid / nclusterid reads | sm_90 | 8.0 |
cluster.ctarank / nctarank | sm_90 | 8.0 |
The relaxed-memory variant of cluster.arrive is the only op in the family that requires ptx 8.1; everything else is legal on 8.0.
Verifier Invariants
maparequires the operand pointer in addr-space 3; generic pointers are rejected.ctaRankis a 32-bit unsigned value; values outside[0, nctarank)cause undefined behaviour at runtime but the verifier does not reject them.- Cluster ops carry no operands and no result; verification rejects any attempt to attach attributes other than location info.
cluster.arriveandcluster.waitmust appear in pairs across cooperating CTAs; the verifier cannot prove pairing but rejects clearly-unpaired uses inside non-cluster kernels (noclusterattribute on the parentgpu.module).
Compilation Pipeline Overview
Abstract
Tileiras consumes a builtin.module carrying one or more gpu.module payloads expressed in the cuda_tile dialect and produces a relocatable object containing assembled cubin. The work splits cleanly across a host-side outer pipeline that operates on the enclosing module and a device-side inner pipeline that runs once per gpu.module. The outer pipeline registers dialects, resolves a single #nvvm.target per device module, and walks each gpu.module through dialect lowering. The inner pipeline pushes TileIR through TileAA, TileAS, CuTe/CUTLASS, NVGPU, and finally the MLIR llvm+nvvm dialect pair, then hands the result to an embedded LLVM 21 NVPTX backend that emits PTX. The driver invokes ptxas on that PTX and stitches the cubin into the output object. Each cascade page underneath this one documents one stage of that chain; this page is the contract between them.
Full cascade
MLIR bytecode (input)
↓
cuda_tile dialect (public surface)
↓
nv_tileaa dialect (analysis)
↓
nv_tileas + cute + cute_nvgpu + cutlass dialects
↓
mlir::nvgpu intermediate
↓
llvm + nvvm dialects
↓
libNVVM linkage
↓
NVPTX backend (LLVM 21 fork)
↓
PTX assembly
↓
ptxas (downstream)
↓
cubin
The descent is driven by three driver responsibilities:
- Register the dialect universe needed by the pipeline.
- Build a pass manager from resolved pipeline options.
- Run the MLIR pipeline, translate the resulting GPU module to LLVM/NVVM, and serialize it through the NVPTX backend.
Instrumentation exposes two major scopes: CompileNVVM for the MLIR lowering work and SerializeGPUModule for the LLVM/NVPTX serialization work. Those two scopes are a useful mental boundary: above them the program is still MLIR; below them it is LLVM IR, PTX, and finally cubin/object data.
Dialect handoff points
Each row is one boundary in the cascade. The "entry-pass" column names the pass that introduces the lower-dialect ops; the "key invariant" column names what must hold at the moment the pass is added.
| From | To | Boundary operation | Key invariant on entry |
|---|---|---|---|
cuda_tile | nv_tileaa | Convert public TileIR to alias-aware TileAA. | Module is fresh from bytecode loading; one gpu.module is present. |
nv_tileaa | nv_tileas | Lower typed, alias-aware operations into assembler-near TileAS operations. | Per-function TileAA cleanup has settled canonical forms. |
nv_tileas plus cute*/cutlass | nvgpu | Materialize schedules, layouts, TMA descriptors, and hardware-facing operations. | TileAS scheduling and layout passes have made execution structure explicit. |
nvgpu | llvm plus nvvm | Convert NVIDIA GPU dialect operations to NVVM intrinsics and LLVM dialect operations. | Memref, vector, and math lowering have removed higher-level abstractions. |
Untargeted gpu.module | Targeted gpu.module with #nvvm.target | Attach the resolved NVPTX target attribute. | Kernel metadata and target options are still available. |
MLIR llvm dialect | llvm::Module | Translate MLIR LLVM dialect to an LLVM module. | Exactly one GPU target has been resolved. |
llvm::Module | linked llvm::Module | Link external bitcode/blob libraries. | Any libdevice surrogate payloads have been attached. |
linked llvm::Module | optimized llvm::Module | Run the LLVM optimization pipeline. | Target machine and optimization level are known. |
optimized llvm::Module | PTX text | Run the NVPTX backend. | NVPTX subtarget information is populated. |
| PTX text | cubin/object payload | Invoke ptxas and package the result. | PTX has been emitted for the resolved target. |
The first six rows are "tier-1" boundaries (MLIR-on-MLIR passes inside the visible PassManager). The remaining four rows are "tier-2" boundaries (libNVVM linkage + NVPTX codegen). The split between the two tiers is described below.
Pass Pipeline Shape
At maximum optimization the visible MLIR cascade is a long nested pass manager, but the important shape is easier to understand as phase groups:
| Phase | Purpose | Typical scope |
|---|---|---|
| Frontend conversion | Convert input cuda_tile operations into nv_tileaa. | gpu.module |
| Early debug and cleanup | Attach debug scopes, canonicalize, and remove obvious redundancy. | top-level and gpu.module |
| TileAA to TileAS | Lower alias-aware operations into assembler-near TileAS functions. | nested nv_tileaa.func |
| Host/callback materialization | Emit host wrapper and callback plumbing. | gpu.module |
| TileAS scheduling and layout | Materialize async pipeline, convert layouts, assign buffers, plan CTA/cluster behavior, and generate schedules. | gpu.module |
| LLVM/NVGPU lowering | Convert TileAS/CuTe/CUTLASS operations toward nvgpu, llvm, and nvvm. | gpu.module |
| Kernel legalization/finalization | Normalize kernel attributes, target metadata, and debug scopes. | top-level and gpu.module |
| Post-lowering cleanup | Canonicalize and run CSE/DCE after the largest rewrites. | gpu.module |
| LLVM translation | Translate MLIR LLVM dialect to llvm::Module. | whole module |
| LLVM optimization | Run the LLVM PassBuilder pipeline for the selected optimization level. | llvm::Module |
| NVPTX emission | Emit PTX and assemble it downstream. | target module |
The detailed pass-count page remains the right place for exact pass ordering and opt-level deltas. This overview is the semantic contract: each phase must leave the module in the form expected by the next phase.
Outer and Inner Pipelines
The driver runs two pass managers in sequence on a single MLIR context. The outer pass manager is anchored on builtin.module. It registers every dialect that any later stage might need, parses the bytecode, and runs only a small amount of work directly on the top-level module: dialect normalization, host-wrapper attribute resolution, and the gpu.module walk that distributes per-device work. The inner pass manager is anchored on gpu.module. It is constructed once and reused for each device module the walk discovers. The two managers share an OperationName cache and an analysis manager, but they keep separate verifier-each settings because the outer pipeline runs cheap structural checks and the inner pipeline runs expensive type-and-region checks that fire on every TileIR mutation.
LogicalResult run_tileiras_pipeline(ModuleOp top, PipelineOptions opts) {
PassManager outer = make_pass_manager(top->getName(), &top->getContext());
populate_outer_pipeline(&outer, opts);
OpPassManager *inner = &outer.nest<GpuModuleOp>();
populate_inner_pipeline(inner, opts);
return outer.run(top);
}
The outer pipeline guarantees three invariants before the inner pipeline starts. First, every gpu.module carries exactly one resolved #nvvm.target attribute giving SM name, PTX feature string, and launch-shape constants. Second, each kernel symbol has a normalized linkage attribute and a populated parent symbol table. Third, target-machine options that the inner passes read by name (num-warps, num-ctas, index-bitwidth) have been attached to the device module so that nested passes pick them up through MLIR's standard attribute lookup rather than through driver globals.
State hand-off between the two pipelines is therefore purely attribute-based: there are no thread-local or driver-side dictionaries that the inner pipeline reads at run time. This rule is what makes the inner pipeline reentrant when the outer walk crosses multiple gpu.module ops with different targets in the same compile.
Kernel-Attribute Lift
A cute.kernel attribute marks a function as a GPU entry point while the module is still in the Tile/CuTe half of the inner pipeline. The lift converts that marker into a chain of three downstream attributes: the function gains nvvm.kernel, the parent gpu.module gains #nvvm.target, and after MLIR-to-LLVM translation the corresponding llvm.func gains the ptx_kernel calling convention plus the launch-shape function attributes that the NVPTX backend reads.
void lift_kernel_attributes(GpuModuleOp gpu, NvvmTargetAttr target) {
require(!gpu->hasAttr("nvvm.target"),
"gpu.module already carries a conflicting target");
for (FuncOp fn : gpu.getOps<FuncOp>()) {
if (!fn->hasAttr("cute.kernel")) {
continue;
}
fn->removeAttr("cute.kernel");
fn->setAttr("nvvm.kernel", UnitAttr::get(gpu.getContext()));
propagate_launch_shape(fn, target);
}
gpu->setAttr("nvvm.target", target);
}
The lift is the line at which target selection stops being implicit. Above it, architecture information lives in Tile-level attributes and pipeline options. Below it, only the triple, CPU string, feature string, and per-function attributes derived from #nvvm.target remain.
Serialization Boundary
When the inner pipeline finishes, the gpu.module contains only llvm and nvvm dialect operations plus the resolved target attribute. The driver then runs serialization, which is not a pass — it is a context-level translation that walks the gpu.module, builds an llvm::Module through MLIR's translateModuleToLLVMIR, links the embedded libdevice surrogate, runs an LLVM PassBuilder pipeline at the driver's chosen OptimizationLevel, emits PTX through the NVPTX backend, and invokes ptxas to produce cubin.
ByteBuffer serialize_gpu_module(GpuModuleOp gpu, PipelineOptions opts) {
NvvmTargetAttr target = cast<NvvmTargetAttr>(gpu->getAttr("nvvm.target"));
LLVMModulePtr llvm = translate_to_llvm_ir(gpu);
link_libdevice_surrogate(llvm, target);
run_llvm_passbuilder_pipeline(llvm, target, opts.opt_level);
StringRef ptx = emit_ptx_with_nvptx_backend(llvm, target);
return invoke_ptxas(ptx, target, opts);
}
Two consequences of this boundary matter when debugging. The MLIR pass timing report and the action handler trace cover only the work above the boundary. Below it, all timing comes from LLVM's --time-passes and from ptxas profiling output. The verifier layers reset across the boundary: between-pass verification stops, and what replaces it is the LLVM module verifier plus the NVVM kernel-launch verifier that runs at module-finalize time.
Δ vs cicc
cicc and Tileiras meet at the LLVM/NVVM layer, not at the source-language layer. cicc enters from CUDA C++ front-end output: textual LLVM IR or bitcode already expressed with NVVM intrinsics and CUDA device ABI conventions. Tileiras enters from CUDA TileIR bytecode and owns a much larger upper half — Tile dialect parsing, TileAA analysis, TileAS scheduling, CuTe/CUTLASS materialization, GPU layout decisions, and MLIR-to-LLVM lowering. Once both compilers hold an LLVM module with NVVM intrinsics, their remaining responsibilities converge.
| Area | tileiras | cicc | Shared after convergence |
|---|---|---|---|
| Input language | CUDA TileIR bytecode | CUDA front-end LLVM IR/bitcode | no |
| Tile/CuTe/CUTLASS dialect cascade | yes | no | no |
| Tile scheduling and layout materialization | yes | no | no |
| LLVM/NVVM module optimization | yes | yes | yes |
| NVPTX target and asm printer | yes | yes | yes |
PTX-to-cubin handoff through ptxas | yes | yes | yes |
Cross-References
Driver Entry and Optimization Levels covers how --opt-level resolves to a concrete pipeline. Pass Manager Internals documents the nesting and dispatch rules these two pipelines rely on. Pipeline Invariants and Verifiers names the three verifier layers that guard the cascade. Pass List by Optimization Level is the right place to look for exact pass ordering. Options Mapping traces how driver flags resolve to PassBuilder calls. Instrumentation and Action Handler describes the MLIR action trace and pass-timing surface. PassBuilder Mega-Registry catalogues the LLVM-side pass registry used after the MLIR-to-LLVM boundary. Backend-side documentation lives under the NVPTX Backend Passes overview, the Codegen overview, and the libdevice overview.
Driver Entry and Optimization Levels
Abstract
The Tileiras driver chooses a single MLIR pass pipeline for each compilation. The choice is a pure function of four inputs: the resolved compute target, the requested opt-level, the v2-opt-level axis that gates the newer TileAS lowering, and the pipeline-strategy flag that selects the warp-specialization variant. The output is a fully-constructed PassManager whose pass list and analysis-preservation contract are fixed before any IR mutates. Decoupling pipeline construction from pipeline execution is what lets the driver report the exact pipeline it is about to run, lets the textual --pass-pipeline parser produce the same pass graph, and lets diagnostics name each pass that contributed to a failure.
Entry Chain
The driver entry point is a small, linear state machine. It registers dialects, parses bytecode, builds the pipeline, runs it, and serializes. Each phase has a defined failure mode that cannot leak state into a later phase.
int compile_tileir(ByteSpan input, TileirasConfig config, ByteBuffer *out) {
MLIRContext ctx;
register_tileiras_dialects(&ctx);
OwningOpRef<ModuleOp> module = parse_tileir_bytecode(&ctx, input);
if (!module) {
return TILEIR_ERROR_BAD_BYTECODE;
}
PipelineOptions opts = resolve_pipeline_options(config);
PassManager pm(&ctx, ModuleOp::getOperationName());
populate_pipeline(&pm, opts);
if (failed(pm.run(*module))) {
return TILEIR_ERROR_COMPILE_FAILED;
}
return serialize_gpu_module(*module, config, out);
}
populate_pipeline is the only place that consults opts.opt_level, opts.v2_opt_level, and opts.pipeline_strategy. Once it returns, the pass manager is immutable; no later phase decides which passes run.
Optimization Tiers
| Tier | Role | Typical use |
|---|---|---|
O0 | Verifier-only skeleton. | Debugging bytecode ingestion and early IR validity. |
O1 | Frontend conversion and light cleanup. | Fast checks of cuda_tile to TileAA lowering. |
O2 | Default TileIR lowering through TileAS and first LLVM/NVGPU conversions. | Normal compilation. |
O3 | Full conversion stack, extra canonicalization, target finalization, and debug-scope synthesis. | Highest quality output and late-stage validation. |
v2-opt-level is a second axis. The primary opt-level selects the tier; v2-opt-level enables or suppresses the newer TileAS scheduling and specialization stages independently of that tier. The driver propagates both values into the pass manager as separate attributes so that the textual --pass-pipeline parser sees the same configuration the driver sees.
The recovered dispatcher uses the following effective structure:
| Requested tier | Base adders | Extra behavior |
|---|---|---|
O0 | none | Only automatic verifier slots run. |
O1 | frontend adder | Convert cuda_tile to TileAA, insert debug scopes, canonicalize. |
O2 | frontend + TileAS adder | Add TileAA-to-TileAS, host wrapper, TileAS-to-LLVM, CSE, TileAS-to-NVGPU. |
O3 | O2 + full conversion adder | Add TileIR verification, LLVM conversion, NVGPU/NVVM conversion, finalization. |
Two snapshot printers are conditional on emit-line-info. The first runs after frontend conversion; the second runs at the TileAS/LLVM boundary. Both are pure diagnostics — they print textual IR for line-info correlation and never mutate the module.
Pipeline Strategy
pipeline-strategy selects how aggressively the compiler specializes producer/consumer execution. The TileAS-side rewrites these strategies select between are documented in the Async Pipeline Family.
| Strategy | Meaning |
|---|---|
none | Do not add TileAS pipeline-specialization passes. |
unspecialize | Use the unspecialized pipeline path with configurable stage count. |
warp-specialize | Split work across producer and consumer agents and schedule resource use. |
For warp specialization, rrt-size-threshold chooses between lighter and heavier scheduling behavior.
A zero threshold selects the lighter path; a nonzero threshold enables resource-reservation-table
compression and the heavier scheduler preparation passes.
The heavy path is the one that prepares scheduling, specializes agents, checks register pressure, and rewrites layouts around the schedule. The light path still inserts boundaries and barriers, but avoids the full resource-reservation machinery.
Schedule Analysis Ordering
TileAS scheduling does not happen in one pass. The work splits across a constraint-generation pass that builds a ScheduleAnalysis and stores it in the analysis manager, a configurable run of cleanup passes that promise to preserve ScheduleAnalysis, and a materialization pass that retrieves the analysis, runs the modulo scheduler, and rewrites IR to express the solved schedule. The separation matters because cleanup passes that do not declare ScheduleAnalysis as preserved cause the analysis to be invalidated and recomputed, which both breaks compile times and produces a different schedule than the one any earlier diagnostic referred to.
The contract reduces to a dependency map. Each pass declares what it requires, what it produces, and what it preserves; the pass manager enforces ordering and invalidation from those declarations.
| Pass | Requires | Produces / Modifies | Preserves |
|---|---|---|---|
tileas-generate-schedule-constraints | TileAS IR with stable function shape | ScheduleAnalysis | TileAA, DominanceInfo |
canonicalize (between generate and materialize) | — | — | ScheduleAnalysis, TileAA |
cse (between generate and materialize) | — | — | ScheduleAnalysis, TileAA |
tileas-materialize-schedule | ScheduleAnalysis | TileAS schedule attributes, pipe IR | — |
LogicalResult run_schedule_pipeline(FuncOp fn, AnalysisManager am) {
ScheduleAnalysis &constraints =
am.getAnalysis<ScheduleAnalysis>(fn);
for (Pass *cleanup : cleanup_between_schedule_and_materialize) {
PreservedAnalyses preserved = cleanup->run(fn);
if (!preserved.isPreserved<ScheduleAnalysis>()) {
return fn.emitError(
"cleanup pass invalidated ScheduleAnalysis; "
"rerun constraint generation or remove the pass");
}
}
Schedule solved = solve_modulo_schedule(constraints);
if (!solved.feasible) {
return fn.emitError("modulo scheduler returned no feasible II");
}
return materialize_schedule(fn, solved);
}
The hard-failure rule on invalidation is deliberate. A silent recompute would hide the underlying mistake that some cleanup pass was added to the pipeline without declaring ScheduleAnalysis as preserved, and the symptom would surface much later as a mismatched schedule.
Serialization Scopes
Two outer instrumentation scopes give profilers and callback integrations stable handles.
| Scope | Covers |
|---|---|
CompileNVVM | Running the MLIR-to-NVVM/NVPTX compilation pipeline. |
SerializeGPUModule | Translating the GPU module to PTX/cubin and invoking downstream tools. |
These scope names are part of the public ABI for embedders. Fine-grained pass scopes underneath them can change between releases, but external profilers rely on the outer names being durable.
Cross-References
Pipeline Options Mapping — Option-to-Pass Map is the lookup table that resolves each option to its consuming pass. Pass List by Optimization Level names the exact pass sequence per tier. Pass Manager Internals — Anchor Hierarchy explains the nesting model the driver populates. Modulo Scheduler and Rau-Style Placement — Placement Arms is the scheduler that consumes the preserved ScheduleAnalysis.
Pass Manager Internals
Abstract
Tileiras's pass manager is upstream MLIR's PassManager plus a small set of local conventions that make the nested structure predictable enough to reason about by inspection. This page documents those conventions: the anchor hierarchy that fixes which op type each nested pipeline targets, the OperationName-identity dispatch that adaptors use instead of string compare, the analysis-preservation discipline that the scheduling pipeline relies on, and the threading model that determines when the pass manager fans out across operations.
Anchor Hierarchy
The pipeline nests three deep, with each level targeting one op type. The outermost level is the driver's PassManager itself, anchored on builtin.module. The next level is an OpPassManager reached through pm.nest<GpuModuleOp>(), anchored on gpu.module. The innermost level is reached through gpu_pm.nest<...>() for each function-shaped op the inner stages operate on; in practice that resolves to one of nv_tileaa.func, gpu.func (TileAS-stage), or llvm.func depending on the stage of the cascade.
| Anchor | Role | Adaptor enters via |
|---|---|---|
builtin.module | Driver root; dialect normalization, host-wrapper, gpu.module walk. | PassManager::run |
gpu.module | Device-module lowering, scheduling, codegen preparation. | OpToOpPassAdaptor walking builtin.module |
nv_tileaa.func | Per-function TileAA cleanup. | OpToOpPassAdaptor walking gpu.module |
gpu.func (TileAS-stage) | Per-function TileAS scheduling and lowering. | OpToOpPassAdaptor walking gpu.module |
llvm.func | Function-scoped MLIR-LLVM cleanup before translation. | OpToOpPassAdaptor walking gpu.module |
Adding a pass with a mismatched anchor is rejected at pass-manager construction time rather than at run time. The check uses the anchor OperationName already stored on the pass:
void OpPassManager::addPass(std::unique_ptr<Pass> pass) {
Optional<OperationName> required = pass->getOpName(getContext());
if (required && *required != getOpAnchor()) {
llvm::report_fatal_error(
Twine("pass '") + pass->getName() +
"' anchored on '" + required->getStringRef() +
"' added to pipeline anchored on '" +
getOpAnchor().getStringRef() + "'");
}
passes.push_back(std::move(pass));
}
OperationName Dispatch
Adaptors do not compare op-name strings at run time. Each OperationName carries a TypeID that uniquely identifies its registered op class within the MLIRContext. The adaptor caches the anchor's TypeID once at construction and compares pointers during the walk. This makes the inner-loop check a single integer compare per op visited, which matters because the outer adaptor walks the entire builtin.module and the inner adaptor walks every nested operation under each gpu.module.
LogicalResult OpToOpPassAdaptor::run(Operation *root) {
TypeID anchorId = nestedAnchor.getTypeID();
for (Region ®ion : root->getRegions()) {
for (Block &block : region) {
for (Operation &op : block) {
if (op.getName().getTypeID() != anchorId) {
continue;
}
if (!op.hasTrait<OpTrait::IsIsolatedFromAbove>()) {
return op.emitOpError(
"nested pipeline anchor must be IsolatedFromAbove");
}
if (failed(runOnOperation(&op))) {
return failure();
}
}
}
}
return success();
}
IsIsolatedFromAbove is what makes the dispatch sound. Without it, a nested pass could read or mutate SSA values defined above the anchor, which would let the threading model below race those values.
Analysis Preservation Discipline
Each anchor level owns its own AnalysisManager. Analyses computed at the gpu.module level (target queries, kernel symbol tables, NVVM target attribute caches) outlive the function-scoped passes that consume them; analyses computed at the gpu.func (TileAS-stage) level (ScheduleAnalysis, register-pressure estimates) live only as long as their function passes do not invalidate them.
The pass manager invalidates everything not explicitly listed in the PreservedAnalyses set the pass returns. Tileiras follows a strict rule for the scheduling pipeline: any pass placed between tileas-generate-schedule-constraints and tileas-materialize-schedule must declare ScheduleAnalysis as preserved or the build is rejected (see Driver Entry — Schedule Analysis Ordering). The check is enforced at pipeline construction:
void verify_schedule_preservation(OpPassManager &pm) {
bool inScheduleRegion = false;
for (Pass &pass : pm.getPasses()) {
if (pass.getArgument() == "tileas-generate-schedule-constraints") {
inScheduleRegion = true;
continue;
}
if (pass.getArgument() == "tileas-materialize-schedule") {
inScheduleRegion = false;
continue;
}
if (inScheduleRegion &&
!pass.preserves<ScheduleAnalysis>()) {
llvm::report_fatal_error(
Twine("pass '") + pass.getName() +
"' between schedule generation and materialization "
"does not preserve ScheduleAnalysis");
}
}
}
This check moves a class of scheduling bugs from rare runtime symptoms (mismatched schedule, wrong II) to a deterministic pipeline-construction failure.
Threading Model
When the outer adaptor is constructed with parallelism enabled and the anchor type is IsolatedFromAbove, the pass manager runs the nested pipeline on different gpu.module ops concurrently using its thread pool. Each thread takes a clone of the pass list and a fresh AnalysisManager; the only shared state is the MLIRContext (which is thread-safe by construction) and the PassInstrumentation chain (which serializes its own callbacks).
Tileiras enables parallelism for the outer builtin.module → gpu.module adaptor only. The inner gpu.module → function adaptors run sequentially because the per-function scheduling pipeline already saturates the thread pool through its own parallel solvers and because pass instrumentation is easier to read when function-level events from one device module do not interleave with another's.
LogicalResult run_with_threading(OpToOpPassAdaptor &adaptor,
Operation *root) {
SmallVector<Operation *> targets;
collect_anchor_operations(root, adaptor.anchor, targets);
if (!adaptor.runInParallel) {
for (Operation *op : targets) {
if (failed(adaptor.runOnOperation(op))) {
return failure();
}
}
return success();
}
return parallelForEach(root->getContext(), targets,
[&](Operation *op) { return adaptor.runOnOperation(op); });
}
The isolation guarantee that holds across both modes: a pass run on one anchor operation observes only that operation and its regions. Cross-anchor effects must travel through the shared context's symbol tables or through attributes attached to operations the outer pipeline visits.
Cross-References
Pipeline Invariants and Verifiers — Verifier Layers describes the verifier layers that run between the pass manager's pass invocations. Driver Entry and Optimization Levels — Entry Chain is where the pass manager is populated. LLVM PassBuilder Registry — Textual Resolution covers the textual resolution that produces the same pass graph at the LLVM tier. Compilation Pipeline Overview — Outer and Inner Pipelines describes how the two nesting levels are constructed and chained.
Pipeline Invariants and Verifiers
Abstract
Tileiras wraps three concentric verifier layers around its pass pipeline. The innermost layer is the OperationName verifier, which fires every time an op is built or modified and checks operand counts, result types, region structure, and trait predicates such as IsolatedFromAbove. The middle layer is the pass-manager between-pass verifier, which runs the full verify() on the anchor operation after each pass when verify-each is enabled and catches the broader class of failures that involve interactions between newly mutated ops. The outermost layer is the explicit module-level verifier suite that runs at fixed pipeline points and checks semantic rules requiring whole-module or target context, including the NVVM kernel-parameter overflow check. Each layer catches a class of bug the others cannot: per-op invariants surface immediately; cross-op invariants surface after the pass that introduced them; cross-pass invariants surface at named checkpoints.
A fourth layer — the NVVM IR verifier — runs after MLIR-to-LLVM conversion and catches NVPTX-specific errors that upstream LLVM's generic verifier ignores; a fifth, ptxas, closes the loop after PTX serialization. The full five-layer model and the bug-class-to-layer table are documented in Correctness Layers; this page covers the three in-pipeline layers in the order the pass manager invokes them.
Verifier Layers
The three layers run in strict order around each pass invocation. The innermost layer is always active and cannot be disabled because it is part of op construction itself. The middle layer is on by default for non-Release builds and is gated on verify-each otherwise. The outermost layer is scheduled as named passes in the pipeline and runs only at the points the pipeline builder places them.
LogicalResult run_pass_with_three_verifier_layers(
OpPassManager &pm, Pass &pass, Operation *anchor) {
// Layer 1: OperationName verifiers fired implicitly during op
// construction inside the pass body. There is no scheduling step
// for this layer; mutation through OpBuilder triggers the per-op
// verifier and may fail before pass->run returns.
if (failed(pass.run(anchor))) {
return anchor->emitError("pass failed; per-op verifier may have fired");
}
// Layer 2: pass-manager between-pass verifier.
if (pm.getVerifyEach()) {
if (failed(verify(anchor, /*verifyRecursively=*/true))) {
return anchor->emitError(
"between-pass verifier rejected IR after '")
<< pass.getName() << "'";
}
}
// Layer 3 runs only at explicit verifier passes (TileIR ops
// analysis, agent verifier, NVVM verifier); those passes appear
// in the pipeline's pass list like any other pass.
return success();
}
The single ordering rule that ties the layers together: layer 1 fires before pass->run returns; layer 2 fires immediately after; layer 3 only fires when its named pass is reached. A break at any layer halts the pipeline with the originating pass and operation attached to the diagnostic.
Layer-1 Example: Per-Op Structural Rejection
A builder that constructs nv_tileas.async.tiled_tma_load with the wrong coordinate count is rejected immediately by the per-op verifier. The op's verify body reads the descriptor operand's box rank, walks the coordinate operand slot range, and compares counts. The diagnostic is emitted before OpBuilder::create returns:
'nv_tileas.async.tiled_tma_load' op expects 3 coordinates, but got 2
The literal partial string in the binary is " coordinates, but got "; the expected count is derived from the view's rank plus an optional +1 when the view carries a TMA descriptor with a leading offset. The diagnostic surfaces inside the pass that built the op, so the failure points at the rewrite pattern that emitted the wrong shape rather than at a later consumer that would have seen the malformed IR.
Layer-2 Example: Partial-Rewrite Detection
MaterializeAsync rewrites every pipeline op into a pair of producer and consumer regions. The pass-level verifier catches a half-rewritten state — a consume_one whose paired produce_one was rewritten but whose region block-argument types still match the pre-rewrite producer-type list. The region-op verifier walks the block-argument list against the producer-type attribute and emits:
'nv_tileas.async.pipeline.consume_one' op expects region arguement types to match with producer types [...], but got: [...]
The typo arguement is preserved verbatim — downstream log-scraping infrastructure matches on it. The diagnostic fires at the boundary of MaterializeAsync, not at the next pass that would have consumed the inconsistent region. See nv_tileas Verifiers — Region-Op Verifier Template for the verifier body.
Layer-3 Example: Missing Kernel Metadata
A late LLVM-tier cleanup pass can strip function attributes after KernelAttrPass has stamped nvvm.kernel but before the NVVM verifier runs. The module-level verifier walks the function list, sees a kernel-shaped function without the kernel attribute, and rejects the module. The verifier predicate is the canonical isKernelFunction four-criteria disjunction documented in Kernel Identity; when none of the four criteria fires for a function the rest of the pipeline treats as a kernel, the inconsistency surfaces here rather than as a missing-.entry directive in the emitted PTX.
Explicit Verifier Passes
| Verifier | Stage | Contract |
|---|---|---|
| TileIR operation analysis | Before LLVM conversion in the full pipeline. | Check TileIR region, atom, schedule, and metadata invariants. |
| TileAA agent verifier | Warp-specialized TileAA path. | Check producer/consumer agent graph shape. |
| NVVM IR verifier | After target conversion and before NVPTX backend lowering. | Check kernel launches and formal parameter-space usage. |
The TileIR verifier runs before high-level operations are erased — once convert-tileas-to-llvm removes the Tile schedule attributes, the verifier has nothing to inspect. The NVVM verifier runs after kernel metadata and address-space attributes have been attached because the parameter-space check depends on the resolved data layout.
The NVVM verifier enforces two behaviors that matter to users. A device launch target must be a kernel (a non-kernel call through a launch op is rejected at this layer rather than at the backend; see Launch-Argument Address-Space Check). A kernel's formal parameter buffer must fit the selected target's parameter-space limit; the verifier walks the argument list, applies the target's data layout, and compares the cumulative size against the limit (the per-SM limits are listed in ParamSpaceLimit by SM Family). It also emits a warning when a child launch receives a pointer to parent-local or CTA-shared memory: the warning is non-fatal because the IR is well-formed, but the child dereference is undefined behavior and the warning is the only place users see it.
Ordering Invariants
| Invariant | Required order |
|---|---|
| Frontend conversion | cuda_tile to TileAA before any TileAA function pass. |
| TileAA lowering | TileAA to TileAS before TileAS-to-LLVM and TileAS-to-NVGPU consumers. |
| TileAS lowering | TileAS-to-LLVM before consumers that expect LLVM-compatible values. |
| TileIR semantic verification | Before LLVM conversion erases TileIR structure. |
| Cleanup bracketing | Canonicalizer and CSE around major dialect conversions. |
| NVVM verification | After kernel metadata and address-space conversion. |
| Target serialization | Only after no high-level TileIR ops remain. |
NVVM Parameter Verification
The kernel-parameter overflow check is the most user-visible piece of layer-3 verification because it is the first place where a target-specific limit can reject otherwise valid TileIR. The verifier walks each kernel argument, asks the target data layout for size and ABI alignment, accumulates an aligned offset, and compares the total against the target's parameter-space limit.
LogicalResult verify_kernel_parameters(LLVMFuncOp kernel,
NvvmTargetAttr target,
const DataLayout &dl) {
uint64_t total = 0;
for (auto [idx, argType] : llvm::enumerate(kernel.getArgumentTypes())) {
TypeSize size = dl.getTypeSize(argType);
Align align = dl.getABITypeAlign(argType);
if (size.isScalable()) {
return kernel.emitOpError("argument ") << idx
<< " has scalable type; not supported in NVVM kernels";
}
total = llvm::alignTo(total, align.value());
total += size.getFixedValue();
}
if (total > target.getParameterSpaceLimit()) {
return kernel.emitOpError("formal parameter space overflowed: ")
<< total << " > " << target.getParameterSpaceLimit()
<< " bytes for " << target.getChip();
}
return success();
}
The limit is target-dependent. Pre-Hopper SM versions allow 4096 bytes; Hopper and later allow 32760 bytes. The verifier reads the limit from the resolved #nvvm.target attribute so that a kernel rejected on one architecture can succeed on another without touching the verifier code.
Cross-References
Correctness Layers is the canonical overview that places the three in-pipeline layers covered here alongside the NVVM IR verifier and ptxas, and gives the bug-class-to-layer table. Pass Manager Internals — Anchor Hierarchy documents the anchor and dispatch model the verifier layers run inside. Pass List by Optimization Level is where each explicit verifier pass appears in the pipeline. Pipeline Options Mapping covers the options that enable or disable verify-each behavior. NVVM IR Verifier is the LLVM-tier sibling that re-checks parameter-space and address-space constraints after the MLIR-to-LLVM translation.
Pipeline Options Mapping
Abstract
TileIRPipelineOptions is the configuration object that parameterizes the MLIR-tier pipeline. It is filled from the driver and from --pass-pipeline="tileir{...}" syntax, then read while building the pass manager. Every public option has a single consuming pass plus a well-defined access pattern: the pipeline builder either passes the option into the pass's constructor (compile-time binding) or attaches it as a module attribute the pass reads from inside its runOnOperation body (run-time binding). This page maps each option to its consumer and its access pattern, then describes the layered defaulting strategy that decides what each option holds when the user does not set it explicitly.
Core Options
| Option | Type | Default | Used for |
|---|---|---|---|
num-warps | integer | 4 | Warp count used by TileAA/TileAS scheduling and launch metadata. |
num-ctas | integer | 1 | CTA count per cluster. |
compute-capability | string | driver target | SM target such as sm_100, sm_103, sm_110, sm_120, or sm_121. |
opt-level | integer | 2 | MLIR-tier optimization tier. |
v2-opt-level | integer | 0 | Secondary TileAS scheduling/lowering axis. |
pipeline-strategy | enum | none | Selects none, unspecialized, or warp-specialized pipeline behavior. |
index-bitwidth | integer | 32 | Index type width used by LLVM conversion passes. |
unspecialized-pipeline-num-stages | integer | 4 | Stage count for the unspecialized software pipeline path. |
Math and Target Options
| Option | Type | Default | Used for |
|---|---|---|---|
approx | boolean | false | Approximate math behavior in target conversion and NVVM reflection. |
ftz | boolean | false | Flush-to-zero behavior for floating-point lowering. |
use-nvgpucomp-libnvvm | boolean | false | Route target conversion through NVGpuComp/libNVVM integration. |
emit-line-info | enum | none | Select the IR stage used for line-info snapshots. |
Scheduler Options
| Option | Type | Default | Used for |
|---|---|---|---|
dynamic-persistent | boolean | false | Enable dynamic persistent-kernel transformation. |
schedule-trace-file | string | empty | Write a Chrome-style scheduler trace to the given path. |
enable-random-delay | boolean | false | Stress-test scheduler ordering with random delays. |
rrt-size-threshold | unsigned | 4096 | Resource-reservation-table compression threshold. |
max-constraint-iterations | unsigned | 10 | Iteration cap for resource constraint generation. |
Host Wrapper Options
| Option | Type | Default | Used for |
|---|---|---|---|
enable-debug-logging | boolean | false | Enable extra host-wrapper logging. |
host-triple | string | native | Target triple for generated host callback code. |
dump-host | string | empty | Write generated host code to a file. |
Option-to-Pass Map
Each option resolves to one or more consuming passes and a specific access pattern. "Constructor" means the pipeline builder reads the option and passes the value as a PassOption to the pass's factory; the pass then reads it through its own option field. "Module attribute" means the pipeline builder attaches the value to the gpu.module and the pass reads it through op->getAttrOfType<...> inside runOnOperation. "Both" means the pipeline builder writes the attribute and also wires the option through the pass constructor; this is used for options consumed both inside MLIR passes and across the MLIR-to-LLVM serialization boundary.
| Option | Consuming pass | Access pattern |
|---|---|---|
num-warps | convert-cudatile-to-tileaa, tileas-generate-schedule-constraints | Both |
num-ctas | convert-cudatile-to-tileaa, tileir-gpu-module-prepare | Module attribute |
compute-capability | convert-target-to-nvvm, tileir-post-nvvm-finalize | Module attribute (via resolved #nvvm.target) |
opt-level | Pipeline builder | Decides which passes are added |
v2-opt-level | tileas-generate-schedule-constraints, tileas-materialize-schedule | Constructor |
pipeline-strategy | Pipeline builder (gates warp-specialization adders) | Decides which passes are added |
index-bitwidth | convert-tileas-to-llvm, convert-to-llvm, convert-memref-to-llvm | Constructor |
unspecialized-pipeline-num-stages | unspecialized-pipeline | Constructor |
approx | convert-target-to-nvvm (NVVM reflect map) | Module attribute |
ftz | convert-target-to-nvvm (NVVM reflect map) | Module attribute |
use-nvgpucomp-libnvvm | Serialization driver | Read at serialize time |
emit-line-info | Snapshot printers in O1 and O2 | Constructor (printer enable + tag) |
dynamic-persistent | tileir-gpu-module-prepare | Module attribute |
schedule-trace-file | DumpTraceImpl instrumentation | Read at instrumentation install |
enable-random-delay | tileas-generate-schedule-constraints | Constructor |
rrt-size-threshold | Pipeline builder + ResourceConstraintBuilder | Both |
max-constraint-iterations | tileas-generate-schedule-constraints | Constructor |
enable-debug-logging | tileir-emit-host-wrapper | Constructor |
host-triple | tileir-emit-host-wrapper | Constructor |
dump-host | tileir-emit-host-wrapper | Constructor |
Pipeline Builder
The pipeline builder reads opt-level, pipeline-strategy, and v2-opt-level to decide which pass-list segments to append, then forwards the remaining options into the passes themselves. Two segments are conditional on opt-level (TileAS lowering for >= 2, full LLVM/NVVM conversion for >= 3); one is conditional on pipeline-strategy (warp-specialization adders); two are conditional on emit-line-info (snapshot printers).
void populate_pipeline(PassManager &pm, const PipelineOptions &opts) {
OpPassManager &gpu_pm = pm.nest<GpuModuleOp>();
attach_target_attributes(pm, opts);
add_frontend_segment(gpu_pm, opts);
if (opts.emit_line_info == EmitLineInfo::Frontend) {
add_snapshot_printer(gpu_pm, "after-frontend");
}
if (opts.opt_level >= 2) {
add_tileas_lowering_segment(gpu_pm, opts);
}
if (opts.pipeline_strategy != PipelineStrategy::None) {
add_warp_specialization_segment(gpu_pm, opts);
}
if (opts.emit_line_info == EmitLineInfo::TileasBoundary) {
add_snapshot_printer(gpu_pm, "tileas-llvm-boundary");
}
if (opts.opt_level >= 3) {
add_full_conversion_segment(gpu_pm, opts);
}
}
The attach_target_attributes step is what turns the module-attribute access pattern into a real binding: it writes compute-capability, num-ctas, approx, ftz, and dynamic-persistent onto every gpu.module so that downstream passes pick them up uniformly.
Defaulting Strategy
Defaults are layered. The driver applies command-line defaults first (its opt-level default is 3, its compute-capability default points at the newest supported Blackwell SM). The pipeline-options parser applies its own defaults if the driver did not (its opt-level default is 2, its compute-capability default is older). The TileGen front end applies a final tier of defaults for options the user never touches.
| Layer | Sets | Wins when |
|---|---|---|
| Driver CLI | opt-level=3, compute-capability=<latest Blackwell> | User invokes the tileiras binary directly. |
--pass-pipeline parser | opt-level=2, compute-capability=<older default> | Pipeline is built from a textual --pass-pipeline=tileir{...} string with no driver wrapping. |
| TileGen front end | Scheduler-trace path, debug-logging flag | Driver did not set them and parser does not see them. |
Tests should set every option they care about explicitly because the two driver-vs-parser defaults disagree on opt-level and on target.
Unconsumed Options
When an option is set but its consuming pass is not in the active pipeline (for example unspecialized-pipeline-num-stages=8 is set but pipeline-strategy=none so the unspecialized pass is never added), the option is silently ignored. The pipeline builder does not emit a warning because the textual parser cannot distinguish a redundant option from a user-supplied override that will become relevant on a later pipeline rebuild. Driver invocations that combine incompatible flags should be rejected at the driver layer, not at the pipeline builder.
Cross-References
Driver Entry and Optimization Levels — Optimization Tiers explains how opt-level and pipeline-strategy decompose into the pass-list segments above. Pass List by Optimization Level names the passes each segment contains. LLVM PassBuilder Registry covers options consumed past the MLIR-to-LLVM boundary. Driver Entry and Optimization Levels — Schedule Analysis Ordering covers the ScheduleAnalysis preservation contract that rrt-size-threshold and the scheduler options feed.
Instrumentation and Action Handling
Abstract
Tileiras exposes four orthogonal tracing surfaces wired into one shared PassManager. Pass instrumentation records named scopes around pipeline stages and forms the spine of pass-timing and IR-printing. The IRPrinter instrumentation layered on top emits *** IR Dump Before/After *** banners for --mlir-print-ir-before-all / --print-after-all. PassTiming consumes the same scope tree and renders either a flat list or a nested tree, in text or JSON, under --mlir-timing. MLIR actions are a lower-level mechanism for tracing individual rewrites and pattern applications; they are visible through mlir::ApplyPatternAction / GreedyPatternRewriteIteration records and through the context's action-handler callback. All four surfaces interact with a single PassInstrumentation chain that the pipeline builder owns.
This page covers the chain itself: the algorithm an instrumentation hook follows, the canonical scope tree the binary actually emits, the MLIR action surface and its printable records, the pass-timing report grammar, the IR-printing banner protocol, the crash reproducer, opt-bisect, and debug-counter — plus the four NVIDIA-private hook points where the chain ties into the scheduler.
Confidence on the existence and ordering of the inner scopes is HIGH (every scope name is a verbatim .rodata string referenced from the scheduler band of .text). Confidence on the exact algorithmic decomposition of runOnOperation between the instrumentation chain and the action handler is MED — the upstream MLIR shape is preserved unchanged in tileiras, and the reverse-engineered call graph matches it bucket-for-bucket, but specific function offsets are not the public contract.
The PassInstrumentation chain
PassManager owns a PassInstrumentor whose only payload is an ordered list of PassInstrumentation* hooks. Each hook is a polymorphic object with eight virtual entry points the pass manager calls at well-defined moments around every pass invocation, every pipeline run, and every analysis computation:
struct PassInstrumentation {
// Pass-execution boundary callbacks.
virtual void runBeforePass(Pass *, Operation *) = 0;
virtual void runAfterPass(Pass *, Operation *) = 0;
virtual void runAfterPassFailed(Pass *, Operation *) = 0;
// Pipeline-nesting callbacks (one per nested OpPassManager).
virtual void runBeforePipeline(OperationName, const PipelineParentInfo &) = 0;
virtual void runAfterPipeline(OperationName, const PipelineParentInfo &) = 0;
// Analysis-cache callbacks.
virtual void runBeforeAnalysis(StringRef, TypeID, Operation *) = 0;
virtual void runAfterAnalysis (StringRef, TypeID, Operation *) = 0;
};
The pass-execution callbacks bracket every Pass::runOnOperation call. The pipeline callbacks fire once at the entry and exit of each nested OpPassManager invocation, regardless of how many passes run inside it. The analysis callbacks fire around any AnalysisManager::getAnalysis<T> cache miss; cache hits fire neither.
The instrumentor walks its hook list in registration order for the Before* callbacks and in reverse order for the After* callbacks. The reverse traversal is what gives nested timing the right semantics — an outer timer's runAfterPass fires after every inner timer has already closed.
LogicalResult run_pass_with_instrumentation(
PassManager &pm, Pass &pass, Operation *anchor) {
PassInstrumentor *instr = pm.getInstrumentor();
for (auto &h : instr->hooks_forward()) h->runBeforePass(&pass, anchor);
LogicalResult result = pass.runOnOperation(anchor);
if (failed(result)) {
for (auto &h : instr->hooks_reverse()) h->runAfterPassFailed(&pass, anchor);
} else {
for (auto &h : instr->hooks_reverse()) h->runAfterPass(&pass, anchor);
}
return result;
}
When no instrumentation is installed the instrumentor's hook list is empty and entering and exiting the bracketed region is a single null check — the cost of a no-op PassInstrumentation chain is two function-pointer-array length tests per pass invocation, which is far below the cost of any non-trivial pass.
Scope tree the binary emits
Every named scope the binary actually emits as a .rodata string is enumerated below. The scopes form a strict tree: outer scopes cover whole compilation phases; inner scopes cover scheduling and TileAS preparation substages. The order in which they fire on a typical compile is top-to-bottom in the table — CompileNVVM opens first, SerializeGPUModule closes last, and every scheduler scope nests inside TileASGenerateSchedule or TileASPrepareForScheduling.
| Scope | Layer | Purpose |
|---|---|---|
CompileNVVM | Outer | Entire MLIR-to-NVVM/NVPTX compile run. |
SerializeGPUModule | Outer | GPU module serialization and downstream assembler handoff. |
IRWalk::findTargetForLoops | Schedule prep | Search the IR for loops eligible for schedule materialization. |
Schedule::unrollStaticForLoop | Schedule prep | Emit static loop unrolling during schedule materialization. |
TileASGenerateSchedule | Schedule | Schedule constraint generation. |
TileASPrepareForScheduling | Schedule | TileAS preparation before schedule solving. |
legalizeLoopScheduleForMaterialization | Schedule | Loop-shape cleanup before materializing a schedule. |
DumpTraceImpl::run | Schedule | Write a scheduler trace when schedule-trace-file is set. |
unrollSmallLoopsForScheduling | Schedule prep | Unroll small loops before schedule construction. |
decomposeSingleOp | Schedule prep | Decompose a single complex op for schedule-friendly IR. |
loopUnrollByFactor | Schedule prep | Apply an explicit unroll factor. |
loopUnrollByHeuristic | Schedule prep | Apply heuristic loop unrolling. |
decomposeTiledLoadStoreView | Schedule prep | Split tiled view loads/stores into scheduler-friendly forms. |
refineVecSizeOfAtoms | Schedule prep | Refine vector sizes for atom operations. |
sliceAndFuse | Schedule prep | Slice and fuse loops or regions for scheduling. |
TileASSliceAndFuse | Schedule prep | Wrapper scope around the TileAS-specific slice-and-fuse stage. |
runCanonicalizer | Schedule prep | Run canonicalization inside a scheduler preparation stage. |
compactMemLayout | Schedule prep | Compact memory layout metadata. |
refreshBoxDim | Schedule prep | Refresh box dimensions after layout changes. |
ResourceConstraintBuilder::tryAddConstraintToAvoidRegSpilling | Schedule constraint | Add scheduling constraints to avoid spills (see Resource Constraint Builder and RRT). |
These names are part of the stable surface — external timing reports and callback integrations match on them verbatim. TileASSliceAndFuse and TileASGenerateSchedule are the two most expensive scopes on a typical O2 compile; together they dominate the wall-clock budget for any schedule-intensive kernel. The DumpTraceImpl::run scope only opens when schedule-trace-file is set in TileIRPipelineOptions.
⚡ QUIRK — typo-stable scope names Several scopes carry their upstream typos verbatim.
tryAddConstraintToAvoidRegSpillingis the canonical example: the spelling is the cross-build contract, not a guideline. Downstream timing-report ingestion matches on this exact string. A reimplementation that silently corrects the spelling breaks every log-scraping pipeline that consumes the timing tree.
Scope Algorithm
Instrumentation scopes are exception-safe and nest correctly through an RAII helper. The exit always fires, even when the body throws — the chain's runAfterPassFailed hooks are the failure path.
class ScopeGuard {
public:
ScopeGuard(PassInstrumentor *p, StringRef name)
: p_(p), token_(p ? p->enter(name) : INVALID_TOKEN) {}
~ScopeGuard() {
if (p_) p_->exit(token_, std::uncaught_exceptions() == 0);
}
private:
PassInstrumentor *p_;
ScopeToken token_;
};
// Caller side:
{
ScopeGuard g(instr, "TileASGenerateSchedule");
if (failed(generate_schedule_constraints(...))) return failure();
// ScopeGuard destructor fires runAfterPass/runAfterPassFailed as appropriate.
}
When no instrumentation handler is installed enter returns INVALID_TOKEN and exit is a single integer compare against that sentinel — entering and exiting a scope costs roughly the same as a function call's prologue and epilogue.
⚡ QUIRK — scope tokens are not pointer-comparable Scope tokens are opaque integers, not pointers into the hook chain. Two independent scopes can return the same token after one closes — the chain reuses retired slots. Code that tries to memoize the token across multiple scope-aware components must use the scope name, not the token, as the key.
Pass Timing Report Format
--mlir-timing enables a TimingInstrumentation hook that accumulates per-scope wall time and user+system time. The hook prints its report at process exit through an llvm::Timer-shaped grammar. Two display modes and two output formats are available:
mlir-timing-display | mlir-output-format | Report grammar |
|---|---|---|
list (default) | text (default) | Flat list of `Wall time |
tree | text | Nested tree, indentation per scope depth; preserves pass-manager nesting. |
list | json | JSON array of {wall, user_system, name} triples. |
tree | json | JSON tree of {wall, user_system, name, children} nodes. |
The header strings are stable across builds and are matched by downstream log-scrapers: ---User Time---, ----User Time----, ---Wall Time---, ----Wall Time---- ----Name----, and ---User+System--. The leading-whitespace count is part of the contract.
===-------------------------------------------------------------------------===
Pass execution timing report
===-------------------------------------------------------------------------===
Total Execution Time: 4.812 seconds
---Wall Time--- ---User+System-- Name
2.013 (41.8%) 2.009 (41.7%) TileASGenerateSchedule
1.144 (23.8%) 1.143 (23.8%) TileASSliceAndFuse
0.401 ( 8.3%) 0.401 ( 8.3%) Canonicalizer
...
4.812 (100.0%) 4.811 (100.0%) Total
The list-mode report's percentage column is computed against the total, which is the root scope's wall time, not the sum of children — Canonicalizer invocations nested under TileASGenerateSchedule contribute to both rows. The tree mode disambiguates by showing the percentages relative to each parent.
⚡ QUIRK — typo-preserved enum value The
--mlir-timing-display=treedescription string in the binary readsdisplay the results ina with a nested tree view— the dropped 'i' afterinis preserved verbatim from upstream MLIR. The cl::opt help text matches the upstreamPass.cppbyte-for-byte. Do not "fix" this when reimplementing; downstream regression suites match the typo.
--enable-statistics and --stats-json enable a parallel statistics pass that walks every registered llvm::Statistic after compilation and prints counters and frequencies. This surface is independent of the timing instrumentation — a build can have timing on while statistics are off, or vice versa.
IR-Printing Instrumentation
The IR-printer hook intercepts the same pass-execution callbacks the timing hook uses. Where the timing hook accumulates a duration, the IR-printer emits a textual MLIR snapshot of the anchor operation either before or after the pass body runs. The hook gates its emission on a per-pass filter, a per-anchor scope, and a change detector. The full surface partitions into eight knobs the user enables through MLIRContext::setPrintIR… or through the corresponding driver flags.
| Knob | Default | Effect |
|---|---|---|
--mlir-print-ir-before-all | off | Snapshot before every pass that runs. |
--mlir-print-ir-after-all | off | Snapshot after every pass that runs. |
--mlir-print-ir-after-change | off | Suppress the after-snapshot unless the pass actually mutated the anchor. |
--mlir-print-ir-after-failure | off | Snapshot only when a pass returns failure. |
--mlir-print-ir-module-scope | off | Use the enclosing builtin.module as the snapshot root rather than the anchor. |
filter-print-funcs=<name> | empty | Limit the per-function snapshot to named functions. |
--mlir-elide-elementsattrs-if-larger=<n> | per-MLIR | Truncate large DenseElementsAttr payloads. |
--mlir-print-skip-regions | off | Skip every region body — print operations alone. |
Each snapshot is bracketed by a verbatim banner: *** IR Dump Before <pass-name> *** or *** IR Dump After <pass-name> ***. The pass name is the registered argument name (e.g., cse, tileas-generate-schedule-constraints). When --mlir-print-ir-module-scope is enabled, the banner is followed by the full module text; without module-scope, only the anchor operation is printed, which on a multi-kernel module can suppress context that the user actually needs to see (see Debugging and Introspection — IR Printing for the user-facing recipe).
void IRPrinterInstrumentation::runAfterPass(Pass *pass, Operation *anchor) {
if (suppressed_by_filter(pass)) return;
if (require_change_ && !pass_changed_anchor(anchor)) return;
Operation *root = use_module_scope_ ? containing_module(anchor) : anchor;
raw_ostream &os = printer_stream();
os << "*** IR Dump After " << pass->getArgument() << " ***\n";
OpPrintingFlags flags;
if (print_debug_info_) flags.enableDebugInfo();
if (print_op_generic_) flags.printGenericOpForm();
if (skip_regions_) flags.skipRegions();
if (elide_threshold_ >= 0) flags.elideLargeElementsAttrs(elide_threshold_);
root->print(os, flags);
os << "\n";
}
The pass_changed_anchor predicate compares a hash captured at runBeforePass against a hash captured at runAfterPass. Hash collisions are not handled — an IR change that produces an identical hash will appear as "no change" to the change detector, but the failure mode is conservative (snapshots are suppressed, never duplicated).
The AsmPrinter knobs that govern the textual form of the printed IR are documented under Dialect Asm-Printer Status — Alias Resolution; they include mlir-print-debuginfo, mlir-print-local-scope, mlir-print-op-generic, mlir-print-skip-regions, mlir-print-unique-ssa-ids, mlir-print-value-users, and mlir-use-nameloc-as-prefix. Each one is a process-wide cl::opt that the IRPrinter hook reads to construct its OpPrintingFlags per snapshot.
⚡ QUIRK —
print-changedrequires a baseline The IRPrinter only emits an "after" snapshot under--mlir-print-ir-after-changeifrunBeforePassran first and stashed a baseline. If both--mlir-print-ir-after-changeand--mlir-print-ir-after-failureare passed without--mlir-print-ir-before-all, the change detector has no baseline to compare against and falls back to "always emit after a failure." This is invisible in the help text but the actual behavior any reproducer must match.
MLIR Actions
Actions are a finer-grained surface than pass instrumentation. Where instrumentation brackets pass-level work, an action brackets a single transformation event — typically one greedy-rewrite iteration or one pattern application — and gives a registered handler the option to inspect, skip, log, or replace the work. Three action types ship in the binary:
| Action type | Class | Fires per |
|---|---|---|
mlir::ApplyPatternAction | RTTI string in .rodata | Single rewrite-pattern invocation inside the greedy or dialect-conversion driver. |
mlir::GreedyPatternRewriteIteration | RTTI string in .rodata | One sweep of the greedy worklist. |
mlir::SetMaxRegisterActionAttr | RTTI string in .rodata | NVVM setmaxnreg rewrite at NVVM lowering time. |
An action has an identity, a tag, and an optional payload pointer that the handler interprets. A context-level handler installed through MLIRContext::registerActionHandler intercepts every action that runs inside that context:
LogicalResult MLIRContext::executeAction(
const Action &action, function_ref<void()> work) {
if (!action_handler_) {
work();
return success();
}
return action_handler_(action, [&] { work(); return success(); });
}
The handler receives the action's IRPrinting state and may print, defer, drop, or stack-trace the work before calling the supplied work closure. Three call sites in the binary thread actions through this dispatch: the greedy-pattern-rewrite driver (one action per iteration plus one per pattern attempt), the dialect-conversion driver (one action per attempted conversion), and the NVVM lowering pass (one action per setmaxnreg rewrite).
The action-record dump format begins with the verbatim banner >> Action Record . The handler then prints Action: <name> followed by the action's payload formatted by the action class itself. Greedy-rewrite actions print their root op and the matched pattern's RTTI name; dialect-conversion actions print the source-and-target op pair. The handler can also emit Action: cleanup for the deferred-rewrites cleanup pass.
⚡ QUIRK — actions are independent of pass instrumentation A pass with no instrumentation hooks installed still emits actions when a handler is registered. Conversely, a pass with instrumentation but no action handler emits scopes but no action records. The two surfaces share neither a thread nor a buffer — they are wired in parallel through
MLIRContextandPassManagerrespectively. This is what lets a build run pass timing without the per-pattern noise of action tracing, or trace patterns without timing overhead.
Opt-Bisect
opt-bisect-limit is LLVM's bisect-by-pass-index harness. Set to a positive integer, it caps the number of passes that run; passes above the cap are skipped, and the IR captured at the cap is the bisect output. Tileiras inherits the LLVM-side instrumentation hook unchanged. Four cl::opts control its behavior:
| Knob | Type | Default | Effect |
|---|---|---|---|
opt-bisect-limit | int | INT_MAX | Skip every pass whose index exceeds the limit. |
opt-bisect-funcs | string list | empty | Restrict bisection to a comma-separated function list. |
opt-bisect-verbose | bool | false | Print each pass index and decision to stderr. |
opt-bisect-print-ir-path | string | empty | Write the IR to this path when the limit is reached. |
Opt-bisect runs in the same instrumentation chain as the IR printer and the timer. Its runBeforePass hook increments a counter, compares against the limit, and either lets the pass through or replaces it with a no-op. The reduced search space — pass-by-pass binary search instead of textual diff — is what makes opt-bisect viable on the 200+-pass tileiras pipeline.
A related but independent surface is debug-counter, which gates individual transformations within a pass by name (e.g., --debug-counter=dce-counter=10 lets DCE run for ten elimination attempts before short-circuiting). Two knobs apply:
| Knob | Type | Default | Effect |
|---|---|---|---|
print-debug-counter | bool | false | Print the post-run counter summary. |
debug-counter-break-on-last | bool | false | Trap into the debugger on the final allowed attempt. |
The Requested --debug-counter in LLVM build without assertions. This is a no-op. warning is emitted at startup when the build does not carry the assertion-time counter table; tileiras's release build is one such build, so debug-counter is silently a no-op unless the build is rebuilt with assertions.
Crash Reproducer
When a pass aborts or fails late in the pipeline, the pass manager can emit a self-contained reproducer — the IR captured immediately before the failing pass plus a --pass-pipeline= string that names every pass run up to the failure. The reproducer is written when the pipeline builder calls enableCrashReproducerGeneration with a path; the file path then surfaces as the verbatim diagnostic reproducer generated at \
Crash-reproducer output has two flavors:
| Flavor | When | What it captures |
|---|---|---|
| Local | Default | The single pass that failed plus the IR seen by that pass; minimal context. |
| Pipeline | genLocalReproducer=false | The full pipeline from the failure backward to the last cse or canonicalize checkpoint; replays the path that produced the failing IR. |
Neither flavor captures host-side cl::opt state — a reproducer built against the binary needs to be re-run with the same CLI flags. The "Reproducer" banner in the captured file is emitted before the IR payload as a textual delimiter.
⚡ QUIRK — reproducer paths are per-PassManager, not per-context Each
PassManagerinstance carries its own reproducer-path setting. Two pass managers running in the same MLIRContext can have different reproducer paths, or one can have it enabled and the other not. This is intentional — it lets the outer driver capture a reproducer for the device-module pipeline without bloating the host-wrapper pipeline's reproducer with kilobytes of unrelated MLIR.
NVIDIA-Private Hook Points
Four hook points on the instrumentation chain are NVIDIA-private:
-
DumpTraceImpl::run— Activated byschedule-trace-file. The hook installs itself on the innergpu.func(TileAS-stage) adaptor and writes a per-pass scheduler trace in the Chromechrome://tracingformat. Each pass invocation produces one event with name, timestamp, and duration; each scheduler decision produces a child event under its enclosing pass. The trace is closed at process exit through__cxa_atexit. -
schedule-trace-fileconsumer — Reads the option at pipeline-construction time and refuses to install the hook if the file cannot be opened. The diagnosticfailed to legalizeLoopScheduleForMaterializationfires when the legalisation phase rejects a loop the scheduler expected to consume; it is the most common visible scheduler-side failure and rides in the same instrumentation tree. -
TileIR Callbacks — Five
__CUDA_TILEIR_*env vars (see Env Vars and Runtime Gates and the TILEIR_CALLBACKS ABI) modify the instrumentation surface at module load time.__CUDA_TILEIR_CALLBACKS_ON_PRE_LOADregisters a host-side pre-load hook that the instrumentation chain notifies before any compilation begins;__CUDA_TILEIR_FUNC_CALLBACKSand__CUDA_TILEIR_FUNC_ON_ARGUMENTS_CHANGEregister per-function callbacks that fire on argument-buffer mutation. -
Action attributes —
mlir::NVVM::SetMaxRegisterActionAttris the only NVIDIA-privateActionclass. It fires when the NVVM lowering rewrites agpu.functo add thesetmaxnregPTX-level directive; the action wraps the rewrite so an outside observer can log the per-function maxnreg decision without modifying the pass itself.
Verify-Each and Action Composition
The verify-each knob (off by default for release builds, on by default for assert builds) runs the full verify() on the anchor operation between every pair of passes. Verify-each is implemented as a PassInstrumentation hook that fires from runAfterPass; it is in the same chain as the IR printer and the timer, so its cost is roughly proportional to the size of the anchor operation. See Pipeline Invariants and Verifiers — Verifier Layers for the layered verifier model and why between-pass verification catches a class of bugs the explicit verifier passes cannot.
Mixing all five hooks — timer, IR printer, verify-each, opt-bisect, and a custom action handler — is supported by the chain. The interleaving order matters for one specific case: if both --mlir-print-ir-after-all and verify-each are enabled, the IR printer's snapshot is taken before verify-each runs, so a failed verification produces both the post-pass IR snapshot (from the printer hook) and the failure diagnostic (from verify-each). The snapshot reflects the IR that triggered the verification failure, which is exactly what the user wants when bisecting a between-pass invariant violation.
Callback Integration
The same compile instrumentation surface feeds the TileIR callback emission path. Callback emission materialises well-known module symbols and launch-site hooks so a runtime can patch instrumentation at module load time. The host-side symbols __CUDA_TILEIR_CALLBACKS, __CUDA_TILEIR_CALLBACKS_ON_PRE_LOAD, __CUDA_TILEIR_FUNC_CALLBACKS, __CUDA_TILEIR_FUNC_ON_ARGUMENTS_CHANGE, and __CUDA_TILEIR_ON_PRE_LOAD are wired through the same pipeline-builder logic that installs DumpTraceImpl; the driver-level ABI is documented in TILEIR_CALLBACKS ABI, and the format strings the callbacks emit ([TileIR Callback] Argument %d: offset = %ld, size = %ld, [TileIR Callback] CUdeviceptr: %p, [TileIR Callback] DESC_TMA512: ...) are part of the public log format.
Reimplementation Notes
build_instrumentation_chain(pm, opts):
chain = PassInstrumentor()
if opts.timing_enabled:
chain.push(TimingInstrumentation(opts.timing_display, opts.timing_format))
if opts.ir_printing_enabled:
chain.push(IRPrinterInstrumentation(opts.print_filter, opts.print_flags))
if pm.verify_each:
chain.push(VerifyEachInstrumentation(pm))
if opts.opt_bisect_limit < INT_MAX:
chain.push(OptBisectInstrumentation(opts.opt_bisect_limit, opts.opt_bisect_funcs))
if opts.schedule_trace_file:
chain.push(DumpTraceImpl(opts.schedule_trace_file))
pm.set_instrumentor(chain)
run_pass(pm, pass, anchor):
for h in chain.forward(): h.run_before(pass, anchor)
result = pass.run_on_operation(anchor)
for h in chain.reverse():
if failed(result): h.run_after_failed(pass, anchor)
else: h.run_after(pass, anchor)
return result
The ordering rule: the timer goes first because it must envelop every other hook's cost; the IR printer goes second so its snapshot is taken before verify-each potentially rejects; verify-each goes third; opt-bisect goes fourth so it can short-circuit before the scheduler-trace hook runs; the trace hook goes last so it sees only the passes opt-bisect allows through. Reversing this order silently changes the meaning of every reported number.
Cross-References
Driver Entry and Optimization Levels — Serialization Scopes names the two outer scopes (CompileNVVM, SerializeGPUModule) that this page enumerates. Modulo Scheduler and Rau-Style Placement is the scheduler whose stages drive most of the inner scopes. Resource Constraint Builder and RRT is what ResourceConstraintBuilder::tryAddConstraintToAvoidRegSpilling is part of. Debugging and Introspection is the user-facing guide that turns the scope tree and action surface documented here into a workflow for diagnosing pipeline issues. Testing and Observability covers how external test suites consume the timing and snapshot streams. Pass Manager Internals — Threading Model covers the threading contract this chain runs inside. cl::opt Full Catalog — Layer 6 is the canonical catalog for every --mlir-* knob mentioned here.
Pass List by Optimization Level
Abstract
Each optimization level in Tileiras selects a different MLIR-tier pass pipeline. The four levels - O0, O1, O2, O3 - are arranged as a strict superset chain: each level runs everything the previous level ran and then adds passes that justify their compile-time cost. This page lists the passes that each level schedules, explains what the additions buy, and describes the IR shape at each stage boundary. LLVM IR and MachineIR passes that run after the MLIR pipeline are documented under NVPTX Backend Passes.
The reader's working question is "if I build with -O2, what runs, in what order, and what does each pass do?" The tables answer the first two parts; the prose around them answers the third.
Stage Vocabulary
The MLIR pipeline can be read as four stages regardless of opt level:
- Frontend cleanup. Convert the public
cuda_tilesurface into the alias-aware TileAA form, insert debug scopes, fold trivial operations. The IR after this stage is in TileAA withcuda_tileremoved. - Architecture-aware lowering. Lower TileAA into the operationally-scheduled TileAS dialect; emit host-wrapper metadata; bring in NVGPU-compatible forms. The IR after this stage carries explicit pipes, mutexes, and TMA-ready memory.
- Standard lowering. Convert vector, memref, math, and arithmetic dialects toward LLVM; legalize kernel ABIs; canonicalize and CSE. The IR after this stage is in the LLVM and NVGPU dialects.
- Target finalization. Convert NVGPU to NVVM; attach target metadata; synthesize debug-info scopes; clean and finalize for the NVPTX backend. The IR after this stage is ready for MLIR-to-LLVM translation.
O0 collapses stages 2-4 into a verifier-only path. O1 exits after stage 1. O2 exits midway through stage 2. O3 runs all four stages.
O0 - Verify Only
O0 is the validation-only path. No transformation passes run; the pass manager schedules its built-in verifier between every parsed module load and the codegen handoff.
| Order | Pass | Purpose |
|---|---|---|
| 1 | Verifier slots | Check IR validity at pass boundaries. |
The IR shape at O0 is whatever the bytecode reader produced: cuda_tile operations, intact, with no lowering applied. O0 is appropriate when the user wants to round-trip bytecode through the front end without touching it - for example, to confirm that a TileIR producer's output is well-formed. O0 is not a valid input to the NVPTX backend; downstream codegen is unreachable from this level because no LLVM dialect ever appears.
O1 - Minimal Lowering
O1 performs the minimum useful TileIR lowering. It clears the public surface and produces a TileAA module that is well-formed for inspection but not yet lowered to anything LLVM understands.
| Order | Pass | Purpose |
|---|---|---|
| 1 | convert-cudatile-to-tileaa | Translate the public cuda_tile surface into TileAA. |
| 2 | Optional snapshot printer | Emit a textual IR snapshot when the selected line-info mode requests it. |
| 3 | tileir-insert-debug-scope | Add debug scopes used by later diagnostics and line-info emission. |
| 4 | canonicalize | Clean simple folds and canonical forms before deeper lowering. |
What O0 -> O1 adds: a single semantic hop (cuda_tile -> TileAA), debug-scope annotations, and the cheap canonicalisation pass. The hop is required because every later stage assumes that cuda_tile operations have been replaced; without it the rest of the pipeline cannot run. Debug-scope insertion is placed early because later passes rely on its scope tree being present, and because synthesising scopes after lowering would require chasing through rewritten operations to find the original locations. Canonicalisation runs last so that the patterns operate on freshly-lowered TileAA and remove the trivial garbage that direct dialect conversion can leave behind.
Invariant: after O1, no cuda_tile operation remains in the module. Cost: a single dialect-conversion pass plus a cheap fold-and-clean. Debuggability: preserved end-to-end; the snapshot printer is the explicit hook for that.
O2 - Default Pipeline
O2 is the default compilation pipeline. It is the lowest level at which Tileiras produces a module that the NVPTX backend can lower, because it brings the IR into the LLVM and NVGPU dialects.
| Order | Pass | Purpose |
|---|---|---|
| 1 | O1 passes | Establish TileAA and clean the frontend IR. |
| 2 | convert-tileaa-to-tileas | Lower architecture-aware TileAA operations to scheduled TileAS forms (see Modulo Scheduler and Rau-Style Placement). |
| 3 | tileir-emit-host-wrapper | Build host-side wrapper metadata and launch glue. |
| 4 | convert-tileas-to-llvm | Lower TileAS memory, control, and async constructs toward LLVM. |
| 5 | cse | Remove redundant values produced by lowering. |
| 6 | Optional snapshot printer | Capture the TileAS/LLVM boundary when the later line-info mode requests it. |
| 7 | convert-tileas-to-nvgpu | Lower remaining target GPU operations to NVGPU-compatible forms. |
What O1 -> O2 adds: three lowering hops (TileAA -> TileAS, TileAS -> LLVM, TileAS -> NVGPU), host-wrapper emission, and one CSE pass. The TileAA-to-TileAS hop is where the modulo scheduler runs: it builds resource constraints, computes the placement, and stores the result as a ScheduleAnalysis. The TileAS-to-LLVM hop materialises pipes and mutexes against that schedule, lowers memory operations to LLVM-dialect ones, and converts async constructs to their LLVM-dialect equivalents. The TileAS-to-NVGPU hop catches the architecture-specific operations (asynchronous copies, TMA descriptors, named barriers) that need NVGPU-dialect shapes before NVVM lowering. Host-wrapper emission produces the launch-side glue the host runtime expects. CSE runs once after the heaviest lowering because lowering patterns frequently produce duplicate index or offset computations.
The order is meaningful: TileAA-to-TileAS must precede every other hop in the stage because everything downstream assumes the schedule already exists. Host-wrapper emission has to land before the TileAS-to-LLVM conversion erases TileAS launch operations. The optional snapshot lands between TileAS-to-LLVM and TileAS-to-NVGPU so users can inspect the intermediate state with both LLVM-style and NVGPU-style operations visible.
Invariants: after O2, no TileAA or TileAS operation remains; the module is in the LLVM and NVGPU dialects with the scheduler's decisions baked into pipe and mutex values. Cost: the scheduler is the dominant pass; CSE is cheap. Debuggability: still preserved; the snapshot point is the natural inspection window for users diagnosing lowering bugs.
O3 - Full Pipeline
O3 adds the full conversion and finalisation stack. It is the level the production driver uses by default for non-debug builds and the only level that exercises every NVVM target attachment.
| Order | Pass | Purpose |
|---|---|---|
| 1 | O2 passes | Run the default lowering sequence. |
| 2 | tileir-verify-ops-analysis | Check TileIR operation invariants before they are erased. |
| 3 | host-device-assert-enable | Enable host/device assertion handling when configured. |
| 4 | O3 debug-scope insertion | Insert the second debug-scope pass used by the full pipeline. |
| 5 | tileir-gpu-module-prepare | Prepare the gpu.module for final lowering. |
| 6 | canonicalize and cse | Clean before conversion to LLVM. |
| 7 | unspecialized-pipeline | Apply the unspecialized pipeline path when selected. |
| 8 | test-convert-to-llvm | Exercise the conversion-interface stack for selected dialects. |
| 9 | tileir-legalize-llvm-kernel | Normalize kernel entry ABI before target conversion. |
| 10 | tileir-finalize-llvm-kernel | Finalize kernel argument and metadata conventions. |
| 11 | convert-to-llvm | Convert standard MLIR dialects to LLVM dialect. |
| 12 | canonicalize | Clean after the broad LLVM conversion. |
| 13 | convert-nvgpu-to-nvvm | Lower NVGPU operations to NVVM operations. |
| 14 | convert-vector-to-llvm | Lower vector dialect operations. |
| 15 | convert-math-to-funcs | Route math operations through callable/library forms where required. |
| 16 | arith-expand | Expand arithmetic operations unsupported by later conversion. |
| 17 | convert-memref-to-llvm | Lower memref types and operations to LLVM-compatible forms. |
| 18 | synthesize-debug-info-scopes | Create final debug-info scopes for line tables. |
| 19 | convert-target-to-nvvm | Attach NVVM target metadata and libNVVM options. |
| 20 | canonicalize and cse | Clean the post-NVVM IR. |
| 21 | tileir-post-nvvm-finalize | Make the module ready for LLVM/NVPTX serialization. |
What O2 -> O3 adds: invariant verification, the full standard-dialect-to-LLVM conversion stack, the NVGPU-to-NVVM and target-NVVM hops, the kernel-ABI legalisation pair, debug-info-scope synthesis, and a final cleanup pair. The block from tileir-verify-ops-analysis through tileir-gpu-module-prepare exists to make late lowering safe: invariants are checked while TileIR-specific operations are still present, asserts are wired so device-side assert calls survive lowering, and the gpu.module is reshaped to the form the standard MLIR lowering machinery expects. The convert-to-llvm block (passes 11 through 17) covers the standard MLIR dialects - vector, memref, math, arith - which O2 does not touch because the default pipeline does not need them; O3 includes them to handle the full surface a TileIR producer can generate. The legalize/finalize kernel-ABI pair is what makes the kernel entry function look like a CUDA kernel rather than a generic LLVM function: argument layout, address-space attributes, calling convention, and nvvm.kernel metadata all land here. synthesize-debug-info-scopes produces line-table-quality debug info (LineTablesOnly in the configured mode) that the backend can lift directly into PTX .loc directives. convert-target-to-nvvm attaches the libNVVM option blob and the target-triple metadata the NVPTX backend reads to choose its subtarget. The closing canonicalise/CSE pair and tileir-post-nvvm-finalize ensure the module is in the exact shape the LLVM/NVPTX translator expects.
Order matters in this stage too. tileir-legalize-llvm-kernel must precede convert-to-llvm because the broad conversion erases the very TileIR markers the legaliser depends on. convert-vector-to-llvm must precede convert-memref-to-llvm because vector lowering can introduce memref accesses but memref lowering does not introduce vector forms. arith-expand runs before convert-memref-to-llvm because some memref index expressions become arith operations that the lowering then expects to find in expanded form. convert-target-to-nvvm is the final lowering step because it is what binds the module to a specific sm_* and to specific libNVVM options - anything that runs after it would have to be target-aware.
Invariants: after O3, the module contains only LLVM and NVVM dialect operations, the kernel ABI is in NVPTX form, line-table debug info is present, and target metadata is attached. The TileIR verifier has confirmed pre-lowering invariants. Cost: the dominant passes are the modulo scheduler (inherited from O2) and the broad LLVM conversion. Debuggability: degraded relative to O2 because the kernel ABI has been rewritten and most TileIR operations are gone; the early snapshot printer and the tileir-verify-ops-analysis pass are the standard inspection points.
Warp-Specialised Adders
Warp-specialised scheduling is layered on top of the base tier when pipeline-strategy=warp-specialize. The adder replaces the modulo-schedule stage with a warp-specialisation pipeline that partitions the loop body across agents.
| Variant | Trigger | Purpose |
|---|---|---|
| Light | rrt-size-threshold=0 | Insert boundaries, run light warp-specialization rewrites, and add barriers. |
| Heavy | rrt-size-threshold nonzero | Prepare scheduling, specialize agents, check register budgets, and compact layouts. |
The light variant is used when the resource reservation table would dominate compile time; it produces a correct but conservative schedule. The heavy variant is the normal path for kernels where modulo scheduling, register-pressure checks, and layout canonicalisation determine final quality. Both variants slot into stage 2 (architecture-aware lowering); the choice is independent of opt level above O1.
Handoff to LLVM/NVPTX
The pass list above ends at the MLIR-to-LLVM/NVVM boundary. After that, the backend runs LLVM IR and MachineIR passes such as NVVM reflection, address-space optimisation, argument lowering, aggregate-copy lowering, image-handle replacement, and NVPTX instruction cleanup. The LLVM-tier pipeline is documented under NVPTX Backend Passes, which describes each pass at the same level of detail as the entries above.
Cross-References
Driver Entry and Optimization Levels describes how the requested tier turns into the segments listed above. Pipeline Options Mapping maps each option a user can set to the consuming pass in this list. Pipeline Invariants and Verifiers covers the verifier passes interleaved between the lowerings. Performance and Cost Model explains the compile-time and runtime trade-offs the four levels expose.
LLVM PassBuilder Registry
Abstract
Tileiras embeds an LLVM PassBuilder registry that resolves textual pass names to factory callbacks. It is the mechanism by which strings inside --pass-pipeline="..." arguments turn into pass instances, and it is the same mechanism the driver uses when the inner serialization stage builds its LLVM optimization pipeline from a named OptimizationLevel. The registry holds 551 entries: 478 reached through templated getTypeName<T>() keys derived from C++ pass class names, 66 reached through bare string keys for passes that opt out of the templated path, 5 pipeline aliases that expand to multi-pass sequences, and 2 specials for analysis printing and verification. The registrar populates the global StringMap at static-init time before any compilation begins, so the table is read-only during compile and the textual resolver runs as a single hash lookup.
The registry is a menu, not a schedule. An entry being present means a user could ask for the pass by name; it does not mean the default Tileiras pipeline runs that pass.
Registry Families
| Family | Examples | Role |
|---|---|---|
| Module analyses | call graph, profile summary, verifier analysis | Query module-wide facts. |
| Module transforms | inlining, internalization, global optimization | Rewrite whole modules. |
| CGSCC passes | inliner and call-graph transforms | Optimize call-graph components. |
| Function analyses | alias analysis, dominators, loops, scalar evolution | Query function-local facts. |
| Function transforms | instcombine, GVN, vectorization, NVVM cleanup | Rewrite LLVM IR functions. |
| Loop passes | LICM, rotate, unswitch, unroll | Rewrite loops. |
| Machine passes | register allocation, scheduling, MIR cleanup | Rewrite MachineIR. |
NVIDIA-Specific Entries
| Pass name | Stage | Purpose |
|---|---|---|
check-gep-index | Module | Validate constant GEP indices after frontend cleanup. |
check-kernel-functions | Module | Normalize kernel and non-kernel function linkage (see Kernel/CDP/Inline/Pretreat — Kernel Identity). |
cnp-launch-check | Module | Validate CUDA dynamic-parallelism launch calls (see Kernel/CDP/Inline/Pretreat — CDP Launch Expansion). |
ipmsp | Module | Specialize generic-pointer callees by memory space. |
nv-early-inliner | Module | Run an NVIDIA-tuned early inliner (see Kernel/CDP/Inline/Pretreat). |
nv-inline-must | Module | Force-inline functions whose ABI cannot survive as calls (see Kernel/CDP/Inline/Pretreat). |
nvvm-pretreat | Module | Canonicalize raw NVVM IR before verification and optimization (see Kernel/CDP/Inline/Pretreat). |
nvvm-verify | Module | Check NVVM kernel launches and parameter-space usage (see NVVM IR Verifier). |
printf-lowering | Module | Lower device printf to the vprintf ABI (see printf Lowering and vprintf). |
select-kernels | Module | Restrict processing to selected kernels for diagnostics/testing. |
nvvm-aa | Function analysis | Provide address-space-aware alias information. |
kernel-info | Function | Emit per-kernel diagnostic metrics. |
nvvm-peephole-optimizer | Function | Simplify NVVM IR and address arithmetic before selection (see Peephole MIR and Image Handles). |
propagate-alignment | Function | Propagate alignment facts through memory operations. |
reuse-local-memory | Function | Reuse non-overlapping local-memory slots. |
memory-space-opt | Function | Infer and rewrite concrete address spaces (see Memory-Space-Opt and Process-Restrict). |
lower-aggr-copies | Function | Expand unsupported aggregate memory intrinsics (see lower-args, lower-aggr-copies, lower-struct-args). |
lower-struct-args | Function | Lower by-value struct kernel parameters (see lower-args, lower-aggr-copies, lower-struct-args). |
process-restrict | Function | Materialize __restrict__ alias metadata (see Memory-Space-Opt and Process-Restrict). |
The same registry also exposes stock LLVM names such as default, thinlto, lto, verify,
inline, function-simplification, and machine-pipeline passes like greedy, regallocfast,
machine-scheduler, and virt-reg-rewriter. Those names are useful for textual LLVM pipeline
experiments, but Tileiras' normal MLIR pipeline reaches the NVPTX backend through its own target
handoff rather than through an arbitrary user-supplied LLVM text pipeline.
Registry Depth
The registry breaks down into five disjoint groups. The 478 templated entries dominate; they exist because LLVM's pass framework derives a string from the C++ class name through getTypeName<T>() and registers the factory under that string. The 66 naked-class entries are passes whose registered name is hand-written (usually because the templated name would be unwieldy or would collide with a stock LLVM pass). The 5 aliases are short names that expand to multi-pass sequences. The 2 specials wire up the analysis-print and verify infrastructure that LLVM's pass parser depends on.
| Group | Count | Examples | Why this group |
|---|---|---|---|
getTypeName<T>() keys | 478 | InstCombinePass, LICMPass, MachineCSEPass | Default registration; key derived from C++ type name. |
| Naked-class string keys | 66 | printf-lowering, nvvm-pretreat, select-kernels | Manually named NVIDIA passes plus a handful of upstream passes that opt out of the templated key. |
| Pipeline aliases | 5 | default<O2>, thinlto<O3>, lto<O2> | Expand to multi-pass strings inside the parser. |
| Specials | 2 | print<analysis>, verify | Wire up the analysis-print and verify infrastructure. |
| Total | 551 |
The 478 + 66 + 5 + 2 = 551 sum is the registry's full depth. There is no overflow path; an unknown name is a parser error, not a fallback to a generic factory.
Static-Init Registrar
The registrar is a single function that calls PassBuilder::registerPipelineParsingCallback once per entry. It runs at static initialization time because each RegisterPass<T> global constructor inserts into the same StringMap. The map's load factor and bucket count are sized at first use; the registrar uses StringMap::insert which is O(1) amortized. After static-init the map is never mutated.
void register_passbuilder_entries(PassBuilder &pb) {
register_templated_module_passes(pb); // contributes to the 478
register_templated_function_passes(pb); // contributes to the 478
register_templated_loop_passes(pb); // contributes to the 478
register_templated_machine_passes(pb); // contributes to the 478
register_named_nvvm_passes(pb); // contributes to the 66
register_named_nvptx_passes(pb); // contributes to the 66
register_pipeline_aliases(pb); // the 5
register_print_and_verify(pb); // the 2
}
The split between templated and named registration is what allows NVIDIA-private passes to be interleaved with upstream LLVM passes in the same registry: upstream passes register under their templated keys; NVIDIA passes register under hand-chosen names from the nv-*, nvvm-*, cnp-*, check-*, and lower-* families.
Textual Resolution
The parser resolves pass names in the context of the current pipeline level. The same string can appear at module, CGSCC, function, loop, and machine-function levels without collision because the parser only consults the registry slice for the manager it is currently constructing.
Expected<PassPlugin> parse_pass(PipelineLevel level, StringRef text) {
auto [name, options] = split_name_and_options(text); // "pass{key=value,...}"
auto *info = lookup_registry_for_level(level, name);
if (!info) {
return make_error<StringError>(
"unknown pass '" + name + "' at " + describe_level(level));
}
PassOptions parsed_opts;
if (failed(parse_option_block(options, info->schema, parsed_opts))) {
return make_error<StringError>(
"invalid option block for pass '" + name + "'");
}
return info->construct(parsed_opts);
}
Name matching is case-sensitive and exact on the name portion. The optional {key=value,...} block is parsed against the per-pass schema after the lookup succeeds; an unrecognized option key is rejected rather than silently ignored.
Relationship to TileIR Passes
TileIR MLIR passes are scheduled by the Tileiras pipeline builder against MLIR's own registry, which is independent of this LLVM PassBuilder registry. The LLVM registry is consulted only after the inner pipeline reaches LLVM/NVVM IR or when a user supplies a textual LLVM pipeline through --passes= to the embedded NVPTX backend. A pass appearing here is not scheduled by Tileiras's default flow unless populate_pipeline or run_llvm_passbuilder_pipeline references it explicitly.
Cross-References
Compilation Pipeline Overview — Serialization Boundary describes where in the outer/inner split this registry is consulted. Pipeline Options Mapping names the options that the inner LLVM pipeline reads through the registry's factories. Pass List by Optimization Level — Handoff to LLVM/NVPTX is the MLIR-tier pass list that runs before this registry comes into play. NVPTX Backend Passes overview is where the NVIDIA-specific entries above are described pass by pass.
TileAS Async and Pipeline Family
Abstract
The async pipeline family turns token-ordered tile work into explicit producer/consumer scaffolding. Starting from queue-like TileAA/TileAS ops, it materializes nv_tileas.async.pipeline.* regions, threads async handshakes through every producer and consumer site, picks between unspecialized and warp-specialized execution, and finally trims each pipeline region to the minimal backward slice of its yielded values.
Downstream LLVM/NVVM lowering converts that scaffold into mbarrier waits, async bulk-copy waits, WGMMA group waits, named barriers, and ordinary LLVM control flow. Phase, stage, iterator, and producer/consumer ownership must survive every transformation in this family bit-for-bit — downstream synchronization assumes them.
Pass Roster
| Pass or family | Purpose |
|---|---|
| queue-to-pipeline rewrite | rewrites nv_tileaa.queue.* and execute into async pipeline operations |
TileASMaterializeAsync | injects async tokens, future waits, producer and consumer handshakes |
TileASMaterializeConvertLayout | decomposes layout conversions that cross pipeline boundaries |
TileASMaterializeSchedule | consumes ScheduleAnalysis and selects AUS or AWS materialization |
TileASUnspecializedPipeline | software-pipelines single-agent loops with prologue/body/epilogue |
TileASOptimizePipelineRegion | shrinks produce_one and consume_one regions to minimal scopes |
| block-scaled MMA verifier | checks Blackwell microscale MMA invariants before lowering |
| pipeline-to-NVVM lowerings | convert async pipeline ops to NVVM and LLVM operations |
Pipeline Operation Surface
The nv_tileas.async.pipeline dialect exposes operations for creating a pipeline, switching agents, producing and consuming one stage, reading and writing through the pipeline slot, acquiring and releasing producer/consumer ownership, advancing iterators, and yielding region results.
| Operation concept | Role |
|---|---|
| create pipeline | builds stage count, producer group, consumer group, and memory-mode state |
| create iterator / increment iterator | tracks stage and phase progression |
| producer acquire / commit | claims and publishes a producer stage |
| consumer wait / release | waits for and releases a consumer stage |
| producer write / consumer read | transfers values through the logical pipeline slot |
| produce one / consume one | region operations that scope producer or consumer work |
| agent switch | partitions the function into producer, consumer, and compute agents |
| pipeline yield | returns region values and iterator state |
Region-op verifiers force block argument types and yielded result types to match the pipeline iterator and operation result contract. When an if or loop yields a pipeline iterator value, both arms must agree on the iterator type — there is no implicit merge.
Queue to Pipeline Rewrite
The queue-to-pipeline rewrite bridges TileAA queue ops onto the pipeline surface.
LogicalResult rewrite_queue_program(ModuleOp module) {
RewritePatternSet patterns(module.context());
patterns.add(rewrite_execute_to_agent_switch);
patterns.add(rewrite_create_queue_to_pipeline);
patterns.add(rewrite_queue_put_to_produce_one);
patterns.add(rewrite_queue_get_to_consume_one);
patterns.add(rewrite_mark_for_reuse_passthrough);
if (failed(apply_patterns_greedily(module, std::move(patterns)))) {
return failure();
}
return propagate_pipeline_iterator_types(module);
}
Iterator propagation is not cleanup — it is part of the contract. Every downstream pass assumes producer and consumer regions carry consistent iterator values across structured control flow.
Materialize Async
TileASMaterializeAsync (CLI: tileas-materialize-async) takes synchronous tile loops still carrying nv_tileaa.queue.* and execute ops and rewrites them into the full nv_tileas.async.pipeline.* producer/consumer scaffold. It runs at function scope (OperationPass<FunctionOpInterface>) and depends on the SymbolTable trait. Every async-bearing loop receives a token iter-arg, every async-defining value gets wrapped through to_async, and producer and consumer handshakes ring the original tile work.
The pass body lives at sub_8174C0 (10 137 B, 463 BB). It walks LoopLikeOpInterface operations via sub_8172F0 with callback sub_819C60 and delegates per-loop rewriting to an assembler. Once the walk finishes, a reconciler verifies that pipeline types stay coherent across every producer-like user.
| Sub | Size | Role |
|---|---|---|
sub_813DC0 | 5 518 B | per-loop rewriter; emits create_none, to_async wrappers, the reissued scf.for with token iter-arg, and tail future_wait + async.wait |
sub_81A290 | 6 314 B | consumer emitter; emits consume_one -> consumer_read -> consume_one_async -> consumer_release -> async.wait |
sub_81BB40 | 5 051 B | producer emitter; emits produce_one_async -> producer_commit -> token_to_async -> async.wait |
sub_815AD0 | 6 176 B | post-walk reconciler; verifies one produce_one-like writer per pipeline across all AllocationOpInterface ops |
Consumer sequence:
consume_one -> consumer_read -> consume_one_async -> consumer_release -> async.wait
Producer sequence:
produce_one_async -> producer_commit -> token_to_async -> async.wait
Exactly one produce_one-like op may write data into a given pipeline. On conflict the reconciler emits the verbatim diagnostic there are two `produce-one-like` operations using different instructions to generate data into the same pipeline. It's a bug of MaterializeAsync Pass. (full sentence, trailing period included) through sub_446CE00 at severity 259 (0x103).
Errors never call signalPassFailure() directly. They set *(self + 40) |= 4, the cross-pass failure handshake documented in Pass-Failure Handshake — Convention. The driver inspects it once the walk completes and lifts it to a top-level failure.
Per-Loop Rewrite Body
The per-loop body at sub_813DC0 builds the rewritten loop in a single pass over the original region. It seeds the initial token, walks the body to classify each async-bearing op, clones the loop with one extra iter-arg, dispatches to the producer or consumer emitter for each classified op, and tails the new loop with a future_wait plus async.wait so the function-level user observes a fully synchronized value.
LogicalResult materialize_async_loop(ScfForOp loop, Rewriter *rw) {
Value initial_token = rw->create("nv_tileas.create_none").result(0);
SmallVector<Value> async_defs = collect_async_defining_values(loop);
for (Value v : async_defs) {
Value storage = rw->create("nv_tileas.async.to_async", v, AS_STORAGE).result(0);
Value tok = rw->create("nv_tileas.async.to_async", v, AS_TOKEN).result(0);
rw->replace_uses_inside_loop(v, storage, tok);
}
ScfForOp rewritten = clone_loop_with_extra_iter_arg(loop, initial_token, rw);
for (Operation *op : rewritten.get_body_ops()) {
if (is_pipeline_consumer(op)) {
emit_consumer_handshake(op, rw); /* sub_81A290 */
} else if (is_pipeline_producer(op)) {
emit_producer_handshake(op, rw); /* sub_81BB40 */
}
}
Value final_token = rewritten.get_loop_result(TOKEN_RESULT_IDX);
rw->create("nv_tileas.async.future_wait", final_token);
rw->create("nv_tileas.async.wait", final_token);
return success();
}
Each async-defining value is wrapped twice through to_async — once for the storage side, once for the token side. Both wrappers stay live until the tail future_wait collapses them back into a synchronized result.
Input and Output IR Shapes
The input is a synchronous tile loop in which loads, MMAs, and stores are sequenced through ordinary SSA values and nv_tileaa.queue.* ops. The shapes the per-loop rewriter expects are precise: a scf.for body containing one or more TMA-eligible load chains, with execute ops marking the producer side and ordinary tile-compute ops on the consumer side.
// Input: synchronous tile loop, TMA-eligible loads on producer side.
%out = scf.for %i = %c0 to %n step %c1 iter_args(%acc = %init) -> tensor<...> {
%a = nv_tileaa.queue.get %qa[%i] : tensor<...,#smem>
%b = nv_tileaa.queue.get %qb[%i] : tensor<...,#smem>
%c = nv_tileas.dot %a, %b, %acc : tensor<...>
scf.yield %c : tensor<...>
}
After the rewrite, the loop carries a token iter-arg, async-defining values are wrapped through to_async, the body splits into producer and consumer regions with a pipeline_stage attribute on each, and a tail future_wait + async.wait synchronises the loop result for the function-level user.
// Output: async pipeline scaffold. Producer region issues async copies;
// consumer region runs compute under an mbarrier try-wait.
%tok0 = nv_tileas.create_none : !nv_tileas.token
%out, %tok = scf.for %i = %c0 to %n step %c1
iter_args(%acc = %init, %t = %tok0)
-> (tensor<...>, !nv_tileas.token) {
// Producer region: emits the async TMA issue and a stage-tagged commit.
%ap, %ta = nv_tileas.async.pipeline.produce_one_async %i,
{ pipeline_stage = 0 : i32 } : !nv_tileas.token
%bp, %tb = nv_tileas.async.pipeline.produce_one_async %i,
{ pipeline_stage = 0 : i32 } : !nv_tileas.token
nv_tileas.async.pipeline.producer_commit %ta, %tb : !nv_tileas.token
// Consumer region: waits for the mbarrier parity flip, reads, computes.
%a, %tac = nv_tileas.async.pipeline.consume_one %ta { pipeline_stage = 1 : i32 }
%b, %tbc = nv_tileas.async.pipeline.consume_one %tb { pipeline_stage = 1 : i32 }
%c = nv_tileas.dot %a, %b, %acc : tensor<...>
nv_tileas.async.pipeline.consumer_release %tac, %tbc : !nv_tileas.token
scf.yield %c, %t : tensor<...>, !nv_tileas.token
}
nv_tileas.async.future_wait %tok : !nv_tileas.token
nv_tileas.async.wait %tok : !nv_tileas.token
Attribute Hand-Off
D07 produces three attributes that downstream passes in the family consume. The contract is one-way: D07 writes them once during rewrite, and no later pass touches the schema, only the values.
| Attribute | Producer | Consumer | Meaning |
|---|---|---|---|
pipeline_stage | D07 | D09 MaterializeSchedule, D11 UnspecializedPipeline | integer index naming which pipeline stage this producer or consumer region belongs to; drives modulo-schedule placement |
token_iter_idx | D07 | D09, D11 | position of the token iter-arg in the rewritten loop; lets stage materialisation thread token state through prologue and epilogue |
producer_kind | D07 | D08 MaterializeConvertLayout, D14 AssignLoadStoreLayouts | tag distinguishing TMA bulk, generic async-copy, and synchronous-fallback producers; drives layout selection on the consumer side |
The pipeline_stage attribute is what binds D07's region split to the scheduler. D09 reads it during stage materialisation: each stage's produce_one and consume_one ops must agree on stage index, otherwise the prologue and epilogue peel-piece builders cannot match producer to consumer across iterations. A mismatch trips the alias-check diagnostic "Alias is not expected here." in D09's helper pipeline.
Failure Modes
D07 has three structural failure paths, all of which set *(self + 40) |= 4 and let the driver lift the bit after the walk completes:
- Two producers for one pipeline. The post-walk reconciler verifies one
produce_one-like writer per pipeline across allAllocationOpInterfaceops. On conflict it emits the verbatim diagnosticthere are two `produce-one-like` operations using different instructions to generate data into the same pipeline. It's a bug of MaterializeAsync Pass.at severity259. The conflicting writers usually come from an earlier pass that duplicated a queue producer without updating the alias map. - No TMA-eligible chain. When the per-loop classifier cannot find a load chain that terminates in a tile compute op, the rewriter leaves the loop synchronous and flags the loop as not-pipelinable. D09 reads the flag and routes the loop through the synchronous fallback. This is recoverable and emits no diagnostic.
- Iterator-type disagreement. If two arms of an
iforscf.foryield iterator values of different types, the post-walk iterator-type propagation fails. The verifier onnv_tileas.async.pipeline.create_iteratorrejects the merge, and the pass bubbles the failure up through the standard handshake.
Anonymous Rewrite Patterns
Two anonymous RewritePattern instances are allocated through sub_44A8C20 + sub_4481530 and registered into the local pattern set. Both are 0x60 B and use the 5-slot vtable shape A documented in Pattern Vtables and Shapes — Five-Slot RewritePattern Vtable: {matchAndRewrite, anchor/match, getDebugName, nullsub_11937 (slot 3), dtor/clone}. The debug-name string pair sits at offsets +0x40 and +0x48 of each pattern object.
| Pattern | Anchor op | Vtable | Debug-name string |
|---|---|---|---|
AsyncWaitOpRemoval | nv_tileas.async.wait | off_59B4500 | mlir::nv_tile_ir::as::{anonymous}::AsyncWaitOpRemoval] |
ExtractSliceOpToAsync | nv_tileas.extract_slice | off_59B4538 | mlir::nv_tile_ir::as::{anonymous}::ExtractSliceOpToAsync] |
AsyncWaitOpRemoval drops redundant async.wait ops that follow another wait on the same token with no intervening async consumer. ExtractSliceOpToAsync rewrites synchronous extract_slice into its async form whenever the slice source already carries an async token.
Interface TypeID Caching
Interface lookups intern TypeIDs through sub_44A6CA0 and cache the resulting TypeID pointer in three globals so later loop walks and trait checks skip the interning hash. All three must populate before the pass can claim its OperationPass<FunctionOpInterface> anchor and verify the SymbolTable trait.
| Interface | Cache slot |
|---|---|
FunctionOpInterface | qword_5B37670 |
SymbolTable | qword_5B37798 |
LoopLikeOpInterface | qword_5B38E18 |
Multiple producer-like ops writing into the same pipeline must agree on the instruction family that generates the data. Mixing incompatible producers is a hard error: the downstream wait and barrier sequence would be ambiguous.
Materialize Convert Layout
Pipeline boundaries demand layout conversion between register, shared-memory, and tensor-memory views. TileASMaterializeConvertLayout (CLI: tileas-materialize-convert-layout) decomposes every surviving nv_tileas.convert_layout into a sequence of alloc, view, copy, and shuffle ops, picking register-to-register staging or shared-memory staging based on what the target specification reports as feasible for the source and destination layouts.
The pass object is a 752-B (0x2F0) PIMPL allocated through sub_44A8C20(0x2F0). Two CLI-visible options sit at fixed offsets inside the body; two callback slots in the pass-object header back them through the standard pass-option registration helper.
| Field | Offset | Type | Default | Meaning |
|---|---|---|---|---|
reg2reg-vec-size | +0x1D0 | u32 | 16 | Cap on register-to-register copy atom width |
reinterpret-to-i8 | +0x2A0 | bool | 0 | Reinterpret source and destination tensors as tensor<...xi8> for sub-byte fp formats so staging happens at byte granularity |
Option-callback slots live at header offsets +65 and +91; the vtable pair is (off_59B4688, unk_5B38E50). Both constructors sub_8206C0 and sub_820940 register the two options through sub_6D3140 (the pass-option registration helper) and run sub_5FED40 (the pass-init helper) to wire the pass into the global registry.
Pass Body
The pass body at sub_820D30 (10 359 B) reads the two options, walks the function bottom-up via sub_81EA30 to collect every nv_tileas.convert_layout op (TypeID &unk_5B44FD8), and asks the target-specification driver sub_91A9B0 for a decomposition plan per op. The plan is a SmallVector<AtomPlan> of 32-byte entries:
struct AtomPlan {
uint32_t tag; /* 0 = reg-to-reg, 1 = via-SMEM */
uint32_t smem_layout; /* descriptor index when tag == 1 */
uint32_t atom_layout; /* per-atom layout descriptor */
uint128_t atom_descriptor;/* CuTe-style atom encoding */
};
The option-read sequence for one op is:
LogicalResult materialize_one(ConvertLayoutOp op, PassState *self, Rewriter *rw) {
uint32_t vec_cap = *(uint32_t *)((uint8_t *)self + 0x1D0);
bool reinterpret = *(bool *)((uint8_t *)self + 0x2A0);
SmallVector<AtomPlan> plans;
if (failed(sub_91A9B0(op, vec_cap, &plans))) {
sub_446CE00(op.loc(), "failed to query target spec for convert_layout");
*(uint32_t *)((uint8_t *)self + 40) |= 4;
return failure();
}
sub_8200D0(plans.data(), plans.size()); /* stable sort by efficiency */
if (reinterpret && is_sub_byte_fp(op.source().type())) {
op = sub_81F8C0(op, rw); /* src -> tensor<...xi8> */
op = sub_81F9F0(op, rw); /* dst -> tensor<...xi8> */
}
return apply_first_feasible_plan(op, plans, rw);
}
The pass-failure handshake matches the rest of the TileAS family: errors set *(self + 40) |= 4 instead of calling signalPassFailure() directly, and the driver inspects the bit after the walk completes. Option-misuse and target-spec lookup failures share the verbatim diagnostic failed to query target spec for convert_layout via sub_446CE00.
Plan Sort and Apply
sub_8200D0 sorts candidate plans by descending efficiency — cost-per-byte transferred through the chosen staging shape — so the first plan that clears the per-op constraint set also carries the highest expected throughput. The merge is the libc++ std::stable_sort over 32-B entries, recognisable by the same __buffered_inplace_merge shape that drives the FUSE arm in the modulo scheduler.
Once sorted, the dispatcher walks the plan vector and accepts the first plan whose tag is feasible for the op's source layout, destination layout, and current vector cap. Tag 0 expands into a sequence of register-to-register nv_tileas.copy ops bounded by reg2reg-vec-size, with an optional nv_tileas.shuffle when the atom needs a cross-lane permutation. Tag 1 stages through shared memory: allocate a tensor<...,#smem>, view the source through the plan's source view, copy into SMEM, then read the SMEM tile back at the destination layout.
Value apply_plan(ConvertLayoutOp op, AtomPlan plan, Rewriter *rw) {
if (plan.tag == 1) {
Value smem = rw->create("nv_tileas.alloc_tensor", plan.smem_type()).result(0);
Value view = rw->create("nv_tileas.view", op.source(), plan.source_view()).result(0);
rw->create("nv_tileas.copy", view, smem, plan.atom_descriptor);
return rw->create("nv_tileas.convert_layout", smem, op.dst_layout()).result(0);
}
if (plan.can_convert_directly()) {
return rw->create("nv_tileas.convert_layout", op.source(), plan.dst_layout()).result(0);
}
return rw->create("nv_tileas.shuffle", op.source(), plan.atom_descriptor).result(0);
}
Reinterpret Builders
With reinterpret-to-i8 set, the pass rewrites source and destination layouts into byte-granular form before consulting the target-spec driver. The two builders look textually similar but each operates on a different end of the op; keeping them separate enables asymmetric reinterpretation — a bytewise source view paired with a native destination, for instance.
| Builder | Operand | Role |
|---|---|---|
sub_81F8C0 | source | rewrites the source tensor type into tensor<...xi8> and inserts a matching view |
sub_81F9F0 | destination | rewrites the destination tensor type into tensor<...xi8> and inserts a matching view |
Byte reinterpretation kicks in for NVFP4, FP6, and FP8 source or destination tensors. The SMEM staging plan then runs over a normal byte-granular tile, sidestepping the otherwise mandatory sub-byte SMEM atoms and letting one SMEM staging path serve every sub-byte fp format.
Input and Output IR Shapes
The input is IR carrying one or more nv_tileas.convert_layout ops between producer and consumer with incompatible layouts. The producer's output layout (typically a #smem or #tmem layout coming out of a TMA load) and the consumer's input layout (typically a WGMMA fragment layout or a register layout for a downstream nv_tileas.dot) do not match the target's native copy-atom catalogue.
// Input: a convert_layout op crossing pipeline boundaries.
%a_smem = nv_tileas.async.pipeline.consume_one %ta : tensor<128x64xf16, #smem>
%a_frag = nv_tileas.convert_layout %a_smem : tensor<128x64xf16, #smem>
-> tensor<128x64xf16, #wgmma_a>
%c = nv_tileas.dot %a_frag, %b_frag, %acc : tensor<128x128xf32>
After materialisation, the chosen plan expands into either a register-to-register sequence (tag 0) or an SMEM-staging sequence (tag 1). The SMEM staging case allocates a private SMEM tile, copies through it using the plan's atom descriptor, and reads back at the destination layout:
// Output (tag 1, SMEM staging): the convert_layout is replaced by an
// alloc + view + copy + read sequence; the atom_plan attribute survives
// onto the final reader op for D14 AssignLoadStoreLayouts.
%tmp = nv_tileas.alloc_tensor : tensor<128x64xf16, #smem_swizzled>
%vsrc = nv_tileas.view %a_smem : tensor<128x64xf16, #smem>
-> tensor<128x64xf16, #smem_byte_view>
nv_tileas.copy %vsrc, %tmp { atom_descriptor = #cute<atom "TiledCopy<...>"> }
%a_frag = nv_tileas.convert_layout %tmp
{ atom_plan = #nv_tileas.atom_plan<tag = 1, smem = #smem_swizzled, ...> }
: tensor<128x64xf16, #smem_swizzled>
-> tensor<128x64xf16, #wgmma_a>
Tag 0 collapses to a chain of nv_tileas.copy ops bounded by reg2reg-vec-size, optionally fronted by a nv_tileas.shuffle when the atom requires a cross-lane permutation.
Attribute Hand-Off
The atom_plan attribute survives onto the final reader op as a fully-resolved AtomPlan record. D14 AssignLoadStoreLayouts is the primary downstream consumer: it reads the plan to bind concrete copy-atom shapes onto the load and store ops the plan expanded into, completing the lowering toward the LLVM/NVVM backend.
| Attribute | Producer | Consumer | Meaning |
|---|---|---|---|
atom_plan | D08 | D14 AssignLoadStoreLayouts | 32-byte AtomPlan record (tag, smem layout, atom layout, atom descriptor); names the concrete copy atom that lowering must instantiate |
reinterpret_byte | D08 | D14, downstream lowering | flag set when the pass rewrote source or destination as tensor<...xi8> so that downstream passes do not re-fold the byte view back to the sub-byte layout |
D08 does not invent atom choices; it consults the target-specification driver, sorts plans by descending efficiency, and records the chosen plan onto the op so D14 has a stable handle. D14, in turn, may refine the plan when target-specific constraints emerge (cluster-shared atoms, multi-CTA descriptors) but it must not silently change the plan tag — the verifier rejects tag changes after D08 has run.
Failure Modes
The two structural failures both emit "failed to query target spec for convert_layout" and set the failure bit:
- Target-spec lookup miss. When the source and destination layouts do not appear in the target's atom catalogue,
sub_91A9B0returns an empty plan vector. The dispatcher emits the diagnostic and tears the op down without rewriting it; the verifier on the unchanged op trips at the next pass. - All plans infeasible at current vector cap. When every returned plan exceeds the
reg2reg-vec-sizecap (typical for sub-byte FP formats withoutreinterpret-to-i8set), no plan passes the feasibility gate. The same diagnostic fires; the recommended fix is to raise the cap or enablereinterpret-to-i8.
Failure Handling and Cross-References
Both option-misuse cases and target-spec lookup failures share the verbatim diagnostic above. Pass-level failure sets *(self + 40) |= 4. Successful expansion replaces the original nv_tileas.convert_layout with the plan's final result value and erases the op.
The SM-specific atom catalogues that sub_91A9B0 reads to build plans are documented in MMA Atoms SM70..SM120 — Per-Arch MMA Shape Lattice. The 8-slot pattern vtable convention that off_59B4688 uses is documented in Pattern Vtables and Shapes — Eight-Slot Vtable. The nv_tileas.convert_layout op definition itself, including its layout-attribute schema and verifier, is documented in nv_tileas Op Roster and Builders.
Materialize Schedule
TileASMaterializeSchedule (CLI: tileas-materialize-schedule) consumes a ScheduleAnalysis and dispatches to one of two driver flavours: AUS (Agent-Unspecialized — one SIMT agent owns producer and compute work) or AWS (Agent-Warp-Specialized — distinct producer and consumer agents partitioned by nv_tileas.async.pipeline.agent_switch). CLI options and a heuristic over the schedule's work-vs-stage shape gate the choice; the pass invents no schedule, it materialises an existing one onto the function.
| Mode | Meaning |
|---|---|
| AUS | single-agent materialization; all stages share the same warp group |
| AWS | warp-specialized materialization with one or two compute agents and one producer agent |
The pass identity triple is sub_8235B0 / sub_8235C0 / sub_8235D0. The name slot returns the literal "MaterializeSchedule"; the description slot returns "Meterialize the pipeline schedule to generate warp-specialized or unspecialized IR" verbatim — the leading typo Meterialize lives in the binary and must survive bit-for-bit in tool output. The factory sub_825050 takes a 3-byte packed option mask whose bits feed the offsets listed below. Dependent dialect registration runs through sub_8235E0, which inserts nv_tileaa, nv_tileas, and scf into the dependency set.
⚡ QUIRK — pass name spells
Materialize, but pass description spellsMeterializeSlotsub_8235B0returns the correctly-spelled"MaterializeSchedule"while the neighbouring description slotsub_8235C0returns"Meterialize the pipeline schedule to generate warp-specialized or unspecialized IR"with a leadingMete-typo. The two slots disagree on a single byte, and--helpoutput (which reads the description) therefore looks misspelled while the CLI option name (which reads the identifier) does not. The typo is binary-stable, and a reimplementation has to reproduce the asymmetry to keep snapshot-based golden tests passing.
Pass Object and CLI Options
The pass body is a 960-B (0x3C0) PIMPL allocated through sub_44A8C20(0x3C0). Three boolean CLI-visible options sit at fixed offsets inside the body, mirroring the option layout in TileASMaterializeConvertLayout.
| Field | Offset | Default | Meaning |
|---|---|---|---|
use-AUS | +464 | false | forces the AUS driver; bypasses the dual-SIMT heuristic entirely |
use-dual-simt | +672 | true | AWS-only: splits compute into two SIMT agents of 4 warps each when feasible |
enable-schedule-rewrite | +880 | true | gates sub_8D6700's re-folding of expanded stages back onto the original scf.for |
Each option threads through the standard pass-option apply thunk at offset +728, the same indirect-call shape used elsewhere in the family. The thunk receives the address of the option storage (a1 + 704 for the dual-SIMT triple), so heuristic updates flow back into the option store without bypassing CLI parsing.
Dispatcher Body
The dispatcher at sub_824000 (4 175 B, 133 BB) opens by resolving the surrounding FunctionOpInterface through the mlir::FunctionOpInterface] interned TypeID cached in qword_5B37670, falling back to a sorted binary search over the operation-info trait table when the host op stores its interfaces in the secondary array form — the same dual lookup every other TileAS pass uses. Errors set *(self + 40) |= 4 and the driver inspects the bit after the walk completes.
With the function handle resolved, the dispatcher loads the cached ScheduleAnalysis from the AnalysisManager DenseMap. Its key is "mlir::nv_tile_ir::as::schedule_utils::ScheduleAnalysis]" (54 chars, trailing ] preserved), interned through sub_44A6CA0 and cached at qword_5B38E78. The probe is the canonical Tileiras (h>>9) ^ (h>>4) & (cap-1) pattern with linear-step rehashing; tombstone -4096 aborts the search. Two loader shims sit behind the probe: sub_8FDE40 is the entry point and forwards to sub_8FCC10 when an analysis was found in the map, or sub_8FD850 when it must be created from defaults. On failure the dispatcher sets the failure bit and falls through to the cleanup tail without allocating a driver.
LogicalResult materialize_schedule(FuncOp func, PassState *self) {
Operation *op = func.getOperation();
FunctionOpInterface fi = lookup_interface(op, qword_5B37670);
ScheduleAnalysis *sched = analysis_manager_lookup(
self->parent, qword_5B38E78,
/* h>>9 ^ h>>4 probe */ &sub_8FCC10, &sub_8FD850);
if (!sub_8FDE40(/* slot */, sched, /* present */)) {
*(uint64_t *)((uint8_t *)self + 40) |= 4;
return failure();
}
...
}
Driver Allocation
The dispatcher picks one of two driver flavours based on use-AUS and the dual-SIMT heuristic, then allocates the driver, invokes its prepare() slot, and finally runs the shared materialisation pipeline.
| Driver | Size | Vtable | Extra state |
|---|---|---|---|
| AUS | 0x68 B | &unk_59DBBE8 | three SmallVector<Op*> slots for stages, allocations, and tokens |
| AWS | 0xC8 B | &unk_59DBBA8 | agent-partition map at +96, numWarps at +112, warpId at +128, useDualSimt byte at +192, shape sentinel 0x600000000 at +200 |
The dispatcher picks AUS whenever use-AUS is true, or when the dual-SIMT heuristic doesn't pay off. The heuristic is one floating-point compare: useDualSimt = (double)N > (totalWork / iters) * 0.6, with N, totalWork, and iters read from three schedule-header fields at *(v23 + 5), *(v23 + 4), and *(v23 + 3). The integer division pre-clamps iters to 1 if non-positive — idiv would otherwise fault on the header's signed-zero case. The computed bit lands at *(self + 672) and re-applies through the option thunk, so the option store reflects the final decision, not just the parsed CLI default.
if (*((uint8_t *)self + 464)) { /* use-AUS */
driver = alloc(0x68);
driver->vtable = &unk_59DBBE8; /* AUS */
} else {
if (analysis_present) {
ScheduleHeader *h = (ScheduleHeader *)(slot + 8);
int iters = h->iters > 0 ? h->iters : 1;
int total_work = h->total_work;
double n_double = (double)h->stage_count;
bool useDual = n_double > (double)(total_work / iters) * 0.6;
*((uint8_t *)self + 672) = useDual;
(*(self->option_apply))(self + 704, &useDual, /* arg */);
}
driver = alloc(0xC8);
driver->vtable = &unk_59DBBA8; /* AWS */
*((uint8_t *)driver + 192) = *(uint8_t *)((uint8_t *)self + 672);
*((uint64_t *)driver + 13) = 0x600000000ULL; /* shape sentinel */
}
The shape sentinel 0x600000000 (stamped into the dispatcher's local frame at slot v147 and again at v150) encodes a default (numStages=6, stageWidth=0) pair that the AWS prepare slot overwrites with the real schedule header. With no schedule analysis present, the dispatcher skips the heuristic block entirely, allocates the AWS object with useDualSimt = 0, and leaves the fail/succeed decision to prepare().
Prepare and Materialisation Pipeline
The driver then receives its prepare() call through (*driver->vtable[0])(driver). AUS and AWS share the prepare slot offset; the vtable dispatch picks the right body. On failure the dispatcher sets the failure bit and invokes the destructor through (*driver->vtable[5])(driver) (offset +40, the standard 8-slot Tileiras driver dtor slot).
Once prepare() succeeds, control passes into the shared materialisation pipeline. The entry helper sub_8F1AA0 (248 B) sequences six fixed-order passes plus the alias-materialisation pass:
| Stage | Helper | Notes |
|---|---|---|
| 1 | sub_8E4510 | producer-region setup |
| 2 | sub_8E2790 | alias check; emits "Alias is not expected here." on contract violation |
| 3 | sub_8E2F00 | consumer-region setup |
| 4 | sub_8F19D0 | iterator threading |
| 5 | sub_8EC560 | release-op insertion |
| 6 | sub_8E1900 | barrier-token wiring |
| 7 | sub_8E4F10 | 10 430-B alias-materialisation pass; the heaviest body in the sequence |
The alias-check diagnostic is severity-259 (0x103) like the rest of the family. It fires when an earlier pass leaves a pipeline-aliased value reaching schedule materialisation, which would corrupt the producer/consumer ownership graph. The error is fatal: it sets the failure bit and tears the driver down.
On the AWS path only, the dispatcher next calls the agent-switch materialiser sub_9130B0 (4 047 B, 114 BB). For each agent boundary detected in the schedule, it emits one nv_tileas.async.pipeline.agent_switch op whose payload encodes the agent id, the warp count partition, and the resource window. The same body carries the "Building op " ... ``" but it isn't known in this MLIRContext: the dialect may no"`` diagnostic pair used by every generic op builder in the dialect.
Stage Materialisation
The expanded stage IR then re-folds onto the loop. sub_90C600 (85 B, single basic block) is the entry point: it prepares the per-stage SmallVector frames and forwards into the heavy sub_8D6700 (10 399 B, 506 BB). That body walks the schedule's stage list, builds an scf.if guard per pipeline-stage prelude through sub_8CE1B0, and constructs the big-tensor MLIR ops through the 13 858-B sub_8D30D0 nest. Each per-stage construction allocates 64-B-strided records into the driver's stage SmallVector; tombstone slots tagged -4096 or -8192 let the cleanup loop skip them without dereferencing freed payloads.
When enable-schedule-rewrite is false, the stage builder still expands stages but skips the final re-fold over the original scf.for, leaving the expanded form for downstream passes to consume directly. Debugging dumps take this path, as does the AUS driver when the heuristic prefers a non-pipelined fallback.
Epilogue handling at sub_8F1F40 (918 B, 62 BB) picks off consumer-side release ops that survived the stage rewrite. It walks the post-loop region, finds consumer_release ops whose pipeline argument escapes the rewritten loop, and re-anchors them onto the AWS agent boundary or the AUS post-loop sequence depending on the active driver. The same helper carries the LABEL_86 cleanup tail in the dispatcher: epilogue failure falls through into the SmallVector teardown the success path uses.
Schedule Cleanup
sub_823B60 (1 183 B, 59 BB) is the schedule-state destructor. It frees eight 24-strided per-stage SmallVectors (producer, consumer, and intermediate slot arrays) plus two 48-strided SmallVectors at +216 and +240 (the alias-materialisation work-lists). DenseMap rows whose first qword equals -4096 (empty) or -8192 (tombstone) are skipped; live rows release their inner SmallVector payloads (*(row + 40), *(row + 16)) through the standard 16-stride deallocator sub_4560420. The dispatcher calls sub_823B60 once on success and once on failure, sharing one cleanup tail to keep the failure handshake symmetric with success.
Strategy Routing
The dispatcher reads its strategy enum (NONE / UNSPECIALIZED / WARP_SPECIALIZED) through sub_6D3460. The enum drives the top-level pass-manager: UNSPECIALIZED routes to the TileASUnspecializedPipeline pass below, while WARP_SPECIALIZED stays inside TileASMaterializeSchedule with use-AUS=false. NONE short-circuits both — the dispatcher tears the driver down immediately and returns success without emitting any pipeline IR, leaving the loop synchronous for downstream NVVM lowering.
Scheduler Hand-Off
The schedule analysis itself does not live in this pass. The modulo scheduler computes II, places ops modulo II, and emits a ScheduleAnalysis record into the AnalysisManager; D09 is the consumer side of that split. The boundary between analysis and materialisation is documented in Schedule Solve and Cost Evaluators — Pass Boundary: the scheduler is forbidden from touching IR directly, and D09 is forbidden from inventing schedules. Every field D09 reads — stage count, total work, iteration count, per-op stage tag, per-op iteration offset — was written by the scheduler. The dispatcher's role is to translate that record into producer and consumer regions, agent boundaries, and peel-piece copies of the loop body.
The strategy enum returned by sub_6D3460 is what binds D09 to the scheduler's chosen strategy:
| Strategy | Meaning | D09 path |
|---|---|---|
NONE | scheduler found no profitable pipeline | tear down driver, return success, leave loop synchronous |
SERIAL | one-stage serial schedule (II == latency) | AUS driver with single-stage degenerate path; no peeling |
COST_BASED | cost-evaluated multi-stage schedule | AUS or AWS driver per CLI options and dual-SIMT heuristic |
FAST | first-feasible-II schedule | same driver path as COST_BASED; only the schedule values differ |
DEFAULT | platform default for current SM | resolves to COST_BASED on Blackwell, SERIAL on pre-Hopper |
The strategy enum does not change D09's algorithm — it only selects which ScheduleAnalysis record was published into the AnalysisManager. The dispatcher reads whichever record is present; the strategy tag travels with the record for diagnostic purposes.
Dual-SIMT FP Heuristic
The dual-SIMT heuristic (double)N > (totalWork / iters) * 0.6 is one floating-point compare, but its three inputs encode a specific shape question: does the schedule have more pipeline stages than work-per-iteration, scaled by a 0.6 efficiency floor? When the answer is yes, splitting the compute warp group into two 4-warp SIMT agents keeps both agents busy; the producer agent issues TMAs for both consumers in parallel, and the second consumer hides instruction-issue latency on the first.
The heuristic fires only under a specific shape combination: an FP-heavy MMA body (where the per-iteration work is dominated by tensor-core throughput, not memory bandwidth) and an SM with dual-issue capability (Hopper, Blackwell, Blackwell Ultra). For integer-dominated or memory-dominated loops the heuristic typically fails the 0.6 threshold and falls back to single-SIMT. The floating-point compare is intentional: integer division would round the work-per-iteration ratio at every iteration count, hiding the difference between balanced and imbalanced shapes.
Peel-Piece Emission
After prepare() succeeds, the stage materialisation pipeline emits the modulo schedule's overlapping iterations as explicit IR. The schedule's stage list becomes a sequence of peel-piece copies: a prologue that fills the pipeline before the steady-state body, a steady-state body that runs one iteration per pipeline stage, and an epilogue that drains the pipeline after the loop exits.
void emit_peel_pieces(Schedule *sched, ScfForOp loop, Rewriter *rw) {
uint32_t num_stages = sched->stage_count;
/* Prologue: stages 0 .. num_stages-2 of iteration 0,
stages 0 .. num_stages-3 of iteration 1, and so on.
At the end, the pipeline has one in-flight iteration per stage. */
for (uint32_t k = 0; k < num_stages - 1; ++k) {
emit_stage_peel(sched, k, /*iter=*/0, rw);
}
/* Steady state: rebuild the scf.for body so each iteration
carries one stage-k op for k in 0..num_stages-1. */
ScfForOp rebuilt = rebuild_with_overlapped_stages(loop, sched, rw);
/* Epilogue: drain the pipeline. After loop exit there are num_stages-1
in-flight iterations; emit consume-only copies that finish them. */
for (uint32_t k = 0; k < num_stages - 1; ++k) {
emit_stage_drain(sched, k, /*iter=*/N - k, rw);
}
}
Each peel-piece copy reuses the stage-mapped clone helper sub_8307E0, which reissues each op with operands rewritten to the corresponding stage's value mapping. The rebuilt loop's trip count is N - (num_stages - 1), matching the prologue's pre-execution of the first num_stages - 1 iterations. When enable-schedule-rewrite is false, the pass emits prologue and epilogue but skips the steady-state re-fold; the unrolled stage IR is left in place for debugging or for downstream passes that expect the expanded form.
Failure Modes
D09 has four structural failure paths, all of which set *(self + 40) |= 4:
- No
ScheduleAnalysisin the AnalysisManager. The analysis loader returnsfalse; the dispatcher tears down the driver and bubbles failure up. This indicates the scheduler pass never ran or failed silently — usually a pass-pipeline ordering bug. prepare()rejection. When the driver'sprepare()slot fails (mismatch between schedule header and function shape, or AWS agent partition impossible), the dispatcher calls the driver's destructor and bubbles failure up.- Alias-check trip. The helper pipeline's stage-2 alias check emits
"Alias is not expected here."at severity259when an earlier pass left a pipeline-aliased value reaching schedule materialisation. The error is fatal; the driver tears down. agent_switchemission failure (AWS path only). When the agent-switch materialiser cannot find a valid warp-count partition for the SM, the AWS path fails and the dispatcher falls back to the synchronous path. The fallback is silent — no diagnostic — because the schedule analysis is intact and the loop can still execute correctly without warp specialisation.
Unspecialized Pipeline
TileASUnspecializedPipeline (CLI: tileas-unspecialized-pipeline) software-pipelines loops in the single-agent AUS flow. It peels a prologue, builds the steady-state body, and emits an epilogue drain. A two-stage pipeline takes a simpler shape; three or more stages introduce a repeating middle stage. The pass runs after D09 has chosen AUS over AWS and never fires on warp-specialized functions — AWS materialization owns its own pipelining and partitions the function into producer/consumer agents long before this pass would see it.
LogicalResult pipeline_unspecialized_loop(ScfForOp loop, uint32_t num_stages) {
if (num_stages <= 1) {
return success();
}
ScheduleMap map = extract_schedule_map(loop);
if (!has_valid_pipeline_schedule(map)) {
return failure();
}
SmallVector<Operation *> prologue = build_prologue(loop, map, num_stages);
ScfForOp body = rebuild_body_loop(loop, map, num_stages);
SmallVector<Operation *> epilogue = build_epilogue(loop, map, num_stages);
splice_pipeline_pieces(loop, prologue, body, epilogue);
return success();
}
The pass identity triple is sub_826530 / sub_826540 / sub_826550. The option num-stages sits at pass + 464 (u32, default 2); the driver-level switch unspecialized-pipeline-num-stages from sub_6D3460 overrides it. The pass early-exits when numStages <= 1 — the schedule expander has nothing to peel.
Pass Body
The pass body at sub_8337F0 (9 774 B, 290 BB) walks the top-level region with sub_827610 plus the callback sub_827000, collecting candidate scf.for and scf.while loops. Each candidate runs through a two-stage legality vtable v239 = {sub_8274D0, sub_826C50} (hasPipelinableOps + hasValidSchedule). Loops failing either gate pass through unchanged, honoring the schedule-map contract earlier scheduling passes published.
LogicalResult run_unspecialized_pipeline(FuncOp func, PassState *self) {
uint32_t num_stages = *(uint32_t *)((uint8_t *)self + 464);
if (num_stages <= 1) {
return success();
}
SmallVector<LoopLikeOp> candidates;
sub_827610(func, &candidates, sub_827000); /* region walk */
for (LoopLikeOp loop : candidates) {
if (!sub_8274D0(loop) || !sub_826C50(loop)) { /* legality vtable */
continue; /* leave loop bit-for-bit unchanged */
}
if (failed(expand_loop_schedule(loop, num_stages))) {
sub_446CE00(loop.loc(), "Failed to pipeline loop", /*severity=Remark=*/3);
*(uint32_t *)((uint8_t *)self + 40) |= 4;
}
}
return success();
}
On the failure remark, the loop stays bit-for-bit unchanged and *(self+40) |= 4 flags the recoverable miss so downstream passes can react. The verbatim diagnostic "Failed to pipeline loop" (23 chars) fires at 0x834F26 with LODWORD(severity) = 3 (Remark). D13 OptimizePipelineRegion is the primary consumer of that bit — it checks bit 2 of the same word to skip un-pipelined loops rather than chase regions that were never materialized.
LoopScheduleExpander
The schedule expander at sub_82CC30 is LoopScheduleExpander::expand (10 341 B, 505 BB). It extracts a ScheduleMap consisting of a 0x28-B header followed by 16-B StageEntry {Operation*, i32 stage, i32 iterOffset} records. The map itself is an open-addressed DenseMap with 72-B slots — the same shape as DenseMap<Operation*, SmallVector<Value, 4>> used elsewhere in the schedule layer — with sentinels -4096 and -8192 and identity-pointer hashing. Stage and iter-offset values come off operations through inherent-attribute classIDs: stage-attr classID at &unk_5B44F90, iteration-offset classID at &unk_5B44ED0.
struct StageEntry { /* 16 B */
Operation *op;
int32_t stage;
int32_t iter_offset;
};
struct ScheduleMap { /* header 0x28 B + 72-B slots */
uint8_t header[0x28];
Slot *slots; /* tombstone keys: -4096, -8192 */
};
Peel-Piece Builders
LoopScheduleExpander::expand invokes three peel-piece builders in a fixed order. The interior-stage selector is v227 = 2 * (numStages != 2): at numStages == 2 it collapses to a single-copy prologue and single-copy drain with no interior stage; for three or more stages it expands stage 1 into the repeating middle.
| Builder | Stage | Role |
|---|---|---|
sub_82F650 | prologue | emits the lead-in iterations that fill the pipeline before the steady-state body |
sub_829440(stage=0) | stage-0 body | emits the first repeating slice of the steady-state body |
sub_829440(stage=1) | stage-1 body | emits the second slice; for numStages >= 3 this slice becomes the repeating middle |
sub_827E10 | rewrite | rebuilds the original scf.for with adjusted trip count N - (numStages - 1) |
After sub_827E10 produces the rebuilt loop, sub_82CC30 runs a second time on the new body to rebuild the ScheduleMap against the fresh operations. The 9-argument rewrite driver sub_82BB80 (induction var, new loop op, prologue ops + count, stage map, stage0 body, stage context, epilogue ops + count) splices prologue, body, and epilogue into place. Per-op SSA wiring goes through sub_8307E0, the stage-mapped clone helper that reissues each op with operands rewritten to the corresponding stage's value mapping. The schedule remapper sub_82A860 updates per-op stage tags so the rebuilt body still matches the published schedule.
Failure Handling
Failure leaves the original loop bit-for-bit unchanged and tags the pass result so later pipeline-region optimization skips it. The flag bit at *(self + 40) |= 4 is the only signal the downstream pipeline reads; the pass itself returns success() because a missed pipelining opportunity is recoverable, not a hard verifier error. D13 OptimizePipelineRegion is the primary downstream consumer and checks bit 2 to skip un-pipelined loops.
Why "Unspecialized"
The name distinguishes this pass from the warp-specialized pipeliner inside D09 MaterializeSchedule. Both produce software-pipelined loops, but they target different execution models:
| Pass | Execution model | When it runs |
|---|---|---|
| D09 AWS driver | warp-specialized: one or two compute agents plus a separate producer agent partitioned by agent_switch | scheduler chose WARP_SPECIALIZED strategy and use-AUS is false |
| D11 UnspecializedPipeline | single-agent: producer and consumer share one warp group; pipelining happens through token iter-args and stage tags | scheduler chose UNSPECIALIZED strategy, or D09 ran with use-AUS=true |
D11 never fires on warp-specialized functions. Its filter callback checks the function for the presence of nv_tileas.async.pipeline.agent_switch ops and skips any function that carries them — AWS materialisation owns its own pipelining and would corrupt the agent boundaries if D11 re-pipelined on top.
IR Before and After Expansion
The input is a function carrying nv_tileas.async.pipeline.* ops whose stage and iteration-offset attributes were placed by the scheduler but whose loop body has not yet been peeled. The output is the same function with prologue, body, and epilogue copies of the body fused into the surrounding region.
// Input: a stage-tagged but un-peeled scf.for. Each pipeline op carries
// stage = 0..num_stages-1; iter_offset names how far back this op should run.
%out = scf.for %i = %c0 to %n step %c1 iter_args(%t = %tok0) -> !nv_tileas.token {
%ap, %ta = nv_tileas.async.pipeline.produce_one_async %i
{ stage = 0 : i32, iter_offset = 0 : i32 } : !nv_tileas.token
%a, %tac = nv_tileas.async.pipeline.consume_one %ta
{ stage = 2 : i32, iter_offset = -2 : i32 } : !nv_tileas.token
...
scf.yield %t : !nv_tileas.token
}
After expansion the producer side leads the consumer side by num_stages - 1 iterations in the prologue, the steady-state body runs one stage of each pipeline phase per iteration, and the epilogue drains the producer-side state:
// Output (num_stages = 3, simplified). Prologue fires two producer
// iterations before the steady-state body opens.
%ap0, %ta0 = nv_tileas.async.pipeline.produce_one_async %c0
{ stage = 0 } : !nv_tileas.token
%ap1, %ta1 = nv_tileas.async.pipeline.produce_one_async %c1
{ stage = 0 } : !nv_tileas.token
%out = scf.for %i = %c0 to %n_minus_2 step %c1
iter_args(%t = %ta1) -> !nv_tileas.token {
// Steady state: one producer and one consumer per iteration, staggered.
%a, %tac = nv_tileas.async.pipeline.consume_one %t
{ stage = 2 } : !nv_tileas.token
%ap, %ta = nv_tileas.async.pipeline.produce_one_async (%i + %c2)
{ stage = 0 } : !nv_tileas.token
scf.yield %ta : !nv_tileas.token
}
// Epilogue: drain remaining in-flight iterations.
%a_drain0, %td0 = nv_tileas.async.pipeline.consume_one %out
{ stage = 2 } : !nv_tileas.token
%a_drain1, %td1 = nv_tileas.async.pipeline.consume_one %td0
{ stage = 2 } : !nv_tileas.token
Peel-Piece Builder Sequence
The three peel-piece builders run in the fixed order prologue → body → epilogue, and the interior-stage selector v227 = 2 * (numStages != 2) controls how stage 1 is treated. For numStages == 2 the selector evaluates to zero: the prologue emits one producer-only iteration, the body alternates producer and consumer once per iteration, and the epilogue emits one consumer-only iteration. For numStages >= 3 the selector evaluates to two: stage 1 becomes the repeating middle, and the steady-state body fires one op from each of stages 0, 1, and 2 per iteration.
LogicalResult expand_pipeline(ScfForOp loop, uint32_t num_stages) {
ScheduleMap map = extract_schedule_map(loop);
if (!has_valid_pipeline_schedule(map)) {
return failure(); /* sister legality vtable rejected */
}
/* Prologue: emit num_stages - 1 lead-in iterations. */
SmallVector<Operation *> prologue;
for (uint32_t k = 0; k < num_stages - 1; ++k) {
emit_stage_peel(&prologue, map, /*iter=*/k, /*stages_through=*/k);
}
/* Body: rebuild with overlapped stages. The interior-stage selector
collapses stage 1 to a no-op when num_stages == 2. */
uint32_t interior = 2 * (num_stages != 2);
ScfForOp body = rebuild_body_loop(loop, map, num_stages, interior);
/* Epilogue: drain in-flight iterations in stage order. */
SmallVector<Operation *> epilogue;
for (uint32_t k = 0; k < num_stages - 1; ++k) {
emit_stage_drain(&epilogue, map, /*iter=*/N - num_stages + 1 + k,
/*stages_from=*/k + 1);
}
splice_pipeline_pieces(loop, prologue, body, epilogue);
return success();
}
The body builder rebuilds the loop with the adjusted trip count N - (num_stages - 1). The schedule remapper sub_82A860 then walks the rebuilt body and updates per-op stage tags to point at the freshly cloned ops, so the post-rebuild ScheduleMap matches the published schedule.
Optimize Pipeline Region
TileASOptimizePipelineRegion (CLI: tileas-optimize-pipeline-region) shrinks every nv_tileas.async.pipeline.produce_one and consume_one region to the minimal backward slice of the ops actually feeding the region's yielded values. It runs immediately after TileASUnspecializedPipeline (D11) and reads D11's pass-result bit so it skips loops D11 left synchronous.
The pass identity triple is sub_83BAE0 / sub_83BAF0 / sub_83BB00. The description string is the verbatim "Optimize the region scope of tileas.async.pipeline.produce_one/consume_one ops" (no leading typo, no trailing punctuation). The pass exposes no CLI options; behaviour is deterministic given the input IR and the D11 bit.
Pass Body
The pass body at sub_840EF0 (2 657 B, 104 BB) is a thin region-walk driver. It collects every produce_one and consume_one op in the function into a SmallVector<Operation*, 48> and iterates that vector back-to-front, calling the region shrinker on each candidate. The walk dispatches through sub_83C190 (the standard Tileiras region-walk driver) with sub_83C100 as the per-op filter callback; the filter classifies each visited op by reading its OperationName* slot at op+48 and matching it against two interned pointers.
bool filter_pipeline_region_ops(Operation *op, void *bucket) {
const void *opname = *(const void **)((uint8_t *)op + 48);
if (opname == &unk_5BE6138) {
return false; /* unregistered sentinel: skip */
}
if (opname == &unk_5B44F70 || /* consume_one */
opname == &unk_5B44F38) { /* produce_one */
smallvector_push_back((SmallVector *)bucket, op);
}
return true;
}
The sentinel &unk_5BE6138 guards against unregistered op shells that share storage with registered dialect ops but must not be visited. After the walk, the driver inspects D11's failure-remark bit and walks the bucket back-to-front:
void run_optimize_pipeline_region(FuncOp func, PassState *self, PassState *d11) {
if ((*(uint32_t *)((uint8_t *)d11 + 40)) & 4) { /* D11 "Failed to pipeline loop" remark */
/* schedule expander never materialised; nothing to shrink */
}
SmallVector<Operation *, 48> ops;
sub_83C190(func, &ops, &filter_pipeline_region_ops);
Operation **end = ops.end();
for (Operation **cur = end; cur != ops.begin();) {
cur -= 1; /* v2 -= 8 in the binary */
sub_83E1B0(*cur, self); /* region shrinker */
}
}
Back-to-front iteration is contract, not preference. The region walk pushes ops in source order, but the shrinker rewrites the region in place by creating a fresh op next to the original and moving slice ops into the new region; processing siblings last-first keeps every earlier op's defining-op chain stable until the shrinker reaches it. Front-to-back walking would invalidate later vector entries the first time a slice contains an unvisited sibling.
D11 Bit-2 Gate
D11's pass state at *(d11_state + 40) is the failure-handshake word every TileAS pass shares; bit 2 (0x4) is the recoverable "Failed to pipeline loop" remark emitted at 0x834F26. D13 reads it through the standard pass-result lookup and skips the shrinker on functions whose loops D11 refused to pipeline. The reasoning is direct: when D11 leaves a loop synchronous, its produce_one / consume_one regions were never materialised and have no surplus ops to remove. Touching them would still be safe, but the region walker would find no candidates and the shrinker would never fire.
Three Structural Boundary Checks
The shrinker's correctness rests on three predicates evaluated for every defining op the backward slice walker reaches. Each predicate names a different reason an op must not be pulled into the new region body, and the walker terminates on the first predicate that fires.
| Predicate | Boundary it enforces | Why it terminates the walk |
|---|---|---|
getDefiningOp() == nullptr | the value is a block argument (region-external) | the producer lives outside the region; merging it would change SSA scoping |
def->parentRegion() != parent | the value is region-external — typically a loop iter-arg or surrounding scf.for live-in | pulling a region-external op into the new region would extend its lifetime past its original scope |
def->name() == &unk_5B44F38 | the defining op is a sibling produce_one | pipeline ownership crosses producer/consumer pairs through tokens, not SSA edges; merging a producer into a consumer slice would erase the ownership boundary |
The order of the checks matters: nullptr-first avoids dereferencing a null Operation* in the second predicate's region lookup, and the region check ahead of the op-name check skips the OperationName* interning probe for the common region-external case. Broadening the set (for instance, stopping at any nv_tileas.async.pipeline.* op) over-shrinks consumer regions that legitimately read through pipeline view ops; narrowing it (for instance, dropping the region-parent check) pulls live-in values into the new region and loses them at the next verifier pass.
Region Shrinker
The region shrinker at sub_83E1B0 (11 571 B, 512 BB) is the heavy body. For each candidate op it computes the minimal backward slice of the op's yielded values, allocates a fresh op with an empty region, builds a nv_tileas.async.pipeline.yield terminator, moves every slice op into the new region in source order, and erases the original. The slice walker tracks visited ops in a DenseSet keyed by Operation* using the verbatim LLVM DenseMapInfo<const void*>::getHashValue constants — CityHash multiplier 0x9DDFEA08EB382D69 and seed 0xAE502812AA7333 — sized from 64 buckets and grown at the standard load factor 4 * (size + 1) >= 3 * num_buckets. The CityHash constants serve double duty: they identify the DenseSet as a region-identity cache so repeat invocations on the same region hash to the same bucket, and they keep the dedup probe collision-free across the typical pipeline region size of 50-200 ops.
LogicalResult shrink_consume_one(Operation *op, Rewriter *rw) {
if ((op->flags & 0x7FFFFF) != 0) { /* malformed op */
llvm::report_fatal_error(/* trap */);
}
Region *parent = op->parentRegion();
DenseSet<Operation *> slice; /* CityHash 0x9DDFEA08EB382D69 / seed 0xAE502812AA7333 */
Worklist work(op->getResults()); /* seeded with yielded values */
while (!work.empty()) {
Value v = work.pop();
Operation *def = v.getDefiningOp();
if (def == nullptr) {
continue; /* block argument: external boundary */
}
if (def->parentRegion() != parent) {
continue; /* outside the current region */
}
if (def->name() == &unk_5B44F38) {
continue; /* produce_one: cross-pipeline boundary */
}
if (slice.insert(def)) {
work.push(def->getOperands());
}
}
Operation *fresh = rw->create( /* see line 1927 fatal-error branch */
"nv_tileas.async.pipeline.consume_one", op->getResultTypes(), op->getOperands());
Region *body = sub_43FCA60(fresh); /* allocate region */
rw->create_in(body, "nv_tileas.async.pipeline.yield", yielded_values); /* line 2177, length 30 */
sub_448E010(slice, body, &sub_83BC00); /* moveInto with per-op callback */
sub_446E1E0(op); /* eraseOp */
return success();
}
The three boundary checks in the pseudocode mirror the table in Three Structural Boundary Checks. The verbatim op name "nv_tileas.async.pipeline.consume_one" is the string passed to the OperationName lookup when the fresh op is built; the terminator name "nv_tileas.async.pipeline.yield" (length 30) is the second registered name the shrinker emits.
The op-flag sanity check (op->flags & 0x7FFFFF) == 0 traps on malformed ops whose 23-bit op-properties word is non-zero. Tileiras's pipeline region ops carry their properties on the region body, not on the wrapper op, so a non-zero properties word means an earlier pass broke the contract. The trap is intentional: continuing would silently lose the properties.
Transitive Operand Closure
The slice walker reaches every region-internal defining op by chasing operands transitively. The closure helper at sub_83CB40 (2 306 B) builds an op's transitive operand set into a fresh DenseSet, stopping at the same three boundaries (producer-op name, region change, block argument). It runs from sub_83DAB0, the per-operand closure walker the shrinker uses to expand the worklist one yielded value at a time without materialising the full operand DAG in memory.
void transitive_operand_closure(Operation *root, Region *region, DenseSet<Operation *> *out) {
Worklist work(root->getOperands());
while (!work.empty()) {
Value v = work.pop();
Operation *def = v.getDefiningOp();
if (def == nullptr || def->parentRegion() != region || def->name() == &unk_5B44F38) {
continue;
}
if (out->insert(def)) {
work.push(def->getOperands());
}
}
}
Keeping the closure walker separate from the shrinker lets the produce_one path reuse the same boundary logic without dragging in the yield-rebuild and op-replace machinery.
produce_one Shrinker
The produce_one path runs an inlined slice walker that uses the same three boundaries as the consume_one path but skips the yield rebuild — produce_one regions have no result values to thread through a terminator, so the original yield op suffices once unused defining ops are gone. Instead of sub_448E010's moveInto callback, the inlined walker dispatches through three moveBefore variants — sub_446E270, sub_446E300, sub_446E390 — depending on whether each slice op lands before the original op, before a specific anchor, or before the region's first non-terminator op. The three-variant fan-out matches the structural cases produce_one regions present after D11 expands stages: bare producer ops, ops anchored to a producer_acquire lifetime, and ops live across an scf.if guard prelude.
ClassID Dispatch Table
The shrinker's per-op behaviour branches on the op's OperationName* pointer, not on a dynamic type query. The dispatch table sits implicit in the filter callback and the boundary checks, but it can be read straight out of the binary:
OperationName* slot | Op | Role in shrinker |
|---|---|---|
&unk_5B44F38 | nv_tileas.async.pipeline.produce_one | candidate for produce_one shrinker; backward-slice boundary in consume_one walks |
&unk_5B44F70 | nv_tileas.async.pipeline.consume_one | candidate for consume_one shrinker |
&unk_5BE6138 | unregistered sentinel | skipped by filter callback; guards against unregistered op shells |
No other op-name pointer reaches the shrinker. Every defining op encountered during the backward walk joins the slice unconditionally as long as it lives in the same region and isn't a produce_one. The shrinker therefore needs no knowledge of the rest of the pipeline op surface — producer_acquire, consumer_release, producer_commit, consumer_wait, and the various read/write ops all move into the new region as ordinary slice members.
Diagnostics and Failure Handling
The shrinker emits no diagnostics on the success path. The op-flag sanity check is a fatal trap, not a recoverable error: a malformed op indicates an earlier-pass bug, not user IR Tileiras can reject gracefully. The pass itself never sets *(self + 40) |= 4 and never returns failure() — every candidate either shrinks successfully or is structurally unshrinkable (slice equals the original region) and stays bit-for-bit unchanged.
Input and Output IR Shapes
The input is a consume_one or produce_one region whose body contains every op the upstream pass placed into it, including ops whose results no longer reach the region's yielded values. After D11 expansion the region's body typically contains the union of all stages' compute work; the shrinker reduces it to just the slice that contributes to the yielded values.
// Input: a consume_one region with surplus ops left over from D11
// expansion. %x_dead and %y_dead are produced but never reach the yield.
%r = nv_tileas.async.pipeline.consume_one %t {
%a = nv_tileas.async.pipeline.consumer_read %t : tensor<128x64xf16>
%b = nv_tileas.async.pipeline.consumer_read %t : tensor<64x128xf16>
%x_dead = nv_tileas.dot %a, %b, %acc_dead : tensor<128x128xf32> // unused
%y_dead = nv_tileas.shuffle %a, %perm : tensor<128x64xf16> // unused
%c = nv_tileas.dot %a, %b, %acc : tensor<128x128xf32>
nv_tileas.async.pipeline.consumer_release %t
nv_tileas.async.pipeline.yield %c : tensor<128x128xf32>
} : tensor<128x128xf32>
After shrinking, the new consume_one region contains only the backward slice of %c and the original consumer_release. The CityHash-keyed DenseSet ensures each defining op enters the slice exactly once even when multiple yielded values reach it through different operand chains.
// Output: surplus ops dropped; only the slice reaching %c remains.
%r = nv_tileas.async.pipeline.consume_one %t {
%a = nv_tileas.async.pipeline.consumer_read %t : tensor<128x64xf16>
%b = nv_tileas.async.pipeline.consumer_read %t : tensor<64x128xf16>
%c = nv_tileas.dot %a, %b, %acc : tensor<128x128xf32>
nv_tileas.async.pipeline.consumer_release %t
nv_tileas.async.pipeline.yield %c : tensor<128x128xf32>
} : tensor<128x128xf32>
When the slice equals the original region body, the shrinker is a no-op: it builds the fresh op, finds the slice covers every original member, and erases the fresh op instead of the original. This is the "structurally unshrinkable" case the failure handling section refers to.
Why Back-to-Front Matters
The region walker pushes candidates in source order; the shrinker iterates v2 -= 8 (a pointer decrement over an 8-byte-stride SmallVector<Operation*, 48>). When a function contains several sibling consume_one regions in the same parent — typical of an AUS-pipelined loop body with multiple compute stages — shrinking the last sibling first keeps earlier siblings' slice walks looking at a consistent operand DAG. Front-to-back shrinking would erase a defining op a later sibling's slice still references, and the later walk would either skip a live op (silently dropping work) or trap on a dangling Operation* — the DenseSet probe touches the pointer's hash, not its body, but the subsequent operand expansion dereferences the erased op's storage.
Reimplementation Notes
A reimplementation must key its DenseSet by Operation*, not by SSA value or op index: the slice walker inserts the same defining op multiple times — once per yielded value reaching it through different operand chains — and only pointer-keyed dedup keeps the closure linear. The boundary set must be exactly three checks. Broadening it (e.g. stopping at any nv_tileas.async.pipeline.* op) over-shrinks consumer regions that legitimately read through pipeline view ops; narrowing it (e.g. dropping the region-parent check) pulls live-in values into the new region and loses them at the next verifier pass.
Block-Scaled MMA Verification
Blackwell block-scaled MMA must satisfy a small catalog of shape and type invariants before lowering:
- FP4 MMA requires scale factors.
- Scale-factor element types for A and B must agree with the MMA kind.
- The accumulator must be Float32.
- Scale-factor vector size must match the K extent.
- Only supported
(atom_k, vector_size)combinations are accepted. - One-CTA and two-CTA variants must use compatible shapes.
The verifier returns the selected atom shape for lowering. Zero or failure means the op is invalid and must not proceed to NVVM.
Pipeline to NVVM
Pipeline lowerings consume the logical pipeline surface and emit fixed NVVM/LLVM sequences.
| Pipeline concept | NVVM/LLVM lowering |
|---|---|
| producer acquire | participant masks, cluster arrive, mbarrier arrive, and state update |
| producer commit or tail | async bulk wait or named-barrier synchronization |
| async wait on TMA | nvvm.cp.async.bulk.commit.group and nvvm.cp.async.bulk.wait_group |
| async wait on GMMA | nvvm.wgmma.commit.group.sync.aligned and wait-group sync |
| async wait on mbarrier | nvvm.mbarrier.try_wait.parity.shared loop |
| create none | LLVM poison value |
| token/async casts | temporary unrealized conversion casts |
| named barrier | nvvm.barrier.cta.sync or warp/cluster barrier sequence |
The TMA bulk-copy templates are documented in TMA, Tensormap, and cp.async.bulk Emission — cp.async.bulk Template Catalog; the WGMMA emission protocol that produces the commit-group / wait-group sequence is in WGMMA Emission Protocol — The Four-Op Sequence; the mbarrier state machine that anchors the arrive/try-wait loop is in mbarrier State Machine; cluster-arrive / cluster-wait pairs and DSMEM transactions are documented in Cluster Sync and DSMEM Handshake. The shared codegen surface for the tcgen05 / WGMMA / mbarrier / cluster families lives in tcgen05, WGMMA, mbarrier, and Cluster Sync.
Ordering Invariants
- Queue-to-pipeline rewrite must run before async materialization.
- Async materialization must run before schedule materialization.
- Convert-layout materialization must run before schedule consumers rely on final stage counts.
- AWS materialization emits
agent_switch; unspecialized pipeline must skip AWS-partitioned functions. - Pipeline-region optimization must run after producer/consumer regions are in their final form.
- Block-scaled MMA verification runs whenever the op is built or transformed.
Cross-References
The pipeline-op surface this family consumes and produces is catalogued in nv_tileas Op Roster and Builders — Pipeline Op Operand/Result Tables; the worked producer/consumer region example aligns with the pre-shrink IR shape D13 sees. The boundary between scheduler analysis and D09 materialisation, including the ScheduleAnalysis record and the strategy enum, is documented in Schedule Solve and Cost Evaluators — Pass Boundary; the modulo-scheduling algorithm that fills the record is in Modulo Scheduler and Rau-Style Placement. The mbarrier try_wait.parity loop the consumer side eventually lowers into is described in mbarrier State Machine — Phase Parity; the WGMMA commit-group / wait-group pair the dot ops lower into is in WGMMA Emission Protocol — The Four-Op Sequence. The cross-pass failure-bit convention every TileAS pass uses for recoverable errors is in Pass-Failure Handshake — Convention.
TileAS Layout and Buffer Family
Abstract
The layout and buffer passes decide where tile values live, remove redundant layout conversions, canonicalize buffer aliases, prune dead region arguments, materialize shared-memory handoffs between agents, and split sliced loops. They run after async and schedule materialization has exposed producer/consumer structure, but before final scheduling and lowering demand stable memory layouts.
The family is internal to the TileAS pipeline, but its public contract is concrete: load/store-class operations come out with assigned layouts, buffer aliases are explicit, agent boundaries cross through shared memory when needed, and sliced loops expose independent per-slice regions.
Pass Roster
| Pass | Purpose |
|---|---|
TileASAssignLoadStoreLayouts | assigns register, shared-memory, tensor-memory, and tiled layouts for load/store groups |
TileASRemoveLayoutConversions | commutes and deletes redundant convert_layout operations |
TileASRemoveBufferAliasPass | rewrites aliased SMEM/TMEM allocs through selects and loops into canonical buffers |
TileASRemoveDeadArgs | removes unused block arguments from region-branch operations |
TileASResolveAgentBoundary | legalises values crossing agent_switch boundaries (documented under CTA Cluster Family — D20 aux passes) |
TileASSlicingPass | splits loops carrying a sliceCount attribute into per-slice loop regions |
Assign Load/Store Layouts
D14 picks concrete memory layouts — shared, blocked, dot-operand, or linear — for every loadable or storable value flowing through a pipelined kernel. It runs at function scope through three cooperating layers. The outer driver walks the function and partitions ops into pipeline alias groups by following producer/consumer edges between produce_one/consume_one pairs and convert_layout seeds. The per-group candidate collector enumerates every layout each op in the group could legally accept, keyed by (memKind, sub_layout_axis, alignment). The pipeline-layout assigner scores the surviving candidates against a three-term hardware-cost model and writes the winning nv_tileas.layout attribute back onto each op.
The four sub-layout axes (A, B, C, D) of a dot-product pipeline have specialised emitters because a candidate for operand A of a WGMMA carries different alignment and stride constraints than the accumulator. A-axis and B-axis emitters handle the operand-broadcast paths, the C-axis emitter handles the accumulator, and the D-axis emitter handles the result; D never participates in operand-broadcast paths and is inlined directly into the candidate collector.
When the candidate collector returns an empty set for a group, the assigner emits the verbatim diagnostic " can not find common memKind among pipeline alias group\n" (the leading space and trailing newline are part of the constant). The terseness is intentional — the upstream candidate collector has already attached per-op diagnostics for every other failure shape, and by the time control reaches the group-level emitter only the cross-op memKind disagreement remains to report.
⚡ QUIRK — diagnostic constant carries a leading space and a trailing newline The string
" can not find common memKind among pipeline alias group\n"includes both a leading space and an embedded\n— both bytes are part of the string-pool constant, not formatter side-effects. A grep that anchors with^canmisses the message; a frontend that wraps diagnostics with its own newline produces a double blank line. The composition is intentional (the upstream emitter assumes a trailing punctuation slot was already consumed), but reproducing it byte-for-byte matters for log scrapers.
The driver dispatches each op in a group on its op kind:
| Op kind | Candidate-collector behaviour |
|---|---|
| erased sentinel | skip without dispatch |
produce_one | emit producer-side memory candidates |
consume_one | emit consumer-side memory candidates |
view | thread existing layout through without new candidates |
convert_layout | seed register-side candidates from the target encoding |
Each candidate also carries a layout family — one of shared, blocked, dot-operand, or linear — and the cost scorer dispatches its family-specific cost function on this tag. Candidates whose family disagrees with the rest of the group are pruned before scoring rather than penalised, keeping the scoring loop's branch profile flat. The op-kind and layout-family dispatch both use the pointer-identity convention described in TypeID Sentinels and Anchors — Idiom 1 — Static Pointer-Identity Sentinel.
LogicalResult assignLayouts(FunctionOpInterface fn) {
SmallVector<PipelineGroup> groups = collectPipelineGroups(fn); // Layer 1
for (PipelineGroup &g : groups) {
SmallVector<LayoutCandidate> cands = collectCandidates(g); // Layer 2
if (cands.empty()) {
return emitDiag(" can not find common memKind among pipeline alias group\n");
}
Layout best = pickByCost(cands, hwModel()); // Layer 3
applyLayout(g, best);
}
return success();
}
The per-operation rewrite dispatcher covers ordinary loads and stores, tiled loads and stores, tiled atomics, gather/scatter ops, register-layout index math, and TMA-preferred paths. An environment switch biases eligible load/store ops toward TMA form, but verifier checks remain authoritative.
Three-Layer Cost Model
Layer 1 enumerates candidate atoms per op. Layer 2 filters by structural legality: operand shape must match the atom's accepted shape, the memory space must match the atom's source/destination domains, and the alignment of each operand must satisfy the atom's minimum. Layer 3 scores the remaining candidates against three additive cost terms:
- SMEM bank-conflict cost — the number of 32-byte transactions required to service the chosen swizzle without two threads of a warp hitting the same shared-memory bank in the same cycle. The cost is the count of conflict-free transactions; an atom that needs four transactions to deliver a tile row costs more than one that needs one.
- TMEM bandwidth cost — for SM100 and SM103 paths only, the number of tensor-memory cycles per tile row consumed by the chosen
tcgen05atom. The cost is denominated in cycles directly. - Register pressure cost — the count of live registers across the atom's window. Atoms that materialise fragments in registers (
ldmatrix.syncfamily) add the fragment size to the cost; atoms that keep the value memory-resident (cp.async.bulk.tensorfamily) contribute zero register cost but pay in SMEM and TMEM terms.
The scorer sums the three terms with a fixed weight vector, breaks ties on register pressure first then SMEM bank-conflict count, and returns the candidate with the lowest score that the structural filter has not already pruned.
Worked Example: Load of tensor<128x64xf16> from SMEM
Consider a single tiled_load of a tensor<128x64xf16> value out of shared memory, with three structurally legal candidate atoms reaching layer 3:
| Atom | SMEM transactions | TMEM cycles/row | Live registers |
|---|---|---|---|
LDSM_M88 (ldmatrix.sync.aligned.m8n8.x4) | 4 | n/a | 32 |
LDSM_M816 (ldmatrix.sync.aligned.m8n16.x4) | 2 | n/a | 64 |
CP_ASYNC_BULK_TENSOR (cp.async.bulk.tensor.2d.shared) | 1 | 0 | 0 |
With the default weight vector w = (1, 4, 0.25) on (SMEM, TMEM, registers):
LDSM_M88:1·4 + 4·0 + 0.25·32 = 12LDSM_M816:1·2 + 4·0 + 0.25·64 = 18CP_ASYNC_BULK_TENSOR:1·1 + 4·0 + 0.25·0 = 1
CP_ASYNC_BULK_TENSOR wins because it keeps the value memory-resident, avoiding the register-fragment cost the two LDSM atoms pay. If the surrounding context already binds the consumer to a register-fragment layout (a downstream WGMMA, for example), structural filtering eliminates the bulk-tensor candidate at layer 2 and the scorer chooses between the two LDSM atoms; LDSM_M88 wins on the tie-break because its register-pressure cost is half that of LDSM_M816.
Input / Output Shape
Input — a function with tiled_load/tiled_store ops carrying no layout attribute, grouped by upstream pipelining:
%a = nv_tileas.tiled_load %src : memref<128x64xf16, #smem> -> tensor<128x64xf16>
%b = nv_tileas.convert_layout %a : tensor<128x64xf16> -> tensor<128x64xf16, #dot_a>
%r = nv_tileas.wgmma %b, %w, %acc : ...
Output — every load and store now carries a chosen layout, and conversions that the layout pass made redundant fold away in the next pass:
%a = nv_tileas.tiled_load %src
{nv_tileas.layout = <(128, 64), (64, 1), swizzle<3, 4, 3>>}
: memref<128x64xf16, #smem> -> tensor<128x64xf16, #dot_a>
%r = nv_tileas.wgmma %a, %w, %acc : ...
See Pipe / Mutex Value Layout for the downstream consumer of the assigned nv_tileas.layout attribute and Buffer Assignment and mbarriers for how the chosen memKind feeds buffer materialisation.
Candidate Records
Each operation contributes candidates in four conceptual buckets:
| Bucket | Meaning |
|---|---|
| A register | source or destination is register-backed for operand A |
| A memory | source or destination is memory-backed for operand A |
| B register | source or destination is register-backed for operand B |
| B memory | source or destination is memory-backed for operand B |
The assignment pass picks one compatible memory kind across the alias group. With no common kind available, it fails rather than guessing a conversion.
Remove Layout Conversions
TileASRemoveLayoutConversions shrinks the nv_tileas.convert_layout population by alternating two directional propagators with a greedy cleanup driver. The two propagators read in opposite directions because layout demand flows one way through buffer-backed values and the other way through register-backed values, and neither single direction reaches a fixed point on its own.
Two-Way Propagation
The buffer-side propagator walks backwards from each convert_layout whose source is a buffer-typed value (SMEM or TMEM). For each such conversion, it asks the producer whether its result type can be rebuilt at the conversion's target layout; if yes, it re-types the producer, redirects every other use of the original result through a fresh view, and deletes the conversion. The buffer side is the natural direction for this rewrite because SMEM and TMEM allocations carry their layout in their result type, so retyping a producer's result is a local edit rather than a transitive rewrite.
The register-side propagator walks forwards from each convert_layout whose source is a register-typed value. For each such conversion, it visits every elementwise or layout-transparent consumer and asks whether the consumer can adopt the conversion's target layout instead of the source layout; if yes, it absorbs the layout into the consumer's result type and re-points downstream uses. Forward propagation continues until it meets either a layout-fixing consumer (a wgmma, tcgen05, or a tiled load/store with an assigned memKind) or an unfusable boundary, at which point the propagator leaves the conversion in place. Each propagator can fail without aborting the pass; the recorded failure flag only blocks the final success() return.
Propagate-Rewrite-Cleanup Cycle
The pass runs the propagators once, then runs three greedy cleanup sweeps separated by a single rewrite-layout-sensitive-ops sweep. The structure is propagate → cleanup → rewrite → cleanup → cleanup. The first cleanup folds the conversions the propagators have already identified as redundant. The rewrite sweep visits the layout-sensitive ops (scf.if, paired produce_one/consume_one, paired pragma ops) and commutes adjacent conversions or unifies layouts across arms. The two trailing cleanups converge the rewrite sweep's output: the second cleanup folds the conversions the rewrite sweep made identity, and the third catches the new commute opportunities the scf.if unification has exposed by sinking a conversion past the merge point.
Convergence is bounded because every cycle either folds at least one conversion (strictly reducing the conversion population) or makes no rewrite at all (terminating). The 3-cleanup count is an empirical upper bound — the cleanup pattern set is closed under one cycle of scf.if unification, and the third cleanup is the safety margin that absorbs interactions between elementwise propagation and scf.if unification on the same value.
LogicalResult remove_layout_conversions(FuncOp func) {
bool propagation_failed = false;
propagation_failed |= failed(propagate_buffer_layouts(func));
propagation_failed |= failed(propagate_register_layouts(func));
apply_greedy_cleanup(func);
rewrite_layout_sensitive_ops(func);
apply_greedy_cleanup(func);
apply_greedy_cleanup(func);
return propagation_failed ? failure() : success();
}
LogicalResult rewrite_layout_sensitive_op(Operation *op, Rewriter *rw) {
switch (op_kind(op)) {
case OP_CONVERT_LAYOUT: return fold_identity_or_commute(op, rw);
case OP_PIPELINE_CONSUME_ONE:return propagate_through_consumer_region(op, rw);
case OP_PRAGMA: return rewrite_paired_pragma(op, rw);
case OP_SCF_IF: return unify_layouts_across_arms(op, rw);
default:
if (is_elementwise(op) || preserves_encoding(op)) {
return propagate_operand_layout_to_result(op, rw);
}
return failure();
}
}
Failure Modes
Semantic layout changes survive every cycle. A convert_layout whose source and destination disagree on memKind (register ↔ SMEM, or SMEM ↔ TMEM) never folds: the buffer- and register-side propagators both refuse to retype across memKinds. A conversion between two encodings within the same memKind survives when a memory-consistency op (a nv_tileas.fence, an async_wait, or a paired produce_one/consume_one whose pipeline depth is non-trivial) lies between the conversion and the producer or consumer the propagator would otherwise retype — the consistency op pins the value's layout at the boundary and the propagator backs off.
The pattern set produces no diagnostics on these failures; surviving conversions are valid IR, just not optimal. The pass returns failure() only when one of the directional propagators trips an internal invariant (a re-type produces an op the verifier rejects, for example), which surfaces through the standard MLIR pass-failure diagnostic rather than a custom emitter.
Input / Output Shape
Input — an SMEM producer followed by a layout conversion before a WGMMA operand:
%t = nv_tileas.tiled_load %src : memref<...> -> tensor<128x64xf16, #smem_blocked>
%c = nv_tileas.convert_layout %t : tensor<128x64xf16, #smem_blocked>
-> tensor<128x64xf16, #smem_swizzled>
%r = nv_tileas.wgmma %c, %w, %acc : ...
Output — the producer has been retyped to the conversion's target layout, and the conversion folds away:
%t = nv_tileas.tiled_load %src : memref<...> -> tensor<128x64xf16, #smem_swizzled>
%r = nv_tileas.wgmma %t, %w, %acc : ...
Remove Buffer Aliases
TileASRemoveBufferAliasPass collapses alias chains over SMEM and TMEM allocations into a canonical allocation plus, when the alias was renaming the layout, an explicit nv_tileas.copy or nv_tileas.view. Two alias shapes appear in practice.
Select-on-Condition Aliases
The first shape is tile.select %c, %a, %b (the dialect's variant of arith.select) on a 1-bit condition where both operands are SMEM- or TMEM-typed buffers. Both branches refer to the same underlying allocation through different SSA values, typically because a double-buffered pipeline names its two slots and a control-flow path selects between them. When both operands trace back to the same alloc_tensor, the select collapses.
// Before
%a = nv_tileas.alloc_tensor : tensor<128x64xf16, #smem>
%b = nv_tileas.view %a {offset = 8192} : tensor<128x64xf16, #smem>
%buf = tile.select %flag, %a, %b : tensor<128x64xf16, #smem>
// After — both arms share the canonical allocation %a; the select is gone.
%a = nv_tileas.alloc_tensor : tensor<128x64xf16, #smem>
%buf = nv_tileas.view %a {offset = tile.select(%flag, 0, 8192)} : tensor<128x64xf16, #smem>
Loop-Carried Aliases
The second shape is scf.for whose iter-arg is initialised from a buffer SSA value and whose yield in the loop body produces the same underlying allocation; the buffer is threaded through the loop body for legibility but adds no temporal storage. When the iter-arg and yield trace back to the same allocation, the iter-arg drops out and consumers inside the body refer directly to the canonical allocation.
// Before — %buf is loop-carried but every iteration yields the same allocation.
%a = nv_tileas.alloc_tensor : tensor<128x64xf16, #smem>
%r = scf.for %i = %c0 to %n step %c1 iter_args(%buf = %a) -> tensor<128x64xf16, #smem> {
%x = use %buf
scf.yield %a : tensor<128x64xf16, #smem>
}
// After — %a is referenced directly inside the body; the iter-arg is gone.
%a = nv_tileas.alloc_tensor : tensor<128x64xf16, #smem>
scf.for %i = %c0 to %n step %c1 {
%x = use %a
}
Canonical-Allocation Tracer
The driver walks the function looking for these shapes. For each, it traces back through view, select, and the loop-carried path to the nv_tileas.alloc_tensor that produced storage; this is the canonical allocation. If the alias preserved the layout, the pass replaces the alias with a view of the canonical allocation; if the alias changed layout (the rare case where a select chose between buffers laid out differently), the pass inserts a copy first so the consumer's view sees the expected layout.
AllocTensorOp find_last_written_alloc(Value v) {
while (Operation *def = v.getDefiningOp()) {
if (auto alloc = dyn_cast<AllocTensorOp>(def)) return alloc;
if (auto view = dyn_cast<ViewOp>(def)) { v = view.source(); continue; }
if (auto copy = dyn_cast<CopyOp>(def)) { v = copy.destination(); continue; }
if (auto sel = dyn_cast<SelectOp>(def)) {
AllocTensorOp lhs = find_last_written_alloc(sel.true_value());
AllocTensorOp rhs = find_last_written_alloc(sel.false_value());
if (lhs == rhs && lhs) return lhs;
return nullptr; // arms disagree; not an alias
}
emitDiag("Cannot find last written SSA.");
return nullptr;
}
// v is a block argument — walk back through region predecessors
return trace_through_region_predecessors(v);
}
LogicalResult rewrite_buffer_select(SelectOp select, Rewriter *rw) {
if (!is_smem_or_tmem(select.result().get_type())) return failure();
if (!select.condition().get_type().is_i1()) return failure();
AllocTensorOp true_alloc = find_last_written_alloc(select.true_value());
AllocTensorOp false_alloc = find_last_written_alloc(select.false_value());
if (!true_alloc || !false_alloc) return failure();
AllocTensorOp canonical = choose_canonical_alloc(true_alloc, false_alloc);
if (layouts_differ(canonical, select.result())) {
rw->create("nv_tileas.copy", select.result(), canonical);
} else {
rw->create("nv_tileas.view", canonical, select.result().get_type());
}
rw->replace_op(select, canonical);
return success();
}
Convergence Bound
The pass iterates the rewrite until the function reaches a fixed point. Convergence is bounded by N, the depth of the deepest alias chain in the function — each iteration strictly reduces that depth, since every rewrite eliminates one alias hop on the path from a use to the canonical allocation. A program whose deepest alias chain is select(view(view(alloc, ...), ...), ...) converges in three iterations.
Failure Modes
The tracer fails when its walk reaches a defining op that is neither a pure tile-structure op (view, copy, select) nor an alloc_tensor. Typical culprits are an affine.apply that synthesises a buffer pointer, a call returning an SMEM buffer from another function, or a select whose two arms trace to different allocations. The failure emits the verbatim diagnostic "Cannot find last written SSA." and the alias stays in the IR. Downstream passes that identify each tensor allocation by SSA value (notably the buffer-assignment pass in the scheduler family) will then see the alias and refuse to compute a barrier layout for it.
Remove Dead Region Arguments
TileASRemoveDeadArgs is the hygiene pass that follows layout assignment. Once the layout passes have rebuilt op signatures around the chosen memKinds, some block arguments and the matching region init operands fall out of use — most often because a convert_layout that was producing one of the loop-carried values has been folded into an equivalent in-place use. The pass walks every op that implements RegionBranchOpInterface — scf.for, scf.while, scf.if, and the nv_tileas.async.pipeline.* region ops — and drops each block-argument-plus-incoming-operand pair where the block argument has no use inside the region.
The two sides must move together: deleting a block argument without deleting the corresponding incoming operand leaves the region-branch interface in an inconsistent state and trips the next verifier the IR meets. The pass therefore reads the incoming operand index from the interface before the erase, then erases both in one transactional step. Block arguments that still have uses, even uses that only feed the region terminator, are preserved — this pass eliminates only the strictly dead ones.
void remove_dead_region_args(RegionBranchOpInterface op) {
for (Region ®ion : op.regions()) {
SmallVector<unsigned> dead_indices;
for (BlockArgument arg : region.entry_block().arguments()) {
if (arg.use_empty()) dead_indices.push_back(arg.index());
}
for (unsigned idx : llvm::reverse(dead_indices)) {
unsigned incoming = op.incoming_operand_index(region, idx);
region.entry_block().erase_argument(idx);
op.erase_incoming_operand(incoming);
}
}
}
Input / Output Shape
Input — a scf.for whose %pre_acc iter-arg has been left unreferenced because a downstream pass folded its single use into an in-place update on %acc:
%r:2 = scf.for %i = %c0 to %n step %c1
iter_args(%acc = %init_acc, %pre_acc = %init_pre)
-> (tensor<128x128xf32>, tensor<128x128xf32>) {
%x = nv_tileas.wgmma %a, %b, %acc : tensor<128x128xf32>
scf.yield %x, %pre_acc : tensor<128x128xf32>, tensor<128x128xf32>
}
%out = use %r#0
Output — %pre_acc and its %init_pre incoming operand are gone, the loop's result arity drops to one, and the yield carries only the live value:
%r = scf.for %i = %c0 to %n step %c1
iter_args(%acc = %init_acc) -> tensor<128x128xf32> {
%x = nv_tileas.wgmma %a, %b, %acc : tensor<128x128xf32>
scf.yield %x : tensor<128x128xf32>
}
%out = use %r
Iterating in reverse index order matters: erasing argument index i shifts every higher index down by one, and recording the indices ascending then erasing descending keeps the indices valid throughout the inner loop. The RegionBranchOpInterface query for the matching incoming-operand index is asked before the erase, while the indexing is still consistent.
Resolve Agent Boundaries
TileASResolveAgentBoundary runs in this family's ordering window — after layout assignment and buffer canonicalization, before slicing — but its contract and rewriter belong to the CTA/cluster family and are documented under CTA Cluster Family — D20 aux passes. The only invariant the rest of the layout-and-buffer family relies on is the handoff shape: every value crossing an nv_tileas.async.pipeline.agent_switch either remains a direct SSA value (when the destination agent can consume it in place) or has been materialised through a shared-memory alloc_tensor / copy / convert_layout chain that delivers it in the destination agent's expected layout. Named-barrier emission stays deferred to Buffer Assignment and mbarriers — Phase 2 — Assign Named Barriers.
Slicing
TileASSlicingPass splits loops carrying a sliceCount attribute into independent per-slice loop regions, exposing parallelism the scheduler can later interleave across warps or async pipeline stages. The pass walks the function looking for scf.for (and, on warp-specialized programs, the matching pipeline region ops) that carry a positive sliceCount IntegerAttr. For each match, it builds a slice plan: divide the iteration space by the slice count, propagate the divided extent through every tiled operand inside the body, and materialize one cloned region per slice with a fresh induction range and rewritten insert_slice ops.
LogicalResult slice_loop(ScfForOp loop, IntegerAttr count_attr, Rewriter *rw) {
if (!count_attr) return loop.emitOpError() << "The `sliceCount` need to be a `IntegerAttr`";
if (!has_supported_blocked_layout(loop)) return failure();
SlicePlan plan = build_slice_plan(loop, count_attr.getInt());
if (!plan.valid()) return failure(); // diagnostics already attached
for (uint32_t s = 0; s < plan.count; ++s) {
ScfForOp slice = clone_loop_for_slice(loop, s, plan, rw);
rewrite_slice_operands(slice, s, plan, rw);
}
rw->erase_op(loop);
return success();
}
Diagnostics
The slicing transform attaches six verbatim diagnostics to the loop op it is rewriting. Each fires from a different stage of the pass.
"The `sliceCount` need to be a `IntegerAttr`" — fires when the sliceCount attribute on the candidate loop is present but is not an IntegerAttr. Valid input is scf.for ... attributes {sliceCount = 4 : i32}; the diagnostic triggers if sliceCount is, for example, a StringAttr carrying a stringified count, an ArrayAttr of per-stage counts (an old-style encoding the parser still accepts), or an IntegerAttr whose underlying value does not fit the loop's iteration space.
"unsupported op in Slicing pass" — fires while the plan-builder walks the loop body and meets an op it cannot clone per-slice. Valid input contains only loads, stores, copies, math, control flow, and the pipeline produce_one/consume_one pair. The diagnostic triggers on ops the rewriter has no clone strategy for — typically a custom dialect op the pipeline was never extended to handle, or a func.call to an unknown callee.
"unsupported op to be a lower bound in slicing pass " — fires while the plan-builder traces the loop's lower bound. Valid input is a lower bound of the form affine.apply over the induction variable of an enclosing loop, or a constant. The diagnostic triggers when the lower bound resolves to an arbitrary SSA value (an arith.muli whose operand history the pass cannot decode, a call result, or a block argument the pass cannot trace through). The trailing space is part of the constant.
"fail to get an initial forOperand in slicing pass" — fires when the plan-builder needs the initial iter_arg value to clone into each slice's prologue and the value's defining op either escapes the function (a func.return reaches the value first) or is itself loop-carried from an outer region the pass does not traverse.
"is not expected inside sliced part in SlicingPass" — fires from the rewrite phase, not the plan-builder. The plan-builder records the set of ops the rewriter expects to clone; if the rewriter walks a cloned slice and finds an op outside that set, the IR has been mutated unexpectedly between plan and rewrite (usually because an earlier match-and-rewrite that the pass tolerated has reshaped the body). The pass refuses to continue.
"unsupported atom of copyOp in slicing pass" — fires when the rewriter visits a copy op whose CopyAtomAttrInterface does not resolve to a concrete CopyAtom. This is almost always a sequencing fault: layout assignment did not finish on the op (no nv_tileas.layout attribute was written), so the copy's atom is still abstract and slicing cannot pick the right per-slice atom variant.
Input / Output Shape
Input — a single loop with sliceCount = 2, carrying one tiled_load and one wgmma over the full iteration space:
scf.for %i = %c0 to %c64 step %c1 iter_args(%acc = %init) -> tensor<...>
attributes {sliceCount = 2 : i32} {
%a = nv_tileas.tiled_load %src[%i] : ... -> tensor<...>
%x = nv_tileas.wgmma %a, %b, %acc : tensor<...>
scf.yield %x : tensor<...>
}
Output — two cloned loops, each over half the iteration space, with the operand-side tiled_load repointed at the corresponding half of the source:
scf.for %i = %c0 to %c32 step %c1 iter_args(%acc0 = %init) -> tensor<...> {
%a0 = nv_tileas.tiled_load %src[%i] : ... -> tensor<...>
%x0 = nv_tileas.wgmma %a0, %b, %acc0 : tensor<...>
scf.yield %x0 : tensor<...>
}
scf.for %j = %c32 to %c64 step %c1 iter_args(%acc1 = %init) -> tensor<...> {
%a1 = nv_tileas.tiled_load %src[%j] : ... -> tensor<...>
%x1 = nv_tileas.wgmma %a1, %b, %acc1 : tensor<...>
scf.yield %x1 : tensor<...>
}
Layout Descriptor Grammar
nv_tileas.layout is serialised as a literal whose parser accepts a shape tuple, a parallel stride tuple, an optional swizzle clause, and an optional named-element-type clause. The shape and stride tuples can nest — nested groups give the parser everything it needs to reconstruct a CuTe-style hierarchical layout — and the swizzle clause is the bit-mask triple <B, M, S> that the descriptor packer later threads into shared-memory descriptors. The named-element-type clause overrides the element type inferred from the operand for paths where the descriptor's internal element type differs from the tensor's element type (the NVFP4 and microscaled paths are the visible callers).
layout-desc := "<" shape "," stride swizzle-opt elem-opt ">"
shape := tuple
stride := tuple
tuple := integer | "(" tuple-item ("," tuple-item)* ")"
tuple-item := tuple | integer
swizzle-opt := ("," "swizzle" "<" integer "," integer "," integer ">")?
elem-opt := ("," "elem" "=" elem-name)?
elem-name := ident -- e.g. "nvfp4", "mxf4", "bf16"
integer := decimal-uint
The swizzle triple's three integers are the descriptor packer's (B, M, S) parameters — base-2 log of the swizzle period, the mode width, and the swizzle shift respectively — and the closed accepted set of triples matches the swizzle predicate documented under Mode Pattern Verifiers — UMMA Canonical Layout Verifier. When the elem clause is absent the layout inherits its element type from the value carrying it; when present the named-element-type is looked up against the dialect's element-type registry, with unknown names rejected by the parser before any other validation runs.
Examples
| Descriptor | Reading |
|---|---|
<(1,1),(0,0)> | identity 1×1 tile; both strides zero, the degenerate base case |
<(16,16),(1,16)> | 16×16 column-major tile, inner stride 1, outer stride 16 |
<(16,16),(1,16),swizzle<2,5,2>> | 16×16 column-major tile with 128-byte swizzle (B=2, M=5, S=2) |
<(16,(8,2)),(1,(16,8))> | hierarchical 2-D layout: 16 outer, inner split as 8 sub-tiles of 2 |
<(128,64),(64,1),swizzle<3,4,3>,elem=nvfp4> | 128×64 row-major tile, 128-byte swizzle, descriptor reads NVFP4 elements |
<((4,32),64),((1,512),16),elem=mxf4> | hierarchical layout with named-element override for MXF4 microscaled path |
The first three forms cover the bulk of WGMMA and tcgen05 operand paths. Hierarchical forms appear when a tile is partitioned across warps or warp groups before reaching the descriptor packer — the outer group is the warp partition, the inner group is the per-warp slab. The elem= clause appears only on paths where the tensor element type and the descriptor's internal element type differ; NVFP4 and MXF4 are the production callers because the value-carrying tensor is f8e4m3 or bf16 but the descriptor's packed payload is sub-byte.
TileAS TMA and Memops Family
Abstract
The TMA and memops family owns Tensor Memory Accelerator lowering, token-ordered tiled memory ops, TMA descriptor ABI construction, host-side descriptor separation, and Blackwell tensor-memory copy legalization. The passes share descriptor indices, host/device TMA counts, kernel argument updates, and the host-code module that prepares CUDA tensor maps at launch time.
The core contract splits along the host/device line: device IR uses TileAS memory operations and TMA descriptor handles; the host side may pre-encode tensor maps and pass descriptor pointers as hidden grid-constant kernel arguments. Later NVVM lowering consumes those descriptors through cp.async.bulk.tensor.*, tcgen05, and related tensor-map operations.
Pass Roster
| Pass or family | Purpose |
|---|---|
| memops verifiers | validate tiled_load, tiled_store, and tiled_atomic_rmw shape and attributes |
LowerTMALoadStoreToAsync | rewrites eligible tiled memory ops into async TMA operations |
SeparateHostTMA | hoists descriptor creation into host code and attaches object bytes to the module |
AttachTMADescriptorArgs | extends kernel ABI with descriptor arguments and descriptor-count attributes |
TileASLegalizeTmemCopy | rewrites TMEM-crossing copies into layouts legal for tcgen05 lowering |
| TMA descriptor builders/verifiers | build and validate make_tiled_tma_desc before lowering |
| tensormap mutators | update device-side tensor-map fields when descriptors are device-born |
The intended order is:
AssignLoadStoreLayouts
LowerTMALoadStoreToAsync
SeparateHostTMA
AttachTMADescriptorArgs
TileASLegalizeTmemCopy
TileAS TMA Operations
The TMA operation family covers async tiled load/store, async tiled reduction and atomic-like variants, gather/scatter TMA ops, the descriptor producer, and an opaque metadata type binding the TileAS descriptor to its CuTe layout and host/device index.
| Operation concept | Role |
|---|---|
| async tiled TMA load | copies tensor tiles from global tensor memory into shared or tensor memory |
| async tiled TMA store | copies tensor tiles back to global tensor memory |
| async tiled atomic/reduction | emits TMA reduction-style traffic when the atom supports it |
| gather/scatter TMA | handles non-contiguous tensor access patterns |
| make tiled TMA descriptor | captures tensor shape, strides, layout, and descriptor storage |
| tiled TMA metadata | links descriptor uses to host/device descriptor accounting |
LowerTMALoadStoreToAsync
LowerTMALoadStoreToAsync converts synchronous tiled load/store ops carrying a TMA copy-atom into the four-op async sequence the downstream NVVM lowering expects: descriptor bind, async bulk-tensor op, mbarrier wait, fence. The CLI mnemonic is lower-tma-load-store-to-async and the description string registered with the pass infrastructure reads "lowering TiledLoad or TiledStore which with tma atom to async tiled load or tiled store". The pass runs once over each function, walking eight phases in fixed order.
The eight phases are:
- KernelSpec gate. Read the function-scoped
KernelSpecAttr(the same attribute that anchors kernel identity through the rest of TileAS). Without it the function cannot host TMA descriptors at all; the pass exits with"LowerTMALoadStoreToAsync: missing or invalid KernelSpecAttr on function". - TMA-eligibility scan. Iterate every
nv_tileas.tiled_load,nv_tileas.tiled_store, andnv_tileas.tiled_atomic_rmwop in the function. Each must carry eitherallow_tma = true(the per-op hint inherited from the public dialect'scuda_tile.allow_tma) or the environment switchTILEIR_PREFER_TMA_FOR_LOAD_STOREmust be set. The atom referenced by the op must implementTmaAtomTypeInterface— atoms that don't (plainldg,stg,ldgsts) are skipped without rewrite. - tmaIdx assignment. Assign a monotonically-increasing
tmaIdxIntegerAttr to each surviving op. The counter is per-function, starting at zero, and the assignment walk is a single pre-order recursion so descriptor uses receive indices in the order the function would emit them. - Descriptor bind. For each op, emit (or look up) an
nv_tileas.make_tiled_tma_descwhose result feeds the async op. The bind captures tensor shape, stride, padding mode, descriptor mode (tiled / im2col / im2col_at / tiled_at / gather4 for loads, store / reduce / scatter4 for stores), the element type, and thetma_internal_typeattribute when the descriptor's internal element type differs from the tensor's element type. - Async op materialization. Replace the synchronous tiled op with its
nv_tileas.async.tiled_tma_load/tiled_tma_store/tiled_atomic_rmw(TMAREDG atom) /gather_tma_load/scatter_tma_storecounterpart. The new op carries the same coordinates and tile, plus the descriptor handle, plus a freshmbarrierSSA value the load variants will wait on, plus atx_countIntegerAttr giving the per-atom byte transfer count. Load variants additionally enforce zero padding — non-zero padding fires"TmaLoad only support zero padding now", and the gather equivalent fires"GatherTmaLoad only support zero padding now". - mbarrier emission. Each async load reserves an
mbarrierin the function's SMEM arena and emits the arrive/wait skeleton. The arrival side iscutlass.pipeline.get_producer_maskplus the bind from phase 4; the wait side iscutlass.pipeline.createand anasync.waittoken. Store variants don't reserve their own mbarrier — the PipelineWaitGroupEmitter aggregates TMA stores with co-located GMMA stages, with the gate documented under Async and Pipeline Family — Pipeline to NVVM. The mbarrier state machine itself is documented in mbarrier State Machine. - Wait sinking. When
TILEIR_DELAY_TMA_STORE_WAITis set, the matchingasync.waitfor store variants may sink past the next barrier, letting the next stage's compute overlap the store's final commit. The pass records the option on the produced op so the wait-group emitter respects it. - Diagnostic finalization. Any op left unresolved by phases 4-6 — typically because its atom couldn't be located in the function's atom table — fires
"failed to find smem buffer address for mbarrier","failed to get expected tx-count","failed to init mbarrier", or"failed to get MBarrier object"depending on which sub-step lost track of its operand.
LogicalResult lower_tma_load_store(FuncOp func, TmaOptions options) {
KernelSpec spec = read_kernel_spec(func);
if (!spec.valid()) {
return func.emitOpError() << "LowerTMALoadStoreToAsync: missing or invalid KernelSpecAttr on function";
}
uint32_t next_tma_index = 0;
for (MemoryOp op : func.tiled_memops()) {
if (!op.allow_tma() && !options.prefer_tma) continue;
if (!op.copy_atom().implements<TmaAtomTypeInterface>()) continue;
TmaDescriptor desc = bind_descriptor(op, next_tma_index++);
MbarrierOp mb = (op.kind() == LOAD) ? reserve_mbarrier(func, op) : nullptr;
AsyncTmaOp async = rewrite_to_async_tma(op, desc, mb);
emit_wait_skeleton(async, options.delay_store_wait);
replace_op(op, async);
}
return success();
}
The input IR shape is a nv_tileas.tiled_load / tiled_store carrying a TMA copy-atom; the output is the four-op sequence — make_tiled_tma_desc (or a reuse of an existing one), the async tiled_tma_*, an mbarrier wait for load variants, and the matching fence inserted by the wait-group emitter. The tmaIdx attribute stamped on each async op is read later by SeparateHostTMA and AttachTMADescriptorArgs to wire host descriptor preparation back to device descriptor consumption.
Token-Ordered Memops
tiled_load, tiled_store, and tiled_atomic_rmw are the three token-ordered memory ops the TileAS layer exposes. They consume and produce nv_tileaa.mem_token SSA values so that ordering between overlapping asynchronous transfers is expressed at the IR level rather than through fences or barriers, and they carry a memory-consistency enum (weak, relaxed, acquire, release, acq_rel) plus an optional mem_scope (cta, cluster, gpu, sys) for the full ordering contract. The three op verifiers share an almost line-for-line skeleton with three small specialisations: load produces a tile plus an out-token, store produces an out-token only, and atomic_rmw produces both a pre-image tile and an out-token. The diagnostics each emits are grouped below by op family; every string is part of the verifier's user-visible contract and reproduced verbatim.
Structural diagnostics — all three ops
These fire before any semantic check. They guard the op's structural shape — regions, successors, the operandSegmentSizes attribute that partitions the variadic operand list, and the per-segment type constraints.
"requires zero regions"— any token-ordered memop has zero regions; the dispatcher rejects regions before reading any operand."requires 0 successors but found "— same shape, no successors permitted."operand group starting at #"and" requires 0 or 1 element, but found "— paired diagnostic when the optional token operand segment carries more than one value."result group starting at #"— counterpart on the result side when the optional token result holds more than one value.
The operandSegmentSizes attribute is parsed against the dialect-interned key string "operandSegmentSizes" (19 characters). The four operand segments are view, coords, offsets, and token in that order; the token segment accepts zero or one element, and segment shape mismatches fall back to the standard MLIR operand group diagnostic above.
Coordinate and shape diagnostics — all three ops
Coordinate-count, coordinate-type, and tile-vs-tensor consistency are checked by an identical 1176-byte verifier instantiated once per op. The numeric segment index differs (load reaches into segment 2 for the memref operand, store into segment 1, atomic_rmw into segment 2) but the message set is the same.
"expects <N> coordinates, but got <M>"— the literal partial string in the binary is" coordinates, but got "; the count before it is the expected coordinate count, derived from the view's rank plus an optional+1when the view carries a TMA descriptor that requires a leading offset."expects CoordType is same as memref index type, but got "— every coordinate must match the memref's index type after masking off the low-3-bit type-uniquer tag."requires the same size for tileSize and tensor"— emitted by the coord-type check when the product of the tile-size dims disagrees with the view's tensor shape."requires the same shape for tileSize and tensor value"— the parallel diagnostic from the dedicated shape-equality helper."view elementType not equal with tensor element type: "— the tile's element type must match the view's element type; the diagnostic is followed by twoprintTypeoutputs separated by" != ".
Tile-dimension invariants — all three ops
A dialect-shared helper enforces three invariants on every tile dim. These do not belong to the TMA family per se — they apply across the dialect — but they fire on this op family more than any other.
"all dimensions must be positive constants, got "— every tile-size dim must be a positive integer constant."all dimensions must be powers of two, got "— every dim must additionally be a power of two."tile would exceed the maximum of "— the product of tile dims must stay below0x1000000(16 M elements), the dialect-wide cap.
Memory semantics — all three ops, op-specific tables
Each op runs its own mem_semantic / mem_scope / in_bounds validator. The shared rules are the scope-vs-semantic compatibility:
"mem_scope not supported when mem_semantic is weak"(snake_case, emitted by load and store)"mem_scope required when mem_semantic is not weak"(load and store)"memScope not supported when memSemantic is weak"(camelCase, emitted by atomic_rmw)"memScope required when memSemantic is not weak"(atomic_rmw)
Each op rejects the consistency modes that don't make sense for its access pattern. Load forbids acquire and acq_rel:
"unsupported mem_semantic: acquire"(load)"unsupported mem_semantic: acq_rel"(load)
Store forbids release and acq_rel:
"unsupported mem_semantic: release"(store)"unsupported mem_semantic: acq_rel"(store)
Across all three ops, the in_bounds DenseI1ArrayAttr length must equal the tile rank:
"incorrect number of in_bounds elements: expected "followed by", but found ".
Store-only diagnostics
Store cross-validates the optional padding_value operand against in_bounds:
"inbounds must be true when paddingValue is not set""inbounds must be false when paddingValue is set"
A separate float-typing helper guards the special padding values:
"special padding values (nan, pos_inf, neg_inf, neg_zero) only for float-like element types"
atomic_rmw-only diagnostics
The atomic op is checked first for the presence of its rmw_mode attribute, then for its bit-width ban list:
"requires attribute 'rmw_mode'"— fires before any other attribute check when thermw_modeIntegerAttr is missing entirely."tiled_atomic_rmw not supported for 8-bit types"— no SM target supports tile-granular 8-bit atomics."tiled_atomic_rmw not supported for 16-bit integer"— same hardware reality for 16-bit integer atomics."tiled_atomic_rmw for 16-bit float only supports add, max, min operations"— when the element is bf16 or f16 and thermw_modeis outside the three-element set the hardware natively supports."tiled_atomic_rmw op cannot use fadd operation, please use add instead for both int and float types"— the dialect normalises floating-point adds to the sameaddopcode the integer path uses; the dispatcher decides int-vs-fp at lowering time."tiled_atomic_rmw op cannot use xchg operation"—xchghas no meaningful tile-granular implementation because the pre-image would only be valid for one lane.
Skeleton
LogicalResult verify_tiled_memop(TiledMemOp op) {
verify_zero_regions(op);
verify_zero_successors(op);
verify_result_count(op); // 2 for load and atomic_rmw, 1 for store
verify_operand_segments(op, "operandSegmentSizes", /*width=*/19);
verify_segment_types_and_attributes(op); // also enforces rmw_mode presence for atomic_rmw
verify_tile_size_matches_tensor_shape(op);
verify_coord_count_and_type(op);
verify_tile_element_type_matches_view(op);
verify_tile_dimensions_positive_pow2_bounded(op);
verify_mem_semantics_in_bounds_and_extras(op); // padding for store, bitwidth ban for atomic_rmw
return success();
}
TMA-backed views add one extra expected coordinate to the count check above — typically the im2col leading offset on SM100 — by reading the descriptor's leading-dim count from the view type and adding it to the rank-derived baseline.
Descriptor ABI
AttachTMADescriptorArgs flips the kernel ABI from "the device builds every descriptor" to "the host or runtime passes descriptor pointers to the kernel." It counts host-side and device-side descriptors, appends descriptor pointer arguments, marks them grid constants, hides existing arguments from the public ABI view, and writes descriptor-count attributes.
LogicalResult attach_tma_descriptor_args(FuncOp kernel) {
TmaCounts counts = count_tma_descriptors(kernel);
FunctionType old_type = kernel.get_function_type();
SmallVector<Type> args = old_type.inputs();
for (uint32_t i = 0; i < counts.device; ++i) {
args.push_back(device_tma_descriptor_pointer_type(kernel.context()));
}
for (uint32_t i = 0; i < counts.host; ++i) {
args.push_back(host_tma_descriptor_pointer_type(kernel.context()));
}
kernel.set_function_type(FunctionType::get(args, old_type.results()));
mark_appended_descriptor_args_grid_constant(kernel, old_type.inputs().size());
mark_existing_args_hidden(kernel, old_type.inputs().size());
kernel.set_attr("nv_tileas.num-device-tmas", i32_attr(counts.device));
kernel.set_attr("nv_tileas.num-host-tmas", i32_attr(counts.host));
return success();
}
Descriptor-index verification confirms that every descriptor use holds a valid tmaIdx within the recorded host or device descriptor count.
Separate Host TMA
SeparateHostTMA hoists descriptor construction into a paired host module. The host module builds CUDA tensor maps, compiles to an in-memory object, and attaches that object as module data. Device code receives pointers or runtime callback hooks instead of constructing every descriptor inline.
The pass phases are:
- Find the enclosing kernel function.
- Read host and device TMA counts.
- Enforce the device-descriptor count limit.
- Read compute capability.
- Convert the device function signature for callback use.
- Reject unsupported math dialect operations in host descriptor code.
- Emit callback functions and descriptor globals.
- Lower host-side descriptor creation to LLVM.
- Emit pre-load callback plumbing.
- Compile the host module to object code.
- Attach the object bytes as host-code metadata.
LogicalResult separate_host_tma(ModuleOp module, FuncOp kernel) {
TmaCounts counts = read_tma_counts(kernel);
if (counts.empty()) {
return success();
}
if (counts.device > MAX_DEVICE_TMA_DESCRIPTORS) {
return kernel.emit_error("too many device TMA descriptors");
}
ModuleOp host = create_host_descriptor_module(kernel);
emit_tileir_callback_globals(host, kernel, counts);
lower_tma_descriptor_builders_to_host_calls(host);
emit_on_preload_callback(host, kernel, counts);
ObjectBytes object = compile_host_module_to_object(host);
module.set_attr("nv_tileas.host-code", bytes_attr(object));
return success();
}
Host separation rejects descriptor builders that depend on structured control flow. Any descriptor builder moved to the host must depend only on values the callback ABI can represent.
D15: AttachTMADescriptorArgs + SeparateHostTMA
D15 splits a tile kernel into a host module that builds and ships TMA descriptors and a device module that consumes them. The pass triple sits at sub_7BDF00, sub_7BDF10, and sub_7BDF20; the identity strings match the description "Attach TMA descriptor arguments and separate host TMA bookkeeping". The run body at sub_7BE450 spans roughly 2 487 bytes of machine code.
The body walks the function once looking for nv_tileas.make_tma_descriptor ops. For each match, it asks the counter callback at sub_7BE1D0 whether the descriptor is built outside the kernel boundary (host-side) or inside it (device-side), then bumps the matching tally. Once the walk finishes, two integer attributes stamp the function with the split, and each TMA-descriptor-typed kernel argument gets marked so NVPTX codegen places it in .param space rather than .global.
| Attribute | Type | Where | Meaning |
|---|---|---|---|
nv_tileas.host-code | UnitAttr | inherent on function op | function is the host-emitter twin (vs device) |
nv_tileas.num-device-tmas | i32 | inherent on function op | count of descriptors the device side consumes |
nv_tileas.num-host-tmas | i32 | inherent on function op | count of descriptors the host side builds |
cute_nvgpu.grid_constant | UnitAttr | argument attribute | TMA-descriptor-typed argument lives in .param |
The host-code options helper sub_7BF4B0 (1 472 bytes) reads the always-on --enable-extended-smem=true flag from the pass-option block and threads it onto the host module's CLI tail, so host-side compilation sees the same shared-memory configuration the device side was tuned for.
The two twin modules share a parent builtin.module. Layout offsets +56 and +16 then +56 on the parent op carry the host-twin and device-twin module references; both modules ship in the same bytecode artifact but compile separately downstream. The cute_nvgpu.grid_constant argument attribute is consumed later in the cute-to-llvm lowering at sub_1698C20, which lifts it to nvvm.grid_constant on the lowered function so ptxas places the descriptor in .param space.
LogicalResult attachTmaArgs(FunctionOpInterface fn) {
int host = 0, device = 0;
fn.walk([&](Operation *op) {
if (op->getName() != "nv_tileas.make_tma_descriptor") return;
bool isHost = isOutsideKernel(op);
if (isHost) ++host;
else ++device;
});
fn->setAttr("nv_tileas.num-host-tmas", IntegerAttr::get(i32, host));
fn->setAttr("nv_tileas.num-device-tmas", IntegerAttr::get(i32, device));
for (BlockArgument arg : fn.getArguments()) {
if (isTmaDescriptorType(arg.getType())) {
fn.setArgAttr(arg.getArgNumber(), "cute_nvgpu.grid_constant", UnitAttr::get(ctx));
}
}
return success();
}
The walk-once-then-stamp shape matters for reimplementation. Counting and ABI rewriting can't split into separate passes without re-walking the function — the descriptor-count attributes must land on the same op the argument attributes do, and downstream consumers expect both sides of the split (the host-code module under nv_tileas.host-code and the device-side argument decorations) visible in a single IR view.
Callback ABI
The host-code path uses a small callback ABI that lets the runtime prepare TMA descriptors before each launch without changing the device-facing kernel signature. The host module emitted by SeparateHostTMA registers two callbacks with the __CUDA_TILEIR_CALLBACKS instrumentation hook: a one-shot SM-count / scratch-size query and a per-descriptor 64-byte payload emitter. Both are printf-style emitters that the runtime parses; their format strings are part of the binary-compatible contract and reproduced verbatim below.
| Callback | Format string | Calls per launch | Purpose |
|---|---|---|---|
| SM count / scratch size | "[TileIR Callback] SmNum: %ld deviceTMAMemorySize: 0x%lx" | 1 | tells the runtime how many SMs the kernel targets and how many bytes of per-SM descriptor scratch to allocate |
| Descriptor payload | "[TileIR Callback] DESC_TMA512: 0x%016lx %016lx %016lx %016lx" | N (= num-device-tmas) | dumps each descriptor's 64-byte payload as four i64 words, in the order matching the kernel's tmaIdx numbering |
The 64-byte payload (DESC_TMA512 — 512 bits) matches NVIDIA's published cp.async.bulk.tensor.Nd descriptor layout: global address, dim sizes, dim strides, format, swizzle, fill mode, element type, and rank, packed into four i64 words. The descriptors are emitted in tmaIdx order so the runtime can index them directly when patching descriptor pointers into the launch frame.
The host module attaches three pieces of metadata to the parent builtin.module. The compiled host object is stored under the nv_tileas.host-code attribute (an UnitAttr on the function plus the raw object bytes on the module). The descriptor counts are stored under the inherent attributes nv_tileas.num-device-tmas and nv_tileas.num-host-tmas on the kernel function. Each descriptor pointer argument the kernel ABI grew through AttachTMADescriptorArgs carries a cute_nvgpu.grid_constant argument attribute that the later cute-to-llvm lowering lifts to nvvm.grid_constant, so ptxas places the descriptor in .param memory rather than .global. The combination keeps the device-facing kernel signature stable across host-code revisions: the host module's compiled object lives entirely in the nv_tileas.host-code blob, and any change to descriptor preparation logic is contained in that blob without disturbing the device side.
Tensor-Memory Copy Legalization
TileASLegalizeTmemCopy (pass D18, CLI mnemonic "tileas-legalize-tmem-copy" at rodata 0x46018DF) is the Blackwell-specific rewriter that turns nv_tileas.copy ops crossing the TMEM boundary into pairs of legal tcgen05.ld / tcgen05.st plus ldmatrix / stmatrix sequences. It runs after D08 (MaterializeConvertLayout) has chosen the staging path — which memory space the values travel through — and before ConvertTileASToLLVM emits the corresponding NVVM intrinsics. By that point each copy carries stable source and destination memory-space tags, so the pass dispatches on a concrete TMEM-paired memory-space relation rather than rerunning layout inference.
The pass body sits at sub_7C8920 (0x267 bytes, 615 B). runOnOperation performs a function walk using sub_7C8B90 as the filter callback; the callback gates on classID &unk_5B44FD8 (the nv_tileas.copy op type) and any other op falls through untouched. The legalization core sub_7C78A0 (0xF8A bytes, 3 978 B) runs once per matched copy. It first reads the source and destination memory-space tags through sub_13C5C50, which returns a 4-bit enum: 0 generic, 1 local, 2 shared, 3 global, 4 tmem, 5 constant. It then infers a register-side layout from the TMEM layout and a source-side layout from the TMEM layout. The two failure paths emit verbatim diagnostics "failed to infer register layout from tmem layout" (rodata 0x4601948) and "failed to infer source layout from tmem layout" (rodata 0x4601980); both abort the rewrite for the current copy without touching neighbouring ops.
With both layouts inferred, the rewriter dispatches on the (srcMS, dstMS) pair. The table below is exhaustive for the TMEM-crossing cases; every other pair was already legal after D08 and the callback leaves it alone.
srcMS → dstMS | Legalised sequence |
|---|---|
4 (tmem) → 0 (rmem) | one tcgen05.ld per register tile |
0 (rmem) → 4 (tmem) | one tcgen05.st per register tile |
4 (tmem) → 2 (smem) | tcgen05.ld into registers, then stmatrix.sync.aligned to smem |
2 (smem) → 4 (tmem) | ldmatrix.sync.aligned into registers, then tcgen05.st to tmem |
| any other pair | pass through; D08 has already lowered or rejected it |
LogicalResult legalizeTmemCopy(FunctionOpInterface fn) {
fn.walk([&](Operation *op) {
if (op->getName().getTypeID() != /*&unk_5B44FD8*/ COPY_TID) return;
uint32_t srcMS = sub_13C5C50(op->getOperand(0).getType());
uint32_t dstMS = sub_13C5C50(op->getOperand(1).getType());
Layout regLayout, srcLayout;
if (failed(inferRegLayoutFromTmem(op, ®Layout))) return emit("failed to infer register layout from tmem layout");
if (failed(inferSrcLayoutFromTmem(op, &srcLayout))) return emit("failed to infer source layout from tmem layout");
if (srcMS == 4 && dstMS == 0/*RMEM*/) emitTcgen05Ld(op);
else if (srcMS == 0 && dstMS == 4) emitTcgen05St(op);
else if (srcMS == 4 && dstMS == 2/*SMEM*/) { emitTcgen05Ld(op); emitStMatrix(op); }
else if (srcMS == 2 && dstMS == 4) { emitLdMatrix(op); emitTcgen05St(op); }
else /* pass through */;
});
return success();
}
The pass gates on the Blackwell tmem subtarget feature — feature index 80 in the NVPTX subtarget table. On any target that doesn't advertise that bit, the walk still runs but the dispatch table finds no work, because no nv_tileas.copy op references a TMEM-tagged operand. See NVPTX Subtarget and Feature Matrix — The 81 Feature Indices for the feature table layout. The split between layout inference and tile emission lines up with the rest of the Blackwell lowering path: Pipe / Mutex Value Layout describes the per-stage value layout the inferred register layout must match, tcgen05, WGMMA, mbarrier, and Cluster Sync — tcgen05 Machine Validation covers the tcgen05.ld / tcgen05.st instruction family this pass emits, and ldmatrix/stmatrix and Register-Class Vtables — Matrix-Copy Templates documents the ldmatrix / stmatrix companion path for the SMEM-paired cases.
Descriptor Builders and Verifiers
make_tiled_tma_desc records element bit-width, tensor rank, shape, strides, padding, descriptor mode (tiled / im2col / im2col_at / tiled_at), and operand segments. Its pre-lowering verifier and the closely related AttachTMADescriptorArgs and MakeTiledTMADescOpCaptureVerifier diagnostics share a common error surface; the rules below are organised by which structural property they guard.
tmaIdx and descriptor-count rules
AttachTMADescriptorArgs validates the descriptor-index attribute against the host and device descriptor counts it records on the function.
"tmaIdx exceed tmaHostNum."— the op'stmaIdxis at or beyond the count recorded innv_tileas.num-host-tmas."tmaIdx exceed tmaDeviceNum."— same againstnv_tileas.num-device-tmas."not find tmaIdx."— the op has notmaIdxattribute at all."funcOp lack tmaDeviceNum and tmaHostNum attr"— the function is missing both descriptor-count attributes the pass needs to validate anytmaIdx.
Pointer-alignment and structural rules
"expected tma descriptor pointer to have alignment at least "— TMA descriptor pointers must be at least 128-byte aligned; the diagnostic ends with the alignment value expected."tma boxDims[0] * elemTypeBitWidth is not a multiple of 16 bytes"— the leading box-dim's bit-width must be a 16-byte multiple."tma leading box-dim bit-width is not 16 bytes aligned"— the equivalent invariant from the descriptor-pointer side."tmaBoxDim and atomBoxDim length mismatch"— descriptor box-dim count must match atom box-dim count."tmaBoxDim and atomBoxDim mismatch"— same but for any per-dim disagreement."tma box-dim and copy atom box-dim mismatch"— equivalent diagnostic from the copy-atom-side check."smem layout is not TMA compatible"— the shared-memory layout's swizzle and rank must fall in the TMA-accepted set documented under Mode Pattern Verifiers — TMA Rank and Mode Gates."only support element_stride = 1 tma desc"— element stride above 1 is not implemented for any TMA mode.
Mode and multicast rules
"unsupported tma load mode '"— the descriptor's mode value, when serialised, falls outside the accepted enum range (tiled,im2col,im2col_at,tiled_at,gather4)."mcast is not supported for TMA load with less than 128bytes per atom"— multicast requires at least 128-byte atoms." but the return TMA load type does not support multicast"— the atom's return type cannot carry multicast metadata."missing or invalid num_multicast for a multicast TMA load"— thenum_multicastattribute must be present and well-typed when multicast is requested.
Padding rules
"TmaLoad only support zero padding now"— non-zero padding is not implemented for any TMA load path."GatherTmaLoad only support zero padding now"— same for gather variants."padding value is not supported for TMA load with non-zero padding value"— the explicit-padding-value form is rejected end-to-end.
Atom-type rules
The verifier checks that every TMA-bearing op's atom operand has the right family (load vs store vs reduce) and falls inside the per-mode allow-list.
"expect a tma_store atom type""expect a tma_load atom type""expect a tma_redg atom type""expect a stg, tma_store or unknown_copy atom type"— the broader allow-list for store-side ops that may also be plainstg."expect a ldgsts, tma_load, ldg or unknown_copy atom type"— load-side equivalent."TmaReduceOp do not support SCATTER4 mode"— the reduce path cannot run in scatter4 because the scatter mode has no reduction operator."invalid TMA atom type"— fallthrough when none of the allow-lists matches.
Capture-walker rules
MakeTiledTMADescOpCaptureVerifier walks back from each operand through RegionBranchOpInterface to check that the dependency closure only uses ops the host-side lowering can replay.
"values depended by MakeTiledTMADescOp are not supportedbecause "followed by" matches more than 1 captured values."— the operand SSA graph reaches a value with multiple capture sources, which the host module cannot reproduce."expected MakeTiledTMADescOp not depends on scf"— the descriptor builder depends on anscfop that the host pass cannot lower.SeparateHostTMArefuses to run when this dependency exists."expect lower MakeTiledTMADescOp"— the residual device-sidemake_tiled_tma_descop that the host-conversion pass expected to be gone is still present after host lowering."math dialect not suppourt in separateHostTMA pass in the moment."(verbatim typo preserved) — the descriptor's capture closure reaches amathdialect op the host module cannot emit.
Composed-layout, descriptor-construction, and partitioning rules
These diagnostics fire from the TMA descriptor's lowering patterns and the partition verifier.
"unable to partition input tensors for TMA"— the TMA partition step couldn't find a partition that satisfies the atom's box-dim constraints."failed to compute the TMA G-basis, got "— the descriptor's G-basis (the global-tensor stride pattern) could not be computed from the supplied shape/stride pair."Computed TMA layout is invalid, got "— the synthesised layout failed the layout verifier downstream."Failed to construct the TMA tensor type"— the descriptor's!cute.tensorresult type couldn't be built from the supplied operand types."doesn't support composed layout for "— the composed-layout path is restricted to the set of swizzle modes the descriptor packer can express.
Skeleton
LogicalResult verify_tma_descriptor(MakeTiledTmaDescOp op) {
require_global_memref(op.tensor());
reject_unsupported_composed_layout(op.layout());
require_rank_at_most(op.tensor(), MAX_TMA_RANK);
require_descriptor_alignment(op.descriptor_pointer(), /*bytes=*/128);
verify_tma_stride_contract(op);
verify_cache_mode(op);
verify_atom_type_in_allow_list(op);
return verify_descriptor_capture(op);
}
Tensormap Mutators
The CUDA driver encodes host-born descriptors once. Device-born descriptors use a fixed three-mutator subset — tensormap.replace.global_address, tensormap.replace.dim_size, and tensormap.replace.stride_size — driven in the order address → dim[0..rank-1] → stride[1..rank-1]. The mutable fields are precisely the three the runtime needs to vary across launches without re-encoding a descriptor: the tensor's base pointer, its per-dim sizes, and its per-dim strides. Everything else is immutable construction state:
| Field | Mutable on device | Notes |
|---|---|---|
| global base address | yes | tensormap.replace.global_address |
| global dim sizes (per dim) | yes | tensormap.replace.dim_size, one per dim |
| global strides (per dim) | yes | tensormap.replace.stride_size, one per dim |
| element type | no | fixed at construction |
| rank | no | fixed at construction |
| format (tiled / im2col / im2col_at / tiled_at) | no | fixed at construction |
| box shape | no | fixed at construction |
| swizzle mode | no | fixed at construction; the descriptor packer chooses from a closed set |
| fill mode (zero / constant) | no | fixed at construction |
| oob fill value | no | fixed at construction |
| interleave layout | no | fixed at construction |
| im2col offsets | no | fixed at construction |
The proxy fence rule pairs every device rebind with its consumer's read. Mutators write through generic memory; cp.async.bulk.tensor.* reads through the TMA proxy. A device-side rebind sequence therefore terminates in fence.proxy.tensormap::generic.release.{cta,gpu} before any TMA op consumes the mutated descriptor; the consumer's side emits the paired fence.proxy.tensormap::generic.acquire.* before its first read. The 64-byte descriptor payload, the .b1024 mutator write width that forces 128-byte allocation alignment, and the exact inline-asm templates the three mutators emit are documented end-to-end in TMA, Tensormap, and cp.async.bulk Emission — TMA Descriptor Mutators.
⚡ QUIRK — only three TMA descriptor fields are mutable on device The CUDA driver encodes a tensormap descriptor once with eleven fields; only the global base address, per-dim sizes, and per-dim strides can be replaced on device through
tensormap.replace.{global_address,dim_size,stride_size}. Element type, rank, format, box shape, swizzle mode, fill mode, OOB value, and interleave layout are construction-time-only. Kernels that need to vary any of the immutable fields across launches have to ship multiple descriptors — there is no in-place rebind path for them — and the mutators silently no-op (the descriptor reads back unchanged) if a reimplementation routes an immutable field through one of the three replace ops.
This pass's specific contract is narrower: it materialises a make_tiled_tma_desc op carrying the rank, box, stride, padding, and cache attributes the partition verifier later re-checks, and tags every TMA-descriptor-typed kernel argument with cute_nvgpu.grid_constant so the kernel ABI carries the descriptor as a .param constant. Downstream NVVM lowering reads those attributes and emits the rebind sequence the codegen page documents.
TileAS CTA / Cluster Family
Abstract
The CTA/cluster family is the cluster-aware tier of the nv_tileas lowering pipeline. Where schedule and layout passes shape work inside a single CTA, this family shapes how multiple CTAs in a Hopper or Blackwell cluster cooperate and how a single CTA cycles through program-IDs across the grid. It bundles OptimizeExecutionUnitMapping (D12), DynamicPersistent (D16), InsertOCGKnobs (D17), PlanCTA (D19), and the D20 aux cluster (RemoveBufferAlias, RemoveLayoutConversions, Slicing, PrepareForScheduling, ResolveAgentBoundary). The SinkNegF sibling rides with InsertOCGKnobs — the binary places them adjacent and they share the same Pass SSO layout. All of these run after agent materialization and before final schedule and TMA-descriptor generation.
Cluster geometry comes from upstream: the nv_tileaa.kernel_spec attribute (read via sub_152FDF0 / sub_152FE00 in D19) carries num_ctas and an auxiliary scalar. This family propagates the consequences through layouts (PlanCTA), register/warp groups (OptimizeExecutionUnitMapping), per-CTA work distribution (DynamicPersistent), and scheduler-knob pragmas baked into late IR (InsertOCGKnobs). The Blackwell 4-CTA MMA path, the DSMEM cluster handshake, and the 2-CTA TMEM copy all consume the IR shape established here — they live in the ConvertTileASToLLVM boundary and the cute_nvgpu rewriter family, but the conditions driving them are set right here.
Ordering Context
The family sits between agent materialization (MaterializeSchedule, see Async and Pipeline Family — Materialize Schedule) and the scheduling-glue passes (Scheduling Glue). D12 needs agent_switch ops on every agent-bearing region. D16 needs a freshly-lowered nv_tileas.kernel. D17 needs MMA-family ops and async-pipeline fence/barrier anchors already lowered. D19 needs the kernel_spec attribute on the function. The D20 aux cluster expects D08 to have assigned per-op layouts and D11 to have either pipelined or skipped each loop. The Blackwell 4-CTA / DSMEM / 2-CTA paths run later, inside ConvertTileASToLLVM, and consume the cluster decisions recorded here.
OptimizeExecutionUnitMapping (D12)
OptimizeExecutionUnitMapping (CLI optimize-execution-unit-mapping, description "Optimize the numWarps and warpId alignment for each agent") rewrites warp-specialized IR from TileASUnspecializedPipeline so every AgentLikeOpInterface op becomes an nv_tileas.async.pipeline.agent_switch with consistent num_warps, warp_id, and agent_strides ArrayAttrs across its successor regions. It runs at ModuleOp scope through three workhorses: sub_83AC70 runs the post-order tree walk, sub_83BA80 dispatches each leaf, and sub_839240 (6 697 bytes, 239 BBs) is the rewriter that builds the agent_switch. A sub-pass PropagateExecutionUnit (CLI propagate-execution-unit, description "propagate the numWarps for each agent") lives at sub_836E70 — the non-agent path of the dispatcher calls it to fold numWarps upward through scf.for, scf.if, and nv_tileaa.func.
The agent rewriter at sub_839240 opens with eleven SmallVector scratch buffers (inline-12 and inline-6 mixed) for partition state and stable-partitions agents into "normal" vs "interleaved" (s[i] & 3 == 0). It then accumulates warp strides, rounding each agent's starting warp to its group size via v44 = ((v39 != 0) + (v39 - (v39 != 0)) / group) * group. When the rounded cursor diverges, the rewriter emits a hole record — stride=delta, warpId=prevWarpId, opPtr=0, numWarps=1 — which downstream lowering treats as an empty slot. A final pad rounds total warps up to a multiple of 4. Group size comes from each agent's own nv_tileas.num_warps, so Hopper WGMMA (4), Blackwell 1-CTA UMMA (4), and 2-CTA UMMA (8) flow through the same logic with no target-specific branches.
Partitioning done, the rewriter compacts duplicate warp-ids via the sub_15D4300 / sub_8369D0 / sub_836510 triple, builds the new op via sub_44624C0(&unk_5B44F80, ctx) (RegisteredOperationName lookup for nv_tileas.async.pipeline.agent_switch), populates an OperationState with num_warps[], warp_id[], and agent_strides[] ArrayAttrs, materialises it via sub_43FFC20, splices each old child region in through ilist surgery, and erases the original with sub_446E1E0. The lone visible diagnostic — "inconsistent numWarps in the agent switch, maybe it is called in different agents with different numWarps" — fires from the propagator when two child regions disagree on numWarps.
LogicalResult optimize_execution_unit_mapping(ModuleOp module) {
module.walk_post_order([&](Operation *op) {
if (!implements_agent_like(op)) {
propagate_execution_unit_upward(op); /* sub_836E70 */
return;
}
SmallVector<AgentDesc> agents = collect_agents(op);
stable_partition(agents, [](AgentDesc &a) { return (a.kind & 3) == 0; });
uint32_t cursor = 0;
SmallVector<int32_t> num_warps, warp_id, strides;
for (AgentDesc &a : agents) {
uint32_t group = read_attr_i32(a.op, "nv_tileas.num_warps");
uint32_t rounded = ((cursor != 0) + (cursor - (cursor != 0)) / group) * group;
if (rounded != cursor) {
push_hole(num_warps, warp_id, strides, rounded - cursor, cursor);
}
num_warps.push_back(group);
warp_id.push_back(rounded);
strides.push_back(a.stride);
cursor = rounded + group;
}
round_up_to(cursor, 4); /* final 4-warp pad */
compact_duplicate_warp_ids(num_warps, warp_id, strides);
Operation *fresh = build_op("nv_tileas.async.pipeline.agent_switch",
num_warps, warp_id, strides);
splice_regions_into(fresh, op);
erase_op(op);
});
return success();
}
DynamicPersistent (D16)
TileASDynamicPersistent (CLI tileas-dynamic-persistent, description "Make the kernel into dynamic persistent kernels") implements the compiler side of the persistent-grid idiom. The host launches one grid (or a small multiple) per SM; the device-side kernel must keep pulling fresh program-IDs from the runtime tile scheduler until the scheduler signals exhaustion. The pass rewrites a freshly-lowered nv_tileas.kernel body into the form
scf.while (%pid) {
cond: %v = is_valid_program_id %pid
scf.condition(%v) %pid
} {
body: <original body with programID remapped>
%next = cancel_next_program_id
scf.yield %next
}
The pass body at sub_7C1800 (10 269 bytes, 322 BBs) is a six-step state machine.
Step 1 finds the KernelOp (TypeID &unk_5B46E50) via the predicate at sub_7BFC80 plus walker sub_7BFCB0. Step 2 runs the idempotence guard sub_7C0DF0 → sub_7C0C40, which inspects every nested scf.while (TypeID &unk_5B44FE0) and, when its condition region already contains is_valid_program_id, emits the warning "Kernel is already dynamic persistent" and returns. Step 3 invokes sub_7C0600 (walkKernelAndCollectProgramIDs) to gather every nv_tileaa.get_program_id value-number into an inlined SetVector<uint32> — the probe-stride-37 open-addressing layout shared across the nv_tileaa cluster-A passes. Step 4 builds the scf.while head (AbstractOperation 0x5BE3FC8); the condition region emits nv_tileaa.is_valid_program_id and scf.condition. Step 5 clones the kernel body into the scf.while's after region using the SetVector-driven IRMapping; the clone rebuilds five ops from scratch — nv_tileas.alloc_tensor, nv_tileas.convert_layout, nv_tileaa.extract, arith.constant, arith.negf — because their attributes (layout, element type) need to change for per-iteration re-evaluation. Step 6 emits nv_tileas.cancel_next_program_id immediately before the scf.yield.
sub_7BFB90 registers the dependent dialects (nv_tileaa, nv_tileas, scf). The pass is target-agnostic and runs uniformly on sm_80+ — the persistent-grid idiom is a CTA-shape transform whose target-specific scheduler implementation (StaticPersistent / StreamK / SM100_scheduler) lives in CUTLASS host code. No barrier or fence sits between iterations; the scheduler arbitrates persistent-CTA synchronisation through its own cancel_next_program_id body.
LogicalResult dynamic_persistent(KernelOp kernel) {
if (already_dynamic_persistent(kernel)) { /* sub_7C0DF0 idempotence */
emit_warning(kernel.loc(), "Kernel is already dynamic persistent");
return success();
}
SetVector<uint32_t> program_ids;
walk_kernel_and_collect_program_ids(kernel, &program_ids); /* sub_7C0600 */
ScfWhileOp loop = build_scf_while(kernel.loc());
Block *cond = loop.before_block();
Block *body = loop.after_block();
OpBuilder cb(cond);
Value valid = cb.create<IsValidProgramIdOp>(/*pid=*/cond->arg(0));
cb.create<ScfConditionOp>(valid, cond->arg(0));
IRMapping map = build_program_id_mapping(program_ids, body->arg(0));
clone_kernel_body_into(kernel, body, map); /* rebuilds 5 op kinds */
OpBuilder bb(body, body->getTerminator());
Value next = bb.create<CancelNextProgramIdOp>();
body->getTerminator()->setOperands(next);
return success();
}
InsertOCGKnobs (D17)
TileASInsertOCGKnobs (CLI tileas-insert-OCG-knobs, description "This pass emits OCG knobs as specific optimization hints for the backend OCG compiler") bakes llvm.inline_asm ops carrying .pragma "..." directives into late IR so OCG — the closed-source PTX→SASS scheduler inside ptxas — sees them as scheduler knobs. Two knobs come out of this pass.
The first knob, emitted by sub_7C6870, is .pragma "global knob SchedResBusyXU64=1";\n (42 bytes). Two conditions gate it: the function contains at least one MMA-family op (TypeID &unk_5B46EB8, discovered by walker sub_7C6150 invoking predicate sub_7C5FD0), and the module-level nv_tileaa.target_spec value falls in {100, 101, 102, 103, 110} — Blackwell sm100..sm103 plus Jetson Thor sm110. The arithmetic at 0x7C6A5D..0x7C6A68 is the literal target_spec - 100 <= 3u || target_spec == 110. The llvm.inline_asm op lands at function entry with empty operand and constraint strings and has_side_effects=true; it tells OCG to treat U64 issue-slots as extra-busy, throttling the scalar 64-bit lane against tcgen05 MMA on TMEM-heavy Blackwell kernels.
The second knob, emitted by sub_7C6DA0, is .pragma "next knob FenceCode";\n (31 bytes). It lands before every op whose class-info matches &unk_5B44F28 (async-pipeline fence / arrive, collected by sub_7C63D0 + sub_7C6220) or &unk_5B44F58 (mbarrier / cluster-barrier, collected by sub_7C6000 + sub_7C6300). Both walkers post-filter through sub_13FDD70, sub_1496CE0, and sub_1497290. The knob applies to the next PTX instruction, so OCG won't reorder the lowered fence/barrier across surrounding memory ops. A second emission path, emitFenceCodePragmaBefore at sub_1162CF0, fires from three tileas-to-LLVM conversion patterns (sub_123DC20, sub_123E6B0, sub_123F090) so individual op lowerings can emit FenceCode inline during dialect conversion. With has_side_effects=true blocking DCE and CSE, the inline-asm op survives every downstream lowering until NVPTXAsmPrinter writes it into the PTX text stream. A parallel ocgEnterDirectives / ocgLeaveDirectives ODS-property family (~12 op-property converters across nv_tileaa / nv_tileas / cute_nvgpu) reaches the same end result through structured attributes rather than inline-asm.
LogicalResult insert_ocg_knobs(FuncOp fn) {
uint32_t target = read_target_spec(fn->getParentOfType<ModuleOp>());
bool busy_xu64 = (target - 100u) <= 3u || target == 110u; /* sm100..103, sm110 */
if (busy_xu64 && function_has_mma(fn)) { /* sub_7C6150 walk */
OpBuilder b(fn.entry_block(), fn.entry_block().begin());
emit_inline_asm(b, /*asm=*/".pragma \"global knob SchedResBusyXU64=1\";\n",
/*has_side_effects=*/true); /* sub_7C6870 */
}
fn.walk([&](Operation *op) {
if (op->name() != &unk_5B44F28 && /* async fence/arrive */
op->name() != &unk_5B44F58) return; /* mbarrier/cluster-barrier */
OpBuilder b(op);
emit_inline_asm(b, /*asm=*/".pragma \"next knob FenceCode\";\n",
/*has_side_effects=*/true); /* sub_7C6DA0 */
});
return success();
}
SinkNegF (D17 sibling)
TileASSinkNegF (CLI sink-negf-through-shapes, description "Move negf before shape operations") is a single-pattern greedy rewriter at sub_7C44E0. Its only pattern is {anonymous}::ExchangeNegWithBroadcastPattern (verbatim from the llvm::getTypeName<T>() cache at 0x4601750). The matchAndRewrite at sub_7C5270 accepts arith.negf ops whose defining op carries AbstractOperation handle &unk_5B46F28 (nv_tileas.broadcast) or &unk_5B44FB8 (nv_tileas.expand_dim), rebuilds the arith.negf on the pre-broadcast operand, then re-broadcasts; on mismatch it emits the note "no broadcast/expand_dim op". Sinking the sign-flip exposes it to downstream MMA selectors that can fold the negation into a .neg operand modifier of mma.sync / wgmma / tcgen05.mma. Arch-agnostic.
PlanCTA (D19 + BF10)
TileASPlanCTA (CLI tileas-plan-cta, description "propagate CTA related layouts") propagates cluster-aware layouts. runOnOperation at sub_7D4090 binds to FunctionOpInterface (intern key "mlir::FunctionOpInterface]", length 25, cached in qword_5B37670 via sub_44A8A10 / sub_44A8AC0), reads num_ctas and an auxiliary u32 from the function's nv_tileaa.kernel_spec (sub_152FDF0 / sub_152FE00), and short-circuits when num_ctas == 1. For multi-CTA clusters — Hopper 2-CTA MMA, Blackwell 2-CTA UMMA, 4-CTA copy-atom — the analysis at sub_7C9600 constructs a 160-byte state object that interns three StringAttrs ("plancta.direction", "backward", "forward") at +16/+24/+32. A 64-byte chunk-list backs a std::deque<Operation*> worklist whose iterator state fills slots +48..+120.
Seeding happens in two phases. sub_7CB2C0 → sub_7CB1E0 → sub_7CB300 walks the function post-order and pushes every nv_tileas.convert_layout op (classID &unk_5B44FC0) into the worklist via the 2 071-byte seeder sub_7CA9C0, which inspects the convert's src/dst encodings and tags it forward, backward, or both. When the direction byte at analysis+40 is unset, sub_7CE010 runs a forward-seed walker (sub_7C94B0 with filter sub_7C9400) to gather TMA-load and alloc-tensor anchors, synthesising placeholder nv_tileas.convert_layout ops via sub_7C9C80. Both flows meet inside the same worklist.
The propagation loop in sub_7D3F90 pops an op and dispatches to sub_7D3F50 (isBackward(op) ? stepBackward : stepForward). stepBackward (sub_7D3B50, 1 012 bytes, 62 BBs) walks operands and either retags the producer backward and re-enqueues, or — when the producer is itself a tagged convert_layout — invokes the merge at sub_7CC3E0. stepForward (sub_7D1C60, 850 bytes) does the dual on users. The merge commits the CTA-layout decision: it splices the producer's operands into the consumer's operand list via doubly-linked pointer surgery, then calls sub_7C9FB0 (deque bulk-pop helper) with flag 1 to erase both convert_layouts.
LogicalResult plan_cta(FuncOp fn) {
KernelSpec spec = read_kernel_spec(fn); /* sub_152FDF0 / sub_152FE00 */
if (spec.num_ctas == 1) return success(); /* trivial cluster: skip */
PlanCtaState st; /* 160-byte analysis */
init_direction_attrs(&st); /* "plancta.direction" et al */
seed_from_convert_layouts(fn, &st); /* sub_7CB2C0 + sub_7CA9C0 */
if (!st.direction_set) seed_from_anchors(fn, &st); /* sub_7CE010 forward seed */
while (Operation *op = pop_front(&st.deque)) { /* std::deque worklist */
if (is_backward(op, &st)) {
step_backward(op, &st); /* sub_7D3B50 */
} else {
step_forward(op, &st); /* sub_7D1C60 */
}
if (matches_partner(op, &st)) merge_pair(op, &st); /* sub_7CC3E0 */
}
return success();
}
A 28-row LOW cluster between 0x7CB8B0 and 0x7D1640 holds the per-classID arm table for these two direction handlers. It splits into 14 backward arms, 9 forward arms, 2 shared broadcast primitives, 2 iter-arg trampolines, and 1 arith-helper-owned deque-consume leaf. The shared primitives at sub_7CD230 (broadcastEncodingOne — adds one forward-tagged convert_layout per Value use, re-enters the worklist via sub_7CC540) and sub_7CD3A0 (broadcastEncodingAll — fans out across an op's operand vector and its DPS result vector, with the inline/sidecar split at op - 16*(i+1) for slots 0..5 and op - 96 - 24*(i-5) for slots ≥6) are the fanout workhorses every arm eventually reaches.
Each arm specialises by encoding classID. sub_7CD0E0 (scatter / extract_slice) reads operand[0]'s encoding, computes a 6-or-rankOfVector+6 rank tag, looks up the parent op's CTA encoding via sub_4435F20, returns 0 on a match, otherwise drives the reduction handler sub_7CCC20 and clears the direction with sub_7C9940. sub_7CDA50 (iter_arg / scf.while) and sub_7CDCE0 (scf.for) stack the same skeleton and additionally call sub_14314D0 to resolve which iter_arg slot in the outer loop maps to the operand being rewritten. The biggest arm — sub_7D0D80 at 2 234 bytes — is the scf.for region sinker: three-path rank dispatch with a rank-4 branch for 2-CTA MMA atoms (sub_13D2140, sub_14F1150, sub_18664A0). The other heavy LOW, sub_7CFAE0 (1 397 bytes), is the forward direction's alias-in/alias-out helper, using sub_1427100 for forward MmaAtomLayout selection and sub_14265B0 for backward.
PlanCTA 28-arm table
| Addr | Dir | Role |
|---|---|---|
| 0x7CB8B0 | aux | dequeBulkConsume(dst, src_begin, src_end, dst_deque) — std::deque memmove leaf |
| 0x7CD0E0 | B | scatter_operand_encoding — extract_slice / view path |
| 0x7CD230 | shared | broadcastEncodingOne(anal, ctx, opOperand, encoding) — one-Value forward leaf |
| 0x7CD3A0 | shared | broadcastEncodingAll(anal, op, ops, n, dests) — operand + DPS fanout dispatcher |
| 0x7CD8F0 | B | broadcast_to_all_operands — generic broadcast across operand vector |
| 0x7CDA50 | B | iter_arg_or_scf_while_propagator — uses sub_14314D0 for iter_arg slot resolution |
| 0x7CDCE0 | B | scf_for_propagator — symmetric backward+forward, also called by sub_7CE010 seeder |
| 0x7CE4F0 | B | forward-seed helper using sub_13E9790 transform |
| 0x7CE570 | F | operand_prev_block_arg (13EADF0 path) |
| 0x7CEDE0 | F | operand_prev_block_arg_v2 (13F5210 path) |
| 0x7CEE70 | B | extract_slice_propagator — uses sub_1570430 slice-map reader + sub_13E9920 |
| 0x7CF030 | B | broadcast_Y (13EADF0 path) |
| 0x7CF0B0 | F | prev_block_arg_v3 (13EADF0 path, block-arg) |
| 0x7CF140 | B | encoding_helper_A — backward full-shape recompute via sub_13F6490 + sub_14265B0 |
| 0x7CF470 | F | encoding_helper_B — forward full-shape recompute via sub_13EB2C0 |
| 0x7CF7A0 | B | encoding_helper_C — 5-arg variant with cached result-encoding |
| 0x7CFAE0 | F | dual_encoding_helper — alias-in/alias-out fork, two MmaAtomLayout pickers |
| 0x7D0060 | iter-trampoline | inner_scf_for_iter_broadcast — 4-slot fanout (input/next/DPS-op/DPS-res) |
| 0x7D01B0 | B | yield_slot_router — scf.yield index dispatch into sub_7D0060 |
| 0x7D0270 | F | yield_index_router — companion router |
| 0x7D02F0 | B | return_yield_handler — switch over ModuleOp / &unk_5BE4008 / &unk_5B44F70 |
| 0x7D0510 | iter-trampoline | forward iter-args fanout using sub_4191730 / sub_41918D0 |
| 0x7D0600 | B | yield_slot2 — mirror of 7D01B0 dispatching into 7D0510 |
| 0x7D06C0 | B | while_region_body — scf.while before-region propagator |
| 0x7D0820 | F | DPS_result_propagator — use-list pointer surgery for &unk_5B44F38 / &unk_5B44F70 |
| 0x7D0B80 | F | DPS_operand_propagator — 4-way classID switch (ModuleOp / 5BE4008 / 5BE3FF8 / 5B44F10) |
| 0x7D0D80 | B | scf_region_sinker — biggest arm; rank-2/rank-2 fast path, reduction, rank-4 (2-CTA) |
| 0x7D1640 | F | scf_while_body_propagator — forward direction's scf.while companion |
Eleven of these arms carry no static caller edge in tileiras_callgraph.json — IDA treats indirect calls through the 5-entry dispatch table at sub_7C8DA0 .. sub_7C8E20 as untracked. The edges were recovered from disassembly of sub_7D1C60 and sub_7CD3A0.
D20 aux passes
The D20 group bundles the rest of the per-FunctionOp cluster-aware transforms.
TileASRemoveBufferAliasPass (sub_7DACE0, 11 402 bytes) iterates a worklist of nv_tileas.alloc_tensor ops to a fixed point, collapsing aliases introduced by arith.select / scf.while. Convergence failure emits "TileASRemoveBufferAliasPass failed to converge"; unsupported ops yield "RemoveBufferAlias: not supported operation type"; scf.while-yielded aliases yield "Yielded alias not implemented yet".
TileASRemoveLayoutConversionsPass (sub_7E6210, 10 124 bytes) delegates to the 11 728-byte worker sub_7E3440 for buffer-side propagation ("failed to rewrite in buffer layout propagation"), register-side propagation ("failed to rewrite in reg layout propagation"), and three rounds of greedy cleanup ("failed to apply patterns greedily"). The worker switches on op names nv_tileas.convert_layout, nv_tileas.async.pipeline.consume_one, nv_tileas.pragma, and scf.if.
TileASSlicingPass (sub_7FE6C0, 12 298 bytes) materialises the sliceCount IntegerAttr on scf.for / scf.while loops via the 10 289-byte pattern sub_7F8DC0. Failure strings are "unsupported op in Slicing pass", "unsupported op to be a lower bound in slicing pass ", " fail to get an initial forOperand in slicing pass ", and "is not expected inside sliced part in SlicingPass\n". The attribute parser at sub_7F7480 emits "The sliceCountneed to be aIntegerAttr" on malformed input.
TileASPrepareForScheduling (sub_8C4F80) fetches compute capability via sub_13FB490, threads it through an argv bundle, and invokes walker sub_8C4590 with leaf sub_8C4710. When the leaf finds FunctionOpInterface on both op and parent, it fires the 9 943-byte per-function kernel at sub_8C1EB0, which runs six serial sub-passes (names baked at 0x4606C6D onward): decomposeTiledLoadStoreView, refineVecSizeOfAtoms, sliceAndFuse, runCanonicalizer, compactMemLayout, refreshBoxDim. Step 2 picks between ld.global.v2/v4/v8 based on compute capability; step 6 is Blackwell-mandatory and recomputes TMA box dimensions for every tiled_load / tiled_store whose view step 1 modified — cp.async.bulk.tensor.Nd traps on SM100+ when the descriptor's boxDim vector mismatches the final view.
TileASResolveAgentBoundaryPass (CLI tileas-resolve-agent-boundary) is the cluster-family pass that legalises values crossing an nv_tileas.async.pipeline.agent_switch boundary. Warp-specialized programs partition work across producer, consumer, and compute agents; values that flow from one agent's region into another's cannot always stay as direct SSA values because the consuming agent runs on a different warp set and reads operands from a different physical register file or shared-memory bank. This pass inserts the IR shape that delivers those values across the boundary — typically a shared-memory handoff combined with a layout conversion sized to the destination agent's expected shape.
The pass body has no strong string anchor in the surveyed range, so the exact rewriter shape is not pinned. The contract, however, is fixed by what the downstream lowering passes assume on input: after this pass runs, every value that crosses an agent_switch boundary either stays as a direct SSA value (when the destination agent can consume it in place) or has been materialised through a nv_tileas.alloc_tensor / nv_tileas.copy / nv_tileas.convert_layout chain that delivers it in the destination layout. Named-barrier emission stays deferred — that is a separate pass's responsibility. The pass scope is FunctionOpInterface, the gate is post-agent-materialisation (i.e. after MaterializeSchedule has emitted the agent_switch ops), and the only invariant downstream consumers rely on is the handoff shape itself, not the precise op sequence used to materialise it.
Blackwell 4-CTA MMA path (BG06)
The tcgen05.mma instruction family has no cta_group::4 opcode — its verifier at sub_1AD26A0 packs cta_group into two bits and accepts only values 1 and 3. Blackwell's 4-CTA semantics are a copy-side notion: four cooperating CTAs hold the A/D TMEM tiles, and the SMEM→TMEM copy atom that stages the A operand fans data out across the cluster before one tcgen05.mma runs per peer. The architectural background — copy-side ownership, rank predicates, sibling pairing — is collected in Blackwell 2-CTA and 4-CTA MMA.
The relevant pattern is {anonymous}::AtomCopyMakeS2tCopyOpLowering::matchAndRewrite at sub_119B710. Its shape dispatch reads Cute_nvgpu_S2tCopy_Shape.+0x20 via sub_13C5F30; the 3 arm at 0x119CB98 writes constant 4 into stack slot var_508 (vs 2 on the 1/2 arm at 0x119CA40). That 4 is the multicast width and propagates into cute.tiled.copy.partition_S and the cta_group field of the eventual nvvm.tcgen05.cp. The 4-CTA path also builds rank-parity predicate arith.andi (arith.remsi %rank, %mcw), 1 with %rank from nvvm.read.ptx.sreg.cluster.ctarank, so exactly two of the four CTAs (odd low-bit) issue the actual copy while the others receive the multicast. The predicate is wrapped in cute_nvgpu.arch.make_warp_uniform (sub_1134460 → sub_1165D80) before scf.if. TMEM distribution happens at cute.tiled.copy.partition_D (TypeID &unk_5B48078, built at 0x119C296), slicing the destination TMEM MemRef into four quarter-slices keyed off %rank.
DSMEM handshake (S01)
sub_11420D0 is the cluster-scope DSMEM handshake emitter inside ConvertTileASToLLVM. Given a barrier/pipeline op and a destination mnemonic root, it emits one of two shapes. When the op-walk finds no multi-CTA parent, it lays down a bare nvvm.cluster.arrive{.relaxed} + nvvm.cluster.wait pair. Otherwise it emits the full DSMEM handshake — nvvm.mapa → llvm.addrspacecast → optional llvm.inline_asm "fence.release.cluster;" (gated on the a4 relaxed flag) → nvvm.mbarrier.txn (expect_tx) → arith.cmpi → scf.if { llvm.load / llvm.store / arith.xori; } → nvvm.cluster.arrive{.relaxed} → nvvm.cluster.wait. The fence.release.cluster; literal lives at byte_4FA453E; the relaxed arrive wins when an explicit release fence sits upstream. A fast-path bypass at sub_1141120 skips the handshake when sub_152FDF0 returns 1 (trivial single-CTA shape). Ten different rewriters reach this emitter through the sub_11435E0 thunk, whose a5 flag switches between this consumer-side body and the producer-side sub_11420B0. The handshake protocol is documented end-to-end in Cluster Sync and DSMEM Handshake — DSMEM Transaction Handshake.
Blackwell 2-CTA TMEM copy (S02)
The Blackwell 2-CTA TMEM copy is the (v36 - 1) <= 1 arm of the same sub_119B710 shape dispatch. Field +0x20 (sub_13C5F30) selects shape; field +0x18 (sub_13C5F20) carries the numeric cta-group (1 / 2 / 4). The 2-CTA arm sets multicast width 2 and takes the direct-build predicate fork at sub_1134400 rather than constructing the arith.remsi / arith.andi chain — the downstream nvvm.tcgen05.cp's symmetric 2-CTA handshake covers the peer-CTA half of the TMEM tile through the mma instruction's own cta_group::2 encoding. Phase 4 resolves the TMEM coord-to-offset map via sub_116A8D0, phase 5 constructs the destination TMEM MemRef (memory-space tag 4) via sub_116AF90, phase 6 builds cute.tiled.copy.partition_S, and phase 7 emits the scf.if-wrapped cute.tiled.copy(atom, coord, mbarrier). The 2-CTA mbarrier-init helper sub_1147B40 is told isCtaGroup2 = (v25 == 2); failure emits "Failed to init mbarrier". See Blackwell 2-CTA and 4-CTA MMA — CTA Group Control Word for the encoded cta_group::2 bit and tcgen05 Tensor Memory Model for the underlying TMEM model.
Late rewriter sub_99A940 (DD02)
sub_99A940 (10 278 bytes, zero string literals) is the post-Schedule late IR rewriter that fires on every nv_tileas.async.pipeline.create_pipeline op. Its sole caller is the sub_99D170 / sub_99D2E0 walker pair, dispatching on the OperationName sentinel &unk_5B45060. The rewriter emits no new MLIR ops — it is a type rewriter. It walks the create_pipeline's operand list and inner consumer/producer regions, unwrapping PipelineIteratorType (TypeID &unk_5B45A60) values via sub_1496C90 at five sites, gated on the type's kind field (sub_1496C80) returning 11 (consumer iterator subclass) or 1 (producer iterator subclass). It also rematerialises new PipelineIteratorType values via sub_1498180(ctx, depth, elem) and seats them into one of six per-pass DenseMaps. This is the final cluster-aware cleanup once Schedule::solve (sub_8EEE70) has committed iteration counts: each producer/consumer region's iterator types unwrap to their element types, so subsequent lowering passes see plain SSA values rather than wrapped iterator handles.
Per-pass roster
| Pass | Body (sub_) | Scope | Gate | Output |
|---|---|---|---|---|
OptimizeExecutionUnitMapping | 0x83AC70 / 0x839240 | ModuleOp | AgentLikeOpInterface (&unk_5B44F80) | agent_switch with num_warps[], warp_id[], agent_strides[] |
PropagateExecutionUnit (helper) | 0x836E70 | per-op | non-agent op classID | folded numWarps upward through scf / func |
TileASDynamicPersistent | 0x7C1800 | nv_tileas.kernel | not already wrapped in scf.while + is_valid_program_id | body wrapped in scf.while + cancel_next_program_id |
TileASInsertOCGKnobs (U64) | 0x7C6870 | FunctionOpInterface | target_spec ∈ {100..103, 110} AND MMA present | .pragma "global knob SchedResBusyXU64=1"; at entry |
TileASInsertOCGKnobs (Fence) | 0x7C6DA0 | FunctionOpInterface | classID ∈ {&unk_5B44F28, &unk_5B44F58} | .pragma "next knob FenceCode"; before each anchor |
TileASSinkNegF | 0x7C44E0 | FunctionOpInterface | arith.negf over broadcast / expand_dim | arith.negf moved before the shape op |
TileASPlanCTA | 0x7D4090 / 0x7D3F90 | FunctionOpInterface | kernel_spec.num_ctas != 1 | folded / moved convert_layouts; cleared direction tags |
TileASRemoveBufferAliasPass | 0x7DACE0 | FunctionOpInterface | aliased alloc_tensor users | canonical alloc + rebuilt users |
TileASRemoveLayoutConversionsPass | 0x7E6210 + 0x7E3440 | FunctionOpInterface | redundant convert_layout pairs | propagated layouts; folded convert chains |
TileASSlicingPass | 0x7FE6C0 + 0x7F8DC0 | FunctionOpInterface | scf.for with sliceCount: IntegerAttr | N slice regions + optional residual loop |
TileASPrepareForScheduling | 0x8C4F80 + 0x8C1EB0 | FunctionOpInterface | valid compute-cap pointer | six stages: decomposeTiledLoadStoreView → refineVecSizeOfAtoms → sliceAndFuse → runCanonicalizer → compactMemLayout → refreshBoxDim |
TileASResolveAgentBoundaryPass | (unpinned) | FunctionOpInterface | post-agent-materialisation | renumbered / legalised agent_switch edges |
Ordering Invariants
OptimizeExecutionUnitMapping(D12) runs afterMaterializeSchedulehas emittedAgentLikeOpInterfaceops.DynamicPersistent(D16) runs once perKernelOp, before scheduling; it is idempotent onscf.while-wrapped kernels.InsertOCGKnobs(D17) runs after MMA-family ops and async-pipeline fence/barrier anchors have reached their final form; the.pragmainline-asm ops must survive every downstream lowering, so D17 must not run before passes that DCE inline-asm with no side effects.SinkNegFis order-insensitive relative to D17 but must run before MMA-operand-modifier folding sees the negation.PlanCTA(D19) requiresnv_tileaa.kernel_specon the function; it short-circuits whennum_ctas == 1.- The D20 aux cluster expects layouts already assigned by D08 and pipelining decided by D11.
- The 4-CTA / DSMEM / 2-CTA emitters in
ConvertTileASToLLVMrun after every pass in this family has committed its cluster-shape decisions.
TileAS Scheduling Glue Passes
Abstract
The scheduling glue passes prepare IR for schedule generation, publish schedule analysis, and tag loops that LLVM must unroll later. They sit between layout/buffer preparation and async-pipeline materialization. Their job: make the IR schedulable, run the scheduler, preserve the result for materialization, and keep register-tensor loops from surviving as runtime-indexed loops in PTX.
This chapter also records the LLVM/NVVM cleanup passes that run after TileIR has been lowered. They schedule no TileAS IR, but they protect the backend handoff — selecting kernels, printing kernel metrics, propagating address spaces, and demoting non-kernel functions before PTX emission.
Pass Roster
| Pass | Layer | Purpose |
|---|---|---|
TileASPrepareForScheduling | TileAS MLIR | canonicalizes tiled views, atom widths, loop hints, memory layout, and TMA box dimensions |
TileASGenerateSchedule | TileAS MLIR | runs serial or cost-based scheduling and publishes ScheduleAnalysis |
TileASUnrollRegisterLoops | TileAS/LLVM bridge | tags register-tensor loops for full LLVM unrolling |
PretreatPass | LLVM/NVVM | canonicalizes NVVM LLVM IR after inlining and before verification |
CheckGepIndexPass | LLVM/NVVM | validates constant GEP indices after canonicalization |
SelectKernelsPass | LLVM/NVVM | filters the module to selected kernels |
KernelInfoPrinter | LLVM/NVVM | prints kernel metrics for diagnostics and tuning |
IPMSPPass | LLVM/NVVM | propagates concrete NVPTX address-space facts interprocedurally |
NVVMAA | LLVM analysis | provides alias facts from NVPTX address spaces |
NVPTXSetFunctionLinkagesPass | LLVM/NVVM | demotes non-kernel functions and marks kernel visibility |
Prepare for Scheduling
TileASPrepareForScheduling is the last TileAS cleanup before schedule generation. Six stages run in fixed order.
| Stage | Purpose |
|---|---|
| tiled load/store view decomposition | rewrites reshaped tiled views into direct views of the underlying allocation |
| atom vector-size refinement | chooses target-aware vector widths for copy, TMA, and MMA atoms |
| slice/fuse marking | annotates loops with slicing and fusion hints already discovered upstream |
| scheduling canonicalizer | unrolls tiny loops and decomposes single-op wrappers around atom operations |
| memory layout compaction | assigns shared-memory offsets to non-overlapping live ranges |
| TMA box-dimension refresh | recomputes runtime box dimensions used by Blackwell tensor-copy operations |
The stage order is part of the contract. TMA box dimensions, for instance, must refresh after view decomposition and memory layout compaction — the descriptor dimensions have to reflect the final view and allocation shape.
LogicalResult prepare_for_scheduling(FuncOp func, TargetInfo target) {
if (failed(decompose_tiled_load_store_views(func))) {
return failure();
}
if (failed(refine_atom_vector_sizes(func, target))) {
return failure();
}
mark_slice_and_fuse_candidates(func);
if (failed(run_scheduling_canonicalizer(func))) {
return failure();
}
if (failed(compact_memory_layout(func, target))) {
return failure();
}
return refresh_tma_box_dimensions(func, target);
}
Generate Schedule
TileASGenerateSchedule reads compute capability, builds the machine model, picks a scheduler strategy, runs it on each schedulable region, legalizes the result for materialization, and publishes ScheduleAnalysis.
The scheduler option accepts the serial baseline and cost-based variants. Unknown values fall back to the cost scheduler. The cost path also reads the maximum constraint-iteration count and the RRT-size threshold — it can refine constraints repeatedly before convergence.
LogicalResult generate_schedule(FuncOp func, SchedulerOptions options) {
ComputeCapability cc = read_compute_capability(func);
if (!cc.valid()) {
return func.emit_error("failed to get compute capability");
}
MachineModel model = build_machine_model(cc);
Scheduler scheduler = make_scheduler(options.scheduler, model, options);
for (Region *region : find_schedulable_regions(func)) {
ScheduleAnalysis analysis;
if (failed(scheduler.schedule(region, &analysis))) {
return region->emit_error("failed to find a schedule");
}
if (failed(legalize_loop_schedule_for_materialization(region, analysis))) {
return region->emit_error("failed to legalize schedule");
}
preserve_schedule_analysis(region, analysis);
}
return success();
}
With trace output enabled, the pass emits a Chrome tracing JSON stream. The useful events: issue timing, atom attributes, dependency edges, minimal latency, actual latency, and thread order. Tracing is diagnostic only — it must not affect scheduling.
Register-Loop Unroll Tagging
TileASUnrollRegisterLoops (D10) is the smallest pass in this family. It attaches a single #llvm.loop_annotation<unroll = <enable>> attribute to ops D08 already marked as the root of a reg-to-reg copy sequence. The pass unrolls nothing itself — it writes metadata so LLVM's LoopUnrollPass fires on the resulting loop later in the NVPTX backend, after convert-scf-to-cf and LowerLoopAnnotation turn the scf.for body into an llvm.br/llvm.phi cycle carrying the !llvm.loop MD node.
The pass triple lives at 0x825590 (getName returning "TileASUnrollRegisterLoops"), 0x8255A0 (getArgument returning "tileas-unroll-register-loops"), and 0x8255B0 (getDescription returning "Unroll loops that access slices of register tensors"). getDependentDialects at 0x8256B0 loads nv_tileaa and nv_tileas. The pass exposes no options: no Option<...> registrations appear in the range, and no command-line wrappers reference it beyond the argument string.
The core rewriter is sub_825BF0 (1 272 B). It walks the function with runOnOperation at sub_8261E0, and for each op carrying D08's "nv_tileas.root_reg_to_reg_copy_op" UnitAttr marker it interns the "loop_annotation" key via sub_440A060(ctx, "loop_annotation", 15) and attaches a LoopAnnotationAttr whose only populated field is unroll = <enable>. The cached global Identifier slot for the llvm.loop_annotation attribute name is qword_591E3B8; the rewriter loads it once and reuses the interned key across every tagged op in the function.
LogicalResult unrollRegisterLoops(FunctionOpInterface fn) {
fn.walk([&](Operation *op) {
if (!op->hasAttr("nv_tileas.root_reg_to_reg_copy_op")) {
return;
}
StringAttr key = StringAttr::get(ctx, "loop_annotation"); // sub_440A060
op->setAttr(key, LoopAnnotationAttr::get(ctx, /*unroll=*/UnrollAttr::enable()));
});
return success();
}
The resulting attribute is the single-element LoopAnnotationAttr form #llvm.loop_annotation<unroll = <enable>>. LLVM's LoopUnrollPass reads this through the !llvm.loop metadata convention and picks the unroll factor from its own cost model (computeUnrollCount, PartialThreshold, register pressure). D10 deliberately writes neither llvm.loop.unroll.count nor llvm.loop.unroll.full: the policy is "enable, and let LLVM decide", keeping the factor decision in one place rather than splitting it between the TileAS scheduler and the NVPTX backend.
Interaction with D08
D10 pairs with D08 (MaterializeConvertLayout). With reg2reg-vec-size at or below 15, D08 cannot vectorise the layout conversion and falls back to a per-element copy sequence. The root op of that sequence carries the "nv_tileas.root_reg_to_reg_copy_op" UnitAttr; downstream lowerings turn that root into a scf.for whose body shuffles register slices one element at a time. D10 recognises the marker and adds the unroll hint so the loop fully unrolls in PTX, where register variables have no runtime addressing — a rolled loop would either spill to local memory or be miscompiled by the tile-to-vector lowering.
The two passes stay split rather than fused because their context requirements pull in opposite directions. D08 needs layout-decomposition state — the cute atom catalogue, the per-target vector-size table, the layout-conversion analysis — to decide which ops are reg-to-reg copy roots. D10 needs the final IR shape: every layout-affecting pass between D08 and PTX emission must already have run so the marker still sits on the op that becomes the actual loop. Fold D10 into D08 and the marker decision has to win against every subsequent rewrite; fold D08 into D10 and D10 has to redo the layout analysis. Splitting them lets each pass run at its natural point in the pipeline and reduces the marker to a one-bit handoff.
LLVM/NVVM Backend Glue
The LLVM/NVVM passes run after TileAS has lowered to LLVM and NVVM dialects.
PretreatPass performs NVIDIA-specific LLVM IR cleanup after inlining: normalising pointer casts, byval aggregates, lifetime intrinsics, intrinsic placeholders, and NVVM metadata before verification and lowering.
CheckGepIndexPass validates that constant GEP indices are in bounds once canonicalization has exposed the final aggregate shapes.
SelectKernelsPass filters the module by a configured list or ordinal range. Non-selected kernels disappear before code generation.
KernelInfoPrinter emits tuning metrics — stack allocation count, direct and indirect calls, inline assembly calls, invokes, and flat-address-space accesses. Flat address-space use matters: generic pointers block tensor-memory and bulk-copy lowering.
IPMSPPass propagates concrete pointer address spaces through calls. When a helper receives a generic pointer but every caller supplies the same concrete space, the pass rewrites or clones the helper so later NVPTX code skips the generic address-space cast.
NVVMAA is the companion alias analysis: generic address space may alias, distinct non-generic spaces do not alias, and equal non-generic spaces may alias.
NVPTXSetFunctionLinkagesPass identifies kernels by calling convention, nvvm.kernel, or kernel metadata, then demotes non-kernel definitions to internal linkage.
Schedule Support Bodies
Three support algorithms serve every schedule consumer (MaterializeSchedule, UnspecializedPipeline, and the downstream stage expander). They live as free functions in the schedule-utils namespace rather than methods on ScheduleAnalysis — the same code then serves both AUS and AWS drivers without dragging analysis state into per-driver vtables.
| Helper | Role |
|---|---|
| materialize schedule | partitions loads, stores, pending async operations, and iteration data into materialization maps keyed by the producing op |
| build stages | converts union constraints into stage-ordered producer/consumer pairs that drive prologue/body/epilogue construction |
| expand single tiled op | replicates a tiled operation per stage and rewires operands to stage-local values |
StageMap materialize_schedule(Region *region, ScheduleAnalysis &an) {
StageMap m;
for (Operation *op : region_ops(region)) {
Stage s = an.stage_of(op);
if (is_load(op)) m.loads[s].push_back(op);
else if (is_store(op)) m.stores[s].push_back(op);
else if (is_async(op)) m.pending_async[s].push_back(op);
else m.iter_data[s].push_back(op);
}
return m;
}
void build_stages(ScheduleAnalysis &an, SmallVector<Stage> *out) {
for (Constraint &c : an.union_constraints()) {
Stage producer = c.producer_stage();
Stage consumer = c.consumer_stage();
emit_producer_consumer_pair(*out, producer, consumer);
}
stable_sort_by_stage(*out);
}
void expand_single_tiled_op(TiledOp op, ScheduleAnalysis &an, Rewriter *rw) {
Map<Value, Value> operand_source = collect_operand_sources(op);
Map<Stage, Operation *> stage_anchor = collect_stage_anchors(an);
for (Stage stage : an.stages_for(op)) {
Operation *replica = clone_for_stage(op, stage, rw);
for (Operand operand : replica->operands()) {
if (Value replacement = lookup_stage_value(operand, stage, operand_source, stage_anchor)) {
operand.replace_with(replacement);
}
}
}
}
Ordering Invariants
PrepareForSchedulingmust run beforeGenerateSchedule.GenerateSchedulemust preserveScheduleAnalysisfor materialization.- Register-tensor unroll tagging must run before LLVM loop-unroll passes.
- LLVM/NVVM pretreatment and verification must run before PTX emission.
- Kernel selection and linkage demotion must not run before kernel attributes are finalized.
- Address-space propagation should run before alias-sensitive optimization and final NVPTX lowering.
Cross-References
Layout and Buffer Family — Assign Load/Store Layouts documents D08 (MaterializeConvertLayout), which emits the "nv_tileas.root_reg_to_reg_copy_op" marker that D10 consumes. nv_tileas Op Roster and Builders — Attribute Roster lists the attribute as part of the dialect's UnitAttr inventory. The upstream LLVM dialect loop_annotation attribute documents the metadata shape that D10 attaches and that the NVPTX backend's LoopUnrollPass later reads. The schedule analysis this family publishes is consumed by Materialize Schedule; the resource model behind it is documented in Modulo Scheduler and Rau-Style Placement and Resource Constraint Builder and RRT. Backend address-space alias facts that NVVMAA exposes are described in NVPTX Bring-up and Target Init.
Scheduler Overview
The TileAS scheduler turns an operational nv_tileas block into a staged pipeline. Its visible output is a stable (stage, order) assignment for the operations in the scheduled block, followed by explicit async coordination values such as Pipe_ and Mutex_. Downstream lowering reads that assignment to decide which operations belong to the same software-pipeline stage, which values cross stage boundaries, and where barrier-like coordination must appear.
Two responsibilities split into two passes. TileASGenerateSchedule chooses the schedule: it builds dependence and resource constraints, searches for a feasible initiation interval, and records per-operation stage/order information in ScheduleAnalysis. MaterializeSchedule then consumes that analysis and rewrites the IR: it builds the concrete async coordination graph, emits Pipe_ and Mutex_ values, and verifies that the scheduled region still satisfies the chosen ordering.
The split is part of the contract. The first pass works on operations, dependence edges, opaque async handles, and resource footprints; the second consumes the already-chosen schedule and materializes SSA values. A faithful reimplementation must not merge these phases — fusing them makes the resource search depend on temporary pipe identities that the materializer is free to rewrite.
Mental Model
The scheduler answers two questions. Placement: for each operation, at what logical stage and order should it run so that dependencies and hardware resource budgets are respected? Communication: after placement, which producer and consumer operations need an explicit async value between them?
The placement pass is a modulo scheduler. It issues loop iterations at a fixed initiation interval, written as II. A Resource Reservation Table (RRT) tracks hardware usage: each row corresponds to one cycle modulo II, and each bit in a row represents a resource class. An operation carries its own footprint table. Placing an operation at a cycle is legal when every footprint row is disjoint from the corresponding global row.
The materialization pass is not another modulo scheduler. Its Schedule::solve step is a greedy disjoint-set pass over producer and consumer groups. It reads the fixed (stage, order) relation, groups operations that must communicate through the same async value, and emits the concrete Pipe_ values. It never searches for a new II and never runs the RRT feasibility test.
Pipeline Shape
nv_tileas block
|
| TileASGenerateSchedule
| - build dependence graph
| - build resource constraints
| - search initiation interval
| - assign stage/order
v
ScheduleAnalysis
|
| MaterializeSchedule
| - recover scheduled depths
| - seed Pipe_/Mutex_ skeletons
| - solve producer/consumer groups
| - rebuild and verify scheduled IR
v
scheduled nv_tileas block
The handoff object is ScheduleAnalysis. Conceptually, it contains the scheduled blocks, validity state, per-operation depth information, resource footprints, and the opaque handles that let the materializer connect async producers and consumers before final Pipe_ SSA values exist.
Pass 1: GenerateSchedule
TileASGenerateSchedule starts from a scheduled candidate block, picks out the operations that participate in the pipeline, and refines constraints until the schedule is feasible or the configured iteration limit is reached. The pass option max-constraint-iterations bounds the outer refinement loop so pathological inputs cannot drive compile time without bound; it defaults to 16.
Scheduling policy enters the algorithm through the constraint builder. The builder reads register pressure, resource-footprint density, pipeline depth, and structural grouping, then emits constraints that restrict the search space. Recurring constraint families include SameDepthConstraint for operations that must remain at the same pipeline depth, MaxDepthConstraint for operations that must not drift beyond a depth limit, ForceSerialExecutionConstraint for regions that must execute single-lane, and structural grouping constraints that tie operations into one scheduling unit. Each constraint family carries the rationale that justifies the restriction so a later refinement round can lift it when the schedule becomes infeasible.
The modulo scheduler then tries candidate placements. The dependence graph enforces legal order; the RRT enforces resource feasibility. The two checks stay separate: dependences answer when an operation may run relative to other operations, the RRT answers whether the machine has capacity at a candidate cycle. The probe-and-commit mechanics are covered in Modulo Scheduler and Rau — Resource Reservation Table; this page does not duplicate them.
Each refinement round records a 2-bit status: bit 0 marks converged, bit 1 marks budget_exceeded. When the iteration budget runs out without convergence, the scheduler raises the budget-truncated flag on Schedule.flags so the materializer can distinguish a clean schedule from a truncated one and emit the matching diagnostic.
Pass 2: MaterializeSchedule
MaterializeSchedule consumes the fixed schedule and turns it into IR. It rebuilds two maps from the analysis — async handle to producer operation, and operation to scheduled depth — then walks the scheduled region and seeds preliminary Pipe_ and Mutex_ skeletons. Once the skeletons are in place, the materializer iterates over the producer/consumer candidate pairs the walker discovered and runs Schedule::solve on each. After the per-pair solves complete, the materializer collapses duplicate pipe skeletons, splices the surviving SSA values back into the region in stage/order, and verifies the rewritten region against the postcondition that producer and consumer placement still agrees with the analysis.
Pass separation is the keystone invariant. Before materialization, scheduling edges live behind opaque async handles because no final Pipe_ SSA value exists yet. After materialization, those handles resolve to concrete Pipe_ and Mutex_ values whose identity is stable for the rest of the pipeline. A reimplementation that fuses the two passes ties the resource search to handle identities the materializer is free to rewrite, which breaks the cache invalidation contract documented below.
Analysis Handoff
A single cached analysis couples the two passes. ScheduleGenerator allocates and populates ScheduleAnalysis; MaterializeSchedule retrieves the same slot through the AnalysisManager. Neither pass touches the other's internals — every cross-pass datum flows through the cached analysis.
The analysis is keyed by the RTTI string "mlir::nv_tile_ir::as::schedule_utils::ScheduleAnalysis]" and registered through the Meyers-cached TypeID idiom documented in TypeID Sentinels and Anchors. The AnalysisManager keys on the resulting TypeID* so the second pass picks up the exact slot the first pass wrote.
Schedule-Data Map
Several data structures cross the analysis-vs-materialization boundary. Each one has a single canonical home page that documents its layout and probing rules; this overview only names them and points outward.
| Structure | Owner page | Role across the seam |
|---|---|---|
ScheduleAnalysis | this page | Cached handoff record; AnalysisManager key |
Schedule view | Modulo Scheduler and Rau | Internal view materializer reconstructs from the analysis |
RRT and NodeRRT | Modulo Scheduler and Rau — Resource Reservation Table | Resource feasibility check; consumed only by Pass 1 |
| Per-op footprint rows | Resource Constraint Builder | Built in Pass 1, frozen by the time materialization starts |
ConstraintMap + DSU | Schedule Constraint Attributes — ConstraintMap Layout | Parsed attribute state consulted by placement and by Schedule::solve |
| Pending-set SwissTable | Serial vs Cost-Based Generators — G1: Pending-Set Membership | Gate-G1 retry filter, seeded by the attribute parser |
origMap + second-table | Schedule::solve and Cost Evaluators | Producer/consumer resolution during materialization |
| Buffer-assignment record | Buffer Assignment and Named Barriers | Bridge between schedule and physical SMEM/TMEM allocation |
Pipe_ / Mutex_ headers | Pipe_ and Mutex_ Value-Header Layout | Final SSA shape emitted by materialization |
Schedule::solve
Schedule::solve is not a solver in the integer-programming sense and not a second modulo scheduler. It is the inner producer/consumer grouping pass that runs during materialization. For each candidate pair the walker discovers, it sorts the relevant ops by (stage, order), classifies producers and consumers, closes the producer set over the live-at-consumer relation, unions producers that share the same raw value through a disjoint-set forest, sweeps each root to pick the earliest scheduled owner, and emits a Pipe_ value per group. It never changes a chosen stage, never picks a new II, and never asks whether a placement fits the RRT.
The deep treatment lives in Schedule::solve and Cost Evaluators. The point for this overview is the invariant: any reimplementation that ends up doing resource search inside Schedule::solve has blurred the pass boundary and broken the handoff contract.
Stage/Order Invariant
Stage and order together form a total order inside every scheduled block. Two operations may share a stage, but their order value tie-breaks deterministically. The materializer leans on that determinism when sorting producers and consumers before emitting pipe groups — reordering even a single pair changes the producer/consumer identity each Pipe_ value sees and breaks the cache contract every later pass relies on.
Usage and Contract
The TileAS pipeline invokes the two passes in fixed order. TileASGenerateSchedule consumes the nv_tileas block, its operand axis-analysis facts, the buffer-lifetime records published by the layout passes, and the nine tileas.schedule.constraint.* and tileas.* attributes parsed by Schedule Constraint Attributes. It writes a populated ScheduleAnalysis into the AnalysisManager slot keyed by its RTTI TypeID, sets validity bits on the analysis, and stores the chosen II and stage count on the per-block records.
MaterializeSchedule consumes only that cached analysis; it never inspects upstream constraint state directly. Its output is the rewritten nv_tileas block with Pipe_ and Mutex_ SSA values inserted between producer and consumer regions and cute_nvgpu.arch.agent_switch partitioning emitted along the warp-specialized boundaries. Downstream passes must not invalidate the analysis between the two passes — the PassManager preservation contract is what lets the second pass pick up exactly the slot the first pass wrote.
Cross-links
- Modulo Scheduler and Rau covers the initiation-interval search and RRT mechanics.
- Resource Constraint Builder and RRT covers resource-pressure constraints and table construction.
- Schedule::solve and Cost Evaluators covers pipe materialization and pass separation.
- Modulo Driver Or-Chain covers the placement arm selector that the generator runs at each
II. - Pipe/Mutex Value Layout covers the SSA shape of the coordination values that the materializer emits.
- Serial vs Cost-Based Generators contrasts the fallback and high-optimization scheduling paths.
- Blackwell Pipeline 15-Slot Model explains the target-specific slot model used by Blackwell pipeline scheduling.
- Schedule Constraint Attributes catalogues the
tileas.schedule.constraint.*and related attributes the generator consumes. - Buffer Assignment and Named Barriers bridges schedule output to SMEM/TMEM buffer allocation and named-barrier binding.
Serial and Cost-Based Schedule Generators
Abstract
Tileiras carries two schedule generators with the same output shape and very different ambitions. The serial generator is a deterministic baseline — it walks operations in dataflow order, emits edges, and validates the resulting topological order. The cost-based generator is the full modulo-scheduling path: it ranks candidates with resource constraints, structural distances, and RRT probes, then retries with heavier strategies when cheaper placement fails.
Downstream passes consume either generator through the same schedule analysis interface. Picking a generator changes compile-time cost and schedule quality, not the public IR contract after generation succeeds.
Generator Roles
| Generator | Intended use | Algorithmic shape |
|---|---|---|
| serial | deterministic baseline, forced-serial regions, low optimization paths | one walk, no II search, no RRT placement |
| cost-based | optimized TileAS scheduling for warp-specialized and resource-heavy loops | iterative placement with resource gates and cost ranking |
The serial generator earns its place by giving the compiler a simple, predictable schedule when the region does not need modulo scheduling or when a constraint asks for serial execution. The cost-based generator takes over when the compiler wants throughput and must reason about Blackwell issue slots, tensor memory, shared memory, barriers, and async pipelines.
Serial Generator
The serial generator is a single greedy walk over the dependence DAG. It builds the dependence graph from the region, computes in-degree for every operation, seeds a ready queue with the zero-in-degree roots, and then repeatedly pops a ready operation, emits it at the next free (stage, order) position, and decrements its successors' in-degree counters — pushing each newly-ready successor onto the queue. Tie-breaking inside the ready queue is by program order so the output is bitwise deterministic across builds.
bool generate_serial_schedule(Schedule *out, Operation *region) {
DependenceGraph g = build_dependence_graph(region);
InDegreeMap in = compute_in_degree(g);
ReadyQueue ready = collect_zero_in_degree(g); // initial roots
uint32_t order = 0;
while (!ready.empty()) {
Operation *op = ready.pop_in_program_order();
out->stage[op] = 0; // single steady-state stage
out->order[op] = order++;
for (Operation *succ : g.successors(op)) {
if (--in[succ] == 0) {
ready.push(succ);
}
}
}
return order == g.node_count(); // false ⇒ dependence cycle
}
The serial generator never builds an RRT, never searches for II, and never ranks candidate seats. It produces a schedule in which every operation lives at stage 0 and runs strictly after its dependences, which is the correct shape for forced-serial regions and for the low-optimisation paths that do not need software pipelining. When the walk does not visit every operation, the input has a dependence cycle and the caller falls back to a stronger strategy or reports failure.
Cost-Based Generator
The cost-based generator runs a multi-arm strategy over candidate placement orderings. At each iteration it collects the ready set, dispatches it through the four placement arms documented in Modulo Scheduler and Rau — permute, fuse, retry, cost-based — and seats the candidate whose arm produces the lowest cost. Each arm independently proposes a candidate schedule; the lowest-cost legal one wins. When every arm rejects, the generator returns failure.
bool generate_cost_based_schedule(ScheduleGenState *state) {
while (!all_candidates_scheduled(state)) {
CandidateList ready = collect_ready_candidates(state);
if (ready.empty()) return false; // dependence cycle
// Each arm proposes a candidate seat. Costs share a common origin so
// the arm comparison is meaningful.
ArmResult arms[4] = {
try_permute (state, ready),
try_fuse (state, ready),
try_retry (state, ready, state->snapshot),
try_cost_based (state, ready, state->snapshot),
};
const ArmResult *best = best_cost_arm(arms); // skips arms that rejected
if (best == NULL) return false;
commit_seat(state, best->candidate, best->cycle);
}
return true;
}
The cost vector itself is lexicographic:
| Component | Role |
|---|---|
| hard resource gate | rejects candidates that violate depth, resource mask, or already-scheduled constraints |
| pipeline-slot pressure | prefers placements that reduce issue-slot and transport pressure |
| structural distance | breaks ties using dependence distance and critical-path shape |
Do not collapse this into one scalar without proving equivalence. The hard gate decides whether a candidate is legal; the later components only rank legal candidates.
Placement Arms in Detail
The cost-based generator's four arms each implement a different placement heuristic. They share the same input — a ready set of candidate ops — and the same output shape — an ArmResult that either holds a chosen (op, cycle) seat or marks the arm as having rejected. Cost is compared across arms by the same lexicographic vector documented in Schedule Solve and Cost Evaluators, so the cheapest legal seat wins regardless of which arm proposed it.
Permute Arm
The permute arm enumerates every permutation of the ready set that respects the partial order from the dependence graph, scores each permutation by seating its ops greedily, and picks the permutation whose total cost is lowest. The arm bails out as soon as the permutation count rises past a threshold; for small ready sets it explores exhaustively, for larger ones it samples a fixed number of random permutations.
ArmResult try_permute(ScheduleGenState *state, CandidateList ready) {
ArmResult best = { .cost = COST_INFINITY, .accepted = false };
PermutationEnumerator perm = enumerate_topo_permutations(ready, state->dep_graph);
for (uint32_t i = 0; i < perm.count && i < PERMUTE_BUDGET; ++i) {
Permutation order = perm.next();
ScheduleSnapshot snap = snapshot_state(state);
bool legal = true;
for (Operation *op : order) {
uint32_t t = find_earliest_legal_cycle(&snap, op);
if (t == NO_LEGAL_CYCLE) { legal = false; break; }
commit_seat_in_snapshot(&snap, op, t);
}
if (!legal) { restore_state(state, snap); continue; }
CostVector cost = score_snapshot(&snap);
if (cost_lexless(cost, best.cost)) {
best = (ArmResult){ .cost = cost, .order = order, .accepted = true };
}
restore_state(state, snap);
}
return best;
}
A worked example: the ready set {tiled_tma_load, smem_write, smem_read} has six topo-permutations, two of which respect the load-before-read edge. The permute arm seats each of those two permutations greedily and picks the one whose total resource pressure is lowest — typically (load, write, read) over (load, read, write) because the latter delays the SMEM write into a stage where the next iteration's TMA load already claims tp_smem_wr.
Fuse Arm
The fuse arm merges adjacent compatible ops that can share a resource slot in the same cycle. Two SMEM reads from the same buffer can fuse if their pool counts sum below the pool cap; a TMA load and an unrelated SMEM write cannot fuse because they claim different rows and the fusion would not reduce pressure. The arm is the only one that emits a single (op-pair, cycle) seat for two source ops.
ArmResult try_fuse(ScheduleGenState *state, CandidateList ready) {
ArmResult best = { .cost = COST_INFINITY, .accepted = false };
for (uint32_t i = 0; i < ready.size; ++i) {
for (uint32_t j = i + 1; j < ready.size; ++j) {
Operation *a = ready.ops[i];
Operation *b = ready.ops[j];
if (!can_fuse(a, b, state)) continue;
FusedOp fused = compose_resource_vectors(a, b);
uint32_t t = find_earliest_legal_cycle_for(state, fused.vec);
if (t == NO_LEGAL_CYCLE) continue;
CostVector cost = score_with_fusion(state, fused, t);
if (cost_lexless(cost, best.cost)) {
best = (ArmResult){ .cost = cost, .fused = fused,
.cycle = t, .accepted = true };
}
}
}
return best;
}
bool can_fuse(Operation *a, Operation *b, ScheduleGenState *state) {
if (has_dependence(state, a, b)) return false;
if (different_slot_groups(a, b)) return false;
if (combined_pool_pressure(a, b) > caps()) return false;
return true;
}
Worked example: two smem_read ops r1, r2 reading disjoint buffers from the same SMEM bank. The fuse arm composes their resource vectors into a single triple (slot=15, duration=7, occupancy=2); pool index 4 for the tp_smem_rd cap allows up to 5 simultaneous reads, so the fused op is legal at cycle 0 where either op alone would have been legal. The arm wins over permute when the buffer pair shares an SMEM bank but differs only in offset — the cost reducer scores the fused seat as one row contribution instead of two.
Retry Arm
The retry arm consumes the snapshot overlay maintained by the driver and re-attempts ops that earlier arms marked dead. It does not re-score; it simply re-probes the same (op, cycle) candidate against a fresh RRT in case an earlier rejection was caused by transient pressure that has since cleared. The arm is the cheapest of the four — no permutation, no fusion, no cost reduction.
ArmResult try_retry(ScheduleGenState *state, CandidateList ready,
RetrySnapshot *snap) {
for (uint32_t i = 0; i < ready.size; ++i) {
Operation *op = ready.ops[i];
if (!snapshot_is_dead(snap, op)) continue; // skip live ops
uint32_t t = find_earliest_legal_cycle(state, op);
if (t == NO_LEGAL_CYCLE) {
continue; // still dead
}
snapshot_mark_live(snap, op);
CostVector cost = score_seat(state, op, t);
return (ArmResult){ .cost = cost, .op = op, .cycle = t, .accepted = true };
}
return (ArmResult){ .cost = COST_INFINITY, .accepted = false };
}
The arm returns on the first revived op rather than scanning the full snapshot. This is intentional — the snapshot is small and the cost-based generator runs the arm again on the next iteration if more revived ops are available. Walking the full snapshot in one pass would burn time on ops that are guaranteed to remain dead until later state changes.
Worked example: an smem_write was marked dead by the permute arm because the TMA load occupied tp_smem_wr at cycle 0. After the TMA load committed at cycle 0 of stage 0 and the modulo wrap exposed cycle 8 as a fresh seat, the retry arm finds tp_smem_wr clear at the new candidate cycle and revives the write.
Cost-Based Arm
The cost-based arm is the most expensive of the four. It enumerates every legal (op, cycle) pair across the entire ready set, scores each with the full lexicographic cost vector, and picks the global minimum. The arm runs only when permute, fuse, and retry have all rejected — they cover the common cases, and the cost-based arm exists to find seats that the cheaper heuristics miss.
ArmResult try_cost_based(ScheduleGenState *state, CandidateList ready,
RetrySnapshot *snap) {
ArmResult best = { .cost = COST_INFINITY, .accepted = false };
for (Operation *op : ready) {
if (snapshot_is_dead(snap, op)) continue;
for (uint32_t t = 0; t < state->ii; ++t) {
if (!gate_g3_rrt_clean (state, op, t)) continue;
if (!gate_g4_leader_gid_consistent(state, op,
leader_gid_of(state, op))) continue;
CostVector cost = sub_988080_search(state, op, t);
if (cost_lexless(cost, best.cost)) {
best = (ArmResult){ .cost = cost, .op = op, .cycle = t,
.accepted = true };
}
}
if (!best.accepted) snapshot_mark_dead(snap, op);
}
return best;
}
The two inner calls — sub_988080_search and the gate ladder — pull from the same cost tables that the slot model documents at rodata 0x4CC9D10..0x4CC9D70. The arm's per-iteration cost is O(|ready| × II) probes, against O(|ready|) for the cheaper arms. Worked example: a ready set of 8 ops at II = 16 produces 128 candidate (op, cycle) pairs; the cost-based arm probes each and returns the global minimum, while the permute arm would have explored only 8! / 6 ≈ 6720 permutations of a fixed seating order without varying the cycle.
The arm's worst case is exactly the case the cost reducer was designed for: small ready sets where every op claims a different slot and the right packing depends on aligning the SMEM transports across stages. The cheaper arms reject because their heuristics cannot see the cross-stage interaction; the cost-based arm sees it because it evaluates the full lexicographic vector for every candidate.
Admission Gates
Before the placement driver sub_981D50 commits a seat for a candidate op, four ordered gates run against every candidate the cost-sort surfaces. All four must pass for the seat to commit; failure at any one gate triggers a specific recovery path rather than rejecting the entire candidate set. Gate order stays fixed across all four placement arms (permute, fuse, retry, cost-based), so the same predicates execute in the same sequence no matter which arm is in play. The Rau termination proof depends on it: G3 (the RRT veto) must run strictly after G1/G2 but strictly before G4 so the resource snapshot it sees is the one the cost-sort produced.
The four gates draw on the cost tables documented in the Blackwell Pipeline 15-Slot Model. G2 reads the constraint-attribute table parsed by sub_97B770, G3 reads the global RRT alongside the per-op latency view, and G4 walks the DSU at offset +112 of the scheduler state. G1 fires first because it costs a single SwissTable probe.
G1: Pending-Set Membership
The first gate is a membership probe against an Abseil-layout SwissTable rooted at offset 49 * 8 = 392 of the scheduler state. The table is seeded by the attribute parser alongside the DSU at state + 112; the full seeding picture lives in Schedule Constraint Attributes — Twin Seeding. The probe runs first because it costs a single hash plus a 16-byte slot stride, and rejection on this gate holds the op over to the next placement attempt rather than killing it.
bool gate_g1_pending_set_clean(SchedulerState *state, Op *op) {
// Membership probe on the carry-state SwissTable seeded by the attribute
// parser. Empty sentinel -4096, tombstone -8192; see container-fingerprints.md.
return !pending_set_contains(state->pending_set, op);
}
G2: Max-Depth Viability
The second gate consults the ConstraintMap that the attribute parser sub_97B770 built from tileas.schedule.constraint.max_depth. The decompiled expression reads *((int*)sub_94A550(state, op) + 2) <= 1. The probe sub_94A550 returns a pointer to the constraint slot; its third i32 (offset +8) is the max_depth field the parser wrote from the MLIR attribute. The literal bound 1 is hard-coded into the cost-sort body.
bool gate_g2_max_depth_viable(SchedulerState *state, Op *op) {
/* ConstraintMap lookup; the max_depth field at byte offset +8
* is written by the attribute parser sub_97B770 from
* `tileas.schedule.constraint.max_depth`. The decompiled
* expression `*((int*)slot + 2) <= 1` reads that same field. */
ConstraintSlot *slot = sub_94A550(state, op);
return slot->max_depth <= 1;
}
Failure on G2 means the op is unreachable at the current depth level. The placement driver marks the op dead in the snapshot for the current attempt; the retry arm picks it up once the depth horizon expands.
G3: RRT Veto
The third gate is the resource veto !sub_94A450(state+88, op). The probe at offset +88 = 11 * 8 delegates to the canonical Rau RRT test in sub_12D0800. The op's per-op RRT footprint at *(u64*)(op + 96) must AND-clean against globalRRT[(t + i) mod II] for every cycle i of the footprint duration. This is the hard gate — lexicographic component one in the cost-model decomposition. No lattice element can sit above a state that fails G3.
bool gate_g3_rrt_clean(SchedulerState *state, Op *op, uint32_t t) {
/* Canonical Rau RRT probe. The per-op footprint at op+96
* must not collide with the global RRT row mask at any of
* the duration cycles starting at modulo cycle t. */
const uint64_t *node_rows = op->footprint_rows; /* op+96 */
const RRT *global = state->global_rrt; /* state+88 */
for (uint32_t i = 0; i < op->duration; ++i) {
uint32_t row = (t + i) % global->ii;
if ((global->rows[row] & node_rows[i]) != 0) {
return false;
}
}
return true;
}
Failure on G3 bumps the seat time forward by one cycle and reruns the same gate ladder against the next candidate cycle; the cost-sort itself does not change ordering on a G3 miss.
G4: Leader-Group DSU Consistency
The fourth gate is sub_96A7D0(state, &candidate, 1, &leader_gid, 1). It walks the DSU at offset +112 of the scheduler state (parent-pointer table, find is sub_976BE0, union is sub_976DE0) and returns non-zero when the candidate's leader-gid find-root coincides with every already-committed group leader that shares the target cycle. The leader gids are parsed by sub_97B770 from tileas.schedule.constraint.gid and tileas.schedule.constraint.leader_gid.
bool gate_g4_leader_gid_consistent(SchedulerState *state, Op *op,
uint32_t leader_gid) {
/* DSU consistency check at scheduler state offset +112.
* Two ops with the same leader_gid must share the same
* depth (= start_cycle / II) for the seat to be legal. */
return sub_96A7D0(state, &op, 1, &leader_gid, 1) != 0;
}
G4 is slot-agnostic at the bit-mask level but slot-dependent at the timing level — two ops in the same group must share the same depth. Fine-slot ties trigger most G4 rejections, for example two tp_tmem_rd candidates belonging to different leader gids competing for the same cycle. Failure on G4 forces the cost-sort to reorder the group rather than reject any single candidate; the driver retries with a different leader ordering before moving on to the next op.
Gate Recovery Summary
Each gate has a distinct failure response. Treating them uniformly would either lose useful candidates (by killing on a recoverable G1) or waste retries (by reordering on a structurally impossible G3).
| Gate | Predicate | On Failure |
|---|---|---|
| G1 | !sub_7E30D0(state+392, op) | hold the op over to the next attempt |
| G2 | sub_94A550(state, op) + 8 <= 1 | mark the op dead in the snapshot for the current attempt |
| G3 | !sub_94A450(state+88, op) | bump seat time by one cycle and retry |
| G4 | sub_96A7D0(state, op, leader_gid, ...) | force a different group ordering |
The G3 RRT veto ties the gate ladder to the cost tables in the slot model — the same global RRT the per-cycle pressure summariser sub_12CEBF0 reads through the 9-element pool capacity vector is what G3 probes for resource conflicts. The latency view that sub_12C8DF0 writes into the per-op pool is what the cost reducer reads to produce the ranking the gate ladder iterates over.
Generator Selection
The driver chooses between the serial and cost-based generators on two thresholds. Small regions and forced-serial regions take the serial path because pipelining cannot help — the compile-time savings outweigh any throughput improvement the cost-based generator could buy. Regions with more than the threshold operation count and at least one resource-bearing op enter the cost-based path because their pipelined throughput dominates compile time.
| Trigger | Selected generator |
|---|---|
force_serial_execution attribute on the region | serial |
op count below serial_threshold (default 8) | serial |
| no async pipeline, TMA, or WGMMA op present | serial |
| otherwise | cost-based |
The thresholds are conservative. Falling back from cost-based to serial is correct but slow; the inverse — taking the cost-based path on a region that the serial generator would have handled — is also correct but burns compile time on a search whose result is identical to the serial walk's.
Selector Predicate
The driver reads the region's MLIR attributes and op-count summary and applies a single ordered predicate. The first matching rule wins.
ScheduleStrategy select_strategy(Region *region, ScheduleOptions *opts) {
if (region->attrs.force_serial_execution) {
return STRATEGY_SERIAL; // attribute trumps everything
}
if (region->op_count < opts->serial_threshold /* 8 */) {
return STRATEGY_SERIAL; // not worth the cost-based price
}
if (!region->summary.has_async_pipeline &&
!region->summary.has_tma &&
!region->summary.has_wgmma) {
return STRATEGY_SERIAL; // no resource-bearing op
}
return STRATEGY_COST_BASED;
}
The has_async_pipeline, has_tma, and has_wgmma flags are byproducts of the per-block summary that the constraint builder produces — the same pass that computes the per-op resource vectors documented in Resource Constraint Builder and RRT. Reusing that data is the only practical way to keep the selector's cost below the serial generator's own cost; running the selector on every region for free is what allows the conservative thresholds.
A region with an async pipeline but force_serial_execution = true still picks serial — the attribute is the override of last resort. A region with no resource-bearing ops but the attribute unset still picks serial because the cost-based path's benefit comes entirely from packing tensor-memory and SMEM transports; with neither present, the cost-based path's cost vector reduces to the structural-distance term alone, which the serial generator's program-order traversal already satisfies.
Strategy Orchestration
Inside the cost-based path the driver runs a fixed strategy ladder rather than a single attempt. Cheap strategies run first: a Rau-style refinement, then a deepest-depth retry, then the initial placement. The driver escalates to heavier cost-based placement only when those refuse the candidate. When even the cost-based pass fails, the driver clears intermediate scheduling state and reruns initial and cost-based placement from a known-empty starting point. Each rung returns success immediately on a match, so the ladder short-circuits at the first strategy that produces a feasible schedule.
bool run_schedule_strategies(ScheduleGenState *state) {
if (try_rau_refinement (state)) return true;
if (try_deepest_retry (state)) return true;
if (try_initial_placement (state)) return true;
if (try_cost_based_placement (state)) return true;
clear_intermediate_schedule_state(state);
if (try_initial_placement (state)) return true;
return try_cost_based_placement(state);
}
The order is pragmatic. Cheap strategies run first, cost-based placement is the most expensive fallback, and the clear-and-retry tail handles the case where intermediate state accumulated during earlier strategies blocks a feasible schedule that the initial placement would have found from scratch.
Constraints Consumed
Several constraint families shape which candidates the cost-based path even considers. Hard constraints — force-serial execution, max depth, resource footprint — gate legality. Soft constraints — same-depth, group unions, structural shape — rank only candidates that already cleared the hard gates. The serial path consumes only the force-serial-execution constraint; every other family is silently ignored.
⚡ QUIRK — serial generator silently drops every constraint except force-serial-execution The serial scheduler accepts the same
Constraintset as the cost-based path but consults onlyforce-serial-execution; same-depth, group-union, structural-shape, max-depth, and resource-footprint constraints are all dropped without warning when the serial generator runs. A frontend that pins a critical resource bound expecting both paths to honour it sees the cost-based schedule respect it and the serial schedule violate it — and the user-facing diagnostic stream is identical in both cases. Bug reports of "my constraint stopped working after--force-serial-schedule" land here.
| Constraint | Effect |
|---|---|
| force-serial execution | selects or emulates serial ordering |
| max depth | prevents seating a candidate beyond a configured depth |
| same depth | forces related operations to share a depth or stage relation |
| union/group constraints | tie operations into shared scheduling groups |
| structural constraints | rank or reject candidates based on dependency shape |
| resource constraints | reject candidates whose RRT footprint conflicts |
Resolution of these constraints happens before materialization. The later Schedule::solve pass should see only the final analysis, not the live constraint-search state.
Output Contract
Both generators publish the same logical analysis so the downstream materializer can consume either result without dispatch. The analysis carries the operation-to-node map, an ordered operation/node list, the per-op (stage, order) assignment, the dependency edges, optional slot/depth/resource annotations populated only by the optimized path, and a success or failure flag. The materializer should not need to know which generator produced the analysis — except for diagnostics or instrumentation.
Usage and Contract
Callers select a generator by setting the schedule strategy field on the ScheduleOptions record before invoking TileASGenerateSchedule. The serial generator consumes only the operation tree of the scheduled block plus the tileas.schedule.constraint.force_serial_execution attribute; it ignores the per-op slot, latency, and capacity inputs. The cost-based generator additionally reads the tileas.schedule.constraint.gid, leader_gid, and max_depth attributes parsed by Schedule Constraint Attributes, the per-op footprint vectors from the Resource Constraint Builder, and the 9-element pool-capacity vector from the Blackwell Pipeline 15-Slot Model — Pool Capacity Vector. Both paths produce ScheduleAnalysis with the same field set — the optimized path simply fills the optional slot/depth/resource cells that the serial path leaves zeroed. Consumers must treat the (stage, order) pair as the public ordering key and ignore the optional cells unless they are explicitly probing the optimized path's annotations.
Resource Constraint Builder and RRT
Abstract
The resource constraint builder is the pipeline that produces per-op NodeRRT footprints and commits chosen placement rows back into the global RRT during TileASGenerateSchedule. The reservation-table model itself — bitset rows per cycle, probe-and-commit semantics, the lower-bound formula, and the galloping-plus-binary II search — lives in Modulo Scheduler and Rau. This page picks up where that one leaves off: how the builder constructs the footprints, how the MII split is computed, and how the apply-mode driver writes accepted rows back into the bitset.
The builder lives in schedule generation, not pipe materialization. MaterializeSchedule consumes the completed schedule analysis and never reruns the II search.
Slot Encoding
The scheduling model uses one-based pipeline slot identifiers. The RRT row bit for slot slot_id is 1 << (slot_id - 1). Blackwell currently uses up to 24 slot identifiers, which fits in one 64-bit row; coarse slots group broad resource families while fine slots model concrete issue and transport pressure. Blackwell Pipeline 15-Slot Model documents the fine-slot taxonomy.
Per-Block Summaries
The builder visits each scheduled block and records the operation tags it uses. Two structures coexist: an open-addressed set de-duplicates tags, while a deterministic list preserves iteration order for stable diagnostics and repeatable scheduling.
BlockSummary summarize_block(Block *block) {
BlockSummary summary = {};
for (OperationNode *node : block->scheduled_nodes()) {
if (summary.tags.insert(node->slot_id)) {
summary.ordered_tags.push_back(node->slot_id);
}
}
return summary;
}
Once every block is summarised, the builder reduces them into per-resource pressure counts and feeds those counts into the lower-bound calculation and the feasibility probe.
Constraint-Builder Pipeline
The pipeline that populates the per-op NodeRRT footprints before scheduling has a small, fixed shape. The top-level entry point sub_98BBE0 is a 2-way dispatcher keyed on its third argument, the build-mode flag: a3 == 0 selects build mode and tail-calls into sub_98A3B0 (the 1 296 LOC builder body that walks the dependence graph, materializes per-op slot footprints, and stages them on the per-block constraint state); a3 == 1 selects apply mode and tail-calls into sub_988710, which consumes the staged state and writes resource bits into the global RRT. Both modes share the same per-block constraint record, so the dispatcher is purely a phase selector — no per-call setup beyond the branch.
Before placement starts, the driver computes the minimum feasible initiation interval from a three-way split in sub_989380. Each component is its own helper: sub_9890C0 returns RecMII by walking the dependence graph for cycles that cross loop iterations, sub_989160 returns FineMII from fine-grained dependence distances within a single iteration, and sub_989340 returns DepMII by reading the cached per-op depth at field offset +0x48 on the op record. The split helper takes the maximum, and that becomes the starting value of Schedule.ii for the placement driver sub_981D50 documented in Modulo Driver and 4-Arm OR-Chain — Driver Signature.
uint32_t compute_min_ii(const ScheduleState *schedule) {
uint32_t rec = compute_rec_mii(schedule); // sub_9890C0
uint32_t fine = compute_fine_mii(schedule); // sub_989160
uint32_t dep = compute_dep_mii(schedule); // sub_989340, reads op[+0x48]
uint32_t mii = rec;
if (fine > mii) mii = fine;
if (dep > mii) mii = dep;
return mii;
}
Per-Op Resource-Vector Encoding
Each op enters the builder as an MLIR operation with operand types and dialect attributes; it leaves as a resource vector — a small array of (slot_id, duration, occupancy) triples plus an optional capacity-pool count. The builder reads the op's opcode to pick the primary slot, reads the operand and result types to pick the transport slot, and reads the latency family to pick the duration. Occupancy stays at 1 for every singleton transport and rises only for the capacity pools whose caps are above 1.
The triples are what the cost reducer ranks and what the apply driver writes into the qword row stack. They are the single canonical input/output of the builder body.
Concrete Encodings
The four ops most worth pinning down are the TMA tiled load, the WGMMA matmul, the SMEM write, and the SMEM read. Every other op in the Blackwell dialect either reduces to one of these four or composes them.
// nv_tileas.async.tiled_tma_load — TMA descriptor parked on slot 12 (tma),
// tensor payload flowing through slot 16 (tp_smem_wr). Both stay live for the
// full TMA round-trip duration of 8 cycles. Occupancy is 1 on each row —
// singleton transports cannot share.
ResourceVector encode_tiled_tma_load(Operation *op) {
return (ResourceVector){
.triples = { { .slot = 12, .duration = 8, .occupancy = 1 },
{ .slot = 16, .duration = 8, .occupancy = 1 } },
.n_triples = 2,
.pool_counts = { /* no pool pressure beyond singleton rows */ },
};
}
// nv_tileas.async.wgmma — issue stage on slot 11 (tc_and_mma), transport on
// slot 19 (tp_mma). The MMA accumulator latency is 16 cycles, but the
// scheduler models only the 8-cycle issue window in which the warpgroup
// holds the slots; the rest is dependence latency, not slot occupancy.
ResourceVector encode_wgmma(Operation *op) {
return (ResourceVector){
.triples = { { .slot = 11, .duration = 8, .occupancy = 1 },
{ .slot = 19, .duration = 8, .occupancy = 1 } },
.n_triples = 2,
.pool_counts = { /* TMEM-bank pressure pushed through pool index 1 */ },
};
}
// nv_tileas.async.smem_write — single-row footprint on slot 16 (tp_smem_wr)
// for 7 cycles. The only op that participates directly in the SMEM byte
// budget pool (pool index 5) so the SMEM byte-budget cap of 232 448 sees the
// store size accumulated across all live writes.
ResourceVector encode_smem_write(Operation *op, uint32_t store_bytes) {
return (ResourceVector){
.triples = { { .slot = 16, .duration = 7, .occupancy = 1 } },
.n_triples = 1,
.pool_counts = { [5] = store_bytes }, // SMEM byte budget
};
}
// nv_tileas.async.smem_read — symmetric mirror of smem_write on slot 15
// (tp_smem_rd). 7-cycle hold, no pool pressure.
ResourceVector encode_smem_read(Operation *op) {
return (ResourceVector){
.triples = { { .slot = 15, .duration = 7, .occupancy = 1 } },
.n_triples = 1,
.pool_counts = { /* read transport is row-only */ },
};
}
The builder's classifier is a flat switch on the dialect opcode; every case sets up a triple list of length one or two and copies the duration from the latency-family table. Multi-triple encodings exist exclusively for ops that issue on one slot while transporting on another — TMA and WGMMA. A tiled_tma_load cannot be reduced to a single-row footprint because the descriptor must stay parked on the tma row even when the tensor payload is in flight on the SMEM transport; if either row is occupied, the candidate is rejected.
Apply-Side Lowering
The per-op triples lower to qword rows by accumulating bits across slots before the apply driver writes them into the cycle stack.
NodeRRT lower_resource_vector(const ResourceVector *vec) {
NodeRRT rrt = { .duration = 0 };
for (uint32_t i = 0; i < vec->n_triples; ++i) {
uint32_t bit = vec->triples[i].slot - 1;
if (vec->triples[i].duration > rrt.duration) {
rrt.duration = vec->triples[i].duration;
}
for (uint32_t k = 0; k < vec->triples[i].duration; ++k) {
rrt.rows[k] |= (1ull << bit);
}
}
return rrt;
}
The duration is the maximum over all triples — two triples with different durations co-occupy the same set of cycles for as long as the longer one runs. Slots that drop out earlier leave their bits clear on the trailing cycles; the OR-fold makes that automatic.
The triple list is also the unit of diagnostics: when the placement driver reports an admission failure, it prints the triple that caused the conflict together with the global RRT row at the failing modulo cycle. Two triples merged into one qword would hide which slot rejected the candidate.
24-Slot Apply Driver
Apply mode walks a 24-bit resource row stored as a qword at field offset +80 on each block record. Bit i set in that qword means resource class i is occupied by the current op on cycle 0 of its footprint. Multi-cycle footprints occupy companion qwords at +88, +96, and so on — one qword per footprint cycle, contiguous and in cycle order. sub_989410 is the per-block apply driver, iterating over the staged op list for one block and updating the qword row stack. sub_989BE0 is the per-op variant that runs the same update for a single op record without the block-level iteration.
The active class count matches the Blackwell pipeline-resource model documented in Blackwell Pipeline 15-Slot Model: 8 bits for TMEM/SMEM banks, 4 bits for WGMMA queue slots, 4 bits for TMA descriptors, 4 bits for named barriers, and 4 bits for cp.async queues. That partitioning is why a single 64-bit qword covers each cycle row, and why the apply drivers can read and write each row with a single load/store rather than a vector spread.
Bit Extraction Idiom
The decompilation tests slot occupancy with the x86 idiom shl rax, cl followed by bt rdx, rax, where cl == slot_id - 1. The -1 bias is the canonical fingerprint — the dispatcher uses 1-based slot identifiers in its public interface and 0-based bit positions in the qword. Any code that performs a (slot_id - 1) shift before a bt-style test against a resource qword belongs to the constraint pipeline.
static inline bool slot_occupied(uint64_t row, uint32_t slot_id) {
uint32_t bit = slot_id - 1; // shl rax, cl
return ((row >> bit) & 1ull) != 0; // bt rdx, rax
}
Soft Constraints and Bit-Row Geometry
When the builder detects that an op would force a register spill if seated at its earliest legal cycle, it adds a soft constraint that biases the placement driver away from that cycle without making it illegal. The constraint is a cost term, not a legality predicate — the placement driver may still seat the op at the original cycle if no cheaper alternative is feasible, and the bias only ranks candidates that already cleared the hard resource and dependence gates.
The cost term encodes as a small integer surcharge attached to the candidate cycle for that specific op. The cost-based arm reads the surcharge as a separate component of its lexicographic cost vector, ranked below the hard resource gate but above structural distance. Multiple spill-bias surcharges for the same op accumulate by addition — the builder caps the accumulated bias so a single op cannot push every cycle out of the feasible region.
void tryAddConstraintToAvoidRegSpilling(ScheduleState *state, Op *op,
uint32_t earliest_cycle) {
PressureEstimate p = estimate_register_pressure_at(state, op, earliest_cycle);
if (p.peak <= p.budget) {
return; // no spill predicted; no constraint needed
}
// Encode bias as a cost surcharge on the (op, cycle) pair. Range and cap
// keep accumulated surcharges from saturating the cost vector.
uint32_t surcharge = clamp((p.peak - p.budget) * SPILL_SURCHARGE_WEIGHT,
0, SPILL_SURCHARGE_CAP);
cost_surcharge_add(state->cost, op, earliest_cycle, surcharge);
}
The surcharge is a hint that ranks otherwise-equivalent candidates; it never rejects a seat by itself. A placement that satisfies every hard constraint but carries spill surcharges at every cycle still commits — the schedule is correct, only the register-pressure heuristic is unhappy.
Cost-Term Formula
The surcharge attached to an (op, cycle) pair is a linear function of predicted register-pressure excess, clamped to a fixed cap so a single op cannot saturate the cost vector.
// pressure: per-op register-pressure estimate at the candidate seat cycle.
// budget: register-file budget for the current SM partition.
// W: SPILL_SURCHARGE_WEIGHT (= 17, sourced from the cost-table seeder).
// CAP: SPILL_SURCHARGE_CAP (= 4096, the cost-vector saturation cap).
uint32_t spill_surcharge(uint32_t pressure, uint32_t budget) {
if (pressure <= budget) return 0;
uint64_t raw = (uint64_t)(pressure - budget) * W;
return raw > CAP ? CAP : (uint32_t)raw;
}
A schedule's accumulated spill surcharge is the sum of these per-(op, cycle) terms across every op that received a surcharge. The cost-based arm reads the sum as the third component of its lexicographic cost vector, immediately after the hard resource gate and the pipeline-slot pressure. Two schedules with identical resource and slot-pressure components tie-break on this sum; the schedule with the smaller surcharge wins.
The cap matters for the proof obligation: without it, a sufficiently large pressure overshoot could push the surcharge above the budget the structural-distance term reserves at the bottom of the lexicographic vector, and the ranking would no longer respect the intended priority. The 4 096 cap leaves three orders of magnitude of headroom for the structural-distance term to express its preferences inside.
The same bit-row geometry that drives the per-op footprints resurfaces in the schedule analyser when it computes stage counts and emits diagnostics. The 24-bit width and the per-cycle qword layout therefore belong to the schedule's serialisation contract, not an apply-mode-only detail.
Helper Table
| Function | Size | Role |
|---|---|---|
sub_98BBE0 | — | 2-way build/apply dispatcher keyed on a3 |
sub_98A3B0 | 1 296 LOC | Build-mode body, populates per-op footprints |
sub_988710 | — | Apply-mode body, writes staged state into the global RRT |
sub_989380 | — | MII split — max(RecMII, FineMII, DepMII) |
sub_9890C0 | — | RecMII from recurrence cycles |
sub_989160 | — | FineMII from fine-grained dependence distances |
sub_989340 | — | DepMII from per-op depth at +0x48 |
sub_989410 | — | 24-slot per-block apply driver |
sub_989BE0 | — | 24-slot per-op apply driver |
sub_9762E0 | — | tryAddConstraintToAvoidRegSpilling soft-constraint hook |
Usage and Contract
The builder runs inside TileASGenerateSchedule, invoked twice per schedule attempt — once in build mode (a3 == 0) to materialise per-op footprints and once in apply mode (a3 == 1) to commit accepted rows. Build mode consumes the per-op slot identifier, duration, and capacity-pool counts produced by the Blackwell slot classifier, plus the dependence graph for the MII split. Apply mode consumes the accepted (stage, order) placement and writes the chosen footprint rows into the global RRT at qword offsets +80, +88, +96, ... on each block record. The builder publishes the smallest feasible II, the per-op start cycles, and the populated RRT into the surrounding ScheduleState; downstream consumers — the placement driver, the cost evaluators, and the materializer — read those fields without rerunning the search.
Cross-References
Modulo Scheduler and Rau consumes the II and the populated RRT this builder produces. Blackwell Pipeline 15-Slot Model defines the slot identifiers and capacity pools the footprints reference. Modulo Driver and 4-Arm OR-Chain probes the global RRT through Arms 1 and 3's commit paths. Schedule Solve and Cost Evaluators consumes the tryAddConstraintToAvoidRegSpilling hints during cost ranking.
Modulo Scheduler and Rau-Style Placement
Abstract
Tileiras software-pipelines TileAS loops with a Rau-style modulo scheduler. The scheduler searches for an initiation interval, places operations into cycles modulo that interval, and respects both data dependences and target-machine resource capacity. This is the throughput-critical scheduler for loops that use Blackwell tensor memory, shared memory, WGMMA, TMA, named barriers, and async-copy queues.
The output of this pass is schedule analysis: stage, order, resource placement, and depth information. A later materialization pass consumes that analysis to emit Pipe_ and Mutex_ IR.
Scheduling Model
A software-pipelined loop overlaps multiple logical iterations. If the initiation interval is II, the steady-state loop starts one new iteration every II cycles. Smaller II means higher throughput, but only if every recurrence and resource constraint can be satisfied.
Tileiras starts from the usual lower bound:
II >= max(resource_mii, recurrence_mii, fine_density_mii, dependency_mii)
| Bound | Meaning |
|---|---|
| resource MII | lower bound from total resource demand and per-cycle capacity |
| recurrence MII | lower bound from dependence cycles crossing loop iterations |
| fine-density MII | lower bound from resource groups with fractional or pooled capacity |
| dependency MII | lower bound from known longest dependence depth |
Resource Reservation Table
The scheduler represents resource occupancy with a Resource Reservation Table. The global table has one row per cycle modulo II; each row is a bitset of resources. Each operation has a footprint table with one row per occupied cycle.
bool rrt_probe(const RRT *global, const NodeRRT *node, uint32_t start) {
for (uint32_t k = 0; k < node->duration; ++k) {
uint32_t row = (start + k) % global->ii;
if ((global->rows[row] & node->rows[k]) != 0) {
return false;
}
}
return true;
}
void rrt_commit(RRT *global, const NodeRRT *node, uint32_t start) {
for (uint32_t k = 0; k < node->duration; ++k) {
uint32_t row = (start + k) % global->ii;
global->rows[row] |= node->rows[k];
}
}
Probe-before-commit is mandatory. Committing a partially probed footprint can make retry behavior nondeterministic.
Candidate-II Search
The scheduler tries the lower bound first, then increases or searches until it finds a feasible interval or reaches the configured cap.
bool schedule_loop(Schedule *sched, RauBounds bounds) {
uint32_t lower = max4(bounds.resource_mii,
bounds.recurrence_mii,
bounds.fine_density_mii,
bounds.dependency_mii);
for (sched->ii = lower; sched->ii <= bounds.ii_cap; ++sched->ii) {
clear_resource_table(&sched->global_rrt, sched->ii);
if (place_all_groups_at_current_ii(sched)) {
sched->stage_count = compute_stage_count(sched);
return true;
}
}
sched->failed = true;
return false;
}
Some subproblems use galloping plus binary search rather than a linear increment. Both forms obey the same semantic contract: find the smallest feasible value according to the probe.
Placement Arms
For a fixed II, Tileiras tries a deterministic sequence of placement arms:
| Arm | Role |
|---|---|
| permute | cheap priority order and earliest legal seat |
| fuse | merge compatible groups to improve packing |
| retry | use a snapshot overlay to skip or reconsider failed operations |
| cost-based | choose the lowest-cost legal placement with RRT-backed scoring |
The first successful arm wins for the current group. If the full group pass fails, the driver clears temporary state and reruns heavier retry/fallback arms.
bool place_group(Schedule *sched, Group group, RetrySnapshot *snapshot) {
if (try_permute(sched, group)) {
return true;
}
if (try_fuse(sched, group)) {
return true;
}
if (try_retry(sched, group, snapshot)) {
return true;
}
return try_cost_based(sched, group, snapshot);
}
Permute Arm
The permute arm sorts the current group by a predecessor-priority map and seats each operation at the earliest legal cycle.
bool try_permute(Schedule *sched, Group group) {
PriorityMap priority = build_predecessor_priority(group);
stable_sort(group.nodes, [&](Node *a, Node *b) {
return priority.less(a, b);
});
for (Node *node : group.nodes) {
if (!seat_earliest_legal_cycle(sched, node)) {
return false;
}
}
return true;
}
Fuse Arm
The fuse arm merges compatible scheduling groups. It must reject fusions that violate dependence order or MLIR nesting relationships.
bool groups_can_fuse(OperationSet a, OperationSet b) {
for (Operation *left : a) {
for (Operation *right : b) {
if (has_required_order(left, right)) {
return false;
}
if (is_proper_ancestor(left, right) || is_proper_ancestor(right, left)) {
return false;
}
}
}
return true;
}
Stable sorting is part of the contract. It keeps tied fusions deterministic across builds and platforms.
Retry and Snapshot Overlay
Retry uses a snapshot map instead of mutating the canonical group state. A failed attempt can mark a candidate dead in the overlay, and later strategies can clear or rebuild the overlay without corrupting the schedule's operation list.
bool try_retry(Schedule *sched, Group group, RetrySnapshot *snapshot) {
for (Node *node : group.nodes) {
if (snapshot_is_dead(snapshot, node)) {
continue;
}
if (!seat_earliest_legal_cycle(sched, node)) {
snapshot_mark_dead(snapshot, node);
return false;
}
}
return true;
}
Cost-Based Arm
The cost-based arm ranks legal placements instead of accepting the first legal seat.
bool try_cost_based(Schedule *sched, Group group, RetrySnapshot *snapshot) {
for (Node *node : group.nodes) {
Placement best = find_lowest_cost_legal_placement(sched, node, snapshot);
if (!best.valid) {
snapshot_mark_dead(snapshot, node);
return false;
}
rrt_commit(&sched->global_rrt, &node->footprint, best.start);
node->start = best.start;
}
return true;
}
Cost ranking is lexicographic. Resource legality is a hard gate; pressure and structural distance rank candidates that already pass.
Prologue and Stage Count
After successful placement, the scheduler computes the latest occupied cycle. This determines the number of overlapped stages in the steady-state loop.
uint32_t compute_stage_count(const Schedule *sched) {
uint32_t end = 0;
for (Node *node : sched->nodes) {
if (node->start < 0) {
return UINT32_MAX;
}
uint32_t finish = (uint32_t)node->start + node->latency;
end = max(end, finish);
}
return ceil_div(end + 1, sched->ii);
}
Diagnostics
When resource placement fails, useful diagnostics should include:
- candidate
II; - operation duration and requested slot footprint;
- global RRT rows at the conflicting modulo cycles;
- dependency window for the operation;
- active capacity pool that rejected the candidate, if any;
- group and stage/order information for the failed operation.
These diagnostics let users distinguish an impossible loop body from a heuristic failure.
Cross-References
Resource Constraint Builder and RRT documents footprint construction and feasible-II probing. Blackwell Pipeline 15-Slot Model documents the resource slots. Schedule Solve and Cost Evaluators explains the materialization boundary.
Modulo Driver and 4-Arm OR-Chain
Abstract
The modulo scheduler makes one placement attempt per candidate initiation interval. sub_981D50 drives that attempt — a 215-byte trampoline that builds two per-attempt SwissTable scratches, walks the ready-group worklist, and runs a four-arm OR-chain on every ready group. Arms fire cheapest-first: a priority permutation, a DSU-based group fusion, a snapshot-driven retry, and a cost-based fallback. When every arm refuses, the driver clears its DSU snapshot and reruns the two heaviest arms from scratch. Only when the desperate retry also fails does the driver return zero and force the outer caller sub_982210 to bump II.
Modulo Scheduler and Rau covers the placement algorithm at fixed II, the resource reservation table, and the cost-based arm's scoring. This page sticks to the driver itself and the dispatch shape.
Driver Signature
__int64 sub_981D50(__int64 *schedulerState, /* a1 */
__int64 ***opListPtr, /* a2 */
__int64 ctx, /* a3 */
__int64 groupHdr, /* a4 */
__int64 aux); /* a5 */
Triple indirection on opListPtr is deliberate: **a2 is the first ScheduleBlock* of the outer worklist. groupHdr is a small header { groups*, _, numGroups, _ } where each group occupies two i32 slots — first the group id, then a readiness flag. ctx is the ScheduleBlock carrying the active II, the stage count, and the operation vector. aux is an opaque cookie passed through to every arm unchanged.
The 215-byte frame holds two SwissTable headers (v36 / v37 / v38 for the outer "op-seen" table, v39 / v40 / v41 for the inner DSU snapshot) plus a re-usable scratch v42 / v43 that the desperate-retry block resets.
Per-Attempt SwissTable Scratches
Both tables are rebuilt every driver call before any arm runs. They never persist across II values, and they never persist across calls within the same II — every placement attempt gets a fresh pair.
/* Per-attempt SwissTable header, used twice in the driver frame. */
struct SwissTableHdr {
void *slots; /*+0x00 */ /* flat bucket array */
uint32_t size; /*+0x08 */ /* live entry count */
uint32_t capacity; /*+0x0C */ /* slot count (power of two) */
};
/* Outer "op-seen" bucket: 16 bytes, stride 16. */
struct OuterBucket {
uint64_t key; /*+0x00 */ /* op handle; -4096 empty, */
/* -8192 tombstone */
uint32_t size_mark; /*+0x08 */ /* counter slot, or 0x7FFFFFFF */
/* once an arm marks the op dead */
uint32_t reserved; /*+0x0C */
};
/* Inner DSU-snapshot bucket: 16 bytes, stride 16. */
struct DsuBucket {
uint64_t op_handle; /*+0x00 */ /* same sentinels as OuterBucket */
uint32_t depth; /*+0x08 */ /* recorded schedule depth, or */
/* 0x7FFFFFFF = "dead, skip in */
/* retry" */
uint32_t pad; /*+0x0C */
};
sub_95BF00 populates the outer table with the per-op callback sub_95BEF0. The driver packs { &v36, ctx } into a __m128i, hands it plus &v33 to the visitor, and the visitor walks every op on the first block, writing each op key into a slot at stride 16. The probe sequence runs the fmix64 finalizer through the helper sub_944B20 (carrying the 0x9DDFEA08EB382D69 size-class constant), uses the high 57 bits of the finalized hash to pick a group, and walks entries within the group with stride-16 linear probing. This variant stores its sentinels in the entry-key slot rather than in a separate control-byte group — -4096 marks empty, -8192 marks tombstone — so the SIMD control-byte scan documented in Container Fingerprints — Control-Byte Sentinels collapses to a key-load comparison in this scheduler-local table. Insertion routes through the standard Abseil load-factor rule:
if (4 * (size + 1) < 3 * capacity) {
/* fast direct insert */
} else if (capacity - tombstones - (size + 1) <= capacity / 8) {
/* rehash in place — tombstones dominate the table */
} else {
/* double capacity via sub_8FC010 / sub_8F9240,
* clear new slots to -4096, rehash each live entry */
}
The outer table is the "op-seen header" arms consult to recognise ops already on the worklist. The inner DSU snapshot is a { op_handle -> depth } map — the only piece of state the driver hands sideways into Arms 3 and 4, which read it to decide whether each op is still placeable at the current attempt.
4-Arm OR-Chain
For every group whose readiness flag clears the (unsigned)gid > 0xFFFFFFFD and flag < 0 filters, the driver runs four arms in fixed dispatch order. The chain short-circuits on the first arm that returns true. The 6th argument carrying the DSU snapshot reaches only Arms 3 and 4 — Arms 1 and 2 consult the scheduler-level DSU at schedulerState + 112 directly.
for (int *g = groupArrayBase; g != groupArrayEnd; g += 2) {
if (skip_unready_or_sentinel(g)) {
continue;
}
int gid = *g; /* a4 of every arm is the group id, not a mode */
if ( sub_978B50(state, opListPtr, ctx, gid, aux) /* arm 1: PERMUTE */
|| sub_97D050(state, opListPtr, ctx, gid, aux) /* arm 2: FUSE */
|| sub_9661B0(state, opListPtr, ctx, gid, aux, &dsuSnap) /* arm 3: RETRY */
|| sub_980290(state, opListPtr, ctx, gid, aux, &dsuSnap) ) {/* arm 4: CBS */
continue; /* group placed; advance worklist */
}
/* All four arms refused. Desperate retry with a fresh empty
* snapshot — re-runs only the two heavy completeness arms. */
SwissTableHdr fresh = { .slots = NULL, .size = 0, .capacity = 0 };
bool ok = sub_9661B0(state, opListPtr, ctx, gid, aux, &fresh);
sub_4560420(fresh.slots, 16ULL * fresh.capacity, 8);
if (!ok) {
fresh = (SwissTableHdr){ NULL, 0, 0 };
ok = sub_980290(state, opListPtr, ctx, gid, aux, &fresh);
sub_4560420(fresh.slots, 16ULL * fresh.capacity, 8);
if (!ok) {
return 0; /* outer caller sub_982210 bumps II */
}
}
}
return 1;
Arm signatures differ only in arity. Arms 1 and 2 are (state, opListPtr, ctx, gid, aux); Arms 3 and 4 add a 6th SwissTableHdr * parameter for the snapshot. Every arm receives the same gid value — the long-standing "pass-mode enum" reading of a4 was wrong. The pass mode is purely positional, decided by which arm address the driver calls.
Arm 2 carries one runtime-dispatched branch on gid == 4. Line 682 of sub_97D050 reads v390 = (a4 == 4), then propagates the flag through a 3-tuple aux bundle { ctx, &gid, &is_combiner } to comparator sub_962F50 lines 138-139, which selects between the inline vec0 (gid ≠ 4) and vec1 (gid == 4) views within a 208-byte SwissTable bucket. The branch also picks between sub_9706D0 (with-buffer stable_sort after a successful get_temporary_buffer) and sub_97CFD0 (no-buffer stable_sort). The 4 is the combiner-mode sentinel gid; no other gid value reaches that code path.
Per-Arm Roster
Each arm pulls from a distinct set of placement primitives, which is what makes them statically separable in the disassembly. PERMUTE reads the scheduler-level DSU directly and commits through sub_96A820; FUSE walks pairwise group merges through sub_95AFC0 / sub_95B220; RETRY consumes the snapshot via the Abseil fmix64 probe with the depth-viability gate; CBS, the heaviest, delegates to the cost-evaluator ladder described separately. The table below records each arm's entry point, size, role, and the callees that disambiguate it from its neighbours.
| Arm | Address | Size | Role | Distinguishing callees |
|---|---|---|---|---|
| 1 PERMUTE | sub_978B50 | 11 KB | Priority-bucket permutation heuristic; the cheapest arm, fires first. Sorts each priority bucket by predecessor distance via comparator sub_962F50 with the 4*A vs B priority-delta clamp; commit-probes via sub_96A820; on success calls sub_976BE0 (DSU find) at line 1235 and sub_976DE0 (DSU union) at line 1298 to merge the placed group into the fixed schedule. Reads the scheduler-level DSU at schedulerState + 112 directly — no snapshot. | sub_9621A0, sub_966070, sub_96A820, sub_96A8B0, sub_962F50 |
| 2 FUSE | sub_97D050 | 12.3 KB | Combiner-mode pairwise group fusion via DSU. Walks every two placed groups and tries to merge them, gated on MLIR isProperAncestor + property-typeID sentinels. The only arm with a runtime-dispatched specialisation (gid == 4 combiner branch). | sub_95AFC0, sub_95B220, sub_94C4F0, sub_94C5A0, sub_9706D0, sub_97CFD0 |
| 3 RETRY | sub_9661B0 | 11 KB | Two-phase DSU-snapshot retry. Phase A consults the snapshot's H2 probe to short-circuit on sentinel 0x7FFFFFFF ("op already dead — skip in retry"). Phase B sorts each op's predecessor list via in-place sub_94B190 or out-of-place sub_94B8C0 with a sub_44A8C50-allocated spill. Phase C runs the canonical three-round Abseil fmix64 probe 0x9DDFEA08EB382D69 * (HIDWORD(x) ^ ((8*x & 0x7FFFFFFF8) - 0xAE502812AA7333)) with a depth-viability gate *((u32*)sub_94A550(ctx, &op) + 2) <= v343. Commits via three calls to the RRT teardown-and-re-emit primitive sub_8DDEF0 at lines 1001, 1136, 1171. Phase E re-runs B–D over the remaining unplaced ops. 2171 decompiled lines. | sub_8DDEF0, sub_94A550, sub_94B190, sub_94B8C0, sub_44A8C50, sub_448DF00 |
| 4 CBS | sub_980290 | 6.8 KB | CostBasedScheduleGenerator. Same 6-argument signature as Arm 3; consumes the same DSU snapshot. Heaviest of the four — delegates to the cost-evaluator ladder documented under Schedule::solve and Cost Evaluators. | (see Schedule::solve page) |
The four arms keep non-overlapping callee sets outside the shared readyList / priority / reseed quintet (sub_958C00, sub_959B30, sub_959A10, sub_93E7D0, sub_93E480) and the shared allocator group (sub_45603F0, sub_4560420, sub_456A580). Those distinguishing callees are how the driver's static disassembly partitions into four arms in the first place.
Arm 2 FUSE: stable_sort Cluster and StructConstraint Comparator
Arm 2 at sub_97D050 performs combiner-mode pairwise group fusion via DSU, and it is the only arm with a runtime-dispatched specialisation. Its Phase B sorts a vector of 64-byte StructConstraint records via libc++ std::stable_sort, dispatching at runtime between a with-buffer driver and a no-buffer driver depending on whether std::get_temporary_buffer succeeded.
The libc++ std::stable_sort template fully inlines against the 64-B StructConstraint element type, producing seven recognisable functions. The 7.2-KB body at sub_96EAC0 is the recursive merge step the with-buffer driver invokes — its size comes from HexRays expanding the libc++ template against a 64-B element, not from template explosion in the original C++.
| Address | Size | Role in libc++ idiom |
|---|---|---|
sub_950050 | 306 B | std::get_temporary_buffer. Canonical halving-on-failure malloc loop plus SSO-aware __uninitialized_copy of 64-B StructConstraint elements. Each element is initialised with *ptr = ptr+2; *(ptr+1) = 0x600000000 — the cap=6/size=0 inline-SmallVector tail. |
sub_9706D0 | 90 B | __stable_sort_move with-buffer driver. Entered when get_temporary_buffer returned a non-null buffer. |
sub_97CFD0 | 80 B | __stable_sort_with_no_buffer no-buffer driver. 14-element insertion-sort switch matching libc++'s _LIBCPP_STABLE_SORT_SWITCH value 896 / 64. |
sub_96EAC0 | 7.2 KB | __buffered_inplace_merge. The recursive merge step the with-buffer driver invokes. |
sub_97BBF0 | 7.5 KB | __merge_without_buffer. Bufferless rotate-merge that calls the GCD-cycle std::rotate at sub_944840. |
sub_964AC0 | ~360 B | Chunk insertion plus bottom-up merge. |
sub_963B10 | 2.0 KB | Insertion-sort body for runs below the switch threshold. |
Inside the 7.2-KB body at sub_96EAC0 sit eight repeated StructConstraint] TypeID intern stubs, each guarded by __cxa_guard_acquire (sub_44A8A10) / __cxa_guard_release (sub_44A8AC0) around the TypeID factory sub_44A6CA0. The eight stubs match the eight call sites of the comparator probe sub_962B10 inside the merge recursion. The StructConstraint TypeID is interned exactly once via sub_44A6CA0("mlir::nv_tile_ir::as::schedule_utils::StructConstraint]", 54); subsequent calls hit the guard and return the cached pointer.
The comparator chain runs sub_962B10 → sub_957D10. sub_957D10 reads the 3-tuple aux bundle { ctx, &gid, &is_combiner = (gid == 4) } the FUSE arm assembled before entering the sort. Phase B's sort threads the is_combiner flag through, but only the priority-bucket inner comparator sub_962F50 at lines 138-139 consumes it — selecting between the inline vec0 view (gid ≠ 4) and the inline vec1 view (gid == 4) within a 208-byte SwissTable bucket. The two inline views are the first and second SmallVector tails of the same 64-B StructConstraint record; gid == 4 switches which tail the comparator keys on.
That gid == 4 test is the only magic number in the four-arm OR-chain. It flips a single bit that routes the comparator to read the second SmallVector tail (the combiner / fused subgroup view) rather than the first. Every other reference to gid in the chain is positional — a small integer the driver hands to each arm verbatim. No enum, no mode field, no second magic value: the entire combiner-vs-non-combiner distinction collapses to one equality test against the literal 4.
Phase B as a self-contained pseudocode reduction:
bool sortConstraints(StructConstraint *v, size_t n, Schedule *S, uint32_t gid) {
bool is_combiner = (gid == 4);
AuxTuple aux = { .ctx = S, .gid = &gid, .is_combiner = &is_combiner };
void *buf = sub_950050(v, n); /* std::get_temporary_buffer */
if (buf) {
sub_9706D0(v, buf, n, &aux); /* with-buffer stable_sort */
} else {
sub_97CFD0(v, n, &aux); /* no-buffer stable_sort */
}
return true;
}
Both paths produce bit-identical output for tied keys — the whole reason libc++ exposes the buffer-failure fallback as a separate driver rather than degrading to a non-stable sort. Tileiras leans on it: ties in the StructConstraint order must resolve identically across builds, because the next pass commits fusions in iteration order over the sorted vector.
Desperate Retry and DSU Reset Semantics
The block at the bottom of the OR-chain is not a true DSU reset. The scheduler-level DSU at schedulerState + 112 — the group-leader parent-pointer table populated upstream by sub_97B770's attribute parser — stays untouched between attempts. Only the per-attempt snapshot resets.
/* Same II, same attempt, but the four-arm chain refused.
* Clear the local snapshot header, free its slot array, and
* re-run the two heavy completeness arms from a known-empty
* starting state. The scheduler-level DSU is unchanged. */
LODWORD(v43) = 0;
v42 = _mm_setzero_si128(); /* fresh empty SwissTableHdr */
ok = sub_9661B0(state, opListPtr, ctx, gid, aux, v42.m128i_i64);
sub_4560420(v42.m128i_i64[0], 16ULL * (uint32_t)v43, 8);
A second identical block reruns sub_980290 when Arm 3b fails. Clearing the snapshot wipes the 0x7FFFFFFF "skip" markers previous arms wrote into snapshot depth slots — the pre-built snapshot accumulates "ops already considered dead" as the chain progresses, so by the time Arm 3 sees it many ops are already off-limits. The fresh-state retry trades that pre-seeded knowledge for from-scratch semantics: the same arms place more aggressively when they aren't constrained by earlier arms' negative results.
If even the fresh-state retry returns false, the driver writes v15 = 0 and falls through to the cleanup label. Both SwissTable arrays are freed unconditionally via sub_4560420(slots, 16 * capacity, 8) on the way out, and the zero return propagates to sub_982210, which bumps II and reruns the whole pass.
State Machine of One Driver Call
[ENTRY] sub_981D50(schedulerState, opListPtr, ctx, groupHdr, aux)
|
v
[SETUP] sub_95BF00 + sub_95BEF0 -> outer "op-seen" table (v36)
Abseil SwissTable scan -> inner DSU snapshot (v39)
|
v
[for each ready group g in groupHdr, skip sentinels]
|
v
[OR_CHAIN] arm_1 PERMUTE sub_978B50(state, ops, ctx, gid, aux)
succeed? --> next group
arm_2 FUSE sub_97D050(state, ops, ctx, gid, aux)
succeed? --> next group
arm_3 RETRY sub_9661B0(state, ops, ctx, gid, aux, &v39)
succeed? --> next group
arm_4 CBS sub_980290(state, ops, ctx, gid, aux, &v39)
succeed? --> next group
else --v
|
v
[DESPERATE RETRY] fresh empty snapshot &v42:
arm_3b RETRY2 sub_9661B0(state, ops, ctx, gid, aux, &v42)
succeed? --> next group
arm_4b CBS2 sub_980290(state, ops, ctx, gid, aux, &v42)
succeed? --> next group
else --> return 0 --> outer sub_982210 bumps II
|
v
[CLEANUP] free outer + inner SwissTable arrays, return 1
Usage and Contract
The modulo scheduler sub_982210 invokes the driver once per candidate II. It consumes the scheduler state (carrying the active II, stage count, global RRT, and the DSU at state + 112 seeded by the constraint-attribute parser), the worklist of ready operations, the group header describing per-group readiness, and an opaque cookie threaded to every arm. Its boolean return value tells the outer scheduler whether to keep the current II or bump it; on success, Arms 1, 3, or 4's commit paths have mutated the schedule state in place, and the materializer reads the final (stage, order) placement off the per-op slot. The driver itself owns no persistent state — every SwissTable scratch it allocates is freed before return on every exit path.
Cross-References
Modulo Scheduler and Rau documents the placement algorithm at fixed II and the arm-level scheduling model. Resource Constraint Builder and RRT covers the footprint tables that Arms 1 and 3 commit through. Schedule::solve and Cost Evaluators covers Arm 4's cost ranking. Serial vs Cost-Based Generators explains where the cost-based arm fits in the broader generator family. Schedule Constraint Attributes documents the DSU at state + 112 that the FUSE arm consults directly.
Schedule Constraint Attributes
Abstract
Before the modulo scheduler runs, sub_97B770 parses every op carrying a tileas.schedule.constraint.* or tileas.* attribute. The 1037-byte routine extracts nine well-known attribute strings off the op and folds them into a ConstraintMap consulted at scheduling time. The map drives two subsystems: the placement driver reads four pipeline-control fields per op, and the rematerialisation passes read five remat-policy fields. The parser also seeds a disjoint-set-union structure at state + 112, unifying ops that share a leader_gid so the driver later treats them as a single fused group.
This page covers the attribute strings, the storage layout of the ConstraintMap slot, the two-step inherent-then-discardable lookup, and the DSU seeding that ties the parser to the driver.
Parsed Attribute Strings
Nine attribute strings come off every op, split between two consumer groups. Four feed the placement driver — gid, leader_gid, max_depth, and the force_serial_execution unit attribute. The remaining five govern rematerialisation policy: preferred_atom_size, max_num_slices_for_non_reduce_axis, max_num_of_recomputations, plus the unit attributes enable_defusion_if_fusion_extending_liveness and recomputable. Frontends may emit any subset; absent strings leave the matching slot field at its zero-fill default.
| String | Type | Consumer | Role |
|---|---|---|---|
tileas.schedule.constraint.gid | i32 | placement driver | op's group id |
tileas.schedule.constraint.leader_gid | i32 | placement driver | group-leader gid for DSU union |
tileas.schedule.constraint.max_depth | i32 | placement driver | viability gate for retry arm (G2 admission) |
tileas.schedule.constraint.force_serial_execution | UnitAttr | placement driver | forces sequential placement of this op |
tileas.preferred_atom_size | i32 | remat pass | preferred atom size for slicing |
tileas.max_num_slices_for_non_reduce_axis | i32 | remat pass | per-axis slice cap |
tileas.max_num_of_recomputations | i32 | remat pass | recomputation budget |
tileas.enable_defusion_if_fusion_extending_liveness | UnitAttr | remat pass | allows defusion when fusion grows liveness |
tileas.recomputable | UnitAttr | remat pass | marks the op as recomputable |
The parser keeps the verbatim attribute strings in its read-only string table and matches them by pointer-or-content compare against the op's attribute dictionary keys.
Two-Step Lookup
sub_97B770 tries the inherent attribute dictionary first, then falls back to the discardable dictionary. Inherent attributes live in the op's Properties storage and survive cloning; discardable attributes sit in a DictionaryAttr on the op header and do not. Frontends emit scheduling constraints as inherent properties when the op definition reserves a property slot for them, and as discardable attributes otherwise.
Attribute lookupAttr(Op *op, StringRef key) {
if (Attribute a = sub_446DC50(op, key)) /* inherent dict */
return a;
return sub_440E370(op, key); /* discardable dict */
}
sub_446DC50 is the inherent-attribute accessor; sub_440E370 is the discardable one. The parser invokes the pair once per attribute string and takes the first non-null return as the value.
⚡ QUIRK — inherent and discardable can disagree, inherent always wins silently If an op carries the same constraint key in both its inherent properties slot and its discardable dictionary with different values — which can happen when a pass copies an attribute forward without removing the source — the parser commits the inherent value and never even reads the discardable one. There is no diagnostic, no warning, and the discardable side stays on the op as a dangling shadow that the next dump pretty-prints alongside the value actually in force. A frontend that round-trips IR through textual form and re-parses risks promoting the shadow into the inherent slot on the second pass and silently flipping the scheduler decision.
Integer-valued attributes are unwrapped through the standard IntegerAttr::getInt() truncation: any storage width is narrowed to a signed 64-bit value, then reinterpreted as a 32-bit unsigned field when written into the slot. UnitAttr keys are tested for presence only — the parser does not read the unit attribute's content, so a non-UnitAttr value living under one of the unit keys (which the verifier should reject upstream) still trips the flag bit. The five integer fields default to zero when their attribute is absent; the three flag bits default to clear. The parser does not distinguish "explicitly set to zero" from "absent" for the integer fields, so a max_depth = 0 attribute behaves identically to a missing one — both make the G2 admission gate fire on every retry-arm attempt.
ConstraintMap Layout
The ConstraintMap keys on the op handle. sub_94A550(state, op) returns a pointer to a 16-byte record carrying the placement-driver fields, plus three i32 fields immediately after it for the remat numerics:
/* Slot returned by sub_94A550. Stride 28 bytes; placement driver reads */
/* the first 16, remat passes read the trailing 12. */
struct ConstraintSlot {
uint32_t gid; /*+0x00 */ /* tileas.schedule.constraint.gid */
uint32_t leader_gid; /*+0x04 */ /* leader gid for DSU union */
uint32_t max_depth; /*+0x08 */ /* viability gate (G2) */
uint32_t flags; /*+0x0C */ /* bit 0: force_serial_execution */
/* bit 1: recomputable */
/* bit 2: enable_defusion_if_ */
/* fusion_extending_ */
/* liveness */
uint32_t preferred_atom_size; /*+0x10 */
uint32_t max_num_slices_for_non_reduce_axis; /*+0x14 */
uint32_t max_num_of_recomputations; /*+0x18 */
};
The placement driver reads max_depth via *((u32*)slot + 2) <= 1 — that direct word load is the G2 admission gate documented in Serial and Cost-Based Schedule Generators — G2: Max-Depth Viability. All three UnitAttr flags share the same i32 so the driver can probe them with a single masked compare.
The 28-byte stride is not the natural sum of seven uint32_t fields rearranged for cache — it is the layout the placement driver hard-codes through direct word indices. sub_94A550 returns a base pointer and the consumers index it with *((u32*)slot + n) for n ∈ {0..6}. There is no struct definition shared between parser and consumers; the layout exists only as a calling convention spelled out in word offsets at every read site. A reimplementation that reorders these fields must update every consumer simultaneously or the masked-compare in the placement driver reads flags from the max_depth slot and gates retry on the wrong word.
The split between the placement-driver words (+0x00–+0x0C) and the remat-pass words (+0x10–+0x18) matches the consumer split exactly: the placement driver reads only the first 16 bytes, the remat pass reads only the trailing 12. Neither side ever touches the other's region, and the parser zero-fills the slot before writing, so a placement-driver read of a remat-only-tagged op sees zeroed gid/leader_gid/max_depth/flags and falls through every gate to the default behaviour.
DSU Seeding at state+112
A union-find structure sits at offset +112 from the scheduler state base. sub_976BE0 is the find primitive with path compression; sub_976DE0 is the union primitive. The parser uses both to fold every op sharing a leader_gid into the same group:
void parseConstraints(Op *op, void *state, ConstraintMap *map) {
ConstraintSlot s = {0};
if (Attribute a = lookupAttr(op, "tileas.schedule.constraint.gid"))
s.gid = a.getInt();
if (Attribute a = lookupAttr(op, "tileas.schedule.constraint.leader_gid"))
s.leader_gid = a.getInt();
if (Attribute a = lookupAttr(op, "tileas.schedule.constraint.max_depth"))
s.max_depth = a.getInt();
if (lookupAttr(op, "tileas.schedule.constraint.force_serial_execution"))
s.flags |= 1u << 0;
if (Attribute a = lookupAttr(op, "tileas.preferred_atom_size"))
s.preferred_atom_size = a.getInt();
if (Attribute a = lookupAttr(op, "tileas.max_num_slices_for_non_reduce_axis"))
s.max_num_slices_for_non_reduce_axis = a.getInt();
if (Attribute a = lookupAttr(op, "tileas.max_num_of_recomputations"))
s.max_num_of_recomputations = a.getInt();
if (lookupAttr(op, "tileas.enable_defusion_if_fusion_extending_liveness"))
s.flags |= 1u << 2;
if (lookupAttr(op, "tileas.recomputable"))
s.flags |= 1u << 1;
if (s.leader_gid != s.gid) {
sub_976DE0((char *)state + 112, s.gid, s.leader_gid); /* DSU union */
}
map->insert(op, s);
}
DSU seeding is the parser's only side effect outside the map. It runs once per op during parsing, so the driver sees a fully-built DSU before its first arm fires.
⚡ QUIRK — leader_gid defaults to zero, which is itself a valid gid When an op carries no
leader_gidattribute, the zero-fill default leavess.leader_gid == 0. The parser then compares againsts.gid, and any op whosegidis non-zero ends up unioned with the phantom gid-0 root — silently grafted onto whichever real gid-0 group exists in the same scheduler state. A frontend that uses gid 0 for "the entry group" and then forgets to setleader_gidon a gid-7 op will see that op fused with the entry group and scheduled as if it belonged there. The fix at the frontend is to always emitleader_gidequal togidfor ops that lead their own groups, since the parser cannot tell "absent" from "explicitly zero". Tileiras' own emitters do this; ad-hoc IR test inputs frequently do not.
⚡ QUIRK — DSU union direction is gid → leader_gid, not symmetric
sub_976DE0takes(state+112, child, parent)in that order and grafts thechildroot under theparentroot before path compression runs. The parser passes(s.gid, s.leader_gid), so the gid root becomes a child of the leader_gid root and every laterfind(gid)returns the leader's root. If two ops in the same intended group disagree on which side is "leader" —op_asaysleader_gid = 7, gid = 3whileop_bsaysleader_gid = 3, gid = 7— the two unions cancel out into a chain7 → 3then3 → 7and one of the two roots ends up parenting the other depending on parse order. The placement driver's leader-consistency check at G4 will then occasionally pick the wrong leader and treat the group as split when probed against the third op. There is no diagnostic; the symptom is non-deterministic schedule output across builds with the same input.
Twin Seeding: DSU and Pending-Set
The parser does not seed one scheduler-state structure but two, and the pair is the full picture of how the attribute pass primes downstream scheduling. The DSU at state + 112 is one half; an Abseil-layout SwissTable pending-set at state + 392 (49 * 8 bytes past the state base) is the other. The parser fills both in the same walk, and both stay frozen for the rest of the schedule.
| Structure | Offset | Shape | Consumer |
|---|---|---|---|
| Disjoint-set forest | state + 112 | Parent-pointer DSU, find with path compression, directional union(child=gid, parent=leader_gid) — no rank, no size heuristic | Placement arms — fuse and retry consult it to keep group leaders consistent |
| Pending-set | state + 392 | SwissTable, control-byte sentinels 0x80 / 0xFE / 0xFF, fmix64 group hash | Cost-based generator's gate G1 |
The DSU records the must-fuse equivalence classes implied by leader_gid. Every op whose leader_gid differs from its gid is unioned with its leader, so the resulting forest's roots are the actual scheduling groups. Placement arms walk the DSU through find whenever they need to know whether two candidate ops belong to the same group; the gate-G4 leader-consistency check in Serial vs Cost-Based Generators — G4: Leader-Group DSU Consistency is the highest-traffic consumer.
The pending-set records ops that have been temporarily removed from consideration — the carry state the cost-based generator uses to hold a candidate over to the next placement attempt without permanently failing it. The gate-G1 membership probe is a single SwissTable find against this table; rejection means "skip this op for this iteration, try again next round." The parser populates the table once at scheduler-init time so the very first gate-G1 probe has a fully-built table to consult.
A reimplementation must seed both structures from the same walk. Splitting the seeding into two passes risks the gate-G1 probe seeing a partially-built pending-set or the gate-G4 check seeing a partial DSU, and either bug surfaces only intermittently when the order of pipeline values happens to expose the seam.
Parse Order and Determinism
The parser is called once per op as the scheduler-init pass walks the region in MLIR's intrinsic block-then-operation order — the same order Operation::walk yields. That order determines two things the downstream consumers rely on: the order DSU unions execute (and therefore which gid wins as the root when two ops disagree about leadership, as the second QUIRK above notes), and the order ops are inserted into the ConstraintMap. Both surfaces are stable across re-runs on the same input IR because MLIR's walk is deterministic, but they are not stable across IR transformations that reorder ops within a block. A pass that hoists or sinks a constraint-bearing op between the front-end and the scheduler can flip the DSU root for groups whose members carry inconsistent leader_gid values, with the symptom that the same source produces different schedules depending on which passes ran upstream. The cure is to enforce leader_gid == gid for group leaders at the frontend so the DSU root choice is no longer order-sensitive.
Usage and Contract
The parser runs once per op at scheduler-init time, before any placement arm fires. It consults the op's inherent properties dictionary first and falls back to the discardable attributes dictionary, reading only the nine string keys listed above — every other attribute on the op is ignored. Two outputs reach the rest of the scheduler. The first is the per-op ConstraintSlot keyed by op handle inside the ConstraintMap, retrieved by every later consumer through sub_94A550(state, op). The second is the seeded disjoint-set forest at state + 112, written only when an op's leader_gid differs from its gid. Frontends emitting the constraint attributes must keep leader_gid consistent across every op in a fusion group — the parser does no symmetry check, and a divergent group will produce two DSU roots that the placement driver treats as independent.
Cross-References
Modulo Driver and 4-Arm OR-Chain documents the placement driver that reads the max_depth G2 admission gate and consults the DSU built here. Schedule::solve and Cost Evaluators documents the cost-based arm that honours force_serial_execution. Serial and Cost-Based Schedule Generators — G2: Max-Depth Viability explains the G2 viability check that gates retry.
Schedule Solve and Cost Evaluators
Abstract
Schedule::solve is not the modulo scheduler and not an optimization solver. It is the materialization step that consumes an already-computed schedule analysis and emits the SSA pipe values the TileAS program needs. Resource search, cost ranking, and initiation-interval work all happen earlier during TileASGenerateSchedule.
The cost evaluators on this page belong to that earlier generation pass. They use paired RRT probes and structural distance constraints to rank candidate placements. Schedule::solve runs none of that machinery — it performs classification, closure, disjoint-set merging, and pipe emission.
Pass Boundary
| Phase | Pass | Main job |
|---|---|---|
| schedule generation | TileASGenerateSchedule | compute stage/order, resource placement, and schedule analysis |
| schedule materialization | MaterializeSchedule | emit Pipe_ and Mutex_ values from the preserved analysis |
The split matters for reimplementation. Treating Schedule::solve as the place where II search happens tangles the compiler architecture and ties the materialization pass to mutable generation state. In Tileiras, generation publishes analysis; materialization reads it.
Schedule::solve
Schedule::solve fires once for each relevant outer operation and candidate consumer. Its output is a set of pipe values that connect producer groups to consumer positions. The body is a greedy disjoint-set propagation pass over six phases — never choosing a new II, never probing the RRT. It sorts candidate operations by (stage, order), builds maps from operation to schedule info, filter state, and owning outer operation, closes the producer set under the "live at this consumer" relation, links producers that share the same original value through a disjoint-set forest, sweeps each disjoint-set root to pick the canonical owner by schedule order, and emits Pipe_ values for producer-to-consumer groups — plus a consumer-only fallback when the live set is empty.
void solve_schedule(Schedule *schedule,
RawValue raw_value,
Operation *consumer) {
Worklist work = collect_candidate_producers(schedule, raw_value, consumer);
stable_sort(work, compare_stage_then_order);
Map<Operation *, ScheduleInfo> info = build_schedule_info(work);
Map<Operation *, Operation *> owner = build_owner_map(work);
DisjointSet dsu = {};
ProducerSet live = close_under_live_at_consumer(work, consumer, info);
for (Operation *producer : live) {
for (Value operand : producer->operands()) {
Operation *def = operand.defining_op();
if (origin(schedule, operand) == raw_value) {
dsu.union_nodes(producer, def);
}
}
}
for (DsuRoot root : dsu.roots()) {
SmallVector<Operation *> members = dsu.members(root);
Operation *canonical = choose_earliest_owner(members, info);
emit_pipe_for_group(schedule, canonical, members, consumer);
}
if (live.empty()) {
emit_consumer_only_pipe(schedule, consumer);
}
}
The comparator is lexicographic — lower stage first, then lower order. This solver never consults a scalar cost function, a resource row, or an initiation interval.
Pipe and Mutex Emission
Before Schedule::solve emits final pipes, materialization builds auxiliary maps from the preserved schedule analysis. One path emits mutex values for exclusion relationships, another emits preliminary pipe placeholders, and the solver reconciles those placeholders into final pipe SSA values from the disjoint-set groups. The naming convention is deliberately visible in the IR: Pipe_ values model dataflow between scheduled producer and consumer regions, while Mutex_ values model exclusion or ordering constraints that ordinary value dependencies cannot express.
void emit_pipe_and_mutex(Schedule *S, const ScheduleAnalysis *analysis) {
build_orig_map(S, analysis); // sub_8E2790 probe target
build_second_map(S, analysis); // sub_8E2F00 probe target
seed_mutex_placeholders(S); // exclusion edges first
seed_pipe_placeholders(S); // dataflow edges next
for (CandidatePair p : S->consumer_worklist) {
solve_schedule(S, p.raw_value, p.consumer);
}
collapse_skeleton_pipes(S); // dedup producer groups
rebuild_scheduled_region(S); // splice Pipe_/Mutex_ ops in
verify_scheduled_region(S); // hard postcondition
}
Mutex placeholders go first because their exclusion semantics are stricter than the pipe placeholders. A missing mutex is a correctness bug; a missing pipe edge is a missed-optimization bug. The materializer commits the harder constraint before relaxing into the softer one.
Cost Evaluators
Cost evaluators run during generation. They answer one question: can this candidate fit at this interval or cycle, and how expensive is that choice? Tileiras uses two paired evaluators.
| Evaluator | Pair being modeled | Role |
|---|---|---|
| bank-pressure evaluator | current-iteration and next-iteration resource shadows | checks bank and carry-over resource conflicts |
| pipe-slot evaluator | resource RRT and structural distance matrix | checks resource slots and dependence distance together |
A generic feasible-search driver calls both evaluators. They return a boolean success flag and, on success, a candidate placement state the caller can commit.
bool evaluate_candidate(SearchOutput *out,
Candidate candidate,
uint32_t ii,
ResourceModel *resources,
DistanceMatrix *distances) {
if (!resource_rows_are_free(resources, candidate, ii)) {
return false;
}
if (!distance_window_allows(distances, candidate, ii)) {
return false;
}
out->placement = candidate.placement;
out->cost = compute_lexicographic_cost(candidate, resources, distances);
return true;
}
The cost is lexicographic. Resource feasibility is the hard gate; pipeline-slot utilization and structural distance rank only the candidates that clear it.
Structural Distance Matrix
The pipe-slot evaluator builds an all-pairs distance matrix for the candidate interval. Each edge encodes how far apart two operations must sit after dependence latency, iteration distance, and skew are accounted for. A transitive closure then lets the evaluator query a legal placement window in constant time.
void build_distance_closure(DistanceMatrix *matrix, Graph graph, uint32_t ii) {
matrix->fill(INFINITE_DISTANCE);
for (Edge edge : graph.edges()) {
int32_t distance = edge.latency - (int32_t)(ii * edge.iteration_distance);
matrix->set(edge.src, edge.dst, distance);
}
for (Node k : graph.nodes()) {
for (Node i : graph.nodes()) {
for (Node j : graph.nodes()) {
int32_t through = matrix->get(i, k) + matrix->get(k, j);
if (through > matrix->get(i, j)) {
matrix->set(i, j, through);
}
}
}
}
}
The closure is generation-only state. It is not carried into Schedule::solve.
Search Driver
The evaluators share the same galloping-plus-binary search shape the resource builder uses. The outer search chooses II; the inner search may pick a cycle threshold or candidate row. That nesting lets generation find a feasible schedule without linearly scanning every interval and cycle.
bool search_with_probe(SearchOutput *out,
ProbeFn probe,
uint32_t lower,
uint32_t upper) {
uint32_t hi = lower;
while (hi < upper && !probe(hi, out)) {
hi = min(upper, hi * 2);
}
if (!probe(hi, out)) {
return false;
}
uint32_t lo = lower;
while (lo < hi) {
uint32_t mid = lo + (hi - lo) / 2;
SearchOutput candidate = {};
if (probe(mid, &candidate)) {
hi = mid;
*out = candidate;
} else {
lo = mid + 1;
}
}
return true;
}
Shared Helpers
Several helpers run alongside Schedule::solve on the scheduling-adjacent lowering path. Persistent-loop construction emits the canonical widened-index scf.for used by persistent kernels; shape verification checks that problem shape, tile shape, and cluster-group agree; result-type verification keeps work-tile info results in one consistent representation; generic work-tile info construction adapts async values and CUTLASS work-tile descriptors. None of them participate in the solve itself, but each enforces a contract that scheduling and materialization assume — failing any of these checks marks the schedule invalid before the solver runs.
Schedule::solve Body (sub_8EEE70)
The previous section described Schedule::solve as five abstract steps. The actual implementation in sub_8EEE70
is a 2 269-line, six-phase state machine that materialises five distinct hashtables and a Union-Find forest on the
stack, runs a generator dispatcher, falls back to a cost-based generator on failure, and emits one trivial-schedule
fallback when every generator path fails. The trampoline sub_8F19D0 invokes the body, and that trampoline reaches
the body from Schedule::buildAndSolve (sub_8F1AA0) inside the materialization pass. The constraint graph and
schedule analysis arrive from ResourceConstraintBuilder upstream; sub_8EEE70 consumes that analysis and produces
the final op-to-stage and op-to-order mapping.
Stack-Local State
Five hashtables and one Union-Find bucket array live on the solver stack frame. Bucket strides vary with the payload
size — schedInfo carries a 32-byte payload of stage/order/RRT-row metadata, while the three pointer-typed maps
store a single 8-byte payload after the key.
| Table | Bucket stride | Key | Value stored | Producer |
|---|---|---|---|---|
schedInfo | 40 B | Op* | current stage, order, RRT row, classification flag bytes | filled by phase 2 topo walk |
filterMap | 16 B | Op* | "dead in retry" sentinel byte at +8 | written by retry-arm failures |
parentMap | 16 B | Op* | DSU group leader pointer | seeded from constraint-graph slot |
opToOwner | 16 B | Op* | Pipe_ / Mutex_ owner reference | written by phase 3 generator |
| UF buckets | 72 B | u64 | {list_ptr, tail_cap32, tail_size32, Op* inline[3]} | seeded from Schedule+0x70 |
All four Op*-keyed maps use the standard llvm::DenseMap open-addressing layout with hash = (op>>9) ^ (op>>4),
empty sentinel 0xFFFFFFFFFFFFE000, tombstone 0xFFFFFFFFFFFFF000, and the 4*(size+1) >= 3*capacity grow rule.
The Union-Find bucket layout is custom: the inline three-slot tail amortises the common case where a DSU group has
at most three members, and overflow spills to the heap via list_ptr. The shape matches the empirical group sizes
ResourceConstraintBuilder produces, which rarely group more than three ops together.
Six-Phase State Machine
void schedule_solve_body(Schedule *schedule,
uint64_t raw_value,
Operation *consumer) {
// Phase 1: init. Construct five tables and the ready-queue staging vector.
DenseMap sched_info = dense_map_create(/*bucket_stride=*/40);
DenseMap filter_map = dense_map_create(/*bucket_stride=*/16);
DenseMap parent_map = dense_map_create(/*bucket_stride=*/16);
DenseMap op_to_owner = dense_map_create(/*bucket_stride=*/16);
UfBuckets uf = uf_buckets_create(/*bucket_stride=*/72);
SmallVector<Operation *, 0> ready_queue = {};
uf_seed_from_constraint_dsu(&uf, /*src=*/schedule->dsu_at_0x70);
// Phase 2: topo prep. Stamp parentMap[op] = leader for every op.
for (Operation *op : schedule_ops_topological(schedule)) {
ConstraintSlot *slot = sub_94A550(schedule, op);
parent_map_set(&parent_map, op, slot->leader);
}
// Phase 3: generator dispatch. Cheap placement driver runs first.
if (sub_981D50(schedule, &sched_info, &parent_map, &op_to_owner)) {
goto finalize;
}
// Phase 4: cost-based fallback. CostBasedScheduleGenerator with pre-warmed UF.
if (cost_based_schedule_generator(schedule,
&sched_info,
&parent_map,
&op_to_owner,
&uf,
/*cmp=*/sub_8F7900,
/*sort=*/sub_8F7EF0)) {
goto finalize;
}
// Phase 5: zero-producers fallback. Trivial Pipe_ flavour-A schedule.
sub_8E9450(schedule, raw_value, /*producers=*/NULL, /*n_producers=*/0,
/*consumers=*/&consumer, /*n_consumers=*/1);
schedule->flags |= 4u; // mark trivial-fallback exit
finalize:
// Phase 6: materialise Op.start and Op.stage per op, then free everything.
for (Operation *op : schedule_ops(schedule)) {
SchedInfoRow *row = dense_map_find(&sched_info, op);
op->stage = row->stage;
op->start = row->order;
}
dense_map_destroy(&sched_info); // sub_4560420 — bucket slab
dense_map_destroy(&filter_map); // sub_4560420
dense_map_destroy(&parent_map); // sub_4560420
dense_map_destroy(&op_to_owner); // sub_4560420
uf_buckets_destroy(&uf); // sub_4560420 per overflow tail + slab
}
The eight distinct sub_4560420 aligned-free sites in the finalize phase match the four DenseMap bucket
slabs, the UF bucket slab, the UF per-bucket overflow tails, and two scratch SmallVector tails. An inline-vs-heap
check on the SSO capacity word guards every free so the deallocator never runs on stack-resident storage.
Comparator and Heap-Sort
sub_8F7900 is the lexicographic (stage, order) comparator. It reads schedInfo[op].stage first and tie-breaks
on schedInfo[op].order. No resource row enters the comparison — ResourceConstraintBuilder already consulted the
RRT upstream. sub_8F7EF0 is a textbook libc++ __push_heap / __pop_heap pair operating on the 24-strided Op*
ready-queue vector. Both functions appear only in phase 4; phase 3's generator dispatch uses its own internal
ordering driven by the constraint-graph topology.
int sub_8F7900(Operation *a, Operation *b, DenseMap *sched_info) {
SchedInfoRow *ra = dense_map_find(sched_info, a);
SchedInfoRow *rb = dense_map_find(sched_info, b);
if (ra->stage != rb->stage) {
return (ra->stage > rb->stage) ? +1 : -1;
}
return (ra->order > rb->order) ? +1 : -1;
}
The comparator never returns zero. Ties on (stage, order) are impossible at this point because phase 2's topo
walk assigns a unique order to every op — which is what lets the heap-sort stay stable without an explicit
tie-breaker.
Callee Inventory (Sampled)
sub_8EEE70 calls 41 distinct functions. The ten most relevant beyond the dispatchers and the eight free sites
appear below.
| Callee | Role |
|---|---|
sub_94A550 | Constraint-slot lookup — returns the parentMap leader for an op |
sub_8E4510 | Constraint propagator — walks the DSU forest to finalise group leaders |
sub_8E2790 | origMap probe at Schedule+80..96 — used by phase 3 to find raw-value producers |
sub_8E2F00 | Second-table fmix64 probe at Schedule+104..120 — depth-keyed lookup |
sub_8F19D0 | Per-pair solve trampoline — caller of sub_8EEE70 |
sub_8EC560 | Union-Find coalesce — merges groups that share a producer post-generator |
sub_8E1900 | DSU snapshot copy-out — preserves the final group leaders for the materializer |
sub_8E4F10 | Alias materialisation, 10 430 bytes — rewrites operand references through Pipe_ SSA |
sub_8FB180 | parseFromAttrs reading nv_tile.aws.stage and nv_tile.aws.order |
sub_981D50 | Placement-driver entry — generator dispatched in phase 3 |
sub_8E4F10 is the heaviest callee — it materialises the Pipe_ and Mutex_ operand rewrites that bring the IR
into final form. The 10 430-byte size reflects every operand shape the constraint graph can produce, including the
asymmetric cases where a Pipe_ consumer sits at a different stage than its producer.
Zero-Producers Fallback Semantics
The phase 5 fallback fires only when both the placement driver and the cost-based generator report failure. The body emits a single Pipe_ flavour-A value (the scalar-shaped pipe constructor — see Pipe and Mutex Value Layout for the flavour-A/flavour-B split) with zero producers and exactly one consumer — the consumer argument — then sets the trivial-schedule flag on Schedule.flags. Neither the Mutex_ constructor nor the Pipe_ flavour-B constructor runs from this branch; those primitives are materialised earlier by the walker that precedes the per-pair solve trampoline. The trivial schedule means "ship the consumer with no producers and let later passes diagnose the missing dataflow", not a recovery attempt.
Dual-RRT Cost Evaluators
Once Schedule::solve invokes the placement driver, the cost-based fallback ranks candidate placements through two
dual-RRT cost evaluators. Both wrap in std::function<bool(int)>-shaped lambda thunks and route through a shared
exponential-then-binary search driver. The two evaluators answer disjoint feasibility questions: the first measures
actual hardware-resource occupancy under the current II, the second measures structural distance between a
candidate op and its data dependencies. The cost-based scheduler runs both and combines them lexicographically —
pipe-slot first as the legality gate, then bank-pressure as the preference signal.
Exponential-Then-Binary Search Driver sub_988080
The driver accepts a candidate-cost predicate lambda. It expands the cost threshold exponentially
(1, 2, 4, 8, ...) until the predicate flips from false to true, then binary-searches inside the bracketing range
to pin down the smallest threshold at which the candidate becomes feasible. Two lambda thunks ride the driver:
sub_987E70 for the bank-pressure evaluator and sub_987EE0 for the pipe-slot evaluator. Each captures a pointer
to the surrounding Schedule state, the candidate Op*, and the cycle index being probed; the threshold value
flows through the lambda's single integer argument.
uint32_t sub_988080(std::function<bool(int)> probe, uint32_t lower, uint32_t upper) {
uint32_t hi = (lower > 0) ? lower : 1;
// Exponential expansion: double hi until probe flips true or we hit the cap.
while (hi < upper && !probe((int)hi)) {
hi = (hi * 2u <= upper) ? hi * 2u : upper;
}
if (!probe((int)hi)) {
return UINT32_MAX; // candidate never becomes feasible
}
// Binary search inside [lower, hi] for the smallest threshold that satisfies probe.
uint32_t lo = lower;
while (lo < hi) {
uint32_t mid = lo + (hi - lo) / 2u;
if (probe((int)mid)) {
hi = mid;
} else {
lo = mid + 1u;
}
}
return lo; // smallest feasible threshold
}
The return value is the cost-threshold cap at which the candidate first becomes feasible. The cost-based generator treats that value as the candidate's score and picks the minimum across all candidates at the current cycle.
Bank-Pressure Evaluator sub_98C440
The bank-pressure evaluator probes a dual-RRT pair: rrt0 holds in-iteration occupancy (resources used by ops
placed at the current iteration of the kernel), rrt1 holds cross-iteration carry (resources still occupied
from the previous iteration's tail at the same modulo cycle). The evaluator reads pool caps 4 (in-iteration)
and 3 (cross-iteration) from the 9-element pool-capacity vector at indices 1 and 6 — TMEM and named-barrier
pools respectively (see Blackwell Pipeline 15-Slot Model — Pool Capacity Vector). The trampoline
sub_98E6A0 wires the evaluator into the driver, and the thunk sub_987E70 adapts the call site to the
std::function<bool(int)> shape sub_988080 expects.
bool sub_98C440(const Schedule *S, const Op *op, uint32_t t, uint32_t cost_cap) {
// rrt0 = in-iteration occupancy; rrt1 = cross-iteration carry.
uint32_t in_iter = countRrtBits(S->rrt0, op->node_rrt, t);
uint32_t cross_it = countRrtBits(S->rrt1, op->node_rrt, t);
// Pool caps from the 9-element capacity vector: index 1 = TMEM, index 6 = named-barrier.
// Hard cap per pool, then a combined budget that the cost driver tightens via cost_cap.
if (in_iter > 4u) return false; // TMEM bank pressure
if (cross_it > 3u) return false; // named-barrier carry
if (in_iter + cross_it > cost_cap) return false;
return true;
}
The combined in_iter + cross_it <= cost_cap term is what the exponential-then-binary search inside sub_988080
walks. The two per-pool caps (4 and 3) are hard gates no amount of cost relaxation can lift — they reflect
physical hardware limits on TMEM banks and named-barrier slots, baked into the binary as immediate constants.
Pipe-Slot Evaluator sub_98E6C0
The pipe-slot evaluator probes a single RRT against an N×N all-pairs distance matrix produced upstream by
sub_98BEE0 — an SSE2-unrolled Floyd-Warshall over the dependence graph. sub_12D0EA0 initialises the matrix,
filling every cell with the sentinel 0x7FFFFFFF before the relaxation loops run. The trampoline sub_990C20
wires the evaluator into the driver, and the thunk sub_987EE0 adapts the call site to the
std::function<bool(int)> shape.
bool sub_98E6C0(const Schedule *S, const Op *op, uint32_t t, uint32_t slot_cap) {
// Single-RRT probe: does the candidate's footprint fit in the current pipe slot?
if (!rrt_probe(&S->rrt_pipe, &op->footprint, t)) {
return false;
}
// All-pairs structural distance gate: every predecessor must reach `op` within slot_cap.
const int32_t *D = S->dist_matrix; // n_ops x n_ops i32, row-major
uint32_t n = S->n_ops;
for (uint32_t p = 0; p < n; ++p) {
if (!is_predecessor(S, p, op)) continue;
int32_t d = D[p * n + op->index];
if (d == 0x7FFFFFFF) continue; // no path between these ops — vacuous
if ((uint32_t)d > slot_cap) return false;
}
return true;
}
All-Pairs Distance Matrix sub_98BEE0
The matrix is n_ops × n_ops i32 cells stored row-major, allocated by sub_44A8C20(4 * n_ops * n_ops). The
inner kernel processes four cells at a time with SSE2 unrolling. The surrounding outer-k / middle-i /
inner-j triple loop is the canonical Floyd-Warshall shape — recognisable in the disassembly by the three-deep
nest with a cmp + cmov (or pminsd after vectorisation) sequence on each inner iteration. The initialiser
sub_12D0EA0 is a memset-shaped fill that writes the 0x7FFFFFFF infinity sentinel into every cell before
the relaxation loops run; the loop then relaxes only edges that exist in the dependence graph to finite distances.
void sub_98BEE0(int32_t *D, const Graph *g, uint32_t ii) {
uint32_t n = g->n_nodes;
sub_12D0EA0(D, /*value=*/0x7FFFFFFF, /*n_cells=*/n * n);
for (Edge e : g->edges) {
int32_t d = e.latency - (int32_t)(ii * e.iter_distance);
D[e.src * n + e.dst] = d;
}
// Floyd-Warshall with SSE2-unrolled inner loop (4 cells per iteration).
for (uint32_t k = 0; k < n; ++k) {
for (uint32_t i = 0; i < n; ++i) {
int32_t dik = D[i * n + k];
if (dik == 0x7FFFFFFF) continue; // skip rows with no path through k
uint32_t j = 0;
for (; j + 4u <= n; j += 4u) {
// SSE2 block: load 4 D[k*n+j..j+3], add dik, pminsd against D[i*n+j..j+3].
__m128i dkj = _mm_loadu_si128((const __m128i *)&D[k * n + j]);
__m128i dij = _mm_loadu_si128((const __m128i *)&D[i * n + j]);
__m128i thru = _mm_add_epi32(dkj, _mm_set1_epi32(dik));
_mm_storeu_si128((__m128i *)&D[i * n + j], _mm_min_epi32(dij, thru));
}
for (; j < n; ++j) { // scalar tail
int32_t thru = D[k * n + j] + dik;
if (thru < D[i * n + j]) D[i * n + j] = thru;
}
}
}
}
The unrolled block uses pminsd (SSE4.1) where available and falls back to a scalar cmp + cmov pair on the
SSE2-only path; the binary carries both code paths under a CPU-feature dispatch handled higher in the resource
builder. The infinity sentinel survives the relaxation untouched for unreachable pairs because adding any finite
dik to 0x7FFFFFFF overflows, and the dik != 0x7FFFFFFF guard at the top of the middle loop masks it off.
Why Two Evaluators
The two evaluators answer different questions and feed different terms of the lexicographic cost. Bank-pressure
ranks placements by actual hardware-resource occupancy under the current II — the signal that stops the
scheduler from overcommitting TMEM banks or named-barrier slots when several candidates are otherwise tied.
Pipe-slot ranks by structural distance between the candidate op and its data dependencies, using the all-pairs
matrix as a constant-time legality oracle — the signal that rejects placements which would force a dependence
edge to span more than the available pipeline depth. The cost-based scheduler runs both and combines them
lexicographically: pipe-slot is the hard legality gate, bank-pressure is the preference between the candidates
that survive the gate.
Worked Scoring Example
The clearest way to see how the lexicographic cost vector ranks candidate schedules is to walk two concrete placements of the four-op loop body from Blackwell Pipeline 15-Slot Model — Worked Example. Both candidates seat the same four ops; both target II = 8; they differ only in whether the SMEM write seats at the same cycle as the TMA load or one cycle later. The cost vector has four lexicographic components, ordered from hardest to softest:
| Position | Component | Source |
|---|---|---|
| 1 | resource feasibility | RRT row-OR test, capacity-pool caps |
| 2 | pipe-slot legality | structural distance matrix, pre-deps inside the slot |
| 3 | bank-pressure pressure | SMEM bank-conflict count |
| 4 | structural distance | dependence-shape preference vs original order |
Candidate A seats every op at II = 8 with a single-stage pipeline:
op stage order cycle slots claimed at cycle 0..7
tiled_tma_load 0 0 0 tma + tp_smem_wr
smem_write 0 1 0 tp_smem_wr ← collision
wgmma 0 2 0 tc_and_mma + tp_mma
smem_read 0 3 0 tp_smem_rd
The RRT probe at cycle 0 finds tp_smem_wr already claimed by tiled_tma_load. Component 1 fails: cost vector is (∞, *, *, *). The candidate is rejected before any later component matters.
Candidate B spreads the SMEM write into stage 1 by seating it at order 1, cycle 0 of the next iteration's modulo window:
op stage order cycle slots claimed
tiled_tma_load 0 0 0 tma + tp_smem_wr [cycles 0..7]
smem_write 1 0 8 tp_smem_wr [cycles 0..6 of next iter, modulo 8]
wgmma 0 1 0 tc_and_mma + tp_mma [cycles 0..7]
smem_read 0 2 0 tp_smem_rd [cycles 0..6]
Component 1 passes — every slot has at most one claimant per modulo cycle. The cost reducer moves to component 2.
Component 2 walks the all-pairs distance matrix produced by sub_98BEE0. The tiled_tma_load → smem_read edge has latency 8 and iteration distance 0; the distance matrix reads D[load, read] = 8. The pipe-slot threshold for tp_smem_rd at II = 8 is also 8, so the gate passes with zero slack. Cost contribution from this component is 0 — equal to the threshold means no preference penalty.
Component 3 counts SMEM bank pressure. The bank-pressure evaluator sub_98C440 reads rrt0 and rrt1 at the modulo cycle, sums them, and compares against pool caps 4 (TMEM) and 3 (named-barrier). For Candidate B the in-iteration occupancy is 2 (load + read on different rows of the SMEM bank) and the cross-iteration carry is 1 (the SMEM write spilling from iteration n−1). The sum 2 + 1 = 3 is below the TMEM cap and equal to the named-barrier cap; the gate passes with zero slack. Contribution to the cost vector is the raw sum 3.
Component 4 computes structural distance from the original program order. The original order is (load, write, mma, read) and Candidate B emits (load, mma, read, write) after the modulo wrap — the SMEM write moved past two later ops. The distance penalty is the Kendall-tau metric 2, the number of inversions.
Candidate B's full cost vector is therefore (0, 0, 3, 2). Compare against any alternative that pulls the SMEM write back into stage 0 by raising II to 9: that alternative would have cost vector (0, 0, 2, 0) on its own resources but pays a +1 in the outer II search; the outer driver penalises larger II directly and rejects it before this inner cost reducer ever runs. Among candidates that share the same outer II, Candidate B wins because every alternative either fails component 1 (like Candidate A) or accumulates a larger component-3 or component-4 cost.
The lexicographic comparison is strict: a candidate that improves component 4 at the price of component 3 always loses. This is what keeps the cost-based generator deterministic — the order in which the components rank is fixed at the binary level, and the cost reducer never sums or normalises across components.
Sentinel 0x7FFFFFFF
The constant 0x7FFFFFFF plays two distinct roles inside the scheduler, and both stay correct because the value
never appears as a real cost in either context. Inside the distance matrix it means "no path between these two
ops" and survives the Floyd-Warshall relaxation; elsewhere in the scheduler the same value marks snapshot-dead
retry-arm entries. Reimplementations must keep the two roles separate — some downstream cost combiners would
otherwise treat the snapshot-dead value as a finite cost and mis-rank candidates.
Usage and Contract
Schedule::solve runs inside MaterializeSchedule once per producer-consumer candidate pair. It consumes the cached ScheduleAnalysis produced by TileASGenerateSchedule, the constraint-attribute DSU at Schedule+0x70 seeded by the parser, and the per-pair raw-value handle plus consumer pointer the materializer walker discovered. It produces the per-op stage and order fields on the schedule record and the Pipe_ and Mutex_ SSA values emitted by the alias materialiser sub_8E4F10. Callers must invoke the cost-based fallback only after the cheap placement driver sub_981D50 has declined — inverting that order makes every schedule pay the cost-evaluator price even when the cheap path would have succeeded.
The dual-RRT cost evaluators live inside the cost-based fallback. They consume the candidate op, the cycle index t, the current II, the per-op footprint at op+96, the global RRT pair rrt0 / rrt1 from the schedule state, the all-pairs distance matrix built by sub_98BEE0, and the per-pool caps 4 and 3 at indices 1 and 6 of the 9-element pool-capacity vector. Their output is a smallest-feasible cost threshold per candidate that the cost reducer then ranks lexicographically. The evaluators never mutate the RRT — they only probe.
Cross-References
Modulo Scheduler and Rau-Style Placement documents the surrounding modulo
scheduler and the placement-arm sequence that invokes these evaluators.
Resource Constraint Builder and RRT documents the RRT bit-counting
primitives consumed by the bank-pressure evaluator.
Blackwell Pipeline 15-Slot Model — Pool Capacity Vector documents the 9-element pool-capacity
vector and the TMEM / named-barrier slots referenced by the per-pool caps 4 and 3.
Pipe_ and Mutex_ Value-Header Layout documents the 808-byte header that the alias materialiser writes into the IR.
Blackwell Pipeline 15-Slot Model
Abstract
The Blackwell pipeline model is the target-machine resource vocabulary that drives Tileiras scheduling. It maps scheduled operations onto issue, transport, tensor-memory, shared-memory, and MMA resource slots. Those slots become row bits in the modulo scheduler's Resource Reservation Table, so a candidate operation can be seated only when its footprint does not overlap resources already claimed in the same modulo cycle.
The model defines 24 slot identifiers. Eight are coarse families used for classification and grouping; fifteen primary slots feed the optimized scheduler's fine-grained pressure model; the remaining identifiers cover catch-all and test classes.
Slot Identifiers
Slot identifiers are one-based. The RRT row bit for a slot is slot_id - 1.
| Slot | Name | Kind | Role |
|---|---|---|---|
| 1 | issue | coarse | generic issue family |
| 2 | xu | coarse | transcendental or special-function unit family |
| 3 | xu64 | coarse | 64-bit special-function variant |
| 4 | fp32x2_fp16ultra | fine | paired FP32x2 and packed FP16 issue |
| 5 | alu | coarse | scalar ALU family |
| 6 | alu_or_fmaheavy | fine | ALU or FMA-heavy issue group |
| 7 | dual_alu | fine | dual-ALU datapath |
| 8 | lsu | coarse | load/store family |
| 9 | tmem | coarse | tensor-memory family |
| 10 | mma | coarse | MMA family |
| 11 | tc_and_mma | fine | TensorCore and legacy MMA issue stage |
| 12 | tma | coarse | tensor memory accelerator family |
| 13 | tp_gnic_rd | fine | generic interconnect transport read |
| 14 | tp_gnic_wr | fine | generic interconnect transport write |
| 15 | tp_smem_rd | fine | shared-memory transport read |
| 16 | tp_smem_wr | fine | shared-memory transport write |
| 17 | tp_tmem_rd | fine | tensor-memory transport read |
| 18 | tp_tmem_wr | fine | tensor-memory transport write |
| 19 | tp_mma | fine | MMA transport |
| 20 | unknown | fine | unclassified fallback |
| 21 | omitted_simt | fine | deliberately omitted SIMT operation |
| 22 | test_simt | fine | scheduler self-test SIMT row |
| 23 | test_mma | fine | scheduler self-test MMA row |
| 24 | test_dma | test | scheduler self-test DMA row |
The fifteen primary fine slots are:
fp32x2_fp16ultra
alu_or_fmaheavy
dual_alu
tc_and_mma
tp_gnic_rd
tp_gnic_wr
tp_smem_rd
tp_smem_wr
tp_tmem_rd
tp_tmem_wr
tp_mma
unknown
omitted_simt
test_simt
test_mma
Tensor-memory read and write slots are the clearest Blackwell markers — tcgen05-style tensor-memory load, store, copy, and MMA paths all hinge on them.
Operation Footprints
Each scheduled operation has:
- a slot identifier or slot group;
- a duration in cycles;
- a row footprint describing which resources it occupies at each cycle;
- optional capacity pressure in a pool that cannot be represented by one bit.
typedef struct ScheduledResourceUse {
uint32_t slot_id;
uint32_t duration;
uint64_t rows[MAX_DURATION];
} ScheduledResourceUse;
uint64_t resource_row_bit(uint32_t slot_id) {
return 1ull << (slot_id - 1);
}
The RRT probes the footprint. Dependency and cost calculation read the latency. The two concepts are related but distinct: a long-latency value can occupy an issue slot for only a moment, while a transport operation can hold transport resources across several cycles.
Latency Families
The model groups operations into latency families that match the scheduler's rough machine model.
| Family | Typical latency | Typical slots |
|---|---|---|
| dual ALU | 2 cycles | dual_alu |
| ordinary ALU or FMA-heavy | 4 cycles | alu_or_fmaheavy, fp32x2_fp16ultra |
| shared-memory or tensor-memory transport | 7 cycles | tp_smem_*, tp_tmem_*, tp_gnic_*, tp_mma |
| TensorCore or MMA issue | 8 cycles | tc_and_mma |
| far memory or cross-cluster anchors | thousands of cycles | modeled as scheduling anchors, not ordinary row duration |
Treat these values as scheduling weights — not a complete microarchitectural latency table, and never source-language semantics.
Capacity Pools
Some resources are modeled by capacities in addition to row bits. The important capacity pools are:
| Pool | Meaning |
|---|---|
| ALU-or-FMA-heavy issue width | limits how many FMA-heavy operations can be admitted in one cycle |
| dual-ALU issue width | limits dual-ALU pressure |
| shared-memory byte budget | constrains shared-memory allocation and spill pressure |
| tensor-memory budget | constrains tensor-memory-backed operations |
| register-bank pairing | models paired register-bank pressure |
| transport singleton slots | keep shared, tensor, and interconnect transports from overbooking |
A debug configuration treats shared memory as effectively unbounded. It isolates whether a scheduling failure comes from shared-memory pressure or from a different resource.
The capacity-pool probe mirrors the row-bit check: count current usage at the modulo cycle, add the candidate's requested count, compare against the cap. Pools with cap 1 behave as singleton resources — the second op claiming the same pool in the same cycle is rejected outright.
bool capacity_pools_allow(const ResourceTable *table,
const ScheduledResourceUse *use,
uint32_t t) {
for (uint32_t k = 0; k < use->duration; ++k) {
uint32_t row = (t + k) % table->ii;
for (uint32_t p = 0; p < table->n_pools; ++p) {
uint32_t requested = use->pool_counts[p];
if (requested == 0) continue;
if (table->pool_usage[row][p] + requested > table->pool_caps[p]) {
return false;
}
}
}
return true;
}
Rau Cost Weight Tables
Six lookup tables anchor the cost model. The placement driver sub_981D50 and its four arms consult them before reserving a slot in the global RRT. The tables do not live in a single rodata blob — they split across the forward-walk seeder sub_12C8DF0, the backward-walk variant sub_12CBDD0, the dispatcher sub_12CEBF0, and the slot-id hashmap builder sub_12CF910. Three tables hold per-op latencies, two hold per-slot cycle anchors, and one holds the per-resource-class capacities. The placement arms read these tables through stable offsets into the 444-byte SchedulerResourcePool and through XMM-word loads from rodata at 0x4CC9980..0x4CC9D70.
Per-Op Latency Table
A contiguous 12-byte stride array seeded by sub_12C8DF0 starting at offset 0 of the resource pool holds the per-op latency table. It carries 23 entries laid out as {u32 latency; u32 op_tag; u32 reserved} covering offsets 0..395. Each entry pairs a Blackwell op tag (the dialect's internal opcode classifier) with the cycles the cost model charges for a single issue of that tag. The per-op latency assigner reads this table to fill in Op.latency for every candidate before the cost-sort runs.
| Offset | Latency | Op Tag | Slot | Family |
|---|---|---|---|---|
+0 | 4 | 0x0B | 6 | FMA-heavy |
+12 | 2 | 0x0C | 7 | dual ALU |
+24 | 4 | 0x09 | 6 | FMA-heavy |
+36 | 4 | 0x0A | 6 | FMA-heavy |
+48 | 4 | 0x0D | 4 | paired FP32x2 / FP16 |
+60 | 4 | 0x0E | 4 | paired FP32x2 / FP16 |
+72 | 4 | 0x0F | 4 | paired FP32x2 / FP16 |
+84 | 4 | 0x10 | 4 | paired FP32x2 / FP16 |
+208 | 2 | 0x03 | 7 | dual ALU |
+228 | 2 | 0x01 | 7 | dual ALU |
+240 | 2 | 0x02 | 7 | dual ALU |
+252 | 4 | 0x15 | 6 | FMA-heavy |
+264 | 4 | 0x16 | 6 | FMA-heavy |
+276 | 4 | 0x17 | 6 | FMA-heavy |
+288 | 7 | 0x1C | 15 / 16 | SMEM transport |
+300 | 7 | 0x1E | 17 | TMEM read transport |
+312 | 7 | 0x1D | 16 | SMEM write transport |
+324 | 7 | 0x1F | 18 | TMEM write transport |
+336 | 8 | 0x18 | 11 | TC+MMA issue |
+348 | 8 | 0x19 | 11 | TC+MMA issue |
+360 | 7 | 0x20 | 19 | MMA transport |
+372 | 7 | 0x21 | 13 | gnic read transport |
+384 | 7 | 0x22 | 14 | gnic write transport |
High tag ids that the linear stride cannot reach (0x11..0x1B plus a handful of secondary tags) live in XMM-word continuations at rodata 0x4CC9980..0x4CC9A10 for the forward walk and 0x4CC9A20..0x4CC9AE0 for the backward walk. Each XMM word packs two {lat, op_tag} pairs into four i32 lanes; the backward-walk table mirrors the forward-walk encoding but carries reverse-dataflow weights consumed by sub_12CBDD0.
Cycle Anchor Table
Rodata 0x4CC9D10..0x4CC9D70 holds the cycle anchor table — per-slot cycle weights that fix the stage-start cycle every candidate seat time must clear before its slot stays admissible. The slot-cycle-weight reader sub_12CBDD0 consults this table during the dispatcher pass and enforces two architectural ceilings: 5000 cycles for HBM3e bandwidth saturation, 7000 cycles for cross-cluster transfers. Both ceilings land as inline 3-word vectors in self[16] and self[20] of the resource pool. Secondary fence anchors at 1600 and 2000 cycles cover near-SM SMEM spill and intra-cluster fences.
| Rodata | Slot Range | Cycle Weights |
|---|---|---|
0x4CC9D10 | 1..4 (issue, xu, xu64, fp32x2_fp16ultra) | 100, 100, 110, 101 |
0x4CC9D20 | 5..8 (alu, alu_or_fmaheavy, dual_alu, lsu) | 102, 102, 103, 103 |
0x4CC9D30 | 9..12 (tmem, mma, tc_and_mma, tma) | 120, 104, 121, 104 |
0x4CC9D40 | 13..16 (gnic rd/wr, smem rd/wr) | 200, 400, 800, 900 |
0x4CC9D50 | 17..20 (tmem rd/wr, mma transport, unknown) | 1500, 2000, 2400, 3000 |
0x4CC9D60 | misc (test_* and omitted_simt scratch) | 50, 100, 200, 360 |
0x4CC9D70 | secondary anchors | 600, 800, 1000, 1200 |
The 5000-cycle HBM3e ceiling is the absolute round-trip budget the scheduler attributes to a worst-case far-memory dependence; the 7000-cycle ceiling is the same budget for TMA traffic that crosses the cluster boundary. Both serve as Big-M terms — every candidate's accumulated latency must stay below them or the placement is rejected outright before any RRT probe runs.
Pool Capacity Vector
A 9-element capacity vector {37, 4, 7, 37, 5, 1, 3, 6, 2} tells the per-cycle pressure summariser sub_12CEBF0 how much of each resource is available in a single modulo cycle. Pool construction installs the vector through nine explicit calls to the capacity rounder sub_12BB050.
| Pool | Capacity | Role |
|---|---|---|
| 0 | 37 | op-tag → latency table, first 37 distinct op tags |
| 1 | 4 | ALU-or-FMA-heavy issue cap |
| 2 | 7 | dual_alu slot fan-out |
| 3 | 37 | shadow of pool 0 for backward-walk |
| 4 | 5 | per-slot issue-width for coarse families |
| 5 | 1 | singleton global lock for SMEM capacity |
| 6 | 3 | dual-issue cap |
| 7 | 6 | alu_or_fmaheavy slot fan-out |
| 8 | 2 | register-bank pairing |
Caps of 1 on transport pools and the SMEM byte budget are what make the tp_smem_*, tp_tmem_*, tp_gnic_*, and tp_mma slots behave as singleton resources — the modulo scheduler rejects any second op claiming the same transport row in the same RRT cycle. The capacity rounder rounds each request up to the next power of two times four-thirds before allocation, so the rodata values are the live counts before rounding.
Tier-2 Global Capacity Struct
sub_12C8DF0 writes a small struct at the same resource pool that holds five hard inequalities every committed schedule must respect. The struct lives at the start of the pool's secondary section and encodes per-tag caps as packed u64 words.
| Offset | Op Tag | Cap | Interpretation |
|---|---|---|---|
ptr[ 0] | 2 | 262144 | TMEM / register-file byte budget |
ptr[ 8] | 1 | 3 | max concurrent ALU issue per warp slot |
ptr[16] | — | 232448 or INT_MAX | SMEM byte budget per SM |
ptr[20] | 1 | 4 | max concurrent ALU-or-FMA-heavy issue |
ptr[28] | 8 | 2048 | register-bank width across 8 banks |
The SMEM byte budget at ptr[16] is the 227 KiB Blackwell floor (232448 bytes). Setting TILE_AS_DEBUG_UNLIMITED_SMEM="1" toggles this cell to INT_MAX, letting the developer isolate whether a scheduling failure comes from SMEM pressure or from another resource. The check runs once at pass-init time inside sub_12C8DF0; later admission attempts read the rewritten cell directly.
Cost Table Consumers
Each of the three readers pulls from a single table and produces one class of cost-model input. The split keeps the per-op latency view, the per-slot cycle anchor view, and the per-class capacity view independently addressable from both placement arms and the cost reducer.
| Cost lookup table | Rodata / Offset | Consumer | Role |
|---|---|---|---|
| Per-op latency, 23 packed entries | SchedulerResourcePool +0..+395 | sub_12C8DF0 | per-op latency assigner; fills Op.latency |
| Forward-walk XMM continuations | 0x4CC9980..0x4CC9A10 | sub_12C8DF0 | high-tag latency lookups (tags 0x11..0x1B) |
| Backward-walk XMM continuations | 0x4CC9A20..0x4CC9AE0 | sub_12CBDD0 | reverse-dataflow latency view |
| Per-slot cycle anchor weights | 0x4CC9D10..0x4CC9D70 | sub_12CBDD0 | slot-cycle-weight reader; applies 5000/7000 ceilings |
| 9-element pool capacity vector | inline arguments to sub_12BB050 | sub_12CEBF0 | per-cycle pressure summariser |
| Tier-2 global capacity struct | SchedulerResourcePool +0..+28 | sub_12C8DF0 | installs hard inequalities (TMEM, ALU, SMEM, regbank) |
The cost reducer that drives CostBasedScheduleGenerator::generateOrRefineScheduleWithConstraint (sub_980290) reaches all three views through the same resource-pool pointer, so the lexicographic cost vector it produces — latency, slot pressure, structural distance — comes from a single coherent snapshot of the tables.
Axis and Buffer Inputs
Names alone do not classify operations. The scheduler consumes analysis facts:
- contiguity, divisibility, and constancy from axis analysis;
- scalar bounds and memory ranges for index expressions;
- buffer lifetime records for shared memory, tensor memory, and auxiliary scratch;
- leader groups and pipeline identifiers from buffer assignment;
- allocation sizes and live ranges from the layout and buffer passes.
Axis analysis decides whether a vector load, TMA coordinate, or pointer expression is aligned and compact enough for a particular resource class. Buffer lifetime decides whether two memory operations share a live resource and must be coupled or separated.
Worked Example: Four-Op Loop Body
The clearest way to read the slot model is to walk a loop body small enough to fit in one RRT and rich enough to touch the transport, MMA, and SMEM rows simultaneously. The body below is the steady-state shape of a software-pipelined matmul inner loop:
%0 = nv_tileas.async.tiled_tma_load %desc, %coord : !smem_ref
%1 = nv_tileas.async.smem_write %src : !smem_ref
%2 = nv_tileas.async.wgmma %a, %b, %c : !tmem_ref
%3 = nv_tileas.async.smem_read %0 : !reg
Each op's resource vector is the triple (slot_id, duration, occupancy) produced by the constraint builder. The classifier reads the op's MLIR opcode plus its operand types, picks the slot from the table at the top of this page, and reads the duration from the latency family.
| Op | Slot | Duration | Occupancy | Family |
|---|---|---|---|---|
tiled_tma_load %0 | 12 (tma) + 16 (tp_smem_wr) | 8 cycles | 1 each | TMA + SMEM write transport |
smem_write %1 | 16 (tp_smem_wr) | 7 cycles | 1 | SMEM write transport |
wgmma %2 | 11 (tc_and_mma) + 19 (tp_mma) | 8 cycles | 1 each | MMA issue + transport |
smem_read %3 | 15 (tp_smem_rd) | 7 cycles | 1 | SMEM read transport |
The TMA load is the only op that claims two slots simultaneously: the descriptor stays parked on the tma row while the tensor payload flows through the SMEM write transport. The cost reducer sees two row contributions for one op, which is why the per-op latency table at offset +288 of the resource pool charges both 0x1C (SMEM transport) and 0x1D (SMEM write transport) variants for the same source-level operation.
Suppose the candidate II is 8. The scheduler probes the four ops in dataflow order and seats each at the earliest legal cycle. The resulting RRT — one 24-bit row per modulo cycle, drawn here only over the slots the example touches — is:
cycle tc_and_mma tma tp_smem_rd tp_smem_wr tp_mma
0 . X . X . ← tiled_tma_load occupies tma + smem_wr
1 . X . X .
2 . X . X .
3 . X . X .
4 . X . X .
5 . X . X .
6 . X . X .
7 . X . X .
// smem_write seats at cycle 0 of next iteration; in the modulo
// view it overlays the same RRT, claiming tp_smem_wr at cycles
// [0..6] mod 8. The probe fails — tp_smem_wr is already busy.
//
// The placement driver bumps smem_write forward; the only legal
// start is cycle 8 mod 8 = 0 of the iteration *after* the TMA
// tail drains, which the modulo scheduler models as a stage-1
// seat with order 0.
The example shows two things at once: (i) singleton transports (tp_smem_wr pool cap = 1) force the modulo scheduler to spread overlapping iterations across stages rather than packing them onto the same cycle, and (ii) the per-op latency table's split between 0x1C and 0x1D exists precisely so the TMA load and the loose SMEM write can be charged at different per-cycle weights — the TMA load's 8-cycle hold is what makes it the structural bottleneck, while the SMEM write's 7-cycle hold lets it slip into the gap one cycle later.
The cost reducer ranks this schedule against any alternative by reading the per-slot cycle weights from rodata 0x4CC9D40 for slots 13..16: 200 for gnic-rd, 400 for gnic-wr, 800 for smem-rd, 900 for smem-wr. A schedule that doubled-up on tp_smem_wr would multiply that 900 by the second user's surcharge; a schedule that kept the SMEM transports balanced pays the base weight once and clears the gate.
Admission Rule
An operation is legal at cycle t when every occupied row is conflict-free.
bool resource_admit(ResourceTable *table,
const ScheduledResourceUse *use,
uint32_t t) {
for (uint32_t k = 0; k < use->duration; ++k) {
uint32_t row = (t + k) % table->ii;
if ((table->rows[row] & use->rows[k]) != 0) {
return false;
}
}
return capacity_pools_allow(table, use, t);
}
Commit is the same loop with OR assignment after all probes pass.
Cross-References
Resource Constraint Builder and RRT consumes the slot identifiers documented here as row bits in its qword footprint stack. Modulo Scheduler and Rau drives the RRT probe and commit against these slots. Modulo Driver and 4-Arm OR-Chain consults the per-op latency table and the 9-element pool-capacity vector during cost ranking. Schedule Solve and Cost Evaluators reads the per-pool caps 4 (TMEM) and 3 (named-barrier) from indices 1 and 6 of the pool-capacity vector. Performance and Cost Model walks the roofline calculation that turns these slot costs into a stage count.
Pipe_ and Mutex_ Value-Header Layout
Abstract
Tileiras's warp-specialized scheduler emits two families of IR-visible SSA values that name the producer/consumer handshakes flowing between agents: Pipe_<N> for streaming dataflow and Mutex_<N> for exclusive access. The same 808-byte (0x328) heap record backs both flavours; the canonical field-by-field layout, the three constructors that allocate it, the shared zero-fill plus self-pointer prologue, and the four-state lifecycle (ZEROED → SKELETON → CONSTRUCTED → PAYLOADED) all live in AsyncValue and BLAKE3 Interning — AsyncValueImpl Header. This page is the scheduler-side companion: it covers what each family means to the schedule DAG, when a coordination value is born, how it transitions through arrival and wait, the nv_tile.aws.stage / nv_tile.aws.order attribute parser that threads scheduling keys back into the header, and the four upstream invariants that decide whether a handshake survives later passes.
The three constructors are anchored at three call sites in the binary: sub_8E0070 writes the literal "Mutex_" into the IR name slot, and sub_8E9450 and sub_8EA0B0 each write "Pipe_" for the scalar and tile flavours respectively (the two Pipe_ constructors stay distinct because the consumer payload initialiser and the verifier they invoke differ). The shared AWS-attribute parser is sub_8FB180, called from sub_8FCD40 (the MaterializeSchedule entry point for Mutex_) and sub_8FD260 (the Pipe_ entry).
Pipe_ vs Mutex_ Semantics
Pipe_ and Mutex_ are the two coordination primitives the warp-specialized model uses. They look similar at the storage level — same 808-byte record, same allocator, same attribute parser — but the contract they enforce is different and the scheduler ranks them at different positions in the dependence DAG.
A Pipe_ value models a producer/consumer handshake over a ring buffer of depth d. The producer arrives on slot k mod d after writing its payload; the consumer waits on the same slot and then reads. The ring buffer allows the producer to run up to d-1 iterations ahead of the consumer, which is what gives software pipelining its overlap. The scheduler treats a Pipe_ as a directed edge with bounded slack: the producer's stage can precede the consumer's stage by any amount up to d, and that slack feeds into the dependence-MII computation in Resource Constraint Builder.
A Mutex_ value models an exclusive-access handshake on a single counter slot. The acquiring side bumps the counter, performs the protected work, and releases it; any second acquirer must wait until the release before entering. The scheduler treats a Mutex_ as a serialisation edge with no slack: the protected work in iteration i must complete before the protected work in iteration i+1 can start. That zero-slack semantics is what makes named-barrier allocation in Buffer Assignment and Named Barriers the central handshake mechanism — every Mutex_ consumes exactly one slot from the 32-slot pool and holds it for the full live range.
| Property | Pipe_ | Mutex_ |
|---|---|---|
| Payload | producer-supplied scalar or tile | none — synchronisation only |
| Slack across iterations | up to depth - 1 | none |
| Hardware backing | ring buffer in SMEM or TMEM, optionally with mbarrier object | named barrier slot from the per-CTA 32-slot pool |
| Buffer-assignment cost | depth × per-slot SMEM/TMEM footprint | 1 named barrier slot |
| Schedule-edge semantics | partial order with bounded skew | strict serialisation |
| Constructor address | sub_8E9450 (scalar) / sub_8EA0B0 (tile) | sub_8E0070 |
Lifecycle
Each coordination value passes through three observable phases. The scheduler emits a value into the IR during materialisation; the executing program then arrives on it from the producer side and waits on it from the consumer side. Identity is stable across all three phases.
// Phase 1: construction — emitted once during MaterializeSchedule.
auto value = construct_pipe_or_mutex(producer, consumer, stage, order, depth);
// Phase 2: arrival — producer signals payload-ready or work-complete.
for (int iter = 0; iter < N; ++iter) {
write_payload(value, iter % value.depth); // Pipe_ only
arrive(value, iter % value.depth);
}
// Phase 3: wait — consumer blocks until the matching arrival.
for (int iter = 0; iter < N; ++iter) {
wait(value, iter % value.depth);
use_payload(value, iter % value.depth); // Pipe_ only
}
The scheduler's job is to choose the producer and consumer stages so that the wait in phase 3 is satisfied by an earlier arrival in phase 2 — never a future one. The (stage, order) pair attached to the header at construction time is what makes that scheduling decision durable through every subsequent pass.
Three Constructors
Three constructors share the zero-fill plus self-pointer prologue and then specialise: a Mutex_ flavour at sub_8E0070 backed by one named-barrier slot, a scalar Pipe_ flavour at sub_8E9450 backed by a small ring of scalar values (default depth 2), and a tile-shaped Pipe_ flavour at sub_8EA0B0 backed by a ring buffer with an attached Layout slot for tile traffic. All three end up routing the IR-visible name through the same SSO short-string append helper (sub_44E1740) with literal length 6 for "Mutex_" and 5 for "Pipe_"; both Pipe_ flavours print under the same name Pipe_<N> because the trailing <N> is a per-function monotone counter appended at print time, not a stored field. The schedule comparator never needs to disambiguate them because the payload value's type already encodes the scalar-versus-tile split. The two Pipe_ flavours stay as separate constructors rather than one templated body because the parameter shape, the consumer-payload initialiser, and the verifier they each invoke differ.
The header itself comes from a bump-pointer arena, which guarantees pointer stability. The embedded DenseMap<Operation*, T> instances rely on that stability because they hash on the header address itself; relocating a header after construction silently breaks every later probe. The per-constructor field initialisation, the kind byte that distinguishes the three at runtime, the optional-flag pair that drives the verifier dispatch, and the shared tail that promotes a header from CONSTRUCTED to PAYLOADED are all spelled out in AsyncValue and BLAKE3 Interning — Construction Prologue.
Attribute Parser
The AWS-attribute parser is the single function sub_8FB180. It walks the parent operation's attribute dictionary looking for two named integer attributes — nv_tile.aws.order (queried first) and nv_tile.aws.stage — both expected to point at the canonical i32 TypeID marker at unk_5BE5F40. stage is the producer's stage index in the steady-state pipeline; order is its intra-stage order. Together they form the lexicographic key (stage, order) that the schedule comparator reads to decide producer-before-consumer in the final emit order.
The parser uses a two-step lookup that reflects how MLIR stores attributes. It first checks a one-byte flag at offset 47 of the parent op — set when the op carries the inline-attribute fast path — and on a hit calls sub_446DC50 to read the inline slot; only on a miss does it fall back to the full dictionary scan via sub_440E370(op + 56, "nv_tile.aws.order", 17). The literal length 17 is the strlen of nv_tile.aws.order and nv_tile.aws.stage (both are exactly 17 bytes), which is why one length argument serves both probes. Any attribute that resolves to a type pointer different from unk_5BE5F40 is rejected as a typed-attribute mismatch — the parser nulls the slot rather than reading the wrong width.
When both attributes are present and type-checked, the parser writes them into the header's status slot. Absent attributes do not fail at parse time — the slot stays at the zero-fill default and the structural check shifts to the AWS verifier, which decides whether the missing keys are tolerable for this flavour or a hard failure. Mutex flavours require both keys; pipe flavours can tolerate one absent key when the matching schedule slot is also absent.
Verifiers
Three structural verifiers run after construction. Each pins a different invariant; any failure sets the pass-level failure flag so downstream emit-phase consumers skip the corrupt header.
| Verifier | Invariant | Scheduler consequence on failure |
|---|---|---|
| Type-match | producer and consumer view the same SSA value type across the handshake | upstream lowering bug; the schedule never had a coherent producer-consumer pair |
| Depth | the ring-buffer depth is within the hardware limit for the flavour | buffer-assignment over-allocated; the merged pipeline class is wider than the hardware can ring |
| AWS-attribute | nv_tile.aws.stage / nv_tile.aws.order are present when the flavour requires them | Schedule::solve failed to write its result; the schedule comparator has no key to sort by |
The type-match verifier is the strictest of the three: producer and consumer view the same SSA value, so a type mismatch points to an upstream lowering bug rather than a user error. The AWS-attribute verifier is the dispatch hub: it reads the (stage, order) pair the parser wrote and decides whether the schedule is internally consistent.
Schedule-Resource Invariants
Four invariants tie this header back to the scheduler analysis and must hold for a MaterializeSchedule output to be coherent.
(stage, order)matches the analysis. The pair written into the status slot must be exactly the pairScheduleAnalysisrecorded for the producer op. The schedule comparator reads only this slot at emission time; any drift between analysis and header is invisible until late-pass verification fails.- Pipe_ depth fits the buffer-assignment record. The ring-buffer depth on a
Pipe_is bounded by the buffer-assignment record's allocated slot count. Phase 4 of buffer assignment may have merged pipelines so two pipes share one physical buffer; the depth on the surviving header must reflect the merged class, not the pre-merge values. The merge path emits"unable to merge two different pipelines: "if the survivors disagree on depth or(stage, order). - Mutex_ named-barrier index is in
[0, 31]. The named barrier index written into theMutex_header is one of the slots Phase 2 allocated from the 32-slot pool. Indices outside that range mean a buffer-assignment bug, never a frontend error. - Mbarrier shape matches the producer-tail flavour. A
Pipe_whoseproducer_taillowers tocutlass.pipeline.producer_tailmust use a non-transactional mbarrier; the transactional flavour is gated bysub_145AFD0, which emits"using transaction mbarrier is not supported". APipe_whose producer carries a non-empty payload must not be backed by a named barrier alone — the named-barrier path emits"the producer_tail implemented with named barrier does not yet support user payload"because there is nowhere to stage the payload byte count.
A reimplementation must preserve all four. Violating (1) corrupts the per-op stage/order map; violating (2) emits a ring buffer that overflows under steady-state pressure; violating (3) emits a bar.sync against a slot the hardware does not have; violating (4) emits an arrive/wait pair that the runtime accepts but that drops the payload silently.
Failure Handling
The scheduler sees one of two outcomes per coordination value: a fully constructed header attached to the parent operation, or a verifier-set failure flag that gates the emit phase. The constructor side of failure handling (allocation aborting on OOM, the partially constructed header that the verifier still returns) is layout-mechanics — see the canonical page. The schedule side is the gating: a single malformed handshake sets the pass-level failure bit and skips the emit phase entirely, so a corrupt header never feeds into later passes that walk the schedule DAG. This split keeps the diagnostic stream useful: one focused error per malformed handshake, no cascade.
The failure diagnostics partition into three layers, each emitted by a different function. The constructor layer signals allocation failure through the standard arena OOM path. The attribute-parser layer is silent — a missing or type-mismatched attribute writes zero into the status slot and the verifier reports later. The downstream-pass layer emits the user-visible strings: buffer-assignment Phase 2 emits "fails to assign named barrier" from sub_13692E0 when it runs out of 32-slot pool entries; the pipeline-merge pass (sub_135CD10) emits "mbarrier has wait-like users, cannot share pipeline buffer." when a Pipe_ candidate for sharing already has consumer-side mbarrier waits in flight, and "unable to merge two different pipelines: " when two merge candidates carry mismatched headers.
Usage and Contract
MaterializeSchedule is the only caller and invokes the three constructors after Schedule::solve has emitted its producer-consumer groupings. Each constructor takes the parent operation pointer (the source of the AWS attribute dictionary), the producer-side scheduling info already written by the modulo scheduler, and — for the two Pipe_ flavours — the ring-buffer depth requested by upstream buffer assignment. The IR-visible name string, the (stage, order) pair written by the AWS-attribute parser, and the consumer payload are the public outputs downstream verification, printing, and lowering passes read.
Three things have to happen at materialisation time for a coordination value to be coherent with the rest of the schedule. First, every Pipe_ value emitted into the IR has to outlive the producer's last arrival and the consumer's last wait; the arena lifetime gives this for free, but a pass that drops a Pipe_ value before materialisation finishes leaves dangling references in the schedule's per-op stage map. Second, the value's (stage, order) pair has to be set before any later pass walks the IR — the AWS-attribute parser is responsible for this, and the verifier emits a hard diagnostic if it finds a Mutex_ value with an unset pair. Third, the named-barrier index inside a Mutex_ header has to come from BufferAssignment's 32-slot allocation table; any other source produces a bar.sync against a slot the hardware does not have.
QUIRKs
The literal length 17 covers both attribute keys exactly. sub_8FB180 passes 0x11u as the length argument to sub_440E370 for both the nv_tile.aws.order and nv_tile.aws.stage lookups. The two attribute names happen to be the same length to the byte (17 ASCII characters each), so the parser hard-codes one literal length and reuses it across both probes. Renaming either attribute in a reimplementation requires changing the corresponding length constant; a rename that drifts the two lengths apart breaks the second probe silently because sub_440E370 does a strict bounded compare.
The i32 type guard rejects i64 stage indices outright. Both attribute probes compare the returned attribute's type pointer against the single global unk_5BE5F40 and null the slot on mismatch — i64 or index-typed values for stage/order are dropped before they ever reach the header. Combined with the fact that absent attributes also produce a zero slot, the verifier cannot distinguish "attribute was wrong type" from "attribute was missing"; both surface as the same AWS-verifier failure, and the diagnostic loses fidelity.
Named-barrier Pipe_ and user payloads are mutually exclusive. When buffer-assignment chooses the named-barrier backing for a Pipe_ (the cheap path — one slot instead of an mbarrier object), the materializer refuses any payload-carrying producer. The diagnostic "the producer_tail implemented with named barrier does not yet support user payload" is the only signal; a reimplementation that silently coerces payload-carrying pipes into the named-barrier path will compile but lose every arrived payload because bar.sync has no transaction byte count to track.
Cross-References
AsyncValue and BLAKE3 Interning is the field-by-field layout of the 808-byte header these constructors allocate, including the prologue, the three constructor specialisations, the lifecycle state machine, and the failure-handling path. Modulo Scheduler and Rau-Style Placement documents the schedule that supplies the (stage, order) pairs the AWS-attribute parser threads into the header. Schedule Solve and Cost Evaluators describes the materialisation boundary where these headers are emitted into IR. Buffer Assignment and Named Barriers supplies the ring-buffer depth and the named-barrier index that constrain the schedule-resource invariants above, and is where the pipeline-merge pass sub_135CD10 decides whether two Pipe_ headers can share one physical buffer.
Buffer Assignment and Named-Barrier Binding
Abstract
Once the modulo scheduler has fixed II and the steady-state stage count, a post-pipelining pass binds each pipelined value to a concrete physical buffer (an SMEM region, a TMEM region, or a TMA descriptor slot) and to a named mbarrier slot for the producer/consumer handshake. The pass is sub_13692E0. It runs in four phases over the loop body and over every nv_tileas.async.pipeline.create_pipeline op.
It consumes the schedule analysis published by the modulo scheduler and produces a per-pipeline-value allocation record. Later materialization passes lower those records into the Pipe_ and Mutex_ IR documented in Pipe_ and Mutex_ Value-Header Layout.
Phase Outline
The four phases run unconditionally in order. Phase 1 and Phase 2 are gating — either failure aborts the pass before any physical buffer is committed. Phase 3 walks pipeline values once and dispatches to the SMEM or TMEM binder. Phase 4 merges disjoint-lifetime pipelines so they can share one physical buffer.
| Phase | Worker | Diagnostic on failure |
|---|---|---|
| 1. resolve lifetime | sub_1367080 | "fails to resolve lifetime" |
| 2. assign named barriers | sub_13692A0 → sub_1368BF0 | "fails to assign named barrier" |
| 3. pick buffer class and bind | sub_13606F0; SMEM via sub_1356650 + sub_13513A0; TMEM via sub_1360730 | "fails to assign smem buffer" / "fails to assign tmem buffer" |
| 4. share buffers (union-find) | sub_1361790 | (no failure path; emits "share pipeline buffer") |
LogicalResult bufferAssign(FunctionOpInterface fn) {
if (failed(resolveLifetime(fn))) return emit("fails to resolve lifetime"); // Phase 1
if (failed(assignNamedBarriers(fn))) return emit("fails to assign named barrier"); // Phase 2
for (PipelineValue *pv : pipelineValues(fn)) { // Phase 3
BufferClass cls = classify_buffer(pv);
if (cls == SMEM && failed(assignSmem(pv))) return emit("fails to assign smem buffer");
if (cls == TMEM && failed(assignTmem(pv))) return emit("fails to assign tmem buffer");
}
sharePipelineBuffers(fn); // Phase 4
return success();
}
Phase 1 — Resolve Lifetime
Phase 1 walks every nv_tileas.async.pipeline.create_pipeline op and computes the live range of its produced values across the loop body. The walk starts at the producer op and follows the SSA use-def chain through every consumer in the same region, terminating at the last use before the end of the loop body. For pipelined producers the live range crosses the iteration boundary in modulo space, so the walker normalises endpoints into (stage, cycle) pairs that Phase 4 can later compare.
Alongside the live-range walk, Phase 1 builds an alias table that records which pipeline values must share storage because they refer to the same underlying buffer. The table is keyed on a memory-flow root — the upstream AllocationOpInterface op that produced the buffer — and seeded by walking every pipeline value's back-cone of allocation ops, then probing the table with find-or-insert semantics.
LogicalResult resolveLifetime(FunctionOpInterface fn) {
AliasTable alias = alias_table_create();
for (PipelineValue *pv : pipelineValues(fn)) {
Operation *root = walk_memory_flow_to_alloc(pv->producer);
if (root == NULL) return failure(); // unanchored back-cone
// Find-or-insert. probe() returns the existing slot or installs a new one.
AliasSlot *slot = alias_table_probe(&alias, root);
if (slot->head == NULL) {
slot->head = pv;
} else {
link_into_alias_chain(slot->head, pv);
}
if (failed(compute_modulo_lifetime(pv))) return failure();
}
return success();
}
The alias-table probe uses the same DenseMap shape every scheduler intern table uses: hash (root>>9) ^ (root>>4) against a power-of-two capacity, stride-1 linear probe, and the canonical -4096 / -8192 sentinels in slot byte 0 (see Container Fingerprints — LLVM DenseMap and DenseSet). Phase 4 reads the resulting chains to decide which pipeline values are eligible for buffer sharing — two values that share an alias chain trivially share storage.
A lifetime that resists normalisation — a cyclic producer chain, a missing iteration anchor, or a producer with no consumer — is fatal. The pass emits "fails to resolve lifetime" and aborts before any barrier or buffer is committed.
Phase 2 — Assign Named Barriers
Phase 2 walks the pipeline-value list and hands each producer/consumer pair one NamedBarrier slot. NamedBarriers are the 32-slot bar.sync mechanism per CTA — distinct from the transactional mbarrier object that other pipeline pages discuss. See mbarrier State Machine for the structural disambiguation. The slot index is encoded as a small integer that the later materializer turns into a bar.sync operand.
The 32-slot pool is the binding constraint. The binder maintains a 32-entry table of currently-bound (stage, cycle) lifetime ranges, one per slot. For each pipeline value, it scans slots in index order looking first for an unbound slot (the fresh-allocate path), then for a slot whose recorded lifetime does not overlap the candidate's lifetime in steady-state (stage, cycle) space (the reuse path). The overlap test is the standard interval check on the modulo-normalised endpoints Phase 1 produced.
LogicalResult assignNamedBarriers(FunctionOpInterface fn) {
SlotState slots[32] = {0}; // all slots start unbound
for (PipelineValue *pv : pipelineValues(fn)) {
// Fresh-allocate pass: pick the lowest-indexed unbound slot.
int chosen = -1;
for (int s = 0; s < 32; ++s) {
if (!slots[s].bound) { chosen = s; break; }
}
// Reuse pass: pick the lowest-indexed slot whose lifetime is disjoint.
if (chosen < 0) {
for (int s = 0; s < 32; ++s) {
if (lifetimes_disjoint(slots[s].lifetime, pv->lifetime)) {
chosen = s;
break;
}
}
}
if (chosen < 0) return failure(); // pool exhausted
slots[chosen].bound = true;
slots[chosen].lifetime = merge_lifetimes(slots[chosen].lifetime, pv->lifetime);
pv->namedBarrier = chosen;
}
return success();
}
Index-order scanning keeps the allocation stable across builds — two compilations of the same function produce the same slot assignments. Reuse stays correct because lifetimes_disjoint works on the modulo-normalised endpoints: two pairs whose live ranges never coexist in the steady state can share one hardware slot without producing a barrier collision.
When neither fresh allocation nor reuse succeeds for some pair, the pass emits "fails to assign named barrier" and aborts. The named-barrier index later lands in the Mutex_ header documented in Pipe_ and Mutex_ Value-Header Layout.
Phase 3 — Pick Buffer Class and Bind
Phase 3 decides whether each pipeline value lives in SMEM or TMEM, then dispatches to the matching binder. The SMEM path first selects a region inside the SMEM allocation pool, then assigns an offset within that region. The TMEM path allocates a handle from the TMEM region and writes it into the pipeline-value record.
Buffer-class selection is a deterministic function of the value's shape, element type, and producer/consumer pattern. The class names the storage domain; the producer/consumer pattern picks the correct binder mode within that domain.
BufferClass classify_buffer(const PipelineValue *pv) {
Shape s = pv->tile_shape;
Type e = pv->element_type;
// Subtarget gate: without the Blackwell tmem feature there is no TMEM domain.
if (!subtarget_has(TMEM_FEATURE)) {
return SMEM;
}
// Tile-shaped values with byte-element types and a footprint above the
// TMEM threshold land in TMEM; everything else stays in SMEM.
bool tile_shaped = s.rank >= 2 && shape_is_2d_tile(s);
bool byte_elements = element_bits(e) >= 8;
size_t footprint = shape_bytes(s, e);
if (tile_shaped && byte_elements && footprint > TMEM_FOOTPRINT_THRESHOLD) {
return TMEM;
}
return SMEM;
}
The threshold reflects Blackwell's TMEM geometry. TMEM is the high-capacity tile store and is too coarse for sub-tile or small-element traffic, so anything that is not a full byte-element tile drops back to SMEM. The Blackwell tmem subtarget feature is the gate documented in NVPTX Subtarget and Feature Matrix; subtargets without it collapse the classifier to SMEM-only.
Once the class is fixed, the binder allocates a per-value record. The record carries the producer-op pointer, the variadic list of consumer-op pointers, the buffer-class enum (SMEM, TMEM, or named-barrier-only), the SMEM byte offset or TMEM handle, the named-barrier index from Phase 2, the steady-state stage count, and the (stage, cycle) lifetime endpoints. TMA descriptor traffic also lands in this record; the TMA path is documented in TMA, Tensormap and cp.async.bulk.
A binder failure is fatal. The pass emits "fails to assign smem buffer" or "fails to assign tmem buffer" and aborts. Common causes are SMEM exhaustion at the chosen stage count, an oversize tile that exceeds the TMEM region, or an alignment requirement that cannot be satisfied at the candidate offset.
Phase 4 — Share Buffers
Phase 4 pools pipeline values into shared physical buffers. The pool is a union-find keyed on pipeline-value identity; each equivalence class names one physical buffer. Two pipeline values qualify to merge when their buffer class, element type, and footprint agree exactly and their (stage, cycle) lifetimes are disjoint in the steady-state schedule. Buffer-class agreement is the legality gate; lifetime disjointness is the correctness gate.
The lifetime overlap test mirrors Phase 2's: a merged class records the union of its members' live ranges, and a new member joins only when its live range stays disjoint from that union. That keeps the merge associative — merging (a, b) and then (ab, c) produces the same outcome as merging (b, c) first.
void sharePipelineBuffers(FunctionOpInterface fn) {
UnionFind uf = uf_init(pipelineValueCount(fn));
for (auto [a, b] : candidatePairs(fn)) {
if (a->bufferClass != b->bufferClass) continue; // legality gate
if (a->element_type != b->element_type) continue;
if (a->footprint != b->footprint) continue;
Lifetime la = uf_class_lifetime(&uf, a);
Lifetime lb = uf_class_lifetime(&uf, b);
if (!lifetimes_disjoint(la, lb)) continue; // correctness gate
uf_union(&uf, a, b);
emit("share pipeline buffer");
}
}
Each successful merge emits the diagnostic "share pipeline buffer". Failures here are not fatal — an unmerged pipeline simply keeps its own buffer. Phase 4 exists to recover SMEM and TMEM capacity in deep pipelines, where the modulo scheduler can produce many pipeline values whose lifetimes never actually coexist at any one cycle.
Per-Record Allocation
The 0x348-byte record is the canonical unit of buffer-assignment state. Phase 1 allocates it up front, Phases 2 and 3 populate it, and Phase 4 may merge it with another.
| Field | Source phase |
|---|---|
| producer-op pointer | Phase 1 |
| consumer-op pointers (variadic) | Phase 1 |
(stage, cycle) lifetime endpoints | Phase 1 |
| stage count | Phase 1 (from schedule analysis) |
| named-barrier index | Phase 2 |
| buffer-class enum | Phase 3 |
| SMEM offset / TMEM handle | Phase 3 |
| union-find parent | Phase 4 |
The record is consumed downstream by the Pipe_ and Mutex_ materializer, which copies the named-barrier index and buffer-class enum into the 808-byte value header documented in Pipe_ and Mutex_ Value-Header Layout.
Usage and Contract
The pass runs once per function after TileASGenerateSchedule produces a valid ScheduleAnalysis and before MaterializeSchedule rewrites IR. It consumes the per-op (stage, order) assignment, the steady-state II and stage count, every nv_tileas.async.pipeline.create_pipeline op in the function body, and the Blackwell tmem subtarget feature flag from the target description. It emits the 0x348-byte per-pipeline-value allocation records — one per pipeline value, populated incrementally across the four phases — and the union-find merge map that tells the materializer which records share a physical buffer. Failures from any of Phases 1–3 abort the function before any IR is rewritten; Phase 4 failures are silently ignored because the worst case is a less efficient but still correct schedule.
Diagnostics
A buffer-assignment failure should include enough state to distinguish the four phases:
- the candidate
IIand stage count; - the failing phase and the matching diagnostic string (
"fails to resolve lifetime","fails to assign named barrier","fails to assign smem buffer","fails to assign tmem buffer"); - the pipeline-value id and its computed
(stage, cycle)endpoints; - the current occupancy of the 32-slot named-barrier pool;
- the SMEM region or TMEM region offset map at the point of failure;
- the buffer-class decision and the element-type / footprint inputs that produced it.
Together they let users separate an impossible loop body from a heuristic failure that can be retuned by changing the stage count, the tile size, or the buffer-class threshold.
Cross-References
Modulo Scheduler and Rau publishes the II and stage count consumed here. Pipe_ and Mutex_ Value-Header Layout documents the 808-byte header that carries the buffer-class enum and named-barrier index downstream. NVPTX Subtarget and Feature Matrix defines the Blackwell tmem gate consulted by Phase 3. TMA, Tensormap and cp.async.bulk covers the TMA descriptors that share this allocation record.
Conversion / Lowering Overview
A verified TileIR module reaches NVVM-ready MLIR through a staged dialect-conversion pipeline:
cuda_tile -> nv_tileaa -> nv_tileas -> llvm/nvvm -> targeted gpu.module
Every stage shares the same shape: declare which dialects and operations are legal, populate a rewrite pattern set, convert types through the stage's type converter, run conversion, and verify that the previous abstraction level has not leaked through. The public contract is the sequence of legality boundaries — not the identity of any recovered helper in the binary.
Provenance vs Upstream MLIR
The four NVIDIA-specific stages (ConvertCudaTileToTileAA, ConvertTileAAToTileAS, ConvertTileASToLLVM, ConvertCuteAndCuteNvgpuToLLVM) and the TranslateDebugInfo rewrite have no upstream MLIR counterpart — they exist because cuda_tile, nv_tileaa, nv_tileas, cute, cute_nvgpu, and cutlass are NVIDIA-introduced dialects (see each dialect's Provenance vs Upstream MLIR section). The two stages that touch only upstream-linked dialects — ConvertNVGPUAndGPUToNVVM and AttachNVVMTarget — reuse the upstream populators populateNVGPUToNVVMConversionPatterns and populateGpuToNVVMConversionPatterns essentially unchanged; the SM-feature gates and bare-pointer ABI choices ride on configuration, not on rewritten patterns. The LLVM type converter the cascade shares is upstream MLIR's LLVMTypeConverter with one tileiras override (async/pipeline token width fixed at i32 with the low bit carrying parity).
cute, cute_nvgpu, and cutlass are companion dialects rather than a single linear rung. They may survive one lowering stage when a later sister pass owns their conversion. The arrangement is intentional: TileAS handles scheduling and layout, while the CuTe/CUTLASS families carry atom, descriptor, and pipeline structure until NVVM conversion can emit the right target intrinsics.
Cascade
TileIR bytecode
|
v
cuda_tile
|
v public tile IR -> alias-aware tile IR
nv_tileaa
|
v scheduling/layout/materialization
nv_tileas + cute/cute_nvgpu/cutlass
|
v ABI and intrinsic lowering
llvm + nvvm
|
v kernel and target finalization
gpu.module with #nvvm.target
|
v
LLVM/NVPTX serialization
Dialect Roles
| Dialect | Role in lowering | Exit condition |
|---|---|---|
cuda_tile | Public input dialect: tile math, views, tokens, entry ops, and structured control flow. | No cuda_tile operations remain after the first conversion. |
nv_tileaa | Alias-aware tile algebra that keeps tile semantics explicit while introducing internal memory and token forms. | TileAA compute and memory ops are either lowered to TileAS or explicitly kept as legal bridge ops. |
nv_tileas | Scheduling-aware tile IR: async pipelines, TMA descriptors, CTA/cluster behavior, layouts, buffers, and staged execution. | Hardware-facing TileAS ops become LLVM/NVVM, inline asm, or companion-dialect constructs. |
cute, cute_nvgpu, cutlass | Companion dialects for layout algebra, MMA/copy atoms, and pipeline abstractions. | Lowered by their dedicated passes when enough target information and LLVM-compatible types exist. |
gpu | Standard MLIR GPU container and builtin GPU queries. | Thread/block/cluster queries, barriers, launches, and GPU functions become NVVM/LLVM operations. |
llvm, nvvm | Terminal MLIR form before translation to llvm::Module. | The module has kernel attributes, target metadata, ABI-ready arguments, and no high-level tile operations. |
Stage Contracts
cuda_tile -> nv_tileaa
The producer-facing legality boundary. Elementwise math may become standard arith or math; tile, view, token, memory, reduction, scan, MMA, and entry operations become TileAA operations. This stage also establishes the type converter that maps public tile/view/token types to internal equivalents.
Key invariant: cuda_tile is illegal after this pass. A producer bug should surface here, while the IR is still close to the public dialect.
nv_tileaa -> nv_tileas
TileAA describes what the program means; TileAS begins describing how the program will execute. This stage introduces layout-aware constants, schedulable memory operations, async pipeline structure, and TileAS function forms, while preserving the ordinary arith, math, and bridge operations that later passes still own.
Key invariant: tile-level memory and compute are now in the dialect the scheduler and layout passes understand.
nv_tileas -> llvm/nvvm
Scheduled tile execution becomes ABI-ready LLVM and NVVM. Loads, stores, allocas, layout conversions, async pipeline ops, cluster barriers, TMA operations, and target-specific helpers turn into LLVM dialect operations, NVVM intrinsics, or tightly scoped inline assembly.
Key invariant: once this stage completes, TileAS no longer owns executable semantics. Any surviving companion-dialect operations must be explicitly legal because a sister pass will lower them.
Companion and GPU Lowering
The CuTe/CUTLASS/NVGPU path lowers layout atoms, TMA copies, WGMMA/tcgen05 operations, grid-constant argument attributes, and kernel markers. The standard GPU lowering path handles thread/block/cluster IDs, barriers, dynamic shared memory, printf, subgroup operations, GPU functions, returns, and launch packing.
Key invariant: before serialization, the surviving gpu.module contains only LLVM/NVVM-compatible operations and exactly one resolved target.
Kernel Entry ABI
Kernel tagging is staged because the function is not ABI-ready until after function-type conversion. Early lowering marks the intended entry point with a dialect-level kernel marker and carries launch metadata — requested threads, cluster dimensions, CTA count, occupancy, register limits. Final NVVM lowering rewrites that marker to nvvm.kernel and migrates argument attributes such as grid-constant semantics onto LLVM-compatible function arguments.
void finalize_kernel_entry(Function fn, KernelSpec spec, TargetInfo target) {
require(fn.has_attr("cute.kernel") || fn.has_attr("tile.kernel"));
fn.remove_attr("cute.kernel");
fn.set_attr("nvvm.kernel", true);
fn.set_attr("nvvm.reqntid", dim3(32 * spec.num_warps, 1, 1));
fn.set_attr("nvvm.minctasm", spec.num_ctas);
if (target.supports_cluster_launch() && spec.cluster_product > 1) {
fn.set_attr("nvvm.cluster_dim", spec.cluster_dim);
fn.set_attr("nvvm.blocksareclusters", true);
}
if (spec.max_registers) {
fn.set_attr("nvvm.maxnreg", *spec.max_registers);
}
for (Argument arg : fn.arguments()) {
if (arg.has_attr("cute_nvgpu.grid_constant")) {
arg.remove_attr("cute_nvgpu.grid_constant");
arg.set_attr("nvvm.grid_constant", true);
}
}
}
The separation is practical: entry-point intent exists before LLVM argument types exist, but final NVVM attributes must attach to the exact arguments the backend will see.
Type ABI
One LLVM type converter spans the TileAS, Tile function, and companion NVGPU/CuTe lowering paths, so every pass agrees on ABI shape. The important rules:
| Source concept | LLVM/NVVM representation |
|---|---|
| Ranked memref or internal tile memory reference | Descriptor {allocatedPtr, alignedPtr, offset, sizes[N], strides[N]} unless bare-pointer kernel ABI applies. |
| Kernel memref argument under bare-pointer ABI | The aligned pointer becomes the formal argument; sizes, strides, and launch metadata are carried separately. |
index | Target index integer width, normally i64 unless configured otherwise. |
| Vectors | LLVM vector type with converted element type. |
| Async/pipeline token | i32; the low bit carries producer/consumer phase for parity-sensitive waits. |
| Tiled view descriptor | Small LLVM struct containing the base pointer and packed layout/rank metadata. |
| Memory spaces | Address-space-qualified pointers; global, shared, constant, local, and tensor memory remain distinct. |
These conversions are ABI commitments. A reimplementation may rearrange the internal pass structure, but it must not silently change descriptor field order, token width, address-space classification, or kernel argument lowering.
Lowering Stages
Lowering runs as four named conversion passes plus a small cluster of companion passes that prepare the module for NVPTX serialization. Each stage hands a specific kind of state to the next: TileAA hands aliasing-aware tile algebra to TileAS, TileAS hands scheduled-and-laid-out tile execution to the LLVM stage, the LLVM stage hands ABI-ready LLVM IR plus a populated gpu.module to the target-attribute and debug-info passes, and those leave a module the GPU-to-binary serializer can consume directly.
Stage 1 — ConvertCudaTileToTileAA
Rewrites the public input dialect. Three populators run in fixed order: Part A covers arithmetic and structured control flow, Part B covers memory, pointer, token, and view operations, Part C lowers the four specialists (mmaf, mmai, reduce, scan) whose shapes depend on decisions made by A and B. The pass installs three type-converter functor pairs that bridge the public TileType, PointerType, and TokenType to their nv_tileaa equivalents.
Hand-off: every cuda_tile.* op has been rewritten. nv_tileaa.* carries the alias-aware tile algebra. Tokens are still SSA values with explicit memory dependences.
Stage 2 — ConvertTileAAToTileAS
Lowers TileAA's "what the program means" view into TileAS's "how the program will execute" view. CopyAtom and ReduceAtom witnesses attach to memory operations during this stage and ride verbatim onto their TileAS replacements; the downstream LLVM stage reads them to pick the concrete hardware primitive. The kernel-spec attribute mirrors onto the function so SM-gated rewrites (notably the SM100 block-scaled MMA path) have a target spec to consult.
Hand-off: TileAS operations carry async-pipeline, layout, and TMA-descriptor structure. The TileAS scheduling and layout-assignment passes (D07 through D22) now own the module.
Stage 3 — ConvertTileFuncToLLVM then ConvertTileASToLLVM
Function-boundary conversion runs first. ConvertTileFuncToLLVM rewrites nv_tileaa.func and nv_tileaa.return into func.func and func.return, applies the bare-pointer ABI, and translates kernel-spec fields into nvvm.* attributes (nvvm.reqntid, nvvm.cluster_dim, nvvm.minctasm, nvvm.maxnreg). Kernel-returning operands fail the pass with an explicit diagnostic; non-kernel functions may return arbitrary value lists.
ConvertTileASToLLVM then rewrites bodies in nine phases (decompose-print, bufferization analysis, main TileAA/TileAS rewrites, bulk supplementary, cute/cute_nvgpu, async.pipeline, arith/llvm cleanup, reconcile-unrealized-casts, late materializer). The shared-memory scratch global @global_smem is emitted before any pattern runs when the kernel requested extended shared memory. The PDL-to-PDLInterp fallback compiles embedded PDL bytecode immediately before the conversion engine runs.
Hand-off: nv_tileaa and nv_tileas no longer appear in executable positions. llvm.* and nvvm.* carry the kernel; cute.*, cute_nvgpu.*, and cutlass.* survive only where a companion pass is responsible for them.
Stage 4 — Companion lowering and target attachment
ConvertCuteAndCuteNvgpuToLLVM desugars layout sugar, lowers primitive CuTe descriptor and tuple operations, then dispatches architectural atoms (SM90 WGMMA, SM100 IMMA, SM100 shared-to-tensor copy) to their dedicated rewriters. ConvertNVGPUAndGPUToNVVM rewrites the standard gpu dialect and the nvgpu architectural surface into nvvm.* intrinsics. AttachNVVMTarget reads the module's compute-capability and target-spec attributes and writes a populated #nvvm.target attribute onto the gpu.module. TranslateDebugInfo rewrites debuginfo.value chains into LLVM debug intrinsics with the NVIDIA-specific llvm.nvvm.move value pin.
Hand-off: every executable op is llvm.* or nvvm.*, the gpu.module carries exactly one resolved target, and debug metadata is in LLVM form. The module is ready for GPU-to-binary serialization.
Stage Sequence
ModuleOp lower_to_nvvm(ModuleBytecode input, CompileOptions options) {
ModuleOp module = parse_tileir(input);
run_pass<ConvertCudaTileToTileAA>(module, options);
run_pass<ConvertTileAAToTileAS>(module, options);
run_pass<ConvertTileFuncToLLVM>(module, options);
run_pass<ConvertTileASToLLVM>(module, options);
run_pass<ConvertCuteAndCuteNvgpuToLLVM>(module, options);
run_pass<ConvertNVGPUAndGPUToNVVM>(module, options);
run_pass<AttachNVVMTarget>(module, options);
run_pass<TranslateDebugInfo>(module, options);
return module;
}
Each pass owns one boundary. The driver does not interleave them — Tile-function conversion must complete before TileAS bodies lower, body lowering must complete before companion CuTe/NVGPU passes run, and target attachment is last because it depends on a fully-lowered gpu.module.
Pattern population, type conversion, and pattern-bank structure are described in Pattern Sets and Type Conversion. This overview leans on the invariant that each stage has a complete legality target and a type converter that agrees with the next stage.
Options and Placement
The conversion cascade runs at every optimization level because later backend stages cannot consume high-level TileIR. Optimization level and pipeline strategy mainly choose auxiliary cleanup, scheduling, async pipeline, debug-info, and snapshot behavior around the mandatory conversions.
| Option family | Effect on lowering |
|---|---|
| Optimization level | Selects cleanup intensity and LLVM/NVVM optimization level, but does not remove the dialect cascade. |
| Pipeline strategy | Changes async pipeline materialization and scheduling choices before TileAS-to-NVVM lowering. |
| Debug and line info | Enables debug-info conversion and preserves source scopes into LLVM metadata. |
| Target GPU / PTX version | Feeds #nvvm.target, feature strings, cluster attributes, and target-gated intrinsic selection. |
use-nvgpucomp-libnvvm style switches | Select whether serialization uses the bundled open NVPTX path or a libNVVM/NVGPUComp path when available. |
Lowering Invariants
- No
cuda_tileoperations survive aftercuda_tile -> nv_tileaa. - No executable TileAA compute or memory operations survive after TileAA-to-TileAS, except explicitly legal bridge operations owned by later passes.
cute,cute_nvgpu, andcutlassoperations may remain only when a later companion pass declares them legal.- Memrefs lower to LLVM descriptors unless a kernel bare-pointer ABI rule applies.
- Async and pipeline tokens lower to
i32; parity-sensitive tokens use the low bit as phase state. - Kernel metadata is staged: tile-level entry metadata first, final
nvvm.kernelonly after LLVM-compatible arguments exist. - Target metadata must exist before serialization: triple, chip, feature string, optimization level, and libNVVM/NVPTX flags.
- The final
gpu.modulemust be serializable without consulting anycuda_tile, TileAA, or TileAS verifier.
Cross-Links
- cuda_tile to nv_tileaa covers the public input-dialect conversion.
- nv_tileaa to nv_tileas covers the analysis-to-scheduled-tile transition.
- nv_tileas to LLVM covers async, memory, layout, and TileAS lowering.
- cute / cute_nvgpu to LLVM covers companion dialect lowering.
- nvgpu / gpu to NVVM covers standard GPU/NVGPU lowering.
- Target and Debug Info covers
#nvvm.targetand debug metadata.
Lowering: cuda_tile to nv_tileaa
Abstract
ConvertCudaTileToTileAA is the first lowering pass in the tileiras pipeline and the only one that translates from a publicly-defined dialect. It rewrites cuda_tile — the bytecode-input form users author against — into the internal nv_tileaa dialect every subsequent pass operates on. No cuda_tile.* operation may survive this pass.
The conversion is partial. The pass loads six legal dialects, marks cuda_tile illegal, attaches a dynamic-legality predicate to ub.poison, registers three type-conversion functor pairs, and applies a pattern bank assembled by three independent populators in a fixed order.
Boundary Contract
| Dimension | Specification |
|---|---|
| Allowed input ops | every cuda_tile.* executable op (whitelisted via addIllegalDialect<cuda_tile>); ub.poison accepted under a dynamic-legality predicate that requires an nv_tileaa-primitive result type |
| Allowed input types / attributes | cuda_tile::TileType, cuda_tile::PointerType, cuda_tile::TokenType; the arith.fastmath-shaped fastmath property carried on cuda_tile arithmetic ops, plus axis, inclusive, and per-op attribute dictionaries; module-level --compute-capability option must parse |
| Guaranteed output ops | only ops from arith, nv_tileaa, func, gpu, scf, math (the six legal dialects); no cuda_tile.* survives; bridge builtin.unrealized_conversion_cast may remain pending downstream reconciliation |
| Guaranteed output types / attributes | tile → llvm.struct<...> descriptor, pointer → llvm.ptr, token → llvm.token through the materialiser triple; region block-arg types rewritten through the same TypeConverter; the fastmath property is propagated unchanged onto the matching nv_tileaa arithmetic op |
| Violation behavior | residual cuda_tile.* op → applyPartialConversion fails with "failed to convert cuda_tile to nv_tileaa"; malformed compute capability → "invalid or missing --compute-capability option"; mismatched region block-arg types → next-stage verifier rejects the parent op (no localised diagnostic from this pass) |
Pass Driver
runOnOperation reads the stored --compute-capability option, builds the conversion target, populates three pattern groups in order, runs the PDL fallback, and invokes applyPartialConversion. Two user-facing diagnostics escape: "invalid or missing --compute-capability option" when the option parses as malformed, and "failed to convert cuda_tile to nv_tileaa" when partial conversion fails to legalise every cuda_tile.* op.
LogicalResult convertCudaTileToTileAA(ModuleOp mod, ComputeCapability cc) {
if (!cc.valid()) {
return emit("invalid or missing --compute-capability option");
}
RewritePatternSet patterns;
populatePartA(patterns); // arithmetic, comparison, conversion, indexing, control flow
populatePartB(patterns); // memory, pointer, token, view, partition
populatePartC(patterns); // mma, reduce, scan, transcendental specialists
ConversionTarget target = buildConversionTarget(mod);
FrozenRewritePatternSet frozen;
compilePDLPatterns(patterns, &frozen);
if (failed(applyPartialConversion(mod, target, frozen))) {
return emit("failed to convert cuda_tile to nv_tileaa");
}
return success();
}
The pass walks all cuda_tile.module operations nested in the input module before conversion. The walker is a recursive op-tree walk filtered by TypeID; collected modules land in a small inline-allocated vector sized for the common case of one nested module per bytecode input.
Conversion Target
The conversion target builder marks six dialects legal, declares cuda_tile fully illegal, and attaches dynamic legality to ub.poison. The same target object is reused across all three populators.
ConversionTarget buildConversionTarget(ModuleOp mod) {
ConversionTarget target(*mod.getContext());
// Fully legal — accept any op of these dialects without further checks
target.addLegalDialect<arith::ArithDialect,
nv_tileaa::TileAADialect,
func::FuncDialect,
gpu::GPUDialect,
scf::SCFDialect,
math::MathDialect>();
// Fully illegal — every cuda_tile op must be rewritten away
target.addIllegalDialect<cuda_tile::CudaTileDialect>();
// Dynamic legality — ub.poison is legal once its result type is an nv_tileaa primitive
target.addDynamicallyLegalOp<ub::PoisonOp>([](ub::PoisonOp op) {
return isLegalTileAAType(op.getResult().getType());
});
return target;
}
The type-converter materialisers handle the residual cases where partial conversion needs a bridge value while the IR is mid-rewrite. Source materialisers run when an nv_tileaa-typed value is needed but only the original cuda_tile-typed value exists; target materialisers run for the reverse direction. Both produce builtin.unrealized_conversion_cast operations that the next pass's reconciliation phase erases.
Input and Output Dialects
| Direction | Surface |
|---|---|
| input ops | cuda_tile.* (all executable ops), ub.poison (dyn-legal) |
| input types | cuda_tile::TileType, cuda_tile::PointerType, cuda_tile::TokenType |
| output ops (legal after this pass) | arith, nv_tileaa, func, gpu, scf, math, plus already-legal llvm.struct and llvm.ptr shapes produced by type materialisation |
| output types | tile types become llvm.struct<...>, pointer types become llvm.ptr, token types become llvm.token (via the materialiser triple) |
The canonical rewrite shape for a one-to-one Part-A pattern is:
input : %r = cuda_tile.addi %a, %b : <tile shape>
output : %r = nv_tileaa.addi %a, %b : <tile shape>
Region-bearing ops (cuda_tile.reduce, cuda_tile.scan) keep their region intact; only block-argument types and yielded values flow through the TypeConverter.
Three-Populator Structure
Three populators build the pattern set in fixed order. Parts A and B are mutually independent at the source level; they run sequentially so the resulting pattern-set composition stays reproducible. Part C runs after both because its patterns depend on the type-conversion and layout decisions A and B have already published.
| Part | Patterns | Role |
|---|---|---|
| A | ~45 | Arithmetic, comparison, conversion, indexing, structured control flow |
| B | ~34 | Memory, pointer, token, view, partition |
| C | 4 | mmaf, mmai, reduce, scan — specialists whose lowering depends on layout choices A and B locked in |
Part A registers hand-written OpConversion patterns (AddIOpConversion, ReduceOpConversion, and so on). Part B mixes template-generated GenericConversion<cuda_tile::XOp, nv_tileaa::YOp> patterns with custom view/token/entry patterns. Part C is four specialists for operations whose rewrite shape varies with the parent op's element type, accumulator location, or combiner-region structure.
Singleton Pattern Adders
Eight pattern classes register through dedicated singleton adders rather than through the main populator bodies, because downstream callers (the CudaTileOptimizer test driver and the rsqrt/fma fusion pass) need to install them into private pattern sets without pulling in the full Part-A/B/C registration. Each adder is a single-purpose helper that allocates one OpConversionPattern and pushes it onto the supplied RewritePatternSet.
cuda_tile op | Pattern class | Role |
|---|---|---|
cuda_tile.trunci | TruncIOpConversion | Integer truncation, lowered through arith.trunci retyped over nv_tileaa operand shapes |
cuda_tile.rsqrt | RsqrtOpConversion | Reciprocal square root, rewrites to nv_tileaa.rsqrt |
cuda_tile.maxi | MaxIOpConversion | Signed integer max, lowered through arith.maxsi/arith.maxui over nv_tileaa operand shapes |
cuda_tile.itof | IToFOpConversion | Integer-to-float conversion, lowered through arith.sitofp / arith.uitofp over nv_tileaa operand shapes |
cuda_tile.global | GlobalOpConversion | Global symbol declaration, rewrites to nv_tileaa.global |
cuda_tile.fma | FmaOpConversion | Fused multiply-add, rewrites to nv_tileaa.fma |
cuda_tile.constant | ConstantOpConversion | Tile constant, rewrites to nv_tileaa.splat (with a constant scalar) or to arith.constant carrying a dense tensor for static aggregates |
cuda_tile.assume | AssumeOpConversion | Assumption hint, rewrites to nv_tileaa.assume |
Each rewrite has the same one-to-one shape:
%r = cuda_tile.rsqrt %x : tensor<8x64xf32>
↓
%r = nv_tileaa.rsqrt %x : tensor<8x64xf32>
The eight ops never appear in the main populator rosters; the singleton adders are the only registration path that brings them into a pattern set.
cuda_tile.trunci Walk
TruncIOpConversion is the canonical type-narrowing rewrite. The operand is an integer tile and the result is a narrower integer tile of the same shape. The rewrite keeps the operand SSA value verbatim, swaps the op mnemonic, and asks the TypeConverter for the result type:
// Before
%narrow = cuda_tile.trunci %wide : !cuda_tile.tile<128xi32> to !cuda_tile.tile<128xi8>
// After
%narrow = nv_tileaa.trunci %wide : tensor<128xi32> to tensor<128xi8>
The operand %wide flows through the source materialiser when its definition has not yet rewritten — applyPartialConversion inserts a builtin.unrealized_conversion_cast %wide : !cuda_tile.tile<128xi32> to tensor<128xi32> that the downstream cast-reconciliation phase erases once both ends are TileAA-typed. No attribute hand-off is needed: trunci carries only its result type.
cuda_tile.fma Walk
FmaOpConversion is a three-operand floating-multiply-add. All three operands share the source tile type and the result has the same shape:
// Before
%r = cuda_tile.fma %a, %b, %c { fastmath = #arith.fastmath<contract> }
: !cuda_tile.tile<8x64xf32>
// After
%r = nv_tileaa.fma %a, %b, %c { fastmath = #arith.fastmath<contract> }
: tensor<8x64xf32>
The fastmath attribute carries verbatim through the rewrite. Both dialects accept the shared arith.fastmath enum (the same one the MLIR arith dialect publishes), so no attribute kind translation is required — the same typed attribute object is re-attached to the rewritten op. A later lowering past nv_tileaa translates it to llvm.fastmath when descending into the LLVM dialect, but at this stage the attribute is dialect-shared rather than dialect-private.
Type-Converter Materialisers
Three type-converter functor pairs register before the populators run. Each pair combines an addConversion callback (called when the converter sees the source type) with an addMaterialization callback (called when partial conversion needs a bridge value while the IR is mid-rewrite). Materialisations should not survive later canonicalisation — the reconciliation phase in the next pass erases them.
| Source type | Target type | Materialiser direction |
|---|---|---|
cuda_tile::TileType | llvm.struct<...> (descriptor shape) | source — produces an nv_tileaa value when only a cuda_tile value exists |
cuda_tile::PointerType | llvm.ptr | target — produces a cuda_tile value when only an nv_tileaa value exists |
cuda_tile::TokenType | llvm.token | source — produces an nv_tileaa value when only a cuda_tile value exists |
Splitting source from target materialisers preserves token ordering and view identity for the scheduler, which still needs to reason about memory dependences before later NVVM lowering flattens tokens into integers. A purely-symmetric materialiser pair would lose the directional information the dialect-conversion engine uses to pick the right cast.
Block-Argument Type Flow
Region-bearing operations (cuda_tile.reduce, cuda_tile.scan, structured control flow that carries cuda_tile-typed iteration arguments) need block-argument types converted in the same step as their parent op. The standard inline-region helper does not see the pass's type converter, so the region-rewriting patterns construct their replacement operations explicitly:
LogicalResult lowerRegionOp(Operation *src, OperationName dst,
ConversionPatternRewriter &rw,
const TypeConverter &types) {
SmallVector<Value> operands;
if (failed(types.convertOperands(src->getOperands(), operands)))
return failure();
SmallVector<Type> resultTypes;
if (failed(types.convertTypes(src->getResultTypes(), resultTypes)))
return failure();
OperationState state(src->getLoc(), dst);
state.addOperands(operands);
state.addTypes(resultTypes);
state.addAttributes(src->getAttrs());
for (Region ®ion : src->getRegions()) {
Region *newRegion = state.addRegion();
rw.inlineRegionBefore(region, *newRegion, newRegion->begin());
if (failed(rw.convertRegionTypes(newRegion, types)))
return failure();
}
Operation *replacement = rw.create(state);
rw.replaceOp(src, replacement->getResults());
return success();
}
convertRegionTypes walks the block-argument list of every block in the region and rewrites types through the same converter the parent op uses. Without this step, the parent op verifies against post-conversion operand types but its region terminator yields pre-conversion types — a signature mismatch the next-stage verifier reports without enough context to diagnose properly.
Part C Specialists
Part C registers four specialists that depend on layout decisions made by Parts A and B. Each takes a cuda_tile op whose lowering shape is parameterised by element type, layout intent, or combiner-region structure, and emits the matching nv_tileaa form.
cuda_tile.mmaf and cuda_tile.mmai
The float and integer matrix-multiply-accumulate ops rewrite to nv_tileaa.dot with the element-type-specific attribute set. The rewriter selects FP rounding mode and accumulator precision from the source op's attributes.
%c' = cuda_tile.mmaf %a, %b, %c { fastmath = "contract" }
: tensor<128x64xf16>, tensor<64x128xf16>, tensor<128x128xf32>
↓
%c' = nv_tileaa.dot %a, %b, %c { input_precision = "tf32", fastmath = "contract" }
: tensor<128x64xf16>, tensor<64x128xf16>, tensor<128x128xf32>
cuda_tile.reduce Worked Example
cuda_tile.reduce carries a combiner region whose block arguments are accumulator-typed and whose terminator yields the next accumulator value. The rewriter walks the region, converts block-argument types through the shared TypeConverter, and rebuilds the op as nv_tileaa.reduce with the converted region body.
Input:
%sum = cuda_tile.reduce %values { axis = 1 : i32 } : tensor<8x64xf32> -> tensor<8xf32> {
^bb0(%acc: !cuda_tile.tile<f32>, %val: !cuda_tile.tile<f32>):
%s = cuda_tile.addf %acc, %val : !cuda_tile.tile<f32>
cuda_tile.yield %s : !cuda_tile.tile<f32>
}
The pattern converts the parent op's operand and result types, inlines the region, then walks the new region's blocks to convert each block-argument type:
%sum = nv_tileaa.reduce %values { axis = 1 : i32 } : tensor<8x64xf32> -> tensor<8xf32> {
^bb0(%acc: f32, %val: f32):
%s = nv_tileaa.addf %acc, %val : f32
nv_tileaa.yield %s : f32
}
Block-argument types !cuda_tile.tile<f32> become f32 because the TileType conversion strips the dialect wrapper; the terminator and combiner body rewrite recursively under the same partial-conversion driver, since cuda_tile.addf and cuda_tile.yield are in the illegal dialect and match Part A patterns.
If the rewriter forgot to convert block-argument types, the parent nv_tileaa.reduce would have f32 operands at the outer signature but the inner region's ^bb0 would still bind !cuda_tile.tile<f32> — the verifier would reject the operation with a signature mismatch the next-stage diagnostics cannot localise back to this pass.
cuda_tile.scan Worked Example
cuda_tile.scan follows the same shape as reduce but produces a tensor of the same rank as the input — every output element is the cumulative reduction of the prefix of input elements along the scan axis. The rewriter applies identical region-conversion logic, only changing the parent op's mnemonic and keeping the result rank equal to the input rank.
Input:
%prefix = cuda_tile.scan %values { axis = 1 : i32, inclusive = true }
: !cuda_tile.tile<8x64xf32> -> !cuda_tile.tile<8x64xf32> {
^bb0(%acc: !cuda_tile.tile<f32>, %elem: !cuda_tile.tile<f32>):
%sum = cuda_tile.addf %acc, %elem : !cuda_tile.tile<f32>
cuda_tile.yield %sum : !cuda_tile.tile<f32>
}
Output:
%prefix = nv_tileaa.scan %values { axis = 1 : i32, inclusive = true }
: tensor<8x64xf32> -> tensor<8x64xf32> {
^bb0(%acc: f32, %elem: f32):
%sum = nv_tileaa.addf %acc, %elem : f32
nv_tileaa.yield %sum : f32
}
The axis and inclusive attributes carry verbatim; block-argument types unwrap from !cuda_tile.tile<f32> to f32 via the TileType converter, and the inner cuda_tile.addf / cuda_tile.yield rewrite recursively under Part A patterns matched by the same partial-conversion driver.
Transcendental Specialists
The transcendental specialists (cuda_tile.exp2, cuda_tile.log2, cuda_tile.sin, cuda_tile.cos, cuda_tile.tanh) rewrite to nv_tileaa counterparts but additionally attach the fastmath flag derived from the source op's attribute dictionary. The flag controls whether downstream lowering selects the __nv_* precise libdevice variant or the __nv_fast_* approximate variant.
Tokens and Atomics
Token-aware operations stay explicit in the IR rather than collapsing immediately to NVVM. Loads, stores, atomic compare-and-swap, atomic read-modify-write, token creation, and token join all become nv_tileaa operations that still expose memory dependences. The downstream scheduler and async-pipeline passes reason about those dependences before LLVM/NVVM lowering flattens tokens into integers.
%t = cuda_tile.token.join [%t0, %t1, %t2] : !cuda_tile.token
↓
%t = nv_tileaa.join_mem_token [%t0, %t1, %t2] : !nv_tileaa.mem_token
Singleton joins skip the join_mem_token op and pass the single token through unchanged; empty joins lower to nv_tileaa.create_mem_token (with empty operand list), the same producer the downstream nv_tileas.async.pipeline.create_null_token later consumes, so every downstream op still has a token operand to consume.
Pipeline Handoff
The pass establishes the alias and view shapes that warp-specialized producer/consumer rewriting relies on later, but assigns no final layouts. It keeps enough structure around load/store views, atomic-token operations, and tensor partitions for TileAS layout assignment to insert nv_tileas.view and nv_tileas.convert_layout at producer and consumer boundaries. The invariant: a view produced here must still identify the same memory object, shape, layout intent, and token ordering when it reaches TileAS layout assignment.
Failure Modes
The pass fails with a user-facing diagnostic when:
- compute capability is missing or malformed (
"invalid or missing --compute-capability option"); - partial conversion leaves a residual
cuda_tile.*op ("failed to convert cuda_tile to nv_tileaa"); - a type materialisation cannot bridge a value across the boundary;
- a region rewrite would produce mismatched block arguments or terminators.
Cross-References
Conversion / Lowering Overview describes this pass's position in the four-stage cascade. Shared LLVM Type Converter documents the shared LLVM type converter that the materialiser triple here registers into. TileAA to TileAS is the next lowering stage; the CopyAtom and ReduceAtom witnesses attached there preserve information this pass made explicit. cuda_tile Op Roster lists the lowering-arm classification per family that the populator order in this pass reflects; nv_tileaa Op Roster — Memory Effects gives the operand and attribute tables for the token-aware operations the singleton adders produce. DSL to PTX End-to-End — Stage 1: cuda_tile IR and Stage 2: nv_tileaa IR trace a single GEMM kernel across the boundary this pass enforces, showing the IR shape on either side for a representative cuda_tile.mmaf op.
Lowering: nv_tileaa to nv_tileas
Abstract
ConvertTileAAToTileAS lowers the alias-aware typed-pointer dialect nv_tileaa into the assembler-near dialect nv_tileas. It runs after ConvertCudaTileToTileAA and before the TileAS family of passes (D07 through D22). Above this boundary tile algebra is target-independent and described in terms of typed pointers and abstract memory; below it, operations carry CopyAtom and ReduceAtom witnesses, the function's kernel-spec is mirrored as an attribute on the module, and SM100-only forms such as block-scaled MMA become legal.
Structurally this is a textbook MLIR partial conversion. A single driver assembles a RewritePatternSet from three fixed-order populators, attaches kernel-spec metadata onto the function, builds the conversion target, and runs applyPartialConversion. There is no second pipeline stage — canonicalization of slice scaffolding is left to the following passes.
Boundary Contract
| Dimension | Specification |
|---|---|
| Allowed input ops | every executable nv_tileaa.* op (illegal-dialect), plus arith.* and math.*; nv_tileaa.func, nv_tileaa.return, nv_tileaa.mark_for_reuse explicitly stay legal (owned by ConvertTileFuncToLLVM) |
| Allowed input types / attributes | nv_tileaa::memref, nv_tileaa::ptr, nv_tileaa::mem_token; CopyAtomAttrInterface witness on memory ops, ReduceAtomAttrInterface on reduce/scan; mem_semantic, mem_scope, operandSegmentSizes, in_bounds; cute.kernel attribute on the function (mirrored to nv_tileaa.kernel_spec) |
| Guaranteed output ops | nv_tileas.* plus arith.*/math.* lowered to TileAS-compatible form; combiner body internals lowered to arith.* (not nv_tileas.*) because the arith populator runs first |
| Guaranteed output types / attributes | tile types preserved as tile<S × element>; memref → nv_tileas.tiled_view<...> (static shape); CopyAtom and ReduceAtom witnesses carry verbatim; mem_semantic/mem_scope re-keyed with nv_tileas prefix but identical discriminant; SM100 block-scaled MMA emits an atom = #nv_tileas<atom umma_bs_...> attribute |
| Violation behavior | residual nv_tileaa.* executable op → applyPartialConversion fails with "failed to convert nv_tileaa to nv_tileas"; block-scale variant on cc ≤ sm_89 → "mma block scale is not supported by compute capability < sm100" before rewrite; missing target spec → "failed to get the target spec"; layout failures emit "missing source layout" / "failed to infer source layout" / "fails to assign layout"; queue-typed mismatch in mark_for_reuse → "expect operands with queue types" |
Pass Driver
runOnOperation populates three pattern groups in fixed order, attaches the kernel-spec attribute onto the function, constructs the conversion target, and applies it.
LogicalResult convertTileAAToTileAS(ModuleOp mod) {
RewritePatternSet patterns;
populateArithPatterns(patterns); // 43-instantiation GenericOpPattern bank
populateMathPatterns(patterns); // math.* → nv_tileas.* with arith fallback
populateTileAACorePatterns(patterns); // queue, execute, alias_token, memory ops
attachKernelSpecAttributes(mod); // mirrors cute.kernel onto nv_tileaa.kernel_spec
ConversionTarget target = buildConversionTarget(mod);
if (failed(applyPartialConversion(mod, target, std::move(patterns)))) {
return emit("failed to convert nv_tileaa to nv_tileas");
}
return success();
}
ConversionTarget buildConversionTarget(ModuleOp mod) {
ConversionTarget target(*mod.getContext());
target.addLegalDialect<nv_tileas::TileASDialect,
arith::ArithDialect,
math::MathDialect,
func::FuncDialect,
gpu::GPUDialect,
scf::SCFDialect>();
target.addIllegalDialect<nv_tileaa::TileAADialect>();
// nv_tileaa.func, nv_tileaa.return, and nv_tileaa.mark_for_reuse stay legal —
// they are owned by ConvertTileFuncToLLVM, which has not yet run.
target.addLegalOp<nv_tileaa::FuncOp,
nv_tileaa::ReturnOp,
nv_tileaa::MarkForReuseOp>();
return target;
}
The arith populator runs first because the math populator falls back to arith for any non-NVPTX-specific operation. Both run before the nv_tileaa core populator so the core sees already-lowered subexpressions when it walks operand types during rewrite. The kernel-spec attachment runs before the partial-conversion driver because the SM100 block-scale guard reads compute capability through the attached attribute.
Input and Output Dialects
| Direction | Surface |
|---|---|
| input ops | nv_tileaa.* (illegal after pass), arith.*, math.* |
| output ops | nv_tileas.* plus arith.* and math.* lowered to TileAS-form when the generic bank applies |
| attribute carriers | CopyAtomAttrInterface on memory ops, ReduceAtomAttrInterface on reduce / scan ops, nv_tileaa.kernel_spec on the function |
The shared rewrite shape for a memory op is:
input : %t = nv_tileaa.tiled_load %src, layout = #layout {copy_atom = #cute.copy_atom<...>}
output : %t = nv_tileas.tiled_load %src, layout = #layout {copy_atom = #cute.copy_atom<...>}
The witness attribute carries verbatim across the rewrite; the next stage (TileAS to LLVM) picks the concrete hardware primitive (cp.async, cp.async.bulk, tcgen05.cp, ldmatrix, stmatrix) from it.
Three Populators
| Populator | Size | Dialect family | Patterns |
|---|---|---|---|
sub_733EF0 | 12.6 KB | arith | ~30 (the GenericOpPattern bank documented in the 43-instantiation arith bank) |
sub_730C50 | 13.1 KB | math | ~25 (math.* to nv_tileas equivalents) |
sub_72D810 | 13.0 KB | nv_tileaa core | ~35 (queue, execute, alias_token, memory ops) |
Each populator is a flat sequence: allocate a 0x68-byte pattern object, fill its vtable and OperationName, push into the pattern vector. The pattern bodies themselves live in the named pattern bank described below; the populators only materialize them.
Named Pattern Bank
Sixteen-plus TileAAToTileAS*OpPattern classes spanning sub_72A1C0 through sub_73C710 make up the dedicated patterns. Each is a 0x68-byte OpConversionPattern of the shape described in Pattern Categories: vtable pointer, interned OperationName, PatternBenefit, captured TypeConverter*, typeinfo-name string, and a small per-pattern tail. The vtables sit at consecutive offsets in 0x59B9000..0x59B9700, one slot per pattern, with the standard eight-entry RewritePattern dispatch order (destructor, deleting destructor, getRootKind, root-kind init, match, rewrite, clone, move helper).
Pattern bodies known by their op names are the global / memref family (nv_tileaa.global, get_global, make_memref,
block_tile, tiled_load) at sub_72A1C0, the copy-atom load/store/atomic family (load, store, tiled_load,
tiled_store, tiled_atomic_rmw, gather_load, scatter_store) at sub_7263C0..sub_728F50, the
extract_slice/convert_layout rewriter at sub_7297B0, the cat rewriter at sub_729D30, the plugin rewriter at
sub_7254B0, the generate rewriter at sub_738E70, the reduce and scan rewriters at sub_739A50 and sub_739FE0,
the mark_for_reuse verifier-style pattern at sub_73C190, and the SM100-gated dot lowering at sub_72C180. The copy
patterns each look up the mlir::nv_tile_ir::as::CopyAtomAttrInterface TypeID once via a double-checked init guarded by
byte_5B38C18 and binary-search the op's attribute dictionary for the resolved CopyAtom witness; the reduce and scan
patterns do the same against ReduceAtomAttrInterface cached in qword_5B38C00. Selection of a concrete hardware
primitive (cp.async, cp.async.bulk, LDGSTS, TMA tile or im2col, tcgen05.cp, ldmatrix, stmatrix) happens later in
the TileAS materialization pipeline; the attachment point is here.
A handful of diagnostics from this layer outline the bank: "TODO: only reg and smem layouts are supported at the moment" from sub_7297B0, "missing source layout" and "failed to infer source layout" from sub_729D30, "plugin has unsupported feature" and "fails to assign layout" from sub_7254B0, "failed to convert block signature" from sub_738E70, and "expect operands with queue types" from sub_73C190.
Per-Pattern Walks
tiled_load Witness Hand-Off
The TileAA tiled_load already carries a CopyAtomAttrInterface witness chosen by the layout-assignment pre-pass; the TileAS rewrite preserves both the witness and the surrounding operand vector verbatim. The mnemonic changes, the operand layout stays one-for-one, and the result-type stays tile<S × element>. The witness slot is still an AtomAttr, but the TileAS verifier reads it through CopyAtomAttrInterface rather than through the TileAA accessor:
// Before
%v, %t1 = nv_tileaa.tiled_load %mr_a[%i, %k], %t0
{ atom = #cute.copy_atom<sm90_tma_load_2d>,
in_bounds = array<i1: true, true>,
mem_semantic = #nv_tileaa<mem_semantic relaxed>,
mem_scope = #nv_tileaa<mem_scope cluster>,
operandSegmentSizes = array<i32: 1, 2, 0, 1> }
: !nv_tileaa.memref<?x?xf16, 1>, index, index, !nv_tileaa.mem_token
-> tile<128x32xf16>, !nv_tileaa.mem_token
// After
%v, %t1 = nv_tileas.tiled_load %mr_a[%i, %k], %t0
{ atom = #cute.copy_atom<sm90_tma_load_2d>,
in_bounds = array<i1: true, true>,
mem_semantic = #nv_tileas<mem_semantic relaxed>,
mem_scope = #nv_tileas<mem_scope cluster>,
operandSegmentSizes = array<i32: 1, 2, 0, 1> }
: !nv_tileaa.tiled_view<128x32xf16>, index, index, !nv_tileaa.mem_token
-> tile<128x32xf16>, !nv_tileaa.mem_token
The view-typed operand changes shape: nv_tileaa.memref<?x?xf16, 1> becomes nv_tileaa.tiled_view<128x32xf16> because TileAS represents the access through the static tile box rather than the parent dynamic memref. The tiled_view type itself is declared in the alias-aware dialect and survives the rewrite untouched; only the producer mnemonic changes. The TypeConverter materialises a nv_tileas.view operation upstream so the rewritten tiled_load consumes an already-typed view; the materialiser is not visible at the call site of the rewrite, but its output feeds the operand slot during partial conversion.
The mem_semantic and mem_scope enum attributes change their dialect prefix but retain identical discriminant values. The CopyAtomAttrInterface witness is the only attribute that is dialect-neutral — #cute.copy_atom<sm90_tma_load_2d> carries through unchanged because the cute dialect publishes the witness interface for both consumers (the SM-specific field attributes that implement it live in cute_nvgpu).
dot Dispatch by Compute Capability
nv_tileaa.dot lowers to a single nv_tileas.dot op in the general case, but the SM100 block-scale guard at sub_72C180 redirects the variant that consumes per-block scale factors to nv_tileas.block_scaled_mma. The dispatcher reads the compute_capability integer encoded as major * 10 + minor from the attached target spec and the is_block_scale_variant flag the validator sets after MMA-shape inspection:
// Before (plain dot, every compute capability ≥ sm70)
%d = nv_tileaa.dot %av, %bv, %c_in
{ operandSegmentSizes = array<i32: 1, 1, 1, 0, 0> }
: tile<128x32xf16>, tile<32x128xf16>, tile<128x128xf32>
-> tile<128x128xf32>
// After (SM90 path — Hopper warpgroup MMA)
%d = nv_tileas.dot %av, %bv, %c_in
{ atom = #nv_tileas<atom mma_f16_f16_f32>,
operandSegmentSizes = array<i32: 1, 1, 1, 0, 0> }
: tile<128x32xf16>, tile<32x128xf16>, tile<128x128xf32>
-> tile<128x128xf32>
For the block-scaled variant on SM100 the rewrite uses a different op:
// Before (block-scaled MMA — requires sm_100+)
%d = nv_tileaa.mma_block_scale %av, %bv, %c_in, %scale_a, %scale_b
: tile<128x32xe4m3>, tile<32x128xe4m3>, tile<128x128xf32>,
tile<128x1xui8>, tile<1x128xui8>
-> tile<128x128xf32>
// After (sm_100)
%d = nv_tileas.block_scaled_mma %av, %bv, %c_in, %scale_a, %scale_b
{ atom = #nv_tileas<atom umma_bs_e4m3_e4m3_f32>,
cta_group = 1 : i32 }
: tile<128x32xe4m3>, tile<32x128xe4m3>, tile<128x128xf32>,
tile<128x1xui8>, tile<1x128xui8>
-> tile<128x128xf32>
The atom attribute attached on the way out names the concrete MMA instruction family the materialiser will eventually pick. Capability ≤ 89 fails with "mma block scale is not supported by compute capability < sm100" before any rewrite is attempted; capabilities 90 and 99 fall through to the plain nv_tileas.dot path, which lowers to nvvm.wgmma.* downstream.
reduce and scan Region Hand-Off
Region-bearing operations preserve their combiner body across the rewrite. The TileAS forms accept the same block-argument types because TileAA already published them as bare element types — no region-types conversion runs here:
// Before
%sum = nv_tileaa.reduce %values { axis = 1 : i32 }
: tensor<8x64xf32> -> tensor<8xf32> {
^bb0(%acc: f32, %val: f32):
%s = nv_tileaa.addf %acc, %val : f32
nv_tileaa.yield %s : f32
}
// After
%sum = nv_tileas.reduce %values
{ atom = #nv_tileas<reduce_atom warp_shfl_xor_f32>,
axis = 1 : i32 }
: tensor<8x64xf32> -> tensor<8xf32> {
^bb0(%acc: f32, %val: f32):
%s = arith.addf %acc, %val : f32
nv_tileas.yield %s : f32
}
Two changes come in alongside the mnemonic swap. First, a ReduceAtomAttrInterface witness is attached on the way out — selected by the layout-assignment pre-pass and looked up through the cached TypeID at qword_5B38C00. Second, the combiner body's nv_tileaa.addf rewrites to upstream arith.addf rather than a nv_tileas.addf: the arith populator that runs first in the populator order has already lowered all body-internal arithmetic, and the core populator picks up the parent reduce only after the body is in arith form. scan follows exactly the same shape, only differing in mnemonic and in producing a same-rank cumulative result.
func Boundary Stays Legal
nv_tileaa.func, nv_tileaa.return, and nv_tileaa.mark_for_reuse are explicitly listed as legal in the conversion target. The pass leaves them untouched — ConvertTileFuncToLLVM owns the boundary rewrite. As a result an entry function survives this pass with nv_tileaa types on its signature even though the body has been fully lowered to TileAS:
nv_tileaa.func @kernel(%a: !nv_tileaa.ptr<f16, 1>, %b: !nv_tileaa.ptr<f16, 1>,
%c: !nv_tileaa.ptr<f32, 1>)
attributes { cute.kernel,
nv_tileaa.kernel_spec = #nv_tileaa.kernel_spec<numWarps=4, clusterDim=[2,1,1]> } {
// body — every executable op now TileAS-typed
...
nv_tileaa.return
}
The kernel-spec attribute attached by attachKernelSpecAttributes is a mirror of the cute.kernel attribute set; the function signature still carries TileAA-typed pointers until the next stage lifts them through the bare-pointer ABI.
137 realloc_insert Trampolines
137 byte-identical 343-byte trampolines fill 0x7000E0..0x70FC80, one per push into the pattern vector. Each is a distinct instantiation of std::vector<std::unique_ptr<RewritePattern>>::_M_realloc_insert, byte-identical apart from the move-constructor vtable offset the inlined relocation loop calls for the unique_ptr's Pattern::T destructor. The count is 137 because the three populators add inserts at multiple PatternBenefit levels: only about 90 distinct pattern classes exist, but several get registered through more than one trampoline. The trampolines defer capacity growth to sub_6E6530, whose sole string is "vector::_M_realloc_insert".
SM100 MMA Block-Scale Guard
sub_72C180 (2 970 B) wraps the nv_tileaa.mma_block_scale to nv_tileas.block_scaled_mma lowering with a target-spec check. The pattern reads the kernel-spec and target-spec from the module, asserts both are present (otherwise emits "failed to get the target spec"), runs the MMA shape validator at sub_14B71C0, then guards the block-scaled variant on compute capability:
v82 = validate_mma_shape(...); // sub_14B71C0
v84 = get_compute_capability(target_spec); // sub_152FDA0
if (is_block_scale_variant(v82) && cc_int(v84) <= 99)
return emit("mma block scale is not supported by compute capability < sm100");
The integer encoding is major * 10 + minor, so the inclusive <= 99 gate rejects every capability up to and including sm_89 and admits sm_90, sm_100, sm_103, sm_110, sm_120, and sm_121. The default compute capability baked into the pass constructor (sub_738810) is "sm_80", which means the gate is closed on the default invocation — the pipeline driver must bump the capability through the --compute-capability option before the block-scale path becomes reachable. The same function then validates the MMA partition ("failed to find available mma partition") and infers the 2D layout ("failed to infer 2d layout") before building nv_tileas.dot. The atom-K and vector-size triple table the validator consults is documented in MMA Atoms sm70-120 — Operand Contract by Tier.
Kernel-Spec Attachment
sub_72B8E0 walks the function looking for cute.kernel attributes emitted by ConvertTileFuncToLLVM and attaches mirroring nv_tileaa.kernel_spec attributes. The mirror lets downstream TileAS passes read kernel parameters such as numWarps, clusterDim, and occupancy directly from the operation's attribute dictionary, without traversing back to the LLVM-level function attributes. The reader interns the attribute name "nv_tileaa.kernel_spec" (length 21) once through the StringAttr getter and walks the op's attribute dictionary at offset +56. A close variant sub_72BCD0 does the same work while also touching the SymbolTable trait. Both are read-only; writes to the kernel-spec attribute happen through the verifier in Strand C.
Conversion Invariants
Executable nv_tileaa operations must not survive the pass — applyPartialConversion reports failure if any illegal-dialect operation remains. CopyAtom and ReduceAtom witnesses on nv_tileaa memory operations must be preserved exactly onto their nv_tileas replacements, because later passes use them to pick the concrete hardware primitive. The kernel-spec attribute must attach before the first pattern that reads compute capability runs, so the sm100 guard in sub_72C180 has a non-null target spec to consult. Populator order has to stay arith, math, nv_tileaa core — both for the math-to-arith fallback and so the core populator's operand-type walks see already-lowered subexpressions.
Cross-References
Pattern Categories documents the dedicated OpConversionPattern layout and
the 43-instantiation arith bank is shared with the arith populator. Convert cuda_tile to TileAA covers
the previous boundary that produces the nv_tileaa input this pass consumes. TileAS to LLVM — Tile Memory and Descriptor Lowering is the
downstream materialization that resolves the CopyAtom and ReduceAtom witnesses attached here into concrete instructions.
MMA Atoms sm70-120 — Operand Contract by Tier lists the atom-K and vector-size triples consulted by
the SM100 block-scale validator. nv_tileas Op Roster — Tiled Memop Operand/Result Tables gives the operand-and-attribute tables the per-pattern walks here build against.
DSL to PTX End-to-End — Stage 3: nv_tileas IR (after scheduling) renders a single GEMM kernel just after this pass plus the async-pipeline family runs — the nv_tileaa.dot rewritten here surfaces as the nv_tileas.dot inside an async.pipeline consumer region, carrying the SM90 WGMMA atom this pass selected.
Lowering: nv_tileas to LLVM
Abstract
ConvertTileASToLLVM is the terminal Tileiras lowering stage. It consumes a module already scheduled, layout-assigned, pipelined, and materialized in TileAS, then rewrites the remaining TileAA, TileAS, CuTe, CUTLASS, arithmetic, math, vector, and utility operations into LLVM and NVVM dialect operations. The output is a GPU module ready to translate to LLVM IR and then to NVPTX.
Function-boundary conversion runs before body conversion. The order matters: kernel function signatures and attributes must be LLVM-compatible before body patterns lower pointers, descriptors, async tokens, barriers, and target-specific operations.
Boundary Contract
| Dimension | Specification |
|---|---|
| Allowed input ops | residual nv_tileaa.*, every nv_tileas.* including async.pipeline.*, cute.*, cute_nvgpu.*, cutlass.*, surviving arith.* and math.*; nv_tileaa.func, nv_tileaa.return, nv_tileaa.mark_for_reuse enter legal because the sister pass already handled the function boundary |
| Allowed input types / attributes | TileAA / TileAS memref, view, token, layout, atom types; CopyAtom / ReduceAtom witnesses; nv_tileaa.kernel_spec, nv_tileaa.compute_capability, nv_tileaa.target_spec on the function/module (required when kernel-translation path is active); extended-SMEM byte budget in pass[6] |
| Guaranteed output ops | llvm.* and nvvm.* plus statically-legal gpu.module containers; surviving cute.*/cute_nvgpu.*/cutlass.* only when consumed by the companion lowering; no executable nv_tileaa.* or nv_tileas.* op remains; builtin.unrealized_conversion_cast stripped by phase 8 reconciliation |
| Guaranteed output types / attributes | LLVM descriptor structs, address-space-qualified llvm.ptr, i32 async-pipeline tokens, i1 mbarrier results; kernel-spec mirrored to nvvm.reqntid / nvvm.cluster_dim / nvvm.blocksareclusters / nvvm.minctasm / nvvm.maxnreg per the eight-row translation table; global_smem synthesised at addr-space 3, align 16 when pass[6] > 0 |
| Violation behavior | absent compute_capability or target_spec → "Failed to get ComputeCapability" (sev 0x103); kernel return with operands → "Kernel functions do not support return with operands"; per-phase populator failures emit "fails to decompose print ops" / "fails to do bufferization analysis" / "region types conversion failed"; PDL fallback failure → "failed to lower PDL pattern module to the PDL Interpreter"; residual TileAA/TileAS op → applyPartialConversion fails (sticky failure bit set at `pass[5] |
Input and Output Dialects
| Direction | Surface |
|---|---|
| input ops | residual nv_tileaa.*, all nv_tileas.*, cute.*, cute_nvgpu.*, cutlass.*, plus remaining arith.* and math.* that did not lower in earlier stages |
| input types | TileAA / TileAS memref, view, token, layout, and atom types |
| output ops (legal) | llvm.*, nvvm.*, gpu.module (container only), surviving cute.* / cute_nvgpu.* / cutlass.* consumed by companion lowering |
| output types | LLVM descriptor structs, LLVM pointers (address-space-qualified), i32 async tokens, i1 mbarrier results |
The canonical rewrite shapes for the major TileAS families are:
nv_tileas.alloc_tensor -> llvm.mlir.addressof @global_smem + llvm.getelementptr
nv_tileas.convert_layout -> sequence of llvm.extractvalue / llvm.insertvalue, possibly via stmatrix / ldmatrix
nv_tileas.tiled_load -> nvvm.cp.async / cp.async.bulk / tma.tile (selected from CopyAtom witness)
nv_tileas.tiled_store -> nvvm.cp.async.bulk.s2g / stmatrix (selected from CopyAtom witness)
nv_tileas.async.pipeline.* -> i32 token phase + nvvm.mbarrier.* arrive / wait
nv_tileas.dot -> nvvm.wgmma.* (SM90) or nvvm.tcgen05.mma (SM100)
Pass Ordering
The LLVM lowering stage is two passes that the driver runs in sequence:
ConvertTileFuncToLLVMrewritesnv_tileaa.funcandnv_tileaa.returnintofunc.funcandfunc.return, applies the bare-pointer kernel ABI, and translates the kernel-spec attribute set intonvvm.*discardable attributes.ConvertTileASToLLVMrewrites bodies in nine phases (described below), starting with shared-memory global synthesis and ending with cast reconciliation.
Function-boundary conversion runs first because body conversion needs LLVM-typed function arguments: every body pattern that reads or writes a kernel argument depends on the bare-pointer ABI having been applied. Reversing the order would produce body lowerings against argument types that are still nv_tileaa-typed, and the cast-reconciliation phase would have nothing to reconcile against.
Function Boundary Conversion
ConvertTileFuncToLLVM (CLI: convert-nv-tile-func-to-llvm) is the function-boundary lifter that runs before ConvertTileASToLLVM. It rewrites nv_tileaa.func and nv_tileaa.return into func.func and func.return, performing full kernel-attribute translation against the active target. The body at sub_1159990 is 6 172 B across 239 basic blocks and calls 51 distinct helpers. sub_1156310 registers the dependent dialects llvm, func, and cutlass — cutlass is pulled in so residual cutlass.* markers carried alongside kernel metadata stay legal during the rewrite.
The pass runs as a four-phase state machine. Three module/function attributes gate the kernel-translation path:
| Attribute | Reader | Reader size |
|---|---|---|
nv_tileaa.kernel_spec | sub_13FE910 | 0x574 B |
nv_tileaa.compute_capability | sub_13FB490 | 0x179 B |
nv_tileaa.target_spec | sub_13FB490 | 0x179 B |
If any of the three is absent, the kernel path is skipped and only the plain signature rewrite runs; the function is a host-side helper as far as this pass is concerned.
Phase 1 — TypeConverter build
sub_15685F0 reads argument types, sub_4419090 rewrites each one, and the rewritten types collect into a SmallVector<Type, 6> with inline header 0x600000000 (inline cap=6, size=0). sub_43FE7A0 then pins the vector back into the new function-type slot. The inline-6 choice is sized for typical Tileiras kernels — argument-buffer pointers plus a handful of scalar launch parameters fit without spilling to the heap.
Phase 2 — Three conversion patterns
Pattern construction goes through sub_1158660 (4 052 B, 213 BB). It installs three OpConversionPattern subclasses, each a 0x68-B object — 8 B wider than upstream OpConversionPattern to accommodate an RTTI string parked at slots +64/+72. The trio:
| Pattern | Vtable | Role |
|---|---|---|
FuncOpConversion | off_59D57E8 | Rewrites nv_tileaa.func into func.func with the converted signature and transfers kernel attributes when the kernel-spec triple is present. |
ReturnOpConversion | off_59D5838 | Rewrites nv_tileaa.return into func.return and enforces the kernel-return policy described in Phase 4. |
CastOpElimination | off_59D5888 | Eliminates the unrealized casts that the signature rewrite introduces between the old and new argument SSA values. |
The PDL fallback sub_36F9730 runs unconditionally so pattern authors can express auxiliary rewrites in PDL; sub_36CB0C0 then drives applyPartialConversion. A failure raises "region types conversion failed" and sets bit 2 of pass+40, matching the sticky failure-reported scheme the sister pass uses.
Phase 3 — Kernel-attribute transfer
This phase runs only when all three attribute reads return non-empty. It performs BarePtr-style ABI translation: argument-buffer slots become pointer-sized LLVM args, and every kernel-spec field becomes an nvvm.* discardable attribute. The full eight-row translation table:
Source (nv_tileaa.*) | Destination | Type | Emission predicate |
|---|---|---|---|
kernel_spec (presence) | cute.kernel | UnitAttr | always when kernel_spec valid |
kernel_spec.numWarps | nvvm.reqntid | IntegerAttr<i32> | always; value = 32 * numWarps |
kernel_spec.clusterDim{X,Y,Z} | nvvm.cluster_dim | IntegerAttr<i32> | targetSM > 89 && clusterProduct > 1 |
| (cluster gating) | nvvm.blocksareclusters | UnitAttr | same predicate as nvvm.cluster_dim |
| (constant 1) | nvvm.minctasm | IntegerAttr<i32> | always |
nv_tileaa.occupancy | nvvm.maxnreg | IntegerAttr<i32> | iff occupancy set; value from sub_13FDB70 (per-SM occupancy → maxnreg table) |
nv_tileaa.compute_capability | (consumed) | IntegerAttr | gates SM-dependent emission only |
nv_tileaa.target_spec | (consumed) | StringAttr | gates SM-dependent emission only |
The translation uses dual-path DictionaryAttr lookup. The interned StringAttr "mlir::FunctionOpInterface]" — cached at qword_5B37670 behind the double-checked lock byte_5B37668 — drives the fast-path pointer comparison; the slow path goes through sub_43F70F0 to rebuild the StringAttr key and sub_446DC50 / sub_446DC70 to search the dictionary and install the new entry. The split exists because the interned key is cheaper than a string compare, but the cache may be cold on the first kernel in the module.
The cute.kernel placeholder is deliberately not renamed to nvvm.kernel here. That rewrite lives in the downstream CuteKernelToNvvmRewrite pass at sub_1698C20. The split exists because the downstream pass also lifts cute_nvgpu.grid_constant argument attributes to nvvm.grid_constant, and that lift needs the LLVM-legal function arguments this pass has just produced. Doing both rewrites in one pass would force grid-constant migration against not-yet-lowered argument types.
Phase 4 — Kernel-return policy
ReturnOpConversion::matchAndRewrite at sub_11565D0 enforces the kernel-return policy. If the parent op is a kernel (*(parent+46) < 0) and the return carries any operands, the pattern emits "Kernel functions do not support return with operands" at severity 259 and fails. Empty returns become func.return unconditionally. Non-kernel nv_tileaa.return is rewritten with whatever operands it carried, since regular func.return accepts arbitrary value lists.
Dynamic legality
nv_tileaa.func dynamic legality at sub_1156400 (0x1E4 B) is a four-way unrolled operand walk that returns "illegal — must rewrite" whenever any argument or result type carries a non-null operand-value pointer — i.e. is still nv_tileaa-typed. Purely LLVM signatures are already legal and skip the pattern entirely, so a function that has already been lifted (for instance because the producer emitted LLVM types directly) doesn't pay for a redundant rewrite.
LogicalResult lower_kernel_return(ReturnOp op, Rewriter *rw) {
if (is_kernel(op.parent_function()) && !op.operands().empty()) {
return op.emit_error("Kernel functions do not support return with operands");
}
rw->replace_op_with_new_op(op, "func.return", {});
return success();
}
Body Conversion Phases
ConvertTileASToLLVM::runOnOperation lives at sub_11547D0, a 20 KB, 180-basic-block body whose only state argument is the pass instance itself. The MLIRContext is recovered from pass[5] & ~7; the three low bits of that word encode skip-pipeline, an index-bitwidth != 0 marker, and a sticky failure-reported flag the pass body sets in place of calling signalPassFailure() directly. The pass-manager wrapper inspects the bit after return. Compute capability and target spec are fetched via sub_13FB490, which keys off the nv_tileaa.compute_capability and nv_tileaa.target_spec module attributes; absence produces "Failed to get ComputeCapability" (severity 259/0x103) routed through sub_446CE00 before the failure bit is set and the pass returns.
Before any rewrite pattern installs, the pass body emits the shared-memory scratch global via sub_1144DA0 (749 bytes, 14 basic blocks). The helper looks up an existing global_smem symbol with sub_1144CC0; if none exists, it synthesises llvm.mlir.global @global_smem ... addr_space(3) align 16 : !llvm.array<N x i8> with N = pass[6] >> 2. Kernels with no extended shared memory request (pass[6] <= 0) short-circuit the helper and emit no global. The body also exercises the standard MLIR registration-probe diagnostic — the binary stores the two halves separately as "Building op "and" but it isn't known in this MLIRContext: the dialect may not be loaded or this operation hasn't been added by the dialect. See also https://mlir.llvm.org/getting_started/Faq/#registered-loaded-dependent-whats-up-with-dialects-management", which the helper concatenates around the op name to distinguish "op registered" from "op missing" without aborting — a deliberate use of MLIR's diagnostic infrastructure as a registration test, not an error path.
LogicalResult run_convert_tileas_to_llvm(Pass *pass) {
MLIRContext *ctx = (MLIRContext *)(pass[5] & ~7uLL);
TargetSpec *spec = sub_13FB490(pass); // compute_capability / target_spec
if (!spec) {
emit_error(ctx, "Failed to get ComputeCapability"); // diag 0x103
pass[5] |= 4uLL; // failure bit, no signalPassFailure()
return failure();
}
if (failed(sub_1151450(ctx, sub_1151520, &patterns))) { // (1) decompose-print
emit_error(ctx, "fails to decompose print ops");
pass[5] |= 4uLL; return failure();
}
if (failed(sub_11523A0(ctx, sub_1152460, &patterns))) { // (2) bufferization analysis
emit_error(ctx, "fails to do bufferization analysis");
pass[5] |= 4uLL; return failure();
}
sub_1144DA0(pass); // global_smem emission (if pass[6] > 0)
sub_114F970(ctx, sub_114D1B0, &patterns); // (3) main nv_tileaa/nv_tileas
sub_114F880(ctx, sub_1150300, &patterns); // (4a) bulk supplementary patterns
for (Slot *slot = barrier_map.slots; slot < barrier_map.end; ++slot) {
if (slot->key == -4096 || slot->key == -8192) continue; // empty / tombstone sentinels
sub_114BB00(slot, ctx, &patterns); // cluster/barrier replay
}
sub_114FA40(ctx, sub_114DC50, &patterns); // (5) cute / cute_nvgpu
if (!(pass[5] & 1uLL)) { // skip-pipeline gate
sub_114FB10(ctx, sub_114EA50, &patterns); // (6) async.pipeline
}
sub_1153E30(ctx, sub_11540E0, &patterns); // (7) arith / llvm / math cleanup
sub_1154530(ctx, sub_1155A80, &patterns); // (8) reconcileUnrealizedCasts
sub_11508F0(ctx, sub_1150580, &patterns); // (9) late materializer
sub_115E240(&target, &type_converter); // configureConversionTarget
if (failed(sub_36F9730(&patterns))) { // PDL → PDLInterp fallback
emit_error(ctx, "failed to lower PDL pattern module to the PDL Interpreter");
pass[5] |= 4uLL; return failure();
}
if (failed(sub_36CB0C0(module, &target, &patterns))) { // applyPartialConversion engine
pass[5] |= 4uLL; return failure();
}
teardown_dense_maps(pass); // 8 slot-array free loops + ~Op() vtable+8
return success();
}
Pattern installation is six identical trampoline/body pairs plus three additional driver/populator pairs in the same shape. Every trampoline is a 199-230 byte, 14-basic-block skeleton that captures the conversion target and the shared TypeConverter, tail-calls its inner populator, and bubbles the resulting bool. Every populator body is a 54-basic-block emplace_back chain over std::vector<std::unique_ptr<RewritePattern>>; each emplace_back resolves to one sub_44A8C20(0x68) arena allocation paired with one indirect pattern-vtable construction call, so the 54 blocks correspond to 54 distinct pattern classes per phase.
| # | Phase | Trampoline | Body | Role | Diagnostic | Driver |
|---|---|---|---|---|---|---|
| 1 | decompose-print | sub_1151450 | sub_1151520 | Decomposes nv_tileaa print operations under a FunctionOpInterface classof guard | "fails to decompose print ops" | applyPartialConversion |
| 2 | bufferize | sub_11523A0 | sub_1152460 | Bufferization-analysis driver; assigns buffer forms required by later memory rewrites | "fails to do bufferization analysis" | applyPartialConversion |
| 3 | main TileAA/TileAS | sub_114F970 | sub_114D1B0 | Main nv_tileaa/nv_tileas → llvm/nvvm rewrites: tid-arith, ctaid/gridDim, warp shuffle, mbarrier init, TMA load/store, atomic RMW | — | applyPartialConversion |
| 4 | bulk supplementary | sub_114F880 | sub_1150300 | Additional lowerings populated in a nested scope between the main roster and the cluster/barrier replay | — | applyPartialConversion |
| 5 | cute / cute_nvgpu | sub_114FA40 | sub_114DC50 | cute.* and cute_nvgpu.* layout, copy, and SM100 arch helpers including cute_nvgpu.arch.sm100.retrieve_tmem_ptr | — | applyPartialConversion |
| 6 | async.pipeline | sub_114FB10 | sub_114EA50 | nv_tileas.async.pipeline.{create_pipeline, produce_one, consume_one, yield}; skipped when skip-pipeline is set | — | applyPartialOneToNConversion |
| 7 | arith / llvm cleanup | sub_1153E30 | sub_11540E0 | Final upstream arith / math → llvm cleanup | — | applyPartialConversion |
| 8 | reconcile-unrealized-casts | sub_1154530 | sub_1155A80 | reconcileUnrealizedCasts: strips leftover builtin.unrealized_conversion_cast ops | — | applyFullConversion |
| 9 | late materializer | sub_11508F0 | sub_1150580 | Fill-remainder helpers; emits PDL-fallback-friendly type materialisers | — | applyPartialConversion |
Between phases 4 and 5 the pass body walks an 80-byte DenseMap slot array using the standard LLVM ADT sentinels (-4096 empty, -8192 tombstone) and invokes sub_114BB00 (5.3 KB) on every non-sentinel slot. This is the cluster/barrier pattern replay that carries the multi-variant barrier lowerings — nvvm.barrier, nvvm.cluster.arrive.relaxed, nvvm.cluster.wait — each registered as a distinct pattern class so the slot-array walk can install them all without reflowing the main populator.
After every populator phase is installed, sub_115E240 (2.2 KB, 13 basic blocks, 34 string literals) builds the conversion target. Three sub-helpers split the op-by-op work. sub_115CDA0 adds nv_tileas.{alloc_tensor, convert_layout, load, store} and nv_tileaa.plugin as dynamic-legal "holes punched in the illegal dialect" — accepted only once their operands have already been converted to LLVM types. sub_115DDB0 adds llvm.{getelementptr, load, inline_asm, mlir.global, extractelement} as statically legal. sub_115D280 marks every nv_tileas.async.pipeline.* op together with nv_tileaa.mark_for_reuse and the sister-pass-owned nv_tileaa.func / nv_tileaa.return as legal, so the surface ConvertTileFuncToLLVM owns survives this pass untouched. arith adds with dynamic legality through sub_36B52E0(target, "arith", 5, {sub_115B300, sub_115D940}); the predicate pair asks the TypeConverter whether the operation's result and operand types are already LLVM-typed.
Once populators are installed and the target is configured, the PDL-to-PDLInterp fallback runs through sub_36F9730 (carrying the diagnostic "failed to lower PDL pattern module to the PDL Interpreter"), letting pattern authors express catch-all rewrites declaratively in PDL rather than C++. The conversion engine sub_36CB0C0 then drives applyPartialConversion. Teardown is eight DenseMap slot-array free loops that hand each storage buffer to sub_4560420 with the right stride, followed by per-pattern ~Op() destructor calls dispatched through the vtable slot at offset +8. The pipeline phase is skipped only when the caller has deliberately picked a path that does not require TileAS async pipeline lowering — typically for IR introspection, since the upstream OneToNTypeConverter folds intermediate casts during type-split and that elision obscures pipeline structure in debug dumps. In normal compilation, every one of the nine phases runs.
Dynamic Shared Memory
Before body patterns run, the pass may create a shared-memory byte array symbol. The symbol is emitted only when the kernel requested extended shared memory, and uses the shared address space with conservative alignment so later GEPs can carve typed views out of it.
void ensure_global_smem(ModuleOp module, uint32_t bytes, Rewriter *rw) {
if (bytes == 0 || module.lookup_symbol("global_smem")) {
return;
}
Type element = rw->i8_type();
Type array = rw->llvm_array_type(element, bytes);
rw->create("llvm.mlir.global", {
.name = "global_smem",
.type = array,
.address_space = AddressSpace::Shared,
.alignment = 16,
});
}
global_smem Synthesis
Before any conversion pattern fires, ConvertTileASToLLVM emits a single shared-memory scratch global named global_smem. Emission happens in sub_1144DA0 (749 bytes, 14 basic blocks), once per pass invocation. The array length is N = pass[6] >> 2, where pass[6] is the upstream-computed extended-SMEM byte budget produced by the TileAS scheduler; the right shift by two converts that budget into the i8-array length the LLVM dialect expects on the synthesised global. Kernels that did not request extended shared memory (pass[6] <= 0) short-circuit at the entry test and emit nothing.
sub_1144CC0 probes for existing symbols. If a global_smem already exists on the module — for instance, emitted by an earlier pass running on the same module — synthesis is skipped and the existing symbol is reused. The synthesised IR has the canonical shape:
llvm.mlir.global @global_smem () { addr_space = 3 : i32, alignment = 16 : i64,
linkage = #llvm.linkage<internal> } : !llvm.array<N x i8>
Address space 3 is the CUDA shared address space (__shared__). Alignment 16 matches the maximum natural alignment for any vectorisable load or store hitting the global, so later GEPs carving typed views out of the i8 backing array need not widen the alignment in place. Internal linkage keeps the symbol private to the module. The classic MLIR registration-probe diagnostic is wired in as a sanity probe — the binary stores it as the two-fragment pair "Building op "and" but it isn't known in this MLIRContext: the dialect may not be loaded or this operation hasn't been added by the dialect. See also https://mlir.llvm.org/getting_started/Faq/#registered-loaded-dependent-whats-up-with-dialects-management", concatenated around the op name. If the llvm.mlir.global op is not registered in the MLIRContext, the helper emits the diagnostic and returns failure rather than crashing.
LogicalResult emitGlobalSmem(PassContext *ctx, ModuleOp module) {
if (ctx->pass[6] <= 0) return success(); // skip
if (Operation *existing = sub_1144CC0(module, "global_smem")) return success(); // reuse
uint64_t n_bytes = (uint64_t)ctx->pass[6] >> 2;
OperationName globalName("llvm.mlir.global", ctx->mlirCtx);
if (!globalName.isRegistered()) {
emit("Building op `llvm.mlir.global` but it isn't known in this MLIRContext");
return failure();
}
OpBuilder b(module.getBodyRegion());
b.create<llvm::GlobalOp>(/*loc=*/loc, /*type=*/llvmArrayI8(n_bytes), /*sym_name=*/"global_smem",
/*linkage=*/Internal, /*addr_space=*/3, /*alignment=*/16);
return success();
}
ConvertTileASToLLVM runs once per kernel, and each invocation gets its own pass[6], so the per-pass emission shape is natural — the same module may host multiple kernels, each with its own extended-SMEM budget, and each kernel's pass instance carries its own byte count. The reuse path through sub_1144CC0 fires only when two kernels happen to share the same scratch, which is rare in practice.
Conversion Target
The terminal conversion target legalizes LLVM and NVVM while treating TileAA and TileAS as illegal except for explicitly dynamic bridge operations. Some arith and math operations are dynamically legal once their operands and results are already LLVM-compatible; otherwise cleanup patterns lower them.
| Legal or dynamic surface | Reason |
|---|---|
llvm and nvvm | terminal executable dialects |
gpu.module containers | consumed by GPU-to-binary serialization |
selected arith operations | legal only after type conversion |
selected math operations | legal only after cleanup-compatible type conversion |
| selected TileAS bridge ops | legal only when operands are already LLVM-typed |
| async pipeline bridge ops | legal only for the pipeline lowering phase |
| unrealized casts | temporary, removed by reconciliation |
The target must reject every executable TileAS operation after the body phase. Accepting one would shift a compiler bug into backend translation.
Conversion-Target Legality
sub_115E240 (2.2 KB, 13 basic blocks, 34 string literals) assembles the ConversionTarget applyPartialConversion consults during the main ConvertTileASToLLVM pass. It runs once, after every populator phase is installed and before the PDL fallback. The job is purely declarative: tell the partial-conversion driver which dialects are fully legal, which dialect is uniformly illegal, and which individual operations are legal only when a type-converter predicate accepts them. No rewrite work happens here.
Two static-legality vectors carry the bulk of the configuration, passed through sub_36B4F90(target, vec, count, kind). A seven-entry vector with kind = 0 ("fully legal — accept any op of these dialects without further checks") carries the terminal executable dialects together with the CuTe and CUTLASS surface that survives lowering verbatim. A single-entry vector with kind = 2 ("illegal") names nv_tileaa, so every nv_tileaa op is a rewrite target by default; the dynamic predicates that follow are the "holes punched in the illegal dialect" that let specific bridge ops slip through when their operands have already been converted.
| Vector | kind | Entries |
|---|---|---|
| static legal dialects | 0 | arith, gpu, nvvm, scf, cute, cute_nvgpu, cutlass |
| static illegal dialects | 2 | nv_tileaa |
Three sub-helpers refine the per-op surface. They are independently callable but share a fixed call order inside sub_115E240: the static "fully legal" bucket fills first, the illegal dialect is declared next, and only then are the dynamic and per-op exceptions layered on top.
| Sub-helper | Size | Operations | Legality |
|---|---|---|---|
sub_115CDA0 | 1.2 KB | nv_tileas.{alloc_tensor, convert_layout, load, store}, nv_tileaa.plugin | dynamic — accepted only when operands already carry LLVM-legal types |
sub_115DDB0 | 1.1 KB | llvm.{getelementptr, load, inline_asm, mlir.global, extractelement} | static |
sub_115D280 | 737 B | nv_tileas.async.pipeline.*, nv_tileaa.{mark_for_reuse, func, return} | static |
sub_115D280 carries the sister-pass contract. nv_tileaa.func and nv_tileaa.return are owned by ConvertTileFuncToLLVM, which has already run; marking them legal here keeps the surface the function-boundary pass produced untouched. nv_tileaa.mark_for_reuse is a scheduling annotation that must survive into NVVM translation. The nv_tileas.async.pipeline.* family is legal because the pipeline phase (phase 6, sub_114FB10 → sub_114EA50) uses applyPartialOneToNConversion rather than applyPartialConversion, and the bookkeeping ops it leaves behind must not be re-attacked by the main driver.
Dynamic legality for the remaining arith, math, and ub.poison operations installs through sub_36B52E0(target, name, count, predicates) and sub_36B50E0(target, op). Each predicate pair is a (legality, materialization) callback: the first asks the TypeConverter whether the operation's result and operand types are already LLVM-typed, the second is the cast-materialization hook invoked when the answer is "almost — convert these operands first".
| Op or family | Registration | Predicate pair |
|---|---|---|
arith (5 ops) | sub_36B52E0(target, "arith", 5, …) | {sub_115B300, sub_115D940} |
math.absi, math.ctlz, math.ctpop, math.cttz, math.trunc | sub_36B50E0(target, op) per op | inherited from the dialect's default dynamic predicate |
ub.poison | sub_36B52E0(target, "ub.poison", 1, …) | {sub_115B360, sub_115B250} |
Those five math.* operations are exactly the ones with direct NVPTX equivalents — handled either by direct intrinsics or by short inline-PTX templates (absi → integer-abs sequence, ctlz/cttz → bfind PTX, ctpop → popc PTX, trunc → integer-truncating cast); the upstream math patterns lower them in the cleanup phase only when their operands are already LLVM-typed.
void buildConversionTarget(ConversionTarget *target) {
sub_36B4F90(target,
/*legalDialects=*/{arith, gpu, nvvm, scf, cute, cute_nvgpu, cutlass},
/*count=*/7,
/*kind=*/0);
sub_36B4F90(target,
/*illegalDialects=*/{nv_tileaa},
/*count=*/1,
/*kind=*/2);
sub_115CDA0(target); // dynamic-legal nv_tileas.{alloc_tensor, convert_layout, load, store} + nv_tileaa.plugin
sub_115DDB0(target); // statically-legal llvm.{getelementptr, load, inline_asm, mlir.global, extractelement}
sub_115D280(target); // legal nv_tileas.async.pipeline.* + nv_tileaa.{mark_for_reuse, func, return}
sub_36B52E0(target, "arith", 5, /*predicates=*/{sub_115B300, sub_115D940});
sub_36B50E0(target, "math.absi");
sub_36B50E0(target, "math.ctlz");
sub_36B50E0(target, "math.ctpop");
sub_36B50E0(target, "math.cttz");
sub_36B50E0(target, "math.trunc");
sub_36B52E0(target, "ub.poison", 1, /*predicates=*/{sub_115B360, sub_115B250});
}
The fixed declaration order matters. Dynamic predicates registered later take precedence over the dialect-wide kind = 0 decision — a reimplementation that swaps sub_115CDA0 with the static-legal-dialect call would accidentally make every arith op statically legal and skip the cleanup phase's type-conversion gate. The driver also relies on nv_tileaa being declared illegal before the per-op holes are punched: punching holes first and declaring dialect-wide illegality second would override the dynamic predicates and force every nv_tileaa.plugin invocation to rewrite unconditionally, which the plugin contract does not support.
Async Pipeline Lowering
TileAS async pipeline operations carry producer/consumer phase, stage index, and queue structure. Terminal lowering distills that structure into integer tokens, barriers, waits, and memory operations.
LogicalResult lower_pipeline_consume(ConsumeOp op, Rewriter *rw, PipelineState *state) {
Value token = state->token_for(op.pipeline(), op.stage());
Value phase = rw->and_i(token, rw->constant_i32(1));
Value ready = rw->create("nvvm.mbarrier.try_wait.parity.shared", {
op.barrier(),
phase
}).result(0);
rw->replace_op(op, ready);
return success();
}
The exact intrinsic varies by operation; the invariant is stable: pipeline tokens are integer phase carriers, not heap objects.
Tile Memory and Descriptor Lowering
The main TileAA/TileAS body patterns lower:
- thread, warp, CTA, grid, and cluster index arithmetic;
- tile loads, stores, gathers, scatters, and atomics;
- TMA descriptor creation and tiled TMA load/store operations;
- mbarrier initialization, arrive, wait, and transaction-count operations;
- layout conversion and view operations;
- tensor-memory helper operations for Blackwell paths;
- inline assembly only where no first-class NVVM operation exists.
Prefer first-class NVVM operations over inline assembly. Inline assembly is appropriate only for target instructions absent from the NVVM dialect snapshot.
Per-Pattern Walks
async.tiled_tma_load → nvvm.cp.async.bulk.tensor.shared.global
The TileAS TMA-load lowering is a five-step rewrite. The TMA descriptor (an nv_tileas.make_tiled_tma_desc result, carrying a !nv_tileas.tma_descriptor_iter value) becomes an llvm.ptr<1> to the descriptor's global-memory home, the destination view becomes the shared-memory base address, the per-axis coordinates flow through unchanged as i32 indices, and the mbarrier slot is the shared-memory address of the completion barrier. The async-token result is the i32 phase carrier the consumer-side mbarrier.try_wait will observe:
// Before
%tok = nv_tileas.async.tiled_tma_load
%desc, %dst_view[%coord_y, %coord_x], %mbar
{ atom = #nv_tileas<atom tma_load_2d>,
operandSegmentSizes = array<i32: 1, 1, 2, 1> }
: !nv_tileas.tma_descriptor_iter, !nv_tileaa.tiled_view<128x64xf16>,
index, index, !nv_tileaa.mem_token
-> !nv_tileas.async.token
// After
%dst_addr = llvm.extractvalue %dst_view_struct[0]
: !llvm.struct<(ptr<3>, ptr<3>, i64, array<2 x i64>, array<2 x i64>)>
%desc_addr = llvm.bitcast %desc : !llvm.ptr -> !llvm.ptr<1>
%mbar_addr = llvm.extractvalue %mbar_struct[0]
: !llvm.struct<(ptr<3>, i32)>
nvvm.cp.async.bulk.tensor.shared.cluster.global %dst_addr, %desc_addr,
%mbar_addr, box[%coord_x, %coord_y]
{ mode = #nvvm.tma_load_mode<tile> }
: !llvm.ptr<3>, !llvm.ptr<1>, !llvm.ptr<3>
%tok = llvm.mlir.constant(0 : i32) : i32
The intrinsic name selection is driven by the atom attribute and the mode attribute on nvvm.cp.async.bulk.tensor.shared.cluster.global. A 2D tile load with no multicast and no L2 cache hint emits the basic form above; multicast variants set multicast = true and append a multicast_mask operand; the im2col atom variants set mode = #nvvm.tma_load_mode<im2col> and prepend a per-axis offset vector before the coordinate list.
Coordinate order also flips. The TileAS surface lists coordinates in row-major (outer-axis-first) order to match the way layout-assignment writes them, but cp.async.bulk.tensor consumes them in column-major (inner-axis-first) order to match the PTX instruction. The rewrite reverses the coordinate operand list as part of the emission.
The !nv_tileas.async.token result becomes an i32 zero constant. Async-token values do not carry hardware state — only data-dependence edges in the IR — so the lowering replaces them with a placeholder whose only purpose is keeping the SSA dataflow connected for the consumer pattern. The consumer side (an nv_tileas.async.pipeline.consumer_wait) lowers to an nvvm.mbarrier.try_wait.parity.shared that reads its phase from the loop iterator's stage index, not from the async-token operand.
async.dot (Hopper WGMMA atom) → Four-Op NVVM Protocol
Hopper warpgroup MMA lowers to a strict four-op NVVM sequence: fence, mma_async, commit_group, wait_group. The fence pins the boundary the consumer cannot reorder past, the mma_async issues the warpgroup compute, and the commit/wait pair drains the accumulator before the next consumer reads it. The 64-bit SMEM descriptors for A and B are built upstream in the cute_nvgpu lowering and arrive as already-packed i64 SSA values:
// Before — nv_tileas.async.dot carrying a sm90 WGMMA atom witness
%c_out = nv_tileas.async.dot %desc_a, %desc_b, %c_in
{ atom = #nv_tileas<atom mma_f16_f16_f32>,
group_id = 0 : i32 }
: i64, i64, tile<128x128xf32>
-> tile<128x128xf32>
// After
nvvm.wgmma.fence.aligned
%c0 = llvm.extractvalue %c_in_struct[0] : !llvm.struct<(f32, f32, ..., f32)>
...
%cN = llvm.extractvalue %c_in_struct[63] : !llvm.struct<(f32, f32, ..., f32)>
%r0, ..., %rN = nvvm.wgmma.mma_async.sync.aligned
%desc_a, %desc_b, %c0, ..., %cN
{ shape = #nvvm.shape<m = 64, n = 128, k = 16>,
typeA = #nvvm.wgmma_type<f16>,
typeB = #nvvm.wgmma_type<f16>,
typeD = #nvvm.wgmma_type<f32>,
scaleA = 1 : i32, scaleB = 1 : i32,
scaleD = #nvvm.wgmma_scale_out<one> }
: i64, i64, f32, ..., f32 -> f32, ..., f32
nvvm.wgmma.commit.group.sync.aligned
nvvm.wgmma.wait.group.sync.aligned 0
%c_out_struct = llvm.insertvalue %r0, %undef[0]
: !llvm.struct<(f32, f32, ..., f32)>
...
The accumulator tile becomes an LLVM struct with one element per register lane — for m64n128.f32 the lane count is 64 per thread, so the struct has 64 f32 fields, each held in a separate register at runtime. The rewrite splits the tile into per-lane SSA values with extractvalue, feeds them into the mma_async op as positional operands, and reassembles the result tile with insertvalue. NVVM canonicalisation later folds the extractvalue/insertvalue chain when the accumulator lives in a register for the full WGMMA loop.
The wait.group 0 waits for every outstanding WGMMA group — the simplest correct lowering. A pipelined variant emits commit.group after every mma_async and wait.group N with N equal to the depth of in-flight groups the scheduler tracks; that path is taken when the upstream nv_tileas.async.dot (the producer of the WGMMA atom payload) carries a pipeline_depth attribute on its atom witness. The four-op protocol is fixed; only the wait-group depth varies.
async.mbarrier_init → nvvm.mbarrier.init.shared
mbarrier initialisation is a one-to-one rewrite. The barrier value lives in shared memory, gets allocated upstream by an alloc_tensor lowering that carves it out of global_smem, and arrives as an llvm.ptr<3> to a 64-bit barrier slot. The tick count — the number of arrivals the barrier expects before phase advance — is an i32:
// Before
%mbar_init = nv_tileas.async.mbarrier_init %mbar, %ticks
: !nv_tileaa.mem_token, i32 -> !nv_tileaa.mem_token
// After
%mbar_addr = llvm.extractvalue %mbar_struct[0]
: !llvm.struct<(ptr<3>, i32)>
nvvm.mbarrier.init.shared %mbar_addr, %ticks : !llvm.ptr<3>, i32
%mbar_init = llvm.mlir.constant(0 : i32) : i32
The TileAS mem-token result is again a placeholder i32 — the actual ordering edge to the matching nvvm.mbarrier.arrive / nvvm.mbarrier.try_wait.parity.shared pair is carried by the producer/consumer-side pattern that issues those intrinsics, not by an explicit operand chain.
The nvvm.mbarrier.init.shared intrinsic is the unconditional emission. There is no global-memory init variant — mbarrier storage must be in shared memory on every supported architecture, and the rewrite asserts the source view's address space before emitting. Initialising a global-memory barrier address would produce a PTX error at SASS translation, well after the conversion target has accepted the IR; the assertion catches it at this pass instead.
Arith Template Cleanup
The cleanup path includes generic arithmetic patterns plus a higher-priority constant conversion. Generic arithmetic conversion maps compare, add, multiply, division, shifts, select, casts, and min/max into the target dialect under the shared type converter. Constants get special handling so tensor constants become TileAA or LLVM aggregate materializations rather than scalar-only constants.
LogicalResult lower_arith_constant(ConstantOp op, Rewriter *rw, TypeConverter *types) {
Attribute value = op.value();
if (isa<DenseElementsAttr>(value) || isa<SplatElementsAttr>(value)) {
return lower_tensor_constant(op, rw, types);
}
return lower_scalar_constant(op, rw, types);
}
Conversion Invariants
- Function signatures must be LLVM-compatible before body lowering starts.
- Kernel functions may not return operands.
- Kernel metadata must be transferred to NVVM-compatible attributes without losing launch semantics.
- Shared-memory globals are emitted only when requested and must use shared address space.
- Async pipeline tokens lower to integer phase carriers.
- TileAA and TileAS executable operations must not survive terminal conversion.
- Temporary unrealized casts must be reconciled before serialization.
- Inline assembly must remain narrowly scoped to missing NVVM dialect coverage.
Cross-References
Conversion / Lowering Overview places this pass at the LLVM-lowering stage between TileAS scheduling and companion-dialect lowering. TileAA to TileAS — Named Pattern Bank is the upstream producer whose CopyAtom and ReduceAtom witnesses this pass resolves into concrete hardware primitives. CuTe and CuTe-NVGPU to LLVM and nvgpu / gpu to NVVM are the companion passes that lower the surviving cute.*, cute_nvgpu.*, gpu.*, and nvgpu.* operations after this pass. Shared LLVM Type Converter describes the shared LLVM type converter every pattern in this pass threads through. MMA Atoms sm70-120 — SM90 WGMMA carries the bit-level SMEM descriptor layout the four-op wgmma walk above consumes verbatim. nv_tileas Op Roster — TMA Op Operand/Result Tables gives the operand and attribute tables for the TileAS surface the per-pattern walks here lower into NVVM. DSL to PTX End-to-End — Stage 4: LLVM IR with NVVM intrinsics shows the kernel exit shape of this pass — nv_tileas.dot, async.pipeline.*, and tiled_tma_load collapsed to nvvm.wgmma.*, integer phase tokens, and nvvm.cp.async.bulk.tensor.* for one representative GEMM iteration.
Lowering: cute / cute_nvgpu to LLVM
Abstract
The cute and cute_nvgpu dialects carry layout algebra, tuple manipulation, descriptor iterators, and architecture-specific MMA or copy atoms. They sit beside the TileAA and TileAS pipeline rather than forming a single linear rung. Their lowering desugars high-level CuTe constructs into a primitive vocabulary, lowers layout and descriptor operations into LLVM-compatible values, then rewrites Hopper and Blackwell atom builders into the NVGPU/NVVM path.
The public contract: layout algebra stays inspectable until enough target information exists, and no CuTe-only executable operation may reach final NVPTX serialization.
Lowering Stages
The CuTe lowering pipeline is three passes that run in order. Each pass owns a different layer of abstraction, and the next pass relies on the prior pass having normalised its input.
| Stage | Responsibility |
|---|---|
CuteDesugar | Expands sugar into primitive cute, scf, arith, and memref operations. Target-neutral. |
cute -> LLVM pattern set | Lowers layout tuples, descriptor iterators, pointer casts, and primitive helpers into LLVM-dialect values. |
cute_nvgpu atom lowering | Rewrites SM90 and SM100 atom builders into target-specific NVVM and tcgen05 IR. |
Stage order simplifies high-level CuTe layout manipulation before architectural operations are selected. Desugaring must run first because the primitive CuTe lowering bank can only see what the desugarer has reduced; atom lowering runs last because its target gates depend on having LLVM-typed operands available.
Layout Descriptors to LLVM
The translation from CuTe layout algebra to LLVM follows a single rule: each cute tuple becomes an llvm.struct, and each Layout becomes a sequence of llvm.insertvalue operations on a fresh undef of that struct type. Modes within a tuple — shape, stride, swizzle — translate independently and compose by struct nesting. A rank-2 layout, for example, packs as !llvm.struct<(struct<(i32, i32)>, struct<(i32, i32)>)> where the outer struct holds shape and stride and each inner struct holds the per-mode entries.
The descriptor-iterator primitive sits at the heart of this lowering. CuTe represents iteration over a layout via cute.get_iter (paired with cute.deref_desc_iter for dereference), which the bank rewrites into a four-op LLVM sequence: a ceildivsi for the total iteration count, an alloca for the iterator state slot, an undef to initialise it, and three insertvalue operations that populate the base pointer, current index, and stride fields.
%iter = cute.get_iter %base, %extent, %tile_shape, %stride
↓
%count = arith.ceildivsi %extent, %tile_shape : i32
%storage = llvm.alloca %c1 x !llvm.struct<(ptr, i32, i32)> : (i32) -> !llvm.ptr
%init = llvm.mlir.undef : !llvm.struct<(ptr, i32, i32)>
%s0 = llvm.insertvalue %base, %init[0] : !llvm.struct<(ptr, i32, i32)>
%s1 = llvm.insertvalue %c0, %s0[1] : !llvm.struct<(ptr, i32, i32)>
%iter = llvm.insertvalue %stride, %s1[2] : !llvm.struct<(ptr, i32, i32)>
The companion cute.deref_desc_iter and the cute.add_offset advance/rewind helpers work directly on the resulting three-field struct — extractvalue to read the index, add or sub to update it, insertvalue to write it back. The iterator is small (24 bytes) and the LLVM optimiser usually eliminates the stack slot through SROA after inlining; emitting the alloca up front gives the optimiser a stable target to fold.
Per-shape escape hatches sit at the edges of the bank. cute.print desugars to an element-wise loop with coordinate materialisation and a scalar print call. cute.make_atom dispatches to atom-interface-specific construction. cute.filter_zeros, cute.group_modes, cute.coalesce, and cute.complement carry layout-algebra semantics: they rewrite to layout reconstruction sequences that compute new shape and stride tuples from the input layout.
Desugaring Contract
CuteDesugar rewrites high-level layout construction and inspection into primitive operations later conversion patterns can lower mechanically.
| Sugar operation | Desugared shape |
|---|---|
cute.make_layout | structured loop over grouped shape and stride modes |
cute.make_shape | loop-driven construction from iterator leaves |
cute.make_stride | loop-driven static stride construction |
cute.make_tile | primitive tile construction and dice operations |
cute.make_coord | flat-coordinate extraction |
| view equality and projection | shape and stride reads followed by boolean conjunction |
cute.print | element loop with coordinate materialization and scalar print |
cute.make_atom | atom-interface-specific primitive atom construction |
The pass is target-neutral. It must not branch on compute capability — target selection belongs to the atom-lowering bodies and NVGPU conversion.
Input and Output Dialects
| Direction | Surface |
|---|---|
| input ops | cute.* (layout, tuple, descriptor, copy, partition), cute_nvgpu.* (atoms, SM100 tcgen05 helpers) |
| input types | cute::LayoutType, cute::ShapeType, cute::StrideType, cute::AtomType, descriptor iterator types |
| output ops | llvm.* (alloca, insertvalue, extractvalue, load, store, struct construction), nvvm.* (tcgen05, wgmma, cp.async.bulk), arith and scf for residual control structure, cutlass.* for atoms forwarded into companion lowering |
| output types | layout and shape tuples become integers or !llvm.struct; descriptor iterators become a 3-field struct (ptr, i32, i32); atoms become opaque struct payloads consumed by the next stage |
Bulk cute -> LLVM Pattern Bank
Forty-four OpConversionPattern classes cover the primitive CuTe surface. The first sixteen anchor the bank by lowering layout construction, tuple manipulation, and the descriptor-iterator primitives every later pattern reaches into; the remaining twenty-eight extend it with copy and partition helpers, fast-division specialisations, pointer-cast bridges, and cute_nvgpu helper operations. Registration is a flat linear sweep with no conditional branches on target — a faithful reimplementation mirrors the bank as a single pattern list.
The sixteen anchor classes:
| Class | Source op | Rewrite target |
|---|---|---|
MakeDescriptorIteratorOpLowering | cute.get_iter | alloca + undef + three insertvalue (the four-op sequence above) |
DescriptorAdvanceOpLowering | cute.add_offset (advance arm) | extractvalue + add + insertvalue |
DescriptorRewindOpLowering | cute.add_offset (rewind arm) | extractvalue + sub + insertvalue |
MakeLayoutOpLowering | cute.make_layout | undef + recursive insertvalue over shape and stride |
MakeCoordOpLowering | cute.make_coord | Flat coordinate-tuple construction |
CrdToIdxOpLowering | cute.crd2idx | Dot product of coordinate and stride tuples |
TiledDivOpLowering | cute.tiled_divide | Per-mode divsi |
TiledModOpLowering | cute.tiled_divide (remainder arm) | Per-mode remsi |
ShapeDivOpLowering | cute.shape_div | Layout reconstruction after shape divide |
CeilDivOpLowering | cute.ceil_div | ceildivsi lifted to LLVM scalars |
FilterZerosOpLowering | cute.filter_zeros | Layout reconstruction skipping zero-extent modes |
GroupModesOpLowering | cute.group_modes | Layout reconstruction with grouped modes |
CoalesceOpLowering | cute.coalesce | Layout reconstruction with adjacent compatible modes merged |
ComplementOpLowering | cute.complement | Layout-complement construction |
PartitionOpLowering | cute.local_partition | Pointer-offset GEP plus layout adjustment |
TilePartitionOpLowering | cute.tiled.copy.partition_D / partition_S | Tiled-partition iteration emission |
A secondary registrar runs after the main bank and adds two DerefineOpLowering patterns (layout-projection-to-coord and layout-flatten) plus the ConvertGPUFuncSignature rewrite that downgrades gpu.func signatures to LLVM-compatible func.func. Running this registrar second matters: it lets the cute_nvgpu helper rewrites assume the primitive CuTe operations are already convertible, so they can compose with the bank's outputs rather than racing them.
Per-Pattern Walks
cute.layout Value to LLVM Struct
A cute.layout<<8:1, 4:8>> describes a rank-2 mapping with shape (8, 4) and stride (1, 8) — the canonical column-major row-tile layout for an 8-by-4 tile. The lowering packs it as a 4-tuple LLVM struct using two insertvalue operations per mode. Static layouts fold into a single LLVM constant; dynamic layouts emit the insertvalue chain so the optimiser can hoist the construction across loops:
// Before
%l = cute.make_layout shape = <8 : i32, 4 : i32>, stride = <1 : i32, 8 : i32>
: !cute.layout<<8:1, 4:8>>
// After (static — folded to constant)
%l = llvm.mlir.constant(
dense<[8, 1, 4, 8]> : tensor<4xi32>)
: !llvm.struct<(i32, i32, i32, i32)>
// After (dynamic — same shape with dynamic stride)
%shape0 = arith.constant 8 : i32
%shape1 = arith.constant 4 : i32
%init = llvm.mlir.undef : !llvm.struct<(i32, i32, i32, i32)>
%s0 = llvm.insertvalue %shape0, %init[0] : !llvm.struct<(i32, i32, i32, i32)>
%s1 = llvm.insertvalue %stride0, %s0[1] : !llvm.struct<(i32, i32, i32, i32)>
%s2 = llvm.insertvalue %shape1, %s1[2] : !llvm.struct<(i32, i32, i32, i32)>
%l = llvm.insertvalue %stride1, %s2[3] : !llvm.struct<(i32, i32, i32, i32)>
The struct field order is (shape_dim0, stride_dim0, shape_dim1, stride_dim1). Interleaving shape and stride per mode rather than packing all shapes then all strides keeps the per-mode pair adjacent in memory, which the SROA pass treats as one local-variable group when it scalarises the alloca that holds a layout iterator. A rank-3 layout cute.layout<<S0:T0, S1:T1, S2:T2>> lowers to a 6-tuple !llvm.struct<(i32, i32, i32, i32, i32, i32)> on the same principle.
Hierarchical layouts (a mode that is itself a layout) lower as nested structs. A rank-2 layout whose inner mode is ((2, 2), 4) packs as !llvm.struct<(struct<(i32, i32, i32, i32)>, i32, i32, i32)> — the inner pair-of-pairs becomes its own 4-tuple, and the outer mode is rank-1 over the nested mode plus a flat (shape, stride) pair for the second outer axis.
cute.compose of Two Layouts
cute.compose %l1, %l2 computes the functional composition (i) → l2(l1(i)). When both operands are static, the composition folds at conversion time into a single layout constant. When at least one is dynamic, the rewriter emits a sequence that extracts the source layout's shape and stride, multiplies stride trees, and packs the result struct:
// Before
%c = cute.compose %l1, %l2
: !cute.layout<<8:1, 4:8>>, !cute.layout<<2:1, 8:2>>
-> !cute.layout<<((2, 4), 4):((1, 16), 8)>>
// After (both static — folded)
%c = llvm.mlir.constant(
dense<[2, 1, 4, 16, 4, 8]> : tensor<6xi32>)
: !llvm.struct<(i32, i32, i32, i32, i32, i32)>
// After (dynamic stride on %l1)
%s1_d0 = llvm.extractvalue %l1[0] : !llvm.struct<(i32, i32, i32, i32)>
%t1_d0 = llvm.extractvalue %l1[1] : !llvm.struct<(i32, i32, i32, i32)>
%s2_d0 = llvm.extractvalue %l2[0] : !llvm.struct<(i32, i32, i32, i32)>
%t2_d0 = llvm.extractvalue %l2[1] : !llvm.struct<(i32, i32, i32, i32)>
%composed0 = llvm.mul %t1_d0, %t2_d0 : i32
%init = llvm.mlir.undef : !llvm.struct<(i32, i32, i32, i32)>
%c0 = llvm.insertvalue %s2_d0, %init[0] : !llvm.struct<(i32, i32, i32, i32)>
%c1 = llvm.insertvalue %composed0, %c0[1] : !llvm.struct<(i32, i32, i32, i32)>
...
The static fold first checks that the cosize of %l1 (8 in the example) fits inside the size of %l2 (16), which the algebra requires — see Layout Algebra — Composition. Failing that check produces an arith.constant 0 for an invalid layout, and the verifier on the consumer op rejects the result. The dynamic path emits an llvm.icmp that the optimiser folds away once both shapes are constant-propagated.
The mode count of the result is not always the sum of input mode counts; composition can introduce nested modes when the strides of %l2 are not divisible by the shape sums of %l1. The lowering walks the inputs structurally and synthesises one struct field per leaf in the result tree, so nested-mode composition lowers to nested structs rather than flat ones.
cute_nvgpu.arch.copy.SM100.copy_s2t — SMEM-to-TMEM Copy
The Blackwell shared-to-tensor-memory copy lowers in two stages. First, the CuTe atom packaging stage (Sm100S2tCopyAtom) materialises a TMEM destination and any cluster-rank arithmetic, then emits a CuTe atom payload that carries the source SMEM view, destination TMEM pointer, mbarrier, and partition info. Second, the TileAS-to-LLVM lowering converts that payload into a nvvm.cp.async.bulk.tensor.shared::cluster.shared::cta intrinsic:
// Before (after CuTe atom packaging)
%tok = cute_nvgpu.cp_async.s2t %src_smem_view, %dst_tmem_ptr, %mbar, %partition
{ atom = #cute.copy_atom<sm100_s2t_b8x128>,
cta_group = 2 : i32 }
: !nv_tileaa.tiled_view<128x128xi8, smem>, !llvm.ptr<6>, !llvm.ptr<3>, i32
-> !nv_tileas.async
// After
%src_addr = llvm.extractvalue %src_smem_view[0]
: !llvm.struct<(ptr<3>, ptr<3>, i64, array<2 x i64>, array<2 x i64>)>
%rank_mod = llvm.urem %cluster_rank, %cta_group_2 : i32
%mask = llvm.and %rank_mod, %cta_group_minus_1 : i32
%cond = llvm.icmp "eq" %mask, %c0_i32 : i32
llvm.cond_br %cond, ^do_copy, ^skip
^do_copy:
nvvm.cp.async.bulk.tensor.shared.cluster.shared.cta
%dst_tmem_ptr, %src_addr, %mbar
{ mode = #nvvm.tma_load_mode<tile>, shape = #nvvm.shape<128x128> }
: !llvm.ptr<6>, !llvm.ptr<3>, !llvm.ptr<3>
llvm.br ^join
^skip:
llvm.br ^join
^join:
%tok = llvm.mlir.constant(0 : i32) : i32
Only the CTA whose rank in the cluster matches the partition selector issues the copy — the others branch around it. The partition selector is (rank_in_cluster mod cta_group) AND (cta_group - 1), which is the value-zero test for the CTA that owns the partition; the rewriter folds the second AND into the address arithmetic when cta_group is a power of two, so the conditional branch typically collapses to a single comparison against zero.
The destination address space is 6, which is the tensor-memory address space in the NVVM dialect's address-space convention (0 = generic, 1 = global, 3 = shared, 4 = constant, 5 = local, 6 = tmem). TMEM is not addressable from generic pointers — every TMEM access must go through the cp.async.bulk.tensor or tcgen05 paths, and the address-space sentinel keeps that contract explicit through the entire pipeline.
The !nv_tileas.async.token value (produced by ops in the nv_tileas.async.* family) again becomes a placeholder i32. Completion observation runs through the mbarrier the cp.async.bulk.tensor increments on its way out — the consumer side reads its phase from the matching nvvm.mbarrier.try_wait.parity.shared, and the i32 token's only purpose is the IR-level data-dependence edge.
cute.tiled.copy.partition_D Pointer-Offset Walk
cute.tiled.copy.partition_D (and its companion partition_S) carves a tile-sized window out of a larger layout, producing a GEP and a residual layout for the sub-tile:
// Before
%sub_ptr, %sub_layout =
cute.tile_partition %base_ptr, %layout, %tile_coord
: !cute.ptr<f16, 3>, !cute.layout<<128:1, 64:128>>, !cute.coord<2>
-> !cute.ptr<f16, 3>, !cute.layout<<32:1, 32:128>>
// After
%coord0 = llvm.extractvalue %tile_coord[0] : !llvm.struct<(i32, i32)>
%coord1 = llvm.extractvalue %tile_coord[1] : !llvm.struct<(i32, i32)>
%stride0 = llvm.extractvalue %layout[1] : !llvm.struct<(i32, i32, i32, i32)>
%stride1 = llvm.extractvalue %layout[3] : !llvm.struct<(i32, i32, i32, i32)>
%off0 = llvm.mul %coord0, %stride0 : i32
%off1 = llvm.mul %coord1, %stride1 : i32
%offset = llvm.add %off0, %off1 : i32
%sub_ptr = llvm.getelementptr %base_ptr[%offset]
: (!llvm.ptr<3>, i32) -> !llvm.ptr<3>
%sub_layout = llvm.mlir.constant(
dense<[32, 1, 32, 128]> : tensor<4xi32>)
: !llvm.struct<(i32, i32, i32, i32)>
The pointer offset is crd2idx(tile_coord, layout.shape, layout.stride) — the dot product of the coordinate tuple with the stride tuple of the parent layout. The sub-tile layout is computed at conversion time from the parent layout's shape and the partition tile size, then emitted as a constant if both are static or as a fresh struct construction sequence otherwise. The result pointer keeps the parent's address-space tag (here 3, shared memory) because tile partitioning does not cross address spaces.
Dialect Registration Semantics
The cute dialect publishes a broad operation set that falls into a small number of semantic classes:
- pure layout algebra and tuple operations;
- memory-effecting load, store, and print operations;
- type-inference operations such as pointer casts and atom construction;
- verifier-heavy layout operations that reject non-positive or malformed tuple leaves;
- no-interface helper operations used as desugaring intermediates.
Model these classes explicitly in any reimplementation. The verifier is not optional: malformed CuTe tuple leaves can otherwise survive until descriptor packing, where the error becomes much harder to explain.
Architecture-Specialized Atoms
Three atom rewriters carry the architectural split. They register as independent OpConversionPattern subclasses rather than one switch, so the dialect-conversion engine selects among them by op kind rather than by runtime dispatch inside a shared rewriter.
| Atom | Architecture | Accumulator location | Critical state |
|---|---|---|---|
Sm90WgmmaAtom | SM90 Hopper | warpgroup register file | GMMA shared-memory descriptors, WGMMA fence |
Sm100ImmaAtom | SM100 Blackwell | tensor memory | TMEM pointer plus mbarrier ownership |
Sm100S2tCopyAtom | SM100 Blackwell | tensor memory | Cluster CTA rank for multi-CTA copy partition |
Accumulator location is the structural distinction. Hopper WGMMA accumulates in the warpgroup register file, so the rewriter materialises a register-allocated accumulator and packages it as the atom's result. Blackwell IMMA and S2T copy accumulate in tensor memory, so their rewriters materialise tensor-memory references and any required mbarrier ownership before emitting the atom.
Hopper WGMMA Contract
The SM90 WGMMA atom rewriter builds operand descriptors for shared-memory matrices, creates a register accumulator, emits the WGMMA fence, and packages the atom for later NVGPU/NVVM lowering. Descriptor packing is deterministic integer arithmetic over the shared-memory base pointer, leading-byte offset, matrix stride, swizzle mode, and base offset — see the canonical bit layout in MMA Atoms sm70-120 — SM90 WGMMA. The packer is side-effect-free so common-subexpression elimination can hoist redundant descriptor construction across loop iterations.
%atom = cute_nvgpu.atom.sm90.wgmma %a_smem, %b_smem, %shape, %elt
↓
%desc_a = cute_nvgpu.gmma.descriptor %a_smem : i64 // packed bitfield
%desc_b = cute_nvgpu.gmma.descriptor %b_smem : i64
%acc = cute_nvgpu.register.accumulator %shape, %elt // warpgroup registers
nvvm.wgmma.fence.aligned
%atom = cute_nvgpu.atom %desc_a, %desc_b, %acc
The atom is a CuTe payload, not an executable WGMMA. The fence sits between descriptor materialisation and the atom packaging because schedulers can move descriptor construction freely but cannot reorder it past the fence; emitting the fence here pins the boundary the consumer pass relies on.
Blackwell IMMA and S2T Contract
Blackwell IMMA lowers through tensor memory rather than the register file. The rewrite validates operand element types, retrieves a tensor-memory destination via the retrieve_tmem_ptr lowering above, initialises any required mbarriers, and emits a CuTe atom payload that the tcgen05 path later consumes.
%atom = cute_nvgpu.atom.sm100.imma %a_smem, %b_smem, %shape, %elt
↓
%tmem = cute_nvgpu.arch.sm100.retrieve_tmem_ptr %handle, %cols
%mbar = cute_nvgpu.mbarrier.init %ticks
%atom = cute_nvgpu.atom %a_smem, %b_smem, %tmem, %mbar
S2T copy follows the same shape and additionally owns cluster-rank arithmetic. For multi-CTA shapes, the rewriter reads the cluster CTA rank, computes the rank modulo the participating CTA group, and emits the conditional copy structure for the selected partition. The partition computation reduces to two integer operations: rank % cta_group for the local index and a bitwise mask (rank % cta_group) & (cta_group - 1) that the rewriter folds into the destination address arithmetic when cta_group is a power of two.
SM100 retrieve_tmem_ptr Lowering
cute_nvgpu.arch.sm100.retrieve_tmem_ptr converts a TMEM handle — a 32-bit token returned by tcgen05.alloc.shared — into a typed i32* pointing into the per-CTA tensor-memory file. Multiple consumers in the same kernel call retrieve_tmem_ptr against the same handle, and emitting tcgen05.alloc more than once for one handle is illegal hardware behaviour. A per-function cache keyed by the handle SSA value is therefore the primary correctness mechanism: the first retrieval emits the alloc, subsequent retrievals reuse the cached pointer.
On a cache hit the rewrite is a no-op replacement with the cached pointer. On a miss the rewriter emits a four-op LLVM sequence and inserts the resulting pointer into the cache under the handle key:
%handle = cute_nvgpu.arch.sm100.retrieve_tmem_ptr %tmem_handle, %num_columns
↓
%handle = nvvm.tcgen05.alloc.shared {num_columns = N : i32} : i32
llvm.store %handle, %tmem_alloc_handle_slot : !llvm.ptr // for later relinquish
nvvm.tcgen05.relinquish_alloc_permit // permit other CTAs to alloc
%tmem_ptr = llvm.load %tmem_handle_addr : !llvm.ptr -> !llvm.ptr<3>
The kernel-entry prologue emits tmem_alloc_handle_slot and tmem_handle_addr earlier, both living in the function's stack frame, so the retrieval lowering reads them as already-allocated stack slots rather than constructing them on demand.
Value lowerRetrieveTmemPtr(RetrieveTmemPtrOp op, Value handle,
ConversionPatternRewriter &rw,
DenseMap<Value, Value> &cache) {
if (auto cached = cache.lookup(handle)) return cached;
Value h = rw.create<nvvm::Tcgen05AllocSharedOp>(loc, op.getNumColumns());
rw.create<llvm::StoreOp>(loc, h, getTmemHandleSlot(op));
rw.create<nvvm::Tcgen05RelinquishAllocPermitOp>(loc);
Value ptr = rw.create<llvm::LoadOp>(loc, llvmPtr(/*as=*/3), getTmemHandleAddr(op));
cache.insert({handle, ptr});
return ptr;
}
The SM100 populator installs fifteen patterns in one call. The roster covers retrieve_tmem_ptr, tmem_load, tmem_store, tmem_alloc, tmem_dealloc, and ten further tcgen05 operations including load_b8x256 and store_b8x256. The populator gates on the tmem subtarget feature (see NVPTX Subtarget and Feature Matrix — Cached Tensor-Memory Predicate); on non-Blackwell or consumer-Blackwell builds the populator is invoked with a no-op flag and registers nothing, so the conversion target never accepts cute_nvgpu.arch.sm100.* operations and any surviving op fails legalisation with a clean diagnostic.
Conversion Invariants
- Desugaring must run before primitive CuTe conversion.
- Desugaring is target-neutral.
- Descriptor iterators must lower to a stable LLVM aggregate layout.
- CuTe tuple and layout verifiers must reject malformed non-positive leaves before descriptor construction.
- SM90 WGMMA uses register accumulators; SM100 IMMA and S2T copy use tensor-memory-backed structures.
- Atom lowerings should emit explicit diagnostics for unsupported architecture or operand type combinations.
- No CuTe-only executable operation may reach final NVPTX serialization.
Cross-References
Conversion / Lowering Overview places CuTe lowering in the companion-dialect stage that runs after TileAS bodies have lowered to LLVM. nvgpu / gpu to NVVM — NVGPU Dialect Lowering is the sister pass that consumes the architectural atoms this pass emits — WGMMA atoms go to nvvm.wgmma.*, IMMA and S2T copy atoms go to nvvm.tcgen05.*. TileAS to LLVM — Body Conversion Phases emits the residual cute_nvgpu.arch.sm100.retrieve_tmem_ptr operations this pass resolves through the per-function TMEM cache. MMA Atoms sm70-120 — SM90 WGMMA carries the canonical bit layout for GMMA descriptors. Layout Algebra — Composition gives the mathematical definition the cute.compose walk above lowers. SM Tier Roster — Copy Atom Registry lists the S2T copy atom variants the SM100 walk dispatches by.
Lowering: nvgpu / gpu to NVVM
Abstract
This lowering family is the final MLIR-side step. It strips the standard gpu and nvgpu dialects from a Tileiras kernel module: portable GPU concepts (thread indices, barriers, dynamic shared memory, subgroup operations, printf) and NVIDIA-specific operations (async-copy, tensor-memory, mbarrier, WGMMA, sparse MMA, packed arithmetic) all become NVVM and LLVM operations the NVPTX backend can consume.
The contract is semantic, not archaeological: once these conversions run, no executable gpu.* or nvgpu.* operation should remain. The resulting module contains llvm.*, nvvm.*, and a small set of explicitly legal container or bridge operations that later serialization already understands.
Boundary Contract
Two related but distinct jobs share this pass.
gpu -> nvvm lowers the standard MLIR GPU dialect: thread and block index queries, cluster index queries, barriers, GPU function boundaries, GPU returns, dynamic shared memory, shuffle/reduce operations, printf, and math operations that need libdevice calls.
nvgpu -> nvvm lowers NVIDIA architectural operations: mbarrier operations, TMA tensor copy operations, descriptor construction and prefetching, WGMMA descriptor and accumulator operations, synchronous MMA, ldmatrix, SM80-style cp.async, sparse MMA, reciprocal approximation, packed float conversion, and packed f32x2 arithmetic.
The conversion target is strict:
| Input concept | Output form |
|---|---|
gpu.thread_id, gpu.block_id, dimension queries | nvvm.read.ptx.sreg.* and integer arithmetic |
gpu.barrier | nvvm.barrier0 |
cf.assert in GPU code | guarded call to CUDA-compatible __assertfail |
gpu.printf | vprintf call with lowered format and argument buffer |
math.* operations that require device helpers | scalarized libdevice __nv_* calls |
nvgpu.mbarrier.* | nvvm.mbarrier.*, usually with shared-memory variants |
nvgpu.tma.* | nvvm.cp.async.bulk.tensor.*, tensor-map helpers, and proxy fences |
nvgpu.warpgroup.* | WGMMA NVVM operations plus LLVM value packing |
nvgpu.mma.sync, nvgpu.ldmatrix | matching NVVM matrix intrinsics plus LLVM repacking |
nvgpu.device_async_* | SM80 nvvm.cp.async.* group operations |
nvgpu.mma.sp.sync | llvm.inline_asm carrying the PTX sparse-MMA instruction |
| SM100 packed arithmetic and conversion ops | dedicated nvvm.* packed operations |
Violation behavior is uniform across the two halves of the pass: any executable gpu.* or nvgpu.* op remaining after the partial conversion is a hard failure — applyPartialConversion reports the unconverted op and the pass fails. An nvgpu.mbarrier.* op whose operand does not resolve to a shared-memory pointer is rejected by the typed-operand trait check (which surfaces as the verbatim " must be mbarrier barrier type, but got " diagnostic, prefixed by the operand label and suffixed with the printed offending type) rather than implicitly inserting an address-space cast, because the cast would change the semantic memory space tcgen05 lowering relies on. A vector-typed math.* operation that reaches libdevice dispatch without prior scalarisation is rejected by the conversion target rather than dispatched lane-by-lane silently. A cf.assert whose message globals cannot be materialised falls through to the upstream LLVM diagnostic. The gpu.module container itself is the only legal gpu.* surface on output; any other surviving gpu.* op signals a missing pattern in this bank.
GPU Dialect Lowering
The standard GPU pass builds a conversion target that legalises LLVM and NVVM, keeps gpu.module and gpu.yield legal so kernel bodies can be rewritten in place, marks the rest of the GPU dialect illegal, and adds libdevice-backed math operations and cf.assert to the illegal set. A surviving gpu.* op after this pass means either no pattern was registered or the pattern rejected the operation; the strict target makes the failure mode visible.
Index Queries
Thread, block, cluster, and grid index queries each rewrite to one NVVM special-register read plus an i32-to-index cast. The shape is uniform across the family — only the special-register name varies.
%i = gpu.thread_id x : index
↓
%r = nvvm.read.ptx.sreg.tid.x : i32
%i = arith.index_cast %r : i32 to index
The full mapping covers nine source operations:
| Source | Special register |
|---|---|
gpu.thread_id {x,y,z} | nvvm.read.ptx.sreg.tid.{x,y,z} |
gpu.block_id {x,y,z} | nvvm.read.ptx.sreg.ctaid.{x,y,z} |
gpu.block_dim {x,y,z} | nvvm.read.ptx.sreg.ntid.{x,y,z} |
gpu.grid_dim {x,y,z} | nvvm.read.ptx.sreg.nctaid.{x,y,z} |
gpu.cluster_id {x,y,z} | nvvm.read.ptx.sreg.clusterid.{x,y,z} |
gpu.cluster_dim {x,y,z} | nvvm.read.ptx.sreg.nclusterid.{x,y,z} |
gpu.cluster_block_id {x,y,z} | nvvm.read.ptx.sreg.cluster.ctaid.{x,y,z} |
gpu.subgroup_size | nvvm.read.ptx.sreg.warpsize |
gpu.lane_id | nvvm.read.ptx.sreg.laneid |
Barrier
The CTA-wide barrier rewrite is one-to-one and must not introduce control flow — schedulers downstream rely on a barrier appearing exactly where the source op did.
gpu.barrier
↓
nvvm.bar.sync.aligned %c0 : i32
The aligned variant is mandatory: tileiras kernels always launch with warp-aligned thread counts, and the non-aligned barrier would force a fallback path the scheduler has not budgeted for.
Assert
cf.assert preserves CUDA's runtime contract. Message, source file, and function name become module-level global strings; the original predicate controls a conditional branch where the failing edge calls __assertfail and the passing edge falls through.
cf.assert %cond, "message" : i1
↓
llvm.cond_br %cond, ^cont, ^fail
^fail:
%msg = llvm.mlir.addressof @.assert_msg : !llvm.ptr
%file = llvm.mlir.addressof @.assert_file : !llvm.ptr
%func = llvm.mlir.addressof @.assert_func : !llvm.ptr
llvm.call @__assertfail(%msg, %file, %line, %func, %c0_i64) : ...
llvm.br ^cont
^cont:
...
__assertfail is the CUDA runtime symbol the linker resolves; the signature (char*, char*, i32 line, char*, i64 charSize) is fixed by the runtime ABI and any reimplementation must call it with exactly those argument types in that order.
Libdevice Math
Vector lanes are scalarised before libdevice dispatch because libdevice functions are scalar and downstream cleanup folds scalar LLVM operations far more reliably than dialect-vector calls. The rewriter walks vector results, emits a per-lane libdevice call selected by element type, and reconstructs the vector via insertelement.
%r = math.sqrt %v : vector<4xf32>
↓
%v0 = vector.extract %v[0] : f32 from vector<4xf32>
%v1 = vector.extract %v[1] : f32 from vector<4xf32>
%v2 = vector.extract %v[2] : f32 from vector<4xf32>
%v3 = vector.extract %v[3] : f32 from vector<4xf32>
%r0 = llvm.call @__nv_sqrtf(%v0) : (f32) -> f32
%r1 = llvm.call @__nv_sqrtf(%v1) : (f32) -> f32
%r2 = llvm.call @__nv_sqrtf(%v2) : (f32) -> f32
%r3 = llvm.call @__nv_sqrtf(%v3) : (f32) -> f32
%r = vector.from_elements %r0, %r1, %r2, %r3 : vector<4xf32>
The callee name comes from a (MathOpKind, ElementType) table: math.sqrt of f32 selects __nv_sqrtf, of f64 selects __nv_sqrt, of f16 selects __nv_sqrtf with operand promotion. Reflection-resolved variants for fast-math and unsafe-math intrinsics (__nv_fast_sinf, __nv_unsafe_divf) attach via the fastmath attribute on the source op.
NVGPU Dialect Lowering
The NVGPU conversion is a table-driven pattern set. Each pattern has one root operation and a typed matchAndRewrite body. Most emit a single NVVM operation. A handful are structural: tensor-map descriptor construction writes an LLVM stack object, WGMMA store decomposes an accumulator into per-thread stores, and sparse MMA emits inline assembly because the dialect snapshot doesn't model that instruction as a first-class NVVM op.
| Source family | Lowering behavior |
|---|---|
nvgpu.mbarrier.create | creates or references a private shared-memory barrier object |
nvgpu.mbarrier.init | initializes the barrier with the requested participant count |
nvgpu.mbarrier.arrive* | emits arrival, no-complete, and expect-transaction NVVM intrinsics |
nvgpu.mbarrier.test.wait | tests and waits on a phase or token |
nvgpu.mbarrier.try_wait.parity | emits the parity-sensitive wait primitive |
nvgpu.tma.async.load | emits tensor bulk copy from global tensor memory into shared memory |
nvgpu.tma.async.store | emits tensor bulk copy from shared memory back to global tensor memory |
nvgpu.tma.create.descriptor | builds the tensor-map descriptor that the CUDA driver can encode |
nvgpu.tma.prefetch.descriptor | emits tensor-map prefetch |
nvgpu.tma.fence.descriptor | emits proxy acquire fence for descriptor visibility |
nvgpu.warpgroup.generate.descriptor | packs the GMMA shared-memory descriptor bitfields |
nvgpu.warpgroup.mma | emits WGMMA fence, async MMA, commit, and wait operations |
nvgpu.warpgroup.mma.store | maps accumulator fragments to per-thread stores |
nvgpu.warpgroup.mma.init.accumulator | builds the zero or poison accumulator aggregate |
nvgpu.mma.sync | emits synchronous MMA NVVM intrinsic |
nvgpu.ldmatrix | emits ldmatrix and repacks the returned fragments |
nvgpu.device_async_copy | emits SM80 cp.async.shared.global |
nvgpu.device_async_create_group | emits cp.async.commit.group |
nvgpu.device_async_wait | emits cp.async.wait.group |
nvgpu.mma.sp.sync | emits sparse MMA inline assembly |
nvgpu.rcp | emits reciprocal approximation |
nvgpu.cvt_fptrunc, nvgpu.cvt_fpext | emits packed float conversion |
nvgpu.fma.packed.f32x2, nvgpu.mul.packed.f32x2 | emits packed f32x2 arithmetic |
Each entry above is a distinct OpConversionPattern subclass registered against its root op. The conversion engine selects among them by op kind; there is no shared dispatcher inside a single rewriter.
Pattern Shapes
Every NVGPU pattern in this stage shares one outer shape: match on a root NVGPU op, convert its operands through the shared LLVM type converter, emit one or more NVVM ops plus any packing arithmetic, and replace the root. The four shapes below cover the families that need more than a single emission step; the remaining one-to-one patterns reduce to generic_remap from the 43-instantiation arith bank.
Mbarrier
The mbarrier family rewrites the five nvgpu.mbarrier.* operations into matching nvvm.mbarrier.* intrinsics. Shared-memory variants take a !llvm.ptr<3> barrier address; non-shared variants take a generic pointer the rewriter must address-space-cast to shared or reject.
%bar = nvgpu.mbarrier.create : !nvgpu.mbarrier
↓
%bar = llvm.mlir.addressof @mbar_storage : !llvm.ptr<3>
nvgpu.mbarrier.init %bar, %count : !nvgpu.mbarrier, i32
↓
nvvm.mbarrier.init.shared %bar, %count : !llvm.ptr<3>, i32
nvgpu.mbarrier.arrive %bar : !nvgpu.mbarrier -> !nvgpu.token
↓
%tok = nvvm.mbarrier.arrive.shared %bar : !llvm.ptr<3> -> i64
nvgpu.mbarrier.arrive.expect_tx %bar, %tx_count : !nvgpu.mbarrier, i32
↓
nvvm.mbarrier.arrive.expect_tx.shared %bar, %tx_count : !llvm.ptr<3>, i32
%t = nvgpu.mbarrier.try_wait.parity %bar, %phase, %ticks
↓
%t = nvvm.mbarrier.try_wait.parity.shared %bar, %phase, %ticks : !llvm.ptr<3>, i1, i32 -> i1
// `nvgpu.mbarrier.inval` is not interned in this binary; the lower-level
// `nvvm.mbarrier.inval.shared` intrinsic is still emitted directly by
// callers (e.g. CTAExit cleanup) without an `nvgpu` wrapper.
nvvm.mbarrier.inval.shared %bar : !llvm.ptr<3>
If the source operand does not resolve to a shared-memory pointer, the rewriter fails via the typed-operand trait check, surfacing the verbatim " must be mbarrier barrier type, but got " diagnostic (prefixed by the operand label and followed by the printed offending type). The pattern rejects rather than inserts an implicit cast because the cast would change the semantic memory space and downstream tcgen05 lowering depends on shared-memory residence.
TMA Async Load
nvgpu.tma.async.load rewrites to nvvm.cp.async.bulk.tensor.shared.cluster.global with the descriptor pointer, coordinate operands, and barrier. Optional attributes — multicastMask and l2CacheHint — wire into the intrinsic's optional argument slots when present.
nvgpu.tma.async.load %desc, %smem, %coords[%c0, %c1], %barrier
{ multicastMask = 0x000F : i16, l2CacheHint = 0xCAFE : i64 }
↓
nvvm.cp.async.bulk.tensor.shared.cluster.global.5d
%smem, %desc, %barrier, %c0, %c1, %c2, %c3, %c4,
multicast_mask = %mask, l2_cache_hint = %hint
: !llvm.ptr<3>, !llvm.ptr<1>, !llvm.ptr<3>, i32 x 5, i16, i64
Operand mapping (rank N)
The intrinsic signature for nvvm.cp.async.bulk.tensor.{N}d.shared.cluster.global.tile is rank-parameterised; the multicastMask and l2CacheHint operands are optional. The rewriter maps the flat nvgpu operand list onto positional intrinsic slots and sets two Unit-typed enable attributes that gate the optional slots.
nvgpu.tma.async.load operand | NVVM intrinsic slot |
|---|---|
dst (SMEM memref, addr-space 3) | slot 0 — dstAddr : ptr addrspace(3) |
tensorMapDescriptor | slot 1 — tensorMap : ptr to the 128-byte CUtensorMap |
coordinates[0..N-1] | slots 2..N+1 — coords : i32, one per rank |
barrier | slot N+2 — barrier : ptr addrspace(3) |
multicastMask (optional) | slot N+3 — multicastMask : i16 |
l2CacheHint (optional) | slot N+4 — cacheHint : i64 |
The two Unit attributes (multicastEnable, cacheHintEnable) are not nvgpu attributes — they are produced by the rewriter from operand presence. When multicastMask is supplied, the rewriter sets multicastEnable = unit on the new nvvm.* op; otherwise it leaves both operand and enable absent. The same rule applies to l2CacheHint / cacheHintEnable.
Worked example, 3-D TMA load with both optional operands:
%smem : memref<128x128xf16, 3>
%bar : !nvgpu.mbarrier.group
%tmap : !nvgpu.tensormap.descriptor
%c0,%c1,%c2 : i32
%mask : i16
%hint : i64
input :
nvgpu.tma.async.load %smem[%c0,%c1,%c2], %bar, %tmap,
multicastMask = %mask,
l2CacheHint = %hint
output :
%smem_ptr = unrealized_conversion_cast %smem : memref<128x128xf16, 3> to !llvm.ptr<3>
%bar_ptr = ... : !llvm.ptr<3>
%tmap_ptr = ... : !llvm.ptr
nvvm.cp.async.bulk.tensor.3d.shared.cluster.global.tile
%smem_ptr, // slot 0
%tmap_ptr, // slot 1
%c0, %c1, %c2, // slots 2..4
%bar_ptr, // slot 5
%mask, // slot 6 (multicast)
%hint // slot 7 (cache hint)
{ multicastEnable, cacheHintEnable, mode = #nvvm.load_mode<tile> }
If %mask is absent, slot 6 is dropped and multicastEnable is not set; slot 7 (if %hint is present) shifts left into slot 6 of the actually-emitted call. The intrinsic ID stays the same; only the operand bag changes width. Absent operands leave slots unset rather than emitting zero constants — a zero cacheHint would force a non-default code path in the backend.
TMA Async Store
nvgpu.tma.async.store is the symmetric reverse direction. The descriptor and coordinates appear in the same operand positions; the source becomes shared memory and the destination becomes global memory. There is no barrier — the producer issues the store and continues.
nvgpu.tma.async.store %smem, %desc, %coords[%c0, %c1]
↓
nvvm.cp.async.bulk.tensor.global.shared.cta.5d
%desc, %smem, %c0, %c1, %c2, %c3, %c4
: !llvm.ptr<1>, !llvm.ptr<3>, i32 x 5
Operand mapping (rank N):
nvgpu.tma.async.store operand | NVVM intrinsic slot |
|---|---|
tensorMapDescriptor | slot 0 — tensorMap : ptr |
coordinates[0..N-1] | slots 1..N — coords : i32 |
src (SMEM memref, addr-space 3) | slot N+1 — srcAddr : ptr addrspace(3) |
l2CacheHint (optional) | slot N+2 — cacheHint : i64, gated by cacheHintEnable |
An nvgpu.tma.async.reduce wrapper is not interned in this binary. Reduce-variant lowerings are reached through cute_nvgpu straight into nvvm.cp.async.bulk.tensor.reduce, where the red_op enum selects the intrinsic ID at registration time — eight distinct intrinsics per rank, one per reduction kind. Operand layout mirrors the store form; the upstream wrapper, when present, would copy redop into the red_op slot verbatim.
The fence pattern nvgpu.tma.fence.descriptor rewrites to nvvm.fence.proxy.acquire.sync so descriptor updates from the CUDA host become visible to the device proxy before the next async load.
WGMMA Pipeline
nvgpu.warpgroup.mma expands into the four-op WGMMA protocol the hardware expects: fence, async issue, commit, wait. The accumulator is an aggregate the pattern emits as register-file values; the matching nvgpu.warpgroup.generate.descriptor pattern pre-packs the GMMA descriptors.
%acc' = nvgpu.warpgroup.mma %desc_a, %desc_b, %acc
↓
nvvm.wgmma.fence.aligned
%acc' = nvvm.wgmma.mma_async %desc_a, %desc_b, %acc : i64, i64, !llvm.struct<(f32, f32, ...)>
nvvm.wgmma.commit.group.sync.aligned
nvvm.wgmma.wait.group.sync.aligned 0
The four ops must appear in order: the fence ensures prior shared-memory stores are visible to the WGMMA pipeline; mma_async issues the operation; commit.group packages it into a group the warpgroup tracks; wait.group 0 blocks until the in-flight group count reaches zero. Reordering any pair changes the semantics — a missing fence loses input-dependence guarantees, and a missing wait races the accumulator into downstream reads.
Ldmatrix and Repack
nvgpu.ldmatrix rewrites to nvvm.ldmatrix.sync and repacks the returned register fragments into the LLVM-typed vector the consumer expects. The shape and transpose attributes pass through verbatim onto the intrinsic.
%v = nvgpu.ldmatrix %smem, num=4, transpose=false : memref<*xi32, 3>, vector<4xi32>
↓
%p = nvvm.ldmatrix.sync %smem, num=4, trans=false
: !llvm.ptr<3> -> !llvm.struct<(i32, i32, i32, i32)>
%v0 = llvm.extractvalue %p[0] : !llvm.struct<(i32, i32, i32, i32)>
%v1 = llvm.extractvalue %p[1] : !llvm.struct<(i32, i32, i32, i32)>
%v2 = llvm.extractvalue %p[2] : !llvm.struct<(i32, i32, i32, i32)>
%v3 = llvm.extractvalue %p[3] : !llvm.struct<(i32, i32, i32, i32)>
%v = vector.from_elements %v0, %v1, %v2, %v3 : vector<4xi32>
The fragment count (1, 2, or 4) selects the struct shape: num=1 returns a single i32, num=2 returns !llvm.struct<(i32, i32)>, num=4 returns !llvm.struct<(i32, i32, i32, i32)>. The repack always uses extractvalue + vector.from_elements so the consumer sees a uniform vector regardless of fragment count.
Device Async Copy (SM80)
nvgpu.device_async_copy rewrites to SM80-era cp.async. The associated group and wait operations rewrite one-to-one.
%tok = nvgpu.device_async_copy %gmem, %smem, %size : memref<*xf32, 1>, memref<*xf32, 3>
↓
nvvm.cp.async.shared.global %smem, %gmem, %size : !llvm.ptr<3>, !llvm.ptr<1>, i32
nvgpu.device_async_create_group [%tok0, %tok1, ...] : !nvgpu.token
↓
nvvm.cp.async.commit.group
nvgpu.device_async_wait %group { numGroups = 0 : i32 }
↓
nvvm.cp.async.wait.group 0
Async tokens lower to i32 integer values; the create-group operation discards its token operands because cp.async.commit.group operates on the implicit in-flight group rather than on explicit token list.
Sparse MMA Inline Assembly
nvgpu.mma.sp.sync has no first-class NVVM op in the current dialect snapshot, so the rewriter emits the PTX sparse-MMA instruction through llvm.inline_asm. This is the only operation in the bank that uses inline assembly; prefer NVVM intrinsics for everything else.
Descriptor and Barrier Rules
Mbarrier lowering is address-space-sensitive. Shared-memory barriers use the .shared NVVM variants; non-shared barrier values are rejected with a diagnostic rather than silently cast, because the cast would change the semantic memory space and downstream tcgen05 lowering depends on shared-memory residence. Token parity stays as a small integer value so wait operations can consume it directly without unpacking.
TMA lowering separates descriptor construction from descriptor use. nvgpu.tma.create.descriptor materialises a 128-byte tensor-map object on the function's stack and populates it with the static shape, stride, element-type, swizzle, rank, and interleave fields the CUDA-side encoder reads. Load, store, prefetch, and fence operations consume that descriptor pointer — they never reconstruct the descriptor from its fields, so descriptor canonicalisation can hoist construction freely.
For device-side descriptor rebind, this pass emits the inline-asm tensormap.replace.tile.* calls — global_address once, global_dim once per rank, global_stride once per non-leading rank — wrapped in the fence.proxy.tensormap::generic acquire/release pair so the generic-proxy write becomes visible to the tensormap proxy that cp.async.bulk.tensor.* reads from. The mutator templates, descriptor field layout, and fence-scope selection (.cta vs .gpu vs .sys) are documented in TMA Descriptor Mutators. The rewrite contract here is that the rewriter emits exactly that fixed sequence — any deviation (writing strides before dims, omitting the acquire fence, scoping to .cta across a cluster) leaves the descriptor partially coherent and the consumer reads stale lanes.
WGMMA descriptor packing is a pure integer operation over five inputs: the shared-memory base pointer, leading-byte offset, matrix stride, swizzle base, and swizzle mode. The 64-bit layout is fixed by the Hopper GMMA ISA — bit positions and field widths are documented in MMA Atoms sm70-120 — SM90 WGMMA. The packer is deterministic and side-effect-free, so schedulers and common-subexpression elimination can hoist redundant descriptor construction across loop iterations.
%desc = nvgpu.warpgroup.generate.descriptor %smem_base
{ leading_byte_offset = 16 : i64, matrix_stride = 64 : i64,
swizzle_base = 128 : i64, swizzle_mode = #nvgpu<swizzle 128B> }
↓
%bits = arith.constant 0x... : i64 // pre-folded bit pattern from attribute fields
%base_i = llvm.ptrtoint %smem_base : !llvm.ptr<3> to i64
%desc = llvm.or %bits, %base_i : i64
The runtime base pointer is the only operand that varies per instance; everything else folds at compile time from the GMMA-descriptor attribute, so the generated LLVM is typically two instructions (ptrtoint plus or) per descriptor.
Conversion Invariants
- The pass must leave no executable
gpu.*ornvgpu.*operation behind. gpu.modulemay survive only as the module container consumed by GPU-to-binary serialization.- Vector math is scalarized before libdevice calls are introduced.
- CUDA assertion lowering must preserve the original predicate and source metadata.
- Mbarrier variants must agree with the operand address space.
- TMA descriptor construction must be kept separate from TMA copy and prefetch operations.
- Sparse MMA uses inline assembly only for the missing dialect intrinsic; other operations should prefer first-class NVVM ops.
- WGMMA lowering must emit the fence, MMA, commit, and wait sequence in the order expected by the hardware pipeline.
Cross-References
Conversion / Lowering Overview places this pass at the companion-lowering stage that runs alongside CuTe lowering. CuTe and CuTe-NVGPU to LLVM — Architecture-Specialized Atoms covers the CuTe atom rewrites whose outputs this pass consumes through cute_nvgpu.atom. Shared LLVM Type Converter describes the shared LLVM type converter every pattern in this bank threads through. MMA Atoms sm70-120 — SM90 WGMMA is the canonical reference for the WGMMA descriptor bit layout the packer above emits.
Lowering: Target and Debug Info
Abstract
Two module-level adapters prepare the lowered MLIR module for NVVM serialization. AttachNVVMTarget turns Tileiras target metadata into the standard #nvvm.target attribute that the GPU-to-binary serializer reads off gpu.module. TranslateDebugInfo rewrites Tileiras debug-value operations into LLVM debug intrinsics, inserting an NVIDIA-specific llvm.nvvm.move value pin so the PTX backend can keep the debugged value visible across optimisation passes that would otherwise fold it away.
Both passes translate between internal TileIR metadata and the public LLVM/NVVM surface. A reimplementation does not need their original pass layout, but it must preserve the target-attribute fields, the libNVVM option dictionary, the debug intrinsic arguments, and the value-pin step.
Target Attribute Conversion
The target pass walks gpu.module operations. For each one it reads the compute capability from the module's attribute dictionary, normalises it to an sm_XX chip name, builds the libNVVM flag dictionary, and writes the resulting #nvvm.target attribute as a single-element array onto the module.
Attribute Sources
Three module-level attributes feed the target adapter, read in the order below.
| Attribute name | Type | Role |
|---|---|---|
nv_tileaa.compute_capability | IntegerAttr (major*10+minor) | Primary source; emitted by ConvertCudaTileToTileAA from the --compute-capability option. |
nv_tileaa.target_spec | StringAttr ("sm_XX" form) | Fallback when compute_capability is absent. |
nv_tileaa.libnvvm_use_nvgpucomp | BoolAttr | Optional; selects the NVGpuComp/libNVVM serialisation path. |
When neither compute_capability nor target_spec resolves, the pass surfaces the verbatim "failed to get compute capability." diagnostic (with the trailing period) and fails the module; the closely related "invalid or missing --compute-capability option" is emitted by the option parser earlier in the pipeline when the CLI argument itself is absent.
Generated Target Fields
The #nvvm.target attribute is a small record consumed by the upstream GPU-to-binary serializer. Field semantics:
| Field | Value | Source |
|---|---|---|
| target triple | nvptx64-nvidia-cuda | fixed |
| chip | normalised sm_XX chip name | nv_tileaa.compute_capability or nv_tileaa.target_spec |
| optimization level | 0..3 | pass option, defaulting to the optimised path |
| feature string | empty | reserved for later target hooks |
| link mode | false | non-linking module target |
| flag dictionary | libNVVM options below | composed per-module |
The flag dictionary is small but consequential. Each entry communicates one decision to the libNVVM backend.
| Flag | When emitted | Purpose |
|---|---|---|
-g | only when debug info is enabled for the module | asks the backend to preserve debug emission |
-Xopt | always | opens the libNVVM option channel |
-pragma-unroll-threshold=9900000 | always | discourages backend re-rolling after Tileiras scheduling |
-fma=0 | always | prevents backend FMA contraction from changing explicit numeric choices |
libNVVMUseNVGpuComp=true | only when the option is enabled | selects the NVGpuComp/libNVVM path downstream |
Consumer Passes
Once the target attribute attaches, three downstream consumers read it:
- The GPU-to-binary serializer reads triple, chip, optimisation level, and feature string to build the libNVVM/NVPTX command line.
- The PTX assembler stage reads chip to pick the SASS target.
- The
cluster_dim/reqntidvalidators read chip to gate cluster-launch metadata on SM90 and above.
A module with #nvvm.target missing reaches the serializer with no target chip and fails serialisation with a "no target attribute" diagnostic before any binary is emitted.
Conversion Algorithm
The pass body is small: walk gpu.modules, resolve compute capability, build flags, attach the attribute.
LogicalResult attach_nvvm_target(ModuleOp module, TargetOptions options) {
for (GpuModuleOp gpu_module : module.gpu_modules()) {
ComputeCapability cc = read_compute_capability(gpu_module);
if (!cc.valid()) {
cc = read_target_spec_compute_capability(gpu_module);
}
if (!cc.valid()) {
return gpu_module.emit_error("failed to get compute capability.");
}
DictionaryAttr flags = build_libnvvm_flags(gpu_module, options);
NVVMTargetAttr target = NVVMTargetAttr::get(
module.context(),
options.opt_level,
"nvptx64-nvidia-cuda",
cc.to_sm_name(),
/*features=*/"",
flags,
/*link=*/false);
gpu_module.set_attr("nvvm.target", ArrayAttr::get({target}));
}
return success();
}
Idempotency matters: re-running the pass on a module that already carries #nvvm.target overwrites the attribute rather than appending a second target. Two targets on the same gpu.module produce undefined behaviour in the serializer.
Debug-Info Conversion
Tileiras carries source-variable metadata in an internal debuginfo.* dialect. Before LLVM translation, those operations must become LLVM-dialect debug intrinsic calls (llvm.intr.dbg.value, llvm.intr.dbg.declare, llvm.intr.dbg.addr) whose operands the NVPTX backend can serialise into DWARF.
MLIR Loc to LLVM !dbg
Every operation in Tileiras carries an MLIR Location. When debug info is enabled, the LLVM translation phase reads those locations and emits LLVM !dbg metadata that attaches to each lowered LLVM instruction. The mapping is direct:
| MLIR location | LLVM !dbg form |
|---|---|
FileLineColLoc(file, line, col) | DILocation(line, col, scope) referencing the file's DIFile |
FusedLoc(child_locs, metadata) | The metadata's DILocation, with child_locs becoming an inlined-at chain |
CallSiteLoc(callee_loc, caller_loc) | DILocation for callee with inlinedAt pointing at caller's DILocation |
NameLoc(name, child) | Passes through to child's location; name becomes a DILocalVariable only at debug-value sites |
UnknownLoc | No !dbg emitted; the LLVM instruction is untracked |
Debug Scope Nesting for gpu.func
Each gpu.func participates in a DISubprogram scope. The translation builds the scope hierarchy bottom-up:
DICompileUnit (per module, attached to llvm.module)
└── DIFile (per source file referenced)
└── DISubprogram (per gpu.func, attached to the llvm.func)
└── DILexicalBlock (per scf.if / scf.for / nested region)
└── DILocalVariable (per debuginfo.value)
Nested SCF regions get a fresh DILexicalBlock so debuggers can step into them without losing local-variable visibility from the parent. The lexical-block scope is parented to the surrounding subprogram, not to other lexical blocks — debuggers walk the inlining chain via inlinedAt rather than nested scopes.
Lineinfo vs Device-Debug
The level of debug information depends on which compile option is active.
| Option | !dbg on instructions | DILocalVariable | DISubprogram | dbg.value intrinsics |
|---|---|---|---|---|
--lineinfo off, --device-debug off | dropped | dropped | dropped | dropped |
--lineinfo on | emitted | dropped | minimal (name + line only) | dropped |
--device-debug on | emitted | emitted | full (with variables) | emitted with llvm.nvvm.move pins |
--lineinfo produces enough metadata for profilers to map SASS instructions back to source lines without paying the optimisation cost of tracking local variables. --device-debug adds local-variable tracking and is the only mode that keeps dbg.value intrinsics alive through the optimisation pipeline.
debuginfo.value Rewrite Shape
The per-op rewrite turns each debuginfo.value into a debug intrinsic call. The NVIDIA-specific step is llvm.nvvm.move: an SSA pass-through value that constant-folding and dead-code elimination treat as opaque, so the debugged value stays visible to the backend even when the surrounding code is folded away.
debuginfo.value %v, #var, #expr : !debuginfo.value<f32>
↓
%pinned = llvm.nvvm.move %v : f32
llvm.intr.dbg.value %pinned, !DILocalVariable(#var), !DIExpression(#expr)
For aggregate values, the rewriter walks vector and struct fields, extracts each leaf, pins it through llvm.nvvm.move, and emits a separate debug intrinsic per leaf. Aggregate fragments are described via DIExpression(DW_OP_LLVM_fragment, offset, size) so the debugger can reconstruct the original aggregate at display time.
LogicalResult lower_debug_value(DebugValueOp op, Rewriter *rewriter) {
Value source = materialize_debug_source(op.value(), op.fragment(), rewriter);
Value pinned = rewriter->create("llvm.nvvm.move", source).result(0);
DebugIntrinsic intrinsic = select_debug_intrinsic(op.kind());
rewriter->create("llvm.intr." + intrinsic.name(), {
pinned,
op.local_variable_attr(),
op.expression_attr()
});
rewriter->erase_op(op);
return success();
}
If a referenced symbol or metadata node cannot be resolved yet, the rewriter emits a placeholder operand that the LLVM-translation phase diagnoses with the surrounding operation context. Failing here rather than at translation time gives a useful Tile-level location for the diagnostic.
Type Conversion for Debug Operands
The debug pass uses its own small type converter rather than the full TileAS LLVM converter. Its job is to make debug operands legal without touching the executable ABI.
| Source debug type | LLVM debug operand form |
|---|---|
| integer scalar | same-width LLVM integer, restricted to backend-supported widths |
| half, bfloat16, tf32-like numeric extensions | LLVM numeric surrogate used by the value-lowering path |
| vector | per-lane extraction followed by scalar debug emission |
| struct or tuple | recursive field extraction and debug emission |
| unresolved aggregate member | placeholder plus diagnostic context |
The debug converter never invents executable computation. The only SSA values it introduces are llvm.nvvm.move pins and the extractvalue/extractelement operations needed to reach a debug leaf; everything else is metadata.
Error Handling
Both passes fail the module with diagnostics that name the missing semantic input rather than the internal mechanism:
- missing compute capability or target specification for
#nvvm.target; - unknown or unloaded LLVM/NVVM operation while building debug IR;
- unsupported debug value type;
- unresolved debug metadata that cannot be represented as an LLVM debug operand.
Conversion Invariants
- Every serializable
gpu.modulemust have a resolved#nvvm.targetattribute before serialisation. - The target triple is the 64-bit CUDA NVPTX triple.
- The compute capability is normalised to the chip name consumed by NVVM.
- Debug emission is gated by the same module-level debug option used to add
-g. llvm.nvvm.movemust sit between the debugged SSA value and the LLVM debug intrinsic.- Debug conversion must not alter executable dataflow except for the value pin used by debug metadata.
Cross-References
Conversion / Lowering Overview places the target-attachment and debug-translation passes in their position at the tail of the pipeline. TileAS to LLVM — Function Boundary Conversion emits the gpu.module and nvvm.kernel attributes those passes consume. NVPTX Subtarget and Feature Matrix — The 40 CPU Rows lists the chip names the compute-capability normaliser produces. Debugging and Introspection is the user-facing guide that frames --lineinfo and --device-debug against the other four debugging surfaces and documents when to pick each one.
Pattern Sets and Type Conversion
Abstract
Tileiras lowering rides on ordinary MLIR dialect conversion: a conversion target declares what is legal, a type converter defines ABI shape, and a rewrite pattern set rewrites illegal operations until the target accepts the module. The public contract is short — every lowering stage must build a complete pattern set, and one coherent LLVM type converter must span TileAA, TileAS, CuTe, NVGPU, and kernel function boundaries. Two passes that disagree about descriptor shape or address-space numbering produce a module that verifies but generates wrong PTX.
This page is the deep reference for that contract: the kinds of pattern objects each stage installs, the type-conversion rules every stage shares, the runtime descriptor shapes later passes assume, and the two anchor patterns — the 43-instantiation arith bank and the kernel-attribute-aware shared LLVM type converter — that the per-stage pages refer back to instead of re-documenting in place.
Pattern Categories
Each lowering stage installs patterns from one of four categories. The choice depends on whether the source op rewrites to a single destination op, a small fixed sequence, or a region-rewriting transform.
| Category | When to use |
|---|---|
Generic one-to-one (GenericOpPattern<SourceOp>) | Source op and target op are semantically equivalent; converter handles operand types and the rewrite is a same-name, different-dialect copy. The arith bank below is the canonical example. |
Dedicated OpConversionPattern<SourceOp> | Region surgery, witness-attribute reads (CopyAtom, ReduceAtom), compute-capability gates, descriptor packing, inline assembly, or any rewrite that emits more than one target op. |
| One-to-N pattern | Async-pipeline operations that produce multiple SSA values with distinct LLVM types and need applyPartialOneToNConversion rather than the standard partial-conversion engine. |
| Cleanup pattern | Runs after the dialect boundary has moved. Removes builtin.unrealized_conversion_cast, folds residual arith and math ops the dialect-conversion target has now made legal. |
The dialect-conversion engine treats all four categories identically — they differ only in what they do inside matchAndRewrite.
Shared LLVM Type Converter
The Tileiras LLVM type converter is an ABI object. Its job is to map every type the lowering stages produce — TileAA memref, TileAS tiled view, async/memory/pipeline tokens, CuTe layout and atom types, function signatures, and the upstream LLVM types — to a single canonical LLVM dialect representation. The dispatcher walks an ordered list of addConversion callbacks and returns the first match.
| Source concept | Converted representation |
|---|---|
| integer, index, float | LLVM scalar with target width and element semantics preserved |
| vector | LLVM vector of converted element type |
| function type | function type with converted arguments and results |
| ranked memref | LLVM memref descriptor unless a bare-pointer ABI rule applies |
| unranked memref | {rank, erased_descriptor_pointer} |
| TileAA or TileAS memref | same descriptor family as ranked memref, with Tileiras address space |
| CuTe memref | descriptor compatible with CuTe layout lowering |
| TileAA and TileAS tiled view | small struct containing base pointer and packed layout metadata |
| async, memory, producer, and consumer tokens | i32 |
| CuTe layout, shape, stride, swizzle, and atom types | LLVM structs or integers consumed by CuTe lowering |
| tuple and none | LLVM struct or empty marker as required by the operation |
| LLVM pointer or LLVM struct | identity conversion |
Function-signature conversion is the one place the converter must do more than per-type translation: the bare-pointer kernel ABI demands that ranked memref arguments lower to a single aligned pointer plus separately carried launch metadata, not the full descriptor. When convertFunctionSignature fails, the converter emits "failed to convert function signature type for: " (the trailing space is preserved verbatim) followed by the printed form of the offending type. Downstream regression suites grep for this string; the wording is fixed.
Descriptor Layouts
The ranked memref descriptor follows the standard LLVM dialect shape:
struct RankedMemRefDescriptor<T, int Rank, int AddressSpace> {
T addrspace(AddressSpace) *allocated;
T addrspace(AddressSpace) *aligned;
int64_t offset;
int64_t sizes[Rank];
int64_t strides[Rank];
};
The tiled-view descriptor is compact because tiled load/store patterns and descriptor builders both consume it.
struct TiledViewDescriptor<T, int AddressSpace> {
T addrspace(AddressSpace) *base;
uint32_t swizzle_encoding;
uint32_t tile_dim0;
uint32_t tile_dims1_to3[3];
};
Tokens are deliberately narrow. A producer/consumer token is not a pointer to runtime storage — it is an integer phase value. The low bit carries the parity consumed by wait operations; higher bits may carry a pipeline slot index.
uint32_t make_pipeline_token(uint32_t slot, bool phase) {
return (slot << 1) | (phase ? 1u : 0u);
}
uint32_t token_slot(uint32_t token) {
return token >> 1;
}
bool token_phase(uint32_t token) {
return (token & 1u) != 0;
}
Address Spaces
Tileiras keeps memory spaces distinct all the way to LLVM pointers.
| Tileiras memory space | LLVM address space | PTX meaning |
|---|---|---|
| register memory | 0 | virtual register values |
| global memory | 1 | .global |
| internal memory | 2 | compiler-internal storage |
| shared memory | 3 | .shared |
| constant memory | 4 | .const |
| local memory | 5 | .local |
| tensor memory | 6 | Blackwell tensor memory |
| generic pointer | 101 | NVVM generic pointer |
Address-space casts must be explicit. The converter rejects implicit transitions that would hide a semantic memory-space change — especially around TMA descriptors, shared-memory barriers, and tensor-memory operations.
Materialization Hooks
Partial conversion sometimes needs a bridge value while only part of the IR has been lowered. The converter installs source and target materialization hooks that both produce builtin.unrealized_conversion_cast. The cast-reconciliation phase (phase 8 of ConvertTileASToLLVM body conversion) erases them once every participating operation has converted. Bridges that survive the final cleanup pass must fail the module rather than reach LLVM translation.
Pattern Discipline
Generic one-to-one patterns must not inspect target hardware — they exist to be benefit-1 fallbacks the engine prefers only when no specialist matches. Dedicated patterns own every target-feature check they introduce, so the legality of an SM100-only rewrite is visible at the point where the rewrite happens. Region-rewriting patterns convert block-argument types and terminators in the same step, or the resulting parent op fails verification with a signature that no longer matches its region. One-to-N async-pipeline patterns run only after scheduling and layout assignment have made the pipeline structure explicit; running them earlier would split tokens before the scheduler can reason about them. Cleanup patterns leave memory-ordering operations alone unless the op is outside the memory-consistency interface.
The 43-Instantiation Arith Bank
The TileAA-to-TileAS arith populator installs 43 instantiations of a single CRTP template, one per arith op that has a same-named TileAS counterpart. Every instantiation derives from OpConversionPattern<SourceOp> and overrides matchAndRewrite to do the generic same-name remap.
template <typename SourceOp>
class GenericOpPattern : public mlir::OpConversionPattern<SourceOp> {
public:
using mlir::OpConversionPattern<SourceOp>::OpConversionPattern;
using OpAdaptor = typename SourceOp::Adaptor;
LogicalResult matchAndRewrite(SourceOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rw) const override;
};
Each rewrite has the shape:
%r = arith.addf %a, %b : tensor<8x64xf32>
↓
%r = nv_tileas.addf %a, %b : tensor<8x64xf32>
The pattern reads operand types through the shared type converter, builds the destination op with the converted operands, copies semantic attributes (rounding mode, fast-math flags, predicate kind, overflow flags), and replaces the source. Any failure in operand or result-type conversion bubbles up as a pattern failure rather than producing a partial replacement.
Op-Mnemonic Roster
The 43 mnemonics, in registration order. The order matches the source-level patterns.add<...>(typeConverter, context) argument list and is reproducible by walking the populator's body in linker-emission order.
| # | Mnemonic | Op class | # | Mnemonic | Op class |
|---|---|---|---|---|---|
| 1 | arith.cmpf | CmpFOp | 23 | arith.minnumf | MinNumFOp |
| 2 | arith.cmpi | CmpIOp | 24 | arith.minsi | MinSIOp |
| 3 | arith.addf | AddFOp | 25 | arith.minui | MinUIOp |
| 4 | arith.addi | AddIOp | 26 | arith.mulf | MulFOp |
| 5 | arith.andi | AndIOp | 27 | arith.muli | MulIOp |
| 6 | arith.bitcast | BitcastOp | 28 | arith.negf | NegFOp |
| 7 | arith.ceildivsi | CeilDivSIOp | 29 | arith.ori | OrIOp |
| 8 | arith.ceildivui | CeilDivUIOp | 30 | arith.remf | RemFOp |
| 9 | arith.divf | DivFOp | 31 | arith.remsi | RemSIOp |
| 10 | arith.divsi | DivSIOp | 32 | arith.remui | RemUIOp |
| 11 | arith.divui | DivUIOp | 33 | arith.select | SelectOp |
| 12 | arith.extf | ExtFOp | 34 | arith.shli | ShLIOp |
| 13 | arith.extsi | ExtSIOp | 35 | arith.shrsi | ShRSIOp |
| 14 | arith.extui | ExtUIOp | 36 | arith.shrui | ShRUIOp |
| 15 | arith.floordivsi | FloorDivSIOp | 37 | arith.sitofp | SIToFPOp |
| 16 | arith.fptosi | FPToSIOp | 38 | arith.subf | SubFOp |
| 17 | arith.fptoui | FPToUIOp | 39 | arith.subi | SubIOp |
| 18 | arith.maximumf | MaximumFOp | 40 | arith.truncf | TruncFOp |
| 19 | arith.maxnumf | MaxNumFOp | 41 | arith.trunci | TruncIOp |
| 20 | arith.maxsi | MaxSIOp | 42 | arith.uitofp | UIToFPOp |
| 21 | arith.maxui | MaxUIOp | 43 | arith.xori | XOrIOp |
| 22 | arith.minimumf | MinimumFOp |
The Benefit-20 Specialist for arith.constant
arith.constant is absent from the bank because it does not route through the generic same-name remap. A hand-written ConstantTensorOpConversion registers separately with PatternBenefit(20), which pre-empts any default-benefit pattern that might otherwise match a constant op. The specialist inspects the constant's Attribute payload and synthesises one of three destinations:
arith.constant dense<1.0> : tensor<8x64xf32> // SplatElementsAttr branch
↓
%r = nv_tileaa.splat 1.0 : tensor<8x64xf32>
arith.constant dense<[1.0, 2.0, ...]> : tensor<32xf32> // DenseElementsAttr branch
↓
%r = nv_tileaa.constant_tensor dense<[1.0, 2.0, ...]> : tensor<32xf32>
Scalar IntegerAttr and FloatAttr payloads keep the same arith.constant form, since arith constants are dynamically legal in the target dialect at scalar types. The decision is structural, not numeric — the specialist branches on attribute kind, not on attribute value.
Parent Driver
The TileAA-to-TileAS pattern bank composes four sub-populators plus six hand-written specialists inlined into the parent body. Sub-populator order is fixed.
| Sub-populator | Patterns |
|---|---|
nv_tileaa structure | nv_tileaa.block_tile, nv_tileaa.make_memref, nv_tileaa.get_dim_size |
nv_tileaa bitcast/ptr | nv_tileaa.bitcast, nv_tileaa.ptr_to_int, nv_tileaa.int_to_ptr |
func dialect | func.func, func.call, func.return |
| arith generic bank | the 43 GenericOpPattern<arith::*> instantiations above |
| inline specialists | MakeTiledTMADescOpHostConversion, ConstantTensorOpConversion, AddPtrOpConversion, SplatOpHostConversion, AssumeOpConversion, ExtractOpHostConversion |
The parent driver builds the ConversionTarget marking llvm, cute, cute_nvgpu, builtin, and vector as fully legal; registers arith with a dynamic legality predicate that returns true once an op already has TileAS-form operands; and marks nv_tileaa and nv_tileas as legal. A failed partial conversion emits the diagnostic "expect lower MakeTiledTMADescOp".
PDL Fallback
Every Convert*ToLLVM pass runs a PDL-to-PDLInterp fallback immediately before applyPartialConversion. The fallback walks the PDL pattern modules registered with the active RewritePatternSet, compiles each module down to PDL Interpreter operations, and hands the resulting interpreter bodies to the conversion driver alongside the C++ patterns. The compiled PDL Interpreter bodies live as fixed bytecode in the binary's read-only data; the fallback's job is to materialize them as runnable patterns at pass time, so the on-disk PDL pattern is essentially an interpreter program ready to be wired into the match-and-rewrite loop.
When PDL compilation fails — typically because a registered pattern references an op or attribute the interpreter cannot resolve in the current dialect registry — the fallback emits "failed to lower PDL pattern module to the PDL Interpreter" and returns failure. The parent driver treats this as a hard pass failure rather than a recoverable miss: surviving without the PDL-side patterns would silently change which ops the conversion target deems illegal, producing a module that compiles but mishandles the pattern's intended rewrite.
Shared LLVMTypeConverter Contract
The shared LLVMTypeConverter extends the upstream MLIR TypeConverter with the five tile-extension hooks Tileiras needs and three overridden methods that enforce the kernel-ABI rules above.
| Method | Status | Role |
|---|---|---|
convertType | overridden | Dispatches to the tile-extension callbacks first, then falls through to the upstream LLVMTypeConverter base for ordinary LLVM types. |
convertCallSignature | overridden | Enforces the bare-pointer ABI for kernel call sites: ranked memref arguments become aligned pointers plus separately-carried launch metadata. |
convertFunctionSignature | overridden | Lifts kernel-spec fields onto the converted function and emits "failed to convert function signature type for: " (trailing space preserved) when a type is unrecognised. |
materializeSourceConversion | overridden | Emits builtin.unrealized_conversion_cast as the source-side bridge during partial conversion. |
materializeTargetConversion | overridden | Emits the inverse builtin.unrealized_conversion_cast for the target side. |
convertTileType | tile-extension | Maps TileType to a llvm.struct payload. |
convertTokenType | tile-extension | Maps TokenType to i32 for memory/async/pipeline tokens. |
convertPipelineIteratorType | tile-extension | Maps PipelineIteratorType to a small llvm.struct carrying stage and phase. |
convertTensorViewType | tile-extension | Maps TensorViewType to the tiled-view descriptor (base pointer plus packed metadata). |
convertPartitionViewType | tile-extension | Maps PartitionViewType to a partition descriptor. |
convertType walks an ordered list of addConversion callbacks; the first callback that recognises an incoming type wins. Tile-extension callbacks register before the base LLVM callbacks so the partial-conversion driver sees a single uniform converter rather than fanning out across two different machines for tile-only versus upstream-LLVM types.
Type LLVMTypeConverter::convertType(Type t) {
if (auto tile = dyn_cast<TileType>(t)) return convertTileType(tile);
if (auto tok = dyn_cast<TokenType>(t)) return convertTokenType(tok);
if (auto pipeIt = dyn_cast<PipelineIteratorType>(t)) return convertPipelineIteratorType(pipeIt);
if (auto view = dyn_cast<TensorViewType>(t)) return convertTensorViewType(view);
if (auto part = dyn_cast<PartitionViewType>(t)) return convertPartitionViewType(part);
/* ... callbacks for cute / cute_nvgpu / nv_tileaa / arith / scf types ... */
return baseLLVMConvertType(t);
}
A Convert*ToLLVM pass must construct exactly one LLVMTypeConverter and thread it through every pattern, every ConversionTarget legality predicate, and every PDL pattern module the fallback later compiles. Constructing two converters in the same pass — for example, one for body patterns and one for cleanup — is the most common way to silently break the bare-pointer ABI on function boundaries, because the two converters disagree on whether a ranked memref kernel argument is a pointer or a descriptor.
Cross-References
Conversion / Lowering Overview shows where each pass that uses this infrastructure sits in the cascade. cuda_tile to TileAA, TileAA to TileAS, TileAS to LLVM, CuTe and CuTe-NVGPU to LLVM, and nvgpu / gpu to NVVM are the five passes that install patterns through the bank and thread the shared type converter described here.
Codegen Overview
Abstract
The backend half of tileiras starts where the MLIR pipeline ends: an
NVVM-ready gpu.module with a resolved #nvvm.target. The program is no
longer TileIR. It is an LLVM/NVVM module that must be linked against device
libraries, optimized, lowered through NVPTX target rules, selected into
machine instructions, and printed as PTX text for ptxas. This page states
the contracts and invariants each stage must preserve. Child pages document
the dispatchers, opcode tables, and modifier vocabularies that implement
those contracts.
The useful model is:
MLIR llvm/nvvm dialect
-> llvm::Module
-> linked device-library module
-> optimized LLVM module
-> SelectionDAG and machine functions
-> MCInst stream
-> PTX assembly
Child pages document the detailed reverse-engineered subsystems. This overview lays out the backend contracts that matter for users and reimplementers.
Backend Contract
| Stage | Responsibility | Public invariant |
|---|---|---|
| LLVM module handoff | Translate MLIR LLVM dialect to an llvm::Module and attach target triple, chip, features, and data layout. | The module is already ABI-ready; no high-level TileIR operations remain. |
| Device library linkage | Link embedded or external device bitcode used by math and NVVM helper calls. | Undefined device helper calls must be resolved before final codegen. |
| LLVM optimization | Run the LLVM optimization pipeline selected by the requested optimization level. | Optimizations preserve NVVM address spaces, kernel attributes, and libdevice semantics. |
| NVPTX target lowering | Lower calls, formal arguments, returns, intrinsics, address spaces, and custom target nodes. | Param-space values and kernel arguments are handled through NVPTX ABI rules, not generic pointer rules. |
| Instruction selection | Select custom NVPTX nodes first, then fall back to generated SelectionDAG matcher tables. | Feature-gated intrinsics are rejected or expanded before an illegal PTX instruction can be emitted. |
| Machine-function passes | Run target passes for argument lowering, image handles, scheduling, register allocation, and MIR cleanup. | Machine IR still carries enough target information for correct PTX emission. |
| PTX emission | Print PTX mnemonics, operands, modifiers, sections, directives, and target attributes. | Emitted PTX matches the resolved target feature set and is suitable for ptxas. |
Target Initialization
The backend registers both 32-bit and 64-bit NVPTX targets, constructs
subtarget information from the target triple, CPU string, and feature string,
then builds or reuses a target machine for the compilation. The normal CUDA
device path is 64-bit and uses the nvptx64-nvidia-cuda triple.
Target initialization provides:
- target registry entries for
nvptxandnvptx64; - MC layer objects for registers, instruction descriptions, subtarget features, and asm output;
- an NVPTX target machine keyed by triple, chip, and feature set;
- a feature bitset used by target lowering and instruction selection.
The target feature set is the guardrail for newer instructions. Tensor memory, TMA, WGMMA, tcgen05, block-scaled MMA, cluster operations, and related PTX modifiers reach selection only when the subtarget says they are legal.
MLIR-To-LLVM Handoff
gpu.module operations carrying the NVVM target attribute leave MLIR through a translator that maps each nvvm.* op to the matching llvm.nvvm.* intrinsic, then walks llvm dialect operations into the corresponding LLVM IR opcodes. The translator is a one-to-one mapping table: nvvm.barrier0 becomes @llvm.nvvm.barrier0, nvvm.mma.sync becomes @llvm.nvvm.mma.*, nvvm.wgmma.mma_async becomes @llvm.nvvm.wgmma.*, and so on. There is no novel rewriting in this step. What matters is that NVPTX-specific information already encoded in the MLIR dialect — kernel attributes, address spaces, target metadata — must survive the translation unchanged.
The output is an llvm::Module with the nvptx64-nvidia-cuda triple set, the target chip and feature string attached to every kernel function, and the NVPTX data layout active. From here on the module is an ordinary LLVM IR module and the backend reads it the same way clang does.
LLVM Optimization
After translation and device-library linkage, the module goes through the LLVM optimization pipeline selected by O0, O1, O2, O3, Os, or Oz. The pipeline is the standard PassBuilder shape — function simplification, CGSCC inlining, loop optimization, vectorization — followed by NVIDIA-private peephole and lowering passes the binary's PassRegistry table lists by name. The NVIDIA-private set covers NVPTX-specific patterns LLVM upstream does not optimize: lowering of llvm.nvvm.barrier* intrinsics, address-space inference and propagation, kernel-attribute preservation, libdevice math-helper specialization, and a final NVVM-aware GVN/DCE sweep.
NVVM-specific properties must survive ordinary LLVM optimization. Kernel functions retain nvvm.kernel metadata, NVVM intrinsics never get rewritten into target-illegal forms, NVPTX address spaces stay distinct, and libdevice calls keep the ABI the NVPTX backend expects. Any optimization pass that strips this metadata makes downstream selection fall back to a generic path that does not understand NVPTX param, shared, or tmem semantics.
NVPTX ABI Lowering
NVPTX has a stricter ABI than ordinary LLVM IR suggests. Kernel parameters live in address-space 101 (param), device-function parameters use the by-value or by-pointer convention NVPTX defines, return values flow through the param space too, and byval aggregates need explicit unpacking into scalar or vector register-passing lowerings. Grid constants live in their own constant address space. None of this is the generic pointer lowering LLVM's IR-level legalizer would produce.
The NVPTX target lowering hook runs before SelectionDAG building and rewrites each formal argument, call, return, and address-space cast into the form the selector and the AsmPrinter both expect. Param-space values become NVPTXISD::LoadParam / StoreParam chains; kernel arguments become explicit param-space loads keyed by formal-arg index; by-value aggregates become a sequence of scalar param loads spelled out per field. Once this pass completes, no inttoptr or addrspacecast between mismatched NVPTX address spaces remains in the function. See Lowering Formal Arguments and Lowering Calls for the formal-arg shape lattice and the call-prototype layout.
Instruction Selection
Selection runs in three layers. The intrinsic-with-chain selector handles NVVM intrinsics that carry memory or control-flow chains and routes most cases to per-family emitters or to a secondary intrinsic-ID dispatcher. The vector load/store selector handles the NVPTX-private vector memory opcodes (the v2/v4/v8 forms over global/shared/param/tmem) plus tensor-memory routing for Blackwell. Both fast selectors fall through to the generated MatcherTable on unrecognized cases, and the MatcherTable runs a saturating-int64 cost scorer over candidate TableGen patterns. The scorer reads a per-opcode predicate-matrix row to decide whether the pattern is legal on the active subtarget before any cost accumulates.
Feature-gated intrinsics — TMA, tensor-memory, WGMMA, tcgen05, mma.block_scale, cluster operations, special registers, async barriers — pass through validators that consult the subtarget feature bitmap and emit a diagnostic on failure rather than letting an illegal PTX instruction reach the printer. See ISelDAG and MatcherTable — Selector Layers for the dispatcher shape, MatcherTable and Cost Scoring for the 119-case scorer, and the operand-class vocabulary the predicate helpers consume.
PTX Emission
The AsmPrinter is a single LTO-folded function with a 6,388-case dispatcher over MC opcodes. Each case selects one of 297 shared print-shape bodies; each body interleaves literal text, operand slots, and modifier-helper calls in the order ptxas requires. Mnemonic lookup goes through a parallel pair of .rodata offset tables keyed by MC opcode, returning a byte offset into an obfuscated mnemonic pool that is decrypted in place on first use via an xor (3 * i) mod 256 walking cipher. Physical-register names use the same scheme on a smaller 586-byte pool.
Module-level emission produces the .version / .target / .address_size header, kernel directives (.entry, .reqntid, .maxntid, .minnctapersm, .maxnreg, cluster directives), global and managed-variable declarations, then per-function bodies. Each function emits its frame setup, the virtual-register declarations grouped by class, and the basic-block sequence of MC instructions. The printer performs no subtarget legality checks: by the time an opcode reaches this layer, the selector and the machine verifier have already proved it is legal for the chosen target.
See AsmPrinter — MC Switch Shape Population Table for the dispatcher partition and AsmWriter String Pools and the XOR-3 Walking Cipher for the mnemonic-pool layout, and Per-SM Emission Templates for the actual PTX template strings emitted per SM tier.
End-To-End Algorithm
The whole codegen path can be read as a sequence of structurally distinct stages, each with a published contract from the table above. From gpu.module to PTX text:
- Translate the MLIR module to LLVM IR, mapping
nvvm.*ops tollvm.nvvm.*intrinsics and preserving NVPTX address spaces and kernel attributes. - Link device libraries so libdevice math helpers and NVVM intrinsic implementations are resolved.
- Resolve the NVPTX target — triple, chip, feature set — and reuse or construct the target machine keyed by that tuple.
- Run the requested LLVM optimization pipeline (
PassBuildershape plus NVIDIA-private peepholes). - Per function: run NVPTX target lowering for arguments, calls, returns, address-space casts, and intrinsic legalization.
- Per function: build the SelectionDAG, run the three-layer selector, build the MachineFunction, and run NVPTX-specific machine passes for argument lowering, scheduling, register allocation, and MIR cleanup.
- Run the AsmPrinter to produce the final PTX text.
A reimplementation that keeps these seven stages and their published contracts can vary internal data structures freely without breaking any consumer downstream of the printer.
Codegen Invariants
- The module has exactly one resolved NVPTX target before backend emission.
- Kernel functions retain
nvvm.kerneland launch metadata through LLVM optimization. - Address spaces remain semantic: global, shared, constant, local, parameter, and tensor memory are not interchangeable.
- Param-space values are lowered through NVPTX ABI code, not generic pointer lowering.
- Custom intrinsic selection validates subtarget support before emission.
- Generated matcher-table selection remains the default path for ordinary DAG nodes.
- Vector memory selection preserves lane grouping and address-space classification.
- TMA, WGMMA, tcgen05, tensor memory, cluster, and block-scaled MMA operations are subtarget-gated.
- PTX emission prints the instruction selected for the target, not a generic approximation.
Cross-Links
- NVPTX Bring-up and Target Init covers target registration and target-machine construction.
- NVPTX Subtarget and Feature Matrix covers per-SM feature bitsets and subtarget construction.
- NVPTX Target Lowering — Calls and Arguments covers parameter, call, and custom-node lowering.
- ISelDAG and MatcherTable covers instruction selection.
- AsmPrinter Monster and Windows covers the MC dispatch shape and AsmWriter string pools.
- Per-SM Emission Templates covers SM-specific opcode families.
- Atomic, Warp, Sreg, and Fence Emission covers atomic, warp-synchronous, special-register, and fence opcodes.
- TMA + Tensormap + cp.async.bulk Emission covers TMA and tensor-map emission.
- tcgen05 / WGMMA / mbarrier / Cluster Emission covers tensor memory, WGMMA, barriers, and cluster features.
- ldmatrix / stmatrix and Register-Class Vtables covers matrix-load/store opcodes and the register-class vtable banks.
- libdevice Overview covers device-library linkage and libdevice behavior.
NVPTX Bring-up and Target Init
Abstract
NVPTX bring-up is the handoff point between the Tileiras dialect-lowering
pipeline and the stock-shaped LLVM TargetMachine configured for PTX
emission. By the time this layer runs, the MLIR pipeline has already
produced LLVM/NVVM IR.
The layer owns target registration, MC-layer object construction, the
NVPTXAsmPrinter section model, the embedded-device-library linker, target
machine caching, and the LLVM optimization pipeline driver. The
reimplementation contract is a sequence, not a static constructor layout:
register both NVPTX triples, build consistent MC services, resolve the
target machine from the requested chip/features, link device bitcode, run
the LLVM pipeline, then emit PTX through the NVPTX asm printer.
Two choices distinguish Tileiras from a plain LLVM build. First,
nvptx and nvptx64 share one constructor table; the triple controls
pointer size and ABI details downstream. Second, libdevice never travels
LLVM's ordinary filesystem search path. It arrives as an MLIR BlobAttr
on the gpu.module and is parsed into an LLVM module before
optimization.
Target Registration Chain
Bring-up follows the same shape as upstream LLVM NVPTX with one structural
twist: the factory chain that constructs the NVPTXTargetMachine is folded
under NVIDIA's private peephole-pass selection. LLVMInitializeNVPTXTargetInfo
registers the target names through TargetRegistry::RegisterTarget. That call
runs from a __attribute__((constructor))-style global initializer, so by the
time main enters the compiler the two target records (nvptx, nvptx64) are
already in the registry. LLVMInitializeNVPTXTarget fills the constructor slots
for the target services used later by MC emission and target-machine creation.
The factory function the registry stores under each target record is not the
upstream createNVPTXTargetMachine. It is an NVIDIA-private variant that,
after building the base target machine, walks the global peephole-pass table
and installs the subset legal on the requested target chip. The selection is
data-driven: each entry in the peephole table carries a chip-feature predicate
that the factory evaluates against the parsed feature string. Peepholes whose
predicates fail are skipped; the survivors become part of the per-target-machine
pass pipeline returned to the caller. Caching the target machine therefore also
caches the peephole-pass selection — rebuilding with a different chip/feature
combination forces both the target machine and the peephole list to be
reconstructed.
| Service | Role |
|---|---|
LLVMInitializeNVPTXTargetInfo | Registers nvptx and nvptx64 target records. |
LLVMInitializeNVPTXTarget | Installs all target constructor callbacks. |
NVPTXMCAsmInfo | Defines PTX comments, directives, pointer size, and asm syntax. |
MCInstrInfo | Supplies instruction descriptors for the NVPTX opcode set. |
NVPTXRegisterInfo | Supplies physical registers and register-class descriptors. |
MCSubtargetInfo | Supplies CPU and feature tables used by legality checks. |
MCInstrAnalysis | Supplies branch and instruction-analysis helpers. |
MCAsmBackend | Supplies MC assembly backend services. |
MCCodeEmitter | Supplies MC instruction encoding hooks where LLVM expects them. |
NVPTXAsmPrinter | Emits module headers, directives, sections, and PTX instruction text. The constructor slot points at the LTO-folded printer described below, not a generic LLVM AsmPrinter. |
Both 32-bit and 64-bit targets receive the same service table. The triple decides
whether the compilation is nvptx or nvptx64, and the MC asm-info constructor
turns that into the pointer-size and stack-slot-size choices needed by the ABI.
User Target vs gpulibs Subtarget Triple
The 64-bit NVPTX target record handles two distinct triples that travel through
the same TargetMachine factory but exit with different feature gates: the
user-facing nvptx64-nvidia-cuda triple compiled by the host LLVM-21 backend at
run time, and the embedded-only nvptx64-nvidia-gpulibs subtarget triple
carried as producer metadata on prebuilt bitcode resources baked into the
binary at link time. The host backend never emits gpulibs IR; it only
consumes it through the bitcode reader during the blobLinkedLib link step.
What makes this surprising is that the same compiler binary ships IR produced by two different clang generations, both of which predate the host LLVM-21 link target by several major versions:
| Producer string | Subtarget triple | Carried symbol family |
|---|---|---|
clang version 16.0.0 (NVIDIA internal) | nvptx64-nvidia-gpulibs | __nv_fp128 softfloat path — fp128 arithmetic and transcendentals |
clang version 7.1.0 git-630d6c22278 | nvptx64-nvidia-gpulibs | __nv_*128 integer family — 128-bit integer divide, modulo, conversion |
The dual-clang split exists because the integer-128 helper library was
finalized against clang 7.1.0 long before the fp128 softfloat work began, and
NVIDIA never recompiled the older IR against newer clang releases. Recompiling
the legacy IR would force re-verification of the entire __nv_*128 integer
helper set against every supported SM, and the helpers are pure bitwise
arithmetic that LLVM 21's optimizer consumes identically to LLVM 7's output.
The fp128 work, by contrast, was a fresh integration that needed clang-16
features (newer __attribute__((target)) handling, fp128 ABI fixes) and was
checked in at the version that built cleanly. Both blobs were frozen at their
respective producer generations and embedded side by side rather than
maintained on a moving baseline.
What the gpulibs IR ships, structurally:
- Berkeley SoftFloat —
f128M_add,f128M_mul,f128M_div,f128M_sqrt,softfloat_*rounding and rawFloat helpers. Provides the arithmetic backbone of the fp128 softfloat path. The library is statically linked into the gpulibs bitcode rather than shipped as a separate.bcresource; on-disk it is invisible. - Sleef —
Sleef_*transcendental functions,Sleef_rempitabqp(the Payne–Hanek argument-reduction table for quad-precision), and theqp_cuda_sleefqCUDA bridge. Providessinq,cosq,tanq,expq,logq, and the rest of the fp128 transcendental surface. - NVIDIA
__nv_*128helpers —__nv_udivti3,__nv_umodti3,__nv_divti3,__nv_modti3, and the wider 128-bit integer conversion set. These come from the clang-7.1 blob, not the clang-16 one.
Integration into the host pipeline goes through the same blobLinkedLib
attribute described below: the gpulibs bitcode is parsed by the LLVM-21
bitcode reader, linked with LinkOnlyNeeded so only the helpers the kernel
actually references survive, then dropped into the optimization pipeline as
ordinary internal functions. The optimizer sees no producer-version
distinction — the IR is read as plain LLVM 21 IR once the bitcode reader has
upgraded any forward-compatible constructs.
⚡ QUIRK — two compiler generations, one binary A stripped tileiras binary carries producer strings for
clang version 16.0.0andclang version 7.1.0 git-630d6c22278simultaneously, alongside the primary host link target identifying asLLVM21.0.0git. The producer strings are the fingerprint to grep for when locating the embedded bitcode resources in a stripped binary; they survive both LTO andstripbecause they live inside the bitcode payload, not in the host symbol table.
⚡ QUIRK —
nvptx64-nvidia-gpulibsis a producer-only triple The host backend never builds or registers aTargetMachinefor thegpulibstriple. The triple appears only in the module metadata of embedded bitcode resources and tells the bitcode reader to apply gpulibs-specific attribute defaults during deserialization. A reimplementation that registersgpulibsas a callable target will be calling code paths the original binary never exercises at run time.
⚡ QUIRK — SoftFloat and Sleef are not separate
.bcfiles Both third-party libraries are statically linked into the gpulibs blob before the producer-string serialization happens. The blob exposesf128M_*,softfloat_*,Sleef_*, and__nv_fp128_*as if they were a single translation unit, which is why the producer string isclang version 16.0.0for the entire fp128 surface even though the upstream SoftFloat and Sleef sources were never built with clang-16 in isolation.
NVPTXMCAsmInfo Constructor
NVPTXMCAsmInfo starts from ordinary LLVM MC defaults and then replaces the
host-assembly pieces that make no sense for PTX. PTX has no ELF-style
.text, .bss, .data, .globl, or .weak directives, so those fields become
comments or PTX-specific byte directives. Inline assembly gets wrapped in
comments so ptxas receives the inline body without host-assembler markers.
| Field | NVPTX value |
|---|---|
PointerSize | 4 for nvptx, 8 for nvptx64 |
CalleeSaveStackSlotSize | matches pointer size |
CommentString | // |
PrivateGlobalPrefix | $L__ |
CommentColumn | 4 |
InlineAsmStart / InlineAsmEnd | commented begin/end markers |
AsciiDirective | .b8 |
Data8bitsDirective | .b8 |
Data32bitsDirective | .b32 |
Data64bitsDirective | .b64 |
GlobalDirective | commented .globl surrogate |
WeakRefDirective | commented .weak surrogate |
UseIntegratedAssembler | disabled |
SupportsDebugInformation | enabled |
PTX assembly must never depend on host object-file section semantics. The asm-info layer turns LLVM's generic MC vocabulary into PTX comments and PTX byte directives before the printer writes a module.
Section Changes
NVPTXAsmPrinter::changeSection implements the brace-bound function-body model
used by PTX. Instead of switching among ELF sections, the printer emits a
commented section header and opens or closes a brace-delimited body.
void change_nvptx_section(AsmPrinter *printer, MCSection *next, raw_ostream *os) {
if (printer->current_section == next) {
os_write(os, "\t}\n");
printer->current_section = NULL;
return;
}
print_commented_section_header(next, os);
os_write(os, "\t{\n");
printer->current_section = next;
}
Emitted PTX kernels therefore appear inside { and } rather than between
.text and .size markers. The section line is documentation for readers and
debug tooling; ptxas treats it as a comment.
The LTO-Folded AsmPrinter Class
The class the target registry stores under NVPTXAsmPrinter is a single, very
large function produced by NVIDIA's whole-program LTO build. It is not the
upstream NVPTXAsmPrinter class hierarchy — at link time the LTO pipeline
collapses the upstream class, its TableGen-generated AsmWriter subclass, the
operand-print helpers, the modifier-print helpers, and most of the per-opcode
print-shape helpers into a single dispatch function with a giant switch ladder
over MC opcodes. The dispatcher inlines the operand-print and modifier-print
work directly into each case rather than calling through small helpers.
What survives as separate, non-inlined methods is the part of the printer the
target machine and the pass manager call from outside: the constructor, the
runOnMachineFunction entry point, the section-change hook described above,
the module-level header emitter, and the global-variable emitter. These methods
are the ones whose addresses the target registry stores and whose vtable slots
the rest of the backend dispatches through.
The MLIR side of the same class handles selected nvvm.* ops — the dialect's
custom printers for ops that do not lower to a single PTX instruction, and for
which the TableGen assemblyFormat is not enough. The MC-instruction side
handles selected MachineInst opcodes — the per-MC-opcode dispatcher. Both
sides share the operand-print and modifier-print code, which is why they end up
folded into the same LTO function: the inliner sees the shared callees and
collapses both call graphs around them.
A reimplementation does not need to reproduce the LTO fold. The contract is
that one printer object serves both MLIR-side nvvm.* printing and MC-side PTX
emission, and that the two sides share modifier-print and operand-print
infrastructure. How the implementation factors that contract is a build-time
decision; the LTO fold is the choice NVIDIA's release build makes.
Mnemonic Pool Decode
PTX mnemonics live in a .data-resident pool the AsmPrinter reads from. The
pool is not stored in cleartext. The bytes in .data are obfuscated under a
walking-XOR cipher; the printer decodes the pool in place on first use, then
all subsequent mnemonic lookups read decoded bytes directly.
The cipher is a stride-3 walking XOR. Byte at offset i in the pool is
decoded by XORing with key byte (3 * i) mod 256. The decode is in-place and
single-pass: the printer walks the pool from offset zero to the end exactly
once, XORing each byte against its computed key byte, and then sets a flag
that future readers consult before doing any work. The decode runs under a
pthread_once guard so concurrent compilations cannot trigger overlapping
in-place writes; the once-init body holds a process-global lock, walks the
pool, and releases the lock with the decoded state visible to all threads.
static pthread_once_t mnemonic_pool_once = PTHREAD_ONCE_INIT;
static char mnemonic_pool[POOL_SIZE]; // .data-resident, obfuscated at link time
static void decode_mnemonic_pool(void) {
for (size_t i = 0; i < POOL_SIZE; ++i) {
mnemonic_pool[i] ^= (char)((3u * (unsigned)i) & 0xff);
}
}
const char *mnemonic_for_opcode(unsigned mc_opcode) {
pthread_once(&mnemonic_pool_once, decode_mnemonic_pool);
uint32_t lo_offset = mnemonic_offset_table_lo[mc_opcode];
uint32_t hi_offset = mnemonic_offset_table_hi[mc_opcode];
return &mnemonic_pool[lo_offset | (hi_offset << 16)];
}
Two offset tables index into the decoded pool — a low-16-bit table and a
high-16-bit table, both keyed by MC opcode. Joining them produces a 32-bit
byte offset into the pool, which lets the printer address mnemonics up to a
4 GiB pool size even though the pool itself fits in a few tens of kilobytes.
The two-table split survives the LTO fold; the printer's MC-opcode dispatcher
emits the same (lo | (hi << 16)) reconstruction inline in every case.
⚡ QUIRK — 32-bit offset reconstruction for a sub-megabyte mnemonic pool The printer addresses each mnemonic with a 32-bit offset reconstructed from two 16-bit tables, even though the entire decoded pool fits in tens of kilobytes — the high 16 bits are always zero in practice. The split is a TableGen artifact: the offset type is sized for the worst-case LLVM target's combined mnemonic pool. NVPTX inherits the layout because LTO refuses to fold the second table away (the dispatcher reads it inline in every case), so the
(lo | (hi << 16))idiom is the fingerprint to grep for when locating the printer in a stripped binary.
A smaller separate pool with the same XOR-3 decode scheme holds physical register names. Both pools are decoded by the same once-init body, so the mnemonic table and the register-name table become available simultaneously on the first call to the printer.
NVVM Intrinsic Mapping Table
The translation from nvvm.* MLIR ops to LLVM IR happens through a
one-to-one mapping table the target-init layer installs into the MLIR-to-LLVM
translator. Each nvvm.* op carries one of two lowering keys:
llvm_intrinsic_id: the op lowers to a singlecallof anllvm.nvvm.*intrinsic. The translator looks up the intrinsic by ID and emits an LLVMIntrinsicInstwith the op's operands as arguments.inline_asm_template: the op lowers to anllvm.inline_asmcall whose template is a PTX fragment with${0},${1}, etc. placeholders for the operands. The translator substitutes the operand SSA values, emits theInlineAsmIR, and lets the later NVPTX backend pass copy the inline-asm body verbatim into the PTX output.
The choice between the two paths is per-op, baked into the table. Ops that
correspond to a single PTX instruction with a stable encoding (most mma.*,
tma.*, and mbarrier.* ops) go through the intrinsic path. Ops that
correspond to compound PTX sequences or that depend on assembly-level
modifiers the LLVM intrinsic surface does not expose go through the
inline-asm path. The inline-asm template typically embeds the modifier
directly into the template string rather than passing it as an operand,
because the LLVM asm constraint vocabulary cannot express PTX modifier
combinatorics in general.
typedef enum { LOWER_INTRINSIC, LOWER_INLINE_ASM } NvvmLoweringKind;
typedef struct NvvmOpLowering {
StringRef op_name; // e.g. "nvvm.barrier0"
NvvmLoweringKind kind;
union {
uint32_t intrinsic_id; // valid when kind == LOWER_INTRINSIC
struct {
StringRef template_str; // PTX template with ${i} placeholders
StringRef constraints; // LLVM inline-asm constraint string
bool has_side_effects;
} asm_data; // valid when kind == LOWER_INLINE_ASM
};
} NvvmOpLowering;
The table is populated by the dialect's addPattern calls during target-init.
On every nvvm.* op encountered by the translator, the dispatcher consults
the table, builds either the intrinsic call or the inline-asm call, and
attaches the same memory-effect and side-effect attributes the op carried in
MLIR. Attaching the effects matters: if the op was marked memory-writing in
the dialect, the resulting LLVM call must be marked the same way, or the
LLVM optimizer will see it as a pure call and start hoisting or eliminating
it on the device IR.
blobLinkedLib Attribute on gpu.module
The blobLinkedLib attribute on a gpu.module carries the precompiled
bitcode payload that gets linked into the LLVM module during NVPTX bring-up.
The attribute is the only point at which libdevice (or another bitcode
helper library) enters the compiler — there is no command-line -mlink-bc
flag or filesystem lookup. The driver attaches the attribute to the module
during front-end processing, and the bring-up layer consumes it during the
LLVM-link step described below.
The attribute value is an MLIR BlobAttr whose payload can be one of two
shapes:
- An inline byte array: the bitcode payload is embedded directly in the IR. Used when the driver wants to pin a specific libdevice version into a reproducible build.
- A filesystem path: the bitcode lives on disk and the bring-up layer reads
it at link time. Used by the normal CUDA toolchain build where libdevice
is shipped as a separate
.bcfile alongside the compiler.
LLVMModule *load_blob_linked_lib(GPUModuleOp module) {
Attribute attr = module.attributes()["blobLinkedLib"];
if (attr == NULL) {
return NULL;
}
BlobPayload payload = resolve_blob_payload(attr);
if (payload.kind == BLOB_FILE && !is_regular_file(payload.path)) {
diagnose("blobLinkedLib: bitcode path does not exist or is not a file");
return NULL;
}
ParseResult parsed = parse_llvm_bitcode(payload);
if (!parsed.ok) {
diagnose("blobLinkedLib: failed to parse embedded bitcode");
return NULL;
}
return parsed.module;
}
The loader runs at the start of the NVPTX bring-up. The parsed module is
linked into the main LLVM module via llvm::Linker with the
LinkOnlyNeeded flag so unused helpers do not bloat the final PTX. From
this point on the helpers are ordinary LLVM IR — the optimizer sees them as
internal functions and can inline, specialize, and DCE them like any other
device function.
The contract for a reimplementation: a gpu.module without blobLinkedLib
proceeds through the NVPTX pipeline with no implicit libdevice link. Math
intrinsics that need libdevice helpers (transcendentals, denormal handling,
some integer conversions) emit linker-error diagnostics rather than silently
falling back to a default library. The driver layer is responsible for
attaching the attribute when the kernel actually needs libdevice.
Target-Machine Cache
Target-machine creation resolves the target triple, looks up the registered
target, builds TargetOptions, selects the requested mcpu, and calls the
target's TargetMachine constructor. The resulting object is cached so repeated
compilations with the same target settings do not rebuild the LLVM backend state.
TargetMachine *get_or_create_nvptx_target_machine(TargetCache *cache,
TargetRequest request) {
if (cache->machine != NULL && target_request_equal(cache->request, request)) {
return cache->machine;
}
const Target *target = lookup_target(request.triple);
if (target == NULL) {
diagnose("failed to look up NVPTX target for requested triple");
return NULL;
}
TargetOptions options = default_nvptx_target_options();
TargetMachine *machine = target->create_target_machine(
request.triple, request.mcpu, request.features, options);
cache->request = request;
cache->machine = machine;
return machine;
}
The cache key must include the triple, chip, and feature string. A target machine reused across incompatible feature sets makes later legality checks observe the wrong subtarget.
LLVM Pass Pipeline
The optimization driver accepts the requested optimization level, ensures a
target machine exists, and asks LLVM PassBuilder for the per-module default
pipeline. Invalid optimization levels become diagnostics before any pass manager
is built.
bool run_llvm_pipeline(LLVMModule *module, TargetMachine *tm, OptLevel level) {
if (!is_valid_opt_level(level)) {
diagnose("invalid LLVM optimization level");
return false;
}
if (tm == NULL) {
diagnose("target machine unavailable; cannot optimize with LLVM");
return false;
}
PassBuilder builder(tm);
ModulePassManager mpm = builder.build_per_module_default_pipeline(level);
mpm.run(*module);
return true;
}
The pipeline shape is the stock LLVM decomposition: early simplification, module
simplification, function simplification, inlining, vectorization, module
optimization, and post-pass cleanup. Tileiras-specific behavior happens before
and around the pipeline: target-machine selection, the blobLinkedLib bitcode
linkage step, and the NVIDIA-private peephole-pass selection the factory
installed when the target machine was built.
Cross-References
- Codegen Overview — End-To-End Algorithm — the seven-stage backend contract these primitives feed into.
- AsmPrinter — MC Switch Shape Population Table — the 6,388-case dispatcher that the LTO-folded printer described here implements, and the per-SM print-shape windows it dispatches into.
- NVPTX Subtarget — The 81 Feature Indices — the chip/feature predicates the peephole-pass selection consults.
- libdevice Overview — the bitcode library most
often delivered through the
blobLinkedLibattribute. - Math Pass Pipeline and Crosswalk — the
consumer side of the gpulibs IR: where
f128M_*,softfloat_*,Sleef_*, and the__nv_*128integer helpers get wired into kernel-side math. - Versions and Fingerprints — the producer-string and subtarget-triple table this section refers to.
- LLVM Fingerprint Table — the host LLVM-21 link-target identification that distinguishes the run-time backend from the embedded clang-16 / clang-7.1 producers.
- Lowering — Target Attribute Conversion — the
point at which
gpu.moduleacquires the#nvvm.targetandblobLinkedLibattributes the bring-up reads.
NVPTX Subtarget and Feature Matrix
Abstract
Two stock LLVM subtarget tables identify an SM target: one lists every
accepted CPU string, one lists every individual feature bit. Each CPU row
carries a feature mask describing what that CPU implies. The runtime
NVPTXSubtarget copies the selected CPU mask, ORs in explicit -mattr
features, and answers hasFeature(idx) queries from the final bitset.
The reimplementation contract is four-fold: keep the 40 CPU strings sorted
lexicographically so std::lower_bound works; keep the 81 feature indices
stable so bit positions do not drift; use one generic scheduling model for
every CPU; and expose the tmem feature as the gate for tensor-memory and
tcgen05 paths.
The Two TableGen Tables
NVPTXSubtarget is built from three arrays: the CPU table, the feature table, and
a parallel CPU-name StringRef array used for early validation of -mcpu.
struct SubtargetSubTypeKV {
const char *cpu_key;
uint64_t implies[6];
uint64_t tune_implies[6];
const MCSchedModel *sched_model;
};
struct SubtargetFeatureKV {
const char *key;
const char *description;
uint64_t value;
uint64_t implies[6];
};
Both tables are sorted: CPU rows by ASCII-lexicographic compare of the CPUKey string, feature rows by Key. That makes std::lower_bound against either array the canonical lookup path at runtime. Lexicographic CPU order produces one initially confusing artifact: "sm_100" < "sm_100a" < "sm_100f" < "sm_101" < ... < "sm_121f" < "sm_20" < "sm_21" < ... < "sm_90" < "sm_90a". The Blackwell sm_1NN family appears before the legacy sm_2N/sm_3N/.../sm_9N family for the simple reason that '1' < '2' in ASCII; rows are sorted by string, not by silicon generation.
The third array mirrors the CPU table as (pointer, length) pairs. Its only
job is early -mcpu validation before a full subtarget object exists.
The 81 Feature Indices
The full feature index table follows. Indices are stable across builds — a TableGen renumber would change PTX bit positions and break every cubin produced against this drop. Each row's Implies bitset is zero, so the only way a CPU acquires a feature bit is through the SubTypeKV row's Implies mask.
idx Feature Description
0 fma-level=0 FP fused-multiply-add fusion disabled
1 fma-level=1 FMA fusion for FP32 only
2 fma-level=2 FMA fusion everywhere (cicc default)
3 ptx32 Use PTX version 32
4 ptx40 Use PTX version 40
5 ptx41 Use PTX version 41
6 ptx42 Use PTX version 42
7 ptx43 Use PTX version 43
8 ptx50 Use PTX version 50
9 ptx60 Use PTX version 60
10 ptx61 Use PTX version 61
11 ptx62 Use PTX version 62
12 ptx63 Use PTX version 63
13 ptx64 Use PTX version 64
14 ptx65 Use PTX version 65
15 ptx70 Use PTX version 70
16 ptx71 Use PTX version 71
17 ptx72 Use PTX version 72
18 ptx73 Use PTX version 73
19 ptx74 Use PTX version 74
20 ptx75 Use PTX version 75
21 ptx76 Use PTX version 76
22 ptx77 Use PTX version 77
23 ptx78 Use PTX version 78
24 ptx80 Use PTX version 80
25 ptx81 Use PTX version 81
26 ptx82 Use PTX version 82
27 ptx83 Use PTX version 83
28 ptx84 Use PTX version 84
29 ptx85 Use PTX version 85
30 ptx86 Use PTX version 86
31 ptx87 Use PTX version 87
32 ptx88 Use PTX version 88
33 prec-divf32=0 See definition in NVPTXISelLowering.cpp
34 prec-divf32=1 See definition in NVPTXISelLowering.cpp
35 prec-divf32=2 See definition in NVPTXISelLowering.cpp
36 prec-divf32=3 See definition in NVPTXISelLowering.cpp
37 prec-sqrtf32=0 See definition in NVPTXISelLowering.cpp
38 prec-sqrtf32=1 See definition in NVPTXISelLowering.cpp
39 sm_20 Target SM 20
40 sm_21 Target SM 21
41 sm_30 Target SM 30
42 sm_32 Target SM 32
43 sm_35 Target SM 35
44 sm_37 Target SM 37
45 sm_50 Target SM 50
46 sm_52 Target SM 52
47 sm_53 Target SM 53
48 sm_60 Target SM 60
49 sm_61 Target SM 61
50 sm_62 Target SM 62
51 sm_70 Target SM 70
52 sm_72 Target SM 72
53 sm_73 Target SM 73
54 sm_75 Target SM 75
55 sm_80 Target SM 80
56 sm_82 Target SM 82
57 sm_86 Target SM 86
58 sm_89 Target SM 89
59 sm_90 Target SM 90
60 sm_90a Accelerated Target SM 90
61 sm_100 Target SM 100
62 sm_100a Accelerated Target SM 100
63 sm_100f Family Conditional Target SM 100
64 sm_101 Target SM 101
65 sm_101a Accelerated Target SM 101
66 sm_101f Family Conditional Target SM 101
67 sm_103 Target SM 103
68 sm_103a Accelerated Target SM 103
69 sm_103f Family Conditional Target SM 103
70 sm_110 Target SM 110
71 sm_110a Accelerated Target SM 110
72 sm_110f Family Conditional Target SM 110
73 sm_120 Target SM 120
74 sm_120a Accelerated Target SM 120
75 sm_120f Family Conditional Target SM 120
76 sm_121 Target SM 121
77 sm_121a Accelerated Target SM 121
78 sm_121f Family Conditional Target SM 121
79 sharedmem32bitptr Use 32 bit ptrs for Shared Memory
80 tmem Has support for Tensor Memory
Indices 0..38 cluster the orthogonal compiler-knob features: three FMA-fusion levels, thirty PTX-version selectors, four FP32-division precision settings, and two FP32-sqrt precision settings. The driver layer (cicc / nvcc) sets these through -mattr=+ptxNN and -mattr=+fma-level=N flags alongside -mcpu=sm_NN; tileiras itself never propagates a PTX-version bit from any CPU row. Indices 39..78 cover the 40 SM-target feature bits, one per CPU row, in lexicographic CPU order. Index 79 is the Fermi-legacy sharedmem32bitptr toggle. Index 80 is the only NVIDIA-proprietary feature in the entire build: tmem, "Has support for Tensor Memory", absent from upstream LLVM 18.1.4 / 19 NVPTX, and the cross-feature implication that distinguishes datacenter Blackwell from consumer Blackwell.
The PTX-version selector range stops at ptx88 — three versions past upstream LLVM 19 (capped at ptx86) and six past LLVM 18.1.4 (capped at ptx82). ptx88 aligns with the CUDA 13.1 toolchain vintage that produced this binary.
The 40 CPU Rows
The 40 CPU rows appear below in lexicographic table order, grouped by
silicon generation for readability. Each entry lists the row index, the
feature bit for the CPU itself, the known ELF target byte when the cubin
reader recognizes one, and whether the CPU implies tmem.
Row CPU FeatKV ELF byte TMem Variant Family
--- -------- ------ ---------- ---- -------- -------------------------------------------
18 sm_20 39 0x14 no base Fermi GF1xx
19 sm_21 40 0x15 no base Fermi GF11x
20 sm_30 41 0x1E no base Kepler GK10x
21 sm_32 42 0x20 no base Kepler (Tegra K1 / Logan)
22 sm_35 43 0x23 no base Kepler GK110 / GK11x
23 sm_37 44 0x25 no base Kepler GK210
24 sm_50 45 0x32 no base Maxwell GM10x
25 sm_52 46 0x34 no base Maxwell GM20x -- DEFAULT CPU
26 sm_53 47 0x35 no base Maxwell (Tegra X1 / Erista)
27 sm_60 48 0x3C no base Pascal GP100 (datacenter)
28 sm_61 49 0x3D no base Pascal GP10x (consumer)
29 sm_62 50 0x3E no base Pascal Tegra X2 / Parker / Drive-PX2
30 sm_70 51 0x46 no base Volta GV100
31 sm_72 52 0x48 no base Volta (Xavier)
32 sm_73 53 (gap) no base placeholder; no known HW product
33 sm_75 54 0x4B no base Turing TU10x
34 sm_80 55 0x50 no base Ampere GA100 (datacenter)
35 sm_82 56 (gap) no base placeholder; no known HW product
36 sm_86 57 0x56 no base Ampere GA10x (consumer)
37 sm_89 58 0x59 no base Ada Lovelace AD10x
38 sm_90 59 0x5A no base Hopper GH100
39 sm_90a 60 0x5A+0x800 no a Hopper GH100 + WGMMA/TMA arch-cond
0 sm_100 61 0x64 no base Blackwell datacenter GB100/GB200/B100/B200
1 sm_100a 62 (gap) YES a Blackwell datacenter + tcgen05 arch-cond
2 sm_100f 63 (gap) YES f Blackwell datacenter + tcgen05 family-cond
3 sm_101 64 (gap) no base Blackwell datacenter (reserved variant)
4 sm_101a 65 (gap) YES a Blackwell datacenter + tcgen05 arch-cond
5 sm_101f 66 (gap) YES f Blackwell datacenter + tcgen05 family-cond
6 sm_103 67 (gap) no base Blackwell Ultra GB300 (datacenter)
7 sm_103a 68 (gap) YES a Blackwell Ultra GB300 + tcgen05 arch-cond
8 sm_103f 69 (gap) YES f Blackwell Ultra GB300 + tcgen05 family-cond
9 sm_110 70 (gap) no base Jetson Thor (embedded Blackwell-class)
10 sm_110a 71 (gap) YES a Jetson Thor + tcgen05 arch-cond
11 sm_110f 72 (gap) YES f Jetson Thor + tcgen05 family-cond
12 sm_120 73 0x78 no base Blackwell consumer RTX 50** / Pro enterprise
13 sm_120a 74 (gap) NO a Consumer RTX 50** arch-cond (no tcgen05)
14 sm_120f 75 (gap) NO f Consumer RTX 50** family-cond (no tcgen05)
15 sm_121 76 (gap) no base DGX Spark / B40-class
16 sm_121a 77 (gap) NO a DGX Spark arch-cond (no tcgen05)
17 sm_121f 78 (gap) NO f DGX Spark family-cond (no tcgen05)
Two architecturally important findings live in this table.
The first is that exactly eight CPUs imply tmem, and they are exactly the datacenter Blackwell and Jetson Thor arch-conditional and family-conditional variants: sm_100a, sm_100f, sm_101a, sm_101f, sm_103a, sm_103f, sm_110a, sm_110f. Their Implies bitsets each carry two bits — the self-bit plus bit 80 — while every other CPU row has only its single self-bit. Tensor Memory and the tcgen05.mma instruction family it gates are physically datacenter-only in NVIDIA's silicon planning. The base CPUs sm_100 / sm_101 / sm_103 / sm_110 deliberately omit the bit so that plain .target sm_100 PTX cannot reach tcgen05; the programmer must opt into .target sm_100a or .target sm_100f.
The second is that consumer Blackwell (sm_120 and variants) and DGX Spark (sm_121 and variants) never imply tmem, even in their arch-conditional or family-conditional forms. This is not build drift — sm_121a is alphabetically reachable through std::lower_bound, so a missing bit is a deliberate choice. Physical silicon for consumer Blackwell and Spark lacks Tensor Memory; consumer Blackwell gets mma.sync.aligned.*.block_scale as a weaker substitute, dispatched through a separate two-opcode MachineInstr path (5468 dense, 5469 sparse) rather than through TMEM-resident tcgen05 atoms.
Hopper's sm_90a is the only arch-conditional CPU that does not imply tmem.
Tensor Memory was introduced with Blackwell; Hopper's arch-conditional surface
covers WGMMA, TMA, and setmaxnreg instead. The plain sm_100 row also lacks
tmem, so programmers must opt into sm_100a or sm_100f to reach tensor
memory.
Two CPU rows, sm_73 and sm_82, behave like compatibility placeholders
with no known physical silicon. Conversely, the cubin reader recognizes
sm_87, but this subtarget table has no sm_87 CPU row. A correct
reimplementation wires CPU selection and cubin classification symmetrically
so any recognized target is also selectable by -mcpu.
Runtime Feature State
The runtime subtarget stores the target triple, selected CPU, feature string, references to the CPU and feature tables, a generic scheduling model, the populated feature bitset, and parsed numeric SM/PTX versions. Only this compact state is needed for codegen legality checks.
struct NvptxFeatureState {
string triple;
string cpu;
string tune_cpu;
string feature_string;
ArrayRef<SubtargetSubTypeKV> cpu_rows;
ArrayRef<SubtargetFeatureKV> feature_rows;
ArrayRef<StringRef> cpu_names;
const MCSchedModel *sched_model;
uint64_t feature_bits[6];
uint32_t sm_version_times_ten;
uint32_t ptx_version_times_ten;
uint32_t sm_minor;
};
feature_bits is the runtime bitset that hasFeature queries. It starts
empty; the selected CPU row's implies mask gets ORed in, then any
explicit -mattr=+feature flags. Masks and runtime bitset share the same
six-word shape, so population is a word-wise OR, not a per-bit loop.
static bool nvptx_has_feature(const NVPTXSubtarget *st, unsigned idx) {
return (st->FeatureBits[idx >> 6] >> (idx & 63)) & 1;
}
/* Concrete probes for the four interesting bits: */
/* HasSM90 = hasFeature(59) = (FeatureBits[0] >> 59) & 1 */
/* HasSM100 = hasFeature(61) = (FeatureBits[0] >> 61) & 1 */
/* HasSM100a = hasFeature(62) = (FeatureBits[0] >> 62) & 1 */
/* HasTMem = hasFeature(80) = (FeatureBits[1] >> 16) & 1 -- the lone NVIDIA-only bit */
Every TuneImplies slot in every SubTypeKV row is zero. Upstream LLVM uses this field to separate tuning advice (latency model, scheduling hints) from architectural feature implication; the NVPTX fork in this build leaves it empty. A faithful reimplementation leaves the TuneFeatures = [...] clause off the TableGen records. Every Implies slot in every FeatureKV row is zero too — features never transitively pull in other features in this binary. CPU rows carry the full implied set directly.
Cached Tensor-Memory Predicate
Hot instruction-selection paths use a cached boolean equivalent to
hasFeature(80). Semantically this is has_tmem: the target supports Tensor
Memory and can select tcgen05 instructions. The cache is an optimization.
Correctness still comes from the feature bitset.
static bool nvptx_has_feature(const NvptxFeatureState *state, unsigned idx) {
return (state->feature_bits[idx >> 6] >> (idx & 63)) & 1;
}
static bool nvptx_has_tmem(const NvptxFeatureState *state) {
return nvptx_has_feature(state, 80);
}
Reimplementations may cache has_tmem after CPU/feature parsing, but the cached
value must be derived from the same feature bitset that services normal
hasFeature queries.
Lookup at Runtime
The full -mcpu resolution path takes the user-supplied CPU string, runs
std::lower_bound against the alphabetically sorted CPU table, and on an
exact hit ORs that CPU row's implies mask into feature_bits. On a miss
Tileiras falls back to sm_52. Any -mattr=+feature flags parsed in the
same call set their respective bits directly, bypassing the CPU table.
After CPU parsing, SMVersionTimesTen derives from the numeric part of the
CPU name. sm_90a records 90, not 901, because the suffix is a variant
marker rather than a new major version. PTXVersionTimesTen only populates
when a +ptxNN feature is supplied.
void parse_nvptx_subtarget(NvptxFeatureState *state, string cpu, FeatureList attrs) {
const SubtargetSubTypeKV *row = lower_bound_cpu(state->cpu_rows, cpu);
if (row == NULL) {
row = lower_bound_cpu(state->cpu_rows, "sm_52");
}
or_feature_bits(state->feature_bits, row->implies);
for (FeatureAttr attr : attrs) {
set_feature_by_name(state, attr.name, attr.enabled);
}
state->sm_version_times_ten = parse_sm_major_times_ten(row->cpu_key);
state->ptx_version_times_ten = parse_ptx_version(attrs);
state->has_tmem_cache = nvptx_has_tmem(state);
}
Cross-References
Per-SM Emission Templates — Capability Matrix walks the PTX templates
each CPU's implied feature set unlocks, including the consumer-Blackwell
mma.sync.aligned.*.block_scale substitute used when sm_120 and sm_121 lack
tmem. PTX Version and Target Selection
explains how a CPU row plus a +ptxNN feature bit drive the
.version / .target header projection, and which arch-conditional
instruction families each a / f suffix unlocks.
NVPTX Bring-up — Target Registration Chain covers
the surrounding target setup. tcgen05 / WGMMA / mbarrier / Cluster Emission
covers the codegen paths guarded by the tensor-memory predicate.
NVPTX Target Lowering - Calls and Arguments
Abstract
The NVPTX SelectionDAG target-lowering layer is the bridge between
ordinary LLVM function semantics and the PTX .param ABI. It converts LLVM
IR calls, formal arguments, return values, custom loads, and atomic
operations into NVPTX-specific DAG nodes.
The contract is param-space discipline. Call arguments and returns never
travel as ordinary memory traffic. Each one gets a generated param symbol,
breaks into ABI-legal value parts, threads through explicit NVPTX call
envelope nodes (DeclareRetParam, ParamCallStart, ParamCallEnd), and
reassembles after the call. Kernel by-value grid-constant arguments take a
fast path that preserves their param-address-space identity through
legalization. Custom DAG opcodes for vector memory, vector atomics, and
scalar floating remainders share one dispatcher so unhandled cases fall
back cleanly to LLVM's generic legalizer.
Responsibilities
This lowering family does four jobs:
| Area | Responsibility |
|---|---|
| Formal arguments | Convert incoming LLVM function parameters into LOAD_PARAM, MoveParam, or proxy nodes. |
| Calls | Build the DeclareRetParam, ParamCallStart, argument materialization, callee target, result extraction, and ParamCallEnd sequence. |
| Custom operations | Handle target-marked custom opcodes such as vector loads, vector atomics, and NVPTX-specific splats. |
| Atomics | Lower scalar and vector atomic-RMW families into NVPTX atomic DAG nodes with explicit chain bundling. |
Everything downstream assumes this layer has already made ABI details explicit. Botch param naming, byval handling, or chain construction and the emitted PTX still prints — but it will not match tileiras behavior.
NVPTXTargetLowering Vtable Bank
The NVPTXTargetLowering instance carries a 21-slot LLVM TargetLowering vtable in .data.rel.ro. Most slots inherit from the abstract base class. The NVPTX backend overrides eight, four of which carry the codegen-shaping methods this page documents. The vtable bank sits at a fixed .data.rel.ro address that this report references as &vt_NVPTXTargetLowering; the exact offset shifts across builds, but slot order is stable because LLVM publishes a versioned TargetLowering ABI.
| Slot | Method | Identity in this build | Role |
|---|---|---|---|
| 0 | typeinfo helper | RTTI pointer | Standard Itanium-ABI _ZTI... slot. |
| 1 | dtor (delete) | inherited | Virtual destructor, deletes through the base pointer. |
| 2 | dtor (no delete) | inherited | Virtual destructor variant that leaves storage alone. |
| 3 | LowerOperation | sub_1A7C310 via shim sub_1A7FB60 | 79-case DAG dispatch for BUILD_VECTOR remap, vector LOAD, scalar floating-remainder fallback, and atomic families. |
| 4 | LowerFormalArguments | sub_1A77460 | Walks the IR argument list, builds _param_<N> symbols, and emits LOAD_PARAM, MoveParam, or ProxyReg per part. |
| 5 | LowerCall | sub_1A72EF0 | Builds the DeclareRetParam, ParamCallStart, argument materialization, callee target, result extraction, and ParamCallEnd envelope. |
| 6 | LowerReturn | inherited hook stub | Lowers ret into RET_FLAG-class nodes; this build leaves the slot pointing at the LLVM default because return-value marshaling already happened upstream through StoreRetval custom nodes. See the Lowering Returns section below. |
| 7 | ReplaceNodeResults | sub_1A7C310 (shared body) | Post-legalisation hook for v8 / v16 splits; reuses the LowerOperation body with a different return path. |
| 8 | getTargetNodeName | NVPTX numeric-opcode table | Translates NVPTXISD:: opcodes (such as 0x1FD ParamCallStart, 0x1FE ParamCallEnd, 0x317 DeclareRetParam) into display names for -debug dumps. |
| 9 | useSoftFloat | constant return false | NVPTX always lowers floating point through hardware DAG nodes; no soft-float runtime. |
| 10-20 | inherited from TargetLowering base | base class methods | Type-promotion hooks, register class hooks, shift-amount type, atomic legality, and other defaults the NVPTX backend does not override. |
The four overrides this page details are slots 3, 4, 5, and 7. Slot 5 dominates the bank's complexity at ~16.6 KB of code; slot 3 is ~14.0 KB; slot 4 is ~6.4 KB. Remaining slots are thin policy returns or short dispatch wrappers. The dispatch shim sub_1A7FB60 deserves a separate note: it is a public-facing LowerOperation trampoline that the LegalizeOp driver enters and tail-calls into sub_1A7C310, keeping the vtable slot pointer stable across rebuilds even when the body shifts.
The numeric-opcode table backing slot 8 carries the names this page references repeatedly: ParamCallStart at 0x1FD, ParamCallEnd at 0x1FE, ParamCallStartScalar at 0x1FF, PrintCallVector at 0x200, CallNonRegPrototype at 0x201, CallNonReg at 0x202, CallSeqBegin at 0x203, CallArg at 0x204. Declare nodes DeclareRetParam (0x13D) and DeclareScalarParam (0x13E) sit just above the call range. Ship the same opcode numbers and -debug parity comes for free; downstream diagnostic tooling reads the names through this slot.
Param Symbol Naming
Tileiras names generated param symbols deterministically:
StringRef make_param_symbol(unsigned arg_index, bool is_vararg) {
if (is_vararg)
return "_vararg";
return format("_param_%u", arg_index);
}
The same namer is used by formal-argument lowering and call lowering, so caller and callee agree on declarations such as:
.param .align A .b8 <function>_param_<N>[SIZE]
This is a behavioral contract, not just a printing convention. Later param loads, stores, and call-sequence nodes refer to these names.
Lowering Formal Arguments
Formal-argument lowering walks the LLVM function's argument list, computes the legal register parts for each argument type, creates the param symbol, and emits one of two shapes:
- Non-kernel by-value arguments are represented with
MoveParamor proxy nodes because their value must be copied through ordinary param space. - Kernel grid-constant by-value arguments can load directly from the param address space, preserving the special kernel argument representation.
void lower_formal_arguments(Function *fn, MachineFunctionInfo *mfi, DAG *dag) {
for (unsigned i = 0; i < fn->arg_count; ++i) {
Argument *arg = fn->args[i];
ValueTypeParts parts = compute_value_parts(arg->type, fn->calling_conv);
StringRef name = intern_param_name(mfi, make_param_symbol(i, false));
if (arg->has_byval && !is_kernel(fn)) {
SDValue moved = emit_move_param(dag, name, parts, arg->type);
bind_argument_value(arg, moved);
continue;
}
if (is_kernel_grid_constant_byval(arg, fn)) {
SDValue loaded = emit_load_param_addrspace_101(dag, name, parts);
mark_preserve_param_address_space(loaded);
bind_argument_value(arg, loaded);
continue;
}
SDValue loaded = emit_load_param(dag, name, parts);
bind_argument_value(arg, loaded);
}
}
The address-space preservation flag is essential for grid-constant byval arguments. Drop it and later DAG combines promote the value back to a generic pointer, erasing the fact that the source was kernel param memory.
Lowering Calls
Call lowering builds an explicit NVPTX call envelope. The path starts by reserving a per-call unique ID, then folds that ID into generated param symbols so multiple calls in the same function cannot collide.
The call lowering path has six logical stages:
- Create the entry chain and return-param declaration.
- Classify outgoing arguments.
- Resolve the callee target.
- Materialize and store each outgoing argument.
- Extract return values from call-result nodes.
- Close the call sequence.
SDValue lower_call(CallLoweringInfo *cli, DAG *dag, MachineFunctionInfo *mfi) {
unsigned call_id = ++mfi->next_call_id;
Chain chain = dag_entry_token(dag);
ReturnParam ret = declare_return_param(dag, cli->result_types, call_id);
chain = emit_param_call_start(dag, chain, call_id);
CalleeTarget target = resolve_callee(cli);
for (unsigned i = 0; i < cli->out_count; ++i) {
OutArg *arg = &cli->outs[i];
ParamSymbol sym = make_call_param_symbol(cli, call_id, i);
if (is_kernel_grid_constant_byval_outarg(cli, arg)) {
chain = emit_direct_grid_constant_param(dag, chain, arg, sym);
} else if (arg->is_byval) {
chain = emit_byval_param_copy(dag, chain, arg, sym);
} else {
chain = emit_scalar_or_vector_param_store(dag, chain, arg, sym);
}
}
chain = emit_callee_target(dag, chain, target);
SDValue result = collect_call_results(dag, chain, ret, cli->result_types);
emit_param_call_end(dag, chain, call_id);
return result;
}
Byval and Grid-Constant State Machine
The byval path is governed by four facts:
| Fact | Meaning |
|---|---|
K | The caller/callee context is a kernel entry. |
B | The argument has byval semantics. |
G | The argument carries the grid-constant annotation. |
D | The call resolves to a direct Function. |
The decision table is:
| Condition | Lowering action |
|---|---|
K && B && G | Load directly from param address space. This is the fast path for kernel grid constants. |
K && B && !G | Materialize through the ordinary lowered-args path. |
!K && B | Use a proxy or move-param sequence for device-function byval. |
!B && D | Emit direct callee prototype and direct call target nodes. |
!B && !D | Build a synthetic callable wrapper and mark it as an NVPTX libcall callee. |
The synthetic wrapper case adds the function attribute:
nvptx-libcall-callee = "true"
The marker is NVIDIA-specific and lets later passes recognize indirect-call wrappers without re-deriving their origin from the DAG.
CalleeTarget resolve_callee(CallLoweringInfo *cli) {
if (cli->called_function != NULL)
return direct_callee(cli->called_function);
if (is_global_address(cli->callee_value))
return external_symbol_callee(cli->callee_value);
Function *wrapper = build_libcall_wrapper(cli->callee_value);
add_function_attribute(wrapper, "nvptx-libcall-callee", "true");
return callable_wrapper(wrapper);
}
Value Part Scheduling
Aggregate and vector arguments are broken into legal machine value types before they are stored into param space. The helper logic is equivalent to:
void store_outgoing_argument(DAG *dag,
Chain *chain,
OutArg *arg,
ParamSymbol sym) {
ValueTypeParts parts = compute_value_parts(arg->type, arg->calling_conv);
PartSchedule sched = schedule_value_parts(parts);
for (unsigned i = 0; i < sched.count; ++i) {
SDValue piece = extract_argument_piece(dag, arg->value, sched.part[i]);
*chain = emit_store_param(dag, *chain, sym, sched.part[i], piece);
}
}
Part scheduling is why the lowering path must know ABI size and alignment. By the time PTX sees a source-level aggregate, it is no longer a single call operand.
Lowering Returns
Vtable slot 6 (LowerReturn) points at the inherited base-class stub in this build because the
NVPTX return ABI does not need a custom DAG shape: every return value has already been routed
through StoreRetval-class custom nodes (numeric 0xDA and friends, dispatched from the load/store
vector selector). The base hook merely emits a RET_FLAG chain node that closes the function.
SDValue lower_return(ReturnLoweringInfo *rli, DAG *dag) {
Chain chain = rli->chain;
for (unsigned i = 0; i < rli->ret_count; ++i) {
RetVal *ret = &rli->rets[i];
ValueTypeParts parts = compute_value_parts(ret->type, rli->calling_conv);
for (unsigned j = 0; j < parts.count; ++j) {
SDValue piece = extract_return_piece(dag, ret->value, parts.part[j]);
chain = emit_store_retval(dag, chain, parts.part[j], piece);
}
}
return emit_ret_flag(dag, chain);
}
Reimplementations that override this slot must keep StoreRetval materialization upstream of the chain close. Pushing return-value materialization into LowerReturn itself collapses the chain into a single node and breaks the value-part scheduling the rest of the lowering layer relies on.
Custom Operation Lowering
The custom-operation dispatcher takes target-specific cases and lets LLVM's generic legalizer take everything else. The relevant classes:
| Operation class | Lowering behavior |
|---|---|
| Vector load/store | Rewrite into NVPTX vector load/store or splat DAG nodes when the target supports the shape. |
INSERT_VECTOR_ELT, EXTRACT_VECTOR_ELT, BUILD_VECTOR, SCALAR_TO_VECTOR | Rebuild as NVPTX splat or lane-extract nodes. |
| Scalar floating remainder fallback | Materialize through param-load and element rebuild nodes. |
| Scalar atomics | Lower into NVPTX atomic nodes and chain bundles. |
| Vector atomics | Require a sufficiently new SM target; otherwise emit a fatal unsupported-architecture diagnostic. |
The dispatcher returns "not handled" for gaps on purpose: that preserves LLVM's ordinary legalization behavior for non-NVIDIA cases.
bool lower_operation(SDNode *node, DAG *dag, SmallVector<SDValue> *results) {
switch (node->opcode) {
case ISD_LOAD:
return lower_vector_or_param_load(node, dag, results);
case ISD_INSERT_VECTOR_ELT:
case ISD_EXTRACT_VECTOR_ELT:
case ISD_BUILD_VECTOR:
case ISD_SCALAR_TO_VECTOR:
return lower_vector_lane_op(node, dag, results);
case ISD_ATOMIC_LOAD_ADD:
case ISD_ATOMIC_LOAD_AND:
case ISD_ATOMIC_LOAD_OR:
case ISD_ATOMIC_LOAD_XOR:
case ISD_ATOMIC_CMP_SWAP:
return lower_atomic(node, dag, results);
default:
return false;
}
}
Atomic-RMW Lowering
Atomic lowering is split by operation family. CAS-like and load-only operations emit an atomic compare/swap skeleton. Arithmetic RMW operations emit one per-part arithmetic atomic and bundle the result chain.
bool lower_atomic(SDNode *node, DAG *dag, SmallVector<SDValue> *results) {
AtomicKind kind = classify_atomic(node->opcode);
if (kind.is_vector && !subtarget_supports_vector_atomics(dag->subtarget))
fatal("vector atomics not supported on this architecture!");
ValueTypeParts parts = compute_value_parts(node->value_type, node->calling_conv);
Chain chain = node->chain;
for (unsigned i = 0; i < parts.count; ++i) {
AtomicPart part = extract_atomic_part(node, parts.part[i]);
if (kind.uses_compare_exchange) {
chain = emit_atomic_compare_and_swap(dag, chain, part, kind.signedness);
} else {
chain = emit_atomic_rmw(dag, chain, part, kind.opcode, kind.signedness);
}
}
results->push_back(bundle_atomic_chain(dag, chain));
return true;
}
Signedness does not change the overall DAG shape. It threads into final instruction selection so the backend picks signed or unsigned PTX mnemonics — atom.global.min.s32 versus atom.global.min.u32.
ISelDAG and MatcherTable
Abstract
NVIDIA's NVPTX SelectionDAG instruction selector turns legalized DAG nodes into NVPTX machine nodes and eventually into PTX instructions. The shape is mostly the familiar LLVM TableGen selector, with CUDA 13.1 additions for Blackwell tensor memory, tcgen05 MMA, block-scaled MMA, TMA, WGMMA, vector atomics, packed narrow-float conversion, and NVIDIA-specific validation.
For reimplementation, the C++ selector skeleton is not the main contract. The contract is the combination of:
- Intrinsic dispatch tables.
- Target feature gates.
- MatcherTable predicates and costs.
- NVPTX-specific DAG opcodes.
- Validator diagnostics for unsupported SM/PTX combinations.
- AsmWriter mnemonic and register-name tables.
Selector Layers
The selector has three major layers:
| Layer | Role |
|---|---|
| Intrinsic-with-chain selector | Handles NVVM intrinsics that carry memory or control-flow chain effects. |
| Vector load/store selector | Handles vectorized memory operations, tensor-memory loads/stores, and packed lane patterns. |
| TableGen MatcherTable | Handles ordinary generated patterns, complex predicates, and recursive pattern scoring. |
The fast selectors try highly structured NVIDIA-specific cases first. When a case returns "not selected", the ordinary MatcherTable path gets a shot at the node. The asymmetry matters: unsupported or unrecognized cases fall back rather than hard-fail, unless the target explicitly diagnoses the operation. Order is fixed — intrinsic-with-chain, then vector load/store, then the generic MatcherTable scorer — and a reimplementation that reorders these layers gets different per-target opcodes for the same DAG node.
The intrinsic-with-chain selector keys off the NVPTXISD opcode, not the LLVM intrinsic ID. Each non-default arm either calls a per-family emitter, delegates to a secondary intrinsic-ID dispatcher, or assembles a MachineSDNode inline. Custom families with their own behaviour are cvt_packfloat (FP8/FP6/FP4/UE8M0 format validation), tcgen05.mma (datacenter-Blackwell tensor-memory MMA), nvvm.red (address-space, type, FTZ, and cache-hint legality), cp.async and TMA bulk-tensor descriptor construction, WGMMA and mma.sync, the consumer-Blackwell mma.block_scale path, and the per-call unsafe-fp-math FTZ override on FMA. The remaining NVPTXISD opcodes fall through and let the MatcherTable produce the machine opcode.
NVPTXISD Pseudo-Opcodes
LLVM IR uses a fixed vocabulary of generic SDNodes — ISD::ADD, ISD::LOAD, ISD::CALL, and so on — and a TargetLowering callback chain turns those generic nodes into target-specific machine instructions. NVPTX does not always have the right shape on the generic side. Kernel parameters do not arrive in stack frames the way they do on most ISAs; PTX uses a special address space and explicit .param declarations. Call argument marshaling needs a paired bracket that says "everything between these two nodes is one call's setup". Register-class proxies need a chainable node that the legalizer can carry through type-coerced copies. NVPTX therefore introduces a private NVPTXISD::* opcode pool: target-specific SDNodes the selector can recognize but the generic LLVM codegen pipeline cannot. The selector emits these pseudo-opcodes during DAG legalization, then consumes them during instruction selection. A handful survive into the post-ISel MIR for a peephole pass to fold; the rest are gone by the time the AsmWriter prints.
The six pseudo-opcodes the rest of this page references repeatedly are summarized below. The "introduced by" column names the lowering call that creates the node; the "carries" column lists the target-specific operand the generic SDNode could not.
| Pseudo-opcode | Introduced by | Carries | Consumed by |
|---|---|---|---|
NVPTXISD::LoadParam | LowerFormalArguments for each ptr addrspace(101) arg | byte offset into the param-space record, type-class index | case 300 in SelectLoadStoreVector, then ld.param.* emission |
NVPTXISD::StoreParam | LowerCall for each outgoing argument | argument index, alignment, ABI class (byval / direct / sret) | case 192 in SelectLoadStoreVector, then st.param.* emission |
NVPTXISD::ParamCallStart | LowerCall immediately before the first StoreParam | call site ID, total parameter byte count | call-prototype emit (case 301), opens the .param block |
NVPTXISD::ParamCallEnd | LowerCall immediately after the call's Glue | the matching call site ID | closes the .param block; pairs with ParamCallStart |
NVPTXISD::ProxyReg | LowerCopyToReg when source and destination register classes differ | the underlying register class of the source value | NVPTXProxyRegErasure peephole (post-ISel) |
NVPTXISD::DeclareScalarParam | LowerFormalArguments once per scalar parameter | parameter index, element size in bytes | header-emission pass that prints .param .{u32,u64,f32,...} _Z<arg>; |
NVPTXISD::DeclareRetParam | LowerFormalArguments when the function has a struct return | return-record element size and alignment | same header-emission pass |
NVPTXISD::PrintCall | LowerCall for void @vprintf(i8*, i8*) after argument marshaling | the printf format-string symbol | direct lowering to the vprintf call ABI |
NVPTXISD::PrintCallUni | same as PrintCall when the call is provably uniform across the warp | same as PrintCall plus a uniform-call marker | uniform-call ABI emit; skips the per-lane mask |
LoadParam and StoreParam are the cleanest illustration of why the generic ISD::LOAD / ISD::STORE would not work. The PTX .param address space is not a memory in the usual sense — it cannot be aliased, cannot be reinterpreted across types, and the legal access pattern is one ld.param.<type> per parameter slot. The generic load would let the legalizer split a v4f32 parameter into four scalar loads at unaligned offsets, which would emit four scalar ld.param.f32 instructions but reference parameter slots that do not exist. The LoadParam opcode pins the access shape: the selector either matches it as a single ld.param.v4.f32 or it bails. The case-300 handler in SelectLoadStoreVector (sub_1A65F50) reads the byte offset, picks the right element type from the type-class index, and emits one wide ld.param per node.
ParamCallStart and ParamCallEnd exist for a structural reason. PTX wraps every call in a .param block:
{
.param .u32 _Zarg0;
.param .u32 _Zarg1;
st.param.u32 [_Zarg0], %r1;
st.param.u32 [_Zarg1], %r2;
call.uni _Z3foov, (_Zarg0, _Zarg1);
}
The block has a single entry, a single exit, and a fixed sequence: declarations first, stores second, the call instruction third. The generic ISD::CALL carries no notion of the surrounding block. Tileiras therefore inserts ParamCallStart before the first StoreParam and ParamCallEnd after the call's Glue node. Both pseudo-opcodes carry a 32-bit call site ID that pairs them; the case-301 handler in SelectLoadStoreVector walks from ParamCallStart to the matching ParamCallEnd and emits the entire block as a single unit. Without the bracket, an aggressive code motion pass could float a StoreParam for call B above the StoreParam for call A, and the PTX block structure would break.
ProxyReg is the most subtle of the set. NVPTX has a typed register hierarchy — %rd0 is a 64-bit register, %r0 is 32-bit, %h0 is 16-bit — and copies between classes need the right move instruction. The generic ISD::CopyToReg has no type-class information, so when the legalizer needs to copy a 32-bit value into a slot that gets later re-typed as 16-bit, it cannot tell which move to emit. LowerCopyToReg inserts a ProxyReg node that pins the source register class. The post-ISel NVPTXProxyRegErasure peephole then walks the MIR, identifies each ProxyReg, and replaces it with the right mov.* based on the recorded class. By the time the AsmWriter sees the MIR, no ProxyReg remains.
DeclareScalarParam and DeclareRetParam are pure marker nodes — they emit no machine instruction. Their entire purpose is to thread parameter metadata through the SDNode graph so a later pass that prints the function header can recover the parameter sizes and alignments. They sit in the chain only to prevent the DAG combiner from reordering them past the function entry point. A reimplementation that strips them out emits a kernel whose header lacks .param declarations and fails the PTX assembler.
PrintCall and PrintCallUni are the special case vprintf lowering. The CUDA runtime exposes printf through a special ABI: the call passes the format string as a .param and a pointer to an argument buffer as a second .param. The selector can choose between the per-lane PrintCall and the warp-uniform PrintCallUni based on whether divergence analysis proved the call uniform; the uniform form skips the per-lane mask and emits a single call.uni vprintf rather than a predicate-guarded loop. Both are introduced by LowerCall and lowered without ever reaching the MatcherTable.
NVPTXISD Node Roster
The summary above sketches the handful of pseudo-opcodes the rest of this page returns to. The full set the binary still spells by name is wider. NVPTXTargetLowering::getTargetNodeName is a giant switch that maps each NVPTXISD::* enumerator to its string for -debug dumps; the cases that ship a string literal survive into the .rodata segment as NVPTXISD::<Name> C strings. Mining tileiras_strings.json for that prefix yields exactly 60 distinct names. These are the tileiras-side surface — the opcodes the NVPTX backend introduces during legalisation and consumes at instruction selection. The cicc-side catalog is larger (its NVPTXISD enum has roughly 460 entries spanning every intrinsic family the parser knows about), but most of those never reach the back-end selector because they collapse during early lowering into generic ISD::* opcodes or into target-specific machine opcodes inlined as numeric constants.
The roster below groups the 60 named nodes by structural family. The family is what the case body does, not where the opcode sits in the dispatch range; the same family can straddle several numeric brackets because NVIDIA appended new opcodes at the end of the enum across LLVM 17, 18, 19, and the LLVM-21-prerelease the binary fingerprints to.
Param-space ABI
These are the nodes the Lowering Formal Arguments and Lowering Calls passes inject around every kernel and device-function boundary. They translate the PTX .param address space — which has no natural representation in generic LLVM IR — into a chainable SDNode sequence.
| Node | Role | Vector widths |
|---|---|---|
LoadParam | Scalar .param load on the callee side of an argument | scalar |
LoadParamV2 | Aligned-pair .param load (legalised v2 aggregates) | v2 |
LoadParamV4 | Aligned-quad .param load (legalised v4 aggregates) | v4 |
StoreParam | Scalar .param store on the caller side of an argument | scalar |
StoreParamV2 | Aligned-pair .param store | v2 |
StoreParamV4 | Aligned-quad .param store | v4 |
MoveParam | Marker that copies a .param-space pointer into a register class without emitting any PTX | scalar |
DeclareParam | Header marker for an aggregate parameter (emits .param .align N .b8 _Zarg[<size>]) | — |
DeclareScalarParam | Header marker for a scalar parameter (emits .param .{u32,u64,f32,...} _Zarg) | — |
DeclareRet | Header marker for a scalar return value | — |
DeclareRetParam | Header marker for a struct return (emits .param .align N .b8 _Zretval[<size>]) | — |
Call / control-flow brackets
Nodes that bracket calls and structured indirect branches. PTX requires a single-entry / single-exit .param block around every call, and brx.idx requires a label-bracketed jump table.
| Node | Role |
|---|---|
CALL | The chainable call SDNode itself; carries the callee symbol and the operand sequence between matching ParamCallStart / ParamCallEnd brackets |
CallPrototype | Emits the inline .callprototype directive for indirect calls whose target signature is not known at link time |
RET_GLUE | The ret; opcode plus the chain glue that pins it after every StoreRetval |
ProxyReg | Pins a source register class across a class-changing copy so NVPTXProxyRegErasure can pick the right mov.* post-ISel |
BrxStart | Opens a structured brx.idx jump table |
BrxItem | One label entry inside a BrxStart / BrxEnd bracket |
BrxEnd | Closes the structured brx.idx jump table |
Vector load / store
Wide loads and stores. The base LoadV* / StoreV* family covers ordinary multi-lane access. The LoadExt* / StoreExt* family carries an additional extension-type operand (sign-extend, zero-extend, or any-extend) that the generic ISD::LOAD / ISD::STORE would have to encode in a separate field; pinning it on the opcode keeps the MatcherTable patterns one-to-one with PTX ld.{s,u}{8,16}.<dest> and st mnemonics. The Ver2 suffix is NVIDIA's post-LLVM-19 alternate encoding that swaps the chain-and-offset operand order so the generic vectoriser can synthesise wider transactions without backtracking through the existing match table.
| Node | Role |
|---|---|
LoadV2, LoadV4, LoadV8 | Aligned vector loads of the indicated lane count |
StoreV2, StoreV4, StoreV8 | Aligned vector stores of the indicated lane count |
LoadExt | Scalar load with explicit extension type |
LoadExtV2, LoadExtV4, LoadExtV8 | Vector load with explicit extension type per lane |
StoreExt | Scalar store with explicit truncation type |
StoreExtV2, StoreExtV4, StoreExtV8 | Vector store with explicit truncation type per lane |
LoadExtVer2, LoadExtVer2V2, LoadExtVer2V4, LoadExtVer2V8 | Alternate-encoded extension-load variants (post-LLVM-19 surface) |
StoreExtVer2, StoreExtVer2V2, StoreExtVer2V4, StoreExtVer2V8 | Alternate-encoded extension-store variants |
LDUV2, LDUV4 | Uniform / cached vector loads from ldg.* paths that NVIDIA promoted into a typed pair of pseudo-opcodes |
Vector synthesis
| Node | Role |
|---|---|
BUILD_VECTOR | Assembles a vector from scalar operands; distinct from upstream ISD::BUILD_VECTOR so PTX-specific lane-packing patterns can match without colliding with generic vector legalisation |
UNPACK_VECTOR | Inverse of BUILD_VECTOR; explicitly splits a packed PTX vector into per-lane scalars when a later use needs scalar operands |
Predicate-set
| Node | Role |
|---|---|
SETP_F16X2 | Packed f16x2 predicate-set; emits setp.<cmp>.f16x2 and produces a 2-bit predicate pair |
SETP_BF16X2 | Packed bf16x2 predicate-set; emits setp.<cmp>.bf16x2 and produces a 2-bit predicate pair |
Arithmetic and bit manipulation
| Node | Role |
|---|---|
BFI | Bit-field insert; lowers to PTX bfi.{b32,b64} |
PRMT | Byte permute; lowers to PTX prmt.b32 (and the SM 8.0+ packed-FP prmt.f16x2 variants) |
FCOPYSIGN | Copy-sign that the selector keeps as a target opcode because PTX has no single-instruction generic copysign for every type and the lane-by-lane lowering depends on the source MVT |
FSHL_CLAMP | Funnel-shift left with the shift amount clamped to the operand width; folds the upstream ISD::FSHL + clamp idiom into one opcode |
FSHR_CLAMP | Funnel-shift right with the shift amount clamped |
MUL_WIDE_SIGNED | 32×32 → 64 signed widening multiply; lowers to mul.wide.s32 |
MUL_WIDE_UNSIGNED | 32×32 → 64 unsigned widening multiply; lowers to mul.wide.u32 |
Stack / dynamic allocation
| Node | Role |
|---|---|
DYNAMIC_STACKALLOC | The chainable alloca lowering that turns into alloca.u64 (or alloca.u32 on 32-bit) and threads the result through the local-memory bump pointer |
STACKSAVE | Snapshots the current local-stack pointer for a later STACKRESTORE |
STACKRESTORE | Restores the local-stack pointer to a saved value |
Cluster launch control
These four opcodes lower the Hopper / Blackwell clusterlaunchcontrol.query_cancel.* intrinsic family. They are the only string-table opcodes whose name spells out the PTX mnemonic in full, and they exist because the result selector reads only one field of the returned canceled-query record at a time.
| Node | Role |
|---|---|
CLUSTERLAUNCHCONTROL_QUERY_CANCEL_IS_CANCELED | Returns the is_canceled predicate from a queried cancel record |
CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID_X | Reads ctaid.x of the first CTA whose launch was canceled |
CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID_Y | Reads ctaid.y of that CTA |
CLUSTERLAUNCHCONTROL_QUERY_CANCEL_GET_FIRST_CTAID_Z | Reads ctaid.z of that CTA |
⚡ QUIRK — the 60 named opcodes are the survivors, not the whole NVPTXISD enum The cicc-side
NVPTXISDenum has roughly 460 entries because the parser front-end carries one enumerator per intrinsic family it knows about. Most of those never reachgetTargetNodeName: they either collapse during early lowering (the TMA descriptor builders, the WGMMA operand marshallers, thetcgen05.mmafamily) into target-specific machine opcodes inlined directly intoSelectLoadStoreVectorandSelectIntrinsic_W_Chain, or they live behind numeric constants the matcher table consumes without a debug name. The 60 strings the binary still ships are the subset whose enum case ingetTargetNodeNamehad a non-emptycase NVPTXISD::Foo: return "NVPTXISD::Foo";arm at compile time. A reimplementation that drives off the cicc enum will see node kinds the tileiras selector has no handler for; a reimplementation that drives off this 60-name list will miss every TMA / WGMMA / tcgen05 opcode the selector emits as a numeric constant.
⚡ QUIRK —
LoadExt*Ver2is not a version-2 ofLoadExt*; it is the post-LLVM-19 alternate operand order TheVer2suffix on the eightLoadExtVer2*/StoreExtVer2*opcodes is a misleading name: it does not mark a newer revision of the same node. It marks NVIDIA's alternate-encoding surface, introduced when the upstream LLVM vectoriser started synthesising wider transactions that bypassed the existing match table. The two variants coexist — both encodings are valid SDNodes the selector still has to handle — and a reimplementation that treatsVer2as superseding the unsuffixed form will fail to match nodes the legacy paths still emit. The selector dispatches on the opcode value, not on a versioning predicate; both forms route through the same case bodies inSelectLoadStoreVectorwith different operand offsets.
⚡ QUIRK —
BUILD_VECTORandUNPACK_VECTORshadow upstreamISD::BUILD_VECTORGeneric LLVM already hasISD::BUILD_VECTOR. NVPTX could in principle let the upstream patterns handle vector assembly, but PTX's lane-packing rules (two 16-bit lanes packed into one 32-bit register forv2f16/v2bf16/v2i16, four 8-bit lanes forv4i8, etc.) do not match the generic legaliser's split-and-recombine sequence. The privateNVPTXISD::BUILD_VECTORlets the selector match a single-node pattern that emits the right PTXmov.b32packing in one shot; the inverseNVPTXISD::UNPACK_VECTORdoes the symmetric job on extraction. The two pseudo-opcodes have the same semantic intent as their upstream cousins — the difference is purely about whose pattern table owns the match. Imports of upstream NVPTX tablegen that drop the private opcodes will reintroduce the split-and-recombine sequence and produce PTX with redundantmovs the assembler cannot fold.
INTRINSIC_W_CHAIN Top-Level Dispatcher
In tileiras, select_intrinsic_with_chain materializes as sub_1A854E0 (NVPTXDAGToDAGISel::SelectIntrinsic_W_Chain) — 6 213 B, 509 basic blocks, with a single jump table at instruction 0x1A8551B driving the body. The dispatch key is not the LLVM intrinsic ID itself. It is the 32-bit overlay at SDNode + 24, packing the NVPTXISD opcode into the low 16 bits and the SDNode flag word into the high 16 bits. Intrinsic IDs enter only inside delegate handlers, which read SDNode + 72.
The switch declares 345 case slots across the dense range [0x17, 0x172]. Of those, 58 carry distinct per-class bodies; the other 287 share a single fallthrough target at 0x4135C4, which returns zero so the surrounding trySelectNode (sub_1AAD9D0) can hand the node to the MatcherTable. The fallthrough is not an error path. It is the deliberate join that says "this NVPTXISD opcode is either handled by the generic TableGen patterns or reserved for an upstream LLVM opcode NVIDIA never customized in this selector." A reimplementation that treats fallthrough as a bug would over-diagnose 287 perfectly legal nodes.
Two cases short-circuit the search entirely. 0x17 and 0x18 correspond to the upstream ISD::INTRINSIC_VOID and ISD::INTRINSIC_WO_CHAIN opcodes, which a well-formed DAG should never route through the _W_CHAIN selector. Both join into the same body at 0x1A85817, which simply returns zero. Return zero differs from fallthrough conceptually: it signals a routing error rather than a deferred decision. The calling convention forces the same outward behavior either way, and the MatcherTable gets a second chance to recognize the node before any diagnostic fires.
unsigned __int64 select_intrinsic_w_chain(SelectorState *st, SDNode *node,
ChainWrap *cw, SelectionDAG **dag,
MachineFunction *mf,
SDValue chain_in, SDValue glue) {
uint32_t key = *(const uint32_t *)((const uint8_t *)node + 24);
uint16_t isd_opcode = (uint16_t)key;
switch (isd_opcode) {
case 0x17: case 0x18:
return 0; /* routing error */
case 0x2F: /* cvt_packfloat */
return select_cvt_packfloat(st, node, cw, dag, mf, chain_in);
case 0x30: /* tcgen05 + nvvm.red */
return select_intrinsic_wo_chain_dispatch(st, node, cw, dag, mf, chain_in);
case 0x31: /* tcgen05.mma -> opcode 0x211 */
return select_tcgen05_mma_fastpath(st, node, cw, dag);
case 0x32: /* tcgen05.mma.ws -> opcode 0x212 */
return select_tcgen05_mma_ws_fastpath(st, node, cw, dag);
case 0x66: /* FMA with FTZ probe */
return select_fma_with_unsafe_fp_math_probe(st, node, cw, dag, mf, chain_in);
/* ... 53 more cases ... */
default:
return 0; /* fall through to MatcherTable */
}
}
The body classes group into six dispatch families. Thirty-two cases delegate into a per-class NV emitter from the sub_1A6x pool — the cp.async, mbarrier, TMA, WGMMA, WMMA, and tcgen05 fast paths. Ten assemble SDNodes inline through the shared builder set sub_2005A50 / sub_2009D80 / sub_2009DB0 / sub_2004920 / sub_200ABE0 / sub_200B040. Eight re-delegate into a smaller secondary dispatcher: SelectIntrinsic_WO_Chain (sub_1A833C0, a 21-case inner switch), the cvt_packfloat fan-out (sub_1A85120, six intrinsic IDs), the tcgen05.mma fan-out (sub_1A80E40, fourteen IDs), or the nvvm.red emitters (sub_1A79EA0 and sub_1A79DE0). Two cases return zero unconditionally; two are pass-through identities.
Per-Case Dispatch Table
The 58 non-default bodies cluster into a small set of structurally related families. The table below lists each named case in dispatch order. First column: the NVPTXISD opcode in hex. Second: the family the case belongs to. Third: the delegate that owns the actual emission, with sub_ADDR notation when the binary contains a dedicated function and "inline" when the body emits SDNodes directly. Fourth: the NVIDIA-specific delta against the upstream LLVM SelectionDAGISel template.
| Case | Family | Delegate | NVIDIA delta vs upstream |
|---|---|---|---|
0x17 | unsupported | inline return 0 | Upstream ISD::INTRINSIC_VOID; routing-error guard. |
0x18 | unsupported | inline return 0 | Upstream ISD::INTRINSIC_WO_CHAIN; routing-error guard. |
0x2F | cvt.packfloat fan-out | sub_1A85120 | Six-way intrinsic-ID dispatch over FP8/FP6/FP4/UE8M0 conversion; not in upstream LLVM 18. |
0x30 | tcgen05 + nvvm.red | sub_1A833C0 | Re-enters the 21-case SelectIntrinsic_WO_Chain dispatcher; carries Blackwell tensor-memory IDs. |
0x31 | tcgen05.mma dense | sub_1A79180 | Bypasses ID dispatch and emits MI opcode 0x211 directly from a custom NVPTXISD opcode created during DAG legalization. |
0x32 | tcgen05.mma.ws | sub_1A70690 | Warp-specialized variant, emits MI opcode 0x212. |
0x37 | nvvm.red dense | sub_1A79EA0 | Atomic reduction with NV-only noftz/scope/cache-hint validators. |
0x38 | nvvm.red noftz | sub_1A79DE0 | f32/f64 reduction path with FTZ bit cleared; carries the "noftz not support for other types for nvvm.red" diagnostic. |
0x3A-0x3C | vector legalisation | inline at 0x1A85520 | Joined body for ld.vector.v{2,4} chain variants; emits MI opcode 0x9E wrapping per-lane 0xA0 extracts. |
0x3F-0x40 | vector legalisation | inline at 0x1A85520 | Same body, used for st.vector.v{2,4} chain variants. |
0x62-0x63 | f16/bf16 FMA | sub_1A6DEE0 | f16x2 fused-multiply-add; emits MI opcode 0x65. |
0x64 | f16/bf16 FMA | sub_1A6DEE0 | bf16x2 fused-multiply-add, sm_80+. |
0x66 | FMAD with FTZ probe | inline | Reads "unsafe-fp-math" Function attribute via sub_3FC6800 and picks between FTZ opcode 0x65 and non-FTZ wrapper opcode 0xF7; per-call override not present in upstream. |
0x9A | AS-marked load | sub_1A6D350 | ld.global with address-space tag wrap. |
0x9E | AS-marked load | sub_1A6A910 | ld.param dispatcher (addrspace 101). |
0x9F | AS-marked load | sub_1A6C600 | ld.const dispatcher (addrspace 4). |
0xA0 | AS-marked load | sub_1A6BF90 | ld.global.nc non-coherent load. |
0xA1 | cp.async commit/wait | sub_1A6A3C0 | cp.async.commit_group / cp.async.wait_group wrap. |
0xA3 | pass-through | inline return a2 | Already in canonical form; selector returns the incoming SDNode unchanged. |
0xA7 | AS-marked load cached | sub_1A6C7F0 | ld.param with cache-modifier suffix. |
0xB6-0xB9 | vector legalisation | inline at 0x1A85520 | BUILD_VECTOR v2/v4 f16/bf16/i16/i8 variants. |
0xBF-0xC0 | vector legalisation | inline at 0x1A85520 | BUILD_VECTOR v8 chain variants. |
0xC3-0xC4 | wmma load dense | sub_1A5F730 | Emits MI opcode 197 / 198 (WMMA_LOAD_DENSE, dense transposed). |
0xC5-0xC6 | wmma load sparse | sub_1A5F730 | Same emitter, sparse-descriptor variant; sm_80+. |
0xC9-0xCA | wmma store | inline | Wraps inner opcode 0xC9/0xCA with MI opcode 0xD8 (STORE_VECTOR_V2_MemRef, 16-byte alignment). |
0xCF | mma.sync / mma.sp | inline | Two paths: register-form (0xBC/0xBD load/store + 207/208 multiply-add) or ADDRESSOF-wrapped form (0xDA wrapper). |
0xD4 | cp.async.mbarrier | sub_1A6CEF0 | cp.async.mbarrier.arrive{.noinc}.shared.b64. |
0xD5-0xD6 | cp.async shared.global | sub_1A6CA70 | cp.async.{ca,cg}.shared.global; sm_80+. |
0xDE-0xDF | TMA 1-5D | sub_1A6E110 | cp.async.bulk.tensor.{1..5}d.global.shared::cta; intrinsic IDs 8941/8946. |
0xE4-0xE5 | TMA reduce | sub_1A6E200 | 8-arm reduce family (intrinsic IDs 8974-9011). |
0xE8 | TMA shared::cluster | sub_1A6E2F0 | cp.async.bulk.tensor.Nd.shared::cluster.global.mbarrier (ID 8951). |
0xEB | TMA shared::cta | sub_1A6E6C0 | cp.async.bulk.tensor.Nd.shared::cta.global (ID 8956). |
0xEC | mbarrier family | sub_1A6A6E0 | mbarrier.{init,inval,arrive,arrive.noComplete} inner dispatch. |
0xED | st.bulk | sub_1A6ED90 | st.bulk.weak.shared::cta (IDs 0x23A4-0x23AA); sm_100. |
0x112 | TMA descriptor load | inline 2-way | Branches on MVT: MVT::i32 (12) -> sub_1A6D560; MVT::i64 (13) -> sub_1A6DA40; anything else falls through to BUG. |
0x12C | wgmma dense | sub_1A6FB40 | wgmma.mma_async.sync.aligned.mNnNkN.type.layout (ID 0x226A); sm_90a. |
0x12D | wgmma block variant | sub_1A705A0 | 10-operand block variant (ID 0x245C). |
0x12E | wgmma control | sub_1A69A70 | wgmma.fence / commit_group / wait_group (IDs 0x225D-0x225F). |
0x131 | cluster control | sub_1A6EB10 | clusterlaunchcontrol.* and griddepcontrol.* family. |
0x13B | mbarrier try_wait | sub_1A6A0F0 | mbarrier.try_wait{.parity,.timelimit}.shared.b64. |
0x13C | cluster sync | sub_1A69EE0 | barrier.cluster.{arrive,arrive.relaxed,wait}. |
0x13F | prefetch.tensormap | sub_1A6EF30 | TMA descriptor prefetch (prefetch.tensormap). |
0x142 | mma.block_scale | sub_1A78E20 | mma.block_scale.sync.aligned.mNnNkN (ID 0x24B6); sm_100a, consumer-Blackwell substitute for tensor-memory MMA. |
0x16F | BUILD_VECTOR remap | inline 195-line body | Remaps source MVT to output MI opcode: v4f32/f16x2/bf16x2 -> opcode 561; v8f32 -> opcode 544; any other lane class falls through to BUG. |
The remaining 287 slots are the holes between named bodies. Their dense ranges collapse into a small set of intervals: 0x19-0x2E, 0x33-0x36, 0x39, 0x3D-0x3E, 0x41-0x61, 0x65, 0x67-0x99, 0x9B-0x9D, 0xA2, 0xA4-0xA6, 0xA8-0xB5, 0xBA-0xBE, 0xC1-0xC2, 0xC7-0xC8, 0xCB-0xCE, 0xD0-0xD3, 0xD7-0xDD, 0xE0-0xE3, 0xE6-0xE7, 0xE9-0xEA, 0xEE-0xEF, 0xF0-0x111, 0x113-0x12B, 0x12F-0x130, 0x132-0x13A, 0x13D-0x13E, 0x140-0x141, and 0x143-0x16E. The largest contiguous run (0xF0-0x111, 34 slots) covers the shfl, vote, match, redux, st, and atom families the MatcherTable handles via TableGen patterns; the second-largest (0x143-0x16E, 44 slots) is the post-mma.block_scale Blackwell opcode window.
Delegate Map and Secondary Dispatchers
Three named cases fan out further before any opcode is emitted, and each secondary dispatcher carries its own dispatch key.
The first is the cvt.packfloat fan-out at sub_1A85120. It reads the LLVM intrinsic ID from SDNode + 24 (a different field within the same word, distinct from the outer NVPTXISD opcode) and routes among six destinations. Intrinsics 8294 and 9123 are identity passes that return *(_QWORD *)(SDNode + 40) unchanged. IDs 8437-8440 form a four-arm block that emits MI opcode 0xEC through sub_200ABE0 followed by 0xA0-style multiply-add wrap. ID 8627 delegates into sub_1A84900, the cvt_packfloat validator that emits "cvt_packfloat intrinsic needs atleast SM90 and PTX >= 78". IDs 9531-9537 emit MI opcode 0x20C (cvt.rn.satfinite.*x2.f32) through sub_2005A50, keyed by a seven-entry per-ID opcode table at dword_4D0DE60.
The second is SelectIntrinsic_WO_Chain at sub_1A833C0, a 5 435 B 21-case dispatcher reached from case 0x30. The IDs it routes include 8376 (tcgen05.alloc), 8381 (tcgen05.dealloc), 9132 (128-bit atomic via sub_1A80A40), 9136 (tcgen05.cp), 9149 (tcgen05.ld), 9150 (tcgen05.st), 9399 (tcgen05.wait), the 9669-9671 commit triple, 9779 / 9811 (sparse texture), 9848 / 9853 (nvvm.red), 9856 (tcgen05.mma emitting MI opcode 0x211 through sub_2015B50(..., 530, ...)), 9857 (tcgen05.mma.ws), and the 10521-10530 tcgen05.mma block. The latter group is itself fan-routed by sub_1A80E40, the 230-basic-block SelectTcgen05Mma super-dispatcher that handles fourteen distinct intrinsic IDs.
The third is the cvt_packfloat-and-tcgen05 architecture gate at sub_1A84900, shared between the SM and PTX-version probes for case 0x2F's sm_90+ requirement, FP6/FP4 arch-conditional checks, and the UE8M0 path. Its diagnostic strings are part of the binary-test contract: paraphrase them in a reimplementation and NVIDIA's regression suite breaks.
The architecturally important consequence of this layering: the outer 345-case switch carries only 58 distinct bodies, secondary dispatchers add roughly 50 more intrinsic-ID-keyed arms, and the MatcherTable contributes the remaining ~200 NVVM IDs as TableGen patterns. The intrinsic-ID space dispatched by sub_1A854E0 and its delegates spans [8294, 10995] and contains contiguous runs that mirror NVIDIA's PTX feature blocks: tcgen05 commits at 9669-9671, the tcgen05.mma block at 10521-10530, Hopper WGMMA singletons at 0x225D-0x225F, 0x226A, and 0x245C, the cp.async.bulk.tensor block at 8919-8956 with the 8974-9011 reduce extension, and the FP8/FP6/FP4 conversion block at 8305-8308.
Intrinsic-ID Range Map
All NVPTX selector paths (SelectIntrinsic_W_Chain, SelectIntrinsic_WO_Chain, the cvt_packfloat fan-out, the tcgen05.mma fan-out, and the MatcherTable cost scorer) dispatch on one 32-bit intrinsic ID stored at *(uint32_t *)(SDNode + 72). The map below consolidates which ID range belongs to which family, which sub_ADDR delegate handles it, which PTX op family it emits, and which SM-and-PTX target gate guards the emission. It folds together the per-case dispatch table above and the secondary-dispatcher fan-outs so a reimplementation can look up a single intrinsic in one place rather than walking three nested switch statements.
| ID range | Family | Selector | PTX family | SM gate |
|---|---|---|---|---|
| 8294 | cvt_packfloat (FP6) | sub_1A85120 | cvt.rn.satfinite.fp6x2.f32 | sm_90 + PTX>=78 |
| 8305-8308 | FP8/FP6/FP4 conversion entry | inline | cvt.{e4m3,e5m2,...}.fp32 | sm_89 |
| 8376 | tcgen05.alloc.shared | sub_1A80E40 arm 0 | tcgen05.alloc.shared::cta.b32 | sm_100 + tmem |
| 8381 | tcgen05.dealloc.shared | sub_1A80E40 arm 1 | tcgen05.dealloc.shared::cta | sm_100 + tmem |
| 8422 | imma.stc | inline | mma.sp.sync.aligned.m8n8k16 | sm_80 |
| 8437-8440 | cvt_packfloat (UE8M0x2) | sub_1A85120 | cvt.ue8m0x2.{fp8,fp16} | sm_100a |
| 8481-8503 | cp.async.bulk.tensor G2S | inline | cp.async.bulk.tensor.{shared, global}::cluster | sm_90 |
| 8519-8582 | wmma | inline | wmma.{load, store, mma.sync.aligned} | sm_70+ |
| 8592-8596 | TMA store | inline | cp.async.bulk.tensor.global.shared::cta | sm_90 |
| 8627 | cvt_packfloat (FP4x2) | sub_1A85120 | cvt.fp4x2.{fp16,fp8} | sm_100a |
| 8919-8956 | cp.async.bulk.tensor block | inline | cp.async.bulk.tensor.* | sm_90 |
| 8974-9011 | cp.async.bulk.tensor reduce | inline | cp.async.bulk.tensor.reduce.* | sm_90 |
| 9045 / 9059 / 9069 | mma type-A | inline | mma.sync.aligned.{f16,f32} | sm_80 |
| 9098 / 9106 / 9114 / 9122 | imma.stc | inline | mma.sp.sync.aligned.{i8,i4} | sm_80 |
| 9123 | cvt_packfloat (E8M0) | sub_1A85120 | cvt.{e8m0,bf16} | sm_100a |
| 9132 | tcgen05 128-bit atom | sub_1A80E40 arm 2 | tcgen05.atom.b128 | sm_100 + tmem |
| 9136 | tcgen05.cp.shared | sub_1A80E40 arm 3 | tcgen05.cp.shared::cta | sm_100 + tmem |
| 9149 / 9150 | tcgen05.ld / tcgen05.st | sub_1A80E40 arms 4/5 | tcgen05.{ld,st} | sm_100 + tmem |
| 9153-9170 | ldmatrix | inline | ldmatrix.sync.aligned.* | sm_75 |
| 9271 / 9272 | shape-class | inline | (selector marker) | - |
| 9308 / 9398 | SM120 block-scaled | inline | mma.block_scale.sync.aligned | sm_120 |
| 9399 | tcgen05.wait | sub_1A80E40 arm 6 | tcgen05.wait::cta | sm_100 + tmem |
| 9531-9537 | cvt_packfloat (E4M3) | sub_1A85120 | cvt.fp16x2.e4m3x2.* | sm_100 |
| 9669-9671 | tcgen05.commit / arrive | sub_1A80E40 arms 7-9 | tcgen05.{commit,arrive} | sm_100 + tmem |
| 9779 / 9811 | tcgen05 sparse texture | sub_1A80E40 (bit-test _bittest64(0x100000401, ID-9779)) | tcgen05.sp.* | sm_100 + tmem |
| 9848 / 9853 | nvvm.red f32/f64 + integer | inline | red.add.* | sm_70+ |
| 9856 / 9857 | tcgen05.mma / .ws | sub_1A80E40 arms 12/13 | tcgen05.mma{,.ws}.sync | sm_100 + tmem |
| 9858-9866 | stmatrix | inline | stmatrix.sync.aligned.* | sm_90 |
| 10379-10382 | alt stmatrix | inline | stmatrix.{16, 32}.aligned.* | sm_90 |
| 10521-10530 | tcgen05.mma family | sub_1A80E40 arms 14+ | tcgen05.mma.*{sp, block_scale, sp.block_scale} | sm_100a + PTX>=7.7 |
| 10535-10571 | mbarrier/fence/barrier expand | inline | mbarrier.*, fence.* | sm_70+ |
The full intrinsic-ID space dispatched out of sub_1A854E0 and its delegates is [8294, 10995]. Gaps between named rows above are reserved holes NVIDIA leaves for future PTX feature blocks. The MatcherTable absorbs them as TableGen patterns or routes them to the upstream SelectCode path; a reimplementation should leave the same holes rather than densifying the dispatch.
The bit-test on the sparse-texture row deserves a separate note. IDs 9779 and 9811 are 32 apart, which sits inside the 64-bit mask 0x100000401 (bit 0 for 9779, bit 10 for 9789 — the unused mid-slot — and bit 32 for 9811). The sub_1A80E40 arm reads the literal mask, subtracts 9779 from the incoming ID, and uses _bittest64 to select between two emission paths in a single instruction. A switch-based reimplementation must match the same two-arm fan-out even though the mask appears to admit a third bit.
Cross-references: tcgen05 commit/arrive layout and the WGMMA-side mbarrier wiring are documented in tcgen05 Machine Validation and WGMMA Emission; the per-register-class vtables that back ldmatrix and stmatrix sit in NVPTX RegisterClass vtables within ldmatrix/stmatrix Emission + Register Class Vtables; the TMA descriptor and cp.async.bulk.tensor IDs map to the descriptor encoders documented in cp.async.bulk Template Catalog.
Dispatch Dimensions by Intrinsic Family
The intrinsic-ID range map records which range maps to which family, but it does not show the shape of the lookup the dispatcher performs to choose between machine opcodes within a family. The atomic, warp-collective, MMA, mbarrier, TMA, and ldmatrix/stmatrix families each carry an opcode table indexed by a small product of orthogonal axes; the dispatcher reads the operand types and modifier bits to compute an index into that table. Tens to hundreds of machine opcodes hang off each family, so reproducing them as one switch case per opcode is unworkable. Reproducing the dispatch dimensions and the opcode table layout is what matters.
Atomic intrinsics
The atomic family covers nvvm.atomic.* and the lowered form of LLVM's atomicrmw and cmpxchg instructions. The dispatcher reads three independent axes and indexes a four-dimensional opcode table.
| Axis | Values | Source |
|---|---|---|
| Atomic kind | add, min, max, inc, dec, and, or, xor, exch, cas | low byte of intrinsic ID minus family base |
| Address space | global (1), shared::cta (3), shared::cluster (7), generic (0) | memop's address space field |
| Element type | i32, i64, f32, f64, f16x2, bf16x2, v2i64 | result MVT slot |
| Memory ordering | relaxed, acquire, release, acq_rel, sys scope | flag bits in the AtomicSDNode ordering field |
The resulting opcode is one of the ATOM_* machine opcodes. The table has roughly 11 kinds × 4 spaces × 7 types × 5 orderings = 1540 slots, but only ~280 are reachable because not every combination is legal in PTX. Float atomics exist only for add and exch; the bf16x2 variants only exist for add on sm_90+; the cas form requires two value operands and is dispatched through a separate sub-handler. The dispatcher computes a packed index (kind << 12) | (space << 8) | (type << 4) | ordering, looks it up in a perfect-hash table of valid combinations, and emits the matching opcode. An illegal combination is not a fallthrough — the dispatcher emits a diagnostic on the form "atom.<kind>.<type> not supported in address space <space>" and bails.
Warp-collective intrinsics
Warp-collective intrinsics (shfl.sync, vote.sync, match.sync, redux.sync, barrier.sync) all carry a 32-bit thread mask as their first operand. The dispatcher reads four axes:
| Axis | Values | Source |
|---|---|---|
| Collective kind | shfl.bfly, shfl.up, shfl.down, shfl.idx, vote.all, vote.any, vote.uni, vote.ballot, match.all, match.any, redux.add, redux.min, redux.max, redux.and, redux.or, redux.xor | intrinsic ID minus family base |
| Operand element type | i32, i64, f32, b1 (for vote.ballot), b32 (for match.any) | result MVT slot |
| Lane-mask form | literal immediate (the 0xFFFFFFFF "all lanes" case) or runtime SDValue | constness of the first operand |
| Sync mode | the .sync suffix is mandatory on sm_70+, optional and deprecated on older targets | subtarget feature gate |
The resulting opcode is one of SHFL_SYNC_*, VOTE_SYNC_*, MATCH_SYNC_*, REDUX_SYNC_*. The literal-mask path is privileged: when the dispatcher detects the all-lanes constant 0xFFFFFFFF at codegen time, it emits the *_FULL variant of the opcode (e.g. SHFL_SYNC_BFLY_I32_FULL), which the AsmWriter prints without the mask operand. The variant exists because PTX accepts the bare shfl.sync.bfly.b32 %r0, %r1, %r2, %r3 without the leading 0xFFFFFFFF argument, and the saved instruction byte adds up across a warp-collective-heavy kernel. The runtime-mask path emits the general opcode with the mask as an additional source operand.
MMA / tcgen05 / WGMMA intrinsics
Matrix-multiply intrinsics span the largest dispatch surface in the entire NVPTX selector. The dispatcher reads five orthogonal axes per family.
| Axis | Values | Source |
|---|---|---|
| Engine | mma.sync (sm_70-sm_80), wgmma (sm_90), tcgen05.mma (sm_100 + tmem), mma.block_scale (sm_100a / sm_120) | family base of intrinsic ID |
| Shape | m8n8k4, m16n8k8, m16n8k16, m16n8k32, m64n128k16, m64n256k32, m128n256k16 (60+ shapes) | shape operand encoded in the intrinsic ID's low nibble |
| A / B / C element type | f16, bf16, tf32, f32, f64, s8, u8, s4, u4, b1, fp8.e4m3, fp8.e5m2, fp6.e2m3, fp4.e2m1 | per-operand MVT slots |
| Layout | row.row, row.col, col.row, col.col for A and B | bits 12-13 of the intrinsic ID |
| Sparsity / scaling | dense, structured-sparse (.sp), block-scaled (.block_scale) | family base of intrinsic ID |
The dispatcher packs all five axes into a 32-bit lookup key and either reaches a perfect-hash table or fans out through a multi-level switch. For the tcgen05.mma family the fan-out lives at sub_1A80E40 and has 230 basic blocks; for WGMMA it lives at sub_1A6FB40; for the older mma.sync family it lives in the inline body of case 0xCF. Each fan-out emits one of MMA_*, WGMMA_*, TCGEN05_MMA_*, or MMA_BLOCK_SCALE_* machine opcodes. The total opcode count across all four engines exceeds 800 because every legal shape × type × layout combination gets its own opcode; the AsmWriter prints them with mnemonic suffixes assembled from the axis values.
mbarrier intrinsics
The mbarrier family is structurally simpler. The dispatcher reads three axes.
| Axis | Values | Source |
|---|---|---|
| Operation | init, inval, arrive, arrive.noComplete, arrive.expect_tx, expect_tx, try_wait, try_wait.parity, complete_tx | low byte of intrinsic ID minus family base |
| Address space | shared::cta (3), shared::cluster (7) | memop's address space field |
| Timeout variant | base form, .timelimit variant (adds 64-bit timeout operand) | bit 4 of the intrinsic ID |
The resulting opcode is one of MBARRIER_INIT_*, MBARRIER_INVAL_*, MBARRIER_ARRIVE_*, MBARRIER_TRY_WAIT_*, etc. The table has 9 ops × 2 spaces × 2 timeout variants = 36 valid combinations, of which 24 are legal in PTX. The try_wait.parity form is its own dispatch arm because it returns a predicate value the rest of the dispatcher must wire through a CopyFromReg pseudo; the other arms emit a single MachineSDNode.
TMA bulk-tensor intrinsics
The TMA family (cp.async.bulk.tensor.*) has the second-largest dispatch surface after MMA. The dispatcher reads six axes.
| Axis | Values | Source |
|---|---|---|
| Rank | 1, 2, 3, 4, 5 | bit 0-2 of the intrinsic ID's low nibble |
| Mode | tile (no row-major remap), im2col (row-major remap for convolution feeds) | bit 3 of the intrinsic ID |
| Direction | global -> shared::cluster, shared::cta -> global (store), global -> shared::cta (load) | family base of intrinsic ID |
| Multicast | none, multicast::cluster (broadcasts to multiple CTAs in a cluster) | bit 4 of the intrinsic ID |
| Cache hint | none, l2::cache_hint (carries a 64-bit cache-policy descriptor as extra operand) | bit 5 of the intrinsic ID |
| Reduce kind | none, add, min, max, inc, dec, and, or, xor | sub-family base in the reduce range |
The resulting opcode is one of the 40+ CP_ASYNC_BULK_TENSOR_* machine opcodes. Combinations are not free: multicast requires the global-to-shared::cluster direction; reduce requires the shared-to-global direction; im2col is only legal for rank ≥ 3. The dispatcher checks each constraint before computing the opcode and emits a diagnostic on an illegal combination. The mbarrier operand that tracks the bulk-tensor completion is wired through a separate operand slot the dispatcher reads from SDNode + 80 (the memop list head).
ldmatrix / stmatrix intrinsics
The ldmatrix and stmatrix family is the smallest of the structured dispatches. The dispatcher reads four axes.
| Axis | Values | Source |
|---|---|---|
| Direction | ldmatrix (shared -> register), stmatrix (register -> shared) | family base |
| Matrix shape | m8n8, m16n8, m8n16 | bits 0-1 of the intrinsic ID's low nibble |
| Element type | b16 (default), b8 (sm_100+), b8x16.b6x16_p32, b8x16.b4x16_p64 | bits 2-3 of the intrinsic ID |
| Transpose | direct, transpose (.trans) | bit 4 of the intrinsic ID |
| Lane count | x1, x2, x4 (how many matrices loaded in one instruction) | bits 5-6 of the intrinsic ID |
The resulting opcode is one of LDMATRIX_* / STMATRIX_*. The total table has 2 directions × 3 shapes × 4 types × 2 transpose × 3 lane counts = 144 slots, of which roughly 60 are legal. The transpose bit only applies to m8n8.b16; the b8 variants only exist on m16n8 and require sm_100+; x4 is illegal for stmatrix because of register-file pressure constraints. The dispatcher reads each axis bit-by-bit and indexes a flat array of opcode constants rather than walking a switch tree — the table fits in a single cache line and the bit-shift-mask-load sequence is faster than a four-deep nested switch.
Common shape
All six families share a dispatch shape: read the intrinsic ID's family base, read the orthogonal axes from operand types and modifier bits, pack them into a small index, look up the machine opcode in a flat table, and bail with a diagnostic if the combination is illegal. None of the dispatchers attempts a fallback to a sequence of smaller instructions — an unsupported MMA shape is a hard error, not a software-emulated fallback. The PTX programmer expects the intrinsic to compile or to fail; silent emulation would mask hardware-feature mismatches. A reimplementation must preserve the bailout: replacing a diagnostic with a generic-lowering fallback breaks NVIDIA's regression suite, which asserts on exact error strings.
The unsafe-fp-math FTZ Probe in Case 0x66
Case 0x66 is the architecturally important inline body in sub_1A854E0. It is the clearest demonstration of how NVIDIA's selector differs from upstream TargetOption-layer FTZ control. Upstream LLVM picks FTZ-flavored FMA opcodes at module level: the denormal-fp-math and nvptx-f32ftz codegen options get read once when the TargetMachine is constructed, and every FMA in the module inherits the same FTZ semantics. The case-0x66 body in tileiras probes the per-function attribute on each individual FMA selection and emits one of two different MI opcode sequences depending on the result.
The probe itself is a string-key lookup against the LLVM::Function attribute table. The selector takes the Function * from the surrounding MachineFunction (a5 in the function ABI), passes it to sub_3FC6800, and asks for the value of the attribute named "unsafe-fp-math". The 14-byte length argument is a verbatim strlen("unsafe-fp-math") consumed by the attribute lookup helper, which compares the key in length-prefixed form.
bool select_fma_with_unsafe_fp_math_probe(SelectorState *st, SDNode *node,
ChainWrap *cw, SelectionDAG **dag,
MachineFunction *mf,
SDValue chain_in) {
LLVMFunction *func = machine_function_get_llvm_function(mf);
bool unsafe = attribute_table_has(func, "unsafe-fp-math", 14);
uint16_t flags = sdnode_flags(node);
bool use_ftz_path = unsafe || (flags & 0x40) != 0;
if (use_ftz_path) {
SDValue mul = emit_node(dag, 0x65, chain_in, /* FMA */ ...);
SDValue wrap = emit_node(dag, 0x10F, mul, /* FTZ_WRAP */ ...);
SDValue inner = emit_node(dag, 0x64, wrap, /* FMAD, flags=512 */ ...);
return emit_node(dag, 0x63, inner, /* FMAD, flags=512 */ ...);
} else {
SDValue alt = emit_node(dag, 0xF7, chain_in, /* FTZ_ALTERNATE wrapper */ ...);
SDValue addr = emit_node(dag, 0xD2, alt, /* INST_WRAPPER, ADDRESSOF-chain */ ...);
SDValue copy = emit_node(dag, 0x11, addr, /* CopyToReg */ ...);
uint16_t mvt = sdnode_result_mvt(node);
uint32_t mul_add_op = (mvt_is_f32(mvt)) ? 207 : 208;
return emit_node(dag, mul_add_op, copy, ...);
}
}
Two pieces of state can force the FTZ path. The first is the function attribute, a per-Function override any front-end can set on individual calls without touching global target options. The second is bit 0x40 of the SDNode's flag word, which the DAG legalizer sets when an earlier combine has already proved FTZ semantics safe (non-denormal constant operands, for instance). ORing the two sources together means a single FMA can take the FTZ path even when the surrounding function has no "unsafe-fp-math" set, and a function with the attribute always takes the FTZ path regardless of the flag word.
The FTZ path emits a four-instruction sequence ending in MI opcode 0x63 (FMAD inner with NoFPExcept flag bit 0x200 set). The non-FTZ path is the NVIDIA-patched wrapper sequence: MI opcode 0xF7, an FMA_NON_FTZ wrapper absent from upstream LLVM's NVPTX tablegen output and unique to tileiras. From there it threads through opcode 0xD2 (INST_WRAPPER, used to keep the chain through an ADDRESSOF wrap), 0x11 (CopyToReg), and finally an MVT-keyed select between opcode 207 (MUL_ADD_f32) and 208 (MUL_ADD_f64). A reimplementation cannot just translate a single PTX FMA template — it must preserve the four-node wrapper chain on the non-FTZ path so downstream passes match the same operand layout.
⚡ QUIRK —
NoFPExceptbit0x40is repurposed as FTZ-authorization on case0x66In upstream LLVM the SDNode flag bitNoFPExcept(0x40) is a pure FP-exception-safety advisory: it tells later passes that no FP exception can be raised. InsideSelectIntrinsic_W_Chainat case0x66(functionsub_1A854E0), tileiras reads that same bit before the"unsafe-fp-math"function attribute and treats it as the per-node "authorize FTZ substitution" signal — same flag, different semantics. A combine that legitimately setsNoFPExcepton a single FMA in an otherwise IEEE-denormal function therefore silently switches that one FMA tofma.rn.ftz.f32(opcode0x65) instead of theFMA_NON_FTZwrapper (0xF7), with no diagnostic. Reimplementations that import upstream flag semantics will produce different PTX for the same SDAG.
The diagnostic-free nature of this case also deserves a note. Neither path produces an error string. FTZ is a semantic choice, not a target restriction, and the selector implements both. Resist the temptation to centralize FTZ handling at any single point above the selector: the per-call override is the contract.
The Four FTZ × Subnormal Cases
The case-0x66 probe collapses two independent semantic axes onto a single binary choice. The first axis is what the function-level denormal-fp-math attribute says: ieee means subnormal inputs and outputs are preserved bit-for-bit; preserve-sign means subnormals flush to zero with the sign retained; positive-zero flushes to +0.0. The second axis is whether the individual FMA carries fast or nnan-style fast-math-flags that authorize the compiler to substitute a faster FTZ variant even when the function attribute says otherwise. The four corners of the 2×2 are summarized below.
| Function attribute | Fast-math flags | Selector path | PTX emitted | Why |
|---|---|---|---|---|
denormal-fp-math=ieee | none | non-FTZ wrapper | fma.rn.f32 | both axes agree on subnormal preservation; no FTZ override available |
denormal-fp-math=ieee | unsafe-fp-math set | FTZ path | fma.rn.ftz.f32 | per-call attribute override forces flush regardless of function-level preservation request |
denormal-fp-math=preserve-sign,preserve-sign | none | FTZ path | fma.rn.ftz.f32 | function-level attribute already authorizes flush; selector picks the faster variant |
denormal-fp-math=preserve-sign,preserve-sign | unsafe-fp-math set | FTZ path | fma.rn.ftz.f32 | both axes agree; redundant but consistent |
The probe order matters. Tileiras reads the SDNode flag word first (bit 0x40, the NoFPExcept flag the DAG combiner sets when it has already proved subnormals safe), and only consults the function attribute if the flag is clear. This ordering lets a single arithmetic-simplification combine in the legalizer enable the FTZ variant for one specific FMA without affecting the rest of the function — the combine sets the flag bit on the SDNode it produces and the selector reads it back two passes later. The function attribute is the broader sledgehammer: setting "unsafe-fp-math" switches every FMA in the function to FTZ regardless of any per-node decision.
The non-FTZ wrapper path emits opcode 0xF7 (FMA_NON_FTZ) into an INST_WRAPPER (0xD2) that holds the chain through an ADDRESSOF node, then a CopyToReg (0x11), then an MVT-keyed MUL_ADD_f{32,64} (opcodes 207 / 208). The wrapper is what carries the non-FTZ semantics through the rest of code generation: the downstream peephole pass that fuses an fmul with an fadd reads the wrapper opcode to verify the combine is legal under the active rounding mode, and a wrapper-stripped FMA gets refused. A reimplementation that emits the bare fma.rn.f32 without the wrapper chain breaks the peephole's recognition pattern and produces silently wrong code under denormal-fp-math=ieee.
Inline Vector-Legalisation Joined Body
Eleven cases (0x3A-0x3C, 0x3F-0x40, 0xB6-0xB9, 0xBF-0xC0) share one body at 0x1A85520. The body is an inline vector-legalisation step that runs whenever the parent NVPTXISD opcode is a vector load, vector store, or BUILD_VECTOR whose lane MVT is the v4f32 slot (NVPTX enum value 48). For any other lane MVT the body short-circuits to a single SDNode emission with MI opcode 0x9E. For v4f32 it walks the operand array, calls sub_2007D50 to materialise each element index as a target constant, and emits a per-lane EXTRACT_SUBVECTOR (MI opcode 0xA0) before re-emitting the parent opcode with the extracted operand list.
SDValue select_vector_legalisation(SelectorState *st, SDNode *node,
ChainWrap *cw, SelectionDAG **dag,
SDValue chain_in) {
DebugLoc *dl = sdnode_debug_loc(node);
debug_loc_ref(dl);
MVT result_mvt = sdnode_result_mvt_at(node, cw->index);
if (result_mvt != MVT_V4F32_SLOT_48) {
return emit_node(dag, 0x9E, chain_in, /* LEGAL_VECTOR_EXTRACT_COMBINE */
dl, node->operand_array, node->num_operands);
}
SDValue extracted[MAX_LANES];
uint32_t count = 0;
for (uint32_t i = cw->index; i < node->num_operands; ++i) {
SDValue idx = make_target_constant(dag, i, dl);
extracted[count++] = emit_node(dag, 0xA0, chain_in, /* EXTRACT_SUBVECTOR */
dl, node->operand_array[i], idx);
}
SDValue reassembled = emit_node(dag, sdnode_isd_opcode(node), chain_in,
dl, extracted, count);
return emit_node(dag, 0x9E, chain_in, dl, reassembled);
}
The body preserves the parent opcode in the re-emit step rather than picking a fixed legalised opcode. Every joined case therefore lands at a different downstream pattern even though they all walk through the same inline code: the parent opcode at SDNode + 24 gets read into a local at the top of the body and replayed when the legalised SDNode is emitted. Hard-coding the re-emit opcode in a reimplementation collapses all eleven cases into one and breaks the MatcherTable patterns that key on the original NVPTXISD opcode.
BUILD_VECTOR Remap in Case 0x16F
Case 0x16F is the longest inline body in sub_1A854E0 and the most explicit example of MVT-driven MI opcode selection. It is reached from the Blackwell vector-load path where a tensor-memory unpack feeds a BUILD_VECTOR whose source MVT determines the output vector width and element type. The body reads the source operand's MVT slot, remaps it to one of two MI opcodes, then emits a BUILD_VECTOR with that opcode plus a per-element EXTRACT_SUBVECTOR chain.
SDValue select_buildvec_remap_0x16F(SelectorState *st, SDNode *node,
ChainWrap *cw, SelectionDAG **dag) {
SDNode *src = sdnode_operand(node, 0);
MVT src_mvt = sdnode_result_mvt_at(src, cw->index);
uint32_t out_opcode;
switch (src_mvt) {
case MVT_V4F32_SLOT_158:
case MVT_F16X2_SLOT_66:
case MVT_BF16X2_SLOT_121:
out_opcode = 561; /* BUILD_VECTOR_V4 */
break;
case MVT_V8F32_SLOT_174:
out_opcode = 544; /* BUILD_VECTOR_V8 */
break;
default:
unreachable("BUG: unsupported MVT for case 0x16F BUILD_VECTOR remap");
}
uint32_t elt_count = sdnode_num_operands(node);
SDValue elts[MAX_LANES];
for (uint32_t i = 0; i < elt_count; ++i) {
SDValue idx = make_target_constant(dag, i, sdnode_debug_loc(node));
elts[i] = emit_node(dag, 0xA0, /* chain */ cw->chain,
sdnode_debug_loc(node), sdnode_operand(node, i), idx);
}
return emit_buildvec_node(dag, out_opcode, /* DL */ sdnode_debug_loc(node),
elts, elt_count);
}
The MVT slot numbers in the switch are NVPTX-fork enum values, not upstream MVT::SimpleValueType values. Slot 158 is v4f32, slot 174 is v8f32, slot 66 is f16x2, slot 121 is bf16x2. The bottom and top guards in the binary (v104 <= 0x9E and v104 > 0x9E and v104 != 174) catch the entire upstream MVT space that does not correspond to one of the four legal Blackwell lane types and route to the same BUG() label as the default case. Preserve the bounds checks in a reimplementation: an out-of-range MVT here is a legalisation invariant violation, not a fallthrough to the MatcherTable.
SDNode MI Opcode Index
The 58 real bodies collectively emit a small set of unique MI opcodes. The table below collates every opcode that appears as a constant argument to one of the builder functions, the builder it flows through, and its inferred purpose. Two opcodes are NVIDIA-only additions to NVPTX's MI namespace: 0xF7 (FMA_NON_FTZ, the case-0x66 non-FTZ wrapper) and 0x10F (FTZ_WRAP, the case-0x66 FTZ wrapper). Both are absent from upstream LLVM 18's NVPTX TableGen output.
| MI opcode | Dec | Emitting builder | Purpose |
|---|---|---|---|
0x11 | 17 | sub_1FF40D0 | CopyToReg |
0x63 | 99 | sub_2008880 (flags=512) | FMAD inner, NoFPExcept set |
0x64 | 100 | sub_2008880 (flags=512) | FMAD inner |
0x65 | 101 | sub_2009D80 | FMA (FTZ path) |
0x9E | 158 | sub_2005A50 | LEGAL_VECTOR_EXTRACT_COMBINE |
0xA0 | 160 | sub_2009D80 | EXTRACT_SUBVECTOR |
0xBC | 188 | sub_2009D80 | MMA_LOAD |
0xBD | 189 | sub_2009DB0 | MMA_STORE |
0xD2 | 210 | sub_201CAC0 | INST_WRAPPER (FTZ path) |
0xD8 | 216 | sub_2009E20 (align=16) | STORE_VECTOR_WRAP |
0xDA | 218 | sub_200ABE0 | MMA_REG_WRAP |
0xEC | 236 | sub_200ABE0 | mbarrier.inval wrapper |
0xF7 | 247 | sub_200ABE0 | FMA_NON_FTZ (NV-patched) |
0x10F | 271 | sub_200ABE0 | FTZ_WRAP (NV-patched) |
0x20C | 524 | sub_2005A50 (in sub_1A85120) | cvt.rn.satfinite.*x2.f32 |
0x211 | 529 | sub_2005A50 (in sub_1A833C0) | tcgen05.mma.sync |
0x212 | 530 | sub_2015B50 (in sub_1A833C0) | tcgen05.mma.ws.sync |
| 197 | 197 | sub_1A5F730 | WMMA_LOAD_DENSE |
| 198 | 198 | sub_1A5F730 | WMMA_LOAD_DENSE_T |
| 207 | 207 | sub_2004920 | MUL_ADD_f32 |
| 208 | 208 | sub_2004920 | MUL_ADD_f64 |
| 544 | 544 | sub_1FF1090 | BUILD_VECTOR_V8 |
| 561 | 561 | sub_1FF1090 | BUILD_VECTOR_V4 |
Subtarget Probe Surface
A useful invariant for any reimplementation: sub_1A854E0 and its delegates consult exactly three subtarget fields. The first is the feature byte at unk_5BEBD51 (HasTcgen05); cases 0x30 (inner), 0x31, 0x32, and 0xED (st.bulk) require this bit set. The second is the dword at *(uint32_t *)(subtarget + 344), which encodes the SM major version times ten. The cvt_packfloat validator (sub_1A84900) and the tcgen05.mma block inside sub_1A833C0 both consult it; it governs cases 0x2F, 0x37, 0x38, 0x31, 0x32, 0xCF, 0x112, 0x12C, 0x12D, 0x142, and 0x16F. The third is the dword at *(uint32_t *)(subtarget + 348), which encodes the PTX version times ten with the last decimal digit holding the architecture suffix (.a -> 2, .f -> 3). The 10521-10530 tcgen05.mma block in sub_1A833C0 and the mma.block_scale path in case 0x142 both read it. No other subtarget field is read in this function. Fan subtarget probes through a broad feature-flag interface and the reimplementation diverges from the binary on test cases that vary other fields without changing these three.
MatcherTable and Cost Scoring
The TableGen-generated MatcherTable path is the third selector layer, and it is not a single function. Two procedures collaborate: an upper-half dispatcher (sub_1AAD9D0, trySelectNode, 8 204 bytes, 61 case labels) decides whether a node has a fast path or must enter the generic matcher, and a recursive pattern-cost scorer (sub_1AAFA40, SelectCodeCommon, 12 724 bytes, 509 basic blocks, 119 case labels) walks the candidate pattern tree and returns an int64_t cost. The dispatcher delegates into the scorer through four call sites; the scorer self-recurses three times at lines 595, 1068, and 1202 of the decompilation. Five predicate helpers — sub_1AAC4D0 CostOperand (299 LOC), sub_1AACAB0 CheckComplexPattern (324 LOC), sub_1AAD1E0 CheckSame / CheckSameVT (320 LOC), the 57-LOC shim sub_1AAD880, and sub_1AAF9E0 OPC_Scope re-entry — implement the operand-check vocabulary the scorer consumes.
Every return path in the scorer and in each of the five helpers is saturating signed int64. The scorer never propagates an unchecked sum. Each arithmetic step performs an overflow probe (__OFADD__ for addition, an is_mul_ok helper for multiplication) and clamps to 0x7FFFFFFFFFFFFFFF on positive overflow or 0x8000000000000000 on negative overflow. The reason sits at line 405: v14 = 9LL * (a3 != 2) + 1 injects a depth-dependent multiplier so root nodes weigh 1 and every nested node weighs 10. Inside a tcgen05.mma or wgmma.mma_async pattern tree with five operand levels and an inner vector-width multiplier of 16-32, an unchecked accumulator overflows int64_t before the match completes — and an overflowed cost would make a deep matrix pattern falsely appear cheaper than a shallow one. The saturating clamp is what keeps pattern selection deterministic on Blackwell tensor-memory trees.
static int64_t sat_add_i64(int64_t a, int64_t b) {
if (b > 0 && a > INT64_MAX - b) return INT64_MAX;
if (b < 0 && a < INT64_MIN - b) return INT64_MIN;
return a + b;
}
static int64_t sat_mul_i64(int64_t a, int64_t b) {
if (!is_mul_ok(a, b)) {
if ((a > 0) == (b > 0)) return INT64_MAX; /* same sign -> +INF */
return INT64_MIN; /* opposite sign -> -INF */
}
return a * b;
}
Scorer entry and the 119-case opcode dispatch
The scorer is invoked as sub_1AAFA40(NVPTXISelDAGToDAG *self, SDNode *N, unsigned Depth, __m128i ctx). It reads Opcode = N->opcode from offset +16 and computes the depth amplifier first. The 119 case labels partition the NVPTX ISD enum into three contiguous ranges. Range 1 covers 0x01..0x5A — upstream ISD::CONSTANT_POOL, ISD::GlobalAddress, ISD::EntryToken, ISD::INLINEASM, and other base kinds. Range 2 covers 0x6D..0xFD — the NVPTX extensions: NVPTXISD::LOAD, STORE, STORE_MASK, Intrinsic_W_Chain, Intrinsic_WO_Chain, FMA_FTZ, and the tcgen05 opcodes. Range 3 covers 0x120..0x17A — the high-numbered NVPTX call-ABI and WGMMA descriptor opcodes such as CallArg, CallPrototype, PseudoUseFP, SETP_*, StoreRetval, LoadParam, WgmmaDescriptor.
Most cases collapse onto a shared tail at LABEL_25. They load a per-opcode integer constant into a local v29 and fall through. The constant is the pattern-table row index used by the subtarget-feature predicate, between 78 and 291 in this build. A small number of cases return synthetic literals: 0x9A returns the constant 4 (an InvisibleReg-style fixed cost); 0x08, 0xD5, 0xD6, 0x127, 0x148 return 0 unconditionally because they are pseudo-ops this layer never matches. Cases 0xE7 and 0xE9 short-circuit into sub_1AA9FC0 and call it the only EmitNode they will ever issue. Cases 0xB1, 0xB2, 0xB3 walk vector loads and stores via two cost probes followed by an is_mul_ok-guarded multiply by vector width.
int64_t SelectCodeCommon(NVPTXISelDAGToDAG *self, SDNode *N, unsigned Depth, __m128i ctx) {
uint32_t Opcode = N->opcode; /* +16 in SDNode */
int64_t Mult = 9LL * (Depth != 2) + 1; /* 10x on every non-root step */
switch (Opcode) {
case 0x08: case 0xD5: case 0xD6: case 0x127: case 0x148:
return 0; /* pseudo-ops, no match */
case 0x9A:
return 4; /* InvisibleReg fixed cost */
case 0xE7:
return sub_1AA9FC0(self, 32, N, sub_3F69B50(self->ctx, N), 1, 0, Depth, 0);
/* ... 110+ further cases each setting v29 = <row> and goto LABEL_25 ... */
default:
goto LABEL_33; /* try fast-path emit primitives */
}
LABEL_25: /* shared CheckPatternPredicate tail */
return check_predicate_and_emit(self, N, Depth, ctx, /*row=*/v29, Mult);
}
The shared LABEL_25 predicate tail and the 507-byte feature stride
LABEL_25 is the single entry point that every range-1 and range-3 case folds into. It reads a byte from the subtarget-feature predicate matrix at the address *(BYTE *)(v29 + v30 + 507 * v31 + 6544). Here v30 is self->subtarget (read from a1[3]), v29 is the per-opcode row constant from the dispatch, v31 is the active SM-feature slot (v376, derived from the current PTX version and architecture bits), and 6544 is the base offset of the predicate matrix inside the NVPTXSubtarget object. An earlier reading interpreted the 507-byte stride as a flattened LLVM FeatureBitset (4 056 bits per slot); a later analysis retracted that in favour of the TileAS modulo-scheduling pipeline-lattice transition matrix, which uses a 507-byte row to encode legal pipe-stage transitions per feature row. The matrix entry is consumed as a small enum: values 0 and 1 accept the pattern at base cost; value 4 doubles the cost (the multiplied path at LABEL_257 that returns sat_mul_i64(2, v32)); other values reject by falling through to the fast-path emit attempt at LABEL_33. The 4-doubling path is what makes patterns that need a partial pipe-stage retraction cost twice as much as their plain form, biasing the scorer toward shapes the pipeline already supports.
int64_t check_predicate_and_emit(NVPTXISelDAGToDAG *self, SDNode *N, unsigned Depth,
__m128i ctx, int row, int64_t Mult) {
uintptr_t st = (uintptr_t)self->subtarget; /* a1[3] */
int slot = self->active_feature_slot; /* v376 */
/* Pipeline-lattice transition matrix: 507 B per slot, base +6544. */
uint8_t pipe = *(uint8_t *)(row + st + 507 * slot + 6544);
if (pipe <= 1) {
/* legal direct transition - return the running cost */
return running_cost;
}
if (pipe == 4) {
/* partial retraction - charge double */
return sat_mul_i64(2, running_cost);
}
goto LABEL_33; /* fall through to OPC_* emit */
}
The five predicate helpers
The scorer leans on five helpers that mirror LLVM's OPC_* operand-check vocabulary. sub_1AAC4D0 is CostOperand. It accepts an operand index, a flag word, and a depth, and returns the operand's contribution to the running cost. It fires when the matcher needs to charge for capturing a child node into a recorded slot. sub_1AACAB0 is CheckComplexPattern. It dispatches into the per-target ComplexPattern matchers — SelectAddrModeImm, SelectFrameIndex, address-space classifiers for tmem/shared/global — and returns a cost reflecting how restrictive the pattern was. Impossible patterns return INT64_MAX. sub_1AAD1E0 is CheckSame and CheckSameVT: pointer equality of two operand nodes (OPC_CheckSame) and value-type equality of two operand slots (OPC_CheckSameVT). One function services both because the implementation differs only in which byte of the recorded-slot descriptor it loads.
sub_1AAD880 is a 57-line shim arbitrating between two interpretations of its flag argument. If the high byte of a4 is zero or the low bit is set (!BYTE4(a4) || (a4 & 1)), control delegates directly to sub_1AAD1E0. Otherwise, if the recorded slot at a3 + 8 holds value 18 (ISD::UNDEF), the shim returns 0 — undef costs nothing. The remaining path computes v8 = sub_1AA64C0(...) (an operand-cost accumulator) and v9 = sub_1AA8940(...) (per-operand cost), then performs *(uint32_t *)(a3 + 32) * v9 with is_mul_ok guarding the multiply. The shim's dispatch shape is what keeps the scorer compact: a single recorded-slot descriptor can be checked as OPC_CheckSame, as OPC_CheckSameVT, or as a count-weighted operand cost depending on flag bits, without branching at the scorer's top level.
sub_1AAF9E0 is the OPC_Scope re-entry — the recursive doorway that LABEL_25 and the 0xB2/0xB3 vector-store cases use to enter a sub-pattern. Structurally it constructs a fresh MatchContext on the stack and recursively invokes sub_1AAFA40 on the candidate sub-tree. The three self-recursion sites in the scorer (lines 595, 1068, 1202) plus the four sub_1AAF9E0 calls form the mutual recursion that walks the full pattern tree.
int64_t CostOperand(NVPTXISelDAGToDAG *self, int slot, SDNode *child,
uintptr_t flags, int *cost_state, ...); /* sub_1AAC4D0 */
int64_t CheckComplexPattern(NVPTXISelDAGToDAG *self, SDNode *N,
const ComplexPatternFn *fn, ...); /* sub_1AACAB0 */
int64_t CheckSame_or_SameVT(NVPTXISelDAGToDAG *self, int slot,
const RecordedSlot *rec, unsigned a5); /* sub_1AAD1E0 */
int64_t CheckSame_shim(NVPTXISelDAGToDAG *self, int slot, uintptr_t rec_addr,
uintptr_t flag_word, unsigned a5) { /* sub_1AAD880 */
if (!BYTE4(flag_word) || (flag_word & 1))
return CheckSame_or_SameVT(self, slot, (RecordedSlot *)rec_addr, a5);
if (*(uint8_t *)(rec_addr + 8) == 18 /* ISD::UNDEF */) return 0;
int64_t acc = sub_1AA64C0(self, (int64_t *)rec_addr, 0, 1);
int64_t per = sub_1AA8940(self, slot, *(uintptr_t *)(rec_addr + 24), a5, 0, 0);
int64_t prod = sat_mul_i64(*(uint32_t *)(rec_addr + 32), per);
return sat_add_i64(prod, acc);
}
int64_t OPC_Scope_reenter(NVPTXISelDAGToDAG *self, SDNode *sub,
unsigned Depth, __m128i ctx, ...); /* sub_1AAF9E0 */
The 15-opcode OPC_* vocabulary
The TableGen primitives the scorer cross-dispatches form a compact 15-entry vocabulary. They are not consumed as a linear byte stream by sub_1AAFA40 directly — the scorer calls the predicate helpers, and those helpers internalize the opcode semantics. The vocabulary still matches upstream LLVM's SelectionDAGISel.h enum byte-for-byte because the TableGen emitter produced both.
| Primitive | Backed by | Semantics |
|---|---|---|
OPC_Scope | sub_1AAF9E0 | Enter a fresh recursive match scope; on failure return to enclosing scope. |
OPC_RecordChild0..7 | sub_1AAC4D0 | Capture operand i into recorded slot r. |
OPC_CheckPatternPredicate | LABEL_25 matrix probe | Test pipeline-lattice byte at +6544 + 507·slot + row. |
OPC_CheckOpcode | dispatch switch(v10) | Test N->opcode == expected. |
OPC_CheckType | sub_1AAD1E0 | Test N->valueType(i) == MVT::X. |
OPC_CheckChild0Type | sub_1AAD1E0 | Same, applied to child 0. |
OPC_CheckSame | sub_1AAD880 shim | Test pointer equality of two recorded slots. |
OPC_CheckSameVT | sub_1AAD1E0 | Test value-type equality of two recorded slots. |
OPC_CheckComplexPat | sub_1AACAB0 | Invoke target-specific ComplexPattern matcher. |
OPC_SwitchOpcode | dispatch tail at LABEL_33 | Multi-way fast-path branch on N->opcode. |
OPC_SwitchType | dispatch tail at LABEL_33 | Multi-way fast-path branch on N->valueType(0). |
OPC_EmitInteger | sub_1A9BF90 | Materialize a constant operand. |
OPC_EmitNode | sub_1A9C8F0 / sub_1AA9FC0 | Build the output MachineSDNode. |
OPC_CompleteMatch | sub_1A9AB90 | Commit uses, return accepted cost. |
OPC_MoveParent / OPC_Reject | scorer epilogue | Walk parent chain or return failure (cost = 0). |
The literal byte stream of these OPC_* codes — the actual data the TableGen emitter writes into a static const unsigned char MatcherTable[] — lives outside sub_1AAFA40. It sits in .rodata, addressed by 0x5B*** globals the scorer reads through the row constants in v29. Pattern-name strings paired with each row are plain ASCII in .rodata and fingerprint the NVIDIA data patch: "setmaxregister", "cp.async.bulk.tensor.group.shared.cluster", "wgmma.mma_async.sync.aligned", "wgmma.fence.sync.aligned", "tcgen05.mma.sync", "tcgen05.mma.ws.sync", "mma.block_scaled.sync.aligned", "mma.sp.sync.aligned.m8n8k16". These names do not live in the XOR-3 mnemonic pool. They are the TableGen-emitted pattern records, distinct from the lowered PTX mnemonics the AsmWriter prints.
The upper-half dispatcher
sub_1AAD9D0 is the first thing every node sees once the intrinsic and vector-memory selectors have declined. It reads N->opcode from offset +16, partitions on whether the opcode is <= 0xD6 or <= 0x17C, and probes for a fast-path emit primitive through sub_3FD9730 (hasPatternFastpath). On a fast-path hit the dispatcher returns 1 without entering the scorer. On a miss the dispatcher consults a smaller per-opcode pattern-presence table. Missing entries jump to LABEL_15 and drop back to the caller; present entries call into the scorer through one of the four sub_1AAF9E0 / sub_1AAFA40 call sites and use the returned cost to commit or reject the match. The 61 case labels in this dispatcher are a strict subset of the scorer's 119 — every dispatched opcode has a scorer entry, but not every scorer entry has a fast path.
bool trySelectNode(NVPTXISelDAGToDAG *self, SDNode *N, unsigned Depth, __m128i ctx) {
uint32_t Opcode = N->opcode;
if (Opcode <= 0xD6) {
switch (Opcode) { /* 0 for unsupported pseudo-ops; fall through otherwise */ }
} else if (Opcode <= 0x17C) {
/* hasIntrinsic check for high-numbered ISD opcodes */
}
if (hasPatternFastpath(Opcode))
return emit_fastpath(self, N); /* sub_3FD9730 → emit_* */
if (!hasMatcherEntry(Opcode))
return false; /* LABEL_15 - no pattern */
int64_t cost = SelectCodeCommon(self, N, Depth, ctx); /* sub_1AAFA40 */
return commit_if_profitable(self, N, cost);
}
The scorer's mutual recursion with this dispatcher is how a single top-level node produces a tree of EmitNode calls. Each successful scope commits one machine node; the scorer recurses through sub_1AAF9E0 into the next sub-pattern; and so on. Reimplementations must preserve the order — fast-path probe first, scorer second — because some Blackwell intrinsics rely on the fast-path emitting a single machine node that the scorer would otherwise score apart into a less efficient MOV + EmitInteger pair.
Worked Example: fmul + fadd + fadd → FMA + FADD
A concrete walk-through makes the scorer's behavior easier to verify. Consider the LLVM IR fragment:
%mul = fmul fast float %a, %b
%add = fadd fast float %mul, %c
%r = fadd fast float %add, %d
After type-legalization and the fast attribute propagates onto each SDNode's flag word, the SelectionDAG holds three nodes:
SDNode #3: FADD f32, flags=0x208 (fast | NoFPExcept)
/ \
SDNode #2: FADD SDNode #6: Argument d
/ \
SDNode #1: FMUL SDNode #5: Argument c
/ \
Arg a Arg b
The MatcherTable has four candidate patterns that can claim the root FADD:
| Pattern ID | Shape | Output MI opcode | TableGen-emitted base cost |
|---|---|---|---|
P_FADD_R | bare FADD | add.f32 (opcode 0x1C2) | 2 |
P_FADD_FMUL_FADD | FADD(FADD(FMUL, c), d) | not encodable as one MI; rejected at match time | — |
P_FMA_FADD | FADD(FMA(a, b, c), d) | fma.f32 (opcode 0x65) + add.f32 | 3 |
P_FADD_FMA | FADD(FADD(_, _), d) where inner reduces to fma | semantically equivalent to P_FMA_FADD | 3 |
The dispatcher invokes SelectCodeCommon(self, N=#3, Depth=0, ctx). Three calls to the scorer happen — one for the root and two recursive descents through OPC_Scope re-entries. The depth amplifier Mult = 9LL * (Depth != 2) + 1 evaluates to 1 at the root (Depth=0), 10 at the immediate child (Depth=1), 1 at the grandchild (Depth=2), and 10 again at any deeper level.
Scoring P_FADD_R for the root FADD:
running_cost = 0
Mult = 1 /* Depth=0 */
charge OPC_CheckOpcode(FADD) -> sat_add(0, 1) = 1
charge OPC_RecordChild0 + CostOperand(#2) -> sat_add(1, sub_1AAC4D0(...,Depth=1))
= sat_add(1, 10*2) = 21
charge OPC_RecordChild1 + CostOperand(#6) -> sat_add(21, 10*1) = 31
charge OPC_CheckPatternPredicate(row=78) -> pipeline-lattice byte = 1, accept
charge OPC_EmitNode(add.f32) -> sat_add(31, 2) = 33
charge OPC_CompleteMatch -> commit running_cost = 33
Scoring P_FMA_FADD for the same root:
running_cost = 0
Mult = 1 /* Depth=0 */
charge OPC_CheckOpcode(FADD) -> sat_add(0, 1) = 1
charge OPC_CheckChild0Type(f32) on #2 -> sat_add(1, 1) = 2
charge OPC_RecordChild0 (descend into #2) -> OPC_Scope re-entry
running_cost' = 0
Mult' = 10 /* Depth=1 */
charge OPC_CheckOpcode(FADD) -> sat_add(0, 10) = 10
charge OPC_CheckChild0Type(f32) on #1 -> sat_add(10, 10) = 20
charge OPC_RecordChild0 (descend #1) -> OPC_Scope re-entry
running_cost'' = 0
Mult'' = 1 /* Depth=2 */
charge OPC_CheckOpcode(FMUL) -> sat_add(0, 1) = 1
charge OPC_RecordChild0..1 -> sat_add(1, 2*sub_1AAC4D0(...,Depth=3)) = 1 + 2*10 = 21
charge OPC_CheckFastMathFlag(fast) -> sat_add(21, 1) = 22
return 22
sat_add(20, 22) = 42
charge OPC_RecordChild1 (capture c=#5) -> sat_add(42, 10*1) = 52
charge OPC_CheckPatternPredicate(row=164, fma-folding allowed) -> byte = 4, double
sat_mul(2, 52) = 104
return 104
sat_add(2, 104) = 106
charge OPC_RecordChild1 (capture d=#6) -> sat_add(106, 1*1) = 107
charge OPC_CheckPatternPredicate(row=164) -> pipeline-lattice byte = 1, accept
charge OPC_EmitNode(fma.f32) + EmitNode(add.f32) -> sat_add(107, 3) = 110
charge OPC_CompleteMatch -> commit running_cost = 110
A naive reading would say P_FADD_R wins at cost 33 against P_FMA_FADD at cost 110, and the FMA pattern loses. The opposite happens. The scorer is invoked once per candidate pattern, not once per node, and the dispatcher subtracts the number of nodes the pattern absorbs from its committed cost. P_FADD_R absorbs one node (the root FADD) and pays cost 33; P_FMA_FADD absorbs three nodes (root FADD, child FADD, grandchild FMUL) and pays cost 110. The per-node committed cost is 33 / 1 = 33 for the bare add and 110 / 3 ≈ 36.7 for the FMA pattern; on cost-per-node the bare add looks cheaper. But the dispatcher uses absolute cost on the residual subtree, not per-node averages. After P_FMA_FADD commits, the remaining work to schedule is zero nodes. After P_FADD_R commits, two more nodes still need scoring, and each of those will add another 30-100 to the total. The bare-add cumulative cost over the full subtree is 33 + 33 + 30 ≈ 96 plus the predicate-tail amplifier; the FMA cumulative cost is 110 once and done. The dispatcher commits whichever absolute-cost path produces the smallest total over the full subtree, and on a three-node fmul + fadd + fadd chain that is the FMA fold.
The pipeline-lattice predicate matters. Row 164 (the FMA pattern row) reads pipe = 1 on Hopper and Blackwell because fma.f32 is a single-stage tensor-pipe instruction; pipe = 4 on Volta because Volta lacks the dual-issue path Hopper uses for an FMA followed by a same-cycle add, and the scorer doubles the FMA cost to 2 * 52 = 104 to bias against the fold. On sm_90+ the double does not fire, the scorer returns the unmultiplied 52, and the FMA pattern dominates.
After the scorer commits P_FMA_FADD, the residual DAG holds:
SDNode #7: FMA f32 (a, b, c), flags=0x208
SDNode #3': FADD f32 (#7, d), flags=0x208
The second FADD is still in the DAG. The scorer reruns on SDNode #3' with the FMA result feeding the add. This time only P_FADD_R matches (no further FMA fold available because #7 is already a FMA, not an FMUL), and the bare-add pattern commits at the original cost 33. The final MIR after instruction selection is two machine instructions:
%vreg2:f32 = FMA_f32 %vreg_a, %vreg_b, %vreg_c, flags=NoFPExcept
%vreg3:f32 = FADD_f32 %vreg2, %vreg_d
Three LLVM IR ops collapsed into two PTX instructions: a single fma.rn.f32 followed by a single add.f32. Without fast on the original IR the scorer would charge an additional OPC_CheckFastMathFlag penalty on P_FMA_FADD and return a cost higher than P_FADD_R + P_FADD_R + P_FMUL_R; the FMA fold would lose and the three-instruction mul, add, add sequence would win. The fast flag is what lets the scorer prefer the fused form.
Reimplementation invariants for the scorer
Saturating arithmetic is mandatory. The depth amplifier Mult = 9 * (Depth != 2) + 1 must be preserved exactly; substituting Depth == 0 ? 1 : 10 is only correct if the caller never invokes the scorer at depth 2 as the initial scope. The 507·slot + 6544 pipeline-lattice probe must read a byte, not a bit, and must compare <= 1 for accept and == 4 for double-cost; other values fall through to LABEL_33. The five predicate helpers each return saturating int64. The sub_1AAD880 shim's flag-byte dispatch (!BYTE4(a4) || (a4 & 1)) must come before the ISD::UNDEF zero-cost check, because reordering exposes a fast-path where an undef in a CheckSameVT slot would silently match. The upper-half dispatcher must consult the fast-path probe before the matcher-entry table; reverse the order and every Blackwell tensor-memory intrinsic ends up running the full 507-row predicate scan.
Binary evidence
The scorer body lives at sub_1AAFA40 (12 724 B, 119 cases, three self-recursion sites at lines 595, 1068, 1202 of the decompilation). The upper-half dispatcher lives at sub_1AAD9D0 (8 204 B, 61 cases). The five predicate helpers are sub_1AAC4D0 (CostOperand), sub_1AACAB0 (CheckComplexPattern), sub_1AAD1E0 (CheckSame/CheckSameVT), sub_1AAD880 (the 57-line CheckSame shim), and sub_1AAF9E0 (OPC_Scope re-entry). Pattern-name strings observed in .rodata ("tcgen05.mma.sync", "wgmma.mma_async.sync.aligned", "mma.block_scaled.sync.aligned", and so on) fingerprint the NVIDIA pattern set against an upstream LLVM 18 NVPTX TableGen output. The 507-byte stride interpretation describes the TileAS modulo-scheduling pipeline-lattice transition matrix rather than a flattened LLVM FeatureBitset.
Vector Load/Store Selection
Vector memory operations flow through a hierarchy of primary vector cases, NVIDIA extension cases, and scalar fallbacks. Tensor-memory (tmem) variants use an address-space marker outside upstream NVPTX's ordinary address-space range. The selector reads that marker plus the subtarget feature set to decide between tensor-memory loads/stores and the fallback path.
bool select_vector_load_store(SDNode *node, SelectorState *st) {
VectorClass cls = classify_vector_memory_node(node);
if (cls.requires_tmem) {
if (!st->subtarget.has_tensor_memory)
return false;
return emit_tmem_vector_memory(node, st);
}
if (cls.requires_bulk_tensor)
return emit_tma_bulk_tensor(node, st);
if (cls.is_predicated_global_store)
return emit_predicated_vector_store(node, st);
if (cls.can_be_merged_to_wide_vector)
return emit_merged_vector_access(node, st);
return false;
}
Wide-vector paths group scalar or smaller-vector operands into a single vector operation when lane count and memory class allow. Preserve the grouping rules in a reimplementation: they affect both emitted PTX shape and register pressure.
SelectLoadVector / SelectStoreVector Dispatcher
In tileiras, select_vector_load_store realizes as sub_1A874A0 (NVPTXDAGToDAGISel::SelectLoadVector / SelectStoreVector) — 9 857 B, 426 basic blocks, dominated by two jump tables and a short scalar tail. The primary jump table at 0x1A874EF covers exactly 80 contiguous SDNode opcode values in [158, 237]; the secondary at 0x1A87526 covers 44 NVPTX-extension opcodes in [524, 567]. Eight short-circuit scalar branches sit before and after the jump tables and handle isolated opcodes {58, 60, 98, 300, 301, -995, -5313, -5314}, rounding the dispatch surface to 90 entries.
At entry the function reads a cached predicate that gates the kernel-parameter paths: v10 = *(uint32_t *)(*(uintptr_t *)(a1 + 8) + 648). a1 + 8 is the NVPTXTargetMachine * field on the selector; offset 648 is a boolean hasVecLDST derived during runOnMachineFunction. The boolean must be true before cases 58, 60, and 301 fire; when it is false the function falls through to the upstream SelectCode MatcherTable. The dispatch key itself is the SDNode opcode read from *(uint32_t *)(a2 + 24) — identical to the upper-half dispatcher's key in sub_1A854E0.
unsigned __int64 select_vector_load_store(NVPTXISelDAGToDAG *self, SDNode *N,
ChainWrap *cw, SelectionDAG **dag,
MachineFunction *mf, ...) {
bool hasVecLDST = *(uint32_t *)(*(uintptr_t *)((uint8_t *)self + 8) + 648);
int op = *(int32_t *)((uint8_t *)N + 24); /* SDNode->NodeType */
if (op > 237) {
if (op > 567) return 0; /* MatcherTable fallback */
if (op <= 523) {
if (op == 300) return SelectLoadParam(N, self); /* sub_1A65F50 */
if (op == 301 && self->vec_len > 2)
return SelectLoadParamV4(N, self); /* sub_1A624D0 */
return 0;
}
switch (op) { /* sw2: [524, 567] */
case 524: return SelectStoreTmemV8Pred(N); /* 0x1A87A78 */
case 538: case 539: case 563:
return SelectLoadStoreV2(N, self); /* sub_1A65F50 */
case 543: case 544: return SelectLoadStoreV4(N, self); /* sub_1A624D0 */
case 549: case 550: case 551:
case 565: case 566: case 567:
return SelectV8_F16BF16Absorb(N); /* 0x1A87870 / 0x1A87950 */
default: return 0;
}
}
if (op > 157) { /* sw1: [158, 237] */
switch (op) {
case 158: return SelectLoadV4Tmem(N, self);
case 160: return SelectStoreV4Tmem_BuildVec(N, self);
case 188: return SelectLoadParamV4_TmemAware(N, self);
case 192: return SelectStoreParamV(N, self); /* sub_1A65610 */
case 208: return SelectTMA_BulkTensor_V4(N, self);
case 210: return SelectTMA_BulkTensor_V2(N, self);
case 215: case 216: return SelectStoreV4_Predicated64(N, self);
case 218: return SelectStoreVectorByImm(N, self);
case 236: return SelectLoadV_Tmem_SubVec(N, self);
case 237: return SelectBitcastVectorCSE(N);
default: return 0; /* 69 fallthroughs */
}
}
if (op == 98) return SelectNonTemporalLoadV(N, self);
if (op == 58 && hasVecLDST) return SelectBuildVectorI64(N, self);
if (op == 60 && hasVecLDST) return SelectScalarToVectorI32(N, self);
if (op == -995) return SelectStoreVectorByImm(N, self);
if ((unsigned)(-op - 5313) <= 1)
return SelectLoadV_Tmem(N, self,
*(int32_t *)(self->subtarget + 352) - 50 <= 0x13);
return 0;
}
The primary switch holds 80 case labels, but only 11 carry non-default bodies. Cases 158, 160, 188, 192, 208, 210, 215, 216, 218, 236, and 237 dispatch to NVIDIA-specific emission helpers; the remaining 69 labels (159, 161-187, 189-191, 193-207, 209, 211-214, 217, 219-235) join the shared 0x1A87820 tail and return zero so the outer trampoline at sub_1AAD9D0 can hand the node to the MatcherTable. The negative aliases -5313 and -5314 are NVIDIA's post-LLVM-19 reservation for NVPTXISD::LoadV2_Tmem and StoreV2_Tmem; both route through the same sub_1A86D30 emitter as case 236, but with the SM-minor predicate read from subtarget + 352. The predicate cc - 50 <= 0x13 (raw SM minor in [50, 69]) is what distinguishes the Blackwell tmem path from the upstream tcgen05 emitter.
Address-Space and MVT Probes
Five of the 11 active cases probe address space 255 — the NVPTX-internal tmem marker absent from upstream LLVM, where the highest defined address space is 103. Cases 158, 160, 188, 215/216, and 236 each test *(uint32_t *)(memop + 96) + 32 == 255 to confirm the node operates on Blackwell tensor memory before dispatching into sub_1A86D30. Case 158 also accepts address space 16 (the NVPTX param AS) on the same handler when the inner MVT is v16i32, v32i32, v8f32, or v16f32; the BITCAST plus EXTRACT chain it emits is the routing flag the Blackwell emitter uses to disambiguate tmem from param.
Case 158 (NVPTXISD::LoadV4Tmem) gates on MVT values 48, 60, 130, and 142 — v16i32, v32i32, v8f32, v16f32. On a match it emits a chain that begins with MI opcode 0xD8 (BITCAST) and continues with one or more 0x9E (EXTRACT_VECTOR_ELT) operations through sub_200ACC0 and sub_1A5FE60. Case 208 (NVPTXISD::TMA_BULK_TENSOR_V4) is the Blackwell cp.async.bulk.tensor 4-lane store materialiser. Its body contains a do { ... } while (v59 != 4) loop that walks the operand array four times and emits MI opcodes 0xA0 (BUILD_VECTOR_V4), 0xCF (CP_ASYNC_BULK_TENSOR_V4_SHARED_CLUSTER), and 0x9E (EXTRACT_VECTOR_ELT) via sub_201CAC0. Cases 215 and 216 share one handler body at 0x1A879D0 that emits predicated v4 i64 stores. Case 215 emits MI opcode 519 (STV_U32_GLOBAL) when (*(uint8_t *)(v47 + 28) & 2) != 0; case 216 emits MI opcode 520 (STV_U64_GLOBAL) on the complementary odd-flag path (*(uint8_t *)(v47 + 28) & 1) != 0. Both paths verify that the inner operand MVT is i64 (MVT 7) with a sub-element MVT of i32 (MVT 6).
The secondary switch is structurally simpler. Six of its 44 cases dispatch to one of two emitters — sub_1A65F50 for v2 patterns, sub_1A624D0 for v4 patterns — while another six (the alt-encoded v8 loads at 549-551 and stores at 565-567) share a chain-absorb tail at 0x1A87870 / 0x1A87950. The v8 tail reads 20 operands per group and uses the magic constant 0xCCCCCCCD * (x >> 2) >> 32 to perform a divide-by-five group-count computation; the result drives a merge of two LOAD_VECTOR_V2 chains into a single LOAD_VECTOR_V8 pattern the downstream scheduler can soak into a single MMA-feeding shared-memory transaction. Case 524 (NVPTXISD::STV_PRED_V2_TMEM_V8) is the only handler in the secondary switch that emits MI opcodes directly: it builds a 2-lane predicated store to tmem with an 8-wide stride encoding computed by sub_1A5E450.
Named Case Bodies
The 11 named bodies on the primary switch, together with their MVT/AS gates and the MI opcodes they emit, are:
| Case | Handler addr | Semantic | W | Elt gate (MVT) | AS gate | MI opcodes emitted |
|---|---|---|---|---|---|---|
| 158 | 0x1A880B0 | NVPTXISD::LoadV4Tmem Blackwell tensor-memory load | v4/v8/v16 | 48 / 60 / 130 / 142 | tmem (255), param (16) | 0xD8, 0x9E chain |
| 160 | 0x1A88228 | BUILD_VECTOR_V4 tmem-store materialise | v4 | i32 / v2i64 / bf16x2 / f16x2 | tmem (255) | 0xC1, 0xD9, 0xDA, 0xEC |
| 188 | 0x1A87610 | LoadParamV4 tmem-routed kernel arg | v4 | v20==12 || ==36 | param 101 -> tmem 255 | 0xD8, delegate sub_1EB1CC0 |
| 192 | 0x1A87E60 | StoreParam / ret-value packer | v2/v4 | any pack <= i64 | param/local | from sub_1A65610 |
| 208 | 0x1A87B60 | TMA_BULK_TENSOR_V4 4-lane bulk-tensor store | v4 | f32 (MVT 38 gate) | shared::cluster (7) | 0xA0, 0xCF, 0x9E |
| 210 | 0x1A87EC0 | TMA_BULK_TENSOR_V2 2-lane TMA load | v2 | v8f32 (130) / v16f32 (142) | generic/global | 0x9E, 521/522 via sub_20159B0 |
| 215/216 | 0x1A879D0 | Predicated v4 i64 global store | v4 | i64 (MVT 7, sub-elt 6) | global | 519 / 520 |
| 218 | 0x1A87800 | StoreConstVector immediate-offset store | v2/v4 | any | const / global | from sub_1A5E690 |
| 236 | 0x1A87AE0 | LoadV_TmemSubVector ext-load feeding v4/v8 tmem | v4/v8 | i32 / bf16 / v8f32 / v16f32 | tmem (255) | delegate sub_1A86D30 |
| 237 | 0x1A87E88 | BUILD_BITCAST_VECTOR identity CSE | v2 | identity fold | n/a | chain pass-through |
Case 236 is the routing hinge for the Blackwell tmem extend-load path: when the inner SDNode's opcode is itself negative (the LoadV2_Tmem/StoreV2_Tmem band), the case delegates to sub_1A86D30(N, TM, cc <= 0x13) and lets the TMEM emitter resolve the final MI opcode. The SM-minor predicate matches the one used by the scalar-tail -5313/-5314 paths and is what allows tileiras to fold an extending tmem load with a vector consumer in a single pattern match — a fold absent from upstream LLVM 18 NVPTX, because the entire negative-opcode band is NVIDIA-private.
Case 237 is the identity fold. The body returns the inner node's first operand verbatim when the inner SDNode also has opcode 237 and the result-VT word at *(uint32_t *)(a2 + 100) equals the inner node's *(uint32_t *)(inner + 96). Consecutive BUILD_BITCAST_VECTOR nodes therefore collapse to a single chain link, which the downstream scheduler treats as a no-op for register-pressure accounting.
Sub-helper Roster
Twelve sub-helpers carry the actual emission work for the named cases. Their sizes and inferred signatures are:
| Helper | Size | Inferred signature | Role |
|---|---|---|---|
sub_1A624D0 | 1374 | SelectBaseMemVectorInst(SDNode*, TM*, imm, imm, ptr) | 25-case inner switch on values 543-567; emits ld.v[2|4|8].{global|shared|local|const|param} MI ops |
sub_1A65F50 | 2319 | SelectLoadParamV*(SDNode*, TM*) | Kernel-argument vector load; case 300 and the v2 secondary-switch arms (538, 539, 563) |
sub_1A65610 | 2357 | SelectStoreParamV*(SDNode*, TM*) | Case 192 ret-value packing; v2/v4 param-AS stores |
sub_1A5E690 | 481 | SelectStoreVectorByImm(SDNode*, TM*) | Case 218 and the scalar-tail opcode -995 fallback; constant-offset addressing |
sub_1A86D30 | 1895 | SelectLoadStoreTMEM(SDNode*, TM*, hasSM70orNewer) | tmem (tcgen05.ld/.st) plus ext-load variants; case 236 and the -5313/-5314 band |
sub_1A61760 | 779 | SelectLoadVectorNonTemporal(SDNode*, TM*, ...) | Case 98 (ISD::NON_EXTLOAD vector) when inner MVT is in [12, 13] (i64-packed) |
sub_1A5F3B0 | 881 | SelectLoadVectorPtr(SDNode*, TM*, imm, SDValue, SDValue) | Case 58 (ISD::BUILD_VECTOR) when inner MVT is i64 (MVT 7); called twice for op0 and op1 |
sub_1A5F190 | 120 | DecodeMemOperand(SDValue) | Addressing-mode classifier; returns 0 for non-addressable operands |
sub_1A5F210 | 412 | SelectMemInst2(SDValue, ...) | Case 60 (ISD::SCALAR_TO_VECTOR) 2-phase lowering; only invoked if sub_1A5F190 succeeded |
sub_1A5FE60 | 282 | SelectTmemAddr(SDValue, SDValue, imm, ..., ptr, TM*) | Address-calc helper for case 158's tmem variant |
sub_1A5E450 | 479 | SelectVectorStride(SDValue, SDValue, int) | Stride-encoding helper for case 524 (TMA vector store) |
sub_1A5D780 | 46 | IsScalableVTLegal(MVT) | Returns true for the small set of scalable vector types that NVPTX legalises directly |
The emitters at the bottom of the call graph (sub_2005A50, sub_2009D80, sub_200ACC0, sub_201CAC0, sub_200ABE0) are the same SDNode-construction APIs sub_1A854E0 uses for intrinsic-with-chain selection. The MI opcode is always passed as a literal integer in the call. Reimplementations cannot factor these calls into a generic emit_node(opcode) template without preserving the per-call flag-word and chain-operand layout: the downstream scheduler reads bits from those operands when it groups vector accesses into wide-vector transactions.
SDNode and Memop Offset Map
The selector probes a small set of byte offsets inside the SDNode and the attached MachineMemOperand. The offsets are stable across the binary and form part of the in-memory ABI a reimplementation must reproduce when it wants to share the same MatcherTable byte stream.
| Object | Offset | Field | Read in cases |
|---|---|---|---|
| SDNode | +24 | NodeType (dispatch key) | every case |
| SDNode | +40 | OperandList (SDValue *) | 158, 160, 188, 192, 208, 210 (operand walk) |
| SDNode | +48+16i | result-VT table (16 B / slot, MVT word + type ptr) | 158, 160, 210, 236, 237 |
| SDNode | +64 | NumOperands and flag word | 208 (do-while group count), 210, 565-567 |
| SDNode | +72 | MemoryVT alignment / ordering | 188, 192, 215, 216, 218 |
| SDNode | +80 | MachineMemOperand * list head | 158, 160, 188, 192, 215, 216, 218, 236 |
| SDNode | +100 | result-VT word (CSE compare key) | 237 |
| memop | +96+32 | AddressSpace (8-bit enum) | 158, 160, 188, 215, 216, 236, 524 |
| memop | +28 | predicate / overlap flag word | 215 (& 2), 216 (& 1) |
| TM | +8 | subtarget pointer (a1 + 8) | entry preamble |
| subtarget | +352 | SM-minor dword | case 236, -5313/-5314 scalar band |
| subtarget | +648 | hasVecLDST boolean | entry preamble; gates cases 58, 60, 301 |
The address-space probe is the most diagnostic of the lot. Address space 255 is the NVPTX-internal tmem marker — present only in this binary and the tcgen05 emitters, with no upstream LLVM analogue because the highest upstream NVPTX address space is 103. The same field reaches case 158 with value 16 (NVPTX param) and value 255 (tmem), which is what lets a single handler route both Blackwell tmem loads and kernel-argument vector reads: the BITCAST-plus-EXTRACT chain is identical, and only the address-space tag tells sub_1A86D30 whether to emit a LD_V_TMEM_* or a LDV_PARAM_* MI opcode.
Delta Summary vs Upstream LLVM 18 NVPTX
All 11 non-default bodies are NVIDIA-added behaviours relative to a clean LLVM 18.1.4 NVPTX tree. The most visible deltas are the tmem fold (case 158 BITCAST-chain pre-pattern), the v4 tmem store materialiser (case 160), the predicated v4 i64 stores (cases 215/216), the immediate-embedded vector store (case 218), the cp.async.bulk.tensor 2- and 4-lane materialisers (cases 208 and 210), and the chain-pass-through CSE at case 237. The 69 default-slot cases stay unchanged from upstream; they reach SelectCodeCommon (sub_1AAFA40) via the outer trampoline at sub_1AAD9D0. Port the upstream NVPTX selector and add only the 11 NVIDIA bodies and the reimplementation matches tileiras on every test that does not depend on tmem-specific addressing — the Blackwell-specific surface is the only place the two need to converge byte-for-byte.
The cleanest invariant to preserve is the order of the two jump tables and the negative-opcode band. Primary switch first, secondary second: the secondary switch's v2 and v4 emitters fall through into the same sub_1A65F50/sub_1A624D0 helpers the primary switch invokes through cases 188 and 192. Reverse the order and an alt-encoded LoadV2 (opcode 538) reaches the v4 emitter through case 188's LoadParamV4 path, emitting a spurious ld.v4 for what should be a ld.v2. The negative-opcode band must be tested after both jump tables: the predicate (unsigned)(-op - 5313) <= 1 is two-valued and any earlier test would have to special-case the wrap-around. The hasVecLDST boolean must be checked before cases 58, 60, and 301, because all three bodies emit MI opcodes the downstream scheduler cannot soak when the target lacks native vector LD/ST.
Subtarget Feature Model
Tileiras recognizes a wide historical NVPTX CPU table, but the driver itself accepts only a narrow Blackwell target set. The backend feature table distinguishes ordinary Blackwell from arch-conditional and family-conditional variants. Tensor memory is present on datacenter Blackwell variants and absent on consumer Blackwell targets.
| Target family | Tensor memory | Notes |
|---|---|---|
| Hopper base and older | No | WGMMA and TMA support depends on Hopper feature bits, not tensor memory. |
| Datacenter Blackwell arch/family variants | Yes | Required for tcgen05 tensor-memory MMA paths. |
| Consumer Blackwell | No | Uses block-scaled MMA where tensor memory is unavailable. |
Target validation should be explicit and early:
void validate_tcgen05_target(const SDNode *node, const Subtarget *st) {
if (!st->has_tensor_memory)
fatal("Not supported on this architecture");
if (!st->is_blackwell_datacenter_variant)
fatal("tcgen05.mma supported only on arch-conditional or family-conditional variants from SM100 onwards.");
if (!ptx_version_supports_tcgen05(st->ptx_version))
fatal("tcgen05.mma requires a newer PTX version");
}
Packed Narrow-Float Conversion
The cvt_packfloat family validates both source and destination narrow-float formats. Source and destination get packed into small integer fields, then checked against SM and PTX feature gates.
void validate_packfloat(const SDNode *node, const Subtarget *st) {
PackedFloatMode mode = decode_packfloat_mode(node->immediate);
if (!st->supports_sm90_or_newer || !st->supports_ptx_78_or_newer)
fatal("cvt_packfloat intrinsic needs atleast SM90 and PTX >= 78");
if (mode.uses_fp6_or_fp4 && !st->is_blackwell_arch_conditional)
fatal("FP6/FP4 packed conversion requires Blackwell arch-conditional support");
if (mode.uses_ue8m0x2 && !st->is_blackwell_arch_conditional)
fatal("UE8M0x2 packed conversion requires Blackwell arch-conditional support");
}
The exact diagnostic spelling is part of compatibility for test suites that assert error text.
AsmWriter String Tables
The NVPTX AsmWriter stows its opcode mnemonic pool and physical-register-name pool in an obfuscated data segment, then decodes them once before first use. The cipher is intentionally simple: byte i is XORed with (3 * i) mod 256.
void xor3_decode(uint8_t *begin, uint8_t *end) {
uint8_t key = 0;
for (uint8_t *p = begin; p != end; ++p) {
*p ^= key;
key = (uint8_t)(key + 3);
}
}
It is not a security boundary. The cipher prevents naive string extraction from surfacing every PTX mnemonic, but it is fully reversible and deterministic. A compatible open implementation can store mnemonic tables plainly unless binary-for-binary compatibility with NVIDIA's object layout is a goal.
SDNode Layout Fingerprint
The selector repeatedly reads a 27-bit operand-count field from every SDNode. That is the standard LLVM SDNode::getNumOperands() layout: low 27 bits for the operand count, upper bits for status flags. Operands live in a contiguous SDUse array immediately before the node header.
uint32_t sdnode_num_operands(const SDNode *node) {
return node->operand_count_and_flags & ((1u << 27) - 1u);
}
SDUse *sdnode_operands(const SDNode *node) {
uint32_t n = sdnode_num_operands(node);
return (SDUse *)((uint8_t *)node - n * sizeof(SDUse));
}
For reverse engineering, this layout identifies SelectionDAG walkers in a stripped binary. For reimplementation, it matters only when reproducing the in-memory ABI of NVIDIA's LLVM fork — otherwise use the public LLVM APIs.
NVIDIA-Specific ISel Patches
The tileiras NVPTX selector carries three byte-identifiable patches over upstream LLVM 21 SelectionDAG with no counterpart in any open-source NVPTX target. Each patch lives in a dedicated arm of one of the dispatchers documented earlier on this page, and each is fingerprintable from a stripped binary because the validator function addresses, intrinsic ID ranges, and diagnostic strings stay stable across builds.
The first patch lives at sub_1A84900 (2 066 bytes) and is the cvt_packfloat 4-gate validator. It is reached from the intrinsic-ID range map for IDs 8294 / 8437-8440 / 8627 / 9123 / 9531-9537, covering the FP6, FP4, and UE8M0x2 packed-conversion ops. The validator splits the encoded mode argument into two nibbles — v10 = arg & 0xF for the source narrow-float type and v9 = arg >> 4 for the destination — then runs four cascaded gates. Gate one requires SM major at least 0x384 (sm_90) and PTX version at least 0x4D (PTX 7.7); failure emits "cvt_packfloat intrinsic needs atleast SM90 and PTX >= 78" (typo preserved).
⚡ QUIRK —
atleasttypo in gate-one diagnostic, plus mismatched PTX number The gate-one diagnostic atsub_1A84900is the verbatim binary string"cvt_packfloat intrinsic needs atleast SM90 and PTX >= 78": the missing space inatleastis preserved byte-for-byte, and the message advertisesPTX >= 78even though the actual compare iscc.ptx >= 0x4D(PTX 7.7). A reimplementer who "fixes" either the spelling or the number desyncs test-suite log scrapers that key on the verbatim string. Gate two fires when the destination nibble selects UE8M0x2 and requires SM major at least0xA0(sm_100a); failure emits"ue8m0x2 type in cvt_packfloat intrinsic supported only in arch-conditional or family-conditional variants from SM100 onwards.". Gate three fires when the destination nibble selects fp6x2 or fp4x2 and applies the same SM major check; failure emits"{fp6/fp4}x2 types in cvt_packfloat intrinsic supported only in arch-conditional variants from SM100 onwards.". Gate four fires when the destination nibble selects the family-conditional path and additionally requirescc.minor == 0xF, the sm_100f marker. Any failing gate makes the validator return a poison SDNode markedIsErr, which the dispatcher drops without falling through to the MatcherTable.
The second patch sits at case 0x66 of SelectIntrinsic_W_Chain and is a per-call FTZ override for fused multiply-add. Upstream LLVM handles FMA FTZ semantics only at the TargetOption layer — the nvptx-f32ftz codegen option is read once when the TargetMachine is constructed and every FMA in the module inherits the same FTZ flavor. Tileiras probes two per-instruction sources instead. It first tests the SDNode flag bit 0x40 (NoFPExcept); if set, FTZ opcode 0x65 is forced regardless of any function attribute. If the flag is clear it then calls sub_3FC6800(F, "unsafe-fp-math", 0xE) against the surrounding LLVMFunction *. Attribute set selects FTZ opcode 0x65 (FMA_FTZ); attribute unset selects non-FTZ opcode 0xF7 (FMA, wrapped). The NoFPExcept bit is the standard LLVM flag, but its NVIDIA-specific consequence — forcing FTZ rather than merely allowing it — is not upstream. Both paths are non-upstream.
The third patch lives at sub_1A80A40 and gates the tcgen05 128-bit atom (intrinsic ID 9132). The function tests cc.major >= 0xA0 && hasFeature(80), where feature byte 80 is the tmem subtarget feature, checked against the byte at unk_5BEBD51. If either side fails the function emits "128b atomics not supported on this architecture!" (verbatim, exclamation point included) and returns a poison SDNode. The fingerprint is unusually clean: a single CMP+JL against 0xA0 followed by a CMP+JZ against the tmem feature byte, with no fallthrough to the MatcherTable. The patch is therefore trivially locatable in a stripped binary, and the diagnostic string is unique enough that grep over a binary dump lands directly on the function epilogue.
| Patch | Address / Site | Intrinsic ID(s) | Gate condition |
|---|---|---|---|
| cvt_packfloat 4-gate validator | sub_1A84900 (2 066 B) | 8294, 8437-8440, 8627, 9123, 9531-9537 | SM major and PTX version floor, plus per-format arch-conditional sm_100a checks, plus family-conditional sm_100f check |
| FMAD FTZ split | case 0x66 of SelectIntrinsic_W_Chain | FMA path | SDNode flag bit 0x40 OR "unsafe-fp-math" function attribute selects FTZ opcode 0x65; otherwise non-FTZ opcode 0xF7 |
| 128-bit atomic guard | sub_1A80A40 | 9132 | cc.major >= 0xA0 && hasFeature(80) (tmem feature at unk_5BEBD51) |
SDNode *handlePacketCvt(SDNode *n) { /* patch 1 */
uint8_t srcNib = n->intrId & 0xF, dstNib = (n->intrId >> 4) & 0xF;
if (subtarget->major < 0x384 || subtarget->ptx < 0x4D) return error("...");
if (isUE8M0x2(dstNib) && subtarget->major < 0xA0) return error("...");
if (isFp6x2OrFp4x2(dstNib) && subtarget->major < 0xA0) return error("...");
if (isFamilyCond(dstNib) && (subtarget->major < 0xA0 || subtarget->minor != 0xF))
return error("...");
return emitCvtPackFloat(n);
}
unsigned pickFmaOpcode(const Function *f, const SDNode *n) { /* patch 2 */
if (n->flags & 0x40) return 0x65;
if (sub_3FC6800(f, "unsafe-fp-math", 0xE)) return 0x65;
return 0xF7;
}
SDNode *handle128bAtomic(SDNode *n) { /* patch 3 */
if (subtarget->major < 0xA0 || !subtarget->hasFeature(80))
return error("128b atomics not supported on this architecture!");
return emitTcgen05Atom128(n);
}
The three patches share a structural property worth calling out. Each sits at a single, well-defined dispatcher arm rather than scattering across the selector, and each returns a poison SDNode marked IsErr on failure rather than falling through to the MatcherTable. A reimplementation can drop these arms in or out independently without disturbing the rest of the selector, and a test suite can assert the exact diagnostic strings without worrying about ordering against unrelated cases. The cvt_packfloat validator reuses the same nibble-decode shape the case-0x66 FMA selector uses for its flag-bit test, suggesting both patches were introduced through the same internal mechanism even though they live in different dispatcher layers. See NVPTX Subtarget — Runtime Feature State and The 81 Feature Indices for the subtarget byte layout backing cc.major, cc.minor, and the tmem feature byte at unk_5BEBD51.
Connection to NVPTXProxyRegErasure Peephole
ISel does not run alone. The selector emits MIR that downstream peephole passes consume, and the cleanest illustration of the ISel/peephole contract is the relationship between NVPTXISD::ProxyReg (introduced during lowering) and the NVPTXProxyRegErasure pass that runs immediately after instruction selection finishes.
ProxyReg exists because NVPTX has a typed register hierarchy and the generic ISD::CopyToReg carries no type-class information. When LowerCopyToReg needs to materialize a copy whose source register class differs from the destination — for example, a value typed as i32 flowing into a register slot the next instruction reads as i16 — it wraps the copy in a ProxyReg SDNode that pins the source class. The MatcherTable matches the wrapped form against one of four contiguous machine opcodes:
| MI opcode | Type class | Register class | TableGen name |
|---|---|---|---|
| 3156 | i16 | Int16Regs | ProxyRegI16 |
| 3157 | i32 | Int32Regs | ProxyRegI32 |
| 3158 | i64 | Int64Regs | ProxyRegI64 |
| 3159 | f32 / f64 | Float32Regs / Float64Regs | ProxyRegF |
The contiguous opcode range [3156, 3159] is not an accident. The TableGen-side consolidation that landed in LLVM 21 (the typed-ProxyReg patch) replaced the older ProxyRegInst<*> template — which generated one opcode per source type — with a four-way emit that produces these four opcodes from a single multiclass. The TableGen emitter assigns contiguous indices to records produced by the same multiclass, so the four ProxyReg* records end up adjacent in the generated MachineInstrInfo table. The peephole pass exploits the adjacency: it tests MI.opcode() >= 3156 && MI.opcode() <= 3159 rather than carrying a switch over four cases. A non-contiguous range would force the peephole to either enumerate every opcode or carry a target-info bit per machine instruction, both of which add bytes to the hot path.
The peephole itself is small. It walks every MachineFunction in topological order, finds each ProxyReg* MI, and replaces it with a COPY from the source virtual register to the destination. The COPY carries the destination's register class on its operand, which the register allocator reads later to pick a physical register from the right bank. The ProxyReg* opcode is erased before the AsmWriter runs.
bool NVPTXProxyRegErasure::runOnMachineFunction(MachineFunction &MF) {
bool changed = false;
for (auto &MBB : MF) {
for (auto it = MBB.begin(); it != MBB.end(); ) {
MachineInstr &MI = *it++;
unsigned op = MI.getOpcode();
if (op < 3156 || op > 3159) continue; /* contiguous range test */
Register dst = MI.getOperand(0).getReg();
Register src = MI.getOperand(1).getReg();
BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(TargetOpcode::COPY), dst)
.addReg(src);
MI.eraseFromParent();
changed = true;
}
}
return changed;
}
The pass is the cleanest example of how ISel and post-ISel peepholes split responsibilities. ISel decides what pseudo-opcode the chain needs; the peephole decides what physical sequence prints. A reimplementation that emits the underlying COPY directly in the selector — skipping the ProxyReg indirection — saves one pass but loses two pieces of information. The first is the source register class, which a bare COPY does not carry on its source operand. The second is the chainability: the ProxyReg SDNode is a chain node, so the DAG combiner respects its ordering during legalization. A bare COPY introduced at lowering time is not chainable and can be reordered past instructions that depend on the copy's effect.
Three other peephole passes consume ISel-introduced pseudo-opcodes through the same shape. NVPTXImageOptimizer rewrites texture and surface intrinsics whose immediates the selector left as placeholders; NVPTXLowerArgs collapses LoadParam byte-offset chains into single ld.param.<wide> instructions when the access pattern allows; NVPTXLowerAggrCopies expands memcpy/memmove pseudo-opcodes into explicit load-store loops. Each pass keys on a contiguous opcode range emitted by the selector, and each pass assumes the selector left the chain intact. Reordering or splitting the selector's emission breaks the peephole's recognition pattern and the optimization silently drops on the floor — no diagnostic, just slower PTX.
Appendix: NVPTXISD Opcode Map
Every SDNode carries a 16-bit SDNode::NodeType field whose numeric value selects between upstream LLVM ISD:: opcodes and the NVPTX-private NVPTXISD:: extensions. The three selectors in Tileiras — the INTRINSIC_W_CHAIN dispatcher, the load/store vector dispatcher, and the MatcherTable cost scorer — each consume a disjoint slice of this numeric space. Together they cover every opcode value the NVPTX backend can emit. The full upstream LLVM ISD::* enum names live in llvm/include/llvm/CodeGen/ISDOpcodes.h; the NVPTX-private additions live in llvm/lib/Target/NVPTX/NVPTXISD.h. Tileiras carries a fork of both headers, fingerprinted by the LLVM21.0.0git producer string.
Dispatcher ranges
The three dispatchers split the opcode space cleanly. SelectIntrinsic_W_Chain at sub_1A854E0 switches across [0x17, 0x172], a 345-case window with 58 non-default bodies. SelectLoadStoreVector at sub_1A874A0 uses two jump tables: a primary table at offset 0x1A874EF covering [158, 237] with 80 cases (11 non-default), and a secondary at offset 0x1A87526 covering [524, 567] with 44 cases. The MatcherTable cost scorer at sub_1AAFA40 consumes the remaining union [0x01, 0x5A] ∪ [0x6D, 0xFD] ∪ [0x120, 0x17A] — a 119-case dispatch that combines upstream ISD:: opcodes with NVPTX-private opcodes inlined directly into the scorer.
| Dispatcher | Address | Numeric range | Cases |
|---|---|---|---|
SelectIntrinsic_W_Chain | sub_1A854E0 | [0x17, 0x172] | 345 (58 non-default) |
SelectLoadStoreVector primary | sub_1A874A0:0x1A874EF | [158, 237] | 80 (11 non-default) |
SelectLoadStoreVector secondary | sub_1A874A0:0x1A87526 | [524, 567] | 44 |
| MatcherTable cost scorer | sub_1AAFA40 | [0x01, 0x5A] ∪ [0x6D, 0xFD] ∪ [0x120, 0x17A] | 119 |
Scalar branches outside the jump tables
Eight opcode values inside SelectLoadStoreVector are handled by isolated branches rather than by either jump table: {58, 60, 98, 300, 301, -995, -5313, -5314}. The first three are scalar parameter load/store opcodes the vector selector still has to recognise so it can route them to the scalar selector once hasVecLDST has decided against vectorisation. Opcodes 300 and 301 are the call-argument marshal and call-prototype emit opcodes. The three negative values are not arithmetic underflow: they are NVIDIA's post-LLVM-19 reservation slots for LoadV2_Tmem and StoreV2_Tmem, encoded as signed offsets from a private base so they cannot collide with upstream allocations.
Named NVPTXISD opcodes
The following table samples opcodes that have explicit handlers in the three dispatchers. The notes column records what each handler keys on beyond the numeric opcode.
| Numeric | Name | Notes |
|---|---|---|
0x65 | FMA_FTZ | non-FTZ wrapper at case 0x66 of INTRINSIC_W_CHAIN |
0xF7 | FMA | non-FTZ form, gated by the "unsafe-fp-math" Function attribute |
58 | LoadParam | scalar param load; hasVecLDST gates whether the vector selector rejects |
60 | StoreParam | scalar param store; same gate |
98 | StoreParamV2 | aligned-pair param store |
158 | LoadV4Tmem | NVPTX tensor-memory v4 load (address space 255) |
160 | LoadV4Tmem (alt MVT) | TMEM v4 with alternate MVT operand |
188 | StoreV4Tmem | TMEM v4 store |
192 | LoadV4Const | constant-AS v4 load |
208 | TMA_BULK_TENSOR_V4 | TMA bulk tensor v4 marshal |
210 | TMA_BULK_TENSOR_V8 | TMA bulk tensor v8 marshal |
215 | STV_U32_GLOBAL_V4 | global v4 u32 store |
216 | STV_U64_GLOBAL_V4 | global v4 u64 store |
218 | StoreRetval | function return-value marshal |
236 | LoadV4_Cluster | cluster-shared v4 load |
300 | CallArg | call argument marshal (scalar branch) |
301 | CallPrototype | call prototype emit (scalar branch) |
524 | STV_PRED_V2_TMEM_V8 | predicated v2 TMEM v8 store |
538 / 539 | LoadV2 (alt) | alternate-encoded v2 load pair |
543 / 544 | LoadV4 (alt) | alternate-encoded v4 load pair |
549–551 | LoadV8 | v8 load family |
563 | StoreV2 | v2 store |
565–567 | StoreV8 | v8 store family |
MatcherTable range opcodes
The MatcherTable cost scorer at sub_1AAFA40 mixes upstream LLVM opcodes with NVPTX-private opcodes in the same numeric dispatch. Upstream values use their canonical ISD::* numbering and reach the scorer through the generic instruction-selection machinery. NVPTX-private values were inlined into the scorer so pattern cost calculations can fold target-specific knowledge without dispatching back into the generic layer.
| Numeric | Name | Notes |
|---|---|---|
0x01 | ISD::LOAD | upstream load matched by upstream patterns |
0x4A | ISD::STORE | upstream store matched by upstream patterns |
0x65 | NVPTXISD::FMA_FTZ | inlined into cost scorer, same numeric as W_Chain case |
0x12C | NVPTXISD::WgmmaDescriptor | wgmma operand marshal |
0x140 | NVPTXISD::SETP_* | predicate-set pattern family |
0x150 | NVPTXISD::CallArg | call-arg pattern family |
0x170 | NVPTXISD::CallPrototype | call-prototype pattern family |
AsmPrinter and Per-SM Windows
Abstract
The final PTX emission layer turns selected operations, operands, attributes,
and module metadata into PTX text that ptxas accepts. By the time execution
arrives here, Tileiras has already lowered MLIR operations to NVVM/LLVM IR,
selected NVPTX machine instructions, and verified subtarget legality. The
printer's job is precise and narrow.
The implementation combines two generated printer roles. The MLIR-facing role
prints nvvm.* operation assembly from TableGen assembly-format descriptions.
The LLVM-MC role prints selected MCInst opcodes from the NVPTX asm-writer
table. They share operand printers, modifier printers, register-name
printing, and module-level PTX emission helpers.
Printer Roles
| Role | Input | Output | Primary responsibility |
|---|---|---|---|
| MLIR operation printer | nvvm.* op, operands, attributes | NVVM dialect assembly | Print op syntax and attributes. |
| MC instruction printer | MCInst opcode and operands | PTX instruction text | Print opcode mnemonic, operands, and suffixes. |
| NVPTX asm printer | LLVM module and machine functions | PTX module text | Print headers, directives, globals, and function bodies. |
The two instruction printers differ in when they run. The MLIR printer describes operations before final machine selection; the MC printer describes the exact PTX instruction after selection. Keep those phases separate in a reimplementation, even when the same helper functions print common modifiers.
The MLIR printer is generated by the dialect's TableGen assemblyFormat and is unrelated to PTX itself. The MC printer is the one PTX consumers care about: it walks an MCInst, looks up the opcode's print shape, and renders mnemonic, modifiers, and operands into the output stream. The next sections document the table layout, the shared-body partition, and the obfuscated mnemonic pool the MC printer reads from.
MC Print Shapes
The MC printer is generated in the style of LLVM AsmWriterEmitter: each
opcode maps to a print shape interleaving literal text, operand slots, and
modifier helpers. Most ordinary ALU, conversion, load/store, branch, and
call instructions share a small set of repeated shapes.
| Shape family | Example PTX family | Printed structure |
|---|---|---|
| One-source move | mov, cvt, simple special ops | mnemonic, destination, source |
| Two-source arithmetic | add, mul, and, or, xor | mnemonic, dst, lhs, rhs |
| Ternary arithmetic | mad, fma, selp | mnemonic, dst, a, b, c |
| Predicate compare | setp, predicate logic | predicate dst, operands, compare suffix |
| Load/store | ld, st, atom, vector memory | address-space suffix plus memory operand |
| Control flow | bra, call, ret, exit | target or call prototype operands |
| Matrix / tensor | mma, wgmma, tcgen05, TMA | shape/type/scope modifiers plus operand groups |
void print_mc_shape(const McPrintShape *shape, MCInst inst, raw_ostream *os) {
for (int i = 0; i < shape->item_count; ++i) {
PrintItem item = shape->items[i];
if (item.kind == PRINT_LITERAL) {
os_write(os, item.literal);
} else if (item.kind == PRINT_OPERAND) {
print_operand(inst, item.operand_index, os);
} else if (item.kind == PRINT_MODIFIER) {
print_modifier(inst, item.modifier_kind, item.operand_index, os);
}
}
}
The printer performs no subtarget legality checks. By the time an opcode reaches this layer, the selector and machine verifier have already decided it is legal for the chosen target. The printer only renders the selected opcode.
MCOperand Wire Format
The 6,388-case AsmPrinter dispatcher consumes MCInst records that are themselves arrays of 16-byte MCOperand slots. Every operand seen by printOperand, printMemOperand, and the modifier helpers shares one layout, which keeps the shared-body dispatcher's mov rax, [rdi + 16*rcx] idiom uniform across operand classes.
typedef struct MCOperand {
/*+0x00*/ uint8_t kind_flags; // bit 0 = immediate-vs-register
/*+0x01*/ uint8_t type_tag; // see type-tag table below
/*+0x02*/ uint8_t pad[6];
/*+0x08*/ uint64_t value; // imm value or register number
} MCOperand;
The kind_flags byte carries the discriminator the printer's switch (MO.getKind()) ladder reads first: bit 0 selects the immediate-versus-register branch, and the high bits carry the smaller Expr / FPImm cases the selector promotes when an operand needs a symbolic relocation. The type_tag byte is the operand's element type. Modifier helpers consult it independently of kind_flags because PTX type suffixes are orthogonal to the register-vs-immediate question.
The eight-byte value field holds either an immediate (zero-extended to 64 bits) or a register number drawn from the virtual or physical register banks. The six-byte pad between the discriminator pair and the value keeps the value field naturally 8-byte aligned without growing the struct to 24 bytes — a size that would slow the dispatcher's stride arithmetic.
Type Tag Enum
The type_tag byte indexes a small enum the Blackwell block-scale dispatch leans on. The values below are what the SM120 mma.block_scale family and the NVFP4 variants read when picking a .kind::* suffix; type tags below 12 cover the integer and predicate families and inherit from the LLVM MVT numbering.
| Tag | Type | Notes |
|---|---|---|
| 12 | f16 | Half-precision; selected by .kind::f16 and packed .f16x2 paths. |
| 15 | E4M3 (Float8E4M3FN) | OCP FP8 with 4-bit exponent, finite-only mantissa. |
| 16 | E5M2 (Float8E5M2) | OCP FP8 with 5-bit exponent, finite-and-Inf mantissa. |
| 17 | E2M1 (Float4E2M1FN) | OCP MXFP4 / NVFP4 leaf; the BYTE1 == 17 && BYTE2 == 17 predicate inside the block-scale expander gates .scale_vec::2X and .scale_vec::4X. |
| 19 | tf32 | Selected by .kind::tf32; consumed by the legacy mma.sync family on SM80 and later. |
| 20 | mxf8f6f4 | Block-scaled mixed FP8/FP6/FP4 kind tag; selected by .kind::mxf8f6f4. |
| 21 | mxf4 | Block-scaled FP4 kind tag; selected by .kind::mxf4 and .kind::mxf4nvf4. |
SM120 Block-Scale Control Word
The SM120 block-scale MMA expander reads a packed control word from MCInst + 280. That offset is the seventh MCOperand slot for the dense form (MI 5468) and the eighth slot for the sparse form (MI 5469). Slot layout: A-fragment, B-fragment, C-accumulator, D-output, SFA handle, SFB handle, control word, optional sparse metadata. The control word's low bytes carry the type tags for A and B, the kind tag, the scale-vec format, and a sync-aligned bit the expander explicitly rejects: only the non-sync-aligned form survives into PTX, and a mismatch produces the nvvm.mma.blockscale currently supports non-sync aligned variants only! diagnostic.
⚡ QUIRK —
mma.block_scale.sync.alignedactually emits non-sync-aligned PTX The mnemonic family name ismma.block_scale.sync.aligned, but the SM120 expander reads the sync-aligned bit and rejects it: only the non-sync-aligned variant ever survives into PTX. A frontend that sets the sync-aligned bit hoping to match the mnemonic gets the diagnosticnvvm.mma.blockscale currently supports non-sync aligned variants only!rather than a working kernel — the bit and the family name disagree by design.
struct NvvmMmaBlockScaleCtrl { // 32-bit packed, MCInst + 280
uint32_t a_type_tag : 5; // BYTE1: 15 = E4M3, 16 = E5M2, 17 = E2M1
uint32_t b_type_tag : 5; // BYTE2: same coding as a_type_tag
uint32_t kind_tag : 3; // BYTE4: 20 = mxf8f6f4, 21 = mxf4
uint32_t block_scale_fmt : 2; // BYTE6 bits [4:5]: scale_vec::{1X, 2X, 4X}
uint32_t sync_aligned : 1; // BYTE6 bit 3: must be 0 for block-scale
uint32_t reserved :16;
};
The AsmPrinter consults this enum when emitting mma.block_scale.sync.aligned and the NVFP4 variants. The pre-flight filters before the shared body fires read BYTE6 & 0x38 to pick the scale-vec lane, then check BYTE1 / BYTE2 against the legal type pairs for that lane. Scale-vec 1X accepts the mixed-FP4/FP6/FP8 leaf set; scale-vec 2X requires both type tags to equal 17 (E2M1) and the block-scale format byte to equal 20; scale-vec 4X keeps the same E2M1 pair but binds the kind tag to 21 (mxf4nvf4) and emits NVFP4-only. Each filter carries a verbatim diagnostic string in the binary, which the printer never sees because the MC expander rejects the malformed MCInst before it reaches a shared body.
Operand Slot Stride
The dispatcher walks operand slots at the 16-byte stride the wire format dictates, but the SM120 block-scale expander reaches its later slots at a 40-byte MachineInstr-class stride. MCInst + 280 therefore corresponds to operand index 7 measured at the inflated MI stride, not at the MCOperand stride. A reimplementation that mirrors the AsmPrinter must keep the two strides separate: modifier helpers read MCOperand records at offsets 16 * opIdx; the MC expander reads MI-class operand metadata at offsets 40 * opIdx. Confusing the two produces operand-aliasing bugs the verifier does not catch, because both layouts agree on slot zero.
Per-SM Reachability
Per-SM availability is enforced before printing. One opcode always prints one PTX spelling; an "SM window" describes which target tiers can reach that opcode from instruction selection.
| Target window | Families that become reachable |
|---|---|
| SM70 / SM75 | Baseline ALU, memory, control flow, and NVVM-intrinsic MMA paths. |
| SM80 / SM86 / SM87 | mma.sync, mma.sp.sync, ldmatrix, cp.async, async barriers. |
| SM89 | SM80 surface plus FP8 mma.sync type combinations. |
| SM90 / SM90a | WGMMA, mbarrier, cluster operations, and TMA tensor-copy forms. |
| SM100 / SM103 | tcgen05, tensor-memory forms, Blackwell cluster/TMA extensions. |
| SM120 / SM121 | Block-scaled warp MMA without tensor-memory tcgen05. |
The separation keeps code generation robust: feature predicates decide which instruction is selected, and the printer stays deterministic.
Modifier Helpers
Most complexity in PTX printing comes from suffix construction. Modifier helpers map small encoded operands or attributes into PTX tokens.
| Modifier family | Examples |
|---|---|
| Rounding and saturation | .rn, .rz, .sat, .satfinite |
| Memory space | .global, .shared, .shared::cta, .shared::cluster, .local |
| Memory ordering | .relaxed, .acquire, .release, .acq_rel, .sc |
| Scope | .cta, .cluster, .gpu, .sys, .cta::cluster |
| Cache policy | .ca, .cg, .L2::cache_hint |
| CTA grouping | .cta_group::1, .cta_group::2 |
| Matrix shape | .m16n8k32, .m64nNkK, .128x256b |
| Matrix type | .f16, .bf16, .tf32, .e4m3, .e5m2, .s8, .u8 |
| Tensor-copy suffixes | .im2col, .multicast::cluster, .mbarrier::complete_tx::bytes |
void print_ldst_code(LdStCode code, raw_ostream *os) {
print_memory_space(code.space, os);
print_cache_policy(code.cache_policy, os);
print_memory_order(code.order, os);
print_scope(code.scope, os);
print_type_suffix(code.type, os);
}
Load/store printing is modifier-driven by design. Address-space tokens are not ordinary free-text operands; they decode from the selected load/store code so invalid order/scope/address-space combinations get rejected before reaching this point.
Modifier Emission Order
PTX is whitespace-tolerant but suffix-order-strict. ptxas parses each
instruction by stripping a dotted suffix sequence off the mnemonic in a
fixed order; reordering the suffixes — even when each individual token is
legal — yields a parse error. The print shapes are built around this
grammar, so a reimplementation must emit modifiers in the same canonical
order the parser expects rather than in the order the operand list happens
to enumerate them.
Atomic Operations
atom[.scope][.semantics].<op>.<type>[.addrspace]
| Slot | Token set |
|---|---|
| scope | .cta, .cluster, .gpu, .sys (default: device) |
| semantics | .relaxed, .acquire, .release, .acq_rel (default: .relaxed) |
| op | .add, .min, .max, .and, .or, .xor, .exch, .cas, .inc, .dec |
| type | .b32, .b64, .u32, .u64, .s32, .f16, .f32, .f64, .f16x2, .bf16, .bf16x2 |
| addrspace | .global, .shared, .shared::cta, .shared::cluster |
Examples:
atom.relaxed.cta.add.u32.shared [%rd0], %r1;
atom.acq_rel.gpu.cas.b64.global %rd0, [%rd1], %rd2, %rd3;
atom.release.cluster.add.f32 [%rd0], %f1;
Warp-Synchronous MMA
mma.sync.aligned.<shape>.<alayout>.<blayout>.<atype>.<btype>.<ctype>.<dtype>[.satfinite]
The fixed prefix mma.sync.aligned is invariant for the dense form. <shape>
encodes the M/N/K tile size (m8n8k4, m16n8k16, m16n8k32, m16n16k16,
and so on). <alayout> and <blayout> are .row or .col and are required
for the integer and FP8 variants; they are omitted for the FP16/BF16/TF32
half-precision forms where the layout is fixed by the shape. The four
type tokens always appear in the order A, B, C, D — never in the order the
operand list enumerates the fragments.
Examples:
mma.sync.aligned.m16n8k16.f32.f16.f16.f32 {%fd0,%fd1,%fd2,%fd3}, {%r0,%r1,%r2,%r3}, {%r4,%r5}, {%fd4,%fd5,%fd6,%fd7};
mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32 {%r0,%r1,%r2,%r3}, {%r4,%r5,%r6,%r7}, {%r8,%r9}, {%r10,%r11,%r12,%r13};
Sparse MMA
mma.sp::ordered_metadata.sync.aligned.<shape>.<atype>.<btype>.<ctype>.<dtype>[.satfinite]
The sparsity selector .sp::ordered_metadata sits in a fixed slot between
the mnemonic stem and the .sync.aligned infix. The metadata operand and
the selector byte are extra operands at the end of the print shape; their
suffix tokens do not move.
Warpgroup MMA
wgmma.mma_async.sync.aligned.<shape>.<dtype>.<atype>.<btype>[.scaleD][.scaleAB][.transA][.transB]
<shape> for WGMMA is the m64nNkK family. The destination type slot
precedes the A and B type slots — the inverse of the mma.sync ordering —
because the warpgroup form treats D as the architectural state and A/B as
streamed inputs. The optional scale-and-transpose suffixes follow in the
order scaleD, scaleAB, transA, transB; the printer omits each
suffix when its operand carries the default value.
Example:
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%fd0,...,%fd63}, %rd_descA, %rd_descB, 1, 1, 0, 0;
Tensor-Memory MMA (tcgen05)
tcgen05.mma[.cta_group::N][.scale_input_acc][.block_scale][.sp::ordered_metadata].<kind>.<shape>.<dtype>.<atype>.<btype>.<ctype>[.satfinite]
<kind> is one of .kind::f16, .kind::tf32, .kind::f8f6f4,
.kind::mxf8f6f4, .kind::mxf4, .kind::mxf4nvf4. The block-scale and
scale-input-acc flags are positional booleans whose presence depends on
operand bits documented in the SM120 control-word section above. The
suffix grammar is the strictest in the ISA: every optional token has a
fixed slot, and the parser rejects any reordering.
TMA Bulk-Tensor Copies
cp.async.bulk.tensor.<rank>d.<dst_space>.<src_space>[.<mode>].mbarrier::complete_tx::bytes[.multicast::cluster][.L2::cache_hint]
| Slot | Token set |
|---|---|
| rank | 1d, 2d, 3d, 4d, 5d |
| dst/src space | shared::cluster.global, global.shared::cta, shared::cta.shared::cluster |
| mode | .tile, .im2col, .im2col::w, .im2col::w::128 |
| barrier | .mbarrier::complete_tx::bytes (required for the load form) |
| multicast | .multicast::cluster (optional, only on load) |
| L2 hint | .L2::cache_hint (optional) |
Example:
cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes.multicast::cluster.L2::cache_hint
[%rd_dst], [%rd_desc, {%r_x, %r_y}], [%rd_bar], %h_mask, %rd_hint;
Async Copies
cp.async.<dst_space>.<src_space>.<size>[.<cache_hint>][.L2::cache_hint][.commit_group]
<size> is the bytes-per-thread token (.4, .8, .16). The
.commit_group suffix is emitted by a separate print shape on the
companion cp.async.commit_group instruction; it does not stack onto the
copy itself. The printer enforces the grammar by laying out the modifier
helpers in the exact order above and never letting one helper print into
another's slot.
Suffix Slot Invariant
The print shapes share a slot invariant: every modifier helper consumes a
specific operand of the MCInst and renders into its assigned grammar
slot regardless of operand-vector order. A reimplementation that prints
suffixes by walking operands in order will produce strings that look
plausible but get rejected by ptxas. Always drive the suffix emission
from the print shape's slot table, not from the operand vector's index.
Module Emission
The outer NVPTXAsmPrinter emits PTX module structure around individual
instructions.
| Module element | Printed PTX |
|---|---|
| Header | .version, .target, .address_size |
| Kernel directives | .entry, .reqntid, .maxntid, .minnctapersm, .maxnreg |
| Cluster directives | .explicitcluster, .maxclusterrank, .blocksareclusters |
| Visibility | .visible, .extern, .weak |
| Globals | .global, .const, .texref, .surfref, .samplerref |
| Managed/unified metadata | .attribute(.managed), .attribute(.unified(...)) |
| Function frame | local depot, %SP, %SPL, virtual register declarations |
| Function body | brace-delimited PTX instructions |
void emit_ptx_module(Module module, NvptxTarget target, raw_ostream *os) {
emit_ptx_header(target, os);
emit_module_globals(module, os);
for (Function fn : module.functions()) {
emit_function_directives(fn, target, os);
emit_function_body_start(fn, os);
emit_machine_instructions(fn, os);
emit_function_body_end(fn, os);
}
}
The .blocksareclusters directive requires both thread-block dimensions and
a cluster dimension. Emitters must diagnose that combination early: a
header-only correction later cannot repair a malformed kernel launch
contract.
Module Header Directives
Every PTX module begins with three mandatory directives followed by an
optional .debug toggle. The header emission runs once per Module and
draws every value from the active NvptxSubtarget plus the
TargetMachine debug flag.
//
// Generated by NVIDIA NVPTX Compiler
//
// Compiler Build ID: <build id>
// Cuda compilation tools, release 13.1
// Based on NVVM 7.0.1
//
.version 8.4
.target sm_90a, debug
.address_size 64
| Directive | Source | Notes |
|---|---|---|
.version | PTX ISA version selected by the subtarget. | 8.4 for the CUDA 13.1 baseline; bumped per ISA-feature requirement. |
.target | Lowered SM name plus optional ,debug. | sm_90a adds the architecture-specific a suffix when SM90 architecture-specific intrinsics are used. |
.address_size | Pointer width of the host-device interface. | Always 64 in this build; the 32-bit path is removed. |
The .target line carries up to four comma-separated tokens: the SM name,
the optional a-suffix marker, the debug flag, and the optional
map_f64_to_f32 legacy flag. The printer picks the SM name from
Subtarget.getSmVersion(), appends a when the function or any global
references an architecture-specific feature (SM90a tensor memory, SM100
distributed shared memory, SM120 block-scale MMA), appends debug when
the TargetMachine debug level is non-zero, and appends map_f64_to_f32
only for the legacy fp64 emulation path that the modern compiler never
selects.
The header banner above the directives is a fixed-format comment block
the AsmPrinter emits before the first directive. The build-ID line lets
post-link tools correlate a .ptx artefact with the exact tileiras
binary that produced it; the NVVM-version line documents the bytecode
schema feeding the printer.
Kernel Directive Emission
When the AsmPrinter encounters a kernel function (ptx_kernel calling
convention on the LLVM function, equivalent to a nvvm.kernel attribute
on the MLIR side), it emits a fixed-order directive cluster before the
function body.
.visible .entry KernelName(
.param .b64 KernelName_param_0,
.param .b32 KernelName_param_1,
.param .align 8 .b8 KernelName_param_2[16]
)
.reqntid 128, 1, 1
.maxntid 256, 1, 1
.minnctapersm 2
.maxnreg 64
.maxclusterrank 8
.explicitcluster
{
// function body
}
| Slot | Directive | MIR/LLVM source attribute |
|---|---|---|
| 1 | .visible / .weak / .extern linkage marker | LLVM linkage (external, weak_odr, internal). |
| 2 | .entry plus name and parameter list | ptx_kernel calling convention. |
| 3 | .reqntid X, Y, Z | nvvm.reqntid attribute / !reqntid{x,y,z} metadata. |
| 4 | .maxntid X, Y, Z | nvvm.maxntid attribute. |
| 5 | .minnctapersm N | nvvm.minctasm / minnctapersm metadata. |
| 6 | .maxnreg N | nvvm.maxnreg metadata. |
| 7 | .maxclusterrank N | nvvm.cluster_max_blocks attribute. |
| 8 | .reqnctapercluster X, Y, Z | nvvm.cluster_dim attribute. |
| 9 | .explicitcluster | nvvm.explicit_cluster attribute. |
| 10 | .blocksareclusters | nvvm.blocks_are_clusters (SM90+). |
| 11 | { | opens the function body. |
The order is fixed: the printer never reorders these directives based on
attribute traversal order, and a reimplementation must emit slots that
exist in the same canonical order. Slots whose attribute is absent are
simply skipped — there is no placeholder. The parameter list inside
.entry(...) is a separate sub-emission that walks the function's formal
parameters in declaration order, picks .param storage modifiers from
each parameter's byval/align/type attributes, and emits the
.b8 paramN[size] form for aggregate parameters that arrived through
ABI-mandated indirection.
For non-kernel device functions the .entry token is replaced by
.func, the visibility marker may be .visible/.weak/.extern, the
parameter list takes a different syntactic shape, and slots 3 through 10
are omitted entirely. The shared sub-emitter is the same; only the
slot-table varies.
Mnemonics and Register Names
Mnemonic lookup is table-driven: the MC opcode indexes a generated table and returns the PTX mnemonic stem. Register printing decodes the logical NVPTX register class, then prints a PTX register prefix and the register number.
| Register class | PTX prefix | Width | Use |
|---|---|---|---|
| Predicate | %p | 1 bit | Predicates and condition flags |
| 16-bit GPR | %rs | 16 bits | Half-width integer storage |
| 32-bit GPR | %r | 32 bits | Integer and bit-pattern values |
| 64-bit GPR | %rd | 64 bits | Pointers and 64-bit integers |
| 32-bit float view | %f | 32 bits | Float spelling of the 32-bit bank |
| 64-bit float view | %fd | 64 bits | Float spelling of the 64-bit bank |
| 128-bit GPR | %rq | 128 bits | Wide descriptors and grouped operands |
| Special registers | named PTX registers | varies | %tid.x, %laneid, %clock64, etc. |
void print_register_name(NvptxRegister reg, raw_ostream *os) {
switch (reg.class_id) {
case REG_PRED:
os_printf(os, "%%p%u", reg.number);
return;
case REG_I16:
os_printf(os, "%%rs%u", reg.number);
return;
case REG_I32:
os_printf(os, "%%r%u", reg.number);
return;
case REG_I64:
os_printf(os, "%%rd%u", reg.number);
return;
case REG_F32:
os_printf(os, "%%f%u", reg.number);
return;
case REG_F64:
os_printf(os, "%%fd%u", reg.number);
return;
case REG_I128:
os_printf(os, "%%rq%u", reg.number);
return;
case REG_SPECIAL:
os_write(os, special_register_name(reg));
return;
}
fail("bad NVPTX register class");
}
The 32-bit integer and f32 views share one physical register bank. The instruction's type suffix decides whether the value is interpreted as bits, integer, or floating point.
Register Classes
Tileiras exposes the practical NVPTX register classes a reimplementation needs for instruction selection and printing.
| Class | PTX type string | Prefix | Notes |
|---|---|---|---|
| Predicate | .pred | %p | Boolean predicates. |
| 16-bit | .b16 | %rs | Half-width integer or packed data. |
| 32-bit | .b32 | %r | Main scalar bank. |
| 32-bit float view | printed as type suffix | %f | Alias view of the 32-bit bank. |
| Special | special names | named | PTX special-register reads. |
| 64-bit | .b64 | %rd | Pointers, descriptors, and 64-bit scalars. |
| 128-bit | .b128 | %rq | Wide grouped operands and descriptors. |
Read the f32 class as a typed view over the 32-bit register bank. COPY
lowering can use ordinary 32-bit moves; instruction printing selects %f
spelling only when the operand is used as a floating-point register.
Operand Constraint Class Glossary
The PTX inline-asm constraint letters that user code passes to asm("..." :: "r"(x), "l"(p), "f"(v)) correspond one-to-one with NVPTX register
classes. The printer reads the constraint class off each MachineOperand,
selects the matching register prefix, and renders the operand's number
through print_register_name. The constraint letters are also the
canonical naming convention for register banks in PTX documentation, so a
reimplementation needs both directions: constraint-class to printed
prefix on output, and printed prefix back to constraint-class for inline
assembly parsing.
| Class | Width | Constraint letter | Register prefix | Type strings | Typical uses |
|---|---|---|---|---|---|
b | 1 bit | b | %p | .pred | branch guards, predicate logic, setp destinations |
h | 16 bits | h | %rs | .b16, .u16, .s16 | multicast masks, im2col offsets, FP16 raw bits |
r | 32 bits | r | %r | .b32, .u32, .s32 | most arithmetic operands, 32-bit pointers in shared address space |
l | 64 bits | l | %rd | .b64, .u64, .s64 | generic pointers, TMA descriptors, L2 cache hints |
f | 32 bits | f | %f | .f32 | FP32 arithmetic |
d | 64 bits | d | %fd | .f64 | FP64 arithmetic |
q | 128 bits | q | %rq | .b128 | wide descriptors, 128-bit vector loads, FP128 storage |
The class-to-prefix mapping is deterministic. The printer never picks %r
or %f based on the surrounding instruction's type suffix; it picks the
prefix from the operand's register class, and the type suffix is a
separate modifier the print shape emits independently. The 32-bit integer
and f32 banks share physical registers but carry distinct logical
classes, which is how the printer knows whether to spell a 32-bit live
range as %r3 or %f3.
const char *constraint_class_to_prefix(ConstraintClass cls) {
switch (cls) {
case CLASS_PRED: return "%p";
case CLASS_I16: return "%rs";
case CLASS_I32: return "%r";
case CLASS_I64: return "%rd";
case CLASS_F32: return "%f";
case CLASS_F64: return "%fd";
case CLASS_I128: return "%rq";
}
fail("bad constraint class");
}
Three printing rules cover the corner cases. First, when an operand is a
vector that the load/store needs to spell as a brace-grouped tuple, the
printer emits {%r0, %r1, %r2, %r3} and increments the sequence number
once per element; the constraint class still selects the prefix. Second,
parameter-passing operands use the %pa, %fa, %ia, %la, %h, %hh
prefix family rather than the generic prefixes; the printer routes these
through a parallel switch that consults the operand's parameter-class
flag. Third, special registers (%tid.x, %ntid.y, %laneid, %warpid,
%clock, %clock64, %globaltimer, %pm0, %envreg{0..31}) bypass the
prefix table entirely and print their canonical PTX name from the
physical-register pool documented in the
printRegName section below.
AsmWriter String Pools and the XOR-3 Walking Cipher
The MC printer's two string pools live not in .rodata like a stock LLVM
build, but XOR-encrypted in .data, decrypted in place during pre-main
initialization. The mnemonic pool occupies 0x5A4C080..0x5A656F0 —
exactly 103,536 bytes (~105 KB) — and stores every PTX opcode stem plus
three AsmWriter tail fragments. The physical-register-name pool occupies
0x5A4BE20..0x5A4C06A — exactly 586 bytes — and stores the 90 named
registers printRegName returns for class 0. Both pools share one cipher
and one initialization shape; the two init routines sub_1BD1810 and
sub_1BD1830 are 20-line bodies differing only in begin and end pointers.
The cipher is a walking byte XOR with a fully deterministic key schedule
k[i] = (3 * i) mod 256. Because gcd(3, 256) = 1, the orbit
0, 3, 6, ..., 255, 2, 5, ... enumerates every residue 0..255 exactly once
per 256-byte window before repeating. The cipher is linear, byte-granular,
and trivially invertible by replaying the same pass over the ciphertext. A
strings tileiras scan surfaces zero PTX mnemonics; the design target is
defeating naive static analysis, not real security.
void xor3_decode(uint8_t *p, uint8_t *end) {
uint8_t k = 0;
while (p != end) { *p++ ^= k; k += 3; }
}
After decryption the mnemonic pool decodes to 3,067 NUL-delimited chunks.
The first three are AsmWriter tail fragments emitted after the final
operand of an instruction template: "},\n\t\t", "},\n\t", ";\n\t". The
remaining ~5,500 entries are PTX mnemonic stems plus the per-template
prefix tokens AsmWriterEmitter packs in front of long-form opcodes. The
register pool decodes to 90 names covering seven virtual-register-class
prefixes (%p, %rs, %r, %rd, %f, %fd, %rq), the
parameter-passing prefixes (%da, %fa, %ia, %la, %h, %hh, plus
32 %envreg{0..31} slots), and the three frame registers %Depot, %SP,
%SPL.
Each decrypter is gated by a pthread_once flag in .bss. The mnemonic
pool uses dword_5B4F4D8; the register pool uses dword_5B4F4C0. Once the
walking-XOR pass returns, getMnemonic runs the Itanium-ABI "safely
initialize local static" dance to publish the decoded pool's base address
into a shared cache: __cxa_guard_acquire (sub_44A8A10) on the
byte_5B4F4C8 lock byte, the cache write to qword_5B4F4D0, then
__cxa_guard_release (sub_44A8AC0). Subsequent calls observe the
already-acquired guard and skip directly to the table lookup.
getMnemonic and the Offset Tables
MC opcode lookup is a pair of parallel .rodata tables indexed by the
32-bit MC opcode. dword_4D4D360 carries the packed mnemonic descriptor:
low 17 bits hold the byte offset into the decoded mnemonic pool, high 15
bits hold the per-opcode tail-state bits the print shape consults to pick
a trailing separator. The companion table dword_4D468C0 carries the
operand-width flags, modifier class index, and fragment indices that drive
the modifier helpers. Both tables hold 6,824 entries of uint32 each.
The first 293 slots are zero, matching LLVM's generic TargetOpcode
prelude (G_ADD, G_PHI, G_IMPLICIT_DEF, and the rest); real NVPTX
opcodes begin at index 293.
const char *getMnemonic(const MCInst *MI) {
pthread_once(&once_mnemonic, init_mnemonic_pool);
if (!guard_once && __cxa_guard_acquire(&guard_once)) {
base_ptr_cache = (uintptr_t)&mnemonic_pool[0];
__cxa_guard_release(&guard_once);
}
uint32_t opc = MI->Opcode;
uint32_t offset_tb = mnemonic_offsets[opc]; // dword_4D4D360
uint32_t companion = mnemonic_companion[opc]; // dword_4D468C0
if (offset_tb | ((uint64_t)companion << 32)) {
uint32_t off = offset_tb & 0x1FFFF;
return (const char *)(base_ptr_cache + off - 1);
}
return NULL;
}
The - 1 bias is LLVM's standard AsmWriterEmitter convention. Offset 0
encodes the "no mnemonic" sentinel; the first real mnemonic sits at pool
byte 0 and is reached through stored offset 1. The combined zero check
offset_tb | (companion << 32) lets one 64-bit test reject opcodes that
have neither a mnemonic nor a companion descriptor — no two separate
branches. The 17-bit offset field saturates at 131,072 bytes; the
103,536-byte payload leaves 26.6 % headroom, consistent with the
SM110/SM121 forward-projection allowance baked into this build's MC opcode
table. The empirical maximum lo17 observed is 103,806, which sits
inside the trailing NUL slack the post-link encoder reserves at the end of
the pool.
The companion-word dword_4D468C0 decomposition is inferred from the
415-value cardinality plus the canonical OpIdx << 8 | ModCls shape that
AsmWriterEmitter emits: the low byte carries operand-width flags, the next byte
indexes the tail-fragment list, the third byte indexes the prefix-fragment
list, and the top byte selects an AsmWriter modifier class. Mark this MED
confidence; the byte boundaries are stable but the per-byte semantic naming
has not been cross-checked against a TableGen build.
| Table | Address | Stride | Count | Purpose |
|---|---|---|---|---|
word_4D46800 | 0x4D46800 | u16 | 96 | Register-name offsets into the 586-byte pool. |
dword_4D4D360 | 0x4D4D360 | u32 | 6,824 | Mnemonic offset (low 17 bits) plus tail state (high 15 bits). |
dword_4D468C0 | 0x4D468C0 | u32 | 6,824 | AsmWriter companion: operand-width flags, modifier class, fragment indices. |
The .bss state cluster lives at four contiguous addresses with an 8-byte
alignment pad between the guard byte and the cache pointer:
| Slot | Address | Width | Role |
|---|---|---|---|
dword_5B4F4C0 | 0x5B4F4C0 | pthread_once_t | Register-name pool once-flag. |
byte_5B4F4C8 | 0x5B4F4C8 | uint8_t | Itanium-ABI __cxa_guard_* lock byte for the cache write. |
qword_5B4F4D0 | 0x5B4F4D0 | uintptr_t | Cached base pointer of the decoded mnemonic pool. |
dword_5B4F4D8 | 0x5B4F4D8 | pthread_once_t | Mnemonic pool once-flag. |
printRegName and the Register Pool
printRegName (sub_1BD1EB0) is the printer's 8-way class switch. The
top four bits of the MCRegister value select the class; the low 28 bits
carry the sequence number for virtual registers or the MCReg enum value
for physical registers. Class 0 is the physical path: it triggers
pthread_once(&dword_5B4F4C0, init_reg_name_pool), indexes the
register-name pool through word_4D46800[r - 1], then writes the resulting
NUL-terminated string to the output stream. The - 1 bias mirrors the
mnemonic-pool convention; MCRegister 0 is the "no register" sentinel.
Classes 1 through 7 print the seven virtual-register prefixes catalogued
in the Mnemonics and Register Names
section above, concatenated with the low 28 bits as a decimal sequence
number. The decoded 586-byte pool therefore carries the strings the
class-0 path returns directly — physical envregs, parameter-passing
prefixes, frame registers — plus the per-class "first virtual register of
class N" exemplars the MC layer emits for register-allocation dumps.
MC Switch Shape Population Table
The MC printer's dispatcher is a single switch over 6,388 MC opcodes
covering every selectable NVPTX instruction in this build. case arms do
not each carry a distinct printer body; most fall through to one of 297
shared body labels that emit textually-identical PTX. Compression is
steep: the fifteen most-populated shared bodies absorb roughly 80 % of
all dispatch traffic, and the top-eight bodies alone shape the bulk of the
printer's output behaviour.
| Shared body | Opcode count | Skeleton |
|---|---|---|
LABEL_18639 | 120 | slot 3 plus 1-byte terminator (e.g. cvt.{type}.{type}.{rnd} {reg}, {reg};) |
0x1C40201 | 577 | mnemonic, operands, terminator (the dominant ALU shape) |
0x1C4097D | 267 | 4-operand form (e.g. mma.sync.aligned {rd, rs1, rs2, rs3};) |
0x1C40B59 | 108 | [addr], reg form (e.g. st {addr}, {reg};) |
0x1C409DF | 96 | 2-operand reg-reg (e.g. mov.{type} {rd}, {rs};) |
0x1C40AAF | 84 | 3-operand plus modifier (e.g. set.{cmp}.{type} {rd}, {rs1}, {rs2};) |
LABEL_18984 | (slot 8) | predicated form |
LABEL_18729 | (slot 5) | conditional form |
Shared body 0x1C40201 is the centre of mass of the table. It prints the
canonical "mnemonic, comma-separated operands, semicolon" shape every
ordinary ALU instruction takes, so a reimplementation that gets exactly
this one body right covers more than 9 % of MC opcodes on its own. The
four 0x1C40... bodies together (the 201, 97D, B59, 9DF, AAF
cluster) form the ALU and memory backbone; the two LABEL_* entries cover
the predicated and conditional forms the selector synthesises around
guarded instructions.
18-Family Non-MMA Partition
Beyond the top-eight shared bodies, the AsmPrinter groups the non-MMA
opcodes into eighteen families F1 through F18 keyed by operand shape.
The partition is operand-driven rather than mnemonic-driven: opcodes whose
PTX text differs in mnemonic but agrees in operand layout share a family,
and the per-opcode flag word in the jump table picks the right mnemonic
stem out of the XOR-3 mnemonic pool. The largest family is F1, the
load/store mega-group, which carries its own inner sub-dispatcher because
LD and ST must discriminate address space, predication, sparsity, and
tensor-memory variants before the shared body can pick the correct PTX
spelling.
Inner LD/ST 13-Label Table
F1 dispatches through a 13-label sub-table that splits the load/store
opcodes by address space, predication, and tensor-memory variant:
| Sub-label | Opcode family |
|---|---|
| 1 | Generic load. |
| 2 | Generic store. |
| 3 | Constant-AS load. |
| 4 | Param-AS load. |
| 5 | Shared-AS load/store. |
| 6 | Global-AS load/store. |
| 7 | Local-AS load. |
| 8 | TMEM load/store. |
| 9 | Bulk-tensor load. |
| 10 | Bulk-tensor store. |
| 11 | Sparse load. |
| 12 | Predicated load. |
| 13 | Predicated store. |
Sub-labels 8 through 11 are the tensor-memory and bulk-tensor variants reachable only from the SM100 window onward; sub-labels 12 and 13 carry the predicated forms the selector emits when a guard predicate survives into the MC layer.
MC Opcode to Label Cascade
Each MC opcode in the dispatcher's jump table holds a u32 index packed
as {shared_label_id << 16 | per-op flag bits}. The high 16 bits select
the shared body; the low 16 bits encode the operand-flavour tweaks (FTZ,
satfinite, modifier kind, address-space hint, and so on) the shared body
consults via currentOp & 0xFFFF. This is the canonical AsmWriterEmitter
compression pattern, fingerprinted in the binary by the
mov rax, [rdi + 4*rcx] plus mov ecx, eax; shr rax, 16 idiom at the
entry of every shared body.
void emit_mc_opcode(MCInst inst, raw_ostream *os) {
uint32_t entry = jump_table[inst.opcode];
uint32_t label = entry >> 16;
uint32_t flags = entry & 0xFFFF;
print_shared_body(label, inst, flags, os);
}
The split is strict: a shared body never reads opcode identity directly.
It reads only the flag word handed to it by the dispatcher, plus the
operand indices its skeleton dictates. That is what lets 297 bodies
absorb 6,388 opcodes without per-opcode branching inside the bodies
themselves.
Worked Example: MMA M16N8K16
The cleanest way to see every layer of the printer cooperate is to trace
a single MachineInst from selection output through to the emitted PTX
line. The example below uses MMA_F32_F16_F16_F32_M16N8K16 — the
canonical FP16-input, FP32-accumulate warp MMA the FP32 GEMM kernels lean
on heaviest.
MachineInst Shape
After instruction selection, the MI carries four operand groups: an
A-fragment, a B-fragment, a C-accumulator that doubles as the D
destination, and an optional satfinite flag. The shape is fixed by the
TableGen instruction definition:
%fd0:f32, %fd1:f32, %fd2:f32, %fd3:f32 =
MMA_F32_F16_F16_F32_M16N8K16
%r0:i32, %r1:i32, %r2:i32, %r3:i32, // A fragment (4 x packed.f16x2)
%r4:i32, %r5:i32, // B fragment (2 x packed.f16x2)
%fd4:f32, %fd5:f32, %fd6:f32, %fd7:f32 // C accumulator
The destination is the first four operands; the A, B, C fragments follow
in source order. The register classes are mixed: A and B use the 32-bit
integer bank (%r) because PTX packs two FP16 lanes into one 32-bit
register, while C and D use the 32-bit float bank (%fd) because the
accumulator is held in FP32 doubles.
Print Shape Lookup
The MC opcode index for MMA_F32_F16_F16_F32_M16N8K16 resolves to a
companion-table entry whose high 16 bits select shared body 0x1C4097D
(the 4-operand-group form documented in the shape population
table) and whose low 16 bits encode
the type-suffix tuple (D=f32, A=f16, B=f16, C=f32) plus the
satfinite=0 bit.
The mnemonic offset retrieved from dword_4D4D360 resolves through the
XOR-3-decoded pool to the stem mma.sync.aligned.m16n8k16. The trailing
type tokens .f32.f16.f16.f32 are appended by the modifier helper
print_mma_type_tuple, which reads the four type-tag bytes off the flag
word and looks each one up in the type-string table.
Modifier Emission
The suffix slot table for the dense MMA family is:
| Slot | Token | Source |
|---|---|---|
| 1 | .sync | fixed for mma.sync |
| 2 | .aligned | fixed for the warp-synchronous form |
| 3 | .m16n8k16 | shape from print-shape entry |
| 4 | .f32 | D type tag from flag word |
| 5 | .f16 | A type tag from flag word |
| 6 | .f16 | B type tag from flag word |
| 7 | .f32 | C type tag from flag word |
| 8 | .satfinite | optional, gated by flag bit |
The printer walks this slot table in order. No operand-vector traversal participates — the slot table is the ground truth for suffix order.
Operand Group Printing
The shared body 0x1C4097D reads four operand-group descriptors out of
its skeleton:
| Group | Operand range | Register class | Width | Printed form |
|---|---|---|---|---|
| D | dest[0..3] | f32 (%fd) | 4 | {%fd0, %fd1, %fd2, %fd3} |
| A | src[0..3] | i32 (%r) | 4 | {%r0, %r1, %r2, %r3} |
| B | src[4..5] | i32 (%r) | 2 | {%r4, %r5} |
| C | src[6..9] | f32 (%fd) | 4 | {%fd4, %fd5, %fd6, %fd7} |
Each group emits an open brace, comma-separated operand prints, a closing brace, and a separator. The group sizes (4, 4, 2, 4) come from the MMA shape's fragment dimensions, not from the operand vector — a 16x8 D output produces 4 FP32 lanes per thread, a 16x16 A input produces 4 packed FP16x2 lanes per thread, and an 8x16 B input produces 2 packed FP16x2 lanes per thread.
Final Emitted Line
After all four groups have printed, the skeleton emits the closing semicolon and a newline:
mma.sync.aligned.m16n8k16.f32.f16.f16.f32 {%fd0, %fd1, %fd2, %fd3}, {%r0, %r1, %r2, %r3}, {%r4, %r5}, {%fd4, %fd5, %fd6, %fd7};
Three observations carry across to other MMA variants. First, the suffix
order and operand-group order are independent: the suffix list runs D,
A, B, C in type-token form while the operand list runs D, A, B, C in
brace-group form, and they happen to agree only because the print shape
arranged it that way. Second, the register-class mismatch between A/B
(integer bank) and C/D (float bank) is deliberate — PTX treats FP16
inputs as bit patterns and FP32 accumulators as floats, and the printer
faithfully picks the bank from each operand's class. Third, the
.satfinite slot is gone from the printed line because its flag bit was
zero; the printer never emits empty-slot placeholders.
The same skeleton drives every entry in the mma.sync.aligned.* family.
Switching to mma.sync.aligned.m16n8k32.f32.bf16.bf16.f32 changes only
the shape token in slot 3 and the type tokens in slots 4 through 7; the
operand-group sizes adjust to match the new fragment dimensions; and the
final line keeps the exact same syntactic shape. That regularity is what
lets one shared body cover several hundred MMA opcodes.
Reimplementation Budget
Recreating the dispatcher in a clean reimplementation takes four artefacts. The first three are bulk data; the fourth carries the per-operand encoding rules the modifier helpers consume.
| Artefact | Shape | Source |
|---|---|---|
| Shared-body table | 297 labels with operand-skeleton scripts. | Disassembled from the asm-printer body cluster. |
| MC opcode jump table | 6,388 entries of {label, flags} packed as u32. | Reconstructed from the per-case jump targets. |
| Mnemonic pool | XOR-3-encrypted, 3,067 NUL-delimited chunks. | See the XOR-3 walking cipher section. |
| Offset tables | dword_4D4D360, dword_4D468C0, word_4D46800. | See getMnemonic and the Offset Tables. |
Ship these four artefacts as static data plus the 297 body scripts as a
small interpreter loop and the entire MC printer surface comes back
without per-opcode code generation. The
ISelDAG and MatcherTable — Selector Layers
page documents the upstream stage feeding these MC opcodes into the printer, and the
XOR-3 walking cipher
section above covers the mnemonic-pool side of the budget.
Per-SM Emission Templates
Abstract
Tileiras emits tensor-core matrix instructions through a different path on each SM generation. The useful public model is not "which helper printed a string" but "which instruction surface is available, which operands it expects, and whether emission goes through inline assembly or an NVPTX machine instruction."
Older Volta and Turing MMA operations take the NVVM intrinsic path. Ampere
and Ada take llvm.inline_asm templates for dense and sparse mma.sync.
Hopper adds WGMMA templates. Datacenter Blackwell moves tensor-core matmul
into tensor-memory tcgen05 machine instructions. Consumer Blackwell has
no tensor memory and falls back to warp-level block-scaled mma.sync
machine instructions.
Capability Matrix
| SM tier | Public surface | Emission path | Main instruction family |
|---|---|---|---|
| SM70 / SM75 | nv_tileas.mma.sm70, nv_tileas.mma.sm75 | NVVM intrinsic | mma.sync.m8n8k* |
| SM80 | dense/sparse MMA atoms | inline asm | mma.sync.aligned, mma.sp.sync.aligned |
| SM89 | FP8 MMA atoms | inline asm | mma.sync.aligned.m16n8k32 |
| SM90 | warp-group MMA | inline asm / NVVM ops | wgmma.mma_async.sync.aligned |
| SM100 / SM103 | tensor-memory MMA | MachineInstr | tcgen05.mma |
| SM120 / SM121 | block-scaled warp MMA | MachineInstr | mma.sync.aligned.*.block_scale |
The selection rule is a tier-keyed lookup: the SM major version names a single emission path. SM70 and SM75 emit the NVVM intrinsic and let the NVPTX backend pick the final PTX spelling. SM80 and SM89 build a mma.sync.aligned inline-asm template at IR time. SM90 builds a four-part WGMMA inline-asm protocol. SM100 and SM103 emit tcgen05.mma as a MachineInstr directly. SM120 and SM121 emit warp-synchronous mma.sync.aligned.*.block_scale as a MachineInstr.
SM70 / SM75
Volta and Turing need no Tileiras-owned inline-assembly templates for their baseline MMA surface. The dialect registers the SM70 and SM75 atoms, then lowers them to the corresponding llvm.nvvm.mma.* intrinsics. The downstream NVPTX backend owns final PTX spelling.
| Tier | Shape families | Lowering rule |
|---|---|---|
| SM70 | m8n8k4 | Use NVVM MMA intrinsic. |
| SM75 | m8n8k16, m8n8k32, m8n8k128, BF16 additions | Use NVVM MMA intrinsic. |
The PTX spelling produced by the NVPTX backend matches the SM tier:
mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32
{%fd0, %fd1, %fd2, %fd3},
{%r0, %r1, %r2, %r3},
{%r4, %r5},
{%fd4, %fd5, %fd6, %fd7};
The exact register count per operand fragment depends on the shape and element type; the table-driven NVPTX printer reads it from the per-opcode operand-class enumeration.
SM80
Ampere is the first tier where Tileiras builds the PTX template directly inside llvm.inline_asm. Dense MMA emits mma.sync.aligned. Sparse MMA emits mma.sp.sync.aligned with a metadata register and a sparsity-selector immediate. The INT8 m16n8k32 sparse form has a .sp::ordered_metadata fast path that pins the selector to zero.
| Family | Shape examples | Accumulator | Extra operands |
|---|---|---|---|
| Dense f16/bf16/tf32 | m16n8k8, m16n8k16 | f16 or f32 | none |
| Dense integer | m16n8k32, m16n8k64 | s32 | optional .satfinite |
| Sparse f16/bf16/tf32 | m16n8k8, m16n8k16 | f16 or f32 | metadata + selector |
| Sparse integer | m16n8k32, m16n8k64 | s32 | metadata + selector |
| Ordered metadata | m16n8k32 INT8 sparse | s32 | metadata, selector fixed to zero |
Dense m16n8k16.f32.f16.f16.f32 emits:
mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
{%fd0, %fd1, %fd2, %fd3},
{%r0, %r1, %r2, %r3},
{%r4, %r5},
{%fd4, %fd5, %fd6, %fd7};
The dense INT8 m16n8k32.s32.s8.s8.s32 form emits the same shape with s32/s8 type suffixes and an optional .satfinite modifier on the destination side.
Sparse m16n8k16.f32.f16.f16.f32 emits the same operand list plus a metadata register and a selector immediate:
mma.sp.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
{%fd0, %fd1, %fd2, %fd3},
{%r0, %r1},
{%r2, %r3},
{%fd4, %fd5, %fd6, %fd7},
%r4,
0x0;
The metadata operand is logically two i16 values packed into one i32 register. The selector is a one-bit immediate.
The INT8 ordered-metadata fast path swaps .sp for .sp::ordered_metadata and elides the explicit selector:
mma.sp::ordered_metadata.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32
{%r0, %r1, %r2, %r3},
{%r4, %r5, %r6, %r7},
{%r8, %r9},
{%r10, %r11, %r12, %r13},
%r14;
Dense integer forms can request .satfinite; floating forms have no such modifier at the MMA level.
SM89
Ada extends the SM80 dynamic builders with FP8 types. The shape is m16n8k32, the accumulator is f32, and the input type product is one of e4m3 x e4m3, e4m3 x e5m2, e5m2 x e4m3, e5m2 x e5m2. The emitted PTX form is:
mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32
{%fd0, %fd1, %fd2, %fd3},
{%r0, %r1, %r2, %r3},
{%r4, %r5},
{%fd4, %fd5, %fd6, %fd7};
Register arity follows the SM80 INT8 k32 layout: four D registers, four A registers, two B registers, four C registers. Sparse FP8 adds one metadata register and reuses the .sp modifier shape from SM80. No FP16 accumulator path exists for this tier's FP8 mma.sync — that belongs to the later WGMMA surface.
SM90
Hopper introduces WGMMA. Tileiras emits wgmma.mma_async.sync.aligned inside a four-part inline-assembly protocol: fence, one or more async MMA instructions, commit group, wait group. The accumulator-update bit is carried by a predicate register (%p) computed from the scale_d operand. Shared-memory operands ride as 64-bit descriptors built by the per-atom descriptor constructor.
| Input family | D type | K | Notes |
|---|---|---|---|
| f16 x f16 | f16 or f32 | 16 | Optional scale and transpose operands. |
| bf16 x bf16 | f32 | 16 | Same operand structure as f16/f32. |
| tf32 x tf32 | f32 | 8 | TF32-specific K width. |
| e4m3/e5m2 FP8 pairs | f32 | 32 | Four FP8 type combinations. |
| s8/u8 integer pairs | s32 | 32 | Forced .satfinite, no scale-a/b. |
| b1 x b1 | s32 | 256 | Uses .xor.popc or .and.popc. |
The four-part protocol for one tile of m64n128k16.f32.f16.f16 is:
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16
{%fd0, %fd1, %fd2, %fd3, %fd4, %fd5, %fd6, %fd7,
%fd8, %fd9, %fd10, %fd11, %fd12, %fd13, %fd14, %fd15,
%fd16, %fd17, %fd18, %fd19, %fd20, %fd21, %fd22, %fd23,
%fd24, %fd25, %fd26, %fd27, %fd28, %fd29, %fd30, %fd31},
%rd0, // descriptor A
%rd1, // descriptor B
%p0, // scale-D (accumulator-update predicate)
1, 1, // scale-A, scale-B (FP families only)
0, 0; // transpose-A, transpose-B (FP families only)
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 0;
The float families append scale-A, scale-B, transpose-A, transpose-B immediates after the scale-D predicate; the integer families omit those and force .satfinite. The b1 family substitutes .xor.popc or .and.popc for the type suffix.
The A operand can be a register fragment instead of an SMEM descriptor — in that case it appears as a register-tuple { %r0, %r1, ... } and the constraint list switches l to r. The B operand is always an SMEM descriptor. Descriptor offsets are expressed in 16-byte units, so the constructor shifts byte offsets right by four before packing. See SM70-120 MMA Atoms — SMEM-Descriptor Construction for the 64-bit descriptor bit layout, and the WGMMA Descriptor Round-Trip section below for a worked hex example.
SM100 / SM103
Datacenter Blackwell uses tensor memory and emits tcgen05.mma through the MachineInstr layer rather than llvm.inline_asm. The instruction is warp-group-uniform and operates on TMEM operands. The packed control word carries instruction family, CTA group, sparsity, block scale, scale-vector size, input family, collector mode, and optional scale-input-accumulator state. See tcgen05 Control-Word Bit Layout for the bit-layout of the control word and Verifier Rules for the verifier rules.
A dense tcgen05.mma for one tile emits:
tcgen05.mma.cta_group::1.kind::f16.f32.f16.f16
[%r0], // TMEM destination (D)
[%r1], // TMEM source (A)
%rd2, // SMEM descriptor (B)
%r3; // packed control word
The control-word operand encodes scale-vector size, MMA kind, scale-input-accumulator, and block-scale bits. A sparse variant adds a metadata operand:
tcgen05.mma.sp.cta_group::1.kind::f16.f32.f16.f16
[%r0], [%r1], %rd2, [%r3], %r4;
A block-scaled variant adds two TMEM scale operands and a scale-vec modifier:
tcgen05.mma.cta_group::1.kind::mxf8f6f4.scale_vec::1X.f32.e4m3.e4m3
[%r0], [%r1], %rd2,
[%r3], // SFA scale (TMEM)
[%r4], // SFB scale (TMEM)
%r5;
The weight-stationary variant prefixes the mnemonic with .ws and rejects two-CTA grouping. The two-CTA variant is encoded as cta_group::2 in the modifier. The arch-conditional variants (sm_100a, sm_100f) accept different subsets of the kind tag and scale-vec width than the base variant.
SM103 follows the same structural path with a different accepted target tuple. Drive the algorithm with subtarget feature predicates, not a separate forked emitter.
SM120 / SM121
Consumer Blackwell removes tensor memory and therefore drops tcgen05.mma entirely. Its block-scaled matmul surface is warp-synchronous mma.sync.aligned.*.block_scale. The public operation has nine attributes: a_type, b_type, byte_id_a, byte_id_b, sf_type, shape_MNK, thread_id_a, thread_id_b, vec_size.
The verifier accepts exactly three shape/vector families:
| K | vec_size | Kind | A/B types | Scale-factor type |
|---|---|---|---|---|
| 32 | 32 | MXFP8 | e4m3, e5m2, e3m2, e2m3, e2m1 | E8M0 |
| 64 | 16 | MXFP4 | e2m1 | E8M0 or E4M3 |
| 64 | 32 | NVFP4 | e2m1 | E8M0 |
Dense and sparse forms share one set of operand families: A fragment, B fragment, C accumulator, D output, SFA scale fragment, SFB scale fragment. Sparse forms add ordered metadata. SFA and SFB are warp-register fragments, unlike SM100 where the scale operands live in tensor memory.
A dense m16n8k32 MXFP8 block-scale tile emits:
mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.scale_vec::1X.block_scale.f32.e4m3.e4m3.f32
{%fd0, %fd1, %fd2, %fd3},
{%r0, %r1, %r2, %r3},
{%r4, %r5},
{%fd4, %fd5, %fd6, %fd7},
%r6, // SFA scale fragment (register)
%r7; // SFB scale fragment (register)
The NVFP4 m16n8k64.scale_vec::4X form emits the same operand layout with e2m1 type suffixes and a kind::mxf4nvf4 tag. The MXFP4 m16n8k64.scale_vec::2X form pairs kind::mxf4 with E8M0 or E4M3 scale-factor type.
Sparse variants prepend .sp::ordered_metadata and add a metadata register slot:
mma.sp::ordered_metadata.sync.aligned.m16n8k64.row.col.kind::mxf4nvf4.scale_vec::4X.block_scale.f32.e2m1.e2m1.f32
{%fd0, ..., %fd3},
{%r0, %r1},
{%r2},
{%fd4, ..., %fd7},
%r3, // metadata
%r4, %r5; // SFA, SFB
The MMA verifier rejects shape/vector/type combinations outside the three accepted families. Compression from the SM100 tcgen05 lattice to the SM120 surface is intentional: no CTA group, no collector mode, no A-shift, no weight-stationary mode, no scale-input accumulator, no tensor-memory destination, no write-disable modifier. Only shape, element family, scale-factor family, scale-vector width, and the sparse/dense choice remain.
WGMMA Descriptor Round-Trip
The SM90 inline-asm template above threads an SMEM descriptor through the l constraint slot for operand B (and for operand A when the atom is fully SMEM-resident). The descriptor is a 64-bit packed word built by the per-atom constructor from the abstract tile shape and swizzle mode. A worked example shows the round-trip from logical fields to the hex value that lands in the inline-asm input.
Consider a representative atom: m64n128k16.f32.f16.f16 with swizzle=128B, lbo=2048, sbo=0, and a starting SMEM byte offset chosen so (smem_off >> 4) = 0x1000. The constructor packs the fields according to the WGMMA descriptor layout:
| Field | Bits | Width | Logical value | Encoded |
|---|---|---|---|---|
start_addr | 0-13 | 14 | smem_off >> 4 = 0x1000 (low 14 bits) | 0x1000 |
lbo | 14-29 | 16 | 2048 (0x800) | 0x800 |
sbo | 30-45 | 16 | 0 | 0x0 |
base_offset | 46-48 | 3 | 0 | 0 |
reserved | 49-51 | 3 | 0 (mandatory) | 0 |
swizzle_mode | 52-53 | 2 | 128B | 1 |
pad | 54-63 | 10 | unused | 0 |
The composition is straightforward:
uint64_t raw = 0;
raw |= ((uint64_t)0x1000) << 0; // start_addr at bits 0-13
raw |= ((uint64_t)0x0800) << 14; // lbo at bits 14-29
raw |= ((uint64_t)0x0000) << 30; // sbo at bits 30-45
raw |= ((uint64_t)0x0000) << 46; // base_offset at bits 46-48
raw |= ((uint64_t)0x0001) << 52; // swizzle 128B at bits 52-53
The resulting WgmmaDescriptor.raw value is 0x0010_0000_0200_1000. Decomposed back: bits 0-13 hold 0x1000, bits 14-29 hold 0x800 (the four bytes 0x02000 overlap into the lbo window because the field starts at bit 14), bits 52-53 hold the 128B swizzle code, and every reserved bit is clear. A round-trip through decode_descriptor(0x00100000_02001000) produces the exact original logical-field set.
The constructor passes this hex value into the inline-asm fragment as an i64 input bound to the l constraint slot. The PTX template that consumes it is the WGMMA form documented in SM90; the runtime register-allocator sees the constant as a 64-bit GPR (%rd1 in the example) and the WGMMA hardware decodes it back into the canonical Hopper SMEM descriptor on each wgmma.mma_async.sync.aligned issue. A mismatch between the constructor's swizzle mode and the verifier's swizzle mode produces silently wrong results, which is why the constructor and verifier must read the same swizzle table — see the cross-reference paragraph at the end of SMEM-Descriptor Construction.
Cross-References
PTX Version and Target Selection — Architecture-Conditional Instructions
documents the upstream subtarget gating that decides which of these
templates is reachable for a given .target line. Plain sm_NN rows
admit only the SM70-SM80-SM89 surfaces; sm_NNa and sm_NNf rows
unlock WGMMA, tcgen05, and block_scale MMA per the suffix grid in
that page. AsmPrinter — MC Switch Shape Population Table documents the dispatcher and AsmWriter String Pools and the XOR-3 Walking Cipher covers the mnemonic pool that finally prints the template strings shown above. tcgen05 Control-Word Bit Layout covers the SM100/SM103 control word and tcgen05 mbarrier Emission plus Cluster Sync Emission cover the mbarrier/cluster wiring around tcgen05.mma. ISelDAG and MatcherTable — Selector Layers shows where the selector chooses between inline-asm and MachineInstr paths for each SM tier. The MMA atom registry in SM70-120 MMA Atoms is the dialect-level entry point that feeds shape and operand types into these templates.
Atomic, Warp, Sreg, Fence Emission
Abstract
Four PTX synchronization and communication families share the NVPTX backend's final printer: atomic read-modify-write and reductions, warp-level collectives, special-register readers, and the fence/mbarrier/proxy-fence family. They enter code generation through different IR layers and selector dispatch arms, then converge on the same emitter.
The contract is modifier construction in a fixed order. Atomics and fences carry memory ordering and scope. Warp collectives carry a small kind enum that picks a PTX template. Special-register readers map a typed NVVM op to a registered PTX special register and route through a fast path for thread and CTA coordinates. Mbarriers and proxy fences carry operation-specific operands but reuse the same scope vocabulary, so one ordering/scope packing function services every family.
Atomic and Reduction Family
Atomic RMW lowering builds a modifier record. The printer emits modifiers in a fixed order: cluster-tail, scope, ordering, operation, cache hint, type. Reductions reuse the same scope and ordering vocabulary but support a smaller ordering set at the PTX level.
| Family / op | Orderings | Scope set | Type set |
|---|---|---|---|
red.cta / red.gpu / red.sys / red.cluster | relaxed default, release | cta/gpu/sys/cluster | b32/b64/u32/u64/s32/s64/f32/f64/f16/f16x2/bf16/bf16x2 |
atom.cas{.b16,.b32,.b64,.b128} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster/cta::cluster | b16, b32, b64, b128 |
atom.exch.{b32,b64} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster/cta::cluster | b32, b64 |
atom.add | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster/cta::cluster | u32, u64, f32, f64, f16/bf16 packed forms |
atom.and.{b32,b64} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster | b32, b64 |
atom.or.{b32,b64} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster | b32, b64 |
atom.xor.{b32,b64} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster | b32, b64 |
atom.min.{s32,s64,u32,u64} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster | s32, s64, u32, u64 |
atom.max.{s32,s64,u32,u64} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster | s32, s64, u32, u64 |
Some red.gpu.global.add.* forms emit as inline PTX templates rather than
through the generic modifier printer. Invalid combinations get specific
diagnostics: unsupported ordering for nvvm.atomic.rmw,
Invalid memory model ordering for nvvm.red,
Invalid reduction op for nvvm.red, Invalid reduction type for nvvm.red.
The printer concatenates tokens in a fixed order so a reimplementation can read tokens off a modifier word without re-sorting. The order, from left to right, is: opcode stem, memory ordering, scope, operation suffix, address-space suffix, optional cache hint, type suffix, then the operand list. Each token comes from a small enum table; an absent enum value (default order or implicit scope) emits nothing rather than a placeholder dash.
For an atomic add on shared memory with relaxed memory order and CTA scope, the printer reads op = ADD, order = RELAXED, scope = CTA, addrspace = SHARED, type = U32 from the operand record and emits:
atom.relaxed.cta.add.u32.shared %r0, [%r1], %r2;
The token order is atom (stem) → .relaxed (order) → .cta (scope) → .add (operation) → .u32 (type) → .shared (address space) → operands. A few token slots accept compound forms: scope can be .cta::cluster when the cluster-tail bit is set, the cache-hint slot expands to .L2::cache_hint and adds a cache-policy operand, and the type slot can take packed widths like .f16x2 or .bf16x2.
Reductions reuse the same order without a return register:
red.gpu.add.f32.global [%rd0], %f1;
Atomic compare-and-swap doubles the operand count but keeps the same token order:
atom.acquire.gpu.cas.b64.global %rd0, [%rd1], %rd2, %rd3;
Lowering rejects illegal order/scope pairs before the printer fires, so the token-emission step never has to recover from an invalid modifier word. The invariant a reimplementation must preserve: every order/scope/space combination that lowering accepts is also accepted by ptxas on the current target. The verifier in ISelDAG and MatcherTable — Subtarget Feature Model shares this contract through the same subtarget feature bitmap.
Warp-Level Collectives
Four MLIR NVVM ops model warp-level collectives: nvvm.redux.sync,
nvvm.shfl.sync, nvvm.vote.sync, nvvm.match.sync. Each carries a
compact kind enum that selects the PTX template.
| NVVM op | Kind enum | PTX template family | Verifier / constraint |
|---|---|---|---|
nvvm.redux.sync | add, umin, umax, min, max, and, or, xor, fmin, fmax, fminabsnan, fmaxabsnan | redux.sync.* on 32-bit values | must run uniformly over the entire subgroup |
nvvm.shfl.sync | bfly, up, down, idx | shfl.sync.{bfly,up,down,idx}.b32 | optional validity predicate result |
nvvm.vote.sync | any, all, uni, ballot | vote.sync.{any,all,uni}.pred or vote.sync.ballot.b32 | ballot returns i32; others return i1 |
nvvm.match.sync | any, all | match.sync.{any,all}.{b32,b64} | any returns i32; all returns {i32, i1} |
redux.sync is feature-gated. Integer reductions require the redux-capable
path; floating redux appears only on newer targets. bar.warp.sync belongs
to the same warp-level family and emits bar.warp.sync mask.
The selector dispatches each warp collective by intrinsic-ID plus operand types. The intrinsic ID picks the family (redux, shfl, vote, match); the kind enum on the SDNode picks the operation within the family; and the operand element type picks the PTX type suffix. Four representative emissions:
redux.sync.add.s32 %r0, %r1, 0xFFFFFFFF; // signed-int reduction over the full warp
shfl.sync.bfly.b32 %r0|%p0, %r1, 0x10, 0x1F, 0xFFFFFFFF;
vote.sync.ballot.b32 %r0, %p1, 0xFFFFFFFF; // ballot returns i32
match.sync.any.b32 %r0, %r1, 0xFFFFFFFF; // match.any returns i32
The vote.sync.{any, all, uni} variants return a pred rather than b32; match.sync.all.b32 returns the pair {i32, i1} and the printer emits the i1 destination as the second operand slot. The last 32-bit operand on each form is the membership mask the issuing thread passes in. redux.sync is feature-gated and requires the subtarget bitmap's has_redux bit; floating redux adds has_redux_float. The verifier rejects non-uniform redux.sync callers before the selector fires, so the emitter can treat the subgroup as uniform without re-checking.
The bar.warp.sync mask; instruction belongs to the same family and emits a warp-level barrier. Selection routes it through the same dispatcher arm as vote.sync, with bar.warp.sync as the stem and the mask as the single operand.
Special-Register Readers
Tileiras registers the nvvm.read.ptx.sreg.* family for PTX
special-register reads. A compact fast path prints the base thread and CTA
coordinate registers; the rest go through the ordinary instruction printer.
| Sreg family | PTX name(s) | Intrinsic | Width |
|---|---|---|---|
| Thread index | %tid.x / %tid.y / %tid.z | nvvm.read.ptx.sreg.tid.{x,y,z} | u32 |
| Thread-block dim | %ntid.x / %ntid.y / %ntid.z | nvvm.read.ptx.sreg.ntid.{x,y,z} | u32 |
| CTA index | %ctaid.x / %ctaid.y / %ctaid.z | nvvm.read.ptx.sreg.ctaid.{x,y,z} | u32 |
| Grid dim | %nctaid.x / %nctaid.y / %nctaid.z | nvvm.read.ptx.sreg.nctaid.{x,y,z} | u32 |
| Cluster geometry | %clusterid.*, %nclusterid.*, %cluster_ctaid.*, %cluster_nctaid.* | nvvm.read.ptx.sreg.cluster* | u32 |
| Cluster rank | %cluster_ctarank, %cluster_nctarank | nvvm.read.ptx.sreg.cluster.ctarank and sibling | u32 |
| SM / warp identity | %smid, %nsmid, %warpid, %nwarpid, %laneid, %warpsize, %gridid | matching nvvm.read.ptx.sreg.* | u32 |
| Lane-mask predicates | %lanemask_{eq,ge,gt,le,lt} | nvvm.read.ptx.sreg.lanemask.{eq,ge,gt,le,lt} | u32 |
| Clocks / timer | %clock, %clock64, %globaltimer | matching nvvm.read.ptx.sreg.* | u32 / u64 / u64 |
| Environment regs | %envreg0 .. %envreg31 | nvvm.read.ptx.sreg.envreg{0..31} | u32 |
%dynamic_smem_size reads through inline assembly. Tileiras exposes only
the combined u64 %globaltimer form, not separate high/low 32-bit halves.
Performance counters %pm0 through %pm7 go unregistered.
nvvm.breakpoint uses %globaltimer for a short busy-wait before
trapping.
The reader path is a one-line mov template keyed by the registered sreg
name. The fast path for thread and CTA coordinates skips the generic
instruction printer entirely:
void print_sreg_read(Printer *p, SregReadInst *inst) {
if (is_thread_or_cta_sreg(inst->sreg)) {
/* Fast path: compact "mov.u32 %rN, %sreg;" emission. */
write(p, "mov.u32 ");
print_dest_reg(p, inst->dest);
write(p, ", ");
write_sreg_token(p, inst->sreg);
write(p, ";");
return;
}
/* Slow path: ordinary instruction printer handles cluster geometry, lane masks,
clocks, environment regs, and any sreg that needs u64 typing. */
print_inst_generic(p, inst);
}
Fence and Mbarrier Family
Fence scope encodes two ways. acq_rel and sc fences use
scope-suffixed operation names; acquire and release fences carry scope
as an attribute. Mbarriers model initialization, arrival, expected
transactions, invalidation, and waits. Proxy fences model synchronization
between generic, async, cluster, and tensormap proxies.
| Op family | Operands / attrs | PTX lowering |
|---|---|---|
mbarrier.init / .shared | smemPtr, count | mbarrier.init[.shared].b64 [$p], $n; |
mbarrier.arrive / .shared | smemPtr | mbarrier.arrive[.shared].b64 $r, [$p]; |
mbarrier.arrive.nocomplete | smemPtr, count | mbarrier.arrive.noComplete[.shared].b64 $r, [$p], $cnt; |
mbarrier.arrive.expect_tx | smemPtr, txCount | mbarrier.arrive.expect_tx[.shared].b64 $r, [$p], $tx; |
mbarrier.txn | smemPtr, txCount, relaxed, noComplete, shared-space kind, scope, peer rank | `mbarrier.expect_tx{.relaxed}.{cta |
mbarrier.test.wait | smemPtr, token | mbarrier.test_wait[.shared].b64 $r, [$p], $token; |
mbarrier.try_wait | smemPtr, parity, suspendNs or timelimit | mbarrier.try_wait*.b64 ...; |
mbarrier.wait | smemPtr, optional parity | mbarrier.wait[.parity].b64 $r, [$p][, $par]; |
fence.acq_rel.{cta,cluster,gpu,sys} | none | fence.acq_rel.{cta,cluster,gpu,sys}; |
fence.sc.{cta,cluster,gpu,sys} | none | fence.sc.{cta,cluster,gpu,sys}; |
fence.acquire / fence.release | scope, space | fence.{acquire,release}.<scope>; |
fence.mbarrier.init | useIntrinsic | fence.mbarrier_init.release.cluster; |
fence.proxy | kind, space, useIntrinsic | fence.proxy.<kind>; |
fence.proxy.acquire / fence.proxy.release | fromProxy, scope, toProxy | fence.proxy.<from>::<to>.{acquire,release}.<scope>.sync.aligned [addr], sz; |
tensormap.cp_fenceproxy | srcTmapPtr, dstTmapPtr, sizeBytes, scope | tensormap.cp_fenceproxy.<scope>.tensormap::generic.release.<scope>.sync.aligned [dst], [src], sz; |
Legacy membar.{cta,gpu,sys} remains as a fallback. Cluster-scope fences
have no pre-Hopper fallback and diagnose unsupported ordering/scope
combinations on older targets.
Fence emission resolves the scope-as-name-vs-scope-as-attribute split at print time:
void print_fence(Printer *p, FenceInst *inst) {
if (inst->order == ORDER_ACQ_REL || inst->order == ORDER_SEQ_CST) {
/* Scope is folded into the operation name: fence.acq_rel.cta */
write(p, "fence.");
write_order_token(p, inst->order);
write(p, ".");
write_scope_token(p, inst->scope);
write(p, ";");
return;
}
/* Acquire and release fences carry scope as a separate token. */
require(inst->order == ORDER_ACQUIRE || inst->order == ORDER_RELEASE,
"fence ordering must be acquire, release, acq_rel, or seq_cst");
write(p, "fence.");
write_order_token(p, inst->order);
write(p, ".");
write_scope_token(p, inst->scope);
write(p, ";");
}
Proxy and tensormap fences extend this with <from>::<to> proxy tokens
and an aligned-sync address payload; the underlying decision tree is the
same.
SyncScope Mapping
LLVM SyncScope names get normalized before atomic and fence printing. Tileiras keeps a small map from LLVM scope names to the NVPTX scope vocabulary, then packs ordering and scope into the modifier word the final printer consumes.
| LLVM SyncScope name | Backend scope | PTX token |
|---|---|---|
singlethread | thread | no token |
| empty default scope | system | .sys |
block | CTA | .cta |
cluster | cluster | .cluster |
device | device / GPU | no explicit token |
The default device scope deliberately prints no .gpu token for ordinary
atomics because PTX treats GPU scope as the default spelling. CTA scope
becomes the composite .cta::cluster spelling when the lowering path asks
for cluster-tail semantics. Cache-hint atomics accept only CTA and system
scope; cluster cache hints get rejected outright rather than silently
downgraded.
typedef enum {
NVPTX_SCOPE_THREAD,
NVPTX_SCOPE_CTA,
NVPTX_SCOPE_CLUSTER,
NVPTX_SCOPE_DEVICE,
NVPTX_SCOPE_SYSTEM,
} NvptxScope;
typedef enum {
ORDER_RELAXED = 1,
ORDER_ACQUIRE = 2,
ORDER_RELEASE = 3,
ORDER_ACQ_REL = 4,
ORDER_SEQ_CST = 5,
} AtomicOrder;
NvptxScope map_sync_scope(const char *scope_name) {
if (scope_name == NULL || scope_name[0] == '\0') {
return NVPTX_SCOPE_SYSTEM;
}
if (strcmp(scope_name, "singlethread") == 0) {
return NVPTX_SCOPE_THREAD;
}
if (strcmp(scope_name, "block") == 0) {
return NVPTX_SCOPE_CTA;
}
if (strcmp(scope_name, "cluster") == 0) {
return NVPTX_SCOPE_CLUSTER;
}
if (strcmp(scope_name, "device") == 0) {
return NVPTX_SCOPE_DEVICE;
}
fail("unsupported NVPTX synchronization scope");
}
unsigned pack_atomic_modifier(AtomicOrder order, NvptxScope scope, bool cta_cluster_tail) {
unsigned modifier = (unsigned)order;
if (scope == NVPTX_SCOPE_CTA) {
modifier |= 1u << 4;
} else if (scope == NVPTX_SCOPE_SYSTEM) {
modifier |= 2u << 4;
} else if (scope == NVPTX_SCOPE_CLUSTER) {
modifier |= 3u << 4;
}
if (cta_cluster_tail) {
modifier |= 1u << 9;
}
return modifier;
}
The reimplementation rule is simple: preserve the high-level LLVM ordering first, map the scope name to the closest PTX scope second, then pick the printed suffix. Diagnose unsupported order/scope pairs at lowering time so the printer never recovers from an invalid modifier word.
Cross-References
AsmPrinter — MC Switch Shape Population Table documents the dispatcher that selects the print shape for these atomic, warp-collective, sreg, and fence opcodes. ISelDAG and MatcherTable covers the selector that consumes the same subtarget feature bitmap before any of these instructions reach the printer. tcgen05 mbarrier Emission and the mbarrier State Machine cover the mbarrier and proxy-fence families that share scope and ordering vocabulary with this page.
TMA + Tensormap + cp.async.bulk Emission
Abstract
The Tensor Memory Accelerator (TMA) path used by Hopper and Blackwell
targets has three surfaces that must stay consistent: descriptor
construction, descriptor mutation, and instruction emission. They share
one 64-byte payload (the CUtensorMap / DESC_TMA512 descriptor), one
set of operand classes (l for 64-bit pointers, r for i32 coordinates,
h for i16 im2col offsets), and one modifier order: .im2col, then
.multicast::cluster, then .L2::cache_hint.
Device-side in-place mutators can rebind a tiled descriptor without
calling cuTensorMapEncodeTiled again, but a cross-proxy handshake
threads through this layer. Device-side writes to a descriptor pass
through the generic PTX proxy; cp.async.bulk.tensor.* reads of the
same descriptor enter the tensormap proxy. Without an explicit
fence.proxy.tensormap::generic acquire/release pair or the fused
nvvm.tensormap.cp_fenceproxy operation, the two accesses are unordered.
Emitter, mutator, and fence intrinsic are therefore one feature.
TMA Descriptor Shape
The descriptor payload is a 64-byte record, represented as eight 64-bit
slots. The device-mutator path writes through a 128-byte .b1024
operation, so every device-visible descriptor pointer must be 128-byte
aligned. The live 64-byte payload occupies the lower half of that aligned
slot; the upper half is reserved padding.
| Field | Offset | Meaning |
|---|---|---|
tensor_base_ptr | 0 | Base address of the logical tensor |
fmt_dim_stride_packed | 8 | Format plus packed dimension lanes |
box_size_packed | 16 | Tile box extents and paired-CTA layout bits |
elem_stride_packed | 24 | Packed global-stride lanes |
load_mode_packed | 32 | Tiled / im2col / multicast mode fields |
interleave_fill | 40 | Interleave and out-of-bounds fill behavior |
l2_sector_promo | 48 | L2 sector-promotion policy |
reserved_future | 56 | Reserved payload slot |
Rank lives nowhere as an independent mutable field in the device rebind path. The operation consuming or mutating the descriptor carries it, selecting the lane to update inside the packed fields.
Inner Bit Packing — Limits of Binary Visibility
Three slots in the eight-slot table multiplex multiple logical fields per 64-bit word. The binary observes only the lane-index argument the mutator templates substitute into PTX text — not the bit-level placement the hardware ultimately writes. Specifically:
| Slot | Mutator template | Lane width | Lane count | Bit packing |
|---|---|---|---|---|
tensor_base_ptr (slot 0) | tensormap.replace.tile.global_address.b1024.b64 [$0], $1 | b64 (full slot) | 1 | direct address — no inner packing |
fmt_dim_stride_packed (slot 1) | tensormap.replace.tile.global_dim.b1024.b32 [$0], {N}, $1 | b32 | rank (0..4) | format bits coexist with rank 32-bit dim lanes; per-lane bit layout is PTX-ISA-defined and not observable in the emitter |
box_size_packed (slot 2) | none in emitted set — see PTX tensormap.replace.tile.box_size | n/a | n/a (host-born only) | hardware-internal |
elem_stride_packed (slot 3) | tensormap.replace.tile.global_stride.b1024.b64 [$0], {N-1}, $1 | b64 | rank-1 (0..3) | strides occupy 64-bit lanes; dim-0 stride is implicit element size, never device-written |
load_mode_packed (slot 4) | none | n/a | n/a | mode enum bits, multicast cardinality — set host-side |
interleave_fill (slot 5) | none | n/a | n/a | interleave + OOB fill — set host-side |
l2_sector_promo (slot 6) | none | n/a | n/a | promotion policy — set host-side |
reserved_future (slot 7) | none | n/a | n/a | observed all-zero in seed templates the binary copies |
The three device-side mutators emitted by tileiras (global_address,
global_dim, global_stride) touch slots 0, 1, and 3. Slots 2, 4, 5, 6,
and 7 are immutable on the device path. Anything that would require
writing them — box-shape changes, swizzle, fill-mode, element-type,
interleave, paired-CTA layout — has to round-trip through host-side
cuTensorMapEncode* driver entries, which is why im2col descriptors and
SM100 paired-CTA descriptors are host-born only.
⚡ QUIRK — eight-slot logical view, not eight-slot byte layout The "eight 64-bit slots" framing is the device-mutator-visible logical view. The PTX
b1024operand class declares the operand is a 1024-bit aligned region; only the lower 64 bytes are live in current tensormap formats. Lane indices in the mutators are logical (dim index, stride index), not raw byte offsets — the hardware translates each lane index to the corresponding bit window inside the relevant packed slot. The exact bit-window mapping is not derivable from the binary; the emitter just hands{N}to PTX and the assembler/hardware handles placement.
Confidence: HIGH on slot names and per-slot mutator coverage (direct
evidence: emitted PTX strings at 0x4ce3b40, 0x4ce3b80, 0x4ce3bc0 in
the rodata string table; debug-dump format "DESC_TMA512: 0x%016lx %016lx %016lx %016lx" at 0x4603ba8 corroborates the 4-of-8 active
slots). MED on the named "logical roles" for slots 4-7 — derived from
host-side cuTensorMapEncodeTiled parameter ordering and the
SeparateHostTMA pass's host-encoder call sites, not from device-side
mutators. LOW on inner bit-position claims — the binary does not contain
the bit-packing logic; consult the PTX ISA tensormap.replace.* section
for the authoritative byte-level layout.
Tensormap Init / Update Algorithm
Descriptor birth follows one of two paths.
The host-born path is emitted by the SeparateHostTMA pass. It
materialises a 64 B stack-aligned blob, calls cuTensorMapEncodeTiled
(or cuTensorMapEncodeIm2col for im2col-mode operations), then hands the
blob to the kernel-argument attachment step, which appends a kernel
parameter tagged cute_nvgpu.grid_constant. The descriptor passes
by-value into the kernel as a __grid_constant__ CUtensorMap and never
gets written by the device. This is the only legal path for im2col
descriptors and for SM100 TWO_CTA paired-CTA descriptors, because
neither the box-size field nor the im2col-offsets field has a device-side
tensormap.replace.* mutator template.
The device-born path is the rebind sequence. A zeroed 128 B aligned
slot is allocated in global or shared::cta, optionally seeded from a
host descriptor via the fused tensormap.cp_fenceproxy op, then patched
in a fixed order — address → dim[0..rank−1] → stride[1..rank−1] — for a
total of 1 + rank + (rank-1) = 2*rank inline-asm ops per rebind. The
ordering invariant is structural: dim-extent writes re-pack the
fmt_dim_stride_packed slot through a hardware-internal bit-interleave
that relies on the format field already being valid (set by
cuTensorMapEncodeTiled at birth). Write strides before all dims and a
short window opens where the slot is coherent but the stride lanes are
stale. Per-kernel counters nv_tileas.num-device-tmas and
nv_tileas.num-host-tmas tally the two populations separately; device-TMA
slots sit before host-TMA slots in the appended block so the kernel can
locate its working buffer at a fixed parameter-list offset.
void rebind_tiled_tma_descriptor(TmaDescriptor *desc, const TmaRebind *rebind) {
require_aligned(desc, 128);
require(1 <= rebind->rank && rebind->rank <= 5);
fence_proxy_tensormap_from_generic_acquire(rebind->scope);
tensormap_replace_global_address(desc, rebind->address);
for (int dim = 0; dim < rebind->rank; ++dim) {
tensormap_replace_global_dim(desc, dim, rebind->extent[dim]);
}
for (int dim = 1; dim < rebind->rank; ++dim) {
tensormap_replace_global_stride(desc, dim - 1, rebind->stride[dim]);
}
fence_proxy_tensormap_from_generic_release(rebind->scope);
}
cp.async.bulk Template Catalog
The complete cp.async.bulk template inventory follows. Three emission
strategies coexist: fixed inline-assembly templates for the 2D
gather4/scatter4 forms and tensormap-replace mutators, runtime-assembled
strings for rank-1 through rank-5 tensor loads and stores, and
TableGen-registered NVVM ops for the generic intrinsic surface. The
descriptor pointer always threads as an i64 GPR with LLVM inline-asm
constraint class "l" regardless of address space; multicast masks use
"h" (i16), L2 cache hints use "l" (i64), coordinate operands use "r"
(i32), im2col offsets use "h" (i16). The descriptor slot sits at PTX
operand position %1 on G2S loads and %0 on S2G stores.
| Variant | Emitter path | Mode | Dim | Multicast / mask | L2 hint | Im2col offsets | Descriptor operand |
|---|---|---|---|---|---|---|---|
cp.async.bulk.tensor.{1..5}d.shared::cluster.global.mbarrier::complete_tx::bytes | runtime builder | tile | 1–5 | opt, i16 "h" | opt, i64 "l" | n/a | %1 "l" |
cp.async.bulk.tensor.{3..5}d.shared::cluster.global.mbarrier::complete_tx::bytes.im2col | runtime builder | im2col | 3–5 | opt, i16 "h" | opt, i64 "l" | K offsets, "h" | %1 "l" |
cp.async.bulk.tensor.2d.tile::gather4.shared::cta.global.mbarrier::complete_tx::bytes | fixed asm | gather4 | 2 | n/a | n/a | n/a | $1 "l" |
cp.async.bulk.tensor.{1..5}d.global.shared::cta.bulk_group | runtime builder | tile | 1–5 | n/a | n/a | n/a | %0 "l" |
cp.async.bulk.tensor.2d.tile::scatter4.global.shared::cta.bulk_group | fixed asm | scatter4 | 2 | n/a | n/a | n/a | $0 "l" |
cp.async.bulk.tensor.s2g.im2col.{3..5}d | LLVM intrinsic | im2col | 3–5 | n/a | per-intrinsic | per-intrinsic | LLVM-managed |
cp.async.bulk.tensor.s2g.tile.{1..5}d | LLVM intrinsic | tile | 1–5 | n/a | per-intrinsic | n/a | LLVM-managed |
cp.async.bulk.tensor.reduce (mode + redKind) | LLVM intrinsic | tile/im2col + 8-way redKind | 1–5 | n/a | n/a | per-mode | LLVM-managed |
cp.async.bulk.tensor.prefetch | LLVM intrinsic | tile/im2col | 1–5 | n/a | per-intrinsic | per-mode | LLVM-managed |
cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes | LLVM intrinsic | n/a | scalar | n/a | n/a | n/a | n/a (byte count) |
cp.async.bulk.global.shared::cta.bulk_group (non-tensor) | LLVM intrinsic | n/a | scalar | n/a | n/a | n/a | n/a (raw byte-count) |
cp.async.bulk.shared::cluster.shared::cta.mbarrier::complete_tx::bytes | LLVM intrinsic | n/a | scalar | n/a | n/a | n/a | n/a |
Modifier cascade order is fixed by both the emitter and PTX ISA 8.4:
.im2col → .multicast::cluster → .L2::cache_hint. Trailing operands
emit with the ordinary comma-space separator NVPTX assembly expects. The
multicast mask is an OptionalAttr<I16Attr> at adaptor slot 7; bit i
set in the mask selects CTA i (cluster maximum is 16, hence the 16-bit
width). The L2 cache hint is an OptionalAttr<I64Attr> at adaptor slot 8
and threads through as an opaque cookie the PTX assembler decodes
(eviction policy plus priority). Trailing operand indices for the optional
tails compute as mcast_opnum = rank + 3 + im2col_count and
ch_opnum = mcast_opnum + (mcast_present ? 1 : 0). The TMA-load mode
enum (NO_MULTICAST=0, TWO_CTA=1, W_MULTICAST=2, W128_MULTICAST=3) gates
the .multicast::cluster modifier; the TMA-store mode enum (TILED=0,
IM2COL=1, IM2COL_W=2, IM2COL_W128=3) selects between the two
runtime-assembled emitters; the reduce redKind enum is the 8-valued
{ADD, MIN, MAX, INC, DEC, AND, OR, XOR} family but carries no PTX-text
emitter inside tileiras — every reduce variant lowers through
int_nvvm_cp_async_bulk_tensor_reduce_* NVPTX intrinsics. The
gather4 / scatter4 forms are 2D-only and Blackwell-specific (SM100+).
TMA Descriptor Mutators
Of the nine tensormap.replace.tile.* field-mutator templates that
PTX ISA 8.3 defines, Tileiras emits only three device-callable mutators:
| Field | Template (verbatim, {0} = address-space token, {1} = decimal index) | Width | Constraint | Emitted | Writes to |
|---|---|---|---|---|---|
global_address | tensormap.replace.tile.global_address.{0}.b1024.b64 [$0], $1; | b1024.b64 | "l,l" (global) / "r,l" (shared::cta) | once | DESC_TMA512+0x00 (tensor_base_ptr) |
global_dim | tensormap.replace.tile.global_dim.{0}.b1024.b32 [$0], {1}, $1; | b1024.b32 | "l,r" | rank times, i ∈ [0, rank) | i-th 32-bit lane inside fmt_dim_stride_packed |
global_stride | tensormap.replace.tile.global_stride.{0}.b1024.b64 [$0], {1}, $1; | b1024.b64 | "l,l" | rank-1 times, i ∈ [1, rank), {1} = i-1 | (i-1)-th 64-bit lane inside elem_stride_packed |
The other PTX ISA 8.3 mutators — box_size, element_stride,
swizzle, fill_mode, elemtype, interleave, rank, and the entire
tensormap.replace.im2col.* family — never appear in this path. The
structural consequence is sharp: any rebind that would change format,
box-shape, swizzle, fill-mode, element-type, or interleave layout must
round-trip through the host-side cuTensorMapEncode* driver entry; the
in-place device-side mutator handles only the
tiled-rebind-with-new-base-and-extents subset. Equivalently, im2col
descriptors are always host-born via cuTensorMapEncodeIm2col, and SM100
paired-CTA TWO_CTA descriptors are always host-born because the CTA
V-map folds into box_size_packed — which has no mutator. The dim-0
stride is implicit 1 (= element size) and never written by the device
path; the host encoder bakes it in at birth time.
The full device-side rebind sequence per descriptor is therefore:
optional tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.cta.sync.aligned
seed-copy from a host-prepared template, one
fence.proxy.tensormap::generic.acquire.{cta|gpu|sys}, one
global_address write, rank global_dim writes, rank-1
global_stride writes, one
fence.proxy.tensormap::generic.release.{cta|gpu|sys}, then the consumer
thread's matching acquire fence before the cp.async.bulk.tensor.* read.
Within a single CTA the .cta fence scope suffices; across CTAs in a
cluster the .gpu scope is mandatory — or the descriptor must be
re-staged into each CTA's own shared::cta slot via cp_fenceproxy.
Every replace mutator has side effects and must remain ordered with
respect to the surrounding proxy fences.
tcgen05 / WGMMA / mbarrier / Cluster Emission
Abstract
Blackwell tcgen05 matrix multiply, Hopper WGMMA, transactional mbarriers,
and cluster-scope synchronization all enter through MLIR nvvm.* or
nvgpu.* operations. None of them become ordinary PTX strings immediately.
They pass through feature checks, operand packing, target-specific
MachineInstr construction, and finally PTX printing.
The central reimplementation idea is two-stage validation. The MLIR verifier checks the operation shape visible at the dialect level. The backend validates the final selected machine form again, because arch-conditional tcgen05 variants, TMA modes, cluster scope, and mbarrier transactions depend on subtarget details that are fully known only after target selection.
For the structural model behind each family see tcgen05 Tensor Memory — Tensor Memory and the tcgen05 Variant Taxonomy, WGMMA Emission Protocol — The Four-Op Sequence, mbarrier State Machine, and Cluster Sync and DSMEM Handshake. This page covers the backend-side validation and PTX-emission detail those topic pages defer here.
tcgen05 Machine Validation
The tcgen05 backend family handles ten matrix-multiply variants plus their sparse, weight-stationary, block-scale, and scale-input-accumulator forms. Selection packs the requested shape into a compact control word. The machine verifier later unpacks the same word and rejects forms the selected PTX version or SM target cannot execute.
Control-Word Bit Layout
Two packed 32-bit words travel through tcgen05 lowering: the primary control word that records shape, kind, and CTA grouping, and a smaller collector word that records collector mode and ashift. Both fields are read by the selector to pick a machine opcode and by the verifier to reject illegal combinations. The bit ranges are stable across the dense, sparse, and weight-stationary families.
| Bits | Field | Width | Encoding |
|---|---|---|---|
| 0-1 | cta_group | 2 | 0 = reserved, 1 = 1-CTA, 2 = 2-CTA, 3 = 4-CTA (matches the Mode Pattern Verifiers kind-word table) |
| 2-3 | scale_vec_size | 2 | 0 = 1X (16-elem scale vector), 1 = 2X (32), 2 = 4X (64), 3 = reserved |
| 4 | scale_input_acc | 1 | enables scale-input-accumulator path |
| 5 | block_scale | 1 | selects the block-scale variants |
| 6-8 | mma_kind | 3 | one of seven kind values for the dense-MMA family |
| 9-31 | reserved | 23 | must be zero; the verifier rejects any non-zero bit |
The seven mma_kind values cover f16 / tf32 / i8 / f8f6f4 / mxf8f6f4 / mxf4 / mxf4nvf4. The selector reads the abstract MMA kind from the SDNode operand and maps to this enum before packing.
A separate collector word carries the operand-A modifiers:
| Bits | Field | Width | Encoding |
|---|---|---|---|
| 0 | collector_a_valid | 1 | distinguishes the explicit collector path from the default |
| 1-2 | collector_a | 2 | 0 = fill, 1 = use, 2 = fill+use, 3 = reserved |
| 2 | ashift | 1 | enables the A-shift modifier (overlaps with the high bit of collector_a) |
The overlap on bit 2 is deliberate: the encoder treats ashift and the collector_a "fill+use" mode as mutually exclusive, so a single byte position carries both with the verifier rejecting any combination that would set them at once. The remaining bits stay reserved and must be zero on entry to the verifier.
Each Tcgen05MmaInst carries one control word and one collector word side by side in its operand list. The packing layout matches the bit ranges:
typedef union Tcgen05Ctrl {
uint32_t raw;
struct {
uint32_t cta_group : 2; /* bits 0-1 */
uint32_t scale_vec_size : 2; /* bits 2-3 */
uint32_t scale_input_acc : 1; /* bit 4 */
uint32_t block_scale : 1; /* bit 5 */
uint32_t mma_kind : 3; /* bits 6-8 */
uint32_t reserved : 23; /* bits 9-31 */
};
} Tcgen05Ctrl;
typedef union Tcgen05Collector {
uint32_t raw;
struct {
uint32_t valid : 1; /* bit 0 */
uint32_t collector_a : 2; /* bits 1-2 */
/* ashift overlays bit 2; encoder rejects the conflicting combination */
uint32_t reserved : 29; /* bits 3-31 */
};
} Tcgen05Collector;
A reimplementation that mirrors the binary layout must mask the reserved fields explicitly. Selection sometimes leaves uninitialized scratch bits in the upper half of the SDNode operand, and the verifier reads the full 32-bit word.
Subtarget Feature Probe
The verifier validates against the subtarget feature bitmap, not against an opaque target descriptor. Each tcgen05 capability the verifier needs corresponds to a single bit in the bitmap: has_tmem (datacenter Blackwell has the tensor-memory storage), has_wgmma (Hopper warp-group MMA is reachable), has_arch_conditional (sm_*a-suffixed variants are allowed), has_family_conditional (sm_*f-suffixed variants are allowed), and has_scale_input_accumulator (the SIA variant is implemented in hardware). The selector reads subtarget.features once per intrinsic; the verifier reads the same bitmap immediately after the packed control word lands in the machine operand list.
The bitmap is the same one consulted by the MatcherTable predicate row (see MatcherTable and Cost Scoring for the 507-byte stride matrix probe). The contract is: every feature gate the verifier rejects is also reachable as a MatcherTable predicate, so a fall-through from custom selection to the MatcherTable cannot accidentally produce an opcode that fails the verifier later.
Verifier Rules
The verifier is deliberately stricter than the MLIR verifier. It validates the actual subtarget tuple, the selected family, and the packed modifier word. The rules below operate on the decoded control and collector words.
void verify_tcgen05_mma(const Tcgen05MmaInst *inst, const NvptxSubtarget *target) {
Tcgen05Ctrl ctrl = decode_tcgen05_ctrl(inst->ctrl_word);
Tcgen05Collector coll = decode_tcgen05_collector(inst->collector_word);
SubtargetFeatures sf = target->features;
/* INT8 inputs require arch-conditional tcgen05. Diagnostic strings here
* are verbatim from the binary — see Mode Pattern Verifiers for the
* canonical 13-rule table including the preserved "colletor" typo. */
if (ctrl.mma_kind == TCGEN05_KIND_I8 && !sf.has_arch_conditional)
diag("INT8 type is supported only on arch-conditional variants.");
/* MXF4 sparse variants require arch-conditional tcgen05. */
if (inst->sparse && (ctrl.mma_kind == TCGEN05_KIND_MXF4NVF4
|| ctrl.mma_kind == TCGEN05_KIND_MXF4)
&& !sf.has_arch_conditional)
diag("MXF4 and MXF4NVF4 types with Sparsity are supported only on arch-conditional variants.");
/* Explicit scale-vector size requires arch-conditional tcgen05. */
if (ctrl.scale_vec_size != SCALE_VEC_IMPLICIT && !sf.has_arch_conditional)
diag("Explicit scale vector size is supported only on arch-conditional variants.");
/* Scale-input-accumulator requires a hardware feature and f16/tf32 inputs. */
if (ctrl.scale_input_acc) {
if (!sf.has_scale_input_accumulator)
diag("Scale input accumulator is not supported on this architecture.");
if (ctrl.mma_kind != TCGEN05_KIND_F16 && ctrl.mma_kind != TCGEN05_KIND_TF32)
diag("Scale input accumulator can only be used with f16 and tf32 types");
}
/* Block-scale-only restrictions. */
if (ctrl.block_scale) {
if (!block_scale_allows_kind(ctrl.mma_kind))
diag("Block scale is not supported for f16, tf32, f8f6f4, and i8 types");
if (coll.valid && coll.collector_a /* ashift overlay */ == COLLECTOR_ASHIFT)
diag("ashift is not supported with tcgen05.mma.block_scale variants");
}
/* Cross-field invariants. */
if (inst->weight_stationary && ctrl.cta_group == CTA_GROUP_2)
diag("cta_group::2 is not supported with weight stationary");
if (inst->weight_stationary && is_fp4_kind(ctrl.mma_kind))
diag("Cannot use weight stationary with mxf8f6f4 and fp4 types");
if (coll.valid && coll.collector_a == COLLECTOR_FILL_USE
&& coll.collector_a == COLLECTOR_ASHIFT)
diag("Cannot use collector::a::use or colletor::a::fill with ashift");
/* "colletor" typo preserved verbatim — required for diagnostic-string
* matching test suites. */
if (!scale_vec_allowed(ctrl.mma_kind, ctrl.scale_vec_size))
diag("scale vector size is not legal for this input family");
}
After validation, tcgen05 lowering assembles the final machine operands from the selected family. Dense variants carry the normal A/B layouts, control word, shape, collector state, and accumulator operands. Sparse and block-scaled variants append metadata and scale planes. The non-negotiable invariant: selection and MC expansion agree on one packed control-word schema.
TMA and Im2Col Validation
The TMA verifier covers global-to-shared tensor loads, shared-to-global tensor stores, and im2col modes. It decodes rank, mode, multicast, cache hint, byte class, and two-CTA mode, then selects the concrete machine form only after the architecture gates pass.
The verifier consults the same subtarget feature bitmap the tcgen05 verifier uses. Three bits matter here: has_wide_im2col (Hopper supports the W and W128 wide variants), has_two_cta_tma (the 2-CTA TMA instruction surface), and has_cluster_multicast (the multicast::cluster modifier on TMA copies). The verifier reads the bitmap once and rejects the instruction at the first mismatched gate.
void verify_tma_tensor_op(const TmaTensorInst *inst, const NvptxSubtarget *target) {
SubtargetFeatures sf = target->features;
if (inst->rank < 1 || inst->rank > 5)
diag("TMA rank must be in the range 1..5");
if (inst->mode == TMA_IM2COL
|| inst->mode == TMA_IM2COL_W
|| inst->mode == TMA_IM2COL_W128) {
if (inst->rank < 3)
diag("im2col tensor copies require at least three dimensions");
}
if ((inst->mode == TMA_IM2COL_W || inst->mode == TMA_IM2COL_W128)
&& !sf.has_wide_im2col)
diag("wide im2col tensor copies are not supported on this architecture");
if (inst->two_cta && !sf.has_two_cta_tma)
diag("two-CTA TMA tensor copies are not supported on this architecture");
if (inst->multicast && !sf.has_cluster_multicast)
diag("cluster multicast TMA requires a compatible SM target");
}
The second verifier is what stops stale target-machine state or an illegal feature string from producing unsupported Blackwell or Hopper instructions.
WGMMA Emission
Hopper WGMMA lowering turns nvgpu.warpgroup.mma into the standard
four-part protocol: fence, one or more async MMA instructions, commit,
wait. Descriptor offsets are expressed in 16-byte units, so every tile
step divides the byte offset by 16 before updating the shared-memory
descriptors.
void lower_wgmma(WgmmaOp op, Rewriter *rewriter) {
emit_nvvm_wgmma_fence_aligned(rewriter);
for (int m_tile = 0; m_tile < op.m / op.inst_m; ++m_tile) {
for (int k_tile = 0; k_tile < op.k / op.inst_k; ++k_tile) {
uint64_t a_desc = advance_smem_desc(op.a_desc, m_tile, k_tile, op.a_layout);
uint64_t b_desc = advance_smem_desc(op.b_desc, m_tile, k_tile, op.b_layout);
emit_nvvm_wgmma_mma_async(rewriter, op, a_desc, b_desc);
}
}
emit_nvvm_wgmma_commit_group_sync_aligned(rewriter);
emit_nvvm_wgmma_wait_group_sync_aligned(rewriter, 0);
}
uint64_t advance_smem_desc(uint64_t desc, int m_tile, int k_tile, WgmmaLayout layout) {
uint64_t byte_offset = layout_byte_offset(layout, m_tile, k_tile);
return desc + (byte_offset >> 4);
}
Operand-B type inference feeds the PTX descriptor form. Bit-level operands take the smallest selector class; i4/i8/u8 take the byte-class path; f16/bf16/tf32/f8 take the half/float class; sparse selectors take the extended selector form.
mbarrier Emission
The mbarrier phase protocol coordinates TMA-load completion, WGMMA commit, and tcgen05 producer/consumer handoff. The finalizer computes the expected transaction count, emits an initialization fence on SM90 and newer targets, invalidates the barrier when the enclosing scope requires it, then pairs that invalidation with a cluster-release fence.
| mbarrier field | Purpose |
|---|---|
smem_base | Shared-memory address of the barrier object. |
kind | Distinguishes ordinary barriers from TMA transaction barriers. |
phase | Tracks parity / phase for wait operations. |
expected_txn | Number of expected transaction completions. |
arrive_count | Arrival count used by the producer side. |
tag | Pipeline bookkeeping tag. |
void finalize_mbarrier_phase(MBarrierHandle *barrier, PhaseContext ctx) {
if (ctx.sm >= 90) {
emit_nvvm_fence_mbarrier_init();
}
barrier->expected_txn = barrier->kind == MBARRIER_TMA ? 32 * ctx.size_minor : 1;
if (ctx.requires_shared_invalidation) {
emit_nvvm_mbarrier_inval_shared(barrier->smem_base);
}
emit_fence_mbarrier_init_release_cluster();
}
Cluster Sync Emission
Cluster synchronization passes through three gates: target must be SM90 or
newer, launch must actually use more than one CTA per cluster, and the
Tileiras barrier scope must request cluster behavior. Single-CTA clusters
fall back to ordinary nvvm.barrier; multi-CTA clusters take the
arrive/wait pair.
void emit_cluster_sync(ClusterSyncRequest req, Rewriter *rewriter) {
if (req.sm < 90 || req.cluster_size == 1 || req.scope == BARRIER_SCOPE_CTA) {
emit_nvvm_barrier(rewriter);
return;
}
emit_nvvm_fence_mbarrier_init(rewriter);
emit_nvvm_cluster_arrive_relaxed(rewriter, req.aligned);
emit_nvvm_cluster_wait(rewriter, req.aligned);
}
Two-CTA Blackwell tensor-memory paths also read the cluster rank special
register. For paired CTAs, cluster.ctarank ^ 1 selects the peer CTA.
End-To-End Lowering
The tcgen05 path is a closed pipeline. The selector chooses a candidate machine family from the intrinsic and subtarget. The machine verifier rechecks the packed control word. The builder then materializes the MachineInstr the asm printer will later render as PTX.
MachineInstr *lower_tcgen05_mma(IntrinsicInst *intrin, const NvptxSubtarget *target) {
Tcgen05MmaInst inst = select_tcgen05_candidate(intrin, target);
verify_tcgen05_mma(&inst, target);
MachineOperand operands[MAX_TCGEN05_OPERANDS];
int num_operands = build_tcgen05_operands(&inst, operands);
return build_machine_instr(inst.machine_opcode, operands, num_operands);
}
Selector and verifier intentionally report different classes of errors. The selector rejects targets that cannot support tcgen05 at all; the verifier rejects instruction-family combinations that become illegal only after all modifiers, scale modes, sparsity bits, and collector modes have been packed.
Cross-References
Per-SM Emission Templates — SM100 / SM103 and WGMMA Descriptor Round-Trip document the actual PTX text the printer emits for tcgen05.mma and WGMMA, including the four-part WGMMA protocol and the worked WGMMA descriptor hex round-trip. ISelDAG and MatcherTable — Selector Layers documents the selector dispatcher that lands on these emitters and MatcherTable and Cost Scoring covers the predicate-row probe that shares the subtarget feature bitmap with the verifier. TMA Descriptor Shape and the cp.async.bulk Template Catalog cover the descriptor encoder for cp.async.bulk.tensor that the TMA verifier sits in front of.
ldmatrix/stmatrix Emission + Register Class Vtables
Abstract
Two table-driven parts of NVPTX code generation regularly coexist in the
same matrix instruction sequence. The first is the ldmatrix/stmatrix
selector, which maps MLIR/NVVM matrix-copy properties to LLVM intrinsic
IDs and reconstructed PTX mnemonics. The second is the NVPTX
register-class model that instruction selection and the asm printer
consult when emitting register declarations such as .reg .b32 %r<N>;
and the per-instruction operand prefixes (%r, %rd, %rs, %rq,
%p, %f).
Both subsystems are deliberately static. Matrix-copy lowering is a small
dispatcher over shape, matrix count, layout, transpose, and packed
element-width fields, with no runtime feedback. Register-class emission
is a fixed mapping from LLVM register classes to PTX declaration width
and printer prefix, with one subtlety: the %f (32-bit float view) class
shares physical register IDs with %r, and the %fd (64-bit float view)
prefix layers on top of %rd storage at print time only.
Matrix-Copy Templates
The warp-wide matrix-copy path has three layers:
cute_nvgpu.arch.copy.{ldsm,stsm}
-> nvgpu.ldmatrix / (no `nvgpu.stmatrix` mnemonic in this binary;
stsm path lowers straight to `nvvm.stmatrix`)
-> nvvm.ldmatrix / nvvm.stmatrix / nvvm.movmatrix
-> llvm.call_intrinsic
The NVVM-to-LLVM tier receives a properties blob, validates legal
combinations, then selects an intrinsic ID. ldmatrix properties:
{num, shape, sz, trans, layout}. stmatrix properties: {num, shape, trans}.
| Family | Shape (enum) | Num | Layout / trans | sz-code (width) | Properties {num,shape,sz,trans,layout} | Intrinsic id | Reconstructed PTX mnemonic |
|---|---|---|---|---|---|---|---|
| ldmatrix | m8n8 (0) | 1 | no-trans | b16 | {1, 0, 0, 0, 0} | 9165 | ldmatrix.sync.aligned.m8n8.x1.b16 |
| ldmatrix | m8n8 (0) | 1 | .trans | b16 | {1, 0, 0, 1, 0} | 9166 | ldmatrix.sync.aligned.m8n8.x1.trans.b16 |
| ldmatrix | m8n8 (0) | 2 | no-trans | b16 | {2, 0, 0, 0, 0} | 9167 | ldmatrix.sync.aligned.m8n8.x2.b16 |
| ldmatrix | m8n8 (0) | 2 | .trans | b16 | {2, 0, 0, 1, 0} | 9168 | ldmatrix.sync.aligned.m8n8.x2.trans.b16 |
| ldmatrix | m8n8 (0) | 4 | no-trans | b16 | {4, 0, 0, 0, 0} | 9169 | ldmatrix.sync.aligned.m8n8.x4.b16 |
| ldmatrix | m8n8 (0) | 4 | .trans | b16 | {4, 0, 0, 1, 0} | 9170 | ldmatrix.sync.aligned.m8n8.x4.trans.b16 |
| ldmatrix | m8n16 (1) | 1 | row (mandatory; trans illegal) | b8 | {1, 1, 0, 0, 0} | 9160 | ldmatrix.sync.aligned.m8n16.x1.b8 |
| ldmatrix | m8n16 (1) | 1 | row | b8x16.b6x16_p32 | {1, 1, 1, 0, 0} | 9159 | ldmatrix.sync.aligned.m8n16.x1.b8x16.b6x16_p32 |
| ldmatrix | m8n16 (1) | 2 | row | b8 | {2, 1, 0, 0, 0} | 9162 | ldmatrix.sync.aligned.m8n16.x2.b8 |
| ldmatrix | m8n16 (1) | 2 | row | b8x16.b6x16_p32 | {2, 1, 1, 0, 0} | 9161 | ldmatrix.sync.aligned.m8n16.x2.b8x16.b6x16_p32 |
| ldmatrix | m8n16 (1) | 4 | row | b8 | {4, 1, 0, 0, 0} | 9164 | ldmatrix.sync.aligned.m8n16.x4.b8 |
| ldmatrix | m8n16 (1) | 4 | row | b8x16.b6x16_p32 | {4, 1, 1, 0, 0} | 9163 | ldmatrix.sync.aligned.m8n16.x4.b8x16.b6x16_p32 |
| ldmatrix | m16n16 (3) | 2 | col (mandatory) | b8 | {2, 3, 0, 0, 1} | 9155 | ldmatrix.sync.aligned.m16n16.x2.b8 |
| ldmatrix | m16n16 (3) | 2 | col | b8x16.b6x16_p32 | {2, 3, 1, 0, 1} | 9154 | ldmatrix.sync.aligned.m16n16.x2.b8x16.b6x16_p32 |
| ldmatrix | m16n16 (3) | 2 | col | b8x16.b6x16_p64 | {2, 3, 2, 0, 1} | 9153 | ldmatrix.sync.aligned.m16n16.x2.b8x16.b6x16_p64 |
| ldmatrix | m16n16 (3) | 4 | col | b8 | {4, 3, 0, 0, 1} | 9158 | ldmatrix.sync.aligned.m16n16.x4.b8 |
| ldmatrix | m16n16 (3) | 4 | col | b8x16.b6x16_p32 | {4, 3, 1, 0, 1} | 9157 | ldmatrix.sync.aligned.m16n16.x4.b8x16.b6x16_p32 |
| ldmatrix | m16n16 (3) | 4 | col | b8x16.b6x16_p64 | {4, 3, 2, 0, 1} | 9156 | ldmatrix.sync.aligned.m16n16.x4.b8x16.b6x16_p64 |
| stmatrix | m8n8 (0) | 1 | no-trans | b16 | {1, 0, 0, –, –} | 9862 | stmatrix.sync.aligned.m8n8.x1.b16 |
| stmatrix | m8n8 (0) | 1 | .trans | b16 | {1, 0, 1, –, –} | 9861 | stmatrix.sync.aligned.m8n8.x1.trans.b16 |
| stmatrix | m8n8 (0) | 2 | no-trans | b16 | {2, 0, 0, –, –} | 9864 | stmatrix.sync.aligned.m8n8.x2.b16 |
| stmatrix | m8n8 (0) | 2 | .trans | b16 | {2, 0, 1, –, –} | 9863 | stmatrix.sync.aligned.m8n8.x2.trans.b16 |
| stmatrix | m8n8 (0) | 4 | no-trans | b16 | {4, 0, 0, –, –} | 9866 | stmatrix.sync.aligned.m8n8.x4.b16 |
| stmatrix | m8n8 (0) | 4 | .trans | b16 | {4, 0, 1, –, –} | 9865 | stmatrix.sync.aligned.m8n8.x4.trans.b16 |
| stmatrix | m8n16 (2) | 1 | .trans (mandatory in observed arm) | b8 | {1, 2, 1, –, –} | 9858 | stmatrix.sync.aligned.m8n16.x1.trans.b8 |
| stmatrix | m8n16 (2) | 2 | .trans | b8 | {2, 2, 1, –, –} | 9859 | stmatrix.sync.aligned.m8n16.x2.trans.b8 |
| stmatrix | m8n16 (2) | 4 | .trans | b8 | {4, 2, 1, –, –} | 9860 | stmatrix.sync.aligned.m8n16.x4.trans.b8 |
| stmatrix alt import path | m8n8 (0) | – | attr 0 / attr 1 | – | – | 10379 / 10380 | stmatrix.sync.aligned.m8n8.{x?}.{trans?}.b16 sibling |
| stmatrix alt import path | m16n16 (3) | – | attr 0 / attr 1 | – | – | 10381 / 10382 | stmatrix.sync.aligned.m16n16.{...} sibling |
| ldmatrix sibling | – | – | – | – | – | 8366 | WGMMA / m8n16.x1 single-id sibling |
| movmatrix | m8n8 | 1 | .trans (mandatory) | b16 | (no arm; folded) | (none) | movmatrix.sync.aligned.m8n8.trans.b16 |
Shape enum value 2 is reserved for ldmatrix. m16n16 rejects num=1;
m8n16 rejects .trans and reports
Transposed layout is not supported for m8n16 shape for nvvm.ldmatrix.
The m8n8 arm is b16-only. movmatrix carries no separate selected
intrinsic in this path because its layout swap folds into
shufflevector and bitcast operations before instruction selection.
The selector is a thin validator over the properties blob, followed by an ID-table lookup:
unsigned select_matrix_copy(MatrixCopyNode *node, Subtarget *st) {
MatrixCopyProps p = decode_properties(node->properties);
/* Family-specific legality checks happen before any ID lookup so the
caller never has to recover from a bogus intrinsic id. */
if (node->family == LDMATRIX) {
if (p.shape == LDSM_M8N16 && p.trans) {
fatal("Transposed layout is not supported for m8n16 shape for nvvm.ldmatrix");
}
if (p.shape == LDSM_M16N16 && p.num == 1) {
fatal("m16n16 ldmatrix requires num=2 or num=4");
}
return select_ldmatrix_id(p);
}
require(node->family == STMATRIX, "unknown matrix-copy family");
return select_stmatrix_id(p);
}
The ID-selection bodies are compact enough to reimplement directly:
int select_ldmatrix_id(LdMatrixProps p) {
if (p.shape == LDSM_M16N16) {
return (p.num == 2 ? 9153 : 9156) + (2 - p.sz);
}
if (p.shape == LDSM_M8N16) {
static const int ids[3][2] = {
{9160, 9159},
{9162, 9161},
{9164, 9163},
};
return ids[num_to_index(p.num)][p.sz];
}
return 9165 + 2 * (p.num / 2) + (p.trans ? 1 : 0);
}
int select_stmatrix_id(StMatrixProps p) {
if (p.shape == STSM_M8N16) {
return 9857 + p.num;
}
return 9862 - (p.trans ? 0 : 1) + 2 * (p.num / 2);
}
NVPTX RegisterClass vtables
The NVPTX register classes used by the selector and asm printer are:
| Class | ClassID | Width | Declaration type | Printer prefix | Notes |
|---|---|---|---|---|---|
%p | 0 | 1 bit | .pred | %p | predicate registers |
%rs | 1 | 16 bit | .b16 | %rs | 16-bit integer registers |
Special | 2 | 32 bit | internal | none | PTX special registers such as %tid and %laneid |
%r | 3 | 32 bit | .b32 | %r | ordinary 32-bit integer registers |
%f | 4 | 32 bit | printer-only | %f | float view over selected 32-bit register IDs |
%rd | 5 | 64 bit | .b64 | %rd | 64-bit integer and f64 physical storage |
%rq | 6 | 128 bit | .b128 | %rq | 128-bit registers |
The %f class is the easiest one to miss. The asm printer never declares
.reg .b32 %f<N>; because float registers print as a view of the same
underlying 32-bit register IDs %r uses. The class still exists so that
TargetRegisterInfo::getRegClass(MVT::f32) succeeds during DAG
legalization and copy lowering. No separate %fd class exists; f64
values physically live in %rd and print with the float-double prefix
only at instruction-print time.
Subclassing closes through %f: both %r and Special include %f in
their subclass masks, and %f lists %r and Special as superclasses.
Preserve that relationship in a reimplementation — it affects COPY
lowering and register-class queries even though %f is mostly invisible
in declarations.
The declaration printer is a pair of maps:
StringRef reg_class_type(const RegisterClass *rc) {
switch (rc->id) {
case RC_RQ:
return ".b128";
case RC_RD:
return ".b64";
case RC_R:
return ".b32";
case RC_RS:
return ".b16";
case RC_P:
return ".pred";
default:
return "INTERNAL";
}
}
StringRef reg_class_prefix(const RegisterClass *rc) {
switch (rc->id) {
case RC_RQ:
return "%rq";
case RC_RD:
return "%rd";
case RC_R:
return "%r";
case RC_RS:
return "%rs";
case RC_P:
return "%p";
default:
return "INTERNAL";
}
}
The declaration printer emits one .reg directive per non-empty class.
%f is skipped because its registers share IDs with %r and have
already been declared under that prefix:
void print_reg_decls(Printer *p, const RegisterAllocation *ra) {
for (RegisterClass *rc : ra->classes) {
if (rc->id == RC_F || rc->id == RC_SPECIAL) {
continue; /* %f shares storage with %r; Special is intrinsic */
}
unsigned count = ra->count_for(rc);
if (count == 0) {
continue;
}
fprintf(p, "\t.reg %s %s<%u>;\n",
reg_class_type(rc), reg_class_prefix(rc), count);
}
}
Inside an instruction operand, the printer chooses %f over %r for
32-bit float MVTs and %fd over %rd for 64-bit float MVTs, printing the
same numeric register ID either way.
NVPTX Backend Passes Overview
Abstract
Once TileIR has been lowered to LLVM IR, the NVPTX backend normalizes that IR and the post-selection MachineIR so PTX emission sees legal kernel parameters, concrete address spaces, expanded aggregate copies, valid device launches, resolved image handles, and subtarget-compatible machine instructions. This page covers what is shared across the cluster: where each pass sits in the pipeline, what state it hands the next pass, and which globals it has to agree on. Per-pass mechanics live in the dedicated pages.
The cluster spans two IR levels. The LLVM-IR passes consume Function, Argument, Instruction, Metadata, address spaces, and intrinsics. The MachineIR passes consume MachineFunction, MachineInstr, machine operands, frame indices, and subtarget feature bits. SelectionDAG sits between the two. Passes that need semantic SSA-level information run on the IR side; passes that need concrete target opcodes run on the MachineIR side.
Pipeline Position
LLVM IR with NVVM intrinsics
|
| Pretreat (canonicalize frontend forms)
| KernelAttrPass (stamp nvvm.kernel)
| InlineMustPass (force AlwaysInline)
| CDPLaunchExpander (rewrite cudaLaunchDevice -> __cudaCDP*V2)
| LowerStructArgs (byval -> parameter-space pointer + scalar LDPARAM)
| MemorySpaceOpt (concrete AS inference, cvta folding)
| ProcessRestrict (noalias / alias-scope materialization)
| PrintfLowering (vprintf packing buffer)
| DeadSyncElim (barrier removal)
| CommonBaseElim (SCEV-keyed GEP CSE)
| NVVMIRVerifier (kernel-ABI invariants, parameter-space ceiling)
|
v
SelectionDAG instruction selection
|
| BASR (post-ISel address-arithmetic peephole)
| Image-handle rewrite (parametric -> slot opcode)
| Prolog/Epilog, proxy-reg erase, invariant-load tagging
|
v
PTX assembly
The order above is the ordering the rest of this cluster's pages assume. Two pages call out specific ordering constraints explicitly: ProcessRestrict must follow MemorySpaceOpt so it sees concrete address-space tags on derived pointers, and BASR must follow instruction selection so it sees the final MachineInstr opcodes rather than IR-level GEPs.
Cross-Pass Invariants
The pages in this cluster share three pieces of state that have to agree across pass boundaries. Getting any of them wrong produces either silent miscompiles or a downstream verifier abort.
Kernel identity
KernelAttrPass, KernelAttrTransplanter, InlineMustPass, CDPLaunchExpander, KernelArgEliminator, NVVMIRVerifier, and the parameter-space ceiling check all consult a single isKernelFunction predicate. The predicate is a four-way disjunction over CallingConv::PTX_Kernel (0x47), the nvvm.kernel attribute, the nvvm.annotations_transplanted attribute, and the legacy "kernel" string attribute. Forking this check across passes is how older NVPTX backends produced inconsistent answers between argument elimination and the inliner. See Kernel Identity for the canonical definition.
Shared parameter-space enable flag
LowerStructArgs and MemorySpaceOpt both read the same boolean enable flag at startup. When the flag is set, LowerStructArgs rewrites each by-value struct argument to a parameter-space pointer plus per-field LDPARAM (MI opcode 101) loads, and MemorySpaceOpt then seeds its lattice on those parameter-space pointers and folds the resulting CVT_PARAM_TO_GENERIC / CVT_PARAM_TO_GLOBAL casts (MI opcodes 49 / 50). A mismatch — one pass enabled, the other disabled — produces by-value pointers MemorySpaceOpt cannot classify, and NVVMIRVerifier then rejects the function with a "pointer-to-local-or-generic launch argument" diagnostic. Reimplementations have to gate both passes on the same flag.
Pass-to-pass attribute hand-off
| Producer | Attribute or metadata | Consumer |
|---|---|---|
KernelAttrPass, KernelAttrTransplanter | nvvm.kernel, nvvm.annotations_transplanted | Every later kernel-aware pass |
LowerStructArgs | parameter-space LDPARAM SSA chain on byval args | MemorySpaceOpt |
MemorySpaceOpt | concrete address-space tag on every pointer SSA value | ProcessRestrict, NVPTX alias analysis |
ProcessRestrict | nvvm.restrict_scope per pointer, nvvm.restrict_processed per function | NVPTX alias analysis |
PrintfLowering | %vprintfBuffer.local alloca, call @vprintf(...) | None (terminal) |
CDPLaunchExpander | call @__cudaCDP{1,2}LaunchDeviceV2 | NVVMIRVerifier (re-checks the callee is a kernel) |
KernelAttrPass + LowerStructArgs | byval-aware parameter list | NVVMIRVerifier (parameter-space ceiling) |
The verifier reads everything in the right column: parameter-space sizes for the byval-aware list, address spaces for launch arguments, and the kernel attribute for the launch-target sanity check. Running the verifier before any producer in the table has fired leads to a false-positive abort.
Routing
| Page | Covers |
|---|---|
| Kernel, CDP, Force-Inline, and Pretreat | Pretreat, kernel attribute stamping, InlineMustPass, CDP launch and parameter-buffer expansion, isKernelFunction. |
| LowerStructArgs | Bare-pointer ABI translation for by-value struct parameters, including the cast-only fast path and nested-aggregate recursion. |
| Memory-Space Optimization and Restrict | Inter-procedural callee specialization, the function-local AS lattice, the cast folder, and __restrict__ propagation. |
| Printf Lowering and the vprintf ABI | Tag-driven rewrite of printf into vprintf, the per-thread packing buffer, and the constant-AS format-string check. |
| Dead Sync Elimination and Common Base | Cross-product test for redundant barriers, and SCEV-keyed GEP merging with alloca cloning. |
| NVVM IR Verifier | Launch-argument address-space check and the parameter-space ceiling per SM family. |
| Peephole, MIR Cleanup, and Image Handles | BASR post-ISel address-arithmetic peephole, the parametric-to-slot rewrite for tex / sust / suld / suq, and final MachineIR cleanup. |
For the shared backend relationship with cicc, see cicc comparison.
NVPTX Peephole, MIR Cleanup, and Image Handles
Abstract
This is the cleanup window around instruction selection. A post-ISel MachineIR peephole pass walks an 801-row pattern table and fuses address-arithmetic chains into their consumers; the same dispatcher hosts BASR (the central base-address-slice-replace fold) and the image-handle replacement that rewrites texture and surface operands from parameter handles into slot operands. The final MachineIR cleanup passes strip target pseudos, fix frame-index address forms, and tag invariant loads. Together they hand PTX printing concrete, target-legal instructions.
Peephole MIR
The MachineIR peephole pass is the central post-ISel rewriter for NVPTX. It runs after instruction selection on the MachineFunction form and applies an 801-row pattern table that matches MIR sequences and rewrites them in place. The two canonical rewrites are BASR (Base-Address-Slice-Replace), which fuses a GEP-style base computation into its consuming load or store, and the image-handle table which is documented in its own section below.
Input and Output MIR Shape
input (MIR, before peephole):
%1:i64 = MUL64ri %iv, 4
%2:p1 = ADD64rr %base, %1
%3:i32 = LD32 %2, 0
output (MIR, after BASR):
%3:i32 = LD32_BASE_SLICE_OFFSET %base, %iv, 4
The fused LD32_BASE_SLICE_OFFSET carries the base pointer, the slice (index) register, and the constant stride directly; the intermediate MUL64 and ADD64 MIs are dead and removed in the same pass. The same shape applies to stores and to a handful of address-arithmetic chains that ISel leaves around tensor-memory addresses.
Pattern Table Structure
The pattern table is 801 rows, one per recognized rewrite. Each row carries:
- an
opcode_maskfield giving the set of MI opcodes that can trigger this row; - a forward or backward direction marker selecting which chain walker to use;
- a per-row matcher that inspects the candidate MI's operands and predecessors (or successors);
- an emit function that constructs the replacement MI and erases the matched sequence.
Dispatch over opcode_mask is constant-time: the pass maintains an opcode → row[] index built once at module entry, so each MI scan does a single hash lookup rather than walking all 801 rows. The mask is a bitset over the 14 active opcode classes — GEP, LOAD, STORE, ADD, SUB, MUL, AND, OR, SHL, SHR, BITCAST, EXTRACT, INSERT, PHI. PHI is included so GEP bases threaded through loop headers can still be canonicalized; the BITCAST / EXTRACT / INSERT classes handle the pointer-typing pseudos NVPTX selection leaves around tensor-memory addresses.
BasrState
Each MachineFunction visit allocates a BasrState scratch record that tracks per-basic-block state across pattern attempts:
typedef struct BasrState {
MachineFunction *mf;
DenseMap<unsigned, MachineInstr*> intern; // canonical-base interning
SmallVector<MachineInstr*, 16> work_list; // pending MIs to retry
DenseMap<MachineInstr*, GepInfo> gep_cache; // memoized base decomposition
uint64_t opcode_mask;
bool debug_enabled;
} BasrState;
The intern map collapses syntactically distinct but semantically equal base computations to the same canonical MI, so a second occurrence of base + i*4 reuses the first occurrence's BASR output instead of allocating new operands. The work list is drained in dominator order, which guarantees that uses of a folded base are rewritten before the base itself is erased.
Forward and Backward Chain Walkers
The 801 rows split into forward and backward families:
- Forward chain walker. Matches sequences of N MIs starting at a given root, walking forward through users. A BASR row that fuses
MUL + ADD + LOADintoLD_BASE_SLICE_OFFSETis forward-rooted at theMUL, then descends to theADD, then to theLOAD. The walker stops at the first non-matching user or at a use that escapes the basic block under the row's locality requirement. - Backward chain walker. Matches sequences ending at a given sink, walking backward through defs. Dead-code-style rewrites — where the consumer is the trigger and the producers are folded into it — use the backward walker. An
LD32row that absorbs a precedingADD64into its addressing mode is backward-rooted at the load and traces defs back through theADDto theMUL.
The split exists because some patterns are cheaper to match top-down (a single root with many possible tails) and others bottom-up (a single sink with many possible heads). The dispatcher picks the right walker per row from the direction marker; the row itself does not see the choice.
Invariant-Load Whitelist
Certain loads are never rewritten away even when the pattern table's matcher claims a fold. The whitelist is the set of loads whose result is observably stable across all reachable program points, so any rewrite that erases or reorders them changes the program:
- loads of program-counter-relative globals (CUDA kernel constants emitted into
.text); - loads from the constant address space (
addrspace(4)); - loads with
!invariant.loadmetadata; - loads from grid-constant parameters;
- loads from the special-register file (thread/block/grid IDs,
clock64,globaltimer).
The whitelist is enforced as the first check in every row's emit function: if the candidate matches the load shape but is on the whitelist, the row's emit short-circuits and the pass continues. Removing a row's whitelist check produces a kernel that loses its broadcasted constants, which manifests as nondeterministic kernel outputs depending on warp scheduling.
BASR-Specific Algorithm
void run_basr(MachineFunction *mf, BasrState *s) {
seed_work_list(s, mf);
while (!s->work_list.empty()) {
MachineInstr *mi = s->work_list.pop_back();
for (Pattern *row : pattern_table_for_opcode(mi->opcode)) {
if (row->direction == FORWARD) {
if (match_forward_chain(s, mi, row)) {
emit_replacement(s, mi, row); // erases matched chain
requeue_users(s, mi);
break;
}
} else {
if (match_backward_chain(s, mi, row)) {
emit_replacement(s, mi, row);
requeue_defs(s, mi);
break;
}
}
}
}
}
A cl::opt<bool> named -print-basr turns on the BASR debug print. When set, BASR emits "phi maxLoopInd = " followed by the current loop induction-variable count for every MachineFunction it visits, so a -print-basr run shows the loop-nest depth the rewriter sees at each entry.
Failure Modes
- Pattern miss leaves redundant arithmetic. A row that fails to match because its operand shape diverges from the canonical form (a different operand order, a non-constant stride) leaves the original
MUL + ADD + LOADchain. Correct but suboptimal. - Whitelist erosion changes program semantics. A reimplementation that loses the invariant-load whitelist will fold an
LD32of a kernel constant into an addressing-mode field and erase the original constant load; downstream consumers see uninitialized data. - Dominator-order violation requeues forever. The work list is drained in dominator order on purpose. A naive FIFO can requeue an instruction whose dependency has not been rewritten yet, leading to oscillation. The intern map breaks the cycle, but losing it causes the pass to fail to terminate on adversarial inputs.
Image Handle Replacement
Input and Output MIR Shape
The image-handle pass is a MachineFunction pass operating on selected NVPTX MachineIR. It rewrites parametric-form texture and surface MIs into slot-form MIs immediately before PTX printing.
input (MIR, parametric form):
%h:p4 = LD_PARAM_p4 %image_arg_offset ; load handle from .param space
%v:v4f32 = TEX_2D_F32_F32_param %h, %x, %y
output (MIR, slot form):
%v:v4f32 = TEX_2D_F32_F32_slot 3, %x, %y ; slot 3 of the texture-unit register file
The slot is the runtime register-file index that the CUDA driver binds to the texture or surface object at launch. The parametric opcode is one of 801 cases across four families (tex, sust, suld, suq); each one has a sibling slot opcode in a parallel table that the rewriter looks up directly.
Matching Predicate
A MachineInstr matches iff its opcode is a *_param form in one of the four families and every operand resolves to a kernel image-argument handle. The handle resolution is the analytical core: the actual TEX_*_param MI may not see the handle as a direct operand because MIR has rerouted it through COPY and PHI instructions, so the pass uses two chain walkers to trace the virtual-register definition back to a kernel image-argument table entry.
| Helper | Direction |
|---|---|
| Forward chain walker | follows uses through COPY / PHI toward the consumer |
| Backward chain walker | follows defs through COPY / PHI back to the handle argument |
The forward walker is what the consumer uses to discover that a particular handle definition reaches it; the backward walker is what the consumer uses to find the slot.
Rewrite Tables
Each family carries its own opcode rewrite table. The transformation is a single integer indexed lookup: each _param opcode value has a sibling _slot opcode value at a fixed offset, so the lookup is a direct index from the parametric opcode value to the slot opcode value.
| Family | Cases | PTX op family |
|---|---|---|
tex | 165 | tex.* |
sust | 210 | sust.* |
suld | 258 | suld.* |
suq | 168 | suq.* |
The four tables together cover all 801 image-handle opcodes.
Algorithm
void rewriteImageHandles(MachineFunction *mf) {
ImageArgTable images = collect_kernel_image_arguments(mf);
for (MachineBasicBlock &mbb : *mf) {
for (MachineInstr &mi : mbb) {
if (!is_image_param_opcode(mi.getOpcode())) continue;
ImageHandle handle = trace_image_handle_backward(&mi);
ImageSlot slot = images.lookup(handle);
if (!slot.valid) {
emit_error(mi, "invalid image handle");
continue;
}
unsigned slot_op = slot_opcode_for(mi.getOpcode()); // table lookup
mi.setOpcode(slot_op);
replace_handle_operand_with_slot(&mi, slot);
}
}
}
Failure Modes
- Handle does not resolve to a kernel argument. A handle threaded through an opaque pointer or a non-image global makes the backward walker terminate at a non-image-arg definition; the pass emits a diagnostic and leaves the MI as a
*_paramform, which the PTX printer cannot handle. This is a hard error. - Slot table lookup miss. A
*_paramopcode without a sibling*_slotentry in the family table indicates a pattern the rewriter does not cover; the pass leaves the MI in place and the printer fails downstream. - Family confusion. A reimplementation that picks the wrong family table for an opcode produces a slot MI of the wrong family — e.g. a
texopcode mapped through thesusttable — and the GPU traps at runtime when the texture unit decodes the wrong descriptor format.
MachineIR Peepholes
The post-ISel peephole pass strips target pseudos that were useful during selection but illegal for printing. The central canonical cleanup is frame-index address folding: a temporary local address move followed by a local-address conversion can often be replaced by the frame index itself.
void run_machine_peepholes(MachineFunction mf) {
for (MachineBasicBlock mbb : mf.blocks) {
for (MachineInstr mi : mbb.instructions) {
if (is_local_cvta_of_frame_address(mi)) {
replace_uses_with_frame_index(mi);
erase_dead_address_pseudos(mi);
}
if (matches_target_specific_copy_chain(mi)) {
fold_copy_chain(mi);
}
}
}
}
Gate target-specific copy-chain folding behind a command-line or build-time option — it is more sensitive to TableGen opcode layout than the canonical frame-address fold.
Prolog/Epilog, Proxy Registers, and Invariant Loads
The remaining MachineIR cleanup is conventional NVPTX target work:
| Pass | Contract |
|---|---|
| Prolog/Epilog | Lay out frame objects, replace frame indices, and emit target prolog/epilog code. |
| Proxy register erasure | Replace proxy-register pseudos with the real source register and erase the pseudos. |
| Invariant-load tagging | Mark loads as invariant only when all bounded uses preserve the invariant contract. |
Invariant-load tagging should be conservative. Parameter, constant, and global loads usually qualify when their use graph is simple. Tensor-memory loads stay off the whitelist unless their selected opcodes and memory operands already carry the needed semantics.
bool load_can_be_invariant(MachineInstr load) {
if (!address_space_allows_invariant_load(load.mem_operand.space)) {
return false;
}
for (MachineInstr user : bounded_use_graph(load, MAX_INVARIANT_DEPTH)) {
if (!is_allowed_invariant_use(user)) {
return false;
}
}
return true;
}
Cross-References
ISel DAG and MatcherTable is what feeds BASR with the post-selection MachineInstr opcodes it folds. Common Base Elimination is the IR-level sibling that performs the analogous GEP-CSE before SelectionDAG runs. AsmPrinter is the PTX-printing consumer that requires the slot-form image opcodes this pass produces. NVPTX Backend Passes Overview places these MIR cleanup passes after instruction selection.
LowerStructArgs: Bare-Pointer ABI Translation
Abstract
LowerStructArgs rewrites by-value struct parameters into parameter-space pointers. Every aggregate load of the original SSA argument becomes a scalar LDPARAM at the field's computed offset, and the use graph gets rewired so downstream instructions consume the loaded scalars instead of the original struct value. The pass lands late enough that LLVM-level struct shape is still visible but early enough that instruction selection sees only pointer-and-scalar traffic.
NVPTX cannot pass an aggregate object directly through register classes the way the IR-level ABI pretends it can. Every by-value struct parameter has to be materialized as a pointer into parameter space, loaded piecewise, and address-space-cast wherever the original value flowed into a generic-pointer or global-pointer consumer.
Rewrite Shape
The pass operates at the LLVM-IR / SelectionDAG MachineIR boundary. For an arbitrary by-value struct parameter %s, the shape it consumes and the shape it produces are:
input : define ptx_kernel void @k(%S byval(%S) %s) {
%x = getelementptr %S, ptr %s, i32 0, i32 1
%v = load i32, ptr %x
...
}
output : define ptx_kernel void @k(ptr addrspace(101) %s.param) {
%v = call i32 @llvm.nvvm.ldparam.i32(ptr addrspace(101) %s.param, i64 4)
%v.gen = call i32 @llvm.nvvm.cvt.generic.to.as(i32 %v, i32 ...)
...
}
The byval aggregate parameter becomes a parameter-space (addrspace(101)) pointer; every load that read a struct field is replaced by LDPARAM (MI opcode 101) reading from the parameter pointer at the field's offset, followed by CVT_GENERIC_TO_AS (opcode 80) when the loaded scalar still flowed into a typed pointer consumer.
Algorithm
The pass body is a function-local rewriter. It seeds a work list from every by-value struct argument of the current function, drains the work list depth-first, and emits replacement MIs at each use site. Each work item carries the original SSA value, its computed replacement, and the specific use edge that needs rewiring — not just the user instruction. GEP chains feeding several downstream loads share a user but not a use, and each use needs an independent rewrite to keep SSA def-use chains consistent for the passes downstream.
typedef struct WorkItem {
Value *defining; // original SSA value being rewritten
Value *replacement; // new value: loaded scalar, parameter-space pointer, or cast
Use *use_edge; // the specific use site to rewire
} WorkItem;
LogicalResult lower_struct_args(Function *fn) {
if (!opt_byval_enabled) return success(); // shared flag, see below
WorkList<WorkItem> wl = seed_from_byval_args(fn);
while (!wl.empty()) {
WorkItem item = wl.pop();
Instruction *user = cast<Instruction>(item.use_edge->getUser());
switch (user->getOpcode()) {
case GEP: rewrite_gep(user, item); push_uses(wl, user); break;
case Load: rewrite_load(user, item); break;
case Store: rewrite_store(user, item); break;
case Call: rewrite_call_arg(user, item); break;
default: emit_diagnostic(user); return failure();
}
}
return success();
}
GEPs are the only opcode that re-seeds the work list: a GEP of the by-value struct produces a new pointer whose own uses must be rewritten, so the walker descends into them. Loads, stores, and calls terminate the rewrite — the materializer emits the LDPARAM + CVT_GENERIC_TO_AS pair (or, for calls and stores, the appropriate address-cast variant), and the original instruction is either replaced or has its operand swapped to the loaded scalar.
Unknown opcodes bail with a diagnostic rather than silently leaving half-rewritten def-use chains for later passes to trip over.
Materializer
The materializer is the single entry point for emitting replacement MIs. Given a work item, it computes the offset of the requested scalar inside the original struct (using the LLVM DataLayout for the active target), emits an LDPARAM reading from the rewritten parameter pointer at that offset, then emits a CVT_GENERIC_TO_AS to coerce the loaded value back to the original SSA type:
Value *materialize_field_load(IRBuilder *b, Value *param_ptr,
StructType *struct_ty, unsigned field_idx) {
uint64_t off = layout.struct_field_offset(struct_ty, field_idx);
Type *field_ty = struct_ty->getElementType(field_idx);
Value *ld = b->createCall(intrinsic_ldparam(field_ty),
{param_ptr, b->getInt64(off)});
Value *cv = b->createCall(intrinsic_cvt_generic_to_as(field_ty),
{ld, b->getInt32(/*target_as=*/0)});
return cv;
}
Order matters: the cast consumes the load's result, and the load consumes the parameter pointer rather than the original aggregate pointer, so the rewrite naturally severs the use graph from the original by-value argument.
MI Opcodes
Four machine-instruction opcodes participate in the rewrite. The materializer picks among them based on the original use's address space and what the consumer expects.
| MI opcode | Mnemonic | When emitted |
|---|---|---|
| 49 | CVT_PARAM_TO_GENERIC | Cast a .param pointer to a generic pointer for a downstream generic-space use. |
| 50 | CVT_PARAM_TO_GLOBAL | Cast a .param pointer directly to a global-space pointer. |
| 80 | CVT_GENERIC_TO_AS | Coerce a loaded scalar back to the original SSA pointer type. |
| 101 | LDPARAM | Load a scalar from .param space at a computed offset from the parameter pointer. |
Opcode 101 always precedes opcode 80 in the materialized sequence: read the scalar out of parameter space first, then cast it to whatever pointer flavor the original SSA value carried. Opcodes 49 and 50 fire only on the address-cast path, where the original by-value struct's address itself flowed into a generic or global consumer rather than being loaded through. The cast-only path is documented separately below.
Worked Example: Field-Level Rewrite
Take the struct
%S = type {f64, i8, [4 x i32]}
On the standard NVPTX target the DataLayout places f64 at offset 0, i8 at offset 8, three padding bytes at offsets 9–11, and [4 x i32] at offset 12. Total struct size is 28 bytes, alignment 8.
Input function:
define ptx_kernel void @k(%S byval(%S) align 8 %s) {
entry:
%p_f = getelementptr %S, ptr %s, i32 0, i32 0
%f = load double, ptr %p_f
%p_b = getelementptr %S, ptr %s, i32 0, i32 1
%b = load i8, ptr %p_b
%p_a = getelementptr %S, ptr %s, i32 0, i32 2, i32 3
%a3 = load i32, ptr %p_a
...
}
The rewriter seeds a work list with the byval argument %s and walks each user GEP. For each GEP it computes the field offset from the DataLayout, then for each downstream load of that pointer it emits the LDPARAM / CVT_GENERIC_TO_AS pair.
Output function:
define ptx_kernel void @k(ptr addrspace(101) align 8 %s.param) {
entry:
%f = call double @llvm.nvvm.ldparam.f64(ptr addrspace(101) %s.param, i64 0)
%b = call i8 @llvm.nvvm.ldparam.i8 (ptr addrspace(101) %s.param, i64 8)
%a3 = call i32 @llvm.nvvm.ldparam.i32(ptr addrspace(101) %s.param, i64 24)
...
}
The i8 at offset 8 still keeps the same 8-byte offset; the three padding bytes that preserved [4 x i32] alignment are never named because nothing in the original IR referenced them. Field 2, element 3 of [4 x i32] lands at offset 12 + 3·4 = 24. The struct's natural alignment (8) survives onto the .param pointer so the loads can use the wide LDPARAM variants without per-field alignment fix-ups.
Worked Example: Cast-Only Fast Path
When no field of the byval struct is ever loaded — only the struct's address flows out, typically into a callee that expects a generic or global pointer — the materializer skips field-level rewriting entirely. A single addrspacecast from parameter space to the consumer's expected space replaces the byval indirection.
Input function: the byval address flows directly into a generic-pointer callee.
declare void @consume(ptr %p)
define ptx_kernel void @k(%S byval(%S) align 8 %s) {
entry:
call void @consume(ptr %s)
ret void
}
The walker visits the single call-site use of %s and notes that the consumer takes a generic (addrspace(0)) pointer. Rather than materializing a scalar load chain, the materializer emits a CVT_PARAM_TO_GENERIC (opcode 49) at the call site and rewires the operand:
define ptx_kernel void @k(ptr addrspace(101) align 8 %s.param) {
entry:
%s.gen = call ptr @llvm.nvvm.cvt.param.to.generic(ptr addrspace(101) %s.param)
call void @consume(ptr %s.gen)
ret void
}
If @consume had taken a ptr addrspace(1) argument instead, the materializer would emit CVT_PARAM_TO_GLOBAL (opcode 50) — the parametric-to-global cast — and feed that to the call. Either way the entire body of lower_struct_args collapses to a single address-space cast: no GEP rewriting, no per-field loads, no padding arithmetic. This is the cheapest shape the pass can produce and the one the rewriter actively prefers when the use graph permits.
Nested Aggregates
Nested aggregates use the same materializer with one extra step. A GEP of the form getelementptr %Outer, ptr %s, i32 0, i32 i, i32 j, ... is folded to a single byte offset by composing per-level DataLayout::getElementOffset queries from outermost to innermost:
uint64_t composite_offset(StructType *outer, ArrayRef<unsigned> path) {
uint64_t off = 0;
Type *t = outer;
for (unsigned idx : path) {
if (auto *st = dyn_cast<StructType>(t)) {
off += layout.struct_field_offset(st, idx);
t = st->getElementType(idx);
} else if (auto *at = dyn_cast<ArrayType>(t)) {
off += idx * layout.size_of(at->getElementType());
t = at->getElementType();
}
}
return off;
}
The recursion is purely on the type, never on the runtime SSA values: every level of nesting collapses to a single integer offset added to the base of the parameter-space pointer. Per-field alignment is whatever the DataLayout says for the leaf type, since the original byval struct's alignment is at least the maximum field alignment by construction.
Shared Enable Flag
The pass is gated by a single boolean (opt-byval in cl::opt terms) that MemorySpaceOpt consults at the same offset in the same .bss slot. Both passes have to see the same value, and the reason is concrete:
- When the flag is
1, this pass rewrites byval struct arguments to parameter-space pointers plus scalarLDPARAMloads.MemorySpaceOptthen seeds its address-space lattice on those parameter-space pointers, folds the resultingCVT_PARAM_TO_GENERIC/CVT_PARAM_TO_GLOBALcasts, and lets the verifier see a clean parameter-space-aware ABI. - When the flag is
0, this pass returns immediately and the byval calling convention is preserved verbatim.MemorySpaceOptthen has to treat byval arguments as generic and refrain from folding the casts.
A mismatched configuration — this pass disabled but MemorySpaceOpt still seeding AS_PARAM — produces parameter-space pointers MemorySpaceOpt cannot classify because the rewrite never ran. The NVVMIRVerifier rejects the function later with a "pointer-to-local-or-generic launch argument" diagnostic, and the failure surface is far from the actual misconfiguration. Both passes therefore read the same byte and a reimplementation must keep them in lockstep.
⚡ QUIRK —
opt-byvalis a shared flag, and the failure surface is remoteLowerStructArgsandMemorySpaceOptread the same.bssbyte. Toggling the flag in one pass without the other still type-checks, still passes the early verifier, and even runs successfully on small kernels. The mismatch surfaces only atNVVMIRVerifiertime as a "pointer-to-local-or-generic launch argument" diagnostic that points at the kernel signature, not at the configuration that produced the inconsistent IR. Reimplementations must wire the flag through both passes from the same source or accept a debugging trail with no obvious connection to the root cause.
Cross-References
MemorySpaceOpt consumes the parameter-space pointers and CVT_PARAM_TO_* casts this pass emits. Parameter-Space Sizer accumulates parameter-space byte counts against the per-SM ceiling using the byval-aware parameter list this pass leaves behind. Modulo Scheduler and Rau-Style Placement is the eventual consumer of the LDPARAM MIs in TileAS loops.
Memory Space Optimization and Restrict Processing
Abstract
This cluster prepares pointer provenance for NVPTX codegen. It specializes generic-pointer callees whose callers consistently pass concrete address spaces, rewrites provable generic pointers inside each function, and translates __restrict__ into alias metadata. The payoff is both correctness and quality — the backend gets to emit direct global, shared, constant, local, tensor-memory, or distributed-shared operations instead of dragging generic conversions through the pipeline.
For an overview of GPU memory spaces and how they appear at each compilation stage, see Memory Hierarchy and Data Flow.
Ordering is deliberate. Inter-procedural specialization runs first, function-local memory-space optimization second, and restrict processing last, once pointer forms have become more concrete.
Address-Space Lattice
Every pointer-typed SSA value carries an inferred address space. The lattice is two-tier and finite-height: BOTTOM at the unknown root, GENERIC at the conflicting top, and one concrete element per NVPTX address space sitting between them.
GENERIC (top)
↑
┌────────┬──────────┼──────────┬────────┐
│ │ │ │ │
GLOBAL SHARED CONSTANT LOCAL TENSOR
(AS=1) (AS=3) (AS=4) (AS=5) (AS=6)
│ │ │ │ │
└────────┴──────────┼──────────┴────────┘
↑
BOTTOM (⊥, unknown)
The ordering captures BOTTOM ⊑ GLOBAL ⊑ GENERIC, BOTTOM ⊑ SHARED ⊑ GENERIC, BOTTOM ⊑ LOCAL ⊑ GENERIC, BOTTOM ⊑ CONSTANT ⊑ GENERIC, BOTTOM ⊑ TENSOR ⊑ GENERIC, and BOTTOM ⊑ DISTRIBUTED_SHARED ⊑ GENERIC. The meet of two distinct concrete spaces is GENERIC; the meet of any element with BOTTOM is the other element.
AddressSpace meet(AddressSpace current, AddressSpace observed) {
if (observed == AS_BOTTOM) return current;
if (current == AS_BOTTOM) return observed;
if (current == observed) return current;
return AS_GENERIC;
}
Termination
The lattice has finite height: every chain BOTTOM ⊏ AS ⊏ GENERIC has length two, so any SSA value can be refined at most twice (BOTTOM → AS → GENERIC). The worklist iterates until no value's tag changes; with n pointer-typed values, the iteration count is bounded by 2n regardless of CFG shape. The propagation is therefore a standard finite-height-lattice fixpoint and terminates without an explicit budget.
| Element | Meaning |
|---|---|
| BOTTOM | No useful evidence yet. |
| GLOBAL | Device global memory (addrspace(1)). |
| SHARED | CTA-local shared memory (addrspace(3)). |
| CONSTANT | Constant memory or grid-constant parameters (addrspace(4)). |
| LOCAL | Per-thread local memory (addrspace(5)). |
| TENSOR | Blackwell tensor memory. |
| DISTRIBUTED_SHARED | Cluster-wide shared memory. |
| GENERIC | Conflicting or unknown-at-boundary provenance (addrspace(0)). |
Tensor-memory and distributed-shared spaces are first-class elements of the lattice. Folding them into ordinary generic memory would keep unnecessary cvta conversions in precisely the code that needs the most accurate state-space lowering.
Callee Specialization
The inter-procedural part of MemorySpaceOpt hunts for helper functions with generic pointer parameters. When every call site to a helper passes one parameter in the same concrete address space, the pass clones the helper with a specialized signature and retargets matching calls.
void specialize_generic_pointer_callees(Module module, int clone_budget) {
WorkList work = collect_candidate_helpers(module);
while (!work.empty()) {
Function fn = work.pop();
VoteVector votes = collect_argument_votes(fn);
if (!has_specializable_vote(votes)) {
continue;
}
if (clone_budget_exceeded(fn, clone_budget)) {
continue;
}
Function clone = clone_function(fn);
rewrite_pointer_argument_spaces(clone, votes);
mark_internal_alwaysinline(clone);
for (CallBase call : users_of(fn)) {
if (call_arguments_match_votes(call, votes)) {
retarget_call(call, clone);
work.push(caller_function(call));
}
}
}
}
Termination follows from the lattice argument earlier in this page. Each argument vote can be refined at most twice (BOTTOM → concrete → GENERIC), the iteration count is bounded by 2 × (helpers × pointer parameters), and the clone budget bounds the worst case for recursive helper families without weakening the lattice.
MemorySpaceOpt
Input and Output IR Shape
MemorySpaceOpt is an inter-procedural address-space inference pass. It consumes LLVM IR where pointer-typed values may carry the generic address space addrspace(0) and produces the same IR with as many of those values as possible retagged to their concrete address space. The pass runs after LowerStructArgs (which promotes byval struct parameters into explicit pointer arguments) and before ProcessRestrict (which attaches alias scopes).
; before: generic-pointer chains
define void @child(ptr %p, i32 %i) {
%a = getelementptr i32, ptr %p, i32 %i
%v = load i32, ptr %a, align 4
store i32 %v, ptr %a, align 4
ret void
}
define void @kernel(ptr addrspace(1) %g, i32 %i) {
%cast = addrspacecast ptr addrspace(1) %g to ptr
call void @child(ptr %cast, i32 %i)
ret void
}
; after: callee specialized, generic cast and inner GEP retagged
define internal void @child.global(ptr addrspace(1) %p, i32 %i) alwaysinline {
%a = getelementptr i32, ptr addrspace(1) %p, i32 %i
%v = load i32, ptr addrspace(1) %a, align 4
store i32 %v, ptr addrspace(1) %a, align 4
ret void
}
define void @kernel(ptr addrspace(1) %g, i32 %i) {
call void @child.global(ptr addrspace(1) %g, i32 %i)
ret void
}
Matching Predicate
The pass propagates AS tags through the SSA def-use graph and through calls. A pointer value v reaches the concrete tag AS iff every reaching definition of v (via GEP, bitcast, PHI, select, addrspacecast, parameter binding) refines to AS under the lattice meet. The pass is intra-procedural for stores, GEPs, and PHIs, but inter-procedural across calls and returns: when every call site to a function passes pointers in the same concrete AS, the callee gets cloned with refined parameter types and matching calls are redirected to the clone.
Lattice Propagation Walker
The walker seeds the lattice from kernel-argument attributes and propagates through every pointer-defining opcode until fixpoint.
void memspace_walker(Function fn, Lattice *lat) {
for (Argument arg : fn.arguments) {
if (has_attr(arg, KERNEL_POINTER)) {
lat_seed(lat, arg, AS_GLOBAL);
} else if (has_attr(arg, GRID_CONSTANT)) {
lat_seed(lat, arg, AS_CONSTANT);
} else if (has_attr(arg, NVVM_BYVAL)) {
lat_seed(lat, arg, AS_GENERIC); // needs cast at every deref
}
}
while (lat_has_changes(lat)) {
for (Instruction inst : fn.pointer_instructions) {
switch (inst.opcode) {
case GEP:
case BITCAST:
lat_propagate(lat, inst, lat_get(lat, inst.operand[0]));
break;
case SELECT:
case PHI:
lat_propagate(lat, inst, lat_meet_all_incoming(lat, inst));
break;
case ADDR_SPACE_CAST:
lat_propagate(lat, inst, inst.target_as); // forces dst AS
break;
case CALL:
lat_propagate_call_args(lat, inst); // backward to caller
lat_propagate_call_ret(lat, inst); // forward from callee
break;
case LOAD:
case STORE:
case ATOMIC:
lat_consume(lat, inst.pointer_operand);
break;
}
}
}
}
The lattice meet rule is the join operator at PHI and Select fan-in. AddrSpaceCast is the only opcode that does not inherit from its operand — it forces the destination AS regardless of the source value's tag. Kernel-argument-derived pointers reach load/store/atomic sites already concrete; only values that cross a true generic boundary (a byval pointer, an opaque external return, an unhandled intrinsic) stay at lattice bottom.
Diagnostic Rewriter and Catalog
Once the walker reaches fixed point, the rewriter visits every pointer-typed instruction, attaches the inferred AS, and emits a diagnostic for every pointer that did not converge. Three diagnostic strings reach the user:
| Site | Diagnostic | Severity |
|---|---|---|
| atomic op on a pointer inferred as LOCAL | Cannot do atomic on local memory | error |
| dereference of a still-unconcrete pointer | assuming global memory space | warning |
| value remains at lattice BOTTOM | Cannot tell what pointer points to | warning |
The first fires before instruction selection and stops the backend from emitting a local-memory atomic the architecture doesn't support. The second is the fallback when no kernel-argument seed reached the dereference: the rewriter assumes global and continues. The third is the soft failure: the pointer stays at GENERIC so a later cvta survives into PTX, with a corresponding cost at runtime.
WMMA Forces Global
A specific family of intrinsics — the WMMA load/store and the related async-bulk family — is defined only against global memory. Their operand pointers cannot reach codegen as generic without a stall, so the pass treats them as a backward constraint: at a WMMA call site the operand pointer's lattice tag is meet-refined with GLOBAL, and that refinement propagates back through the def-use graph until it either hits a kernel argument (which is already global) or contradicts another concrete tag (which becomes GENERIC and surfaces as the warning above).
void apply_wmma_constraint(Lattice *lat, Instruction inst) {
if (is_wmma_pointer_intrinsic(inst)) {
Value *p = inst.pointer_operand;
lat_refine_backward(lat, p, AS_GLOBAL); // propagate AS_GLOBAL up the chain
}
}
The refinement is one-directional: WMMA forces GLOBAL on the operand chain but never demotes a value that already proved itself in a different concrete space. A contradicting tag terminates with the lattice top (GENERIC) and the user sees the WMMA failure as a verifier diagnostic.
Addrspacecast Folder
NVPTX frontends emit a number of addrspacecast instructions that become no-ops once the walker has refined the source pointer. The folder applies three rewriting rules at fixpoint:
Value fold_addrspace_cast(Lattice *lat, Instruction cast) {
Value src = cast.operand[0];
AddressSpace dst = cast.target_as;
// (1) cast to the AS the operand already has -> drop the cast.
if (lat_get(lat, src) == dst) {
return src;
}
// (2) cast of a cast -> collapse to a single cast with the outer target.
if (src.opcode == ADDR_SPACE_CAST && src.target_as != dst) {
return make_addrspace_cast(src.operand[0], dst);
}
// (3) kernel pointer arg already global -> drop the cast to global.
if (has_attr(src, KERNEL_POINTER) && dst == AS_GLOBAL) {
return src;
}
return cast;
}
Rule (1) is the common case once the walker has tagged cvta.to.global results; rule (2) collapses the back-to-back casts the frontend emits when a generic pointer is briefly routed through a cast and immediately cast again; rule (3) handles the canonical cast(KERNEL_PTR, GLOBAL) shape produced by source code that re-asserts the argument's known space.
Tunables
Four cl::opt knobs configure the pass:
| Knob | Default | Meaning |
|---|---|---|
memspace-opt-enable | 1 | Master enable. |
memspace-opt-verbose | 0 | Emit verbose lattice trace. |
memspace-opt-conservative | 0 | Force conservative inference: treat unknowns as GENERIC on first contact. |
memspace-opt-alias-merge | enum | Alias-set merging policy (none / per-AS / unified). |
The conservative-inference flag short-circuits the lattice the first time a value fails to acquire a concrete AS, which makes diff-style comparisons against an older toolchain easier to read during regression triage.
Failure Modes
- Generic survives to atomic-on-LOCAL. A backward propagation that reaches
addrspace(5)on an atomic pointer is a hard error; the pass refuses to lower it. - WMMA constraint contradiction. A WMMA operand reached through a shared-memory chain is unrepresentable; the verifier rejects the kernel.
- Cast folder loop. Without rule (2)'s outer-target preservation, two back-to-back casts could ping-pong forever. The folder applies rules in order and only fires once per cast per round.
Restrict Processing
Input and Output IR Shape
ProcessRestrict turns frontend __restrict__ intent into LLVM noalias attributes on pointer arguments and into nvvm.restrict_* metadata on every load and store reached from a restricted root. It runs after MemorySpaceOpt because the propagation worker consults the inferred address-space tag when deciding whether a derived pointer is global; the reverse order would over-restrict shared and local pointers and degrade downstream alias analysis. The output feeds the NVPTX alias-analysis pipeline and ultimately reaches the backend scheduler as a noalias guarantee.
; before
define void @k(ptr addrspace(1) %a, ptr addrspace(1) %b)
"user_specified_restrict_scope"="1"
"user_specified_restrict_keyword"="__restrict__" {
%va = load i32, ptr addrspace(1) %a, align 4
%vb = load i32, ptr addrspace(1) %b, align 4
store i32 %va, ptr addrspace(1) %b, align 4
ret void
}
; after
define void @k(ptr addrspace(1) noalias %a, ptr addrspace(1) noalias %b)
"nvvm.restrict_processed"
"nvvm.restrict_scope"="1"
"nvvm.restrict_keyword"="__restrict__" {
%va = load i32, ptr addrspace(1) %a, align 4, !alias.scope !0, !noalias !1
%vb = load i32, ptr addrspace(1) %b, align 4, !alias.scope !1, !noalias !0
store i32 %va, ptr addrspace(1) %b, align 4, !alias.scope !1, !noalias !0
ret void
}
!0 = !{!2}
!1 = !{!3}
!2 = !{!"restrict.scope.1.a", !4}
!3 = !{!"restrict.scope.1.b", !4}
!4 = !{!"restrict.domain.k"}
The transformation has two effects. The argument attributes change from the front-end's user_specified_* form to the canonical nvvm.restrict_* form and gain a real noalias attribute that LLVM's alias analysis honors directly. Every load and store derived from a restricted argument acquires !alias.scope and !noalias metadata pointing at the per-argument scope, which is what teaches the backend scheduler that the two pointer families never overlap.
Matching Predicate
A pointer argument is in scope iff it carries a user_specified_restrict_scope annotation or one of the legacy spellings the front-end emits. The propagation predicate, applied at every def-use edge, is: an SSA value is derived from a restricted root iff every reaching definition arrives through GEP, bitcast, or address-space cast from a restricted root (or from another value already proved derived). PHI and select join multiple roots and produce a derived-from-multiple-scopes value, which receives the meet of the contributing scopes in its alias-scope metadata.
Attribute Keys: Frontend vs Canonical
The pass operates over six attribute keys split between the front-end's annotation form and the canonical form alias analysis actually reads.
| Front-end key | Canonical key | Purpose |
|---|---|---|
user_specified_restrict_scope | nvvm.restrict_scope | Per-pointer scope ID. |
user_specified_restrict_keyword | nvvm.restrict_keyword | Keyword form (restrict vs __restrict__) preserved for diagnostics. |
user_specified_restrict_processed | nvvm.restrict_processed | Function-level idempotency marker; prevents re-entry. |
The user_specified_* variants are the frontend's deposit; the nvvm.restrict_* variants are this pass's canonical form. The frontend keys stay on the IR after canonicalization because later diagnostic passes need the original keyword spelling for source-line error messages, while alias analysis reads only the canonical form. A reimplementation that drops the frontend keys after canonicalization compiles correctly but loses the original spelling in diagnostics.
The diagnostic "function contains restrict keyword in struct" fires when a restrict-qualified pointer is found inside a struct field; the default policy rejects that shape because the pass does not propagate restrict through aggregate loads.
Tunables
Four cl::opt knobs control the pass:
| Knob | Default | Meaning |
|---|---|---|
process-restrict | 1 | Master enable. |
allow-restrict-in-struct | 0 | Permit struct-field restrict; otherwise emit the diagnostic above. |
apply-multi-level-restrict | 0 | Walk through two or more levels of pointer indirection. |
dump-process-restrict | 0 | Print before/after IR for debugging. |
The default policy is conservative: only direct-argument restrict propagates, struct-field restrict gets rejected with a diagnostic, and multi-level indirection is left alone. The two opt-in knobs exist for code bases that rely on more aggressive aliasing assumptions; the dump knob is strictly a debugging aid.
Algorithm
LogicalResult processRestrict(Function *F) {
if (F->hasAttr("nvvm.restrict_processed")) {
return success();
}
for (Argument &arg : F->args()) {
if (!hasRestrictAnnotation(arg)) {
continue;
}
uint32_t scopeId = nextScopeId();
canonicalize_keys(arg); // user_specified_* -> nvvm.restrict_*
attachNoaliasAttr(arg);
propagateRestrict(arg, scopeId);
}
F->setAttr("nvvm.restrict_processed", UnitAttr::get(ctx));
return success();
}
void propagateRestrict(Value *root, uint32_t scopeId) {
WorkList wl({root});
while (!wl.empty()) {
Value *v = wl.pop();
attachAttr(v, "nvvm.restrict_scope", IntAttr(scopeId));
for (User *u : v->users()) {
if (isPointerArithmetic(u)) {
wl.push(u);
}
if (isLoadOrStore(u)) {
attachAliasScopeMD(u, scopeId);
}
if (isPointerLoad(u) && apply_multi_level_restrict) {
wl.push(u); // recurse through T**
}
}
}
}
The worker's entry path leads with the per-function idempotency check. The nvvm.restrict_processed attribute prevents accidental re-entry when a later pass clones or specializes a function and runs the cluster again, and it gives the rest of the pipeline a cheap way to ask whether restrict metadata is already canonicalized. The worklist is a flat traversal of the def-use graph: pointer-arithmetic users stay in the frontier and load/store users terminate it with a metadata stamp. apply-multi-level-restrict gates the only place where the walker is allowed to recurse through a pointer-of-pointer load.
Failure Modes
- Struct-field restrict. With
allow-restrict-in-structoff, the pass refuses to propagate from a restrict-qualified pointer reached through a struct member load and emits the diagnostic. Flipping the knob enables the propagation but the user accepts responsibility for the relaxed aliasing. - PHI of two restricted roots. The two scopes do not meet to a single canonical scope; the PHI inherits both as a noalias set, which is correct but weakens the alias analysis at the merge.
- Re-entry without canonical key. A pass that clones a function and copies
user_specified_*but notnvvm.restrict_processedwill re-enter and emit a fresh scope, breaking the alias analysis cache. The idempotency attribute exists precisely to make the re-entry detectable.
Restrict metadata is not a proof of address space. It is a noalias relation among pointer families. MemorySpaceOpt and Restrict Processing cooperate but do not replace each other — the former tells the backend which state space to use, the latter tells alias analysis which pointer pairs cannot overlap.
Operational Knobs
These passes expose useful controls in debugging and testing builds:
| Knob family | Purpose |
|---|---|
| Clone budget | Bounds inter-procedural specialization on recursive or template-heavy helper graphs. |
| Dump memory-space propagation | Prints specialization decisions and affected callees. |
| Process restrict enable | Allows disabling restrict metadata generation for differential testing. |
| Propagate-only restrict mode | Reapplies already-stamped scopes after another pass creates new derived values. |
| Multi-level restrict mode | Follows T** and deeper pointer chains when frontend metadata requested it. |
Cross-References
LowerStructArgs Rewrite Shape produces the parameter-space pointers and CVT_PARAM_TO_* casts this pass consumes. Address-space vote lattice covers the inter-procedural worker that drives callee specialization. NVPTX Backend Passes Overview — Shared parameter-space enable flag documents the byval-AS flag both this pass and LowerStructArgs read. Parameter-Space Sizer is the downstream consumer that turns inferred address-space tags into the launch-argument check.
Printf Lowering and the vprintf ABI
Abstract
VprintfLowering rewrites every CUDA-side printf(...) call into the device-runtime intrinsic vprintf(fmt, buffer). The format string stays a constant-address-space pointer, the variadic tail packs into a contiguous per-thread local buffer, and the high-level call becomes a direct call to the runtime symbol. The pass is a flat scan: visit each call printf(...), dispatch on a single op-tag byte attached to the call's argument-packing block, and emit the lowered form for that tag. No inter-procedural analysis, no varargs reasoning beyond what the tag already encodes.
Input and Output Shape
The pass consumes one IR opcode and emits one runtime call plus optional packing ops. The shape of the rewrite, for the varargs tag, is:
input : %r = call @printf(%fmt, %a, %b, %c, ...) ; fmt in addrspace(4)
output : %buf = alloca %vprintfBuffer.local : [N x i8]
store %a, %buf+off_a
store %b, %buf+off_b
store %c, %buf+off_c
%r = call @vprintf(%fmt, %buf) ; i32 result
For the bare-format tag the alloca and stores are absent and the buffer argument is nullptr. For the pre-packed tag the caller has already produced %buf and the pass forwards it verbatim.
Rewriter Dispatch
The rewriter walks every call printf(...) in the current function. For each call it reads the op-tag byte at offset 0 of the call's argument-packing block and dispatches on the value. Three tag bytes pass; anything else triggers a hard diagnostic.
| Tag | Form | Meaning |
|---|---|---|
| 40 | varargs | Standard printf(fmt, a, b, c, ...). Pack the args into a local buffer. |
| 34 | bare format | printf(fmt) with no variadic args. Skip packing; pass nullptr as buffer. |
| 85 | pre-packed buffer | Caller already packed args into a buffer; forward it. Used by CUB / Thrust internals. |
Any other tag emits "unsupported printf form (op-tag = N)", with the decimal tag value substituted for N. The string is emitted verbatim with no localization.
Buffer Allocation
Tag 40 emits a single alloca in the function's entry block sized to the sum of the packed-arg sizes. The allocation is named %vprintfBuffer.local, and that name is the canonical fingerprint for vprintf-lowered functions across every CUDA version — stable, deterministic, and untouched by later NVPTX passes. Tag 34 skips the allocation entirely and feeds nullptr as the buffer argument. Tag 85 forwards the caller-supplied pointer and allocates nothing.
LogicalResult lower_printf(CallInst *call) {
uint8_t tag = call->arg_packing_block()[0];
switch (tag) {
case 40: {
Value *buf = alloca_packing_buffer(call); // %vprintfBuffer.local = alloca [N x i8]
pack_args_into(buf, call->args_from(1));
emit_vprintf(call->getArg(0), buf);
return success();
}
case 34:
emit_vprintf(call->getArg(0), /*buf=*/nullptr);
return success();
case 85:
emit_vprintf(call->getArg(0), call->getArg(1) /*pre-packed*/);
return success();
default:
return emit_diagnostic("unsupported printf form (op-tag = "
+ std::to_string(tag) + ")");
}
}
Buffer size N is the sum of the slot sizes for every variadic operand, in order, once each operand has been legalized to its ABI-stored type.
Runtime Symbol
The runtime intrinsic is vprintf(fmt: i8*, buf: i8*) -> i32. The original call printf(...) becomes a direct call @vprintf(fmt, buf), and every use of the printf result is replaced with the vprintf result. The declaration is materialized lazily the first time the rewriter needs it within a translation unit.
Format String Address Space
The fmt argument must be a constant-AS pointer. The rewriter probes getPointerAddressSpace(fmt) == 4 and rejects any other address space with the diagnostic "printf format string must be a constant address space pointer". This rules out format strings synthesized into generic, global, shared, or local memory and forces the front-end to materialize the literal in constant memory before lowering reaches it.
Slot Layout
Each operand in the argument-packing block contributes one 32-byte slot. The rewriter advances by exactly 32 bytes when iterating, regardless of the underlying operand size; oversized operands (anything that does not fit into a single slot's payload, including structs passed by pointer) occupy a single stride entry whose payload word holds a pointer to the larger value. Each slot header carries two fields the rewriter reads:
- The indirect-operand flag is bit 7 of the slot's tag byte. When set, the slot's payload is a pointer to the actual value rather than the value itself, and the rewriter materializes a load before packing. When clear, the slot's payload is used directly.
- The size of the actual operand drives how many bytes inside the slot are populated. The remainder is unspecified padding that
vprintfignores at the receiving end because the format string already encodes which operand sizes to expect.
Packing walks the args in source order, legalizes each one (float is promoted to double per the C variadic ABI, smaller integer types are widened to int), and writes it into %vprintfBuffer.local at the next slot offset. The final buffer size N is the offset after the last write.
Worked Example: printf("x=%d y=%f", i, f)
Take the canonical mixed-type case:
int i = 7;
float f = 3.5f;
printf("x=%d y=%f", i, f);
The front-end emits this as a call printf(...) with two variadic operands. The frontend has also placed the literal "x=%d y=%f" into a constant-AS-4 string global and attached a tag-40 packing block to the call.
After lowering, the buffer carries two slots, 32 bytes each, for a total N = 64:
| Slot | Offset | Tag byte | Payload | Notes |
|---|---|---|---|---|
| 0 | 0 | 0x00 | i32 7 written at byte 0, zero-padded to 32 bytes | %d consumes one slot. Bit 7 of tag clear: payload is the literal value. |
| 1 | 32 | 0x00 | f64 3.5 written at byte 32, zero-padded to 32 bytes | %f promotes the float to double per the C variadic ABI; the 8-byte payload sits at the slot base. |
If the call had passed a struct Pt { int x, y, z, w, u; } through %p instead of f, the slot tag's bit 7 would be set and the 8-byte payload would be a pointer to the struct rather than the struct contents. The 32-byte stride absorbs the size mismatch: every operand consumes exactly one slot regardless of width, and the indirect-pointer escape hatch handles anything that does not fit.
Output IR:
@.str = private addrspace(4) constant [10 x i8] c"x=%d y=%f\00"
define ptx_kernel void @k(i32 %i, float %f) {
entry:
%vprintfBuffer.local = alloca [64 x i8]
%slot0 = getelementptr [64 x i8], ptr %vprintfBuffer.local, i64 0, i64 0
store i32 %i, ptr %slot0
%slot1 = getelementptr [64 x i8], ptr %vprintfBuffer.local, i64 0, i64 32
%f.d = fpext float %f to double
store double %f.d, ptr %slot1
%fmt = getelementptr [10 x i8], ptr addrspace(4) @.str, i64 0, i64 0
%r = call i32 @vprintf(ptr addrspace(4) %fmt, ptr %vprintfBuffer.local)
ret void
}
The buffer name %vprintfBuffer.local is preserved verbatim across CUDA versions, and downstream tooling that parses --print-after dumps anchors on that name to find the lowered call.
Cross-References
MemorySpaceOpt classifies the %vprintfBuffer.local alloca as local space and propagates that tag onto every slot pointer. The Parameter-Space Sizer does not size this buffer against the per-SM ceiling — vprintf does not take its arguments through .param.
Dead Sync Elimination and Common Base Elimination
Abstract
Two NVPTX middle-end passes attack different forms of redundancy. Dead Sync Elimination deletes barriers that don't separate visible memory traffic, using a four-map cross-product over the cross-warp dependence graph as the correctness predicate. Common Base Elimination collapses repeated address arithmetic by hoisting a shared base pointer and rewriting related GEP chains as deltas off that base, using ScalarEvolution to recognize algebraically equal bases written through different operand sequences. Both passes are sound by construction: deletion happens only when the dependence graph or the SCEV equivalence proves the rewrite is an identity transformation on observable behavior.
Dead Sync Elimination
Input and Output IR Shape
DeadSyncElim consumes LLVM IR carrying NVPTX sync intrinsics and emits the same IR with provably redundant calls removed. The intrinsic family it deletes covers nvvm.bar.sync.aligned, nvvm.barrier0 and its named variants, and the nvvm.fence.* ordering intrinsics. A representative pair of fragments:
; before
define void @kernel(ptr addrspace(3) %s, i32 %i) {
entry:
%p = getelementptr i32, ptr addrspace(3) %s, i32 %i
%v = load i32, ptr addrspace(3) %p, align 4
call void @llvm.nvvm.barrier0()
%w = add i32 %v, 1
store i32 %w, ptr addrspace(3) %p, align 4
ret void
}
; after
define void @kernel(ptr addrspace(3) %s, i32 %i) {
entry:
%p = getelementptr i32, ptr addrspace(3) %s, i32 %i
%v = load i32, ptr addrspace(3) %p, align 4
%w = add i32 %v, 1
store i32 %w, ptr addrspace(3) %p, align 4
ret void
}
The nvvm.barrier0 here separates a per-thread load from a per-thread store on the same address; no other warp can observe a value through that barrier that it could not observe in its absence, so the pass deletes it. The deletion is local to a basic block — the analyzer never claims a barrier dead when it spans CFG edges that could carry cross-warp dependences from outside the block.
Matching Predicate
The pass operates per basic block. For each barrier site it maintains four maps from shared-memory address (or, when the address is non-constant, an address-class tag) to a one-byte access summary:
| Map | Meaning |
|---|---|
read_above | SMEM reads before the barrier in this block |
write_above | SMEM writes before the barrier in this block |
read_below | SMEM reads after the barrier in this block |
write_below | SMEM writes after the barrier in this block |
A barrier is dead exactly when
(write_above × read_below) ∪ (write_below × read_above) = ∅
— no producer-consumer pair on shared memory crosses it. Address-class collisions count as may-alias and keep the barrier alive.
The four maps are a compact encoding of the cross-warp dependence graph at the barrier. Each entry (t, A, ↑) representing "thread t accesses address A above the barrier" induces a potential cross-warp edge to every entry (t', A, ↓) with the opposite read/write polarity. The barrier is observably necessary iff at least one such edge exists; the four-map cross-product is precisely that emptiness check, computed once per address class instead of per (thread, thread) pair.
Algebraic Correctness Condition
The elimination is safe iff the cross-warp dependence graph is unchanged after the deletion. With the barrier present, every pair ((t, A, write_above), (t', A, read_below)) carries a happens-before edge that the memory model would otherwise leave undefined; deleting the barrier removes that edge. The maps' emptiness guarantees no such edge exists to begin with, which makes deletion an identity transformation on the dependence graph and therefore on the program's observable behavior. The condition is necessary and sufficient: an empty cross-product is what happens-before actually demands, not a sufficient over-approximation.
Exempt Intrinsics
Three sync intrinsics are permanently exempted from deletion even when the four-map predicate declares them dead. Their semantics reach past shared memory in ways the lightweight scan cannot represent:
| Intrinsic | Reason for exemption |
|---|---|
llvm.nvvm.exit | Terminates the thread; any preceding ordering must survive. |
llvm.nvvm.trap | Aborts the device; same argument. |
llvm.nvvm.bar.warp.sync | Intra-warp lane convergence; mask-only effects, not modeled by the SMEM maps. |
llvm.nvvm.cp.async.bulk.wait_group | TMA bulk-copy completion wait; ordering is against the async DMA engine, not SMEM. |
llvm.nvvm.cluster.arrive.relaxed | Cluster-wide CTA handshake; orders against the cluster fabric. |
The exit and trap cases are the conservative additions: any preceding sync may be the only thing guaranteeing a store visibility before the abort, and the analyzer has no way to prove otherwise without a full inter-block walk.
SCEV Opcode Set for the Address Builder
Both maps key on a shared-memory address. The analyzer normalizes each address through a ScalarEvolution walk so that algebraically equal addresses written through different operand sequences collapse to the same map key. Anything outside the recognized opcode set becomes an opaque leaf, which is conservative: it widens to an address-class tag and keeps every barrier with a may-alias hit alive.
The recognized opcodes are exactly those that admit safe SCEV reordering across a sync without changing the address each thread computes:
getelementptr inbounds— base case; produces the SCEV pointer leaf.add nsw,add nuw— folded into a single SCEV add with no-wrap flags preserved.mul nsw,mul nuw— folded into a SCEV mul, same flag handling.shl nuw,shl nsw— rewritten asmul (1 << shamt)and joined with the SCEV-mul chain.sext,zext,trunc— recursed past with the matching SCEV extension applied.phi— handled through the SCEV merge rule so loop-variant addresses stay symbolic.
Any other opcode (a load, an opaque call, a vector shuffle, a bitcast across address spaces) terminates the SCEV walk at that operand and forces the conservative address-class fallback.
Algorithm
LogicalResult deadSyncElim(Function *F) {
for (BasicBlock &bb : *F) {
SMEMState s = build_smem_state(&bb); // empty maps initially
for (Instruction &inst : bb) {
if (is_sync_intrinsic(&inst)) {
if (is_exempt_intrinsic(&inst)) {
s = split_maps_at(&inst, s); // refresh above/below split
continue;
}
if (cross_product_empty(s.write_above, s.read_below) &&
cross_product_empty(s.write_below, s.read_above)) {
emit_remark(&inst, "Removed dead synch: ", format_state(s));
inst.eraseFromParent();
continue;
}
s = split_maps_at(&inst, s);
continue;
}
if (is_shared_load(&inst)) s.read_above.insert(addr_key_scev(inst));
if (is_shared_store(&inst)) s.write_above.insert(addr_key_scev(inst));
}
}
return success();
}
Failure Modes
Three observable failure modes deserve mention:
- Address-class collision keeps a barrier alive. When
ScalarEvolutioncannot resolve the address (a non-constant index through an opaque pointer), both halves collapse to a single tag and any read/write pair on opposite sides keeps the barrier. The pass logs the conservative miss only under the verbose dump flag. - Cross-block dependences are invisible. The maps are per-block. A barrier that separates a producer in one block from a consumer in another is never deletable here; that is a global dataflow problem that this pass intentionally avoids.
- Wrong sync family on the exempt list. A reimplementation that loses the exempt entry for
nvvm.exitornvvm.trapdeletes the last fence before a process abort and silently changes observable store visibility. Diagnostics never fire for this case; the bug surfaces as memory inconsistency on the host side.
Every deletion emits the diagnostic "Removed dead synch: " followed by a four-line "Read/Write above/below" summary of the four maps. The -print-dead-sync-elim flag gates the dump.
Common Base Elimination
Input and Output IR Shape
Common Base Elimination is GEP-CSE with teeth. The syntactic CSE in InstCombine matches GEPs whose operand chains are literally identical; this pass uses LLVM ScalarEvolution to merge GEPs that share a common base pointer at the same SCEV-expression level. Two GEPs whose bases hash to the same SCEV key are mergeable even when their operand chains differ — a frequent shape after loop unrolling and affine-to-LLVM lowering, where one address is reached through algebraically equal but textually distinct sequences of add, mul, shl, and integer extensions.
; before: two GEPs with textually distinct but SCEV-equal bases
define void @k(ptr %p, i64 %i) {
entry:
%t0 = mul nsw i64 %i, 4
%a0 = getelementptr i8, ptr %p, i64 %t0
%v0 = load i32, ptr %a0, align 4
%t1 = shl nsw i64 %i, 2 ; SCEV-equal to %t0
%a1 = getelementptr i8, ptr %p, i64 %t1
store i32 %v0, ptr %a1, align 4
ret void
}
; after: one canonical base, second use becomes a delta-zero load/store
define void @k(ptr %p, i64 %i) {
entry:
%t0 = mul nsw i64 %i, 4
%scevcgp_0 = getelementptr i8, ptr %p, i64 %t0
%v0 = load i32, ptr %scevcgp_0, align 4
store i32 %v0, ptr %scevcgp_0, align 4
ret void
}
Matching Predicate
A pair of GEPs is mergeable iff their base expressions normalize to the same SCEV value under the visitor below. Once a group is identified, the pass picks a canonical representative, hoists it to a position that dominates every consumer, and rewrites the remaining members as the difference between their SCEV and the representative's SCEV.
Dominance and the Alloca + PHI Argument
The merge is only correct when the canonical representative dominates every original use. Two alloca-based GEPs with the same allocation size and address space can share a single base safely iff their lifetimes don't overlap and at least one of them is reachable through an entry-block-dominated alloca. The argument: the function entry block dominates every basic block by construction, so any value materialized there dominates every use in the function. Cloning an alloca into the entry block lifts its dominance to that of the entry; cloning is the prerequisite for any merge where the original alloca lived in a block that didn't dominate every group member.
When the cloned base must flow through a CFG merge — a loop header, the join of an if/else, the post-dominator of a switch — a PHI node at the merge point assembles one incoming value per predecessor. The PHI is what makes the canonical base usable across loops and around branching regions where the entry-block alloca would not by itself reach every deduplicated GEP without an explicit dataflow merge.
SCEV Visitor
The SCEV computation walks each GEP's IR operand graph through a small fixed opcode set. Anything outside the set becomes an opaque leaf and stops the recursion, which keeps the implementation robust against unfamiliar IR shapes:
getelementptr inboundsis the base case and contributes the pointer-typed leaf of the SCEV.add nswandadd nuware folded into a single SCEV add with the no-wrap flag preserved.mul nswandmul nuware folded into a single SCEV mul, again preserving no-wrap flags.shl nuwandshl nsware converted tomul (1 << shamt)so they participate in the SCEV-mul chain.sext,zext, andtruncare recursed past, with the SCEV extension or truncation applied to the result.phiis recursed via the SCEV merge rule so loop-variant addresses stay symbolic rather than blocking the match.
Driver and Body
The pass splits into an outer driver and an inner body. The driver walks every function and every basic block, emitting "Processing X / Block Y" diagnostics where X and Y are sequential counters for visited functions and blocks; the diagnostics double as a progress indicator on very large modules. The body runs per basic block and performs the actual rewrite: for each GEP it consults the SCEV cache, looks up an existing canonical representative in a hash from SCEV key to representative, and either records the GEP as the new representative or rewrites it as a delta off the existing one.
IRBuilder Temporary Prefixes
Stable name prefixes mark every rewrite-produced IR value, so they jump out in dumps and --print-after traces. Four prefixes, each tied to a distinct role:
| Prefix | Meaning |
|---|---|
scevcgp_ | SCEV-canonicalised GEP, the merged representative produced by the CSE |
scevcgptmp_ | Temporary value holding a partial SCEV computation during materialisation |
baseValue | Cloned alloca base pointer emitted into the function entry block |
bitCastEnd | Optional bitcast applied when the merged GEP's pointer element type differs from a user's expected type |
The bitCastEnd cast lands only when the canonical representative's pointer element type does not match a specific user. Skipping the cast otherwise keeps the rewritten IR free of no-op casts that would otherwise survive into instruction selection.
Tunables
Five cl::opt knobs configure the pass. Each takes effect at the next function the driver visits.
| Knob | Default | Meaning |
|---|---|---|
cbe-enable | 1 | Master enable for the whole pass. |
cbe-max-depth | 8 | Maximum SCEV-tree depth to consider when matching bases. |
cbe-max-iter | 16 | Maximum number of CSE iterations per function before giving up. |
cbe-clone-allocas | 1 | Enable the alloca-cloning step. |
cbe-min-uses | 2 | Minimum number of uses before CSE fires on a candidate base. |
cbe-max-depth caps SCEV traversal cost on pathological index expressions. cbe-max-iter caps the outer fixed point: each iteration can expose new mergeable bases by replacing one GEP with a delta off another, and the bound prevents runaway behaviour on adversarial inputs. cbe-min-uses blocks rewrites on single-user GEPs, where the rewrite would add a PHI or a cast without saving any address arithmetic. With cbe-clone-allocas disabled, the alloca-cloning branch is skipped and any group whose base would have required cloning falls out of the merge — correct, but at the cost of some missed CSE on alloca-rooted addresses.
Algorithm
LogicalResult commonBaseElim(Function *F) {
SCEVCache cache = computeSCEVAll(F);
DenseMap<SCEVKey, GEP*> groups;
for (BasicBlock &bb : *F) {
for (Instruction &inst : bb) {
if (auto *gep = dyn_cast<GetElementPtrInst>(&inst)) {
if (gep->getNumUses() < cbe_min_uses) continue;
SCEVKey key = scevOfBase(cache, gep);
auto &rep = groups[key];
if (!rep) { rep = gep; continue; }
if (!dominates(rep, gep)) {
if (isa<AllocaInst>(rep_base(rep)) && cbe_clone_allocas)
cloneAllocaToEntry(F, rep);
insertMergePHIs(F, rep, gep);
}
Value *delta = buildDelta(cache, rep, gep);
replaceWithCanonical(gep, rep, delta, /*bitcastIfNeeded=*/true);
}
}
}
return success();
}
Failure Modes
- Dominance lift fails. When
cbe-clone-allocasis off and the original alloca does not dominate every consumer, the group is discarded silently. The IR is unchanged; the only signal is a missing rewrite. - SCEV depth cap hit. A deep index expression yields an opaque SCEV leaf at the cap, so distinct deep expressions never collide. The pass stays correct but misses the merge.
- Iteration cap hit. Each round may expose new mergeable bases; hitting
cbe-max-iterleaves residual GEPs whose mergeability would have been visible to a later round. Increasing the cap is safe but linear in compile time. - Type mismatch without
bitCastEnd. A reimplementation that skips the bitcast on type mismatch produces ill-typed IR; the verifier rejects the function. The cast is mandatory whenever the representative and the user disagree on pointer element type.
The final materialisation rule: single-predecessor regions reuse the incoming base directly without a PHI; multi-predecessor regions need one incoming value per predecessor and a final bitCastEnd when the original pointer type differs from the canonical representative.
Cross-References
NVPTX Backend Passes Overview places this pass at the tail of the LLVM-IR middle end, after MemorySpaceOpt and before NVVM IR Verifier. BASR: Base-Address-Slice-Replace is the post-ISel MIR-level peephole that performs the analogous address-arithmetic fusion on selected machine instructions — Common Base Elimination is its IR-level counterpart.
NVVM IR Verifier
Abstract
NVVMIRVerifier enforces NVVM-IR-level invariants the upstream LLVM Verifier knows nothing about. It runs after every NVPTX-side pass in the Tileiras pipeline and fires diagnostics on violations such as a kernel launched from a non-kernel function or a parameter buffer that overflows the SM's parameter-space limit. Failure aborts compilation through signalPassFailure(). The pass is a regular LLVM FunctionPass, not an MLIR OperationPass, so it never touches the failure-flag handshake TileAS passes use; the LLVM pass manager picks up its failure through the standard Pass::run return path and aborts before the next NVPTX pass starts.
Two principal procedures do the work. The launch-argument address-space checker walks every gpu.launch_func instruction and verifies that arguments live in an address space the child grid can dereference — typically global or constant. The parameter-space sizer walks each kernel's formal parameter list, sums byte sizes per the NVVM ABI, and compares the total to the per-SM parameter-space ceiling.
Launch-Argument Address-Space Check
The launch checker iterates the operands of each gpu.launch_func site and resolves the address space of every pointer-typed argument. Global and constant pointers pass unconditionally. A pointer the child grid cannot legally dereference triggers one of two diagnostics.
The first diagnostic fires when the launch target itself is not a kernel:
a function that is not __global__ cannot be launched
The second fires when an argument is a generic-AS or local-AS pointer. The child grid runs in a different address-space frame, and dereferencing a parent-thread local pointer or an addrspace(0) pointer through it is undefined:
A pointer to local memory or memory in 'addrspace(0)' has been used as a launch argument. Dereferencing this within the launch is undefined
Both strings surface through the standard MLIR diagnostic engine; downstream tooling matches on them.
Parameter-Space Sizer
A 21-case switch on the NVVM type tag stored in the parameter descriptor dominates the sizer. Each case returns the parameter's byte footprint; the caller accumulates the running total with natural alignment between fields.
| Tag | Type | Size formula |
|---|---|---|
| 0 | i1 | 1 byte (padded) |
| 1 | i8 | 1 |
| 2 | i16 | 2 |
| 3 | i32 | 4 |
| 4 | i64 | 8 |
| 5 | f16 | 2 |
| 6 | bf16 | 2 |
| 7 | f32 | 4 |
| 8 | f64 | 8 |
| 9 | tf32 | 4 |
| 10 | f8e4m3 | 1 |
| 11 | f8e5m2 | 1 |
| 12 | f4e2m1 | 0.5 (packed pair) |
| 13 | ptr_global | 8 |
| 14 | ptr_constant | 8 |
| 15 | ptr_shared | 4 (sm32 ABI) or 8 |
| 16 | ptr_generic | 8 |
| 17 | array<elem, N> | size(elem) × N |
| 18 | struct{fields…} | aligned sum |
| 19 | vector<elem, N> | size(elem) × N (no padding) |
| 20 | opaque | error |
Tag 12 (f4e2m1) is the only sub-byte case — two values share a byte, so the sizer treats it as half a byte and only commits a whole byte when the parameter count rounds up. Tag 15 (ptr_shared) is the only case where the result depends on the ABI flavor: the legacy sm32 shared-memory pointer is 32 bits, every modern SM uses 64. Tag 20 (opaque) is unreachable in valid NVVM-IR; if it appears, the verifier emits a hard error pointing at an upstream type-lowering bug rather than user code.
Aggregate tags recurse. A struct{i32, f64, i8} aligns the f64 to 8 and pads the trailing i8 so the next parameter starts aligned. A vector<f32, 4> consumes 16 bytes flat with no inter-element padding — that's what distinguishes it from array<f32, 4> at the ABI boundary.
uint64_t size_of_param(ParamDesc *p, TargetInfo *target) {
switch (p->tag) {
case TAG_I1: return 1;
case TAG_I8: return 1;
case TAG_I16: return 2;
case TAG_I32: return 4;
case TAG_I64: return 8;
case TAG_F16: return 2;
case TAG_F32: return 4;
case TAG_F64: return 8;
case TAG_PTR_SHARED: return target->sm == 32 ? 4 : 8;
case TAG_PTR_GLOBAL:
case TAG_PTR_CONSTANT:
case TAG_PTR_GENERIC: return 8;
case TAG_ARRAY: return p->elem_count * size_of_param(p->elem, target);
case TAG_VECTOR: return p->elem_count * size_of_param(p->elem, target);
case TAG_STRUCT: return size_struct(p, target);
case TAG_OPAQUE: fatal("opaque parameter type"); return 0;
...
}
}
uint64_t size_struct(ParamDesc *p, TargetInfo *target) {
uint64_t off = 0;
for (size_t i = 0; i < p->field_count; ++i) {
ParamDesc *f = p->fields[i];
off = round_up(off, align_of(f, target));
off += size_of_param(f, target);
}
return round_up(off, align_of(p, target));
}
ParamSpaceLimit by SM Family
The accumulated total is checked against a per-SM ceiling. The limit is a step function of the SM major version:
| SM family | Limit (bytes) |
|---|---|
| sm_20…sm_35 | 440 |
| sm_50…sm_75 | 1 024 |
| sm_80…sm_90 | 32 764 |
| sm_100…sm_121 | 32 768 |
The sm_80–sm_90 ceiling falls 4 bytes short of 32 KiB because the runtime reserves a small trailer for the implicit grid-constant descriptor; sm_100 and later move that descriptor elsewhere and reclaim the full 32 KiB. When the running total exceeds the SM's limit, the sizer emits:
Formal parameter space overflowed (X bytes required, max Y bytes allowed) in function Z
X is the running sum, Y is the parameter-space ceiling for the active SM, and Z is the demangled kernel name.
Worked Example: Parameter-Space Overflow on sm_75
Take the kernel
struct Heavy {
double scale; // 8 B
char tag; // 1 B (+ 7 B padding to align the next field)
int data[10000]; // 40000 B
};
__global__ void big_kernel(struct Heavy h) { /* ... */ }
LowerStructArgs has already promoted the by-value h into a parameter-space pointer that the verifier walks through. The sizer descends into the struct in declaration order:
| Field | Tag | Offset (B) | Size (B) | Running total (B) |
|---|---|---|---|---|
scale | f64 | 0 | 8 | 8 |
tag | i8 | 8 | 1 | 9 |
(padding to 4-byte alignment for int) | — | 9 | 3 | 12 |
data[10000] | array<i32, 10000> | 12 | 40000 | 40012 |
| (trailing pad to 8-byte struct alignment) | — | 40012 | 4 | 40016 |
The struct sizes to 40016 bytes. The active SM is sm_75, so the ceiling is 1024 bytes. The running total exceeds the ceiling at the very first call to size_struct, and the sizer emits:
Formal parameter space overflowed (40016 bytes required, max 1024 bytes allowed) in function big_kernel
signalPassFailure() fires, the LLVM pass manager picks the failure up on the Pass::run return path, and the pipeline aborts before instruction selection runs. The same kernel compiles on sm_80 (40016 < 32764), and on sm_100 the ceiling rises to 32768 — still too small for this struct, but enough room for data[8000] to fit. The verifier is the canonical place where the kernel ABI's parameter-space ceiling becomes a hard error rather than a silent truncation.
What This Catches That Upstream LLVM Doesn't
Upstream LLVM ships a generic Verifier that validates LLVM IR independent of any target. It checks instruction operand counts, type compatibility across def-use chains, terminator shapes, intrinsic signatures against their declarations, and metadata-node well-formedness. None of those checks is aware of the NVPTX ABI, the per-SM intrinsic introduction matrix, or the address-space rules that govern device-side launches. Four NVPTX-specific bug classes pass upstream LLVM's verifier unconditionally and surface only here.
Parameter-Space Overflow
%struct.Heavy = type { double, i8, [40000 x i32] }
define void @big_kernel(%struct.Heavy %h) !nvvm.kernel !0 {
...
}
Upstream LLVM accepts the function: the by-value struct argument is a well-formed LLVM type, the def-use chain is consistent, and the kernel-marker metadata is well-formed. The function would lower through the NVPTX backend and emit a .entry big_kernel directive whose .param declarations name a struct that exceeds the SM's parameter-space ceiling. On sm_75 the ceiling is 1 024 bytes; the struct sizes to 40 016 bytes. The hardware-side consequence of a kernel whose parameter buffer exceeds the ABI limit is undefined: the runtime either truncates the parameter copy or rejects the launch with an opaque cuLaunchKernel error far from the source. The NVVM verifier reads the per-SM ceiling from the resolved #nvvm.target attribute and emits the diagnostic shown in the worked example above. Upstream LLVM has no concept of parameter space, so it cannot reach the check.
SM-Versioned Intrinsic Used Below Its Introducing SM
%cluster_addr = call ptr @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.2d(...)
nvvm.cp.async.bulk.tensor.* is the Hopper TMA tile-load intrinsic family, introduced at sm_90. Upstream LLVM's verifier checks the intrinsic signature against the declaration in Intrinsics.td and accepts the call as well-formed. It does not check the active target. When the compile-target is sm_80 (Ampere, no TMA hardware), the NVVM verifier consults the intrinsic-to-introducing-SM table, compares against the resolved target's chip field, and emits a diagnostic naming the intrinsic and the minimum SM it requires. Without this check the call would lower to a PTX cp.async.bulk.tensor instruction that ptxas would reject with an architecture-mismatch error far from the source; the NVVM verifier surfaces the bug at the IR site that introduced the call.
The same check fires for the Blackwell-only nvvm.tcgen05.mma intrinsic family when the target is below sm_100, and for the SM_103-only block-scaled MMA intrinsics when the target is below sm_103.
Launch-Argument Address-Space Mismatch
%local = alloca i32, addrspace(0)
call void @llvm.nvvm.launch(ptr @child_kernel, ptr %local)
The gpu.launch_func operand is a pointer to an alloca in the generic address space — the parent thread's local storage. Upstream LLVM's verifier accepts the call: the argument is a well-formed pointer, the launch intrinsic signature accepts a generic-AS pointer, and the def-use chain is consistent. The child grid runs in a different address-space frame, however, and a generic or local-AS pointer the parent passed is undefined to dereference from the child. The NVVM verifier walks each launch call's operand list, resolves the address space of every pointer-typed argument, and emits:
A pointer to local memory or memory in 'addrspace(0)' has been used as a launch argument. Dereferencing this within the launch is undefined
The closely related diagnostic for the launch target itself fires when the launched function is not marked as a kernel:
a function that is not __global__ cannot be launched
Upstream LLVM has no notion of NVPTX address-space rules or of the nvvm.kernel predicate. Both diagnostics surface only here.
Kernel-Required Metadata Missing on Launchable Function
define void @child(...) { ; missing nvvm.kernel marker
...
}
%launched = call ... @llvm.nvvm.launch(ptr @child, ...)
Upstream LLVM accepts both definitions and the call. The NVVM verifier walks every gpu.launch_func site, follows the called-symbol operand to its definition, and consults the isKernelFunction predicate documented in Kernel Identity. A function the launch reaches that does not satisfy the predicate fails the launch-target check above. The diagnostic carries the call-site location so the user can locate the missing __global__ declaration in the source.
Driver
The driver is a thin loop over the module. It selects kernels using the canonical isKernelFunction predicate (see Kernel Identity) and dispatches to the two checkers:
void run_nvvm_ir_verifier(Module *module, TargetInfo *target) {
for (Function &fn : *module) {
if (!is_nvvm_kernel(fn)) continue;
check_parameter_space(fn, target); // sizer + ceiling
check_launch_arguments(fn); // address-space check
}
}
Any failed check calls signalPassFailure() directly.
Cross-References
LowerStructArgs Rewrite Shape is what leaves the parameter list this pass sizes. Kernel Identity defines the isKernelFunction predicate the driver consults. The NVPTX Backend Passes Overview shows where the verifier sits in the cluster pipeline.
Kernel, CDP, Force-Inline, and Pretreat Passes
Abstract
Four cooperating NVPTX-side passes share a single notion of kernel identity and run before the heavier NVPTX middle end. The kernel-attribute pass tags entry points with nvvm.kernel; the CDP expander rewrites device-side cudaLaunchDevice calls into runtime stubs; the force-inline pass collapses helpers the PTX ABI can't carry across a call boundary; and the pretreat pass normalizes frontend IR so address-space inference and argument lowering see a uniform form. They register together because they all consult the same isKernelFunction predicate and the same kernel-name registration table, and because their ordering is coupled: pretreat runs first, kernel attributes get stamped before CDP expansion goes looking for launchable targets, and force-inline runs last so it sees the final set of kernel and helper annotations.
Pass Registration Table
A single shared registration entry wires ten short names into the NVPTX pass registry. Each entry calls RegisterPass<T>(short_name, long_name) with the static class metadata, the short string consumed by --passes= and opt -passes=, and the long human-readable description. Other passes look these names up when scheduling a dependency or querying whether a pass already ran.
| Short name | C++ class | Purpose |
|---|---|---|
KernelAttrPass | mlir::nvvm::KernelAttrPass | annotate kernels with nvvm.kernel |
KernelInfoPrinter | mlir::nvvm::KernelInfoPrinter | emit "kernel-info: …" remarks |
InlineMustPass | mlir::nvvm::InlineMustPass | force AlwaysInline on hot kernels |
Pretreat | mlir::nvvm::PretreatPass | early IR cleanup before NVPTX |
CDPLaunchExpander | mlir::nvvm::CDPLaunchExpander | expand cudaLaunchDevice to __cudaCDP*LaunchDeviceV2 |
CDPParameterBuffer | mlir::nvvm::CDPParameterBuffer | wire up __cudaCDP*GetParameterBufferV2 |
KernelArgEliminator | mlir::nvvm::KernelArgEliminator | drop unused kernel args |
KernelAttrTransplanter | mlir::nvvm::KernelAttrTransplanter | move kernel attrs to nvvm.* form |
RemoveDeadFunctions | mlir::nvvm::RemoveDeadFunctions | dead-fn DCE |
LegalizeFunctions | mlir::nvvm::LegalizeFunctions | post-link function-level cleanup |
Treat the short names as stable public surface. They appear in remark output, in command-line pass pipelines, and in the names emitted by -debug-pass-manager.
Kernel Identity
Kernel detection is the primary cross-cutting decision in this cluster. KernelAttrPass, InlineMustPass, CDPLaunchExpander, KernelArgEliminator, and several later NVPTX passes all consult one shared isKernelFunction predicate. The predicate is a four-criteria disjunction: a function is a kernel iff at least one of the following holds.
| # | Criterion | Source |
|---|---|---|
| 1 | Function::getCallingConv() == CallingConv::PTX_Kernel | the LLVM calling convention enumerator (value 0x47) emitted by the front-end on every kernel entry point |
| 2 | function carries the nvvm.kernel LLVM attribute | new-style NVVM attribute set by KernelAttrPass after CUDA 12 |
| 3 | function carries the nvvm.annotations_transplanted attribute | set by KernelAttrTransplanter when it migrates old !nvvm.annotations metadata |
| 4 | function carries the legacy string attribute "kernel" | CUDA 11 and earlier frontend output |
Criterion 1 is what every modern CUDA front-end emits directly. Criterion 2 is the canonical form KernelAttrPass produces and the form every later analysis prefers to read. Criterion 3 is the bookkeeping marker that lets the rest of the pipeline distinguish a kernel whose modern attribute was synthesized from old metadata from one that originally carried only the calling convention or the modern attribute. Criterion 4 is the long-tail fallback for IR consumed from older toolchains.
The third criterion is the subtle one. KernelAttrTransplanter walks the legacy !nvvm.annotations metadata list, copies each kernel mark to the modern attribute form, then stamps the source function with nvvm.annotations_transplanted so subsequent passes can distinguish a transplanted-and-already-modernized kernel from one that still owns its legacy metadata. The four-criteria predicate is the canonical "is this a kernel?" check across the NVPTX backend; every other pass reaches it through a single shared callee.
bool isKernelFunction(Function *fn) {
if (fn->getCallingConv() == CallingConv::PTX_Kernel) return true;
if (fn->hasFnAttribute("nvvm.kernel")) return true;
if (fn->hasFnAttribute("nvvm.annotations_transplanted")) return true;
if (fn->hasFnAttribute("kernel")) return true;
return false;
}
Keep this predicate centralized in a single header. Forking the check across passes is how older NVPTX backends produced inconsistent "is this a kernel?" answers between KernelArgEliminator and InlineMustPass, with the predictable result that argument elimination dropped parameters of a function the inliner then refused to inline.
CDP Launch Expansion
Input and Output IR Shape
CUDA Dynamic Parallelism lets device code launch another kernel. CDPLaunchExpander rewrites each high-level cudaLaunchDevice call site into a CDP-specific intrinsic-call sequence that targets one of two runtime launch stubs; CDPParameterBuffer rewrites each cudaGetParameterBuffer call into a call to one of two runtime buffer-allocation stubs. The four stubs partition by CDP variant: CDP-1 is the single-grid form, CDP-2 is the two-grid form the runtime introduced for grid-of-grids workloads.
; before: high-level CUDA-runtime call
%pbuf = call ptr @cudaGetParameterBuffer(i64 64, i64 16)
store ptr %arg0, ptr %pbuf, align 8
%pbuf.1 = getelementptr i8, ptr %pbuf, i64 8
store i32 %arg1, ptr %pbuf.1, align 4
%r = call i32 @cudaLaunchDevice(ptr @child_kernel, ptr %pbuf,
%struct.dim3 %grid, %struct.dim3 %block,
i32 %smem, ptr %stream)
; after: CDP-1 intrinsic sequence
%pbuf = call ptr @__cudaCDP1GetParameterBufferV2(ptr @child_kernel,
%struct.dim3 %grid,
%struct.dim3 %block,
i32 %smem)
store ptr %arg0, ptr %pbuf, align 8
%pbuf.1 = getelementptr i8, ptr %pbuf, i64 8
store i32 %arg1, ptr %pbuf.1, align 4
%r = call i32 @__cudaCDP1LaunchDeviceV2(ptr @child_kernel, ptr %pbuf,
%struct.dim3 %grid, %struct.dim3 %block,
i32 %smem, ptr %stream)
The parameter-buffer rewrite is not merely a name swap. The V2 buffer-allocation stub takes the child-kernel pointer and launch geometry as arguments so the runtime can allocate a buffer sized exactly for the child's parameter layout; the high-level call only carried the size and alignment. The expander reconstructs the geometry by walking the matching cudaLaunchDevice and threading its dim3 arguments back to the buffer allocation, which is why both passes register together and the launch expander has to run after the parameter-buffer rewrite (or visit them as a pair).
CDP Variant Selection
| Stub | Variant |
|---|---|
__cudaCDP1LaunchDeviceV2 | CDP-1 (single grid) |
__cudaCDP2LaunchDeviceV2 | CDP-2 (two grids) |
__cudaCDP1GetParameterBufferV2 | CDP-1 parameter buffer alloc |
__cudaCDP2GetParameterBufferV2 | CDP-2 parameter buffer alloc |
CDP variant selection (CDP1 vs CDP2) comes from the call site's variant flag, not from the kernel signature. The stub names are held in two const char* lookup arrays — one for launch stubs, one for parameter-buffer stubs — indexed by the variant. A future CDP-3 variant slots in by adding the new entries to those arrays without touching the rewriter logic. Keep that indirection in a reimplementation: it turns the CDP runtime ABI into a data table rather than a control-flow tree.
Matching Predicate
A call site is rewritable iff:
- the callee resolves to one of the four high-level entry points (
cudaLaunchDevice,cudaLaunchDeviceV2,cudaGetParameterBuffer,cudaGetParameterBufferV2); - the launched child resolves through
isKernelFunctionto a real PTX kernel; - the call site carries a CDP-variant flag (1 or 2);
- the call site's parent function is itself a device function that the backend will lower to PTX.
A cudaLaunchDevice whose target resolves to an ordinary device function is a hard error: there is no PTX kernel entry to call, and the V2 launch stubs assume the callee is a real kernel. The expander emits a diagnostic and leaves the IR unchanged.
Algorithm
void expand_cdp_launches(Function *F) {
for (Instruction &inst : instructions(F)) {
if (auto *call = dyn_cast<CallInst>(&inst)) {
Function *callee = call->getCalledFunction();
if (!callee) continue;
CdpKind k = classify_cdp_entry(callee);
if (k == CDP_NONE) continue;
Function *child = resolve_child_kernel(call);
if (child && !isKernelFunction(child)) {
emit_error(call, "CDP target is not a kernel");
continue;
}
int variant = read_variant_flag(call); // 1 or 2
const char *stub = (k == CDP_LAUNCH)
? launch_stub_table[variant]
: pbuf_stub_table[variant];
rewrite_call_to_stub(call, stub);
}
}
}
Failure Modes
- Non-kernel target. The diagnostic fires before the launch stub is wired up; the IR retains the original
cudaLaunchDevicecall and a later verifier flags it. - Variant flag missing. A call site with no readable variant tag is rewritten to CDP-1 by default; this is correct on every existing CUDA toolchain but a reimplementation that omits the default produces an unrewritten call.
- Parameter-buffer / launch mismatch. When the rewriter sees a buffer alloc whose corresponding launch is unreachable in the same function, it falls back to the legacy
cudaGetParameterBufferABI and emits a diagnostic; mixing legacy and V2 ABIs is supported but the user loses the V2 size-checking guarantees.
Force-Inline Policy
Input and Output IR Shape
InlineMustPass walks every call site and force-inlines callees marked with the always_inline attribute. The pass exists because parts of the NVPTX ABI can't lower certain helper signatures faithfully: image and sampler arguments must arrive at the kernel boundary as opaque handles, large aggregate arguments can't survive a call boundary, and some helpers exist solely so the frontend has somewhere to attach attributes that must be visible at the use site.
; before
define internal float @sqrt_approx(float %x) "nvvm.always_inline" {
%r = call float @llvm.nvvm.rsqrt.approx.f(float %x)
%s = fmul float %r, %x
ret float %s
}
define void @kernel(ptr addrspace(1) %p, float %x) {
%v = call float @sqrt_approx(float %x)
store float %v, ptr addrspace(1) %p, align 4
ret void
}
; after: callee body inlined, internal callee dead-stripped
define void @kernel(ptr addrspace(1) %p, float %x) {
%r = call float @llvm.nvvm.rsqrt.approx.f(float %x)
%s = fmul float %r, %x
store float %s, ptr addrspace(1) %p, align 4
ret void
}
Force-Inline Marker Propagation
Certain callees are unconditionally inlined regardless of whether the front-end marked them: math-library wrappers (the __nv_* family that wraps NVPTX intrinsics), the intrinsic-wrappers the frontend emits to attach convergent or noreturn to a callsite, and any helper whose body contains an NVPTX intrinsic that cannot survive an ABI boundary. The pass detects these by walking the callee's body for a small set of forced-inline-triggering opcodes; on a match it stamps the callee with always_inline itself before the inlining walk.
The propagation step is intentionally idempotent: a second run of the pass over already-marked IR is a no-op for the marker pass and either a no-op or a redundant inline for the inliner. This matters because some pass pipelines run InlineMustPass twice — once before CDP expansion and once after — and the marker must survive the first run untouched.
Matching Predicate
A call site is forced-inline iff:
- its callee carries
always_inline(either from the front-end or from the marker-propagation step); - the callee has a body in this module (not an external declaration);
- the call is not part of a recursive cycle the inliner cannot break;
- the callee is not interposable.
The marker propagation step itself stamps always_inline on any internal callee whose body contains a forced-inline trigger and whose signature obeys the ABI constraints.
Algorithm
void inline_must_pass(Module *M) {
// Phase 1: propagate the always-inline marker.
for (Function &F : *M) {
if (!F.isDeclaration() && contains_forced_inline_trigger(&F)) {
F.addFnAttr("nvvm.always_inline");
}
}
// Phase 2: actually inline.
for (Function &caller : *M) {
for (CallInst *call : calls_in(&caller)) {
Function *callee = call->getCalledFunction();
if (!callee || !callee->hasFnAttribute("nvvm.always_inline")) continue;
if (!try_inline_at_call_site(call)) {
emit_remark(&caller, "not AlwaysInline into ", caller.getName());
}
}
}
}
When the inliner hits a callee it cannot inline — a recursive cycle, an exception handler frame, an interposable definition, or a callee whose body is unavailable — it emits a Remark of the form "not AlwaysInline into " followed by the caller's function name. The pass never silently downgrades the requirement: either the callee is inlined or the user receives the diagnostic and can fix the offending annotation.
Failure Modes
- Recursive always-inline. Two functions both marked
always_inlinethat call each other produce an infinite inline chain; the inliner breaks the cycle, emits the Remark, and leaves the cycle in place for later DCE. - Marker on a declaration. An always-inline declaration without a body is unreachable: there is nothing to inline. The inliner emits the Remark and leaves the call.
- Marker propagation false positive. A reimplementation that lists too many opcodes as forced-inline triggers will stamp ordinary library helpers and inflate code size; the trigger set should be exactly the opcodes whose ABI requires inlining, not a heuristic.
Kernel Info Printer
KernelInfoPrinter is a read-only diagnostic pass. It walks every function that satisfies isKernelFunction and emits one Remark per metric in a fixed "kernel-info: <Metric> in function '<fn>' = <value>" format. The metric set is exactly nineteen entries, in order: regs, smem, cmem, tex, params, local, stack, barriers, loads, stores, branches, fp_ops, int_ops, divergence, predicated, vector_ops, mma_ops, tcgen05_ops, tma_ops.
The last three are Blackwell-era additions. mma_ops counts WGMMA-family tensor-core instructions, tcgen05_ops counts the tensor-memory ops introduced for sm_100 and later, and tma_ops counts asynchronous bulk-copy instructions. Keep the metric list ordered in any reimplementation — downstream tooling parses the remark stream positionally and breaks the moment the order shifts.
Pretreat
Input and Output IR Shape
PretreatPass is the first cleanup stage after libNVVM accepts frontend IR. Its job is to strip or normalize frontend-specific forms before verification, address-space inference, and argument lowering start relying on them. The pass is deliberately narrow: it canonicalizes pointer casts, normalizes lifetime and memory intrinsics, strips metadata that earlier frontend stages already consumed, and rewrites placeholder intrinsics into the forms later NVVM passes expect. It performs no optimization that depends on the analysis results it precedes — the contract is "make the IR uniform without changing observable behavior".
; before: typical frontend output
define void @k(ptr %p, i32 %n) {
entry:
%cast1 = bitcast ptr %p to ptr addrspace(1)
%cast2 = bitcast ptr addrspace(1) %cast1 to ptr
call void @llvm.lifetime.start.p0(i64 -1, ptr %p) ; -1 means "whole alloca"
call void @llvm.memcpy.p0.p0.i32(ptr %p, ptr %p, i32 0, i1 false) ; zero-length
call void @llvm.nvvm.kernel.placeholder() ; consumed by libNVVM
call void @llvm.lifetime.end.p0(i64 -1, ptr %p)
ret void
}
; after: canonical form
define void @k(ptr %p, i32 %n) {
entry:
call void @llvm.lifetime.start.p0(i64 -1, ptr %p)
; zero-length memcpy deleted
; placeholder intrinsic deleted
call void @llvm.lifetime.end.p0(i64 -1, ptr %p)
ret void
}
Matching Predicate
The pass is a sequence of independent rewrite rules. Each rule matches a fixed IR shape — an intrinsic call, a pointer cast, a metadata kind — and either deletes it, rewrites it into canonical form, or stamps a marker that later passes consume. No rule depends on the result of another rule running in the same pass invocation; the ordering is fixed for determinism but not for correctness.
The 19 Metric-Ordered Cleanups
The cleanups run in a fixed order. Each is independent and idempotent, and the order is the one downstream passes assume. Reordering produces correct IR but breaks the verification-friendly invariants that the NVVM verifier checks for.
| # | Cleanup | Effect |
|---|---|---|
| 1 | Strip already-consumed !nvvm.annotations entries | Remove metadata entries whose attribute was already migrated. |
| 2 | Canonicalize trivial bitcast chains | Collapse bitcast(bitcast(x)) to a single cast. |
| 3 | Drop no-op addrspacecast pairs | Remove cvta-to-self casts produced by the front-end. |
| 4 | Normalize llvm.lifetime.{start,end} sizes | Replace explicit alloca sizes with -1 (whole-alloca) where the size matches. |
| 5 | Delete zero-length llvm.memcpy / llvm.memmove / llvm.memset | Remove explicit no-op moves. |
| 6 | Replace constant-fold-eligible casts | Fold bitcast of a constant into the constant itself. |
| 7 | Collapse getelementptr chains with zero indices | Drop GEPs that produce the same pointer they consume. |
| 8 | Canonicalize integer extensions | Choose zext over sext for known-non-negative sources where the front-end emitted the wrong one. |
| 9 | Strip convergent from non-convergent intrinsics | Remove a front-end-conservative convergent from intrinsics whose semantics do not require it. |
| 10 | Rewrite llvm.nvvm.read.ptx.sreg.* placeholder calls | Replace placeholder special-register reads with the canonical form. |
| 11 | Normalize llvm.dbg.declare to llvm.dbg.value | Convert variable-address debug info to value debug info where applicable. |
| 12 | Canonicalize select of constants | Reorder operands so the constant-true branch comes first. |
| 13 | Strip dead llvm.assume calls | Delete assume(true) and assume(constant) calls. |
| 14 | Replace undef operands in memcpy byte-count | Rewrite undef lengths to zero so cleanup 5 can delete them. |
| 15 | Canonicalize NaN/Inf floating-point literals | Convert non-IEEE-canonical NaN bit patterns to the canonical quiet NaN. |
| 16 | Strip discarded loop metadata | Remove !llvm.loop entries the front-end attached but the back-end ignores. |
| 17 | Lift nvvm.kernel metadata to function attribute | When KernelAttrTransplanter has not yet run, do the equivalent stamping. |
| 18 | Remove unreachable basic blocks | Delete BBs with no predecessor and no entry-block status. |
| 19 | Drop empty llvm.global_ctors / llvm.global_dtors entries | Clean up the trailing nulls some front-ends emit. |
The numbering is the canonical order. Cleanups 1, 16, and 19 strip metadata or globals; 2–8 simplify pointer and integer arithmetic; 9–14 normalize intrinsics and debug info; 15 fixes floating-point bit patterns; 17 is the legacy-attribute migration; 18 is the unreachable-block sweep that gives later passes a non-pessimistic dominator tree.
Algorithm
void pretreat_module(Module *M) {
for (Function &F : *M) {
if (F.isDeclaration()) continue;
strip_consumed_annotations(&F); // 1
canonicalize_bitcast_chains(&F); // 2
drop_noop_addrspace_casts(&F); // 3
normalize_lifetime_sizes(&F); // 4
delete_zero_length_mem_intrinsics(&F); // 5
constant_fold_casts(&F); // 6
collapse_zero_index_geps(&F); // 7
canonicalize_int_extensions(&F); // 8
strip_spurious_convergent(&F); // 9
rewrite_sreg_placeholders(&F); // 10
normalize_dbg_declare(&F); // 11
canonicalize_select_constants(&F); // 12
strip_dead_assume(&F); // 13
normalize_undef_memcpy_lengths(&F); // 14
canonicalize_fp_specials(&F); // 15
strip_discarded_loop_metadata(&F); // 16
lift_kernel_metadata_to_attr(&F); // 17
remove_unreachable_blocks(&F); // 18
}
drop_empty_ctor_dtor_entries(M); // 19
}
Failure Modes
- Out-of-order cleanups. Running cleanup 5 before cleanup 14 leaves
memcpy(_, _, undef)in the IR; the verifier accepts it but the back-end emits a runtime call to a memcpy stub. - Skipping cleanup 17. A kernel that retained only
!nvvm.annotationsand was never visited byKernelAttrTransplanterwill not be recognized as a kernel byisKernelFunction's criterion 2; criteria 1 and 3 still catch it, but cleanup 17 is what gives criterion 2 a chance. - Aggressive optimization in pretreat. A reimplementation that adds an arithmetic-simplification rule to pretreat will change observable IR before the verifier runs, breaking the contract that pretreat is purely canonicalization. The rule belongs in the optimization passes downstream.
Cross-References
NVPTX Backend Passes Overview shows where this cluster sits in the full NVPTX schedule. NVVM IR Verifier is the downstream consumer that re-checks nvvm.kernel on every CDP launch target. cicc comparison documents the shared NVPTX backend lineage these passes inherited.
libdevice Overview
Abstract
libdevice is the NVIDIA device math library: a precompiled LLVM bitcode module shipped with CUDA that supplies device-side bodies for hundreds of math functions — __nv_sin, __nv_cos, __nv_exp, __nv_log, __nv_pow, __nv_sqrt, __nv_div_*, the full transcendental and special-function set, and their f/d variants. Each body is written in LLVM IR with NVPTX-aware patterns, parameterized on per-module configuration (flush-to-zero mode, IEEE divide/sqrt precision, fast integer division) through __nvvm_reflect("KEY") queries. TileIR lowering emits direct calls to these declarations whenever a GPU math operation is better represented as a library call than as a single intrinsic. Before NVPTX code generation, every __nv_* declaration must resolve to a concrete bitcode body — unresolved external declarations are a backend error.
The integration is a four-pass correctness sequence: link the library bitcode into the user module so the __nv_* declarations gain definitions, fold every __nvvm_reflect("KEY") call site into a configuration-derived integer constant, run an always-inliner pass that fires on every libdevice function, and simplify the now-resolved configuration branches plus garbage-collect the unused library helpers. The sequence runs at every optimization level — even -O0 — because resolution is required for correctness rather than for speed.
Pipeline
LLVM module with calls to __nv_* declarations
|
| link embedded or supplied libdevice bitcode
v
LLVM module with __nv_* definitions
|
| fold __nvvm_reflect("KEY") queries
v
configuration-specialized libdevice bodies
|
| always-inline libdevice calls into kernels
v
kernel bodies containing selected math implementations
|
| simplify branches, fold constants, remove unused library functions
v
LLVM module ready for NVPTX code generation
The effective order matters. Libdevice bodies contain reflection queries, so reflection folding must see the linked bodies. Inlining should run after reflection so dead configuration arms are already easy to remove. Constant folding and global dead-code elimination then remove unused paths and unused library definitions.
Reflection
__nvvm_reflect is a compile-time query mechanism. Libdevice bodies call it with string keys — "__CUDA_FTZ", "__CUDA_PREC_DIV", "__CUDA_PREC_SQRT", "__CUDA_FAST_INT_DIV", "__CUDA_ARCH", and their variants — and the reflect pass replaces those calls with integer constants drawn from a three-source resolver: module-level metadata (!nvvm-reflect and module flags), command-line overrides (-mllvm -nvvm-reflect-add=KEY=VAL), and target defaults. The result of folding is dead-branch material: each query lives inside an if (__nvvm_reflect("KEY")) { … } guard inside the library body, and once the call is replaced by i32 0 or i32 1, normal IR simplification eliminates the unreachable arm.
PreservedAnalyses NVVMReflectPass::run(Module &m, ModuleAnalysisManager &) {
DenseMap<StringRef, int> resolved;
seed_from_module_metadata(m, resolved); /* !nvvm-reflect / module flags */
seed_from_command_line(resolved); /* -nvvm-reflect-add=KEY=VAL */
seed_from_target_defaults(m, resolved); /* SM-derived defaults */
for (StringRef name : {"__nvvm_reflect", "__nvvm_reflect_ocl",
"_Z20__nvvm_reflectPKc", /* …5 mangled variants… */}) {
Function *f = m.getFunction(name);
if (!f) continue;
for (User *u : llvm::make_early_inc_range(f->users())) {
auto *call = cast<CallInst>(u);
StringRef key = require_constant_cstring(call->getArgOperand(0));
int v = resolved.lookup_or_zero(key); /* unknown → 0, recorded once */
call->replaceAllUsesWith(ConstantInt::get(call->getType(), v, /*Signed=*/false));
call->eraseFromParent();
}
if (f->use_empty()) f->eraseFromParent();
}
return PreservedAnalyses::none();
}
Unknown keys resolve to zero, and the resolver records the zero so the same unknown key folds consistently at every call site. A reimplementer wanting bug-for-bug compatibility must seed the resolver from the same three sources in the same priority order — module metadata wins over target defaults, command-line overrides win over both. Diverging from the zero-default for unknown keys is observable: libdevice bodies pick different approximation paths based on whether a flag resolves to 0 or 1.
Link, inline, simplify
Libdevice linking is a module-construction step rather than an optimization pass. The driver parses the embedded bitcode blob into an llvm::Module, then runs the LLVM linker in OnlyNeeded mode so only functions reachable from the user module are pulled in. Once linked, the user module gains concrete bodies for every previously external __nv_* declaration. Each libdevice body carries the alwaysinline attribute, so a dedicated always-inliner pass — separate from the optimization-level inliner — fires on every call site regardless of -O0/-O1. After inlining, the configuration constants left behind by NVVMReflectPass propagate into the inlined bodies; the subsequent simplify-cfg + SCCP + global-DCE pair collapses the now-dead approximation arms and removes the library functions that no longer have callers.
bool prepare_libdevice(Module &user, MemoryBufferRef libdevice_bc, ReflectConfig cfg) {
/* 1. parse and link — OnlyNeeded keeps the user module small */
std::unique_ptr<Module> lib = parseBitcodeFile(libdevice_bc, user.getContext());
if (Linker::linkModules(user, std::move(lib), Linker::Flags::OnlyNeeded))
return false;
/* 2. resolve every __nvvm_reflect call into a configuration-derived constant */
ModulePassManager mpm;
mpm.addPass(NVVMReflectPass(cfg));
/* 3. always-inline libdevice bodies into their call sites */
mpm.addPass(AlwaysInlinerPass(/*InsertLifetime=*/false));
/* 4. simplify the configuration-folded branches and garbage-collect leftovers */
FunctionPassManager fpm;
fpm.addPass(SimplifyCFGPass());
fpm.addPass(SCCPPass());
fpm.addPass(InstCombinePass());
mpm.addPass(createModuleToFunctionPassAdaptor(std::move(fpm)));
mpm.addPass(GlobalDCEPass());
ModuleAnalysisManager mam;
mpm.run(user, mam);
return verify_no_unresolved_libdevice_declarations(user);
}
At higher optimization levels the standard inliner, instruction combiner, SCCP, GVN, and global DCE refine the result further. At -O0 the four-pass sequence still runs because resolution is a correctness requirement: an unresolved __nv_sin declaration reaches the NVPTX backend as an external symbol the backend cannot lower into PTX.
Constant folding
After linking and reflection, many libdevice call paths reduce to compile-time constants or short arithmetic chains. Constant folding evaluates calls with constant operands (__nv_sin(0.0) collapses to 0.0), prunes the dead if (FTZ) { … } else { … } arms that reflection just selected, and global-DCE removes the library helpers whose only callers have been inlined away. This matters most for math functions with multiple approximation paths behind target-mode checks: without folding, the IR retains the unselected approximation as dead code the backend then has to schedule around.
The goal is not to prove every math call at compile time. The goal is to specialize the library to the selected target and remove impossible branches before the backend sees them.
Cross-references
The reflection key set, the three-source resolver, and the post-reflect constant-conditional cleanup pass are documented in NVVMReflect Mechanism — Three var-map sources and NVVMReflect Mechanism — Post-reflect cleanup. The mapping from libdevice function names to LLVM intrinsic IDs — used by the constant folder to recognize math calls without reading their bodies — is documented in Intrinsic ID Switch and Name Table — libdevice suffix name table. The surrounding LLVM math-optimization flow, including the crosswalk between libdevice calls and target-specific intrinsics, is covered in Math Pass Pipeline and Crosswalk — Full math-op crosswalk. The NVPTX bring-up that links the embedded libdevice bitcode at module construction time is documented in NVPTX Bring-up and Target Init.
NVVMReflect Mechanism
Abstract
The tileiras binary hosts a complete copy of the LLVM NVVMReflectPass machinery — the device-side mechanism that resolves the __nvvm_reflect("KEY") intrinsic into a compile-time integer so that the libdevice bitcode bodies can specialise themselves on per-module decisions (FTZ mode, target SM, IEEE divide/sqrt precision, etc.) without runtime branches. The pass takes an LLVM module immediately after the libdevice bitcode has been linked in, walks every call to __nvvm_reflect / __nvvm_reflect_ocl plus five mangled variants, looks the key string up in a DenseMap<StringRef,int> populated from three orthogonal sources, replaces the call with a ConstantInt::get(callType, value, /*IsSigned=*/false), and erases the now-useless declarations. Missing keys default to 0 and are silently inserted into the map so a single key is folded consistently across every call site that names it.
This page covers the end-to-end registration, CLI surface, var-map population, replacement loop, and post-reflect constant-conditional cleanup pass. The legacy pass-manager entry is fail-fast, so the reachable path is the new pass manager.
New-PM registration
NVVMReflectPass is registered as an NVPTX-specific new-PM pass under the CLI key nvvm-reflect. It lives alongside the target pass family: generic-to-nvvm, nvptx-lower-ctor-dtor, nvptx-set-global-array-alignment, register-pressure-analysis, nvptx-aa, nvvm-intr-range, lower-struct-args, nvptx-lower-args, and related NVPTX preparation passes. The split between the NVPTX-specific registry and the generic LLVM pass registry mirrors upstream LLVM's source-tree split.
CLI surface
The pass exposes two options and one alias:
| CLI argument | Type | Help | Default |
|---|---|---|---|
nvvm-reflect-enable | cl::opt<bool> | "NVVM reflection, enabled by default" | true |
nvvm-reflect-add | cl::list<std::string> | "A key=value pair. Replace __nvvm_reflect(name) with value." | empty list |
R | cl::alias to nvvm-reflect-add | inherited | — |
nvvm-reflect-add accepts entries of the form name=<int>. The alias follows normal LLVM cl::alias validation rules.
Runtime pipeline
NVVMReflect::runOnModule fires once per module. It first builds the reflection map, then rewrites calls to __nvvm_reflect, __nvvm_reflect_ocl, and five ABI-mangled variants used by libdevice and C++ frontend paths. The pass reports that it changed the module if any call site was replaced.
Three var-map sources
populateVarMap merges three sources into one reflection map. The important rule is ordering:
named metadata is read first, the FTZ module flag is read second, and command-line overrides are
read last. A later source overwrites an earlier value for the same key, so -nvvm-reflect-add is
the user-visible escape hatch for testing or forcing a libdevice configuration.
| Order | Source | Key extraction | Value extraction |
|---|---|---|---|
| A | !nvvm.reflection named metadata | operand 0 as an MDString | operand 1 as a signed integer constant |
| B | module flag nvvm-reflect-ftz | remapped to __CUDA_FTZ | same signed integer normalization |
| C | -nvvm-reflect-add name=value / -R name=value | substring before = | decimal integer parsed after = |
Metadata and module-flag values are treated as signed integers. Narrow integer constants are
sign-extended before insertion; ordinary 32-bit reflect values therefore behave like plain C
int values. CLI values are decimal only. Malformed CLI entries are reported as option errors,
not compiler crashes:
- empty key before
= - missing value after
= - non-integer value after
=
static int64_t normalize_reflect_int(const ConstantInt *value) {
unsigned bits = constant_int_width(value);
uint64_t raw = constant_int_zext(value);
if (bits < 64) {
unsigned shift = 64 - bits;
return ((int64_t)(raw << shift)) >> shift;
}
return (int64_t)raw;
}
static void populate_reflect_map(Module *module, const ReflectOptions *options, ReflectMap *values) {
for (MetadataEntry entry : nvvm_reflection_metadata(module)) {
reflect_map_set(values, metadata_key(entry), normalize_reflect_int(metadata_value(entry)));
}
if (ConstantInt *ftz = module_flag_int(module, "nvvm-reflect-ftz")) {
reflect_map_set(values, "__CUDA_FTZ", normalize_reflect_int(ftz));
}
for (StringRef option : options->reflect_add) {
ParsedReflectOption parsed = parse_reflect_add(option);
reflect_map_set(values, parsed.name, parsed.value);
}
}
FTZ module flag
__CUDA_FTZ is the only reflect key with a dedicated module-flag path. The compiler reads the
module flag named nvvm-reflect-ftz, normalizes its integer value, and stores it under the
libdevice key __CUDA_FTZ. A later -nvvm-reflect-add __CUDA_FTZ=<int> still overrides it.
The precision keys, such as __CUDA_PREC_DIV and __CUDA_PREC_SQRT, do not have equivalent
module-flag shortcuts. They enter the map through !nvvm.reflection or through explicit CLI
overrides.
Missing keys are deliberately benign. If a call names a key that is absent from every source, the lookup path creates a zero-valued entry and folds the call to integer zero. That makes unsupported or future libdevice probes deterministic instead of fatal.
Replacement loop
The rewriter runs once for each accepted reflect function spelling: __nvvm_reflect,
__nvvm_reflect_ocl, and five ABI-mangled forms observed in C++ libdevice paths. For each
function, it walks every use and requires a direct call with exactly one argument. The argument
must reduce to a constant, null-terminated string after stripping pointer casts and the simple
constant-expression forms produced for global string literals.
Malformed calls are fatal IR errors. The public diagnostics are intentionally specific:
__nvvm_reflect can only be used in a call instruction__nvvm_reflect requires exactly one argument__nvvm_reflect argument must be a constant string__nvvm_reflect argument must be a string constant__nvvm_reflect argument must be a null-terminated string__nvvm_reflect argument cannot be empty
For a valid call, the pass reads the key, looks up the integer value with default zero, creates a constant of the call's result type, replaces every use of the call with that constant, erases the call, and removes the now-unused declaration.
static StringRef read_reflect_key(Value *arg) {
Value *base = strip_pointer_casts(arg);
if (ConstantExpr *expr = dyn_cast_constant_expr(base)) {
base = peel_global_string_gep(expr);
}
ConstantDataSequential *data = dyn_cast_constant_data(base);
if (data == NULL) {
fatal("__nvvm_reflect argument must be a string constant");
}
if (!constant_data_is_c_string(data)) {
fatal("__nvvm_reflect argument must be a null-terminated string");
}
StringRef key = constant_data_as_c_string(data);
if (key.empty()) {
fatal("__nvvm_reflect argument cannot be empty");
}
return key;
}
static bool replace_reflect_calls(Module *module, StringRef name, ReflectMap *values) {
Function *function = module_get_function(module, name);
if (function == NULL) {
return false;
}
bool changed = false;
SmallVector<CallInst *, 16> calls = collect_reflect_calls(function);
for (CallInst *call : calls) {
StringRef key = read_reflect_key(call_arg(call, 0));
int64_t value = reflect_map_lookup_or_insert_zero(values, key);
Constant *replacement = constant_int(call_type(call), value, false);
replace_all_uses_with(call, replacement);
erase_instruction(call);
changed = true;
}
if (function_has_no_uses(function)) {
erase_function(function);
}
return changed;
}
Post-reflect cleanup — nvvm-reflect-pp
The reflect rewrite usually exposes branches whose conditions are now constants. A libdevice body
often has the shape "if __nvvm_reflect(KEY) equals N, use this implementation; otherwise use the
fallback." Once the call is replaced, those branches are no longer semantic choices; they are dead
IR structure.
nvvm-reflect-pp runs immediately after reflection as a small function pass. It folds constant
conditional branches, drops unreachable successors, and invalidates the affected control-flow
analyses. Scheduling the cleanup next to reflection keeps the rest of the optimization pipeline
from repeatedly rediscovering the same trivial facts, and it gives the NVPTX backend a smaller,
more predictable CFG even in low-optimization pipelines.
Reimplementation Notes
A compatible implementation has to preserve three behavioral contracts:
- merge metadata, FTZ module flag, and CLI overrides in that exact order
- fold missing keys to zero, not to an error
- run constant-conditional cleanup directly after reflection
bool run_nvvm_reflect(Module *module, const ReflectOptions *options) {
if (!options->enable_reflect) {
return false;
}
ReflectMap values = reflect_map_create();
populate_reflect_map(module, options, &values);
bool changed = false;
for (StringRef name : reflect_function_names()) {
changed |= replace_reflect_calls(module, name, &values);
}
if (changed) {
simplify_constant_conditionals(module);
}
reflect_map_destroy(&values);
return changed;
}
Cross-references
The end-to-end libdevice integration that drives NVVMReflectPass is documented in libdevice Overview — Pipeline and libdevice Overview — Link, inline, simplify. The constant-folding consumer that sees reflect-stripped libdevice bodies is Intrinsic ID Switch and Name Table. The downstream math lowering whose __CUDA_PREC_* / __CUDA_FTZ arms collapse after reflection is documented in Math Pass Pipeline and Crosswalk — Cases that skip libdevice entirely. The user-facing precision model that composes reflect-driven libdevice gating with per-op fast-math flags, FTZ, and FP8 cast semantics is documented in Fast-Math and Numerical Precision.
libdevice __nv_* Symbol Catalog
Abstract
The libdevice bitcode shipped with CUDA exposes roughly 350 device-side math entry points behind the __nv_ prefix. They are the implementation surface that MLIR math.* / arith.* lowering and CUDA-C front ends target by name; every call site appears in the LLVM module as declare <type> @__nv_<name>(<args>) until Linker::linkModules pulls in the bitcode body and the always-inliner folds it into the caller. This page catalogues those symbols by family, names the reflection keys their bodies query, identifies the NVPTX hardware intrinsic each body decays into after NVVMReflectPass folds the configuration constants, and pins down the rounding-mode and FTZ matrix the symbols collectively cover.
The Intrinsic ID Switch and Name Table page documents how the LLVM constant folder classifies surviving call sites by name; the Math Pass Pipeline and Crosswalk page documents the MLIR-side rewrite from math.<op> to __nv_<name>. This page is the inventory in between — the names themselves, the bodies they unwrap to, and the reasons the body chooses one PTX form over another.
Naming convention
Every libdevice symbol decomposes into prefix, base name, type suffix, and optional rounding-mode suffix:
__nv_ <base> [<rounding-mode>] [<type-suffix>]
| Component | Form | Examples | Notes |
|---|---|---|---|
| Prefix | __nv_ | every entry | identifies device math; trips libdevice linker pattern |
| Base name | C99 / IEEE-754 root | sin, cos, exp, log, sqrt, fma, pow, rint | shared with libm; semantics match unless reflection keys override |
| Rounding mode | _rn, _rz, _ru, _rd | __nv_dadd_rn, __nv_fdiv_ru | optional; absent forms imply round-to-nearest-even |
| Type suffix | f, d (or none) | __nv_sinf, __nv_sin, __nv_fabs (default f64) | f = float, d or bare = double, h/bf16 absent |
The full grammar admits four orthogonal axes: input domain (f32/f64/i32/i64/u32/u64), rounding mode, FTZ behaviour, and approximation policy. A name like __nv_dadd_rn reads as "double add, round-to-nearest-even, full precision"; __nv_fast_powf reads as "float pow, fast path approximation, may flush denormals". Half-precision (f16, bf16) is intentionally absent — MLIR OpToFuncCallLowering promotes to f32 before the libdevice call and demotes via arith.truncf after, so libdevice never sees the narrow type.
Family inventory
The catalogue groups symbols by the IEEE-754 / C99 root family they belong to. Counts are the entries reachable from Linker::Flags::OnlyNeeded against a kernel that touches every published math intrinsic; bitcode versions with optional families may not ship a body for every entry in the table.
Trigonometric — circular
| Symbol family | f32 | f64 | Fast path | Reflection key | Decay (when applicable) |
|---|---|---|---|---|---|
| Sine | __nv_sinf | __nv_sin | __nv_fast_sinf | __CUDA_FTZ, __CUDA_ARCH | sin.approx.f32 (FTZ); Payne–Hanek otherwise |
| Cosine | __nv_cosf | __nv_cos | __nv_fast_cosf | __CUDA_FTZ | cos.approx.f32 (FTZ); Payne–Hanek otherwise |
| Tangent | __nv_tanf | __nv_tan | __nv_fast_tanf | __CUDA_FTZ | sin.approx/cos.approx quotient on FTZ paths |
| Sine + cosine | __nv_sincosf | __nv_sincos | __nv_fast_sincosf | __CUDA_FTZ | fuses both PTX approximations; returns by pointer outs |
| Sine of π·x | __nv_sinpif | __nv_sinpi | — | — | scaled Payne–Hanek; argument is in half-cycles |
| Cosine of π·x | __nv_cospif | __nv_cospi | — | — | scaled Payne–Hanek; argument is in half-cycles |
| Arc sine | __nv_asinf | __nv_asin | — | — | libdevice-only; polynomial in 1 - x*x |
| Arc cosine | __nv_acosf | __nv_acos | — | — | uses asin then subtracts from π/2 |
| Arc tangent | __nv_atanf | __nv_atan | — | — | range-reduced rational approximation |
| Two-arg arc tan | __nv_atan2f | __nv_atan2 | — | — | quadrant fixup on top of atan; matches C atan2 |
The __nv_fast_* aliases bind directly to the PTX approximate intrinsic (sin.approx.f32, cos.approx.f32) and skip Payne–Hanek range reduction; they are reachable through the fast-math math path or by name, never through MLIR math.* lowering on default settings.
Trigonometric — hyperbolic and inverse hyperbolic
| Symbol family | f32 | f64 | Reflection key | Decay |
|---|---|---|---|---|
| Hyperbolic sine | __nv_sinhf | __nv_sinh | — | (__nv_exp(x) - __nv_exp(-x)) * 0.5 with overflow guard |
| Hyperbolic cosine | __nv_coshf | __nv_cosh | — | (__nv_exp(x) + __nv_exp(-x)) * 0.5 with overflow guard |
| Hyperbolic tangent | __nv_tanhf | __nv_tanh | — | rational approximation; sm_75+ uses tanh.approx.f32 when present |
| Inverse hyperbolic sine | __nv_asinhf | __nv_asinh | — | log(x + sqrt(x*x + 1)) with cancellation fix-up |
| Inverse hyperbolic cosine | __nv_acoshf | __nv_acosh | — | log(x + sqrt(x*x - 1)) |
| Inverse hyperbolic tangent | __nv_atanhf | __nv_atanh | — | 0.5 * log1p(2x/(1-x)) |
Exponential family
| Symbol family | f32 | f64 | Fast path | Reflection key | Decay |
|---|---|---|---|---|---|
| Base-e exp | __nv_expf | __nv_exp | __nv_fast_expf | __CUDA_FTZ | ex2.approx.f32 (exp(x) = ex2(x * 1.4426950408)) |
| Base-2 exp | __nv_exp2f | __nv_exp2 | — | __CUDA_FTZ | ex2.approx.f32 directly |
| Base-10 exp | __nv_exp10f | __nv_exp10 | __nv_fast_exp10f | __CUDA_FTZ | ex2.approx after * log2(10) |
exp(x) - 1 | __nv_expm1f | __nv_expm1 | — | — | libdevice-only; Estrin-form polynomial near 0 |
| Natural log | __nv_logf | __nv_log | __nv_fast_logf | __CUDA_FTZ | lg2.approx.f32 then * 0.6931471806 |
| Base-2 log | __nv_log2f | __nv_log2 | — | __CUDA_FTZ, nvptx-approx-log2f32 | lg2.approx.f32 directly |
| Base-10 log | __nv_log10f | __nv_log10 | __nv_fast_log10f | __CUDA_FTZ | lg2.approx then * 0.30102999566 |
log(1 + x) | __nv_log1pf | __nv_log1p | — | — | libdevice-only; minimax polynomial |
| Power | __nv_powf | __nv_pow | __nv_fast_powf | __CUDA_FTZ | lg2.approx + ex2.approx composition |
| Integer power | __nv_powif | __nv_powi | — | — | repeated-squaring; integer exponent |
pow(x, n) for int n | __nv_fast_powf (alias) | — | — | — | uses lg2/ex2 regardless of integer-ness |
The fast-path aliases are the entry points the fast-math pragma routes math ops through; they short-circuit the precision-checking guard arms and emit the bare ex2.approx.f32 / lg2.approx.f32 pair without finite-input cleanup.
Power-of-2 and integer-shift helpers
| Symbol family | f32 | f64 | Notes |
|---|---|---|---|
ldexp(x, n) | __nv_ldexpf | __nv_ldexp | integer scale n is i32; result is x * 2^n |
frexp(x, *n) | __nv_frexpf | __nv_frexp | mantissa returned, exponent written through pointer |
scalbn(x, n) | __nv_scalbnf | __nv_scalbn | identical to ldexp on IEEE-754 binary radix |
scalbln(x, l) | __nv_scalblnf | __nv_scalbln | long exponent; libdevice clamps before scaling |
logb(x) | __nv_logbf | __nv_logb | floor(log2( |
ilogb(x) | __nv_ilogbf | __nv_ilogb | int exponent; raises domain error inline |
nextafter(x, y) | __nv_nextafterf | __nv_nextafter | bitwise next representable; respects denormal direction |
Rounding and sign manipulation
| Symbol | Type | Decay |
|---|---|---|
__nv_floorf / __nv_floor | round toward -∞ | cvt.rmi.f32.f32 (f32); libdevice body (f64) |
__nv_ceilf / __nv_ceil | round toward +∞ | cvt.rpi.f32.f32 (f32); libdevice body (f64) |
__nv_truncf / __nv_trunc | round toward 0 | cvt.rzi.f32.f32 (f32); libdevice body (f64) |
__nv_roundf / __nv_round | round half-away-from-zero | libdevice-only — PTX has no matching mode |
__nv_rintf / __nv_rint | round to nearest (current rounding mode) | cvt.rni.f32.f32 (default IEEE) |
__nv_nearbyintf / __nv_nearbyint | rint without inexact flag | same as rint; libdevice flag handling differs |
__nv_lroundf / __nv_lround | round to long | cvt.rni.s32.f32 after range check |
__nv_llroundf / __nv_llround | round to long long | cvt.rni.s64.f64 after range check |
__nv_lrintf / __nv_lrint | rint to long | cvt.rni.s32.f32 |
__nv_llrintf / __nv_llrint | rint to long long | cvt.rni.s64.f64 |
__nv_copysignf / __nv_copysign | sign transfer | bit op; folds to llvm.copysign.* |
__nv_fabsf / __nv_fabs | absolute value | bit-AND mask; folds to llvm.fabs.* or abs.f32 |
__nv_signbitf / __nv_signbitd | sign-bit test | shift-right of bit pattern |
Min/max and classification
| Symbol | Semantics | Decay |
|---|---|---|
__nv_fminf / __nv_fmin | IEEE-754 minNum | min.f32/min.f64 on sm_80+; libdevice body otherwise |
__nv_fmaxf / __nv_fmax | IEEE-754 maxNum | max.f32/max.f64 on sm_80+; libdevice body otherwise |
__nv_fminimumf / __nv_fminimum | IEEE-754-2019 minimum (NaN-propagating) | bit ops + NaN check |
__nv_fmaximumf / __nv_fmaximum | IEEE-754-2019 maximum (NaN-propagating) | bit ops + NaN check |
__nv_isfinitef / __nv_isfinited | finite predicate | bit arithmetic on exponent field |
__nv_isinff / __nv_isinfd | infinite predicate | bit arithmetic on exponent + mantissa |
__nv_isnanf / __nv_isnand | NaN predicate | bit arithmetic; matches IEEE-754 quiet/sign-NaN definition |
__nv_finitef / __nv_finite | legacy isfinite alias | aliased to __nv_isfinitef/__nv_isfinited |
The min/max divergence is the most observable one. fmin/fmax follow IEEE-754-2008's "minNum" rule that returns the non-NaN operand when exactly one operand is NaN; fminimum/fmaximum follow IEEE-754-2019's "minimum" rule that returns NaN whenever any operand is NaN. The MLIR arith.minnumf and arith.maxnumf ops route to fmin/fmax; there are no MLIR ops covering fminimum/fmaximum, only direct front-end calls.
Roots, reciprocals, divides — the precision-keyed family
| Symbol | f32 | f64 | Reflection key | Decay at key=0 | Decay at key=1 |
|---|---|---|---|---|---|
| Square root | __nv_sqrtf | __nv_sqrt | __CUDA_PREC_SQRT | sqrt.approx.f32 | sqrt.rn.f32 |
| Reciprocal sqrt | __nv_rsqrtf | __nv_rsqrt | — | rsqrt.approx.f32 | (same — no precise form) |
| Division | __nv_fdividef | __nv_fdivide | __CUDA_PREC_DIV | div.approx.f32 | div.rn.f32 |
| Reciprocal | __nv_frcp_rn etc. | __nv_drcp_rn etc. | — | rcp.approx.f32 | rcp.rn.f32 |
| Cube root | __nv_cbrtf | __nv_cbrt | — | libdevice-only — polynomial + Newton refinement | (same) |
| Reciprocal cbrt | __nv_rcbrtf | __nv_rcbrt | — | libdevice-only — 1 / cbrt(x) with sign fix | (same) |
| Hypot | __nv_hypotf | __nv_hypot | — | sqrt(x*x + y*y) with overflow guard | (same) |
| Reciprocal hypot | __nv_rhypotf | __nv_rhypot | — | 1 / hypot(x, y) | (same) |
| 3-argument hypot | __nv_norm3df | __nv_norm3d | — | sqrt(x*x + y*y + z*z) | (same) |
| 4-argument hypot | __nv_norm4df | __nv_norm4d | — | same with one more term | (same) |
| n-argument hypot | __nv_normf | __nv_norm | — | loop; pointer + length args | (same) |
__CUDA_PREC_SQRT and __CUDA_PREC_DIV are the two reflection keys with the most observable impact on libdevice output. Their 0 settings trip the approximate hardware path that the SASS engine schedules in a single cycle; their 1 settings replace the call with a software Newton-Raphson refinement on top of the approximate result, costing roughly five additional FMAs per call. The MLIR lowering path picks the key value from module-level !nvvm.reflection metadata seeded by the driver CLI options — tileiras defaults to __CUDA_PREC_DIV=1, __CUDA_PREC_SQRT=1 matching nvcc's default of full IEEE precision.
Integer arithmetic helpers
| Symbol family | Width | Decay |
|---|---|---|
__nv_abs | i32 → i32 | (x ^ (x >> 31)) - (x >> 31) — fully inlined |
__nv_llabs | i64 → i64 | same idiom on 64-bit shift |
__nv_min / __nv_max | i32 | min.s32 / max.s32 |
__nv_umin / __nv_umax | u32 | min.u32 / max.u32 |
__nv_llmin / __nv_llmax | i64 | min.s64 / max.s64 |
__nv_ullmin / __nv_ullmax | u64 | min.u64 / max.u64 |
__nv_mul24 | i32 × i32 → i32 | mul24.s32 (24-bit truncated multiply) |
__nv_umul24 | u32 × u32 → u32 | mul24.u32 |
__nv_mul64hi | i64 × i64 → i64 (hi half) | mul.hi.s64 |
__nv_umul64hi | u64 × u64 → u64 (hi half) | mul.hi.u64 |
__nv_mulhi | i32 × i32 → i32 (hi half) | mul.hi.s32 |
__nv_umulhi | u32 × u32 → u32 (hi half) | mul.hi.u32 |
__nv_popc | u32 → i32 | popc.b32 |
__nv_popcll | u64 → i32 | popc.b64 |
__nv_clz / __nv_clzll | leading zeros | clz.b32 / clz.b64 |
__nv_ffs / __nv_ffsll | bit position of LSB | bfind family |
__nv_brev / __nv_brevll | bit reverse | brev.b32 / brev.b64 |
__nv_sad / __nv_usad | sum of absolute differences | sad.s32 / sad.u32 |
__nv_byte_perm | byte permutation | prmt.b32 |
__nv_funnelshift_l/_lc/_r/_rc | 64-bit funnel shifts | shf.l/r.wrap/clamp.b32 |
The mul24 family is the most architecture-dependent: pre-Volta hardware ran mul24.s32 as a single-issue instruction; sm_70+ runs the full 32-bit mul.lo.s32 at the same throughput, and the libdevice body simply forwards the call. Old CUDA-C code that explicitly calls __mul24 therefore retains the API surface but loses the historical performance benefit.
Mixed-mode conversions and float decoders
| Symbol family | Direction | Decay |
|---|---|---|
__nv_int2float_{rn,rz,ru,rd} | i32 → f32 | cvt.<rnd>.f32.s32 |
__nv_uint2float_{rn,rz,ru,rd} | u32 → f32 | cvt.<rnd>.f32.u32 |
__nv_ll2float_{rn,rz,ru,rd} | i64 → f32 | cvt.<rnd>.f32.s64 |
__nv_ull2float_{rn,rz,ru,rd} | u64 → f32 | cvt.<rnd>.f32.u64 |
__nv_int2double_rn | i32 → f64 | cvt.f64.s32 (only rn is exact) |
__nv_double2int_{rn,rz,ru,rd} | f64 → i32 | cvt.<rnd>.s32.f64 |
__nv_float2int_{rn,rz,ru,rd} | f32 → i32 | cvt.<rnd>.s32.f32 |
__nv_double2float_{rn,rz,ru,rd} | f64 → f32 | cvt.<rnd>.f32.f64 |
__nv_float2half_{rn,rz} | f32 → f16 | cvt.<rnd>.f16.f32 |
__nv_half2float | f16 → f32 | cvt.f32.f16 |
__nv_float_as_int | bit reinterpret | mov.b32 (lossless) |
__nv_int_as_float | bit reinterpret | mov.b32 (lossless) |
__nv_longlong_as_double | bit reinterpret | mov.b64 |
__nv_double_as_longlong | bit reinterpret | mov.b64 |
__nv_double2hiint / _loint | f64 → upper/lower 32 bits | cvt.u32.u64 after mov.b64 |
__nv_hiloint2double | reassemble f64 from two i32 | mov.b64 of packed result |
The *_as_* family is intentionally a no-op at the LLVM level; libdevice ships a body anyway so that the symbol exists and the bitcode linker has something to resolve. The body is a single bitcast followed by ret, which the always-inliner reduces to a register rename in the caller.
Error and gamma functions
| Symbol | f32 | f64 | Notes |
|---|---|---|---|
| Error function | __nv_erff | __nv_erf | libdevice-only; rational approximation, double-double internals |
| Complementary erf | __nv_erfcf | __nv_erfc | libdevice-only; scaled exp(-x*x) path for large ` |
| Inverse erf | __nv_erfinvf | __nv_erfinv | libdevice-only; iterative |
| Inverse erfc | __nv_erfcinvf | __nv_erfcinv | libdevice-only; iterative |
| Scaled erfc | __nv_erfcxf | __nv_erfcx | exp(x*x) * erfc(x); large-x stable form |
| Gamma | __nv_tgammaf | __nv_tgamma | Stirling for large x, reflection for small x |
| Log-gamma | __nv_lgammaf | __nv_lgamma | log of |
| Norm CDF | __nv_normcdff | __nv_normcdf | 0.5 * erfc(-x/sqrt(2)) |
| Inverse norm CDF | __nv_normcdfinvf | __nv_normcdfinv | iterative on erfinv |
| Bessel J0 / J1 | __nv_j0f / __nv_j1f | __nv_j0 / __nv_j1 | libdevice-only; minimax for small x, asymptotic for large |
| Bessel Y0 / Y1 | __nv_y0f / __nv_y1f | __nv_y0 / __nv_y1 | libdevice-only; same shape |
| Bessel Jn / Yn | __nv_jnf / __nv_ynf | __nv_jn / __nv_yn | recurrence on the J0/J1, Y0/Y1 pair |
Rounding-mode-qualified arithmetic
These are the "primitive" forms the MLIR lowering does not use directly, but which front-end code can call to force a specific rounding mode on a single op:
| Op | f32 family | f64 family | Decay |
|---|---|---|---|
| Add | __nv_fadd_rn / _rz / _ru / _rd | __nv_dadd_rn / _rz / _ru / _rd | add.<rnd>.f32 / add.<rnd>.f64 |
| Subtract | __nv_fsub_rn etc. | __nv_dsub_rn etc. | sub.<rnd>.f32 / sub.<rnd>.f64 |
| Multiply | __nv_fmul_rn etc. | __nv_dmul_rn etc. | mul.<rnd>.f32 / mul.<rnd>.f64 |
| Divide | __nv_fdiv_rn etc. | __nv_ddiv_rn etc. | div.<rnd>.f32 / div.<rnd>.f64; _rn is the only IEEE-correct form |
| FMA | __nv_fmaf_rn etc. | __nv_fma_rn etc. | fma.<rnd>.f32 / fma.<rnd>.f64 |
| Reciprocal | __nv_frcp_rn etc. | __nv_drcp_rn etc. | rcp.<rnd>.f32 / rcp.<rnd>.f64 |
| Square root | __nv_fsqrt_rn etc. | __nv_dsqrt_rn etc. | sqrt.<rnd>.f32 / sqrt.<rnd>.f64 |
The MLIR pipeline never emits these names directly; they are reachable only through CUDA-C intrinsic shims (__fadd_rn etc. without the __nv_ prefix) and pass through the libdevice linker unchanged.
Reflection-key cross-reference
The reflection keys consumed by libdevice bodies fall into four orthogonal axes:
| Key | Type | Values | Effect on bodies that read it |
|---|---|---|---|
__CUDA_FTZ | bool | 0 (preserve), 1 (flush) | Selects FTZ vs non-FTZ approximate-intrinsic variant in sin, cos, tan, exp, log, pow, etc. Bodies typically have if (__nvvm_reflect("__CUDA_FTZ")) arms wrapping the sin.approx.ftz.f32 / sin.approx.f32 selection. |
__CUDA_PREC_DIV | bool | 0 (approx), 1 (IEEE) | __nv_fdividef and __nv_fdivide choose div.approx.f32 vs div.rn.f32 + Newton refinement. nvcc default is 1; --use_fast_math flips to 0. |
__CUDA_PREC_SQRT | bool | 0 (approx), 1 (IEEE) | __nv_sqrtf and __nv_sqrt choose sqrt.approx.f32 vs sqrt.rn.f32. Default and flip behaviour mirror __CUDA_PREC_DIV. |
__CUDA_FAST_INT_DIV | bool | 0, 1 | Integer division and modulo libdevice helpers (__nv_idiv, __nv_imod, etc., if present in the bitcode) choose between the reference 32-bit algorithm and the truncated approximation. |
__CUDA_ARCH | int | 700, 750, 800, 860, 890, 900, 1000, 1030, 1200, … | Selects per-SM intrinsic availability inside bodies that fall back to legacy paths on older hardware. |
Bodies that do not query any reflection key are non-configurable; they emit the same NVPTX intrinsic regardless of target options. The libdevice overview pipeline folds the reflection keys before the always-inliner runs, so reflection-driven branches are dead by the time the inliner copies the body into the caller.
SM-floor inventory
A handful of __nv_* symbols decay into instructions whose lowest PTX support level is later than the rest of libdevice. Calls to these symbols from a kernel compiled for an older SM produce libdevice fall-back bodies rather than the named instruction.
| Symbol family | Decay floor | Older-SM fallback |
|---|---|---|
__nv_fminf / __nv_fmaxf | sm_80 min.f32/max.f32 | branch-and-select bit logic |
__nv_fmin / __nv_fmax | sm_80 min.f64/max.f64 | branch-and-select |
__nv_tanhf | sm_75 tanh.approx.f32 | rational approximation in software |
Block-scaled __nv_cvt_* (FP8 / FP4) | sm_89 / sm_100a cvt.packfloat.* | not provided — undefined behaviour on older SMs |
__nv_fma_relu_* | sm_75 (f16) / sm_90a (f8) | not provided — softmax-style ReLU+FMA fused intrinsic is sm-gated |
| Tensor-memory casts | sm_100a tcgen05 path | not in libdevice — these live in nvvm |
The libdevice "fall-back" body is the same body the reflection-folded reference path uses; the only difference is that the always-inliner cannot collapse the body into a single PTX instruction because the PTX form does not exist yet.
Linker behaviour and dead-call elimination
Libdevice bitcode is linked with Linker::Flags::OnlyNeeded. The linker walks the user module's declaration set, copies in the matching definitions, and recursively pulls in any further __nv_* declarations the freshly-imported bodies reference. The __CUDA_FTZ / __CUDA_PREC_* reflection arms typically reference both the FTZ and the non-FTZ helper symbols, so a library body that ultimately resolves to a single arm still drags the unused arm's helpers into the user module. The post-inline GlobalDCEPass cleans them up:
1. Linker pulls in __nv_sinf body, which references __nv_sin_kernel_ftz, __nv_sin_kernel_nonftz.
2. NVVMReflectPass folds the FTZ arm to the chosen path.
3. AlwaysInlinerPass inlines __nv_sinf into the caller.
4. SimplifyCFG + SCCP eliminate the dead arm and its helper call.
5. GlobalDCEPass removes the orphaned __nv_sin_kernel_<other> from the module.
Steps 4 and 5 are why the libdevice bitcode appears tiny in the final PTX even though the bitcode blob is several megabytes. The pre-DCE module size can be 5–10× the final size; the dead-arm elimination is the single largest IR shrink in the libdevice integration path.
Verification invariants
Three invariants hold across libdevice integration. Violations are caught by NVVMIRVerifier before the NVPTX backend runs.
- Every
__nv_*declaration is resolved before code generation. A surviving declaration is a backend error. - Every
__nvvm_reflect("KEY")call is folded into aConstantIntbefore always-inlining. A surviving reflect call is a configuration bug. - No
__nv_*body retains a__nvvm_reflectcall after the four-pass integration; the post-linknvvm-reflect-ppcleanup folds the constant branches and removes any dangling intrinsic call sites.
QUIRK: Unknown reflection keys silently fold to zero
NVVMReflectPass::populateVarMap defaults missing keys to 0 and records the zero in the resolved map so that every later call site folds to the same value. A typo in __nvvm_reflect("__CUDA_FFZ") (with double-F) is therefore not a diagnostic — it is a silent reset to the FTZ-off behaviour, applied consistently. The only way to notice is to inspect the post-reflect IR and check that the key the body queries is the key the configuration set. Reimplementations that diverge from this — for example by warning on unknown keys, or by returning -1 to indicate "unknown" — break libdevice bodies that rely on the recorded-zero behaviour for legacy options that the bitcode references but the current configuration system does not know about.
QUIRK: _rn is the only IEEE-correct division and square root
__nv_fdiv_rn and __nv_fsqrt_rn decay to div.rn.f32 and sqrt.rn.f32 — the only PTX divide and square-root variants that the IEEE-754 standard certifies as correctly rounded. The _rz, _ru, and _rd variants are valid hardware instructions but do not satisfy IEEE-754 single-step correctness for division and square root: they round the approximate result rather than the mathematically exact one. Libdevice does not paper over this — code that calls __nv_fdiv_ru(a, b) gets the directed-rounded approximation, not a Newton-refined directed-rounded result. The MLIR arith dialect has no rounding-mode parameter on arith.divf, so this asymmetry is only reachable through CUDA-C intrinsics; MLIR-fronted code always sees the round-to-nearest path.
QUIRK: __nv_fast_* are libdevice symbols, not preprocessor macros
__nv_fast_sinf, __nv_fast_cosf, __nv_fast_powf, etc. exist as separate bitcode symbols, not as #define-style rewrites of __nv_sinf and friends. They have distinct bodies — typically a single sin.approx.ftz.f32 call — and their existence is what allows --use_fast_math to substitute the symbol name during MLIR OpToFuncCallLowering selection without recompiling the libdevice bitcode. A reimplementation that treats __nv_fast_sinf as a macro alias of __nv_sinf will lose the FTZ behaviour the fast-path body enforces unconditionally; the slow-path body is FTZ-conditional on __CUDA_FTZ, and a fast-math build with __CUDA_FTZ=0 (the IEEE-clean default) would then silently preserve denormals where CUDA's bitcode would flush them.
Cross-references
The four-pass integration sequence that turns these declarations into concrete bodies is documented in libdevice Overview — Pipeline. The reflection keys that gate body selection are documented in NVVMReflect Mechanism — Three var-map sources. The MLIR-side rewriter that emits the __nv_* call sites these symbols define is documented in Math Pass Pipeline and Crosswalk — Full math-op crosswalk. The LLVM constant folder that classifies any surviving by-name call sites is documented in Intrinsic ID Switch and Name Table — libdevice suffix name table. The fast-math pragma that selects the __nv_fast_* family over the precision-keyed family is discussed in Fast Math and Numerical Precision.
Intrinsic ID Switch + Name Table
Abstract
tileiras carries the LLVM constant-folder predicate that decides whether a CallBase can be evaluated at compile time. It is the upstream llvm::canConstantFoldCallTo(const CallBase*, const Function*) shape with NVIDIA extensions for NVPTX intrinsics and libdevice naming conventions. A positive result permits the APFloat/APInt folding body to replace the call with a constant.
The dispatcher decomposes into a primary 412-case switch on Function::IntrinsicID, a secondary 161-case switch for the Intrinsic::nvvm_* block, a sparse high-ID range tree, and a name-walking tail for non-intrinsic libdevice and finite-math aliases.
412-case Intrinsic::ID switch
The primary switch is indexed by IntrinsicID ∈ [0, 411]. Five successor buckets are reached:
| Target | Bucket | Cases | Semantic |
|---|---|---|---|
T_FALSE | A | 311 | return false; intrinsic carries side effects or is not foldable. |
T_ATTR | B | 29 | return !NoFold && !StrictFP; floating-point arithmetic gated by attributes. |
T_TRUE | C | 71 | return true; pure integer/bit-domain APInt-foldable. |
T_LIB | D | 1 | Intrinsic::not_intrinsic; dispatch on Function::getName(). |
T_DEF | — | — | default arm; range tree for IDs above the primary table. |
Bucket A (T_FALSE, 311 cases) collects the IDs that have observable side effects on memory, the debug-info family, EH/GC/sanitizer support, frame/return-address probes, the entire VP-intrinsic block, and the low-numbered NVPTX intrinsics whose lowering happens during NVPTX ISel pattern matching rather than at constant-fold time. The verbatim union of cases is 2..11, 13, 16..19, 22, 23, 27..62, 68..87, 91..96, 98..101, 110..113, 116..127, 129, 130, 134..139, 141..172, 174, 180, 181, 185..187, 189..208, 213..220, 224..230, 232..237, 241..248, 252, 254..287, 290..311, 318..328, 331, 338, 340, 341, 344..349, 351..358, 360..362, 365, 367, 368, 371, 372, 374, 377..380, 382..387, 391..396, 399..404.
Bucket B (T_ATTR, 29 cases) is the floating-point arithmetic family: llvm.{sin,cos,exp,exp2,exp10,log,log2,log10,pow,sqrt,fma,minnum,maxnum,copysign,fabs,floor,ceil,trunc,round,roundeven,nearbyint,rint} and their f16/bf16/f32/f64/fp128/x86_fp80 type-overloaded variants. The folder can evaluate them via the APFloat-emulating tail, but only when the surrounding Function carries neither NoFold nor StrictFP. Cases: 12, 24, 25, 63, 64, 88..90, 176..179, 182, 212, 221..223, 238..240, 249..251, 288, 289, 329, 330, 332, 339.
Bucket C (T_TRUE, 71 cases) is the bit-precise integer arithmetic surface: llvm.abs, umax/umin/smax/smin, the vector_reduce_* family (102..109), the saturating-arith block (209..211), the bswap/ctlz/cttz/ctpop/bitreverse/fshl/fshr bitfield block (312..317), and the matrix / masked-{load,store,gather,scatter} family at the upper end (405..411). Cases: 1, 14, 15, 20, 21, 26, 65..67, 97, 102..109, 114, 115, 128, 131..133, 140, 173, 175, 183, 184, 188, 209..211, 231, 253, 312..317, 333..337, 342, 343, 350, 359, 363, 364, 366, 369, 370, 373, 375, 376, 381, 388..390, 397, 398, 405..411.
Bucket D is the single case 0 (Intrinsic::not_intrinsic) path. Before reaching the name-walking sub-tree it checks that the function only reads memory, re-runs the NoFold and StrictFP gates, loads Function::getName(), and dispatches on the first character. The sum 311 + 29 + 71 + 1 = 412 exhausts every label in the primary table.
161-case secondary switch — 8851..9011 (NVPTX block)
When the default arm sees an ID in the NVPTX intrinsic range, it falls into a 161-case secondary switch. This block covers per-shape variants of cp.async.bulk.tensor.{1..5}d, tcgen05.* alloc/dealloc/commit, wgmma.fence, fence.proxy.*, mbarrier.*, cluster.*, ldmatrix.*, stmatrix.*, and block-scaled MMA dispatcher entries. All 161 IDs are explicitly classified between T_FALSE and T_ATTR; no NVPTX hardware-effect intrinsic is always foldable.
| ID | Bucket | Class | Notes |
|---|---|---|---|
| 8851 | T_ATTR | TMA-tensor metadata | First case in block; per-shape "no-op" variant |
| 8852 | T_ATTR | TMA prefetch | Foldable to no-op if not StrictFP-marked |
| 8853 | T_FALSE | TMA store | Side-effecting on shared/global |
| 8854 | T_ATTR | commit-group head | First of 5-stride boundary family |
| 8855..8916 | T_FALSE | cp.async.bulk.tensor.* body | 62-case contiguous block — all SM90+ TMA primitives |
| 8917 | T_ATTR | TMA fence variant | +5 stride from 8852 |
| 8923 | T_ATTR | tcgen05.alloc head | 5th in the 5-step pattern |
| 8931, 8936, 8941, 8946, 8951 | T_ATTR | tcgen05.commit / tcgen05.fence | One per dimension |
| 8956, 8972, 8978 | T_ATTR | wgmma.fence.{sync,async,wait} | Hopper warpgroup-MMA fences |
| 8957..8971 | T_FALSE | wgmma.mma_async.* | Side-effecting matrix multiply |
| 8997..9010 | T_FALSE | mbarrier.arrive.* / cluster.* | Side-effecting sync primitives |
| 9011 | T_ATTR | last case | Final ID in block |
The 23 T_ATTR IDs {8851, 8852, 8854, 8917, 8919, 8923, 8926, 8931, 8936, 8941, 8946, 8951, 8956, 8972, 8974, 8978, 8981, 8986, 8991, 8996, 9001, 9006, 9011} cluster suspiciously on +5 strides — they correspond to the metadata-only / prefetch / commit-group variants of each TMA-tensor dimension. The remaining 138 IDs go to T_FALSE.
Default-case binary tree for high IDs
When ID > 9011 the default arm executes a hand-coded binary search over the sparse high-ID space [3184, 15923]. Membership for tight ranges is tested with 64-bit bitmasks rather than nested compares — a classic clang sparse-switch pattern. The decision tree splits at 0x2628 (9768), 0x3AA3 (15011), 0x2628, 0x255F (9567), 0x254B (9547), and 0x21FF (8703); each leaf is a goto T_TRUE/T_ATTR/T_FALSE. The bit-mask leaves are:
| Range base | Selected IDs | Target |
|---|---|---|
| 8740 | 8740..8755, 8770..8786 | T_ATTR |
| 9548 | 9548, 9553..9567 | T_ATTR |
| 9695 | 9695, 9696, 9697, 9699, 9704, 9708 | T_ATTR |
| 9723 | 9723..9726, 9762, 9764, 9766 | T_ATTR |
| 9830 | 9830, 9832, 9833, 9839..9842 | T_ATTR |
| 15889 | 15889, 15890, 15921, 15922, 15923 | T_ATTR |
Isolated T_TRUE IDs from the same tree: 1352, 3184, 3260, 3278, 3299, 3422..3424, 3600..3604, 8294 (cvt.packfloat head), 9211, and 14542..14543. Isolated T_ATTR IDs: 2191, 2192..2196, 2315, 2318..2319, 3312, 8625, 8638..8653, 8698..8699, 8703, 9178, 15006..15011, and 15486..15493. Every other ID outside the enumerated leaves falls through to T_FALSE.
LLVM 17/18 fingerprint analysis
Three independent fingerprints converge on the LLVM 17/18 family. The generic Intrinsic::ID space contains exactly 412 entries, which sits between upstream LLVM 17 and 18 counts. The Function::IntrinsicID field position rules out older layouts, and the attribute gate uses the slot occupied by NoFold and StrictFP in the LLVM 17 family. The combined evidence favors an LLVM 17-era generic table with NVIDIA NVPTX additions, though LLVM 18 with selected legacy removals remains close enough that the public documentation should treat this as a 17/18-family implementation detail.
libdevice suffix name table
The case 0 tail walks Function::getName() byte-by-byte and dispatches into nested switches for generic libm names, Itanium-mangled names, and CUDA-C suffix overloads such as *d, *ff, and *dd.
| String | Class |
|---|---|
remainderf | libdevice helper |
powff, powdd | CUDA-C type-suffix helpers |
acosd, asind, atand, ceild, coshd, exp2d, fabsd | double-precision suffix helpers |
sinhd, sqrtd, tanhd, floord, log10d | double-precision suffix helpers |
__acos_finite, __acosf_finite, __asin_finite, __asinf_finite | finite-math aliases |
__atan2_finite, __atan2f_finite, __cosh_finite, __coshf_finite | finite-math aliases |
__sinh_finite, __sinhf_finite | finite-math aliases |
The suffix names are CUDA-C overload helpers that disambiguate float and double arguments where C++ ABI mangling is unavailable: f means float scalar, d means double scalar, ff means (float, float), and dd means (double, double). These symbols are recognition keys; libdevice itself exposes canonical __nv_* names. When the walker matches a suffix helper, lowering rewrites the call to the canonical symbol pair, for example acosd to __nv_acos and powff to __nv_powf. The __<name>_finite entries are GCC/Clang finite-math call targets and fold identically to their non-finite siblings for constant operands.
A separate mini-table holds the Itanium-mangled binary-argument helpers consumed by the constant-fold rewriter:
| String | Demangled |
|---|---|
_Z4fmodff | fmod(float, float) |
_Z4fmoddd | fmod(double, double) |
_Z5atan2ff | atan2(float, float) |
_Z5atan2dd | atan2(double, double) |
Together the suffix table, mangled helper table, and finite-math aliases form the NVIDIA extension to LLVM's TargetLibraryInfo recognition set.
Reimplementation Notes
can_constant_fold(call):
if call.callee.is_intrinsic:
return classify_intrinsic(call.callee.intrinsic_id, call.function_attrs)
if not call.callee.only_reads_memory:
return false
if call.function_attrs.has("NoFold") or call.function_attrs.has("StrictFP"):
return false
return classify_libdevice_name(call.callee.name)
Keep the side-effecting NVPTX intrinsics out of the always-foldable bucket. Metadata-only and prefetch-like intrinsics may be attribute-gated, but barriers, async copies, tensor-memory operations, and cluster synchronization must remain non-foldable.
Cross-references
The libdevice linking and reflect-folding sequence that produces the call sites this table classifies is documented in libdevice Overview — Pipeline. The reflection mechanism behind __CUDA_PREC_* / __CUDA_FTZ is documented in NVVMReflect Mechanism. The lowering side — which MLIR math.* / arith.* ops feed this table through __nv_* calls — is documented in Math Pass Pipeline and Crosswalk — Full math-op crosswalk. The NVPTX intrinsic IDs in the 8851..9011 range correspond to the cluster/TMA/tcgen05/WGMMA families documented in tcgen05, WGMMA, mbarrier, and Cluster Sync, TMA, Tensormap, and cp.async.bulk Emission, and the NVVM dialect overviews (nvvm cluster ops, nvvm mbarrier ops, nvvm tma ops, nvvm tcgen05 ops, nvvm wgmma ops).
Math Pass Pipeline + Crosswalk
Abstract
This page anchors the end-to-end translation of MLIR math.* and arith.* floating-point operations through three name-binding layers: MLIR OpToFuncCallLowering patterns emit llvm.call @__nv_<op>[f], the embedded libdevice bitcode supplies the __nv_* bodies, and the post-libdevice LLVM constant folder matches surviving call sites by Intrinsic::ID or by callee name.
It also corrects an earlier misidentification: the 8-phase LLVM pass near the libdevice consumers is not the math-to-libdevice rewriter. Math lowering happens in MLIR before libdevice is linked. The 8-phase pass is a later per-function cleanup over LLVM IR.
Post-libdevice cleanup, not math-to-libdevice
The 8-phase pass was originally hypothesized as an in-tileiras "math to libdevice" pass. That is not its role. It allocates no RewritePatternSet, configures no ConversionTarget, references no math.* mnemonic strings, references no __nv_* symbols, and walks LLVM Function ranges rather than MLIR operation graphs. Its filters match memory-reading and memory-writing LLVM instruction classes: loads, stores, calls, atomics, fences, and memset-like operations. The pass is therefore an LLVM new-PM FunctionPass running after MLIR lowering and libdevice linking, well after the __nv_* calls have already been materialized and inlined.
What the pass shares with NVVMReflect is the underlying LLVM ADT vocabulary, not the algorithm. NVVMReflect is module-global and string-keyed; this pass is per-function and pointer-keyed. NVVMReflect folds __nvvm_reflect(...) calls, then nvvm-reflect-pp removes the now-constant branches in libdevice bodies. The cleanup pass runs downstream on the simplified IR and never introduces a new __nv_* call.
Full math-op crosswalk
For every math.* / arith.* op lowered here, the important public artifacts are the libdevice entry point and, where applicable, the post-link intrinsic or name recognized by the constant folder after libdevice inlining and nvvm-reflect-pp cleanup.
math.<op> / arith.<op> | f32 symbol | f64 symbol | Constant-folder name(s) | by-name fold |
|---|---|---|---|---|
arith.remf | __nv_fmodf | __nv_fmod | _Z4fmodff / _Z4fmoddd | yes |
arith.minnumf | __nv_fminf | __nv_fmin | LLVM MinNum node | no |
arith.maxnumf | __nv_fmaxf | __nv_fmax | LLVM MaxNum node | no |
math.absi | __nv_abs | n/a | n/a | no |
math.absf | __nv_fabsf | __nv_fabs | fabsd / llvm.nvvm.fabs.f | no |
math.acosh | __nv_acoshf | __nv_acosh | libdevice-only | yes |
math.asin | __nv_asinf | __nv_asin | asind | yes |
math.atan | __nv_atanf | __nv_atan | atand | yes |
math.acos | __nv_acosf | __nv_acos | acosd | yes |
math.atan2 | __nv_atan2f | __nv_atan2 | _Z5atan2ff / _Z5atan2dd | yes |
math.asinh | __nv_asinhf | __nv_asinh | libdevice-only | yes |
math.atanh | __nv_atanhf | __nv_atanh | libdevice-only | yes |
math.cbrt | __nv_cbrtf | __nv_cbrt | libdevice-only | yes |
math.ceil | __nv_ceilf | __nv_ceil | ceild | no |
math.copysign | __nv_copysignf | __nv_copysign | llvm.copysign.* | no |
math.cos | __nv_cosf | __nv_cos | cosd / cosf | yes |
math.cosh | __nv_coshf | __nv_cosh | coshd | yes |
math.erf | __nv_erff | __nv_erf | libdevice-only | yes |
math.erfc | __nv_erfcf | __nv_erfc | libdevice-only | yes |
math.exp2 | __nv_exp2f | __nv_exp2 | exp2d | yes |
math.exp | __nv_expf | __nv_exp | expd / expf | yes |
math.expm1 | __nv_expm1f | __nv_expm1 | libdevice-only | yes |
math.floor | __nv_floorf | __nv_floor | floord | yes |
math.fma | __nv_fmaf | __nv_fma | llvm.nvvm.fma.rn.{f,d} | no |
math.fpowi | __nv_powif | __nv_powi | libdevice-only | no |
math.isfinite | __nv_finitef | __nv_isfinited | bit arithmetic | no |
math.isinf | __nv_isinff | __nv_isinfd | bit arithmetic | no |
math.isnan | __nv_isnanf | __nv_isnand | bit arithmetic | no |
math.log10 | __nv_log10f | __nv_log10 | log10d | yes |
math.log1p | __nv_log1pf | __nv_log1p | libdevice-only | yes |
math.log2 | __nv_log2f | __nv_log2 | log2f | yes |
math.log | __nv_logf | __nv_log | logd / logf | yes |
math.powf | __nv_powf | __nv_pow | powff / powdd | yes |
math.roundeven | __nv_rintf | __nv_rint | llvm.rint.f64 | no |
math.round | __nv_roundf | __nv_round | libdevice-only | no |
math.rsqrt | __nv_rsqrtf | __nv_rsqrt | nvvm.rsqrt.approx.{f,d} | no |
math.sinh | __nv_sinhf | __nv_sinh | sinhd | yes |
math.sin | __nv_sinf | __nv_sin | sind / sinf | yes |
math.sqrt | __nv_sqrtf | __nv_sqrt | sqrtd | yes |
math.tanh | __nv_tanhf | __nv_tanh | tanhd | yes |
math.tan | __nv_tanf | __nv_tan | libdevice-only | yes |
Entries marked "libdevice-only" have no dedicated NVPTX backend intrinsic. After libdevice inline plus NVVMReflect cleanup, the body decays into a sequence of more primitive __nvvm_* intrinsics whose IDs the constant folder may recognize. The by-name folder runs against compile-time-constant inputs only: it reads Function::getName(), matches the recognized libdevice or finite-math spelling, evaluates the operation with host math routines, and constructs a ConstantFP result. The libdevice body is not invoked for IR-time constant folding.
FP32 vs FP64 — four axes of divergence
| Axis | f32 | f64 |
|---|---|---|
| Symbol pair | __nv_Xf (47 entries) | __nv_X / __nv_Xd (47 entries) |
| Libdevice body | Separate __nv_sinf body (Payne–Hanek f32 reduction, single-precision polynomial coefficients) | Separate __nv_sin body (Payne–Hanek f64 reduction, double-precision polynomial coefficients) |
| Backend intrinsic | TableGen suffix f — sinf, cosf, expf, logf, sqrtf, powff (pow.f.f), _Z5atan2ff (atan2(float,float)), _Z4fmodff | TableGen suffix d — sind, cosd, expd, logd, sqrtd, powdd (pow.d.d), _Z5atan2dd, _Z4fmoddd |
| HW asymmetry | nvptx-prec-divf32, nvptx-prec-sqrtf32, nvptx-approx-log2f32, nvptx-rsqrt-approx-opt — all PTX-ISA-level f32 selectors with no f64 counterpart | f64 div is always div.rn.f64; f64 sqrt is always sqrt.rn.f64 or a libdevice fallback when HW lacks it on the target SM |
The f16 and bf16 slots of these lowerings are empty: no __nv_* half-precision libdevice symbol is used. The MLIR pipeline promotes f16/bf16 to f32 via arith.extf before the libdevice call and demotes via arith.truncf after. The fp128 family is independent and softfloat-emulated; it is not driven by these OpToFuncCallLowering patterns.
Cases that skip libdevice entirely
A subset of math.* ops have libdevice bodies whose control flow is mostly __nvvm_reflect("__CUDA_PREC_*") or __nvvm_reflect("__CUDA_FTZ") tests guarding Intrinsic::nvvm_* arms. After NVVMReflect folds the reflect calls and nvvm-reflect-pp removes constant branches, the body can reduce to a single hardware intrinsic and the __nv_* call symbol disappears.
Examples:
math.sqrt %x : f32with__CUDA_PREC_SQRT=0reduces tonvvm.sqrt.approx.f; with__CUDA_PREC_SQRT=1it reduces tonvvm.sqrt.rn.f.math.rsqrt %x : f32reduces tonvvm.rsqrt.approx.f.math.sin/math.coson f32 reduce to FTZ or non-FTZ approximate intrinsics depending on__CUDA_FTZ.math.exp %x : f32rewrites toexp2.approx.fcomposed with a multiply.math.log2 %x : f32rewrites tonvvm.lg2.approx.fwhen the approximate-log2 option is enabled.math.absiinlines as(x ^ (x >> 31)) - (x >> 31).math.{isnan,isinf,isfinite}reduce to bit arithmetic on the raw FP encoding.
Conversely, acosh, asinh, atanh, cbrt, erf, erfc, expm1, log1p, sinh, cosh, tanh, atan, atan2, asin, acos, tan, generic pow, remainder, fmod, and powi retain the libdevice body unless the input is a compile-time constant.
Reimplementation Notes
lower_math_op(op):
if op.type is f16 or bf16:
x = extf(op.input, f32)
y = call_libdevice(f32_symbol(op.name), x)
return truncf(y, op.type)
if op.type is f32:
return call_libdevice(f32_symbol(op.name), op.operands)
if op.type is f64:
return call_libdevice(f64_symbol(op.name), op.operands)
Constant folding is a separate LLVM-tier concern. Do not execute libdevice IR to fold constants; classify the call, evaluate the recognized math operation directly, and replace the call with a constant.
Cross-references
The four-pass integration sequence that materializes the __nv_* bodies this page lowers into is documented in libdevice Overview — Pipeline and libdevice Overview — Link, inline, simplify. The __nvvm_reflect("__CUDA_FTZ") / __CUDA_PREC_* mechanism whose folding collapses the per-arch arms is documented in NVVMReflect Mechanism — Three var-map sources. The constant-folder classifier that recognizes the post-libdevice call sites by Intrinsic::ID or by name is documented in Intrinsic ID Switch and Name Table — libdevice suffix name table. The NVPTX bring-up path that pulls libdevice into the LLVM module is documented in NVPTX Bring-up and Target Init.
MLIR Infrastructure Overview
Abstract
TileIR rides on top of a standard MLIR substrate that the whole compiler shares: 0x48-byte Operation headers, a two-level StorageUniquer that interns every Type and Attribute, an InterfaceMap keyed on TypeID sentinel addresses, four rewrite-pattern shapes A/B/C/D at 0x60 / 0x68 / 0x70 / 0x78 bytes, an 808-byte AsyncValueImpl that backs every Pipe_ and Mutex_ scheduling value, and a 208-byte Diagnostic body with a 4-slot inline argument buffer. There is one walker driver, one pattern application loop, one uniquer gateway, and one diagnostic engine for the entire toolchain.
The substrate is statically linked once and shared by cuda_tile, nv_tileas, nv_tileaa, cute, cute_nvgpu, cutlass, nvvm, llvm, and the standard builtin / func / arith / scf / vector / memref / cf / math / pdl dialects. This page is the router into the deep pages — each section names one topic and points at the page that covers it field-by-field.
Reading Path
The pages below assume each other in roughly the following order. Read the operation layout first to understand what an MLIR node looks like in memory, then the storage uniquer to see how types and attributes interleave with operations, then the container fingerprints page to recognise every map and set in the binary.
| Topic | Owner page |
|---|---|
| Operation header, region traversal, walker driver | Operation Layout |
| Two-level uniquer, EMPTY/TOMBSTONE sentinels, fmix64 hash, context-impl rwlock | Storage Uniquer and Context Impl |
| Pattern shapes A/B/C/D, FrozenRewritePatternSet, fingerprint hashmap | Pattern Vtables and Shapes |
| Interface vtables, concept tables, InterfaceMap probing | Interface Vtables |
| TypeID idioms, .bss sentinel bands, Meyers-cached idiom | TypeID Sentinels and Anchors |
| TypeID construction idioms (static sentinel and Meyers cache) | TypeID Construction Idioms |
| DenseMap, SwissTable, SmallVector fingerprints and resize policies | Container Fingerprints |
AsyncValueImpl 808-byte body backing every Pipe_ and Mutex_ value | AsyncValue and BLAKE3 Interning |
| Diagnostic ABI, argument buffer, source-location formatting | Diagnostic ABI and Helpers |
| Pass-failure handshake between pass manager and verifiers | Pass-Failure Handshake |
Substrate Invariants
Three invariants tie the deep pages together. They are the assumptions every dialect and every pass relies on; violating any one of them is a substrate-level bug that surfaces far from the violation site.
Uniqued payloads are immutable. Types, attributes, locations, affine maps, and most dialect-specific values pass through the storage uniquer once and are referenced by pointer identity for the rest of the compiler's life. Mutating a uniqued payload after construction breaks every equality test, every keyed map, and every cache that depends on it — and the storage uniquer makes no copy.
TypeID is by pointer-identity. Every concrete type, attribute, interface, and pattern owns a TypeID whose address is its identity. Dispatch walkers compare against the address, not against the byte at the address, and the address must be stable for the lifetime of the MLIRContext. Two materialisations of the same TypeID — one through the Idiom-1 static sentinel and one through the Idiom-2 Meyers cache — live in different .bss bands, but a single TypeID never moves between them.
Operation header offsets are part of the binary contract. Reverse-engineering notes name byte offsets to identify behaviour because every walker, verifier, and canonicaliser in the binary reads the same offsets. A reimplementation that changes the header layout without updating every consumer breaks dispatch silently — the kindPtr at *(qword*)(op + 48) + 16 is the canonical example.
Cross-Cutting Threads
Three substrate threads thread through several deep pages and earn their own pointers here.
The 808-byte AsyncValueImpl body is what the scheduler's Pipe_ and Mutex_ constructors allocate; the scheduler-side companion is Pipe_ and Mutex_ Value-Header Layout. The body itself lives in AsyncValue and BLAKE3 Interning.
The SwissTable family — distinct from DenseMap by its fmix64 mixer and 16-byte control-byte groups — is exclusive to the scheduler in this binary. Container Fingerprints covers the layout; Modulo Driver and OR-Chain is the highest-traffic consumer.
The TypeID idioms back every dispatch in the binary. TypeID Sentinels and Anchors covers both idioms; the AnalysisManager slot that caches the scheduler's ScheduleAnalysis is one example of the Meyers idiom in action.
Cross-links
- Operation Layout
- Storage Uniquer and Context Impl
- Pattern Vtables and Shapes
- Interface Vtables
- TypeID Construction Idioms
- TypeID Sentinels and Anchors
- Container Fingerprints
- AsyncValue and BLAKE3 Interning
- Diagnostic ABI and Helpers
- Pass-Failure Handshake
MLIR Operation Layout
Abstract
Every MLIR Operation* in tileiras is a fixed 0x48-byte header followed by a contiguous
TrailingObjects run — inline result storage, the operand slab, regions, successors, and the
attribute-dictionary slot. The header is constant across every dialect linked into the binary
(cuda_tile, nv_tileas, nv_tileaa, cute, cute_nvgpu, cutlass, nvvm, llvm); per-op
variation lives entirely in the trailing area. Kind dispatch is a single pointer-identity compare
against the interned OperationName slot at +0x40, the operand count is masked with 0x7FFFFFF
on every load, and the canonical TrailingObjects decoder at sub_4492630 is seven lines of
arithmetic with one branch.
Fixed Header
typedef struct Operation {
/*+0x00*/ Block *block; // parent block, ilist owner
/*+0x08*/ Region *regions_inline; // first inline region (single-region ops)
/*+0x10*/ Operation *parent; // parent operation, IRObjectWithUseList base
/*+0x18*/ OpOperand *first_use; // head of this op's result use-list
/*+0x20*/ OperandStorage *operand_storage; // pointer to OpOperand[] / resizable slab
/*+0x28*/ uint32_t num_operands; // walkers AND with 0x7FFFFFF before use
/*+0x2C*/ uint16_t num_results : 23; // low 23 bits = numResults
/*+0x2D*/ uint16_t flags : 9; // upper bits: trailing-result + dialect flags
/*+0x2E*/ uint8_t trailing_result_flag; // = (flags & 0x80) >> 7, 16-byte stride gate
/*+0x2F*/ uint8_t num_inline_results; // small-count slot, scaled by 8 in decoder
/*+0x38*/ Location loc; // source location pointer
/*+0x40*/ OperationName name; // interned pointer — identity dispatch key
} Operation; // sizeof == 0x48
Kind dispatch lives in one field: +0x40. The OperationName there is an interned record
pointer whose address identifies the op kind — not its mnemonic, not a hash. Every walker,
canonicalizer, and verifier in the binary compares op->name to entries in the &unk_5B44... /
&unk_5B45... / &unk_5BE6... slot banks with a plain MOV+CMP. No string compare, no hash
lookup, no indirect call on the hot path. Slot interning is documented in
Storage Uniquer and Context Impl; the sentinel records
themselves are catalogued in TypeID Sentinels and Anchors.
Every load of num_operands at +0x28 is followed by a mask against 0x7FFFFFF. The upper five
bits carry per-op flags — bit 0x4000000 is HasDebugValue, bit 0x40 is NoFPExcept, the rest
dialect-specific. That 27-bit mask appears verbatim in 28 distinct functions and is the single
sharpest fingerprint for MLIR-walker sites in stripped tileiras code: load a DWORD at +0x28,
AND it with 0x7FFFFFF, and you are operating on a mlir::Operation.
TrailingObjects Decoder
The canonical decoder lives at sub_4492630. Seven lines of source, one branch, one alignment
round-up:
uintptr_t getTrailingStorage(const Operation *op) {
uintptr_t hdr = (uintptr_t)op;
uint8_t trailing = *((const uint8_t *)op + 0x2E) >> 7; // 0 or 1
uint8_t n_inline = *((const uint8_t *)op + 0x2F); // small-count
uintptr_t base = (hdr + 16 * trailing + 8 * n_inline + 64 + 7) & ~(uintptr_t)7;
uint32_t n_ops = *(const uint32_t *)((const char *)op + 0x28) & 0x7FFFFFF;
return base + 32 * n_ops;
}
The +64 term is sizeof(Operation header) - 8 rounded into the alignment math; combined with the
& ~7 mask it lands on the first 8-byte boundary after the inline-result prologue. The 32-byte
stride applied to n_ops is sizeof(OpOperand) — forward link, backward link, owning operation
pointer, and Value, eight bytes each. The decoder returns the address of the region slab; callers
that want operands, successors, or the attribute dictionary subtract the appropriate stride.
Trailing storage follows canonical upstream order. Inline results lead when trailing_result_flag == 0; otherwise an outline-result prologue (one 8-byte cell per num_inline_results) precedes the
operand slab. The slab follows, aligned to 8 bytes. After 32 * num_operands bytes of OpOperand
storage come 24 * num_regions bytes of Region slots (the stride is confirmed by the for ( i = v4 + 24 * v5; i != v4; v4 += 24 ) walk in sub_7C6150), then 24 * num_successors bytes of
BlockOperand, then the trailing attribute-dictionary pointer. sub_4492630 itself never decodes
the split between num_regions, num_successors, and the upper flag bits of +0x2C; that lives in
the per-op accessors the ODS generator inlines into each builder.
struct OpOperand { // sizeof == 32
/*+0x00*/ Value value; // SSA operand, tagged pointer
/*+0x08*/ OpOperand *next; // ilist next in def's use-list
/*+0x10*/ OpOperand *prev; // ilist prev
/*+0x18*/ Operation *owner; // back-pointer to enclosing Operation
};
The (num_results | flags) packed word at +0x2C doubles as the "has any result"
gate: sub_4492630 returns zero outright when (*(uint32_t *)(op + 44) & 0x7FFFFF) == 0, because a
result-less operation cannot have inline-result storage and the trailing area starts at the
aligned-up boundary anyway. The walker at sub_44924B0 uses the same 0x7FFFFF probe to decide
whether to descend through trailing objects. Both functions encode an identical contract: zero
results means the trailing area is degenerate and callers must compute their own bases.
Walker Contract
sub_447FBB0 is mlir::detail::walk_impl — 1242 LOC, lock-free, the iteration backbone behind
every walker vtable in the binary. No pthread_mutex_lock or pthread_rwlock_* call appears in
its body; it reads the 0x48-byte Operation header directly. The MLIR rule is single-threaded
walks, with concurrent passes using separate MLIRContext instances — which is why a walker
descending through 100k+ ops compiles down to such tight code.
The walker maintains a worklist of 40-byte frames on a stack-allocated SmallVector. Each frame
holds the current op, the user-supplied visitor vtable, the next region and block cursors, a phase
discriminator that distinguishes pre-order entry from child descent and post-order exit, and a
skip/interrupt flag word:
typedef struct WalkFrame {
/*+0x00*/ Operation *op; // current op
/*+0x08*/ const WalkVisitor *visitor; // user-supplied vtable
/*+0x10*/ Region *region_cursor; // next region to walk
/*+0x18*/ Block *block_cursor;
/*+0x20*/ uint32_t phase; // 0=pre, 1=children, 2=post
/*+0x24*/ uint32_t flags; // skip/interrupt
} WalkFrame; // 40 B
Dispatch is two levels of indirection through the visitor at frame +0x08. The walker reads the
visitor's vtable, then loads the per-op callback at slot +64, matches the loaded function pointer
against the op's interned name at header +0x40, and calls through. Pre-order entry and post-order
exit go through slots +16 and +0x24 of the same vtable. The pattern
*((void**)(*(void**)(frame[1])) + 64) is the fingerprint that identifies a walker callback site in
stripped tileiras code:
typedef uint32_t (*WalkCallback)(Operation *op);
uint32_t dispatch_per_op(const WalkFrame *frame) {
const WalkVisitor *v = frame->visitor; // frame[+0x08]
void *const *vt = *(void *const *const *)v; // visitor->vtable
WalkCallback cb = (WalkCallback)vt[8]; // vtable + 64 == slot 8
return cb(frame->op); // resolves via op->name (+0x40)
}
The binary ships three walker-vtable instantiations: sub_4481140 is the bare driver — the
canonical 7-LOC tail — sub_4481150 is the kind-filtered driver, and sub_4481220 is the
post-order driver. Each is a 3-slot vtable shaped {enter, leave, perOp}; the bare driver is the
smallest body and the cleanest reference for reimplementation.
The visitor callback returns one of four control words. 0 is continue — descend into regions and
blocks, then run post-order. 1 is skip-children — run post-order on the current frame but push no
child frames. 2 is interrupt — pop every frame and return to the caller. 3 is re-visit, used by
fixed-point rewrites to re-run the per-op callback after children complete. The verifier wires its
first-error path to the interrupt return, so a single failed invariant unwinds the entire walk in
one pop sweep.
The walker reads exactly four fields from the header on each iteration: name at +0x40 for kind
dispatch, the 27-bit operand count via *(uint32_t *)(op + 0x28) & 0x7FFFFFF, the first inline
region pointer at +0x08, and the interned-slot identity used by the kind-filtered driver — again
through name at +0x40, compared pointer-equal against the &unk_5B... slot bank. All four reads
are plain MOV+CMP; no string compare, no hash, and no indirect call fires until the per-op
callback at vtable +64 is invoked.
Pointer-Identity Dispatch
sub_447FBB0 is the canonical example of MLIR kind dispatch in this binary. It loads op->name
from +0x40 and compares the pointer against entries in the slot banks at &unk_5B44...,
&unk_5B45..., and &unk_5BE6.... Each comparison is one MOV plus one CMP; the dispatch tree
is a chain of conditional branches over interned addresses, with no string compare and no vtable
lookup on the fast path. Pattern matching over MLIR ops in stripped tileiras code is cheap enough to
inline into hot canonicalizers because the kind check fits in two instructions and the
trailing-object decoder fits in seven.
Pattern-matching helpers wrap the same idiom. The isa<OpT> shape from pattern-vtables-and-shapes
loads +0x40, compares against the interned slot for OpT::getOperationName(), and falls through
to a no-op or into the matched-pattern body. Diagnostic helpers such as sub_446EC50
(Operation::emitOpError) reach the same field to spell the op's mnemonic in the error prefix
before returning a builder for the caller to append to.
Accessor Map
The accessor surface in the binary maps cleanly onto upstream MLIR methods, with the canonical offset loads shown below.
| Binary thunk | Upstream equivalent | Offsets read |
|---|---|---|
sub_446E0D0 | Operation::getOperation() — identity thunk | returns a1 |
sub_446E0E0 | Operation::getOperation() const overload | returns a1 |
sub_4492630 | TrailingObjectsImpl::getTrailingObjects<Region>() | +0x28, +0x2C, +0x2E, +0x2F |
sub_44924B0 | Operation::walk() body, descends through trailing objects | +0x2C masked with 0x7FFFFF |
sub_446EC50 | Operation::emitOpError() — diagnostic builder | +0x40 (OperationName) |
sub_44499A0 | MLIRContext DenseMap probe (operation-name / TypeID lookup) | context table, not header |
sub_447FBB0 | walker / pattern driver, lock-free | +0x40 against sentinel slot banks |
The two getOperation identity thunks exist so that templated ODS code can call
op.getOperation() uniformly whether op is a concrete OpT wrapper or a raw Operation*. Both
return their argument verbatim.
Invariants
The 0x48-byte size is fixed by the upstream MLIR contract and is not negotiable for any dialect that
participates in the shared infrastructure. The five fields that any reimplementation must place at
the documented offsets are num_operands at +0x28, the packed (num_results | flags) word at
+0x2C (with the trailing-result bit at +0x2E, bit 7), the inline-result count at +0x2F, and
the interned OperationName pointer at +0x40. The 27-bit operand mask must be 0x7FFFFFF; the
32-byte OpOperand stride and the 64-byte alignment base in sub_4492630 follow from the upstream
IROperand<OpOperand, Value> layout and the TrailingObjects alignment policy.
How to Recognize in a Binary
Three independent fingerprints identify the Operation header path:
- The 27-bit mask
0x7FFFFFFAND-ed with a 32-bit load from+0x28of any object is the most distinctive single signature. The mask appears verbatim in 28 distinct functions; any function that performs this load-and-mask is operating on amlir::Operation. - The seven-line
getTrailingStorageshape atsub_4492630— `(hdr + 16trailing + 8n_inline + 64-
- & ~7
followed by32 * n_ops— identifies the canonical trailing-object decoder. The 32-byte stride issizeof(OpOperand)and the& ~7` mask is the alignment policy.
- & ~7
-
- The visitor-vtable callback site
*((void**)(*(void**)(frame[1])) + 64)from the walker body is the third fingerprint. Two indirections followed by a+64offset (slot 8 of a 3-slot vtable) is always a per-op walker callback.
Consumers
Every walker, verifier, canonicaliser, and pattern matcher in the binary reads this header. The
walker driver sub_447FBB0 is the iteration backbone documented above; the storage uniquer in
Storage Uniquer and Context Impl is the source of the
OperationName pointer at +0x40; the pattern application drivers in
Pattern Vtables and Shapes read the +0x40 slot to dispatch
matching patterns; and the diagnostic constructor at sub_446EC50
(Diagnostic ABI and Helpers) reads +0x40 to spell the op
mnemonic in error prefixes.
Cross-references: Storage Uniquer and Context Impl for
how OperationName slots are interned; ISel DAG and Matcher Table
for the same 0x7FFFFFF mask reused on backend SDNode; and
TypeID Sentinels and Anchors for the slot bank that backs
+0x40; and Common Compiler Patterns and Idioms
for the TrailingObjects shape catalogued alongside the other recurring tileiras idioms.
Storage Uniquer and Context Implementation
Abstract
Every uniqued value in TileIR — every Type, Attribute, Location, Identifier, AffineExpr, AffineMap, and IntegerSet — is interned through a single 9 630-byte gateway. The function lives at sub_4497E40, 534 basic blocks of mostly duplicated insert paths, and it is reached from more than 700 call sites, approximately one per registered uniqued class. Calling it twice with the same (MLIRContextImpl*, TypeID, hash, equality) tuple returns the same BaseStorage*. Calling it with a fresh key allocates a 32-byte ThreadSafeRefCountedBase-shaped storage object, publishes it into the right hash table, and returns the new canonical pointer.
What follows is the algorithm at reimplementation grade: the two-level hash table behind uniquing, the per-class allocator that owns Level-2, the compare-and-swap that publishes Level-2 into Level-1, the refcount transitions on storage objects, the thread-local cache that skips every lock on the common case, and the lock order that keeps the slow path safe under MLIR's full thread-safe context.
Two-Level Intern Table
Two hash tables stack: Level-1 keys on a TypeID singleton — the address of a per-class sentinel in .data.rel.ro such as &unk_5B37828 for cuda_tile::TileType or &unk_5B377F0 for a representative attribute class — and stores a pointer to that class's StorageAllocator, an 88-byte structure that owns Level-2. Level-2 keys on the caller-supplied 32-bit hash plus a caller-supplied equality predicate; its values are the BaseStorage* objects returned to user code.
| Level | Key | Hash input | Value | Container |
|---|---|---|---|---|
| 1 | TypeID sentinel pointer (a2) | sentinel address | StorageAllocator* | per-Context bucket array |
| 2 | structural key blob (a5) | caller hash (a3) | BaseStorage* | per-class bucket array |
Both levels run the same machinery. One hash family — the canonical LLVM DenseMapInfo<void*>::getHashValue seed ((uintptr_t)key >> 9) ^ ((uintptr_t)key >> 4). One collision strategy — stride-1 linear probing, bucket count kept a power of two so the index is (N - 1) & h. One sentinel pair at 16-byte slot pitch: 0xFFFFFFFFFFFFF000 ((void*)-4096) marks EMPTY, 0xFFFFFFFFFFFFE000 ((void*)-8192) marks TOMBSTONE.
The probe seed and the sentinel pair together are a hard fingerprint for upstream LLVM DenseMap and MLIR's StorageUniquer. Sharding does the rest of the work: the 16-byte slot pitch combined with per-TypeID Level-2 tables — cuda_tile, nv_tileas, nv_tileaa, cute, cute_nvgpu, cutlass, nvvm, llvm, builtin, func, arith, scf, vector, memref, cf, math, pdl, pdl_interp, plus roughly 30 attributes per dense dialect — keeps probe chains short even on enormous IRs.
MLIRContextImpl Layout
The Level-1 array lives inside MLIRContextImpl, a 576-byte (0x240) object allocated by sub_445EDD0. Its first qword is the vtable pointer &off_5A2CA80, which is the central anchor for "this is an MLIRContext" in the binary. The fields read by the uniquer are:
struct MLIRContextImpl { /* 0x240 bytes */
/*+0x000*/ void **vtable; /* off_5A2CA80 */
/*+0x010*/ atomic_uint64_t context_id;
/*+0x040*/ DialectTable *registered_dialects;
/*+0x080*/ OpRegistry *registered_op_table;
/*+0x110*/ AttributeRegistry *registered_attr_table;
/*+0x180*/ TypeRegistry *registered_type_table;
/*+0x1B0*/ Level1Slot *type_uniquer_buckets;
/*+0x1C0*/ uint32_t type_uniquer_size; /* power of two, >= 64 */
/*+0x260*/ Level1Slot *attr_uniquer_buckets;
/*+0x270*/ uint32_t attr_uniquer_size;
/*+0x278*/ pthread_mutex_t allocator_mutex; /* 40 B, *ctx + 632 */
/*+0x2B0*/ AffineUniquerState *affine_uniquer_state; /* *ctx + 688 */
/* ... diagnostic handler chain, interface tables, dialect hooks ... */
};
struct Level1Slot { /* 16 B */
/*+0x00*/ TypeID *type_id; /* the sentinel address, or EMPTY/TOMBSTONE */
/*+0x08*/ StorageAllocator *impl; /* Level-2 handle, CAS target */
};
Five helper routines reach into MLIRContextImpl — callers inline them or share them. sub_445B3C0 is the insertOrLookup<ImmortalStorage<ArrayRef<Storage*>>> shim that other passes use to dedup-then-intern small pointer arrays. sub_447FBB0 is the 1242-line walk_impl that drives the operation walker. sub_4458150 is the 4-way unrolled tail of an LLVM DenseSet::find used on hot dispatch paths. sub_445F520 and sub_4461BA0 are getRegisteredType and getRegisteredAttribute; both consult the context allocator mutex when they need to publish a new descriptor.
Do not confuse the allocator mutex at *ctx + 632 with the diagnostic-handler mutex earlier in the structure. The allocator mutex guards Level-1 mutation alone: live-count and tombstone bookkeeping, the resize-in-place dance, and the narrow window where a thread has decided a TypeID has no Level-2 slot and is about to publish one.
StorageAllocator Layout
One StorageAllocator per registered class: 88 bytes holding the Level-2 bucket array, its free list, its load-factor counters, and the rwlock that synchronises Level-2 readers and writers.
struct StorageAllocator { /* 0x58 bytes */
/*+0x00*/ StorageEntry *buckets; /* open-addressed table, 16 B slots */
/*+0x08*/ void *freelist; /* zeroed on resize */
/*+0x10*/ uint32_t live_count;
/*+0x14*/ uint32_t tombstone_count;
/*+0x18*/ uint32_t bucket_count; /* power of two, >= 64 */
/*+0x1C*/ uint32_t resize_threshold;
/*+0x20*/ pthread_rwlock_t lock; /* 56 B, Level-2 readers/writer */
};
struct StorageEntry { /* 16 B */
/*+0x00*/ uint32_t hash_key; /* caller-supplied a3 */
/*+0x04*/ uint32_t pad0;
/*+0x08*/ uint64_t ptr_or_sentinel; /* BaseStorage*, -4096 EMPTY, -8192 TOMBSTONE */
};
sub_44A8C20(0x58) allocates the allocator and zeroes it before publish. sub_45603F0(16 * N, 8) allocates its buckets, then a tight loop walks the buffer at 16-byte stride writing hash_key = 0 and ptr_or_sentinel = -4096. Insert-path specialisation duplicates the same loop body more than ten times across sub_4497E40; the literal -4096 appears 47 times in the body, -8192 40 times.
BaseStorage Layout
A uniqued storage object is 32 bytes of ThreadSafeRefCountedBase with the MLIR storage payload tacked onto the same block. Its vtable is the global off_59A4108.
struct BaseStorage { /* 0x20 bytes */
/*+0x00*/ void **vtable; /* off_59A4108 */
/*+0x08*/ int32_t strong_count; /* init 1 — owned-by-uniquer */
/*+0x0C*/ int32_t weak_count; /* init 1 — handed to caller */
/*+0x10*/ void *payload; /* class-specific, zero at init */
/*+0x18*/ uint8_t flags; /* bit 0 = "owned by uniquer" */
/*+0x19*/ uint8_t pad[7];
};
Vtable slot 2 is the deleter, slot 3 the full destructor. The deleter fires when strong_count drops to zero through _InterlockedExchangeAdd(&strong, -1), dispatched as (*(**vtable + 16))(obj); the destructor fires when weak_count drops to zero, dispatched as (*(**vtable + 24))(obj). Both counts initialise to 1 because the uniquer holds the strong reference (it caches the object) and hands the caller a weak reference through the bucket entry. The flags byte at +0x18 is set to 1 right after a successful insert — the "installed in cache" marker that prevents the deleter from running while the publish is still in flight.
The getOrCreate Gateway
The full signature of the gateway is:
__int64 sub_4497E40(
MLIRContextImpl **uniquer_pp, /* a1 */
TypeID *type_id, /* a2 — sentinel pointer */
uint32_t hash, /* a3 — precomputed 32-bit key hash */
bool (*equals)(uintptr_t, uintptr_t, uintptr_t),
/* a4 — equality predicate */
void *key_ctx, /* a5 — KeyTy pointer / equality context */
void *alloc_ctx, /* a6 — opaque, forwarded to ctor */
__m128i pack); /* a7 — pack.lo = ctor*, pack.hi = KeyTy blob */
The __m128i at a7 is loaded into a single SSE register on entry and split at the two construction sites: the low qword is a pointer to the storage constructor and the high qword is the key blob handed to that constructor. This packing matches upstream mlir::detail::StorageUniquer::getOrCreate<Storage>(KeyTy), where the StorageAllocator and KeyTy are forwarded to Storage::construct.
The full algorithm, with the duplicated insert bodies collapsed into a single representative path:
BaseStorage *get_or_create(MLIRContextImpl **uniquer_pp,
TypeID *tid,
uint32_t hash,
equals_fn equals,
void *key_ctx,
void *alloc_ctx,
__m128i pack)
{
MLIRContextImpl *U = *uniquer_pp;
uint32_t N1 = U->type_uniquer_size;
/* ---------- Level-1 probe: TypeID -> StorageAllocator ---------- */
if (N1 == 0) {
grow_level1(U, /*new_count=*/64); /* min bucket count is 64 */
N1 = U->type_uniquer_size;
}
uint32_t h1 = ((uintptr_t)tid >> 9) ^ ((uintptr_t)tid >> 4);
uint32_t mask = N1 - 1;
Level1Slot *buckets = U->type_uniquer_buckets;
Level1Slot *tomb = NULL;
uint32_t step = 1;
uint32_t idx = mask & h1;
for (;;) {
Level1Slot *s = &buckets[idx];
if (s->type_id == tid) break; /* hit */
if ((uintptr_t)s->type_id == -4096) goto l1_insert; /* EMPTY */
if ((uintptr_t)s->type_id == -8192 && !tomb) tomb = s; /* first tombstone wins */
idx = mask & (idx + step);
++step;
}
StorageAllocator *impl = buckets[idx].impl;
goto l2_entry;
l1_insert:
/* Load-factor 3/4 trigger and 1/8 tombstone-density trigger.
* On grow, next-pow2(2N - 1) via the inline 5-round bit-fill,
* clamped to a minimum of 64. On rehash-in-place, same size. */
pthread_mutex_lock(&U->allocator_mutex);
uint32_t live = ++U->live_count;
if (4 * live >= 3 * N1) {
uint32_t new_N = next_pow2(2 * N1 - 1);
if (new_N < 64) new_N = 64;
rehash_level1(U, new_N);
} else if (N1 - U->tombstone_count - live <= N1 / 8) {
rehash_level1(U, N1); /* same size, drops tombs */
}
Level1Slot *seat = tomb ? tomb : &buckets[idx];
seat->type_id = tid;
StorageAllocator *fresh = sub_44A8C20(0x58); /* 88-byte calloc */
memset(fresh, 0, 0x58);
/* CAS-publish the StorageAllocator. If another thread won the race,
* free the loser and use the winner. The CAS happens with the allocator
* mutex held; the mutex protects bookkeeping, the CAS protects publish. */
StorageAllocator *winner = (StorageAllocator *)
_InterlockedCompareExchange64(&seat->impl, (int64_t)fresh, 0);
impl = winner ? winner : fresh;
if (winner) {
sub_4560420(fresh->buckets, 16 * fresh->bucket_count, 8);
free(fresh);
}
pthread_mutex_unlock(&U->allocator_mutex);
l2_entry:
/* ---------- TLS cache fast path ---------- */
if (tls_cache_hit(key_ctx, impl, hash, &result)) {
return result; /* no locks, no atomics */
}
/* ---------- Level-2 probe under per-class rwlock ---------- */
pthread_rwlock_rdlock(&impl->lock);
BaseStorage *hit = level2_probe_read(impl, hash, equals, key_ctx);
if (hit) {
pthread_rwlock_unlock(&impl->lock);
tls_cache_install(key_ctx, impl, hash, hit);
return hit;
}
pthread_rwlock_unlock(&impl->lock);
/* Upgrade to write and re-probe; another thread may have inserted. */
pthread_rwlock_wrlock(&impl->lock);
hit = level2_probe_read(impl, hash, equals, key_ctx);
if (hit) {
pthread_rwlock_unlock(&impl->lock);
tls_cache_install(key_ctx, impl, hash, hit);
return hit;
}
/* Resize Level-2 with the same load-factor and tombstone-density
* triggers as Level-1. Reuse the inline next-pow2 bit-fill. */
if (4 * (impl->live_count + 1) >= 3 * impl->bucket_count) {
rehash_level2(impl, next_pow2(2 * impl->bucket_count - 1));
} else if (impl->bucket_count - impl->tombstone_count - impl->live_count
<= impl->bucket_count / 8) {
rehash_level2(impl, impl->bucket_count);
}
/* Construct the storage object via the caller's ctor callback.
* In thread-safe contexts the allocator argument is the per-thread
* sub-allocator returned by sub_4496E20; in single-threaded mode it
* is the context itself. */
void *allocator_arg = thread_safe(U)
? sub_4496E20(uniquer_pp, alloc_ctx)
: (void *)U;
typedef BaseStorage *(*ctor_fn)(void *key, void *alloc, void *ctx);
ctor_fn ctor = (ctor_fn)((__m128i_u64 *)&pack)[0];
void *key = (void *)((__m128i_u64 *)&pack)[1];
BaseStorage *storage = ctor(key, allocator_arg, alloc_ctx);
/* Initialise the refcount header. The uniquer holds the strong ref
* (it caches the object); the caller is handed the weak ref. */
storage->vtable = &off_59A4108;
storage->strong_count = 1;
storage->weak_count = 1;
storage->payload = NULL;
storage->flags = 1; /* installed in cache */
StorageEntry *seat2 = level2_seat_for(impl, hash);
seat2->hash_key = hash;
seat2->ptr_or_sentinel = (uint64_t)storage;
pthread_rwlock_unlock(&impl->lock);
tls_cache_install(key_ctx, impl, hash, storage);
return storage;
}
The body is enormous because the inner insert is duplicated for every combination of {Level-1 resize / no-resize} × {Level-2 resize / no-resize} × {mutex / rwlock / single-threaded}. The pseudocode collapses those into one normal form; the binary carries nine specialisations of the same insert block, each tuned for one combination of locks held and resize state.
Sentinels and the Inline next-pow2
The EMPTY and TOMBSTONE sentinels are the same constants at both levels and across every duplicated probe body:
#define DENSE_EMPTY ((void *)-4096) /* 0xFFFFFFFFFFFFF000 */
#define DENSE_TOMBSTONE ((void *)-8192) /* 0xFFFFFFFFFFFFE000 */
-4096 and -8192 are deliberate choices: both are page-aligned, both stand out against any heap pointer, and both compare cheaply against sign-extended 32-bit immediates. The same pair shows up in sub_117BB70, an unrelated DenseMap rehash body with 80-byte slots; the slot pitch differs because sub_117BB70 inlines its full keys while sub_4497E40 stores only pointers and a 32-bit hash.
The next-power-of-two routine is expanded inline at every grow site:
uint32_t next_pow2(uint32_t x) {
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return x + 1;
}
Any result smaller than 0x40 is clamped to 64. The minimum bucket count after any allocation is therefore always 64.
Resize Policy
Two independent triggers govern resize, applied identically at both levels:
| Trigger | Condition | Action |
|---|---|---|
| Load-factor | 4 * (live + 1) >= 3 * N | grow to next_pow2(2*N - 1), min 64 |
| Tombstone density | N - tombstones - (live + 1) <= N / 8 | rehash in place at the same size |
Load-factor resize keeps the probe chain expected-constant. Tombstone-density resize stops a delete-heavy workload from accumulating an unbounded chain of dead slots that linear probing must scan through. A reimplementation that follows true DenseMap semantics — never delete, only allocate — exercises the load-factor trigger almost exclusively, because storage objects are immutable and only freed when the whole context dies.
Compare-And-Swap on Level-1 Publish
A single _InterlockedCompareExchange64(&seat->impl, fresh, 0) installs the new StorageAllocator into Level-1. The CAS races every other thread allocating the same TypeID for the first time: both see EMPTY at Level-1, both call sub_44A8C20(0x58), and both arrive at the CAS with a private allocator in hand. The winner installs its allocator and proceeds to Level-2; the loser sees the winner's allocator in the CAS return value, frees its own through sub_4560420 and free, and proceeds against the winner.
This pattern is correct because Level-1 entries are write-once. Once a StorageAllocator is published into a Level-1 slot, the entry never changes — the TypeID is permanent and the allocator outlives the context. Level-2 is mutated forever, and that is why Level-2 is guarded by the per-class rwlock instead of CAS.
The CAS is wrapped in a broader region protected by the allocator mutex at *ctx + 632. The mutex is held while bookkeeping live_count and tombstone_count, while resizing Level-1, and across the CAS itself. The CAS is the synchronisation primitive that publishes the allocator; the mutex is the synchronisation primitive that keeps Level-1's metadata consistent. They are complementary, not redundant.
Single-Threaded Collapse
Single-threaded builds dissolve the entire locking apparatus into plain loads and stores. The trick is a weak-symbol probe of &_pthread_key_create: glibc resolves it to a non-zero address when libpthread is loaded and to NULL otherwise.
That same probe gates every atomic op in sub_4497E40. pthread_mutex_lock / _unlock and pthread_rwlock_rdlock / wrlock all resolve to no-ops; _InterlockedCompareExchange64 collapses to a plain pointer store followed by a pointer load. The binary carries both expansions side-by-side, switched by a load of the weak symbol. The same gate guards every _InterlockedExchangeAdd on the strong and weak refcounts.
This is why a single-threaded cicc invocation pays zero synchronisation cost for uniquing. The fast path really is a hash lookup and a pointer load — nothing else.
Thread-Local Cache
Both locks vanish from the fast path through a thread-local cache rooted at %fs:-584. The cache holds the four most recently looked-up (KeyTy, StorageAllocator*) pairs; a hit returns the interned pointer with no atomic ops and no locks at all.
struct TlsCache { /* at %fs:-592 */
/*-592*/ bool initialised;
/*-584*/ uint32_t header; /* bucket_count_cache << 1 | tombstone_bit */
/*-580*/ uint32_t tombstone_count;
/*-576*/ void *cache_storage; /* inline 4-slot array */
/*-568*/ uint32_t live_count;
/*-560*/ CacheRow rows[4]; /* 40 B each */
};
On first use the cache registers a thread-exit destructor with sub_44A7D30(sub_44933E0, &tls[-584]), the moral equivalent of pthread_key_create(&key, sub_44933E0). The destructor walks the four cache rows on thread exit and decrements the weak refcount of each cached storage object so that the uniquer's strong references are correctly accounted.
The cache keys on (KeyTy, StorageAllocator*) rather than (KeyTy, TypeID) because the Level-1 CAS publish happens once per class and the resulting allocator pointer is stable for the life of the context. Caching on the allocator skips Level-1 entirely on every subsequent hit.
Refcount Transitions
Refcount transitions on BaseStorage go through _InterlockedExchangeAdd, treated as fetch-and-add since it returns the pre-update value. Both counters share the qword at +0x08 but are accessed as 32-bit subfields, so an atomic on either counter leaves the other undisturbed.
| Transition | Atomic | Trigger |
|---|---|---|
| Strong increment | _InterlockedExchangeAdd(&strong, +1) | hand-off to caller after insert |
| Strong decrement | _InterlockedExchangeAdd(&strong, -1) | uniquer evicts cached entry |
| Weak increment | _InterlockedExchangeAdd(&weak, +1) | caller stores a weak handle |
| Weak decrement | _InterlockedExchangeAdd(&weak, -1) | weak handle drops |
When a strong decrement returns 1 (pre-decrement), the deleter at vtable[2] fires via (*(**vtable + 16))(obj). When a weak decrement returns 1, the destructor at vtable[3] fires via (*(**vtable + 24))(obj). The flags byte at +0x18 is the "owned by uniquer" marker that prevents the deleter from running while the storage is mid-publish — the byte is set to 1 only after the Level-2 bucket has been written with the storage pointer, so an in-flight insert is never reachable from another thread before its refcount transitions become valid.
Lock Order and Concurrency Model
Three lock domains are held during a complete get_or_create, and the gateway always acquires them in the same order:
| Order | Lock | Scope | Protects |
|---|---|---|---|
| 1 | TLS cache | per-thread | local 4-slot cache, no synchronisation needed |
| 2 | allocator_mutex | per-context | Level-1 bookkeeping and CAS publish window |
| 3 | StorageAllocator::lock | per-class | Level-2 buckets, refcount transitions |
The allocator mutex is held only on the slow path. The fast path — TLS hit or warm Level-1 plus warm Level-2 — never acquires it. Concurrent uniquers of different TypeIDs share no state once their Level-1 entries are published; they race only at Level-2 within their own class. Concurrent uniquers of the same TypeID synchronise through the per-class rwlock: readers probe under the rdlock, and a miss upgrades to wrlock with a mandatory re-probe to catch a competing insert.
The rwlock upgrade is not atomic — the gateway explicitly drops the read lock before requesting the write lock, and the re-probe under wrlock is what makes the design correct. A simple loop that holds the read lock and asks for the write lock would deadlock against another thread doing the same thing.
Caller Shape
Each of the 700+ callers is a tiny shim of roughly 1 KB. The shim's only job is to compute the 32-bit key hash, pack the constructor pointer and key blob into the __m128i, and tail-call sub_4497E40 with the right TypeID sentinel. A representative pattern, derived from five canonical shims (sub_6156C0, sub_6180E0, sub_618360, sub_6185E0, sub_61E800):
BaseStorage *get_or_create_TileType(MLIRContextImpl *ctx, Shape shape, ElementType elt) {
KeyTy key = pack_key(shape, elt);
uint32_t hash = ((uintptr_t)&key >> 9) ^ ((uintptr_t)&key >> 4);
/* hash is then mixed with the structural bytes of the key */
hash = mix_key_bytes(hash, &key, sizeof key);
__m128i pack;
pack.lo = (uint64_t)&TileTypeStorage_construct;
pack.hi = (uint64_t)&key;
return sub_4497E40(&ctx, &unk_5B37828 /* TileType TypeID */,
hash, &TileTypeStorage_equals, &key, ctx, pack);
}
The TypeID sentinel address is hard-coded per shim because the address is the identity. The constructor is a small helper that allocates 32 bytes via the StorageAllocator's bump-pointer allocator (separate from StorageAllocator::buckets — that is the hash table, not the storage region), copies the key's structural bytes into the payload, and returns the pointer to be installed in Level-2.
Interaction with the Rest of MLIR
sub_4497E40 is the shared backbone for every uniqued value in TileIR. The Type system uniques IntegerType, FloatType, MemRefType, cuda_tile::TileType, nv_tileaa::TokenType, cute::LayoutType, and so on. The Attribute system uniques StringAttr, ArrayAttr, DictionaryAttr, plus per-dialect dense attribute classes. The Location system uniques FileLineColLoc, NameLoc, CallSiteLoc, and FusedLoc. Identifier is a small wrapper around StringAttr that short-circuits to the same uniquer. AffineExpr, AffineMap, and IntegerSet each have their own TypeIDs and their own Level-2 tables but share the gateway. The internal DAG uniquers for the cuda_tile block and region trees also reach the gateway, transitively, through sub_445B3C0.
A reimplementation can choose a different table representation, but the contract is fixed: identity for uniqued objects is pointer equality, storage objects are immutable after publication, and the allocator that owns Level-2 outlives every storage object it allocates. Anything that breaks one of those invariants breaks every map, set, and pattern matcher that keys on Type or Attribute identity.
How to Recognize in a Binary
The gateway sub_4497E40 is identifiable from any of the following independent fingerprints:
- The combination of the EMPTY sentinel
0xFFFFFFFFFFFFF000(47 occurrences) and the TOMBSTONE sentinel0xFFFFFFFFFFFFE000(40 occurrences) at 16-byte slot pitch is the strongest signal. The pair is unambiguous because both values are at the top of the unmapped address range and never collide with heap pointers. - The inline pointer hash
((uintptr_t)k >> 9) ^ ((uintptr_t)k >> 4)appears at every Level-1 probe entry and at every Level-2 caller-supplied-hash mixer site. A function that materialises this two-shift XOR over a pointer-shaped operand is part of the uniquer family. - The
__m128icalling convention withpack.lo = ctor*andpack.hi = key_blob*distinguishessub_4497E40from any other variadic interner. Callers visibly pack two pointers into an SSE register before the call; the gateway splits them at the two construction sites. - The 88-byte (
0x58)sub_44A8C20(0x58)allocation immediately followed by a per-class rwlock initialiser is theStorageAllocatorconstructor — the Level-2 owner allocated by the gateway's L1-insert path.
The single qword at *ctx + 632 that is held under pthread_mutex_lock during Level-1 mutation
distinguishes the allocator mutex from the diagnostic-handler mutex (earlier in the structure) and
from the per-class rwlock (later, inside each StorageAllocator). Verifiers that audit lock order
key on the offset rather than on the lock value.
Consumers
Every uniqued value in TileIR is produced by a caller of this gateway. The 700+ shims sit one per
registered class — each cuda_tile, nv_tileas, nv_tileaa, cute, cute_nvgpu, cutlass,
nvvm, llvm, builtin, func, arith, scf, vector, memref Type and Attribute class owns
one. The walker in Operation Layout — Pointer-Identity Dispatch reads the resulting OperationName
sentinels at +0x40 for kind dispatch; the pattern application drivers in
Pattern Vtables and Shapes — Pattern Application Drivers read them through the frozen
fingerprint map. The TypeID sentinel bands documented in
TypeID Sentinels and Anchors are the Level-1 keys this gateway
hashes on.
Cross-References
Type Identity Anchors documents how TypeID sentinel addresses are assigned to dialects, operations, types, attributes, and interfaces. MLIR Infrastructure Overview is the entry point for the rest of the substrate. Operation Layout describes how uniqued types and attributes are referenced from operations. Container Fingerprints catalogues the other DenseMap- and DenseSet-shaped tables in the binary that share the same probe seed and sentinel constants.
Pattern Vtables and Shapes
Abstract
Tileiras instantiates several thousand MLIR rewrite-pattern objects when a pass manager is constructed.
The compiler reuses a small number of physical layouts for every one of these objects: four pattern
shapes, distinguished by total size and by a single trailing field at +0x60, and two virtual tables,
distinguished by slot count. Recognising the shape and vtable of a pattern object is enough to
identify whether it is a plain RewritePattern or an OpConversionPattern, whether it carries a type
converter, and whether it captures a predicate closure. This page documents the layout so that a
catalogue scan can classify a pattern without entering its constructor.
Fixed Prefix
Every pattern object shares the same 0x60-byte prefix, regardless of shape: the vtable pointer, the
rooted operation name, the benefit and kind tags, the owning context, the typeinfo string built from
__PRETTY_FUNCTION__, and the inline SmallVector of generated operation names.
typedef struct PatternShape {
/*+0x00*/ void **vtable; // 8-slot OpConversionPattern or 5-slot RewritePattern
/*+0x08*/ StringRef op_name; // rooted MLIR operation name
/*+0x18*/ uint16_t benefit; // PatternBenefit tie-breaker
/*+0x1A*/ uint16_t kind; // RootKind / MatchAnyOpTypeTag discriminant
/*+0x20*/ MLIRContext *ctx; // owning context pointer
/*+0x28*/ const char *typeinfo_str; // captured from __PRETTY_FUNCTION__
/*+0x30*/ size_t typeinfo_len; // length of typeinfo_str, sans NUL
/*+0x38*/ SmallVector<OperationName, 4> generatedOps; // inline marker 0x400000000
/*+0x60*/ /* shape-dependent trailing slot */
} PatternShape;
typeinfo_str always closes with ]. The constructor slices its class name out of __PRETTY_FUNCTION__,
and the surrounding macro expansion ends with that bracket. A literal like
mlir::nv_tile_ir::as::{anonymous}::FuncOpConversion] is the normal form, and the trailing ]
is a reliable fingerprint when scanning for pattern objects in a stripped binary.
The inline SmallVector at +0x38 lists the operations the pattern intends to generate. Empty
vectors carry the marker 0x400000000 in the size word; non-empty vectors point at a heap buffer of
OperationName slots. Both layouts are valid; the marker is the discriminator.
Four Shapes
Four sizes exist, and the only thing that varies between them is the trailing slot at +0x60. The
prefix is identical across all of them.
| Shape | Size | +0x60 slot | Used by |
|---|---|---|---|
| A | 0x60 (96 B) | (none) | minimal RewritePattern with no extra state |
| B | 0x68 (104 B) | MLIRContext * | standard OpConversionPattern |
| C | 0x70 (112 B) | TypeConverter * | OpConversionPattern carrying a type converter |
| D | 0x78 (120 B) | closure tuple | rare closure-capturing patterns |
Shape A is the bare layout — no extra fields beyond the prefix, matches one operation name, rewrites in place. Canonicalisers and small folds live here.
Shape B carries a second pointer at +0x60: a re-stored MLIRContext *. The conversion driver reads
the context from this slot rather than from +0x20 when it populates the type-converter-free path of
OpConversionPattern. The duplication is intentional; a scan that misses it misclassifies Shape B as
Shape A.
Shape C carries a TypeConverter * at +0x60. This is the workhorse conversion pattern — the
pattern calls into the converter when materialising operand and result types. One converter is shared
across every pattern in a population set, so a single converter object accounts for many Shape C
entries.
Shape D embeds a closure tuple at +0x60, opaquely pointing at a heap-allocated tuple whose payload
includes a std::function<bool(int)> predicate. The closure lets a pattern condition its match on
captured tile sizes, lane counts, or feature toggles without specialising the class itself. The main
TileAA/TileAS phase in ConvertTileASToLLVM registers a small number of these.
Eight-Slot Vtable
Shapes B, C, and D all point at the same eight-slot OpConversionPattern vtable.
| Slot | Function | Notes |
|---|---|---|
| 0 | typeinfo helper | rtti accessor for the pattern class |
| 1 | dtor (delete) | calls j_j_free on the pattern object |
| 2 | dtor (no delete) | invariant body sub_36C8EC0 |
| 3 | empty trait | nullsub_11937 at 0x447F250, returns immediately |
| 4 | getDebugName | returns typeinfo_str from +0x28 |
| 5 | match | sometimes inlined into slot 6 |
| 6 | matchAndRewrite | the pattern body |
| 7 | getDependentOperationNames | returns generatedOps from +0x38 |
Slots 2 and 3 are invariant across every concrete pattern: sub_36C8EC0 for the non-deleting
destructor body, nullsub_11937 (at 0x447F250) for the empty trait callback. That pair is the
reliable fingerprint for the eight-slot vtable. A vtable whose slot 2 is not sub_36C8EC0 or whose
slot 3 is not nullsub_11937 is not an OpConversionPattern.
Slot 5 is sometimes a real match body, sometimes a stub deferring to slot 6. The combined form is
the default — most patterns share legality check and rewrite. A separate slot 5 is rare and signals
a pattern with an expensive feasibility check that should not be paid again in the rewrite phase.
Slot 7 returns the inline SmallVector at +0x38. The conversion driver reads this list to seed the
worklist when the pattern itself generates new operations. An empty list with the inline marker
0x400000000 is a valid return; the driver treats it as "no further dependence."
Five-Slot RewritePattern Vtable
Shape A points at a smaller, five-slot vtable.
| Slot | Function |
|---|---|
| 0 | typeinfo helper |
| 1 | dtor (delete) |
| 2 | dtor (no delete) |
| 3 | matchAndRewrite |
| 4 | getDebugName |
No empty-trait slot, no dependent-operation accessor. A pattern prefix at +0x00..+0x60 paired with
a five-entry vtable is a plain RewritePattern.
712 entries in the catalogued binary match the eight-slot fingerprint and 235 match the five-slot fingerprint — together accounting for every pattern object the constructors register.
Closure Patterns
Shape D is the only shape that owns a non-trivial heap object beyond the prefix. The closure tuple
at +0x60 wraps an std::function<bool(int)> predicate along with the original lambda's captures.
The tuple's destructor runs from slot 1 of the pattern's own vtable, which calls j_j_free on the
closure pointer before freeing the pattern object itself.
match queries the predicate, which is why slot 5 of a Shape D pattern is almost always a real
function rather than a stub — the closure makes the match path heavier than the rewrite path.
ConvertTileASToLLVM picks this shape for patterns whose legality depends on tile-shape attributes
that cannot be encoded as a simple operation-name root.
Concrete Cluster
The largest contiguous run of pattern objects sits in the constructor that populates
GenericOpPattern<arith::*Op>. The cluster spans 0x59B5480..0x59B61A0, contains 43 pattern objects
of Shape C, and uses a stride of 0x50 because the constructor packs successive entries with no
padding between adjacent 0x70-byte objects (the cluster is built from a static array). The vtable
pointer is identical for every entry; the op_name and typeinfo_str fields vary across the cluster
because each entry roots a different arith operation.
Locating that cluster and following its op_name strings is the cheapest way to enumerate the arith
lowering set without entering the constructor itself. The same trick works for any pattern set built
from a static array: scan for a run of equally spaced objects with a shared vtable pointer.
Pattern Application Drivers
Once a pattern object has the shape and vtable documented above, it still has to be applied. Tileiras runs the standard MLIR application pipeline plus a PDL-to-PDLInterp compile stage, and the flow splits cleanly into four sub-routines that own one stage each. The stages are ordered: a mutable vector is populated, frozen into an immutable lookup table, walked by the greedy driver, and optionally wrapped by a partial- or full-conversion driver.
void apply_pattern_set(Operation *root, MLIRContext *ctx) {
RewritePatternSet set(ctx); // std::vector<std::unique_ptr<Pattern>>, 8 B stride
populate_arith_generic_op_patterns(set); // sub_873F30; 43 Shape C pushes, one arena slab each
/* ... further populate* helpers ... */
FrozenRewritePatternSet frozen(set); // sub_36F9730; sort by (benefit desc, kind),
// build OperationName* fingerprint hashmap,
// compile any PDL bytecode to PDL Interpreter ops
ConversionTarget target(*ctx);
/* populate legal/illegal/dynamic ops on target ... */
if (failed(applyPartialConversion(root, // sub_1308320; 56 KB body, gpu->nvvm driver
target,
frozen))) {
rollback_partial_changes(root);
}
/* sub_36F8A00 runs on frozen's destruction; frees fingerprint map + patterns */
}
The greedy driver itself is the inner loop of applyPartialConversion. It walks the IR via the
walker at sub_447FBB0, looks each operation up in the frozen fingerprint hashmap, dispatches the
highest-benefit matching pattern's vtable slot 6 (matchAndRewrite, see the eight-slot table
above), and reruns until fixed-point or until the iteration cap is hit. The cap is typically 10;
exceeding it emits a "fixed-point not reached" remark rather than aborting the pass.
| Stage | Owner | Sub | Size |
|---|---|---|---|
| 1 | RewritePatternSet construction; one populate* helper | sub_873F30 (example) | varies |
| 1a | per-pattern arena allocation | sub_44A8C20(0x68) | per push |
| 2 | FrozenRewritePatternSet ctor + PDL bytecode compile | sub_36F9730 | 15 119 B |
| 3 | greedy match-and-rewrite loop | sub_36D01B0 | 4 653 B |
| 3a | IR walker used by the greedy driver | sub_447FBB0 | walker |
| 4 | applyPartialConversion / applyFullConversion driver | sub_1308320 | 56 KB |
| 5 | FrozenRewritePatternSet dtor (tear-down) | sub_36F8A00 | 124 B |
Stage 1 is std::vector<std::unique_ptr<Pattern>> with an 8-byte stride. Each populate* helper —
sub_873F30 is the canonical example, registering the 43 arith GenericOpPattern entries that form
the Shape C cluster at 0x59B5480..0x59B61A0 — performs one sub_44A8C20(0x68) arena allocation
per push followed by one indirect vtable construction call. Patterns expressed in PDL bytecode
rather than C++ classes are pushed as PDL pattern modules; their compilation to PDL Interpreter ops
is deferred until stage 2.
Stage 2 walks the vector, sorts entries by benefit descending and then by pattern kind, and builds a
fingerprint hashmap keyed by OperationName * for O(1) per-op lookup. The PDL pattern-module
fallback also lives here: bytecode patterns are compiled to PDL Interpreter ops by this constructor.
The result is an immutable handle the application drivers can share without further synchronization.
Stage 4 is the large body. sub_1308320 is the gpu-to-nvvm conversion driver and the canonical
instantiation; its 56 KB size is HexRays fully inlining the conversion template against the
per-instantiation operand and result types. It builds a ConversionTarget set of legal, illegal,
and dynamic ops, drives the greedy loop against the frozen set with a type converter, and on
failure walks the IR rolling back partial changes. The same template is instantiated for every
conversion pass; each instantiation produces its own large sub-routine.
Tear-down at sub_36F8A00 is the frozen-set destructor. It frees the fingerprint hashmap and then
walks the pattern vector calling each pattern's vtable slot 1 (the deleting destructor), which in
turn calls j_j_free on the pattern object — the same j_j_free cited in the eight-slot table.
Benefit tie-break in stage 3 is lexicographic on (benefit descending, registration order ascending). Two patterns with equal benefit at the same registration order are a programmer error;
the greedy driver picks one deterministically (the first in vector order) but emits no warning. The
benefit value lives at +0x18 of every pattern object, so the sort key is read directly from the
prefix without needing a virtual dispatch.
How to Recognize in a Binary
The eight-slot vtable is the strongest fingerprint: any vtable whose slot 2 is sub_36C8EC0
(non-deleting destructor) and whose slot 3 is nullsub_11937 at 0x447F250 is an
OpConversionPattern. The five-slot vtable is identified by absence — five slots, no empty-trait
nullsub — and by slot 3 being the matchAndRewrite body rather than a stub.
Object-level fingerprints:
- The 0x60-byte prefix terminating with a
SmallVectorsize word that is either heap-pointing or carries the inline marker0x400000000(size=0, cap=4) at+0x40of the prefix is a pattern prefix. - The typeinfo string at
+0x28always ends with a literal]because it is sliced from__PRETTY_FUNCTION__. Scanning for]near aMLIRContext *slot at+0x20locates pattern objects in stripped code. - The Shape C cluster
0x59B5480..0x59B61A0of 43 equally spaced0x70-byte entries with a shared vtable is the easiest entry point for enumerating the arith lowering set.
Consumers
Patterns are consumed by three driver families. The greedy match-and-rewrite loop at sub_36D01B0
drives canonicalisation and applyPatternsAndFoldGreedily-style fixed-point passes. The partial-
and full-conversion drivers built on sub_1308320 drive dialect-to-dialect lowering — gpu-to-nvvm,
TileAS-to-LLVM, and similar template instantiations. The instruction-selection DAG in
ISel DAG and Matcher Table reuses the same 0x7FFFFFFF
operand-count mask documented in Operation Layout — Fixed Header when matching backend
SDNode operations, but does not consume MLIR pattern objects directly — it has its own table-driven
matcher.
Cross-References
TileAS to LLVM Lowering documents the pass that registers the
arith GenericOpPattern cluster. Pattern Set and Type Converter
documents how the population functions wire Shape C patterns to a shared TypeConverter.
Operation Layout documents the operation header that pattern matchers read
through +0x40. ISel DAG and Matcher Table covers the
later backend matcher that reuses the operand-count mask but runs on SDNode, not Operation.
Common Compiler Patterns and Idioms places the
two pattern-vtable shapes in the catalogue of recurring structural moves tileiras uses across every
subsystem.
Interface Vtables and Dispatch
Abstract
MLIR interfaces let generic code ask semantic questions about a dialect object without knowing its
concrete C++ class. In tileiras every concrete op, type, attribute, and dialect carries a sorted
array of (TypeID, concept*) pairs — its InterfaceMap — and lookup is a 16-byte-pitch binary
search keyed on the TypeID address documented in
TypeID Sentinels and Anchors. The concept block is a small vtable
whose first slot is always the concept's own destructor and whose remaining slots are the methods
the interface defines; layout-type interfaces expose shape and coordinate-mapping callbacks, view
types expose element type and memory space, copy-atom types expose value shape and copied bits, and
MMA-atom types expose A/B/C types plus the verifier callback. This page documents the entry
layouts, the binary-search lookup, and the registration shim that installs implementations during
dialect load.
InterfaceMap Layout
Each object that participates in interface dispatch (op, type, attribute, dialect) carries one
InterfaceMap: a flat sorted array of 16-byte entries — the same pitch the OperationName slot
banks use — sorted ascending on the TypeID address.
typedef struct InterfaceEntry { /* 16 B */
/*+0x00*/ TypeID *id; /* interned TypeID sentinel — Idiom 1 or Idiom 2 */
/*+0x08*/ InterfaceConcept *concept; /* small vtable; first slot is the concept's dtor */
} InterfaceEntry;
typedef struct InterfaceMap {
/*+0x00*/ InterfaceEntry *entries;
/*+0x08*/ uint32_t size; /* live count; entries are dense, no tombstones */
/*+0x0C*/ uint32_t cap;
} InterfaceMap;
No tombstones: interfaces are write-once after dialect load. A registration phase runs during
addInterfaces<> and the resulting map is never mutated again for the life of the context. The sort
key is the raw TypeID address, and binary search is well-defined because Idiom-1 sentinels are
link-time constants and Idiom-2 qwords are Meyers-cached on first use — every address compared
against is stable by the time lookup runs.
Concept Vtable Shape
An InterfaceConcept is a small vtable. Slot 0 is the concept's destructor (so the map can free
the concept on context teardown); the remaining slots are the per-interface methods in declaration
order. Slot counts vary by interface — copy-atom carries 4 methods, MMA-atom 5, layout-type 6 — but
the prefix is fixed.
typedef struct InterfaceConcept {
/*+0x00*/ void (*dtor)(InterfaceConcept *self); /* concept teardown, always populated */
/*+0x08*/ void (*method0)(/* iface-specific signature */);
/*+0x10*/ void (*method1)(/* iface-specific signature */);
/* ... */
} InterfaceConcept;
The method-pointer entries are per-implementer thunks generated at dialect-init time. They adapt
the generic (void *concrete, ...) calling convention used by the dispatch helper to the concrete
C++ member function on the implementing class. Dialect conversion's
populateConvertToLLVMConversionPatterns and the per-type printAssembly callbacks both follow
this shape.
Lookup Algorithm
Lookup is a stride-16 binary search over the entries array. On a miss it returns NULL, which every
caller treats as "capability not supported":
InterfaceConcept *interface_map_lookup(const InterfaceMap *map, TypeID *id) {
uint32_t lo = 0, hi = map->size;
while (lo < hi) {
uint32_t mid = lo + ((hi - lo) >> 1);
TypeID *k = map->entries[mid].id;
if (k == id) return map->entries[mid].concept; /* hit — pointer-identity */
if ((uintptr_t)k < (uintptr_t)id) lo = mid + 1;
else hi = mid;
}
return NULL; /* miss — capability absent */
}
The comparator is raw pointer order, not a structural property of the interface — correct because
insert sorts by the same address comparison. Insert holds the invariant with a one-shot
lower_bound plus an in-place memmove:
void interface_map_register(InterfaceMap *map, TypeID *id, InterfaceConcept *concept) {
uint32_t pos = lower_bound_typeid(map, id);
if (pos < map->size && map->entries[pos].id == id) {
/* replace — dialects may rebind an interface */
map->entries[pos].concept = concept;
return;
}
if (map->size == map->cap) grow_entries(map); /* arena-backed; doubles cap */
memmove(&map->entries[pos + 1], &map->entries[pos],
(map->size - pos) * sizeof(InterfaceEntry));
map->entries[pos].id = id;
map->entries[pos].concept = concept;
++map->size;
}
Replace-in-place is rare but legal — a downstream dialect can override an interface installed by an
upstream dialect. A target dialect, for example, may rebind ConvertToLLVMInterface for a base
op-class that the standard dialect already registered.
Dispatch Helper
Every generic caller funnels through the same one-line dispatch helper to ask "does this object implement this interface, and if so, run method N":
InterfaceConcept *get_interface(void *concrete, TypeID *id) {
InterfaceMap *map = object_interface_map(concrete); /* per-class accessor */
return interface_map_lookup(map, id);
}
/* Typical call site — generic code asks for a capability without knowing the class. */
unsigned get_num_warps(Operation *op) {
InterfaceConcept *c = get_interface(op, &qword_AgentLikeOpInterface_TypeID);
if (c == NULL) return 0;
typedef unsigned (*GetNumWarpsFn)(Operation *);
GetNumWarpsFn fn = (GetNumWarpsFn)((void **)c)[1]; /* slot 1 — first real method */
return fn(op);
}
((void**)concept)[0] is always the concept destructor; ((void**)concept)[1] is the first declared
method. A caller that knows the interface knows the slot index of the method it needs — there is no
runtime name-to-slot resolution.
Dialect Interfaces
Dialect interfaces describe behavior owned by an entire dialect rather than a single operation or type.
| Interface | Contract |
|---|---|
| Assembly interface | Provides aliases and preferred SSA names for dialect types and attributes. |
| Inliner interface | Decides whether operations, regions, and blocks may be inlined. |
| Convert-to-LLVM interface | Populates conversion patterns for dialect lowering to LLVM. |
void populate_llvm_patterns(Context *ctx, RewritePatternSet *patterns) {
for (Dialect *dialect : loaded_dialects(ctx)) {
ConvertToLLVMInterface *iface = dialect_interface(dialect, CONVERT_TO_LLVM);
if (iface != NULL) {
iface->populate_patterns(dialect, patterns);
}
}
}
Type Interfaces
Tileiras uses type interfaces to make layout and target-specific atom types composable.
| Interface | Typical implementers | Questions answered |
|---|---|---|
| Layout type | cute.layout, composed layouts | Shape, body, static-ness, coordinate mapping. |
| View type | cute.ptr, cute.memref | Element type, memory space, rank, effective layout. |
| Pointer type | pointer-like CuTe types | Memory space and alignment. |
| Iterator type | descriptor and iterator types | Element type, layout refinement, projection. |
| Copy atom type | copy atom descriptors | Value shape, layouts, copied bits, value type. |
| MMA atom type | SM-specific MMA atoms | A/B/C types, shape, operand partitioning, verifier rules. |
| Printable type | public textual types | Stable type printer implementation. |
LogicalResult verify_view_type(Type type) {
ViewTypeInterface *view = dyn_cast_view_interface(type);
if (view == NULL) {
return failure("expected a view type");
}
if (view->rank(type) == 0) {
return failure("view type must have at least one dimension");
}
return success();
}
Interfaces let verifiers stay declarative. A pass asks for "view rank" or "MMA atom shape" without knowing every concrete type class that implements the concept.
Operation Interfaces
Operation interfaces express control-flow, scheduling, and producer-consumer behaviour. Each is a concept vtable installed against per-op TypeIDs at op-class registration time.
| Interface | Contract |
|---|---|
| Producer op | Names the region that produces async pipeline data. |
| Region branch terminator | Describes where a region terminator can transfer values. |
| Agent-like op | Exposes agent bodies, group size, and warp allocation. |
| Constant-like trait | Marks an op as a constant for folding and canonicalization. |
Marker traits are zero-method interfaces — a single-slot vtable carrying only the destructor, with
"present" determined by interface_map_lookup returning non-NULL.
How to Recognize in a Binary
The dispatch helper is the cleanest fingerprint: a function that loads an InterfaceMap* from a
per-class slot, runs a stride-16 binary search keyed on a sentinel address from one of the bands
catalogued in TypeID Sentinels and Anchors, and returns either
NULL or a small vtable pointer is an interface_map_lookup call.
Concrete fingerprints:
- The
qword_5B47028(PrintableTypeInterface) andqword_5B44600(LayoutTypeInterface) Meyers slots from the TypeID page are the most frequent keys passed to lookup; any function that loads one of those qwords and then performs a stride-16 search is dispatching against the cute / cute_nvgpu interface set. - The replace-in-place behaviour distinguishes interface maps from operation-name slot banks.
Operation-name banks are write-once on dialect registration; interface maps occasionally see a
pos < size && entries[pos].id == idreplace. A function that does the equality probe and then conditionally rewritesentries[pos].conceptis registering an override. - The concept dtor at slot 0 is the cleanest concept-vs-arbitrary-vtable distinguisher. A 5-slot or
6-slot vtable whose slot 0 is a small function ending in a
freeor arena-discard call, with the rest being type-dispatched method thunks, is anInterfaceConcept.
Verifier Use
A verifier checking an interface-bearing operand should name the missing capability, never the implementation class.
LogicalResult verify_copy_atom(Type atom) {
CopyAtomTypeInterface *iface = dyn_cast_copy_atom(atom);
if (iface == NULL) {
return failure("expected a copy atom type");
}
if (iface->copy_bits(atom) == 0) {
return failure("copy atom must move at least one bit");
}
return success();
}
This wording stays stable even if a later version adds new atom classes.
Consumers
Every generic pass in the binary that asks "does this object support capability X" routes through
the dispatch helper above. Pattern application drivers in
Pattern Vtables and Shapes consult ConvertToLLVMInterface per
dialect during conversion-pattern population. Verifiers consult type-side interfaces such as
LayoutTypeInterface and ViewTypeInterface per operand. The scheduler consults
AgentLikeOpInterface and LoopLikeOpInterface to identify region-bearing producer/consumer
boundaries. The diagnostic engine documented in
Diagnostic ABI and Helpers does not depend on the interface map
directly, but verifiers that emit diagnostics universally key their messages on the missing
interface name rather than on a concrete class.
OpInterface Inventory
The binary exposes sixty-five distinct OpInterface typeinfo strings — every one of them paired
with at least one ::Trait shim that registers a concrete implementer into the per-op
InterfaceMap. The inventory below groups them by what the dispatcher uses them for; the right
column points to the consumer that issues the lookup. None of these counts include the closely
related TypeInterface and AttrInterface families, which use the same dispatch primitive but
key on type and attribute headers respectively.
| Family | Interfaces | Primary Consumer |
|---|---|---|
| Control flow | BranchOpInterface, RegionBranchOpInterface, RegionBranchTerminatorOpInterface, WeightedBranchOpInterface, LoopLikeOpInterface, SelectLikeOpInterface | scheduler region traversal and the dominance/CFG analyses |
| Symbol and call | CallOpInterface, CallableOpInterface, SymbolOpInterface, SymbolUserOpInterface, FunctionOpInterface, AnyFunctionOpInterface | symbol-table cache and the call-graph builder |
| Async pipeline | AsyncOpInterface, AgentLikeOpInterface, ConsumerOpInterface, ProducerOpInterface | producer/consumer pipeline analyses in passes/tileas/async-pipeline-family.md |
| Memory effects | MemoryEffectOpInterface, MemoryConsistencyOpInterface, AllocationOpInterface, BufferizableOpInterface, BufferDeallocationOpInterface, CopyOpInterface, AliasAnalysisOpInterface, AccessGroupOpInterface, DereferenceableOpInterface | alias analysis, bufferization, and the verifier's effect collector |
| Tile and view shaping | XformLayoutOpInterface, RelayoutOpInterface, ViewLikeOpInterface, ShapedDimOpInterface, OffsetSizeAndStrideOpInterface, IndexingMapOpInterface, ReifyRankedShapedTypeOpInterface, BlockStripedOpInterface, TilerOpInterface | layout materialization and the tile-conversion driver |
| Subset and destination style | SubsetOpInterface, SubsetExtractionOpInterface, SubsetInsertionOpInterface, DestinationStyleOpInterface, DestructurableAccessorOpInterface, DestructurableAllocationOpInterface | the bufferization pipeline and SROA-style promotion passes |
| Cast and type inference | CastOpInterface, InferTypeOpInterface, RefineTypeOpInterface, FindPayloadReplacementOpInterface | type refinement during conversion |
| Promotion (mem2reg style) | PromotableOpInterface, PromotableAllocationOpInterface, PromotableMemOpInterface, SafeMemorySlotAccessOpInterface | the mem2reg-equivalent promotion pass |
| Vectorization | VectorTransferOpInterface, VectorUnrollOpInterface, MaskableOpInterface, MaskingOpInterface, ParallelCombiningOpInterface | vector dialect lowering |
| Affine memory | AffineReadOpInterface, AffineWriteOpInterface | affine analyses retained from upstream MLIR |
| Floating-point modes | RoundingModeOpInterface, FPExceptionBehaviorOpInterface | the FP-mode threader during lowering |
| TMA descriptor | MakeTmaDescOpInterface | the TMA descriptor materialization pass; the only nv_tile-specific OpInterface in this row |
| Conversion and printing | ConvertToLLVMOpInterface, BytecodeOpInterface, OpAsmOpInterface, OneToOneIntrinsicOpInterface | LLVM lowering driver and the bytecode/textual printers |
| Bounds and verification | ValueBoundsOpInterface, RuntimeVerifiableOpInterface | integer-bounds analysis and the runtime-check insertion pass |
Two NVVM-side families deserve a separate mention because they straddle the boundary between op
and dialect interfaces: BasicPtxBuilderInterface and PtxBuilderOpInterface together govern how
NVVM ops emit inline PTX during lowering. Their ::Trait shims live in the NVVM dialect's
interface map and are looked up by the lowering driver per op; the dispatcher table at the head of
this page applies unchanged.
A reimplementation can ignore the upstream-MLIR breakdown of public versus internal interfaces:
the binary collapses both into a single dispatch primitive, and the only invariant that matters is
that every interface that appears in a registration call has exactly one concept-block layout
known to every implementer. Adding an interface is a four-step change — declare the concept,
register a TypeID sentinel, stamp the ::Trait shim onto every implementer, and document a
consumer that runs the lookup — and the consumer must be added because an interface with no
consumer wastes 16 bytes of InterfaceMap per op.
Cross-References
TypeID Sentinels and Anchors catalogues the sentinel addresses
this map keys on. Operation Layout describes the operation header that owns
the per-op InterfaceMap. Storage Uniquer and Context Impl
documents the dialect-registration machinery that installs interface implementations on context
load. The trait side of nv_tileas verification — closed family of twenty-three OpTrait::nv_tile
mixins that run alongside these interfaces — is catalogued in
nv_tileas Verifiers — OpTrait::nv_tile Inventory.
TypeID Construction Idioms
Abstract
mlir::TypeID is MLIR's runtime identity tag for compiler-internal RTTI. Conceptually it is a
non-null const void * that is unique per C++ class and stable across the life of the
MLIRContext. The natural C++ implementation — &typeid(T) from the Itanium ABI — is not
viable for an MLIR-style framework, and tileiras's binary makes the reason visible in its
layout. The Itanium typeinfo blocks in .data.rel.ro (0x4FA5242..0x5A2C360) hold libstdc++
types only — exceptions, locale facets, stream buffers — and no MLIR class appears there. Every
Dialect, Op, Type, Attribute, and Interface that the binary dispatches on builds its
TypeID through one of two idioms that sidestep typeid entirely.
This page is the canonical description of those two idioms in isolation. The companion page
TypeID Sentinels and Anchors covers where in .bss the two
idioms land in the tileiras image and how the dispatcher consumes them; the address-band reference
table is TypeID Sentinel Address Table. Wave 22B's
finding — that MLIR uses two distinct idioms because &typeid(T) is unusable across DSO
boundaries under hidden visibility — is the architectural justification this page expands on.
Why &typeid(T) Cannot Be the Identity
The Itanium C++ ABI specifies one type_info object per type per program. Cross-DSO uniqueness
relies on weak symbols emitted into .data.rel.ro with STB_WEAK binding and STV_DEFAULT
visibility, so the dynamic linker can merge duplicates from different shared objects into a
single instance at load time. MLIR's static-linking and packaging discipline breaks this
guarantee on three independent axes.
- Hidden visibility. Tileiras's dialect libraries are compiled
-fvisibility=hidden. Hiddentype_infosymbols cannot participate in cross-DSO merging — each shared object that instantiates the same template gets its own private copy, and&typeid(T)differs between callers. Theextern templatediscipline upstream LLVM uses forllvm::cl::optdoesn't help here because the underlyingtype_infosymbol still ends up hidden. - Anonymous namespaces. Several MLIR base classes (
OpInterface<...>template instantiations,Trait<...>mixins, generatedStorageclasses) appear inside anonymous namespaces in generated TableGen code. The Itanium ABI gives anonymous-namespace types internal linkage, so&typeid(T)is per-translation-unit by definition — there is no merging step the linker could perform even if visibility weredefault. - Static linkage. When dialects are statically linked into a host (which tileiras does for
the bundled CUDA toolchain) the entire
.data.rel.rotypeinfo block is duplicated per linked archive, and only the linker's--gc-sectionsheuristics decide which copy survives. A TypeID derived from&typeid(T)would silently differ depending on whether a dialect is loaded through a plugin or compiled in.
The result is that an MLIR-style framework needs its own discriminator. The two idioms below are how the framework — and tileiras's MLIR vendor branch — synthesise one.
Idiom 1 — Per-Class Static Sentinel
The first idiom builds the TypeID out of the address of a per-class static storage object. The object's value is never read. Only its address matters, and the address is the TypeID.
/* Per-class declaration (one of these exists for every concrete dialect / op /
* type / attribute that the binary dispatches on). */
typedef struct {
char id; /* one byte in .bss, value never read */
} ClassNameTypeIDStorage;
static ClassNameTypeIDStorage kClassNameTypeIDStorage; /* .bss, 1 byte */
/* The TypeID is just the address of the storage byte. */
typedef struct { const void *opaque; } TypeID;
static inline TypeID class_name_typeid(void) {
return (TypeID){ &kClassNameTypeIDStorage };
}
Upstream MLIR spells the same shape as TypeID::get<T>() returning
&detail::TypeIDResolver<T>::id, with the static id field defined inside the resolver
specialisation. The inline static discipline plus the C++17 inline-variable rule guarantees one
storage instance per program. Inside a single executable that is enough.
The properties this idiom relies on:
- The address of a static-storage object is a link-time constant within one binary or DSO.
- The C++ standard guarantees one storage instance per
staticvariable defined in aninline/constexprcontext, which the linker enforces by COMDAT-merging duplicate definitions. - Hot dispatch becomes
MOVplusCMP— loadop->kindPtr, compare against a sentinel address baked into the dispatcher arm. No string compare, no hash, no atomic load.
The cost is that two independently-loaded DSOs each get their own sentinel for the same C++ class
if visibility is hidden — exactly the failure mode that motivates Idiom 2. tileiras avoids this by
statically linking the dialects that use Idiom 1 into one image, so every Idiom-1 sentinel is a
link-time constant in the same .bss slab. The address bands listed in
TypeID Sentinels and Anchors — Idiom 1
show the result: each owning dialect or category gets one contiguous slab of one-byte sentinels at
an 8-byte pitch.
Idiom 2 — __PRETTY_FUNCTION__ String Interning
The second idiom builds the TypeID out of the interned address of a C++ type-name string.
The string is produced by the compiler's __PRETTY_FUNCTION__ macro inside a template, captured
verbatim, and looked up in a process-wide string pool the first time the accessor runs. The pool's
returned pointer is the TypeID, and a Meyers-style cache stores it for subsequent calls.
/* The intern pool — one per process, owned by MLIRContext / a ManagedStatic.
* In tileiras's binary this is sub_44A6CA0; upstream MLIR ships it as
* SelfOwningTypeID::resolveTypeID under llvm::ManagedStatic. */
extern const void *intern_typeid_string(const char *rtti_name, size_t len);
/* Per-class lazy accessor. The compiler bakes __PRETTY_FUNCTION__ at the call
* site, which expands to something like:
* "const void *typeid_string<mlir::FunctionOpInterface>() [T = ...]"
* The slice between the angle brackets is what the interner keys on. In the
* tileiras image the captured slice is the suffix ending in ']'. */
const void *typeid_meyers_cached_FunctionOpInterface(void) {
static uint8_t guard = 0; /* one byte, Itanium ABI guard */
static uint64_t cached = 0; /* qword that ends up holding TypeID */
if (__builtin_expect(guard == 0, 0)) {
if (__cxa_guard_acquire(&guard) != 0) {
cached = (uint64_t)intern_typeid_string(
"mlir::FunctionOpInterface]", 26);
__cxa_guard_release(&guard);
}
}
return (const void *)cached;
}
The __PRETTY_FUNCTION__ trick is the cross-compiler-stable way to get a textual name for a C++
type without RTTI. GCC and Clang both emit the unmangled, human-readable form, including template
arguments. MLIR's TypeID::getFromOpaquePointer<T>() machinery captures the slice between two
fixed markers ([T = and the closing ]) and passes the result to the interner.
The properties this idiom relies on:
- Two DSOs that instantiate
TypeID::get<Foo>()produce byte-identical__PRETTY_FUNCTION__strings, because the compiler generates them from the same C++ type expression. Hidden visibility doesn't affect string contents. - The interner is one process-wide table. Both DSOs find or insert the same row, and the row's
address is the TypeID. Cross-DSO identity holds even without
STB_WEAK. - The interned string survives for the life of the context, so the cached qword remains valid forever.
The cost is one branch on the cached qword, one atomic load on the guard, and a one-time string lookup. Idiom 1 is strictly faster — one less indirection, no atomic — but Idiom 2 is the only choice when the same C++ type must yield the same TypeID across statically-linked DSOs or across arbitrary template instantiations whose storage cannot be named at link time.
Why Each Idiom Is Used
| Decision axis | Idiom 1 — Static sentinel | Idiom 2 — Interned string |
|---|---|---|
| Cross-DSO identity | Breaks under hidden visibility | Stable; relies on string equality |
| Anonymous-namespace types | Per-TU storage, per-TU identity | Stable; string contents are well-defined |
| Cost on hot path | One load, one compare | One load (cached qword), one compare |
| Cost on first call | Zero — address is link-time constant | Guard acquire, string intern, qword store |
Storage shape in .bss | 1-byte slot at 8-byte alignment | {u8 guard, u64 qword} pair, 9 bytes |
| Created | Before main (link-time addresses) | First call after dialect load |
| Tileiras tenants | Dialects, concrete Types/Attributes, per-op opInfo, per-op kindPtr | Op/Type/Attr interfaces, registered analyses, pattern RTTI tags |
Idiom 1 is reserved for objects that exist before main. Their identities are link-time constants
and the linker packs them into dense bands one slab per owning dialect — the
&unk_5B38B[B0..C8], &unk_5B48D[88..F8], &unk_5B49A[98..B18], and the larger NVVM op slab at
0x5B8D610..0x5B8DCB8 are the visible result.
Idiom 2 is reserved for objects whose existence depends on a runtime registration step. Op and
Type interfaces are attached via addInterfaces<> calls made well after dialect construction;
analyses are keyed by C++ type and instantiated on first request through the AnalysisManager;
pattern RTTI tags exist only for patterns that explicitly opt into RTTI. None of these have a
link-time storage owner — the runtime has to derive the identity from the C++ type alone.
A TypeID Never Moves Between Idioms
Once a class is assigned to one idiom by its declaration site, every install site and every
dispatcher uses the same idiom. There is no fallback path from Idiom 1 to Idiom 2 or vice versa.
The two pools never collide because their address bands never overlap — Idiom 1 sentinels are
one-byte storage objects at 8-byte alignment within .bss slabs the linker emits per dialect,
while Idiom 2 qwords are part of {guard, qword} pairs that the C++ compiler scatters at the
declaration sites of the Meyers accessors. Both are stable for the lifetime of the
MLIRContext, which is what the binary-searched InterfaceMap and the pointer-equality dispatch
both depend on.
QUIRK — The captured slice ends in ]
The string interned by Idiom 2 in tileiras's binary always ends in ], even though the
human-eye-friendly version of the type name would end with the type itself. This is the closing
bracket of __PRETTY_FUNCTION__'s [T = ...] slot, captured along with the type name. A binary
triage that searches for the string "mlir::FunctionOpInterface" (without the trailing bracket)
will not find the literal in .rodata. Search for "mlir::FunctionOpInterface]" instead, with
the bracket; that is the byte sequence the interner sees and the string the comparator hashes.
The 9-class table in
TypeID Sentinels and Anchors — Idiom 2
preserves the bracket on every row for exactly this reason.
QUIRK — Idiom 1 sentinels are 1-byte storage, but the slab pitch is 8 bytes
Each Idiom-1 sentinel is conceptually a char — one byte that nobody ever reads. The linker
nonetheless allocates an 8-byte slot per sentinel so that the next sentinel's address remains
8-byte-aligned. This is a side effect of how MLIR declares the storage (inline static char id
inside a class that has 8-byte members elsewhere) plus the linker's default alignment. A
disassembler scanning the slab will see runs of 00 00 00 00 00 00 00 00 between every used byte;
those aren't padding bytes in a meaningful sense, but they are not addressable as sentinels
either. The address of the first byte of each 8-byte slot is the TypeID; the trailing seven
bytes are unused.
QUIRK — The Meyers guard byte must come before the qword
The Itanium C++ ABI's __cxa_guard_acquire machinery expects a 64-bit guard variable, but the
compiler is free to allocate the guard byte separately from the cached value. tileiras's binary
consistently places the guard byte at qword_addr - 8, immediately before the 8-byte cached
slot, with the qword 8-byte-aligned. A reimplementation that places the guard after the qword
(or interleaves multiple guards in front of one qword) breaks the steady-state load pattern that
every interface-using dispatcher in the binary assumes. The fast-path body is cmp byte ptr [guard], 0 followed by mov rax, qword ptr [qword] — if the guard byte sits anywhere other than
[qword - 8] the loader generates a different sequence and the cache-line locality argument
breaks.
Cross-References
- TypeID Sentinels and Anchors — where in the tileiras image
the two idioms physically land, the null-opinfo guard, and the dispatch-by-pointer-identity
pattern in
sub_7ACC40and the other load-store classifiers. - TypeID Sentinel Address Table — the address-sorted enumeration of every sentinel referenced anywhere in the binary, with idiom-form (1-byte pointer-identity vs 8-byte Meyers qword vs 9-byte guard+qword pair) attached to each row.
- Interface Vtables — the
InterfaceMapthat performs binary search against the TypeID addresses produced by both idioms, including the 16-byte entry pitch and the binary-search invariants. - Storage Uniquer and Context Impl — the registration
machinery that installs both idioms during dialect load, plus the relationship between Idiom-1
sentinels and the
UniquedStorageslab. - Operation Layout — the op header that holds the kindPtr at
*(qword*)(op+48)+16, which is the read every dispatcher performs before comparing against an Idiom-1 or Idiom-2 sentinel.
TypeID Sentinels and Anchors
Abstract
Tileiras materialises an MLIR TypeID in exactly two ways. Idiom 1 is a 1-byte sentinel in .bss whose
address is the identity — the byte's value is never read, only its pointer is. Idiom 2 is a Meyers
singleton that lazily interns a __PRETTY_FUNCTION__-derived RTTI string the first time the accessor
runs, then caches the resulting TypeID* in a qword next to a one-shot guard byte. The two never mix:
every TypeID in the binary is either a static sentinel pointer or a Meyers-cached qword.
Neither idiom touches the Itanium C++ ABI's typeinfo/vtable machinery. The binary's .data.rel.ro
typeinfo block (0x4FA5242..0x5A2C360) holds only libstdc++ classes — exceptions, streams, locale
facets — and no MLIR class appears there. This is the architectural reason both idioms exist: MLIR
needs cross-DSO identity for types that the C++ standard's &typeid(T) cannot give it (anonymous
namespaces, hidden visibility, statically linked dialects), so it builds its own discriminators on
top of address-taking and string-interning instead. A reimplementation that swaps in std::type_info*
will be unable to keep pointer-equality stable across the registered-dialect set.
The distinction is significant. Idiom 1 carries the ABI-frozen identities — dialects, the registered
Type and Attribute subclasses, the upstream MLIR built-ins — whose addresses are link-time constants and
whose registration happens before main. Idiom 2 carries identities that come into existence at runtime
as part of an addInterfaces<> call, an analysis registration, or a pattern RTTI tag.
Idiom 1 — Static Pointer-Identity Sentinel
Each dialect, each concrete Type, each concrete Attribute that ships with the binary owns a 1-byte
sentinel in .bss. The byte's value is irrelevant; the linker assigns it an address and that
address is the TypeID. Hot dispatch compares op->kindPtr (or a Type's vtable slot) against a
sentinel by pointer-identity — one MOV+CMP, no string compare, no hash lookup.
typedef uint8_t TypeIDSentinel;
extern TypeIDSentinel kCuteLayoutTypeID; /* &unk_5B49AE0 */
extern TypeIDSentinel kCuteNvgpuSm90MmaTypeID; /* &unk_5B48E28 */
bool is_cute_layout(Type *t) {
return t->kind_ptr == &kCuteLayoutTypeID;
}
Sentinels do not scatter across the binary — they cluster into a small number of address bands, one band per owning dialect or category. Three bands carry the weight of Tileiras dispatch.
| Band | Owner | Examples |
|---|---|---|
&unk_5B38B[B0..C8] | cuda_tile dialect Type TypeIDs | cuda_tile.tile, cuda_tile.ptr, cuda_tile.tensor_view |
&unk_5B48D[88..F8] / 5B48E[00..58] | cute_nvgpu concrete Type TypeIDs (27 slots, 8-byte pitch) | cute_nvgpu.sm90.mma, cute_nvgpu.smem_desc, cute_nvgpu.atom.tma_load |
&unk_5B49A[98..B18] | cute dialect concrete Type TypeIDs (17 slots, 8-byte pitch) | cute.layout, cute.swizzle, cute.tile |
&unk_5B44E[B8..F8] / 5B44F[08..FD8] | nv_tileas per-op opInfo sentinels (21 ops, 8-byte pitch) | nv_tileas.tiled_load @ 5B44ED0, nv_tileas.gather_load @ 5B44FA8, nv_tileas.convert_layout @ 5B44FD8. Paired kindPtr forms live in 0x5BE3F* / 0x5BE4* / 0x5BE5* — see Sentinel Sharing And Aliasing. |
&unk_5B46[D28..F68] | nv_tileaa per-op FoldRecord sentinels (33 ops) | nv_tileaa.make_memref, nv_tileaa.block_tile |
&unk_5BE5xxx / &unk_5BE6xxx | Upstream MLIR Type and Attribute TypeIDs (built-in dialect) | f32 at &unk_5BE6030, f8E4M3FN at &unk_5BE60A0 |
&unk_5BAADxx | Opaque / erased-storage TypeIDs | the i32-blocked-layout-id-1 variant at &unk_5BAADB8 |
Dialect TypeIDs get their own one-byte slots too: &unk_5B496B8 for the cute dialect,
&unk_5B482C8 for cute_nvgpu, &unk_5BA8F60 for LLVM, &unk_5BE5908 for arith. The
nv_tile_ir::as::schedule_utils::ScheduleAnalysis analysis registration at qword_5B38E78 is the
canonical Idiom-2 example for analyses; the dialect-level Idiom-1 sentinels and the analysis-level
Idiom-2 sentinels coexist in the same MLIRContext without colliding because their bands never
overlap.
Dispatch By Pointer-Identity
Every walker, canonicalizer, and verifier in the binary distinguishes ops the same way: load
op->kindPtr from *(qword*)(op + 48) + 16 and compare it against a list of sentinel addresses.
The classifier sub_7ACC40 — the mode-classifier for the TileAS layout-assignment pass — is a
representative example.
int classify_load_store_op(Operation *op, ModuleSpec *spec, LayoutCandVec *cands) {
void *kind = *(void **)(*(qword *)(op + 48) + 16);
if (kind == &unk_5BE6138) { /* null-opinfo (mid-rewrite) */
if (kind_ptr_is_tiled_atomic_rmw(op)) /* &unk_5B44ED8 via leaf */
return tail_call_primary_resolver(op, spec, cands);
if (kind_ptr_is_scatter_store(op)) /* &unk_5B44EF0 via leaf */
return tail_call_fallback_resolver(op, spec, cands);
return FAILURE;
}
if (kind == &unk_5B44ED0 /* tiled_load */ ||
kind == &unk_5B44EC8 /* tiled_store */)
return tail_call_primary_resolver(op, spec, cands);
if (kind == &unk_5B44F90 /* nv_tileas.load */) return classify_load_inline(op, spec, cands);
if (kind == &unk_5B44EE0 /* nv_tileas.store */) return CANONICAL_MODE;
if (kind == &unk_5B44FA8 /* gather_load */) return tail_call_gather_resolver(op, spec, cands);
return FAILURE;
}
The entire switch is a sequence of pointer comparisons. No string lives in this function — six
CMP instructions on a hot path that runs once per op. Reimplementations must preserve this
property; the address-as-identity model is what makes generic walkers cheap.
The &unk_5BE6138 Null-Opinfo Guard
One sentinel earns its own paragraph. &unk_5BE6138 is the "no properties" guard that ops without
an inline Properties payload carry as their kindPtr discriminator during construction and
mid-rewrite. Dispatchers test it first to short-circuit the properties-decode path. It is also the
address an in-flight RewritePattern leaves in op->kindPtr after wiping the original singleton —
which is why every resolver in the load-store cluster (sub_7ACC40, sub_788BE0, sub_7E3440)
tests for it before falling through to the leaf-predicate helpers sub_7A9D30 (tiled_atomic_rmw)
and sub_79DA80 (scatter_store), which read one indirection deeper to recover the original
identity.
Treat the null-opinfo sentinel as a transient state. A walker that observes it on a fully constructed op outside a rewrite frame should report failure rather than guess the kind.
Idiom 2 — Meyers-Cached TypeID
When a TypeID is not a link-time constant — primarily Op and Type interfaces attached via
addInterfaces<>, analysis types registered through mlir::AnalysisManager, and pattern RTTI tags —
the binary falls back to a {guard:u8, qword:u64} pair plus a one-shot init function. The factory
sub_44A6CA0 takes a string ending in ] (the closing bracket of __PRETTY_FUNCTION__ captured by
MLIR's TypeID::get<T>() trick) and returns the uniqued TypeID* for that string. These strings
sit in ordinary .rodata literal pools, not in the Itanium typeinfo block, so a binary triage that
scans for typeinfo for'mlir::... will find nothing — every MLIR identity string in this binary is
addressable only through the corresponding install-site call.
TypeID get_function_op_interface_typeid(void) {
static uint8_t guard = 0; /* byte_5B37668 */
static uint64_t cached = 0; /* qword_5B37670 */
if (__builtin_expect(guard == 0, 0)) {
if (__cxa_guard_acquire(&guard) != 0) {
cached = (uint64_t)sub_44A6CA0("mlir::FunctionOpInterface]", 22);
__cxa_guard_release(&guard);
}
}
return (TypeID)cached;
}
void install_interface(InterfaceMap *map, TypeID id, void *concept) {
sub_4492D60(map, id, concept);
}
The guard byte sits immediately before the qword in .bss, with 8-byte alignment so the qword
stays naturally aligned. The Itanium ABI's __cxa_guard_acquire / __cxa_guard_release pair makes
initialisation thread-safe; __builtin_expect(guard == 0, 0) keeps the steady-state load on the
fast path. After the first call the qword is the TypeID, and the slot behaves exactly like an
Idiom-1 sentinel — except its address is the address of a 64-bit pointer, not a 1-byte tag.
Concrete examples observed in the binary, all matching this exact template:
| Qword slot | RTTI string (verbatim) | Used by |
|---|---|---|
qword_5B37670 | mlir::FunctionOpInterface] | ConvertTileFuncToLLVM |
qword_5B37798 | mlir::SymbolTable] | symbol-table analysis lookup |
qword_5B38E18 | mlir::LoopLikeOpInterface] | loop pipeliner and licm passes |
qword_5B38E78 | mlir::nv_tile_ir::as::schedule_utils::ScheduleAnalysis] | TileAS scheduler analysis manager |
qword_5B44600 | mlir::cutlass_ir::cute::LayoutTypeInterface] | every CuTe layout-bearing type |
qword_5B44618 | mlir::cutlass_ir::cute::ViewTypeInterface] | every CuTe view-bearing type |
qword_5B46FF8 | mlir::cutlass_ir::cute::MmaAtomTypeInterface] | 9 SM-specific MMA atom installs (SM70..SM120) |
qword_5B47028 | mlir::cutlass_ir::cute::PrintableTypeInterface] | 16+ concrete cute / cute_nvgpu type installs |
qword_5B47088 | mlir::cutlass_ir::cute::DescriptorIteratorTypeInterface] | TMA / shared-memory descriptor iteration |
The pairs cluster tightly in .bss by design: seven cute-interface slots in the band
0x5B47000..0x5B470D0, three more in 0x5B44600..0x5B44890. Each band holds the interface-id
table for a single owning dialect, registered in one initialiser.
Choosing Between the Two
Idiom 1 covers objects that exist before main — dialect TypeIDs, registered Type and Attribute
subclasses, the per-op kindPtr singletons in &unk_5B44Exx / 5B44Fxx. Their addresses are
link-time constants, and the linker packs them into dense 8-byte-pitched bands. Idiom 2 covers
objects whose existence depends on a runtime registration step — interfaces attached after a dialect
loads, analyses keyed by C++ type, pattern RTTI tags. Their identity has to derive from the C++ type
alone, even across translation units, so the binary spells the type name out and uniques the string.
A TypeID never moves between idioms. Once an interface owns a Meyers slot, every install site uses
that slot; once a concrete Type owns a .bss sentinel, no caller ever asks the factory for its name.
Sentinel Sharing And Aliasing
Two cross-dialect aliases deserve a flag. &unk_5B49B18 serves as the TypeID for both
cute.ConstrainedInt and cute_nvgpu.AtomIType — the two share an identical inline
i<N>(<divby M>)? printer surface, and the binary treats them as one identity. qword_5B47028
(PrintableTypeInterface) attaches to every concrete cute Type and most cute_nvgpu Types — 27+
installs of the same interface against different concrete types. Both patterns are legitimate; both
rely on Idiom-1 and Idiom-2 sentinels being stable for the lifetime of the MLIRContext.
The "OperationName ↔ AbstractOperation" split is the other place dispatch sentinels alias. A single
op mnemonic owns two singletons: &unk_5B44FD8 is the OperationName.opInfo (the descriptor passed
to sub_4461CA0 during dialect registration) for nv_tileas.convert_layout, while &unk_5BE4008
is the AbstractOperation kindPtr that ends up at *(qword*)(op+48)+16 after the op is uniqued.
Same op identity, two sentinels at different indirection levels — a verifier that wants to recognise
the op needs to pick the right level.
How to Recognize in a Binary
Idiom 1 is identified by the address band rather than the content. Sentinels cluster densely in
8-byte-pitched runs inside the .bss ranges listed above; the only operation performed on a sentinel
is to take its address. A 1-byte object at an 8-byte-aligned offset, never written, whose address
appears as the right-hand side of a CMP against an op-header field at +0x40 or against
*(qword*)(op + 48) + 16 is an Idiom-1 sentinel.
Idiom 2 is identified by the {guard:u8, qword:u64} pair plus a guarded one-shot init body. The
characteristic sequence is __cxa_guard_acquire(guard) → factory(rtti_string, length) → store result into qword → __cxa_guard_release(guard). The factory sub_44A6CA0 takes a string ending in
] (the closing bracket from __PRETTY_FUNCTION__); any call to this factory with a string literal
argument is an Idiom-2 install site.
The null-opinfo sentinel &unk_5BE6138 is the single most useful cross-cutting fingerprint for
in-flight rewrites. Any walker that loads *(qword*)(op + 48) + 16 and immediately compares it
against &unk_5BE6138 is auditing for the mid-rewrite state documented above.
Consumers
The kindPtr at *(qword*)(op + 48) + 16 is read by every walker, verifier, canonicaliser, and
pattern matcher in the binary. The walker driver sub_447FBB0 from
Operation Layout — Walker Contract dispatches against these sentinels; the pattern fingerprint
map built by FrozenRewritePatternSet in
Pattern Vtables and Shapes keys on OperationName.opInfo
addresses; the InterfaceMap in Interface Vtables — InterfaceMap Layout keys on the same TypeID
addresses (the Meyers Idiom-2 ones for interfaces, the Idiom-1 ones for concrete classes).
Cross-References
The companion page TypeID Construction Idioms covers the two idioms in the
abstract — why &typeid(T) is unusable under hidden visibility, how Idiom 1 builds identity from a
per-class static sentinel address, how Idiom 2 builds it from a __PRETTY_FUNCTION__-derived
string interned through the process-wide pool — without the address-band specifics that occupy
this page. Read that page first for the architectural justification; read this page for the
tileiras layout.
Operation Layout describes the op header where the kindPtr lives. Interface
VTables covers the concept tables that the Meyers-cached interface TypeIDs key
into. Storage Uniquer and ContextImpl documents the registration
machinery that installs both idioms during dialect load. The companion address-sorted reference
TypeID Sentinel Address Table enumerates every individual
sentinel referenced anywhere in the binary, including the full 213-slot NVVM op slab at
0x5B8D610..0x5B8DCB8 and the 33-slot nv_tileaa FoldRecord band at 0x5B46D28..0x5B46F68,
neither of which is unpacked here.
Container Fingerprints
Abstract
Three associative-container families dominate the tileiras binary and each leaves a distinct
constant fingerprint that survives stripping: LLVM DenseMap / DenseSet (pointer-keyed, sentinel
pointers 0xFFFFFFFFFFFFF000 and 0xFFFFFFFFFFFFE000 in the slot key word, inline pointer hash
(p>>9)^(p>>4)), an Abseil SwissTable variant in the scheduler (full fmix64 finalizer with
multipliers 0xFF51AFD7ED558CCD and 0xC4CEB9FE1A85EC53, size-class constant
0x9DDFEA08EB382D69, 16-byte control-byte groups with sentinels 0x80, 0xFE, 0xFF), and
SmallVector value blocks with packed inline-capacity markers (0x300000000, 0x400000000,
0x600000000 as single 64-bit stores). This page lists the verbatim constants, the inline hash and
probe expressions, the resize predicates, and the identification procedure needed to recognise each
family from a single fingerprint and reimplement it without symbols.
Fingerprint Summary
| Family | Primary fingerprint | Slot pitch | Probe shape |
|---|---|---|---|
| LLVM DenseMap / DenseSet | sentinel pointers 0xFFFFFFFFFFFFF000 (empty) and 0xFFFFFFFFFFFFE000 (tombstone) in slot key word | 16 B {KeyTy*, ValueTy*} | inline pointer hash (p>>9)^(p>>4), stride-1 linear probe over key slots |
| Abseil SwissTable (scheduler) | fmix64 multiplier 0x9DDFEA08EB382D69, HighMul64 intermediate 0xAE502812AA7333, full fmix64 finalizer (three xor-shifts) | 16 B control-byte group plus 16 entry slots | H1 picks group (high 57 bits of hash), H2 tag (low 7 bits) SIMD-matched inside 16-byte control group |
| SmallVector inline-cap marker | 0x300000000, 0x400000000, 0x600000000 written as a single 64-bit store at value-block offset +0 | 8 B header u64 | encodes cap in high 32 bits, size = 0 in low 32 bits |
Across the binary there are 47 distinct occurrences of the literal 0xFFFFFFFFFFFFF000 and 40 of 0xFFFFFFFFFFFFE000 that fit the DenseMap empty/tombstone slot pattern.
LLVM DenseMap and DenseSet
Two sentinel pointer values mark slot state in every pointer-keyed DenseMap and DenseSet in Tileiras. Empty slots hold 0xFFFFFFFFFFFFF000 (the signed value -4096); tombstones hold 0xFFFFFFFFFFFFE000 (-8192). The empty/tombstone test reads only the first 8 bytes of the 16-byte slot; the companion value pointer is irrelevant.
Lookup runs LLVM's classical inline pointer hash followed by a stride-1 linear probe. The hash is open-coded at every call site rather than dispatched through a virtual table, so the same two shifts and one XOR appear over and over:
size_t dense_map_index(const void *key, size_t cap_mask) {
uintptr_t p = (uintptr_t)key;
size_t h = ((size_t)(p >> 9)) ^ ((size_t)(p >> 4));
return h & cap_mask;
}
void *dense_map_find(DenseSlot *slots, size_t cap, const void *key) {
size_t mask = cap - 1;
size_t idx = dense_map_index(key, mask);
for (;;) {
DenseSlot *s = &slots[idx];
uintptr_t k = (uintptr_t)s->key;
if (k == 0xFFFFFFFFFFFFF000ULL) return NULL; // empty: terminate
if (k != 0xFFFFFFFFFFFFE000ULL && s->key == key) return s;
idx = (idx + 1) & mask; // stride-1 probe
}
}
Two thresholds drive resize. Growth fires at 3/4 occupancy; in-place rehash to clear tombstones fires when free non-tombstone slots fall to 1/8 of capacity. Growth picks the next power of two of 2N-1 with a 64-slot floor, via the same five-round bit-fill shift sequence that appears verbatim in the binary:
bool should_grow(size_t live, size_t cap) { return 4 * (live + 1) >= 3 * cap; }
bool should_rehash_in_place(size_t live, size_t tomb, size_t cap) {
return cap - tomb - (live + 1) <= cap / 8;
}
size_t next_size(size_t cap) {
size_t t = 2 * cap - 1;
t |= t >> 1;
t |= t >> 2;
t |= t >> 4;
t |= t >> 8;
t |= t >> 16;
++t;
return t < 64 ? 64 : t;
}
That five-round shift cascade with the trailing ++t is itself a fingerprint. Any call site that materialises 2*cap - 1 and then runs the cascade is either DenseMap growth or its SwissTable cousin.
Abseil SwissTable in the Scheduler
A second container family lives inside the scheduler. It uses the standard Abseil SwissTable shape — a 16-byte control-byte group followed by 16 entry slots — and is identifiable by the fmix64 multiplier 0x9DDFEA08EB382D69, the strongest single signature in the binary because no other call site uses that exact immediate.
fmix64 Finalizer
The hash a SwissTable indexes with is the output of the fmix64 finalizer, derived from the input pointer or key by three successive xor-shift-multiply rounds:
uint64_t fmix64(uint64_t h) {
h ^= h >> 33;
h *= 0xFF51AFD7ED558CCDULL; // K1
h ^= h >> 33;
h *= 0xC4CEB9FE1A85EC53ULL; // K2
h ^= h >> 33;
return h;
}
The 0x9DDFEA08EB382D69 constant is the size-class multiplier used in the HighMul64 step that picks a group index from the finalized hash; the intermediate 0xAE502812AA7333 surfaces inside that HighMul64 computation as the high-half of the 128-bit product. The two-round xor-shift-multiply is the diagnostic — any function that runs the fmix64 finalizer and follows it with a HighMul64-style group selection is a SwissTable probe.
H1 / H2 Split
The finalized hash splits into two halves that play distinct roles. H1 (the high 57 bits) selects a group; H2 (the low 7 bits) tags the entry inside the group. The control-byte group is 16 bytes wide and holds one control byte per entry slot. A probe loads the group, broadcasts H2 across a 16-byte SIMD register, compares for byte-equality against the control bytes in one instruction, and walks only the slots whose control byte matches H2.
size_t swiss_h1(uint64_t hash) { return hash >> 7; } // group selector
uint8_t swiss_h2(uint64_t hash) { return (uint8_t)(hash & 0x7F); } // 7-bit per-slot tag
Entry *swiss_find(const SwissTable *t, const void *key) {
uint64_t hash = fmix64((uint64_t)key);
size_t group = swiss_h1(hash) % t->n_groups;
uint8_t tag = swiss_h2(hash);
while (true) {
__m128i ctrl = _mm_loadu_si128((const __m128i *)t->ctrl[group]);
__m128i match = _mm_cmpeq_epi8(ctrl, _mm_set1_epi8(tag));
uint32_t mask = (uint32_t)_mm_movemask_epi8(match);
while (mask) {
int i = __builtin_ctz(mask); // first matching slot
Entry *e = &t->slots[group * 16 + i];
if (e->key == key) return e;
mask &= mask - 1;
}
// No live match in this group. If any slot is kEmpty, the key is absent.
__m128i empties = _mm_cmpeq_epi8(ctrl, _mm_set1_epi8((char)0x80));
if (_mm_movemask_epi8(empties)) return NULL;
group = (group + 1) % t->n_groups; // probe next group
}
}
The H2 tag turns membership into a single-instruction byte compare; only the slots whose tag matches are read further. That is why SwissTable finds a present key in expected O(1) regardless of group occupancy — the SIMD scan handles all 16 slots in one step.
Control-Byte Sentinels
Four control-byte values encode slot state. The top bit distinguishes occupied slots (top bit clear, byte holds H2) from special slots (top bit set, byte encodes a state).
| Sentinel | Value | Meaning |
|---|---|---|
kEmpty | 0x80 | slot has never been used; probe terminates here |
kDeleted | 0xFE | slot was occupied and erased; probe continues past it |
kSentinel | 0xFF | end-of-table guard; never matched as an entry |
| occupied | 0x00..0x7F | H2 tag of the entry currently in this slot |
The probe terminates on kEmpty because no later insertion could have written past an empty cell; on kDeleted it continues so it can still find live entries inserted before the deletion. The kSentinel byte sits one past the last valid slot and lets the SIMD scan run without a separate bounds check.
Growth and Rehash
SwissTable's growth policy is structurally similar to DenseMap's but expressed against the number of empty control slots rather than against tombstone count. Growth fires when the number of remaining empty slots drops below (7 * capacity) / 16; rehash-in-place to clear kDeleted slots fires when the tombstone fraction exceeds an analogous threshold. The same power-of-two next_size cascade from the DenseMap section applies.
SmallVector Inline-Capacity Markers
A SmallVector value block begins with an 8-byte header: high 32 bits encode the inline capacity, low 32 bits hold the current size. Empty construction with a small inline capacity writes the whole header as a single 64-bit store of a recognisable constant.
| Constant | Decoded meaning |
|---|---|
0x300000000 | inline capacity = 3, size = 0 |
0x400000000 | inline capacity = 4, size = 0 |
0x600000000 | inline capacity = 6, size = 0 |
These constants appear near allocator entry points such as sub_44A8C20 whenever a pass-local SmallVector is initialized. They are unambiguous because no DenseMap or SwissTable slot encoding produces a value in the 0x100000000-0xFFF00000000 range with all-zero low 32 bits.
Identification Procedure
A short procedure classifies any constant or call site:
(p >> 9) ^ (p >> 4)near 16-byte slot accesses paired with0xFFFFFFFFFFFFF000(empty) and0xFFFFFFFFFFFFE000(tombstone) pointer reads points to LLVM DenseMap or DenseSet.- The immediate
0x9DDFEA08EB382D69followed by an xor-shift-multiply cascade, paired with a SIMD compare against an immediate that broadcasts a single byte (pshufb/pcmpeqbagainst H2), points to Abseil SwissTable. Control bytes0x80,0xFE,0xFFappearing in the same probe loop are the sentinel signature. - A 64-bit constant of the form
0xN00000000for smallNnear aSmallVectorallocator is an inline-capacity marker, not a hash-table sentinel.
The two hash-table families are distinguishable by sentinel domain — DenseMap stores sentinels as pointer values in the key slot, SwissTable stores them as bytes in a separate control-byte group. The SwissTable's fmix64 multiplier is the strongest single-call-site signature.
Consumers
The DenseMap family backs every uniquing table in the binary — the Level-1 / Level-2 buckets in
Storage Uniquer and Context Impl — Two-Level Intern Table, the dual-width DenseMaps
embedded in AsyncValueImpl (see AsyncValue and BLAKE3 Interning — AsyncValueImpl Header),
the OperationName * fingerprint hashmap built by FrozenRewritePatternSet
(Pattern Vtables and Shapes — Pattern Application Drivers), and the interface entry arrays in
Interface Vtables — InterfaceMap Layout. The SwissTable family is exclusive to the scheduler and
the IR intern tables consumed by the BLAKE3 driver.
Cross-References
Storage Uniquer and Context Impl describes the type and attribute interning tables that sit on top of these container families. Modulo Driver — Per-Attempt SwissTable Scratches documents the scheduler control flow that consumes the SwissTable intern tables described above. AsyncValue and BLAKE3 Interning describes the BLAKE3 digest path that feeds the SwissTable family.
Diagnostic ABI and Helpers
Abstract
Every user-visible error, warning, note, and remark produced by Tileiras flows through a single 208-byte
Diagnostic body. Verifiers, parsers, conversion patterns, pass drivers, and dialect-init routines all
seed that body through one of three constructors, stream fragments into a 4-slot inline argument
buffer, and rely on an InFlightDiagnostic RAII wrapper to flush through a context-registered handler.
The sections below cover the exact body layout, the 24-byte DiagnosticArg 3-tuple, the bit-packed severity
word at +0x10, and the constructor / streamer / destructor triad that builds and tears it down.
This page is the body-layout reference. For the end-to-end story of how those bodies flow through the three error-handling layers — engine, TileAS pass-failure handshake, and driver exit codes — see Error Handling and Diagnostics.
Diagnostic Body
sub_44A8C20(0xD0) allocates the heap body, zero-fills it, and hands it off to one of the seeds for
population. The first 200 bytes are state; the remainder is a 64-byte inline sink buffer the default
handler uses when no external sink is registered. Offsets below come verbatim from sub_446EC50
(the emitOpError seed) and sub_4448AC0 (the destructor).
typedef struct Diagnostic {
/*+0x00*/ Location loc; // interned LocationAttr*, 0 once flushed
/*+0x10*/ uint16_t packed_severity_flags; // class | (op_prefix<<8) | (trace<<9)
/*+0x18*/ DiagnosticArg *args_begin; // == &inline_args[0] until spill
/*+0x20*/ uint32_t args_size; // low dword of the 0x400000000 init
/*+0x24*/ uint32_t args_cap; // high dword; starts at 4
/*+0x28*/ DiagnosticArg inline_args[4]; // 24 B per slot = 96 B
/*+0x88*/ SmallVector<std::string, 0> owned_strings;
/*+0xA0*/ SmallVector<Diagnostic, 0> notes; // child diagnostics, 0xC0-byte bodies
/*+0xB8*/ raw_ostream *inline_sink; // initialised to self+0xC8 by ctor
/*+0xC8*/ uint8_t alive; // 1 after ctor, 0 once emitted
} Diagnostic; // sizeof = 208 (0xD0)
args_begin at +0x18 points into the inline buffer at +0x28 until the argument count crosses
four; the streamer then promotes to heap storage and rewrites the pointer. owned_strings at +0x88
holds any payload the diagnostic had to copy — typically Twine outputs and any const char * whose
lifetime is shorter than the body. notes at +0xA0 is a vector of pointers to child Diagnostic
bodies; children are slightly smaller (0xC0) because they reuse the parent's sink rather than
carrying their own.
DiagnosticArg 3-Tuple
Every streamed argument is a 24-byte 3-tuple. The streamer dispatches on kind and interprets
value and aux according to the table below. The constructor sets kind to 1 (placeholder) on
every inline slot, so an unstreamed diagnostic prints no body text.
typedef struct DiagnosticArg {
/*+0x00*/ uint8_t kind; // 1..6, see table
/*+0x01*/ uint8_t pad[7];
/*+0x08*/ uint64_t value; // scalar or primary pointer
/*+0x10*/ uint64_t aux; // length, twine kind, or unused
} DiagnosticArg; // sizeof = 24
| Kind | Meaning |
|---|---|
| 1 | placeholder (unset; ctor default) |
| 2 | int64 — value lives in value |
| 3 | const char* — pointer in value, length recomputed by the printer |
| 4 | heap-string — body owns a std::string in owned_strings |
| 5 | StringRef — {value=ptr, aux=len} |
| 6 | Twine — {value=twine_ptr, aux=twine_kind}; large path renders into kind 4 |
The streamer at sub_44488C0 walks args_begin at a 24-byte stride. New arguments fill
inline_args[0..3] first; the fifth and any later argument spills to a heap-grown vector reached
through the same args_begin pointer, so consumers never special-case the small-buffer state.
Severity Word
The 16-bit field at +0x10 packs the severity class into the low byte and two boolean flags into the
next byte. The upper bits are reserved and zero in every observed value.
| Bit range | Field | Encoding |
|---|---|---|
0..7 | severity class | 1=Note, 2=Warning, 3=Error, 4=Remark |
8 | op-prefix flag | printer prepends ' op "<name>" boilerplate |
9 | trace flag | a trace/child note is attached |
10..15 | reserved | always zero |
Five concrete words appear across the binary: 0x101, 0x103, 0x104, 0x302, 0x503. 0x101 is
the constructor default — class 1 (Note) with the op-prefix bit set, used by attachNote paths.
0x103 is the canonical verifier-failure word: Error class with op-prefix. 0x104 is the Remark
flavour used by the diagnostic emitted with trace:\n child diagnostic. 0x302 and 0x503 are
inliner-set words whose bit 9 says the diagnostic carries a structured trace note. The shape of
these bytes mirrors the flags |= 4 and pass[5] |= 4 failure-bit patterns used by the Schedule
and pass-pipeline state machines elsewhere in the binary, which keeps the printer from having to
reach back through those records when it decides whether to emit the error: prefix.
Construction
Three seeds populate the body. sub_446EC50 is the emitOpError constructor: given a Location
and an Operation*, it allocates 208 bytes via sub_44A8C20(0xD0), zero-fills, writes the location
to +0x00, writes packed severity 0x103 to +0x10, sets args_begin = self+0x28, packs
args_size=0 and args_cap=4 into the 64-bit word at +0x20 through the immediate 0x400000000,
points inline_sink at self+0xC8, sets alive=1, and finally streams in the op-name fragment
through sub_44487A0 (which emits an ' op "<name>" kind-3 argument).
sub_4470160 is a 12-byte thin wrapper forwarding to sub_446EC50 without the op-name prefix —
the free-standing emitError entry point for parser and driver code that has no Operation * to
hand. sub_444B3A0 is the generic constructor: it takes a Location and an explicit severity class,
and switches on the stack-trace path when the global --mlir-print-stacktrace-on-diagnostic toggle
is set. With that toggle live, it creates a child diagnostic through sub_444B160, streams the
literal "diagnostic emitted with trace:\n", sets the child's severity word to 0x104, and pushes
the rendered backtrace as a kind-4 argument.
Streaming Arguments
sub_44488C0 is Diagnostic::operator<<(DiagnosticArg&&) and its char-pointer / StringRef overloads.
Each call writes a 24-byte record into args_begin[args_size] and bumps args_size. When
args_size reaches args_cap, the streamer reallocates onto the heap and rewrites args_begin to
the new buffer; the four inline slots at +0x28 stay in place but zeroed, so the destructor can
still scan them safely.
The streamer also owns the kind-4 promotion path. When a streamed Twine does not fit the small
representation, sub_4581720 renders it into a std::string, pushes the string into
owned_strings at +0x88, and rewrites the argument's kind to 4 with value pointing into the
owning vector. The same mechanism rescues a const char * whose lifetime is known to be shorter
than the diagnostic — the helper copies the bytes into owned_strings and upgrades the kind from 3
to 4 so the body is self-contained at flush time.
Notes and Trace Chains
attachNote(Location) is implemented by sub_444B160. It allocates a 192-byte child body (no inline
sink), copies the parent's location into the child's +0x00 if the caller did not pass a fresh one,
appends the child pointer to the parent's notes vector at +0xA0, and returns a reference to the
child so the caller can stream additional fragments into it. The child's inline_sink is left null;
the parent's destructor will render the child against the parent's sink at flush time.
Diagnostic *attach_note(Diagnostic *parent, Location loc) {
Diagnostic *child = (Diagnostic *)sub_44A8C20(0xC0);
memset(child, 0, 0xC0);
child->loc = loc ? loc : parent->loc;
child->packed_severity_flags = 0x101;
child->args_begin = &child->inline_args[0];
child->args_size = 0;
child->args_cap = 4;
child->alive = 1;
vector_push(&parent->notes, child);
return child;
}
The trace path uses the same primitive, but the parent constructor sets bit 9 of its own severity word so the printer knows to walk into the child without re-checking the global toggle.
Sink Chain
inline_sink at +0xB8 is an raw_ostream*. The constructor points it at the inline buffer at
+0xC8, which gives the default handler a 64-byte small-string sink for assembling the formatted
message. op->emitError(...) and op->emitWarning(...) overwrite this pointer with llvm::errs()
when the caller wants direct stderr output; capture tools (such as the diagnostic-handler interface
that backs in-process IR tests) replace it with their own ostream.
The destructor flushes through whatever sink is currently installed, so the choice of sink is the
only place a caller can intercept the formatted output before it reaches the registered context
handler. Replacing the sink does not bypass the handler chain — the handler still runs against the
structured Diagnostic after the sink flushes the inline rendering.
Destruction and Flush
sub_4448AC0 is the destructor — the only function permitted to flip alive from 1 to 0. The
flush path is short:
void diagnostic_destroy(Diagnostic *d) {
if (d->loc != 0) {
sub_44488C0(d->loc->context->engine_mutex, d); // engine handler chain
d->loc = 0;
}
if (d->alive == 1) {
d->alive = 0;
for (Diagnostic *n : d->notes) { diagnostic_destroy(n); free(n); }
for (std::string &s : d->owned_strings) { s.~basic_string(); }
if (d->args_begin != &d->inline_args[0]) {
free(d->args_begin);
}
}
}
alive at +0xC8 is the double-emit guard. A moved-out diagnostic — the common case when an
InFlightDiagnostic is returned as a LogicalResult — clears its location pointer; the destructor
of the moved-from shell then sees loc == 0 and skips the handler call, then sees alive == 0 and
skips the cleanup. The same byte is why attachNote cannot be called after a diagnostic has been
flushed: the engine clears loc first, so a follow-on append would have nothing to attach to and the
printer would lose the parent context.
Engine Entry Point
sub_44488C0 is the function the destructor calls when loc != 0. It takes the engine's pthread
mutex and the diagnostic body, locks the mutex, walks the registered diagnostic handler chain — an
intrusive linked list rooted in the engine — and offers the diagnostic to each handler in turn. The
first handler that returns true consumes the diagnostic. If no handler consumes it, the default
handler emits an "error: " prefix when the severity class is exactly 2, walks the argument
vector through sub_4448570 to render each kind, emits a trailing newline, and flushes the sink.
The engine mutex is released whichever handler accepted the diagnostic, so a handler that throws is a hard contract violation. The codebase relies on every handler being noexcept and on the engine never being re-entered from inside a handler callback.
How to Recognize in a Binary
The 208-byte (0xD0) allocation immediately followed by a zero-fill, a write of one of the five
canonical severity words (0x101, 0x103, 0x104, 0x302, 0x503) to the +0x10 offset, and the
single 64-bit 0x400000000 store at +0x20 ({size=0, cap=4}) is the unambiguous fingerprint. Any
function that allocates 0xD0 bytes through sub_44A8C20 and then stores 0x400000000 at offset
+0x20 is a Diagnostic constructor, regardless of which severity it ultimately writes.
The complementary destructor fingerprint at sub_4448AC0 is the loc != 0 ? engine_call : skip
guard followed by the alive == 1 cleanup gate. Any function that branches on a qword at +0x00 of
a 208-byte object, then on a byte at +0xC8 of the same object, is Diagnostic::~Diagnostic. The
double-flush guard is the same byte at +0xC8.
Consumers
Operation::emitOpError (sub_446EC50) is the primary entry — every verifier in the binary calls it
when an invariant fails. The pattern application drivers documented in
Pattern Vtables and Shapes seed 0x103-class diagnostics when a
matchAndRewrite returns failure with an explanation. The scheduler reuses the same severity-bit
pattern in its Schedule.flags |= 4 failure-bit encoding; see
Modulo Scheduler and Rau-Style Placement.
Cross-References
Error Handling and Diagnostics
is the canonical end-to-end page tying this body layout together with the
TileAS pass-failure handshake and the driver-level exit codes. Operation
Layout covers the Operation header that emitOpError
reads its mnemonic from. Storage Uniquer and Context Impl
documents the context that owns the diagnostic engine and its handler chain.
Pass-Failure Handshake covers the
*(self+40) |= 4 soft-failure convention that pairs with most Error-class
diagnostics in the TileAS pass family. Modulo Scheduler and Rau-Style
Placement reuses the same severity-bit pattern in its
Schedule.flags |= 4 failure-bit encoding.
AsyncValueImpl and BLAKE3 IR Interning
Abstract
Two pieces of nv_tileas infrastructure sit immediately under the warp-specialisation scheduler. The first is AsyncValueImpl, an 808-byte (0x328) heap record that anchors every Pipe_ and Mutex_ SSA value the scheduler manipulates; the second is a BLAKE3-based content hasher used as the keying function for several intern tables that deduplicate IR-object tuples. The two mechanisms are unrelated in purpose — one is a fat scheduler-side header, the other is a 64-bit content key — but they share callers in the same address range and they share the same allocator family, so they are documented together.
The BLAKE3 driver lives at sub_45BF670. It is not ChaCha20 even though both algorithms use the 7/8/12/16 rotation set: the binary loads the canonical SHA-256/BLAKE2/BLAKE3 IV (0x6A09E667 0xBB67AE85 0x3C6EF372 0xA54FF53A 0x510E527F 0x9B05688C 0x1F83D9AB 0x5BE0CD19) verbatim from xmmword_503C080 / xmmword_503C090 as two _mm_load_si128 operands, contains exactly 56 sixteen-bit left-rotations per block (seven rounds of eight quarter-rounds), threads a chunk-counter and a ROOT/CHUNK_END flag bit through the compression block, and never contains the "expand 32-byte k" / "expand 16-byte k" sigma strings that any ChaCha20 implementation would carry. Five independent corroborations of this identification exist in the binary; the only feature ChaCha20 and BLAKE3 share at this depth is the rotation amounts.
AsyncValueImpl Header
Behind every Pipe_ or Mutex_ SSA value the scheduler creates sits an 808-byte AsyncValueImpl. Three constructors allocate one: sub_8E0070 for the Mutex_ flavour (3240 bytes), sub_8E9450 for the scalar Pipe_ flavour (3157 bytes), and sub_8EA0B0 for the tensor Pipe_ flavour (3264 bytes). All three call sub_44A8C20(0x328) — a BumpPtrAllocator-style wrapper that hands out arena-stable pointers — then run the same 14-line initialiser prologue before specialising. Arena stability is non-negotiable: the DenseMap<Operation*, T> instances embedded in the header hash with (op>>9) ^ (op>>4), so moving an AsyncValueImpl after construction would break every probe that follows.
The initialiser sets up three inline SmallString heads at capacity 3, four inline SmallVector<u64,6> heads at capacity 6, sets hasValue at byte 64, copies "Mutex_" (at 0x4607054) or "Pipe_" (at 0x4607077) into the std::string SSO buffer at offset 0 through sub_44E1740, then stitches the header into the owning builder's growable SmallVector<AsyncValueImpl*> at (builder+168, +176, +180). Capacity encodings are 0x300000000 (low dword size=0, high dword cap=3) for the SSO strings and 0x600000000 for the inline-6 SmallVectors.
struct AsyncValueImpl {
/*+0x000*/ char name_dataplus[8]; // std::string GCC SSO: data ptr
/*+0x008*/ uint64_t name_length; // string size
/*+0x010*/ char name_inline[16]; // SSO inline buffer ("Mutex_\0" or "Pipe_\0")
/*+0x020*/ void* producerType; // Operation* (Pipe_/Pipe-T) or 0 (Mutex_)
/*+0x028*/ void* consumerType; // Operation*
/*+0x030*/ void* producerPayload; // first qword of *a7[2]
/*+0x038*/ void* consumerPayload; // first qword of *a7[3]
/*+0x040*/ uint8_t hasValue; // 1 once emitPayload runs
/*+0x041*/ uint8_t _pad0[7];
/*+0x048*/ uint32_t regionStageKind; // 1 for Mutex_; looked up for Pipe_
/*+0x04c*/ uint16_t okFlag; // Optional<u8>: {hasValue, value}
/*+0x04e*/ uint16_t payloadFlag; // Optional<u8>; Pipe_ writes 0x0101
/*+0x050*/ SmallString_48 tag1; // inline cap=3 (data@80, size/cap@88, inline@96)
/*+0x090*/ SmallString_48 tag2; // inline cap=3 (data@144, size/cap@152, inline@160)
/*+0x0d0*/ uint8_t kind; // 0=scalar Pipe_, 1=Mutex_ or Pipe-T
/*+0x0d1*/ uint8_t _pad1[7];
/*+0x0d8*/ DenseMap_48 chainMapA; // <Op*, SmallVector<u64,0>>; 48-byte bucket
/*+0x0f0*/ DenseMap_48 chainMapB; // symmetric consumer-side
/*+0x108*/ SmallVector_u64_6 stageVecA; // 64 B, cap=6 inline (data@264)
/*+0x148*/ SmallVector_u64_6 stageVecB; // 64 B, cap=6 inline (data@328)
/*+0x188*/ DenseMap_16 indexMap0; // <Op*, i32>; 16-byte bucket (Mutex_ primary)
/*+0x1a0*/ DenseMap_16 indexMap1; // symmetric
/*+0x1b8*/ uint32_t statusBits0; // OR-accumulated by emitPayload
/*+0x1bc*/ uint32_t statusBits1; // consumer-side analogue
/*+0x1c0*/ void* scheduleMirror; // cached Schedule::opToStage data ptr
/*+0x1c8*/ uint64_t _reserved1;
/*+0x1d0*/ uint32_t scheduleCapacity;
/*+0x1d4*/ uint32_t _pad2;
/*+0x1d8*/ SmallVector_Op_6 producerOps; // 64 B inline-6 (data@472, size@480, cap@484)
/*+0x218*/ SmallVector_Op_6 consumerOps; // 64 B inline-6 (data@536)
/*+0x258*/ SmallVector_u64_0 producerOrders; // 24 B zero-inline (data@600)
/*+0x270*/ SmallVector_u64_0 producerStages; // 24 B (data@624)
/*+0x288*/ SmallVector_u64_0 consumerOrders; // 24 B (data@648)
/*+0x2a0*/ SmallVector_u64_0 consumerStages; // 24 B (data@672)
/*+0x2b8*/ SmallVector_u64_0 producerPairsA; // 24 B; packed (stage<<32 | order)
/*+0x2d0*/ SmallVector_u64_0 consumerPairsA; // 24 B
/*+0x2e8*/ SmallVector_u64_0 producerPairsB; // 24 B
/*+0x300*/ SmallVector_u64_0 consumerPairsB; // 24 B
/*+0x318*/ __m128i statusWord; // Optional<RegisterSlot>, init from xmmword_4607080
};
static_assert(sizeof(struct AsyncValueImpl) == 0x328, "808 bytes");
The dual DenseMap widths are intentional and every reader depends on them. 48-byte-stride maps at +0x0d8 / +0x0f0 carry SmallVector<u64,0> values (the order set observed for each chained operation); 16-byte-stride maps at +0x188 / +0x1a0 carry raw i32 indices. Both share the same (op>>9)^(op>>4) hash, the same tombstone (Operation*)-4096 and empty (Operation*)-8192 sentinels, and the same 4*(size+1) >= 3*capacity rehash threshold. Mixing the bucket strides corrupts every later read — every consumer indexes by absolute byte offset.
Construction Prologue
All three constructors run the same 14-line initialiser before specialising. Every embedded inline buffer becomes self-pointer-valid before any subsequent write touches it — deferring this step breaks the SmallString / SmallVector inline-vs-heap discriminator that downstream code relies on.
void asyncvalue_init(struct AsyncValueImpl *v) {
memset(v, 0, 0x328);
/* three SmallString<48> heads: data ptr -> own inline buffer */
((uint64_t*)v)[0] = (uint64_t)(v + 16); /* name (std::string SSO) */
((uint64_t*)v)[10] = (uint64_t)(v + 96); /* tag1.data -> tag1.inline */
((uint64_t*)v)[18] = (uint64_t)(v + 160); /* tag2.data -> tag2.inline */
((uint64_t*)v)[11] = 0x300000000ULL; /* tag1: size=0, cap=3 */
((uint64_t*)v)[19] = 0x300000000ULL; /* tag2: size=0, cap=3 */
/* four SmallVector<u64,6> heads: data ptr -> own inline storage */
((uint64_t*)v)[33] = (uint64_t)(v + 280); /* stageVecA */
((uint64_t*)v)[41] = (uint64_t)(v + 344); /* stageVecB */
((uint64_t*)v)[59] = (uint64_t)(v + 488); /* producerOps */
((uint64_t*)v)[67] = (uint64_t)(v + 552); /* consumerOps */
((uint64_t*)v)[34] = 0x600000000ULL; /* stageVecA: size=0, cap=6 */
((uint64_t*)v)[42] = 0x600000000ULL; /* stageVecB */
((uint64_t*)v)[60] = 0x600000000ULL; /* producerOps */
((uint64_t*)v)[68] = 0x600000000ULL; /* consumerOps */
}
Each constructor then writes its flavour-discriminating fields. sub_8E0070 writes kind = 1 at +208, regionStageKind = 1 at +72, copies "Mutex_" into the SSO, threads producer/consumer index ranges through sub_8D9750 into producerOps / consumerOps, runs the two parallel hash-table fill loops over indexMap0 / indexMap1, and ends by calling sub_8F7900 to compute the okFlag Optional at +76. sub_8E9450 leaves kind = 0, looks up regionStageKind from the region-tree map at (builder+104) through sub_8DA7D0, writes "Pipe_" into the SSO, and conditionally writes payloadFlag = 0x0101 at +78 when no producer-side range is supplied. sub_8EA0B0 is the tensor variant — same shape as the scalar Pipe_ plus an extra arm threading two additional SmallVector arguments through chainMapA / chainMapB.
The shared tail sub_8E7A70 (Pipe::emitPayload) drives the transition from CONSTRUCTED to PAYLOADED. It copies *(a7+16) and *(a7+24) into producerPayload / consumerPayload, sets hasValue = 1, caches Schedule::opToStage's data pointer into scheduleMirror, populates the four SmallVector<u64,0> quadruples by joining each producer/consumer op against the scheduler's stage and order maps, then loads the 16-byte Optionalxmmword_4607080 into statusWord at +792. The header carries no atomic refcount: lifetime is arena-based, and teardown only runs on the failure-to-append path in the constructor through sub_8DB490 followed by free.
parseFromAttrs
sub_8FB180 is the Schedule-side companion that lets the verifier and a few lowering passes rebuild an in-memory Schedule from MLIR attributes attached to the schedule-owning operation. It reads two DenseI64ArrayAttr attributes — "nv_tile.aws.stage" and "nv_tile.aws.order" — through the discardable-attribute fast path (sub_446DC50) when the op's discardable bit is set, or the inherent-attribute dictionary walker (sub_440E370) otherwise, validates both type tags against &unk_5BE5F40, then walks the block's operation list in lock-step with the two arrays.
Each (op, stage, order) triple drops into a pair of 16-byte-stride DenseMap<Operation*, int32_t> instances laid out exactly like the indexMap0 / indexMap1 maps inside AsyncValueImpl — same hash, same sentinels, same load factor. The accumulator is a 60-byte Schedule struct: owning op at +0, stage map at +8/+16/+24, order map at +32/+40/+48, valid flag at +56. A type-tag mismatch on either side clears the valid flag and returns immediately; the downstream verifier treats !valid as "schedule not parsed".
BLAKE3 Driver
Four entry points reach tileiras' IR-interning callers: blake3_init at sub_45BEC80, blake3_update at sub_45BECE0, blake3_finalize at sub_45BF540, and a CPU-feature dispatcher pair at sub_45BF620 / sub_45BF670 that tail-call the actual compression routines at sub_45BF840 (in-place state update) and sub_45C03D0 (output-emitting variant). The dispatchers consult a lazy-initialised feature mask at dword_5B3761C built from CPUID(0/1/7) and xgetbv results; in this binary the mask only ever gates between "uninitialised" and "baseline scalar", so the AVX2 / AVX-512 specialisations the upstream BLAKE3 reference would carry have been stripped.
The initialiser loads the canonical 64-byte IV (the 8-word constant shared by SHA-256, BLAKE2 and BLAKE3) into the first 64 bytes of the hasher state, then zeroes the counter, the buffer, and the flags. The state is 1976 bytes — observed as _BYTE v34[1976] in sub_3CC6B10, and _BYTE v8[1920] plus an 8-byte counter slack in sub_3C92D50. Both call sites write 1 into the byte at offset 1912 — BLAKE3's default flags (0) combined with the chunk-state block_len field set to 1. The binary never sets a non-zero key; blake3_hasher_init always loads the full IV, so this is plain hash mode — never keyed-hash, never derive-key.
The compression block runs seven rounds of eight quarter-rounds — count the __ROL4__(..., 16) invocations in sub_45BF840 and you get exactly 56 — and adds the IV words 0x6A09E667 / 0xBB67AE85 / 0x3C6EF372 / 0xA54FF53A back into the rotated state at the end. The finalizer in sub_45BF540 runs an outer loop with a chunk counter (v44 in the decompile) and ORs the ROOT flag (8) into the final block's domain-separation byte; for short last blocks the CHUNK_END | ROOT combination falls out of the (buffer_len == 0) guard. Stream ciphers carry no ROOT / CHUNK_START / CHUNK_END flags, no chaining-value tree, and would never load the SHA-256 IV. This driver is BLAKE3.
/* The canonical 4-call sequence used by every interning caller. */
uint64_t blake3_intern_key(void *parent_ptr, int32_t i, int32_t j) {
uint8_t hasher[1976]; /* sizeof(blake3_hasher); 1976 in this build */
uint64_t digest; /* 8-byte truncated output, used as table key */
sub_45BEC80(hasher); /* blake3_hasher_init: loads IV from xmmword_503C080/090 */
sub_45BECE0(hasher, &parent_ptr, 8); /* update: parent pointer */
sub_45BECE0(hasher, &i, 4); /* update: first int32 */
sub_45BECE0(hasher, &j, 4); /* update: second int32 */
sub_45BF540(hasher, &digest, 8); /* finalize: 8-byte XOF emit */
return digest;
}
Every consumer treats the 8-byte truncated digest as a 64-bit hash. Nothing in the binary uses the full 256-bit BLAKE3 output, and nothing uses BLAKE3 in keyed-hash or derive-key mode. Swapping BLAKE3 for upstream MLIR's llvm::hash_combine (SipHash-derived) is unusual but harmless — the same key shape and bucket policy as stock MLIR StorageUniquer flows through it. The most plausible motivation is determinism across a polyglot toolchain where one component is a Rust crate that bundles BLAKE3 directly.
Five Interning Callers
Five caller families consume the 8-byte digest, each driving a different intern-table shape. The shared input is always a small tuple — (pointer, int32, int32) or a trivial extension of it.
sub_2CC9780 is the RB-tree caller. It hashes (parent_op, i, j), walks an std::map-style red-black tree anchored at on-stack sentinel &v343 / &v348 with left, right, and parent pointers at node offsets +16, +24, and +0, and tests the 8-byte digest against a key field at +32. Insertion uses the standard top-down (unsigned __int64)ptr <= v42[4] comparison; lookup short-circuits on equality. This is the deduplication path for the IR-construction routine's most complex sub-tree.
sub_3CC1560 is the primary open-addressing intern. Capacity at +56, table base at +40, occupancy at +48, power-of-two capacity rounded via popcount, 4*(occ+1) >= 3*cap rehash threshold. The sentinels are not the usual -4096 / -8192: they are tomb = qword_5BDD9D8 and empty = unk_5BDD9E0, two address-space-stable constants the binary keeps in .bss. The successful-insert return value is table_index + 4096, where 4096 is a "no inline value" sentinel that distinguishes "key was newly created and has no associated payload yet" from a real index.
sub_3CC1E30 and sub_3CC2680 are byte-identical siblings of sub_3CC1560 against different IR-object kinds — same capacity-mask probing, same qword_5BDD9D8 / unk_5BDD9E0 sentinels, same 4096 "no inline value" return convention. The three live as separate functions instead of a single templated body because the inline equality test against the original key tuple differs slightly per IR-object kind.
sub_3C92D50 is the vector-of-tuples hasher. It takes an __int64 *a1[] of length a2 whose elements are 16-byte (u64, u32, u32) records, runs the standard init → update(parent_ptr, 8) → update(i32, 4) → update(j32, 4) sequence in a loop, and finalizes once at the end. The 8-byte digest is the intern key for a flat vector table; the stack frame declares _BYTE v8[1920] for the hasher state and writes v8[1912] = 1 at the chunk-state block_len slot.
sub_3CC6B10 is the heaviest consumer: a buffer-plus-sidecar hasher with a 1976-byte hasher state on its stack (_BYTE v34[1976]) and a v32 = 0x400000000LL initialiser that packs (flags<<32) | block_len = (4 << 32) | 0 — i.e. the BLAKE3 CHUNK_END flag prearmed for short content. The caller is the buffer-plus-sparse-table content hasher that 1560 / 1E30 / 2680 fan out into when their inline key tuples reference a content blob rather than a pointer.
State Machine
An AsyncValueImpl cycles through four observable states across the five constructors and the shared tail. sub_44A8C20(0x328) followed by memset(0) produces ZEROED. The initialiser prologue produces SKELETON: every SmallString and SmallVector head points at its own inline storage, every DenseMap is empty. Writing the name through sub_44E1740 plus the kind / regionStageKind fields produces CONSTRUCTED. Running sub_8E7A70 produces PAYLOADED: hasValue = 1, the four DenseMaps populated, the eight SmallVector<u64,0> quadruples filled, the statusWord at +792 loaded from xmmword_4607080. No observable transition leads back from PAYLOADED — teardown is arena discard, not per-object destruction.
The six fields that encode this state machine — hasValue at byte 64, kind at byte 208, regionStageKind at byte 72, okFlag at byte 76, payloadFlag at byte 78, and the two OR-accumulated statusBits dwords at bytes 440 and 444 — are read by absolute offset throughout the scheduler. Reordering any of them breaks ListSchedule::verify (sub_8F5410), LoopSchedule::verify (sub_8F80E0), and the dispatch hub Schedule::verifyStageOrder (sub_8F87A0).
How to Recognize in a Binary
Three independent fingerprints identify the AsyncValueImpl path with no ambiguity:
- The constructor signature is an
sub_44A8C20(0x328)allocation immediately followed by amemset(0, 0x328)and then a sequence of self-pointer initialisers writing the inline-buffer addresses (v+16,v+96,v+160,v+280,v+344,v+488,v+552) into their owning header slots. Any function with this exact prologue isMutex_(sub_8E0070), scalarPipe_(sub_8E9450), or tensorPipe_(sub_8EA0B0). - The capacity-encoding immediates
0x300000000(size=0, cap=3) and0x600000000(size=0, cap=6) appearing in pairs identify the SmallString and SmallVector head initialisers, respectively. These immediates are unambiguous in0xN00000000form because no DenseMap sentinel produces values in this range. - The literal strings
"Mutex_"at0x4607054and"Pipe_"at0x4607077, both interned throughsub_44E1740into the std::string SSO at byte 0 of the header, locate the flavour switch.
The BLAKE3 driver is identified by the SHA-256/BLAKE2/BLAKE3 IV pair at xmmword_503C080 and
xmmword_503C090 (0x6A09E667 0xBB67AE85 ...), the 1976-byte hasher-state stack frame in callers,
and the chunk_state.block_len = 1 write at byte 1912 of that frame. The absence of "expand 32-byte k" or "expand 16-byte k" strings anywhere in the surrounding code, plus the presence of
the ROOT/CHUNK_START/CHUNK_END flag bits, rules out ChaCha20 — the only feature ChaCha20 and BLAKE3
share at this depth is the quarter-round rotation amounts.
Consumers
The AsyncValueImpl headers are produced during MaterializeSchedule and consumed thereafter by
every pass that walks Pipe_ or Mutex_ SSA values: the verifier (ListSchedule::verify at
sub_8F5410, LoopSchedule::verify at sub_8F80E0, Schedule::verifyStageOrder at sub_8F87A0),
the warp-specialisation legaliser, and the nv_tileas-to-nvvm lowering patterns that translate
each handle into an mbarrier or a token-passing sequence. The Pipe_ / Mutex_ IR-level view of
these values is documented in Pipe and Mutex Value Layout;
the scheduler that materialises them is in Modulo Scheduler and Rau-Style
Placement.
The BLAKE3 intern tables are consumed by five families of IR-object dedup paths (sub_2CC9780,
sub_3CC1560, sub_3CC1E30, sub_3CC2680, sub_3C92D50, sub_3CC6B10) that all key on the same
8-byte truncated digest and reuse the same probe machinery documented in
Container Fingerprints.
Cross-References
Storage Uniquer and Context Impl — Two-Level Intern Table describes the
canonical two-level uniquing model that the BLAKE3 intern tables fit into.
Pipe and Mutex Value Layout is the IR-level view of the
SSA values that AsyncValueImpl backs.
Modulo Scheduler and Rau-Style Placement documents the
scheduler that owns the AsyncValueImpl instances and drives Pipe::emitPayload.
Operation Layout describes the mlir::Operation pointer that the DenseMaps
inside AsyncValueImpl key on.
Container Fingerprints catalogues the open-addressing probes that the
BLAKE3 digest feeds into.
TileAS Pass-Failure Handshake
Abstract
TileAS passes communicate failure through a shared status byte at offset +40 in the per-pass PassObject. Setting bit 2 of that byte (0x04) signals a soft failure: the pass completes its walk, the driver inspects the bit once the walk terminates, and dependent downstream passes either short-circuit or skip work that requires output from a failed predecessor. Failure does not throw, does not unwind, and does not abandon the IR.
The handshake appears across the entire D08-D13 TileAS pass family — async materialization, convert-layout materialization, schedule materialization, the unspecialized pipeline pass, the pipeline-region optimizer, and the convert-tileas-to-LLVM rewriter all set or read the same bit. It is the central piece of inter-pass plumbing in TileAS.
This page documents the handshake convention specifically. For the broader three-layer error-handling architecture — MLIR diagnostic engine, pass-failure handshake, and driver-level exit codes — see Error Handling and Diagnostics.
Convention
Every TileAS pass instance carries a status word in its PassObject. The byte at offset +40 is the failure-handshake byte; bit 2 (0x04) is the failure signal. Other bits of the same word may carry pass-specific flags (the upper bits are not reserved), but bit 2 is the cross-pass contract.
typedef struct PassObject {
/* ... pass-specific fields at +0 .. +39 ... */
/*+0x28*/ uint32_t status_word; /* bit 2 (0x04) = soft failure */
/* ... pass-specific options and state ... */
} PassObject;
static inline void pass_mark_soft_failure(PassObject *self) {
self->status_word |= 4;
}
static inline bool pass_soft_failed(const PassObject *self) {
return (self->status_word & 4) != 0;
}
The pass-side use is uniform: when a pass body decides that its work cannot complete, it emits a diagnostic and ORs 4 into self+40, then keeps walking or returns success(). The driver inspects the bit after the walk and lifts it to a top-level pass-manager failure if the pass result is required, or leaves it as a recoverable miss if downstream passes know how to handle it.
Why Not signalPassFailure()
The upstream MLIR PassManager exposes signalPassFailure() for hard pass failures. TileAS deliberately avoids that path in most places, for two reasons.
First, granularity. signalPassFailure() is whole-function: once a pass calls it, the pass-manager treats the whole function as failed and may stop running subsequent passes on it. TileAS often wants to fail one op or one loop without poisoning the rest of the function — for example, "this one loop could not be software-pipelined, leave it synchronous and continue". The handshake bit lets a pass record the partial-failure outcome while still producing valid IR the next pass can consume.
Second, downstream readability. When a TileAS pass communicates failure through signalPassFailure(), the next pass has no way to discover the reason — the failure is opaque, and the next pass would have to re-do whatever analysis the failed pass performed to decide what to skip. With the handshake bit, the failed pass leaves a clear and inspectable signal, and the dependent pass simply reads the status word and acts accordingly.
The bit is not a replacement for signalPassFailure(). Fatal contract violations — malformed IR, missing analyses that should always exist, sentinel pointer dereferences — still trap or call report_fatal_error. The handshake is for recoverable cases where one pass produces IR the next pass can either use or sidestep.
Soft handshake vs hard fatal error
The TileAS pipeline carries three failure paths at three different
severities. The soft handshake is the lightest; signalPassFailure() is
the middle weight; report_fatal_error is the heavy one. The three are
visually similar inside a pass body — each is a call paired with a
diagnostic — but their downstream consequences diverge sharply.
| Path | Mechanism | What stays running | User outcome |
|---|---|---|---|
| Soft handshake | `*(self+40) | = 4afterop->emitRemarkorop->emitError; pass returns success()` | The pass-manager keeps running; downstream passes peek at the bit and skip dependent work |
signalPassFailure() | MLIR pass-manager-side failure flag after op->emitError | The current pass completes its walk, then the pass-manager returns failure() | An Error-class diagnostic appears; driver exit code 5 |
llvm::report_fatal_error | LLVM-tier fatal-error handler | Nothing — the process aborts through the fatal-error handler | A bare diagnostic on stderr; process abort, no clean exit code |
The handshake is the only path on which the user can still get a usable
artifact: a function whose pipelining failed under D11 still compiles
correctly, just synchronously, and the driver returns 0 if no other pass
escalated. signalPassFailure() always aborts the compile; the difference
between it and the handshake is whether the next pass gets a chance to run
at all. report_fatal_error skips even the pass-manager's failure
propagation — the LLVM-side handler runs immediately, the process exits
through abort(), and the driver cannot translate the result into an exit
code because main never returns.
The choice between the three is structural, not stylistic. A reimplementer
should pick the soft handshake when downstream passes can plausibly run on
the un-rewritten IR, signalPassFailure() when they cannot but the
pipeline state is still consistent, and report_fatal_error only when the
IR or an internal invariant has been corrupted beyond what subsequent
passes can describe. The TileAS family uses all three; the canonical
async-pipeline path uses the soft handshake for unpipelinable loops and
report_fatal_error for the trap that fires when a sentinel pointer
escapes its expected scope.
Propagation
Downstream passes that depend on the success of a predecessor read the predecessor's status word through the PassManager's pass-result lookup. The dependent pass either short-circuits (if it has nothing to do when the predecessor failed) or runs a fallback (if it can still produce useful output).
The canonical example is TileASOptimizePipelineRegion (D13), which shrinks produce_one and consume_one regions after TileASUnspecializedPipeline (D11) has expanded the schedule. When D11 leaves a loop synchronous (its Failed to pipeline loop remark), it sets bit 2 of its own status word; D13 reads that bit and skips the shrinker on functions whose loops D11 refused to pipeline. The shrinker has no work to do on a synchronous loop — its regions were never materialised — so skipping is the correct behaviour, and the contract is one-bit-wide.
void run_optimize_pipeline_region(FuncOp func, PassObject *self, PassObject *d11) {
if (pass_soft_failed(d11)) {
/* D11 left this function synchronous; no pipeline regions to shrink. */
return;
}
/* ... walk and shrink ... */
}
A pass that ignores a predecessor's soft failure is not buggy by itself — the IR is still valid — but it may waste cycles walking regions that have nothing useful to do. The convention is to read the bit whenever a pass has a cheap reason to skip work.
The Diagnostic-Emit Pattern
A pass that sets the handshake bit always pairs it with a diagnostic. The two are written in a fixed order: emit the diagnostic, then set the bit.
LogicalResult run_one_pass(PassObject *self, Operation *op) {
if (failed(do_work(op))) {
op->emitError() << "verbatim diagnostic explaining the structural reason";
pass_mark_soft_failure(self);
return failure();
}
return success();
}
The diagnostic gives the user the structural reason for the failure — what shape the pass expected, what it found, what the user could change to make the pass succeed. The bit gives the pass manager a machine-readable signal that downstream passes can read without parsing the diagnostic stream.
Diagnostics typically come through sub_446CE00 (the standard Tileiras diagnostic emitter) at severity 259 (0x103, "Error"); a recoverable miss like TileASUnspecializedPipeline's Failed to pipeline loop uses severity 3 (Remark) instead. Both severity levels set the same bit — the user-facing message is what changes, not the inter-pass signal.
Where the Handshake Appears
The convention is used across the entire TileAS pipeline. The list below covers the principal callers:
| Pass | Trigger | Verbatim diagnostic |
|---|---|---|
TileASMaterializeAsync (D08) | conflicting producer-like ops on one pipeline | "there are two produce-one-like operations using different instructions to generate data into the same pipeline. It's a bug of MaterializeAsync Pass." |
TileASMaterializeConvertLayout (D09) | layout-conversion decomposition failure | "failed to decompose the convert_layout" |
TileASMaterializeSchedule (D10) | missing ScheduleAnalysis or alias contract violation | "Alias is not expected here." |
TileASUnspecializedPipeline (D11) | non-pipelinable loop shape | "Failed to pipeline loop" |
TileASOptimizePipelineRegion (D13) | reads D11's bit; never sets its own | (skips work, no diagnostic) |
ConvertTileASToLLVM | various lowering failures | varies by op family |
Most TileAS passes both read predecessors' bits and set their own. The convention is recursive: a pass's status word is part of its public contract with every subsequent pass.
Stickiness: the OR-only word
The status word at +40 is monotonic for the lifetime of one pass run. Every
write is an OR — *(self+40) |= 4 — and there is no corresponding clear
inside the pass body. A pass that detects ten unpipelinable loops sets the
bit ten times; the second through tenth writes are no-ops at the
bit-pattern level but cost nothing and keep the call sites uniform. The
driver, not the pass, owns the lifecycle: it zeroes the word before the
pass walk begins and reads it once after the walk completes.
/* Driver-side wrapper around one pass invocation. */
LogicalResult driver_run_pass(PassObject *pass, FuncOp func) {
pass->status_word = 0; /* clear sticky bits before walk */
LogicalResult walk_result = pass->run(pass, func);
if (pass->status_word & 4) {
/* The pass set the soft-failure bit at least once during the walk.
* Record it in the per-function pass-result map so downstream
* passes can inspect it via pass_soft_failed(predecessor). */
record_soft_failure(func, pass);
}
return walk_result;
}
Stickiness matters because TileAS passes walk the IR with op-level
granularity. A single function may contain a dozen loops; pipelining might
fail on three and succeed on the rest. The pass returns success() at the
function level (the IR is still valid), but the bit records that at least
one loop missed. The downstream reader does not need to know which loop
failed — only that the function is not fully pipelined and therefore that
the regional shrinker in D13 has reduced work to do. A multi-bit failure
count would carry no extra information given the binary nature of the
downstream skip decision.
The same pattern appears at wider granularity in the
ConvertTileASToLLVM rewriter:
when a single op fails to lower, the rewriter ORs 4 into its own status
word and continues with the next op rather than abandoning the function.
The driver lifts the bit to a hard failure only if the post-walk
verifier rejects the IR — typical for a partial lowering — but the
diagnostic stream still preserves every per-op error message.
Worked Examples
D11 unpipelinable loop
The most-walked failure path. TileASUnspecializedPipeline (D11) tries
each loop in a function and records a per-loop result.
LogicalResult d11_run(PassObject *self, FuncOp func) {
self->status_word = 0;
func.walk([&](scf::ForOp loop) {
if (failed(check_pipelinable_shape(loop))) {
loop->emitRemark()
<< "Failed to pipeline loop"; /* severity 3, remark */
self->status_word |= 4; /* sticky soft fail */
return WalkResult::skip(); /* leave loop intact */
}
rewrite_to_pipelined_form(loop);
return WalkResult::advance();
});
return success(); /* IR still valid */
}
D11 returns success() even when every loop in the function fails — the
function compiles, the loops simply stay synchronous. D13 reads the bit
afterwards and skips its region shrinker on this function.
D08 conflicting producer ops
The hard-but-recoverable case. TileASMaterializeAsync (D08) emits
severity-259 (error) diagnostics rather than remarks but still uses the
handshake bit instead of signalPassFailure(), so that the rest of the
compilation can produce a best-effort artifact for inspection.
LogicalResult d08_check_pipeline(PassObject *self, PipelineOp pipe) {
Operation *first = nullptr;
for (Operation *producer : pipe.producers()) {
if (!first) { first = producer; continue; }
if (!same_instruction_kind(first, producer)) {
producer->emitError()
<< "there are two `produce-one-like` operations using "
<< "different instructions to generate data into the "
<< "same pipeline. It's a bug of MaterializeAsync Pass.";
self->status_word |= 4;
return failure(); /* skip this pipe */
}
}
return success();
}
The diagnostic text is verbatim from the binary, including the
self-attributing "It's a bug of MaterializeAsync Pass" — TileAS treats
this as an internal inconsistency the user is unlikely to be able to fix,
but still recoverable enough to keep the pass-manager running.
D13 downstream skip
The reader side. TileASOptimizePipelineRegion (D13) consults D11's bit
before walking — there is nothing to shrink on a function whose loops
stayed synchronous.
LogicalResult d13_run(PassObject *self, FuncOp func) {
PassObject *d11 = pass_manager_lookup(self->pm, "TileASUnspecializedPipeline");
if (d11 && (d11->status_word & 4)) {
/* D11 left at least one loop synchronous in this function. The
* shrinker would walk produce_one/consume_one regions that were
* never materialised; nothing to do. */
return success();
}
func.walk([&](PipelineRegionOp region) {
shrink_region(region);
});
return success();
}
Note that D13 does not set its own bit in the skip path: the skip is not a failure, it is the absence of work. A downstream pass reading D13's bit gets a clean signal that D13 had nothing to escalate.
Implementation Constraints
A reimplementation must preserve four invariants.
First, the bit must be at the same offset and meaning across every pass. A pass whose PassObject lays out its status word at a different offset cannot participate in the handshake — the downstream-read pattern hard-codes +40.
Second, the diagnostic must precede the bit-set. If the bit is set before the diagnostic, a pass-manager that early-exits on bit-set may never publish the diagnostic to the user, and the failure becomes invisible.
Third, the bit is cumulative within one pass run. Multiple op-level failures inside one pass keep ORing 4 into the same word; the word never gets cleared mid-run. The driver clears the word before the pass starts and inspects it once the pass returns.
Fourth, the bit is per-pass-instance, not per-function. The driver owns the clear-before-run; a pass that re-runs on a second function under the same pass-manager instance gets a fresh zero. A reimplementation that caches PassObjects across runs must clear the word at run entry, not at constructor time.
QUIRKs
QUIRK: the bit lives at +40 even when the PassObject is shorter
Several TileAS passes have PassObjects whose pass-specific tail ends well
before offset 40 — the field is padded out specifically to host the
handshake word at the conventional offset. A reimplementation that
size-optimises the PassObject layout and moves the status word inward
breaks the cross-pass read pattern: D13 and the rest of the family
hard-code *((uint32_t *)((char *)pred + 40)) & 4 and will read garbage
from the displaced field. The offset is part of the binary contract.
QUIRK: pass[5] |= 4 reads as a u32 store at +20, not +40
A handful of disassembled call sites express the bit-set as
pass[5] |= 4 where pass is a uint32_t * — that is, a 32-bit store at
offset 20, not 40. Both forms appear in the binary. They refer to the
same status word: the PassObject base pointer used in the [5] form
is offset 20 bytes into the structure compared with the base used in the
+40 form (the inner pointer skips the pass-manager prelude and lands at
the body). A reimplementer reading the disassembly must check which base
pointer each call site is working from before deciding whether the bit
write targets the handshake word or some unrelated u32 — they look
identical at the instruction level. The handshake word is always the same
physical location regardless of which base the call site indexes.
QUIRK: severity 259 still sets the same bit as severity 3
The handshake bit does not distinguish between an error-class diagnostic
(severity 259 / 0x103) and a remark (severity 3). D08's "It's a bug of
MaterializeAsync Pass" and D11's "Failed to pipeline loop" both set the
same bit through the same |= 4 write. The downstream reader cannot
recover the severity from the bit alone — it must consult the diagnostic
engine's recorded messages, or simply accept that "the predecessor had a
non-success outcome" is the only information the handshake carries. This
deliberate flattening keeps the inter-pass protocol one-bit-wide; severity
is a user-facing concept, not an inter-pass concept.
Cross-References
Error Handling and Diagnostics
is the canonical end-to-end page tying the handshake together with the
MLIR diagnostic engine and the driver-level exit codes.
TileAS Async and Pipeline Family is the canonical example, with the handshake appearing in five of its passes.
Pass Manager Internals covers the PassObject layout and the driver-side pass-result lookup the handshake rides on.
Diagnostic Helpers documents the diagnostic emitter that all these passes call before setting the bit.
Diagnostic ABI and Helpers is the
body-layout reference for the diagnostics that pair with each bit-set.
Invariants and Verifiers covers the cross-pass invariants the handshake protects.
Common Compiler Patterns and Idioms places the
*(self+40) |= 4 convention in the catalogue of recurring structural moves alongside PIMPL,
vtable banks, and dispatcher tables.
.data XOR-3 Obfuscation
Abstract
The tileiras AsmWriter stores its two largest plaintext-string assets — the physical-register-name pool and the PTX opcode-mnemonic pool — as XOR-encoded byte arrays in a writable load segment. The encoding is a simple walking XOR stream (0, 3, 6, 9, ...) applied after linking and undone in place at runtime by pthread_once-guarded initializers. Once decoded, both pools feed the normal LLVM AsmWriterEmitter lookup paths.
The cipher has no cryptographic value. Its only purpose is to keep the strings out of trivial strings output. For reimplementation, the clean design is to store the pools as ordinary read-only string tables and delete the runtime decoder.
XOR-3 obfuscation scheme
Only two pools are encoded: the opcode-mnemonic string pool and the physical-register-name pool. They live in writable memory, so the decoder mutates the bytes directly without needing mprotect. No third encoded pool is referenced by the AsmWriter path.
Mnemonic and register-name pools
The opcode-mnemonic pool decodes to a packed NUL-delimited C-string table with roughly three thousand chunks. The first chunks are AsmWriter separators such as "},\n\t\t", "},\n\t", and ";\n\t"; later chunks carry PTX mnemonic fragments such as "match.all.sync.b32 \t" and "suld.b.1d_buffer.v2.b8".
The shorter register-name pool carries physical-register names. Decoded prefixes include %Depot, %SP, %SPL, %envreg0..31, and the PTX register families %p, %rs, %r, %rd, %f, %fd, and %rq. Only physical-register names use the pool; virtual PTX register classes are formatted directly from prefix plus register number.
Decoded once at initialization
Both pools are decoded exactly once per process by pthread_once. After the mnemonic pool is decoded, getMnemonic performs a second one-shot to cache the pool base pointer behind the Itanium ABI static-local guard protocol.
static pthread_once_t once_reg_name = PTHREAD_ONCE_INIT;
static uint8_t guard_once = 0;
static const char *base_ptr_cache = NULL;
static pthread_once_t once_mnemonic = PTHREAD_ONCE_INIT;
No teardown registers. Once decoded, the pools live in writable memory for the lifetime of the process.
Byte-level transform decoding
Both init helpers implement the same byte-granular walking-XOR cipher:
void xor3_decode(uint8_t *begin, uint8_t *end) {
uint8_t k = 0;
while (begin != end) {
*begin++ ^= k;
k += 3; // wraps mod 256
}
}
The key schedule is k[i] = (3 * i) mod 256. Because gcd(3, 256) = 1, the schedule visits every byte value once per 256-byte window before repeating. XOR is self-inverse, so running the same pass twice re-encodes the pool.
The transform is in-place, byte-granular, single-pass — no block chaining, no IV, no key derivation, no integrity tag. The encoder is the same function as the decoder.
AsmWriter consumer
NVPTXInstPrinter::getMnemonic(const MCInst*) is the canonical LLVM AsmWriterEmitter lookup with NVIDIA's decode/cache steps welded onto the prologue:
const char *get_mnemonic(const MCInst *mi) {
pthread_once(&once_mnemonic, decode_mnemonic_pool);
if (!guard_once && __cxa_guard_acquire(&guard_once)) {
base_ptr_cache = mnemonic_pool;
__cxa_guard_release(&guard_once);
}
uint32_t op = mi->opcode;
uint32_t lo = mnemonic_offsets[op];
uint32_t hi = mnemonic_flags[op];
if (lo | ((uint64_t)hi << 32))
return base_ptr_cache + (lo & MNEMONIC_OFFSET_MASK) - 1;
return NULL;
}
The per-opcode offset table contains one packed uint32_t per MC opcode. Bits 0..16 are the byte offset into the mnemonic pool; bits 17..31 carry AsmWriter tail state. A parallel companion table carries operand flags and modifier-class words. The (lo | hi << 32) == 0 test distinguishes a real mnemonic from an LLVM generic pseudo without a mnemonic. The -1 bias is upstream AsmWriterEmitter convention: offset 0 is the no-mnemonic sentinel.
The parallel printRegName path decodes the register-name pool once and uses a 16-bit offset table for physical registers. Other register classes format directly from prefix plus register number without consulting the pool.
Reimplementation Notes
A faithful but cleaner implementation can make both string pools read-only:
const char *get_mnemonic_clean(const MCInst *mi) {
uint32_t op = mi->opcode;
uint32_t lo = mnemonic_offsets[op];
uint32_t hi = mnemonic_flags[op];
if ((lo | ((uint64_t)hi << 32)) == 0)
return NULL;
return mnemonic_pool + (lo & MNEMONIC_OFFSET_MASK) - 1;
}
That removes the writable string tables, the two pthread_once decoders, and the base-pointer guard while preserving the AsmWriterEmitter lookup contract.
Binary Vtable Banks + Static Ctors
Abstract
The tileiras ELF uses conventional C++ static registration heavily. MLIR pass models, rewrite patterns, dialect descriptors, op interfaces, NVPTX register classes, and target-lowering hooks all publish through vtables or static constructors before normal compilation starts. This page documents the runtime shape of that registration layer without treating raw table addresses as part of the public API.
PassConcept vtable shape
Every MLIR new-PM PassModel<PassT> instantiation uses the same PassConcept<PassT> shape: Itanium ABI prefix, destructor pair, run wrapper, name printer, isRequired hook, and tail destructor. The uniform shape is what lets MLIR store arbitrary pass models behind one concept pointer while still dispatching to the typed pass implementation.
| Slot | Role |
|---|---|
| 0 | typeinfo pointer, null in this no-RTTI build |
| 1 | typeinfo extension word, null in this no-RTTI build |
| 2 | deleting destructor |
| 3 | non-virtual destructor body |
| 4 | run(IRUnitT&, AM&) wrapper |
| 5 | name() printer trampoline |
| 6 | isRequired() |
| 7 | tail ~PassModel() |
The run wrapper adjusts from the erased PassConcept base to the typed PassT object, calls the real pass body, and returns the model pointer. For a reimplementation only the ownership-and-dispatch contract matters: a pass instance must retain enough typed state for its run, name, and isRequired hooks to agree.
RewritePattern tables
Rewrite patterns split into two shapes: conversion patterns and plain rewrite patterns. Conversion patterns share a generic rewrite driver that delegates to the typed matchAndRewrite; plain patterns have a smaller table and often use the default no-op match hook. Pattern identity is carried by the op name, benefit, context pointer, and typed rewrite hook — not by a binary table address.
Dialect vtables
Every MLIR Dialect subclass uses the same dialect ABI: destructor pair, canonicalization-pattern hook, constant materializer, attribute parser/printer, type parser/printer, and region/op attribute verifiers.
| Dialect | Distinctive behavior |
|---|---|
nv_tileaa | Installs the inliner interface used by the alias-analysis layer. |
cutlass | Disables textual attribute/type parsing for unsupported forms. |
cute_nvgpu | Provides the non-trivial textual type printer for GPU atom types. |
cute | Relies heavily on generic ODS assembly behavior. |
cutlass.seq_bar family | Dense op-model family for sequence-barrier operations. |
Dialect construction is registration-heavy: the constructor installs namespace, TypeID, attribute/type parsers, op interfaces, and dialect interfaces as one coherent unit. A reimplementation should reproduce the observable dialect behavior and parser/printer hooks, not the original table layout.
NVPTX register-class descriptors
The NVPTX backend ships declared-pool TargetRegisterClass descriptors for PTX register pools: %p, %rs, special registers, %r, %f, %rd, and %rq. These descriptors are pure data: MC register class pointer, subclass masks, allocatable bit, superclass list, and related metadata. The asm printer reads them to choose declarations such as .reg .b<width> %<prefix><N>;.
NVPTXTargetLowering
tileiras carries the normal LLVM SelectionDAG target-lowering surface for NVPTX. Key hooks: LowerFormalArguments, LowerCall, LowerOperation, and the load/vector helper path. Slot-by-slot behavior is covered in NVPTX Target Lowering, Call and Args; this page only records that the target-lowering surface is a conventional LLVM virtual interface.
Static Constructors
The binary has hundreds of static constructors. Each body falls into one of three useful categories:
cl::opt<>registrars that publish command-line options, help text, defaults, and flags.TypeID::get<T>()static-local initializers guarded by the Itanium ABI guard protocol.- Dispatch-table initializers that install vtables, op interfaces, or dialect interface records.
For reimplementation, constructor order matters only where later code expects a registry to be populated before first use. The durable behavior is the registry side effect, not the original constructor body address.
Per-Dialect Ctor Chain
A .ctors table at 0x591CE78..0x591E2F0 lists all 653 ctor bodies — a void(*)()[] array of function pointers walked by _start before main runs. Each entry is a __cxa_atexit-registered ctor. The order is established at link time and matches the order of declarations across the source units; the dependency-ordered listing matches the actual link order.
Six of those ctors initialise the six in-binary dialects, and the order over those six is not incidental. Each later dialect ctor reads types, attributes, or interface records that a previous ctor registered, so swapping the order would observe partially-populated registries during construction. The chain is strict, single-threaded, and runs to completion before any user code touches the dialect registry.
| Order | Dialect | Ctor | Notes |
|---|---|---|---|
| 1 | nv_tileaa | sub_1545E80 | Lowest dialect; registers Type / Attribute / OperationName slabs first |
| 2 | nv_tileas | sub_147EC90 | Depends on nv_tileaa for token types |
| 3 | (intermediate) | sub_153EC20 | Registers shared interfaces (FunctionOpInterface, SymbolTable, LoopLikeOpInterface) |
| 4 | CutlassDialect | sub_17640C0 | Depends on the shared interfaces |
| 5 | CuteNvgpuDialect | sub_17D1190 | Depends on CutlassDialect for pipeline ops |
| 6 | CuteDialect | sub_1928370 | Highest dialect; depends on all the above |
The cuda_tile dialect registers separately — it is the public input dialect and has its own ctor chain via the dialect-target registry, registered through RegisteredDialect at sub_6B3ED0. It does not appear in the six-step chain above because the chain only covers in-binary dialects whose ctors emit registration calls into the global MLIRContext table; cuda_tile is published into the target registry instead.
__cxa_atexit and the XOR-3 Pool Exception
Most of the 653 ctors register a corresponding __cxa_atexit dtor for ordered teardown — but the XOR-3-encrypted .data pools (mnemonic and register-name pools — see Data Section Decryption for the cipher and decoders, and AsmPrinter Monster and Windows for the AsmWriter consumer) do not register dtors. The pools are zeroed at static-init, decoded at first use via pthread_once, and never re-encoded. The omission is deliberate: pools sit memory-mapped read-only after the first use, so re-encrypting them on shutdown is pointless and a no-op dtor would only add wasted entries to the exit chain.
The init-order over the six dialects also lines up with which dialects use pthread_once guards versus eager static-init. Only the shared-interfaces step at order 3 is gated by a one-shot guard — the interfaces it publishes are queried lazily on first use, so the ctor stages a pthread_once_t slot rather than running registration immediately. The other five dialects run their entire registration at static-init and need no one-shot guard.
| Ctor | sub_ADDR | Dialect | pthread_once slot |
|---|---|---|---|
| ctor_001 | sub_1545E80 | nv_tileaa | dword_5B6A640 (not used; nv_tileaa has eager init) |
| ctor_002 | sub_147EC90 | nv_tileas | (eager) |
| ctor_003 | sub_153EC20 | shared interfaces | dword_5B37670 (FunctionOpInterface guard) |
| ctor_004 | sub_17640C0 | CutlassDialect | (eager) |
| ctor_005 | sub_17D1190 | CuteNvgpuDialect | (eager) |
| ctor_006 | sub_1928370 | CuteDialect | (eager) |
Referenced Ctor Bodies at 0x46xxxxx
Of the 653 total ctor pointers, only 49 are referenced from elsewhere in the binary — the rest are template instantiations of upstream LLVM/MLIR static-init that fire once and never get named again. The 49 referenced ctors split into a small handful of roles: the six dialect ctors above, twelve cl::opt registrations, eight pass registrations, eleven TypeID-singleton initialisers, four raw_ostream sinks, three fingerprint-singleton initialisers, and five miscellaneous singletons. A reimplementation only needs to reproduce those 49 effects in a clean way; the unreferenced 604 are link-order noise from upstream with no observable post-init behavior in tileiras.
Reimplementation Notes
startup():
register_command_line_options()
register_type_ids()
register_dialects()
register_passes()
register_rewrite_patterns()
register_nvptx_target_lowering()
The registration graph should be explicit in a clean implementation. Avoid making address-contiguous table placement part of the design; it is an artifact of the original linker layout, not a semantic requirement.
Threading and Synchronization
Abstract
The tileiras binary links against libpthread and uses the standard POSIX threading primitives — pthread_once_t, pthread_mutex_t, pthread_rwlock_t, plus pthread_key_create for thread-local storage — together with compare-exchange and atomic add/sub operations on reference-count fields. Concurrency surfaces in four distinct layers: process-wide one-shot initialization of decoded data tables and cached TypeID singletons; thread-local caches that front every type and attribute uniquer probe; per-MLIRContext rwlock-protected hash tables that the TLS caches publish into; and the lock-free CAS fast path that publishes a freshly allocated StorageAllocator into the per-TypeID bucket array. Single-threaded builds collapse the same paths to plain loads and stores through LLVM's weak-threading gates.
This page is the canonical concurrency story for the whole MLIRContext storage stack. The allocator and refcount machinery it interacts with live elsewhere — see Allocator BumpPtr and Slab Sizes for the arena and StorageAllocator shape that the per-TypeID pthread_rwlock_t below protects, and Data Section Decryption for the one pthread_once use that decodes a binary-time-encrypted pool rather than building runtime state.
Primitive Families
Five POSIX primitives carry the concurrency contract; the rest of the binary's threading reduces to atomics on integer fields. Every primitive is paired with a weak-symbol fallback: when the linked libc lacks one of the _pthread_* family, the wrapper degenerates to a no-op and the surrounding code runs single-threaded.
| Primitive | Role inside tileiras |
|---|---|
pthread_mutex_t | Affine-cluster Level-1 insertion mutex; data-section decode mutex; diagnostic-printer mutex on MLIRContext. |
pthread_rwlock_t | Per-TypeID StorageAllocator lock — read mode protects the probe, write mode protects insert and resize. |
pthread_once_t | One-shot guards for data-table decode, cached TypeID construction, and dialect / pass registration. |
pthread_key_create / pthread_getspecific | TLS keys for the per-thread uniquer cache and the diagnostic-context pointer. |
| Atomic CAS, atomic add/sub | Refcount increments/decrements, StorageAllocator publish-or-free, lock-free probe of the Level-1 bucket array. |
The atomic operations compile down to lock cmpxchg and lock xadd on x86-64 in threaded builds. The weak-threading gate replaces them with plain mov plus inc / dec when the linker resolves _pthread_key_create to a null weak symbol — the binary checks that pointer once at startup, caches the result, and dispatches every later concurrency primitive against the cached flag.
Weak-Symbol Single-Threaded Collapse
The same translation unit that compiles to a thread-safe MLIRContext also serves single-threaded builds. The collapse mechanism is a startup check that reads the weak symbol _pthread_key_create: if it resolves to nullptr, the binary is being run against a libc that does not link the threading library, and every later concurrency primitive in the storage stack short-circuits.
static bool tileiras_threading_enabled(void) {
// _pthread_key_create is a weak symbol. Its resolved address is the
// canonical "do we have a real pthreads underneath us?" probe.
return &_pthread_key_create != nullptr;
}
Once that probe returns false, the consequences cascade: the per-TypeID pthread_rwlock_t reads and writes become uncontended local loads and stores; the atomic CAS in the StorageAllocator publish path becomes a plain pointer write because no other thread can race; the refcount add/sub becomes ordinary integer arithmetic; the TLS cache becomes a single per-process cache. The MLIRContext runs end-to-end with no synchronisation overhead and the dialect-conversion driver finishes a kernel-build in roughly the same wall-clock time it would take on a single-threaded LLVM.
The opposite direction is also true: threaded builds never branch on the flag in the hot path. The wrappers compile their atomic and lock instructions unconditionally; the cost of a single uncontended lock cmpxchg is too small to justify the branch and the speculation cost. The flag is only read once at startup and at the few places where a destructor needs to know whether to call pthread_*_destroy or skip it.
pthread_once one-shot gates
pthread_once serves as a process-wide "run exactly once, all other callers wait" guard in three structural roles. First, data-table decoding: PTX mnemonic and register-name pools decode lazily the first time the NVPTX printer asks for them. Second, cached TypeID construction: per-type StorageUniquer shims build their TypeID once, then future callers skip construction and go straight to lookup. Third, dialect and pass registration: dialect initialization and pass registration are once-gated so concurrent module creation sees a fully populated registry.
The Itanium ABI guard pair __cxa_guard_acquire / __cxa_guard_release handles smaller static-local byte guards. Its practical contract matches pthread_once: initialize once, publish only after construction completes, make later calls read-only.
The two guard families differ in observable behaviour under contention. pthread_once blocks every waiter on a futex until the running call completes; the waiters are then released together and each sees the published state. __cxa_guard_acquire busy-spins on the guard byte's low bit before falling back to a futex — the spin is short enough that uncontended cases (a single thread initialising a function-local static) never enter the kernel, but contended cases still degenerate to the same futex wait. The choice between them is driven by the size of the initialisation work: pool decoding and dialect registration go through pthread_once because the work is large enough that no waiter benefits from spinning; function-local statics go through the Itanium ABI guards because the work is short and the kernel call would dominate.
Atomic Memory Ordering
The CAS in the StorageAllocator publish path uses sequentially-consistent memory ordering. A weaker ordering — release on the store, acquire on the load — would suffice for the storage-uniquer's correctness model, but the binary picks seq_cst because the cost of the stronger fence is small (an mfence on x86-64 the publishing thread issues once at first observation, never again) and the cost of getting the weaker ordering wrong would be a use-before-construct race that only manifests under specific weak-memory-model behaviours. The atomic add/sub on the refcount fields use relaxed ordering for the increment (which only matters for the count itself, not for any data it protects) and acq_rel ordering for the decrement that hits zero (which has to synchronise with the destructor's reads of the storage payload).
The lock-free probe of the Level-1 bucket array uses acquire ordering on the load: the probing thread has to see every store the publishing thread issued before its publishing CAS. The publishing thread's CAS is seq_cst (which is acq_rel plus a sequential consistency tail), so the load's acquire is sufficient — the bucket pointer's visibility implies the visibility of every store the publishing thread issued to the allocator's fields before the CAS.
MLIRContextImpl Concurrency Layout
MLIRContextImpl owns the synchronization objects for type, attribute, and affine-expression interning. The interning machinery is a two-level hash table: Level 1 is a bucket array keyed by a TypeID, where each bucket holds a pointer to a StorageAllocator; Level 2 is the per-TypeID hash table that allocator owns, keyed by the parameterised storage key of each interned type or attribute. Type and attribute uniquers use 16-byte bucket slots plus size words; the affine map/expression path has its own mutex and state pointer because affine-cluster interning was bolted on after the original storage uniquer was generic.
| Field | Primitive | Protects |
|---|---|---|
affine_uniquer_mutex | pthread_mutex_t | AffineMap, AffineExpr, and IntegerSet Level-1 insert path. |
affine_uniquer_state | state pointer | Pointer to the affine-cluster StorageUniquerImpl. |
type_uniquer_buckets / size | bucket pointer plus size | Per-context Type Level-1 bucket array. |
attr_uniquer_buckets / size | bucket pointer plus size | Per-context Attribute Level-1 bucket array. |
diagnostic_printer_mutex | pthread_mutex_t | Serialises diagnostic output across worker threads. |
Each per-TypeID StorageAllocator published into the Level-1 bucket array owns a pthread_rwlock_t for its Level-2 table. The lock has three operational modes: read mode for probes (any number of probes proceed concurrently), write mode for inserts and table resizes, and no-lock mode for the per-thread TLS cache that fronts the probe. The read-then-upgrade pattern is the core StorageUniquer concurrency contract: a probe takes the read lock and looks up the key, and if the key is absent the probe releases the read lock, takes the write lock, re-probes (because another writer may have inserted the same key between the two locks), and only then allocates.
TLS Uniquer Cache
In front of every Level-2 probe sits a per-thread cache: a small hash table keyed by the parameterised storage key and valued by the BaseStorage* returned by the last successful probe. The cache is thread-local, accessed through a pthread_key_create-registered TLS key, and the read path needs no synchronisation because no other thread can see it. On x86-64 the access compiles to a single %fs:-segmented load — the fs segment register holds the TLS base for the current thread, and the compiler emits mov %fs:offset, %reg for the cache pointer probe.
The cache lives in two layers. The outer layer is a pthread_getspecific(tls_key) call that returns the per-thread cache header pointer; the inner layer is the cache header itself, which holds the hash-table buckets and the LRU eviction queue. The outer layer goes through libpthread's TLS mechanism the first time a thread touches the uniquer; subsequent accesses on the same thread short-circuit to the cached header pointer the compiler hoists into a register on entry to the hot probe function. The total cost of a cache hit on the steady-state path is one mov from the segment register, one compare, and one branch — comparable to a successful inline hash lookup against any unsynchronised in-process map.
The cache is filled from the global probe: a hit on the global table writes the (key, BaseStorage*) pair into the local cache, evicting the least-recently-used entry if the cache is full. A subsequent probe with the same key short-circuits to a single cache-hit lookup with no lock acquisition at all — which is the common case for the dialect-internal types and attributes a single compilation pass touches repeatedly.
The cache has one structural invariant: it caches only positive probes. A miss does not insert a tombstone, because the absence of an entry from the global table is racy — between the miss and a later probe, another thread can publish a matching entry, and a negative cache entry would shadow the global publication. The cache's eviction policy is approximate-LRU rather than exact-LRU: a hit promotes the entry to the front of the eviction list with one swap, and a full cache evicts the tail entry on each insert. Approximate-LRU is sufficient because the per-thread access pattern is heavily skewed — a single pass touches a small working set of types, and the working set fits in the cache's fixed-size table without eviction in practice.
Affine Cluster Mutex
The affine cluster (AffineMap, AffineExpr, IntegerSet) uses a single pthread_mutex_t rather than the per-TypeID rwlock the type and attribute clusters use. The split has a structural reason: affine expressions cross-reference each other through structural sharing (an AffineExpr may be the sub-expression of many AffineMap instances), so the per-TypeID locking model would force a probe of an AffineMap to acquire the lock for every AffineExpr it referenced. The single mutex collapses that fan-out into one acquisition per probe.
The cost is that affine-cluster interning is the only place in the storage stack where a probe blocks a concurrent insert from a different thread. In practice the cost is negligible because affine interning is dominated by the dialect-conversion driver's affine-expression rewrites, which run on the function thread and rarely overlap.
StorageUniquer Two-Level Concurrency
The full StorageUniquer::getOrCreate(type_id, key) flow threads the four layers — TLS cache, atomic CAS on the Level-1 bucket, rwlock-guarded Level-2 probe, rwlock-guarded Level-2 insert — in that order. The TLS cache hit is the common-case fast path; the lock-free CAS path is what makes the first time any thread touches a new TypeID lock-free; the rwlock read path is what scales probes for hot types across worker threads; the rwlock write path is the single serialisation point where new storage is allocated.
The CAS-publish-or-free idiom protects the Level-1 bucket array. When a probe finds a null entry for an unseen TypeID, the probing thread allocates a fresh StorageAllocator, then issues a compare-exchange against the bucket. The winner of the CAS has its allocator visible to every other thread; the loser frees its allocation and continues with the winner's allocator. No two threads ever share a half-initialised StorageAllocator — the allocator is fully constructed before the CAS that publishes it.
get_or_create(type_id, key):
# Layer 1: TLS cache (no lock, no atomic)
cached = tls_cache.lookup(type_id, key)
if cached != nullptr:
return retain(cached)
# Layer 2: atomic CAS on Level-1 bucket (lock-free publish)
allocator = atomic_load(level1[type_id])
if allocator == nullptr:
candidate = allocate_storage_allocator(type_id)
prev = compare_exchange(level1[type_id], nullptr, candidate)
allocator = (prev == nullptr) ? candidate : (free(candidate), prev)
# Layer 3: rwlock-guarded read probe
with allocator.rwlock.read:
existing = allocator.lookup(key)
if existing:
tls_cache.insert(type_id, key, existing)
return retain(existing)
# Layer 4: rwlock-guarded write insert with re-probe
with allocator.rwlock.write:
existing = allocator.lookup(key) # re-probe, mandatory
if existing:
tls_cache.insert(type_id, key, existing)
return retain(existing)
fresh = allocator.insert_new_storage(key)
tls_cache.insert(type_id, key, fresh)
return fresh
The re-probe under the write lock is not optional. Without it, two racing inserts can pass through the read-lock probe simultaneously, both find the key absent, both upgrade to write mode in sequence, and both allocate a fresh storage object — leaving the table with two distinct BaseStorage* for the same key. Every later equality test on that type would compare unique-but-equal storage objects and return false, breaking the structural-equality contract every dialect relies on.
The TLS cache insert happens after the global probe in both the read-path hit case and the write-path insert case. Inserting the entry before the lock is released would still be correct under TLS isolation, but the post-lock insert keeps the hot path's lock window minimal — the cache write is a pair of stores against thread-local memory and adds no contention to the rwlock.
ThreadSafeRefCountedBase
Every interned storage object follows the canonical llvm::ThreadSafeRefCountedBase shape: a strong count, a weak count, and an installed-in-cache marker, all int32. Strong increments fire when a storage object is handed to a caller through the getOrCreate return path; strong decrements run the payload deleter when the count reaches zero. Weak decrements run the final destructor when the weak count reaches zero. Threaded builds use atomic add/sub on each field; single-threaded builds collapse the same code to ordinary integer updates through LLVM's weak-threading gate.
The interaction with the storage-uniquer layers above is one-way. The refcount lives inside the BaseStorage the uniquer hands out, but the uniquer itself never reads or writes the refcount — that work is done by the higher-level Type / Attribute value wrappers as they enter and leave caller scope. The uniquer is responsible only for ensuring that two callers asking for the same key see the same BaseStorage*; the refcount machinery then ensures that the storage outlives the last caller that holds a reference.
Cross-Pass Synchronisation
Most passes in the compiler are single-threaded over a given function, but the dialect-conversion driver can run multiple passes in parallel over different functions of the same module. Each function's pass execution touches the shared MLIRContext, which means every concurrent function-pass shares the same Level-1 bucket array, the same per-TypeID Level-2 tables, and the same TLS caches per thread. The four-layer storage-uniquer concurrency model is what makes that parallelism safe: TLS isolation hides the common-case probe from the rwlocks; the rwlock read mode lets many functions probe the same type concurrently; the rwlock write mode serialises the rare insert; the CAS-publish-or-free lock-free path handles the even rarer case of the first thread to touch a new TypeID.
The dialect-conversion driver also depends on the diagnostic-printer mutex on MLIRContext. When a parallel pass emits an emitError or emitWarning, the diagnostic engine acquires that mutex before writing to the active diagnostic stream — without it, two threads emitting concurrent diagnostics would interleave their output character-by-character. The mutex is held for the duration of one diagnostic emission only; it does not protect the diagnostic engine's internal state between emissions, which is single-threaded by construction.
Reimplementation Notes
A reimplementation has to preserve four invariants together; any one of them in isolation is insufficient.
- TLS cache caches only positive probes. A negative cache entry shadows global publication and breaks the structural-equality contract.
- The Level-1 CAS publishes a fully constructed allocator. The allocator must be allocated, zero-filled, and have its rwlock initialised before the CAS issues; any thread that observes the bucket through
atomic_loadmust see a usable allocator. - The Level-2 write path re-probes after the read-to-write upgrade. Without the re-probe, racing inserts create distinct storage objects for the same key and the dialect's equality contract breaks downstream.
- The weak-threading gate is checked once. A per-call check would not be slower in the absolute, but the hot path's branch predictor pollution and the lost speculation slot make a startup-time check the correct trade.
A reimplementation that skips invariant (1) can corrupt user-visible compilation by giving distinct Type values for what should be the same type. A reimplementation that skips invariant (2) hits a use-after-construct race on the first multi-threaded probe of any new TypeID. A reimplementation that skips invariant (3) hits the same equality-corruption mode as (1), but only under load. A reimplementation that skips invariant (4) wastes cycles on the threaded hot path; the correctness model still holds.
Allocator + BumpPtr + Slab Sizes
Abstract
The tileiras binary leans on three intertwined allocator layers — a generic malloc-retry shim, a bump-pointer arena following LLVM's BumpPtrAllocator::Allocate contract, and a per-MLIRContextImpl lattice of fixed-size slab requests that fan out into the StorageUniquer hash-cons machinery. Together they explain why the dominant fixed-size allocations land on a small number of well-known C++ class sizes: each is the byte image of a published LLVM 18 / MLIR storage record, and every one reconciles against an upstream sizeof() in LLVM 18 (which the producer string a2100git independently dates the binary to, modulo the LLVM 21 development tag the bitcode loader reports).
The layers are documented in the order a getOrCreate call visits them: the SDNode-shaped BumpPtrAllocator::Allocate wrapper, the four pattern-object slab sizes, the 24/96-byte Region / Block strides, the custom MLIR allocation wrappers, and the per-MLIRContextImpl arena that owns all of the above. Cross-links: Data Section Decryption and the StorageUniquer and Context Impl page for the 88-byte StorageAllocator slot allocated atop this stack.
BumpPtrAllocator vs Allocator
Two allocator types sit in the storage stack and they are not interchangeable. BumpPtrAllocator is LLVM's raw forward-only arena — a singly-linked list of slabs, a current-slab pointer, and a high-water mark within the current slab. Every Allocate(size, count) call carves the next aligned region out of the current slab, advances the high-water mark, and returns the carved pointer; when the slab fills, the allocator allocates a fresh slab from the next size class. There is no per-object free path: deallocation happens once at allocator destruction, when every slab is released in one pass.
Allocator (the per-MLIRContext allocator the StorageUniquer owns) wraps a BumpPtrAllocator and adds three responsibilities. First, it picks a slab-size class for the request based on the requested allocation size and the alignment requirement of the result type. Second, it routes oversized allocations — anything larger than the largest fixed slab class — through a separate large-allocation path that allocates a dedicated slab for the single object. Third, it tracks per-TypeID allocation counts so the StorageAllocator can decide when to grow its Level-2 hash table.
The two-layer split matters for reimplementation. The lower-level BumpPtrAllocator is the cold-path workhorse: callers that don't need slab-class selection (the bytecode reader, the dialect-registration walker, the diagnostic-message formatter) call into it directly. The upper-level Allocator is the hot-path entry point for the storage uniquer, because every interned type, attribute, and affine expression has to pick a slab class matched to its sizeof and natural alignment.
BumpPtrAllocator Allocate Wrapper
The bump-pointer wrapper follows LLVM's BumpPtrAllocator::Allocate(size, count) contract. It computes a header area proportional to the element count, delegates the actual allocation to the retrying allocator, then initializes fixed 32-byte slot metadata before returning the usable pointer. Two allocation shapes identify the LLVM layout family: 72-byte SDNode records and 88-byte GlobalVariable records. Both sizes align with LLVM 18 object layouts, even though the linked toolchain otherwise carries the later LLVM development tag used by CUDA 13.1.
Pattern object slab sizes
Four pattern-object footprints dominate fixed-size requests. Each is the sizeof() of an OpRewritePattern<T> / ConversionPattern<T> subclass after RewritePattern base inflation (vtable ptr + PatternBenefit + op-name SmallVector + MLIRContext* + any custom members).
| Slab size | Site count | Upstream identity | When you see it |
|---|---|---|---|
0x60 | 286 | RewritePattern base (vtable + 80 B inherited + 16 B benefit/tag tail) | cuda_tile / nv_tileaa canonicalisers, dialect-internal folds |
0x68 | 201 | OpRewritePattern<T> + one inline Value/Type member | typed canonicalisers that retain one dispatch handle |
0x70 | 902 | ConversionPattern<T> (adds TypeConverter* to the 0x60 base) | every *-to-LLVM / *-to-NVVM lowering, dominant slab |
0x78 | 66 | conversion pattern + one inline storage member (e.g. layout, IntegerSet) | conversion patterns that carry a layout, set, or address-space tag |
The 0x70 slab dominates because every dialect-to-NVVM lowering instantiates ConvertOpToLLVMPattern<T> or OpConversionPattern<T>, both 112 B after inheritance flattening (vtable + 80 B RewritePattern + 24 B ConversionPattern extension carrying the TypeConverter* plus padding). Second place 0x60 is plain RewritePattern (no type converter), used by the cuda_tile canonicalisers. The 0x68 and 0x78 slabs differ from their 0x60 / 0x70 neighbours by exactly one trailing 8-byte member — typically an Attribute handle or a TypeID retained for dispatch — matching upstream's practice of stashing dispatch keys inline rather than chasing them through the op-name SmallVector.
MLIR Region / Block strides
Walkers assume two fixed strides for the IR backbone. mlir::Region is 24 bytes: one Operation* parent, one Block ilist sentinel, and an 8-byte tail flag word. mlir::Block is 96 bytes in the LLVM 18 layout: IList header, BlockArgument vector, operation-list sentinel, region back-pointer, and successor/predecessor vectors. Every region/block walker either steps a contiguous Region array by 24 bytes or follows a Block::next ilist link with no stride assumption.
Custom MLIR alloc wrappers
Two helpers wrap raw allocation for MLIR-specific contracts:
- The
malloc-retry trampoline implements LLVM's allocation contract: zero-byte requests are rounded up to one byte, failed allocations invoke the activenew_handler, and fixed-size slabs bottom out in the same path. InterfaceMap::insertkeeps(TypeID, void*)pairs sorted by TypeID. It binary-searches the 16-byte-strided vector, appends or shifts elements right, frees duplicate implementation pointers, and delegates growth to LLVM's aligned buffer allocator.
Hot-Path Subsystem Mapping
The dominant slab classes (0x60 / 0x68 / 0x70 / 0x78) account for nearly every fixed-size allocation in steady state, but each class is fed by different subsystems. The mapping below tracks the four hot consumers against the slab class they overwhelmingly land on.
| Subsystem | Primary slab | What lands there |
|---|---|---|
| Dialect-to-NVVM conversion driver | 0x70 (112 B) | every ConvertOpToLLVMPattern<T> and OpConversionPattern<T> instantiation — one per source op the pass set rewrites |
| cuda_tile / nv_tileaa canonicaliser registry | 0x60 (96 B) | every RewritePattern the canonicaliser pass registers — folds, identity removals, swap-operand patterns |
| Scheduler schedule-node allocator | 0x60 / 0x68 | Schedule::Node records and the inline SmallVector-backed dependence-edge lists each node carries |
| Bytecode reader IR-node array allocator | 0x70 / 0x78 | parsed-IR-node arrays, where the trailing 8-byte slot on 0x78 carries the IR-attribute index back to the dialect |
The slab-class fan-out is what gives the hot path its locality. A bytecode reader that parses a function reads contiguous arrays of 0x70-shaped records into a single slab; the slab fits in two or three cache lines per ten records, and the prefetcher walks the slab forward naturally. The conversion driver hits the same effect: every OpConversionPattern<T> it registers in the order they're requested lands on the same slab, so the dialect-conversion walker's locality is determined entirely by the registration order.
Alignment Guarantees
Every allocation is aligned to the natural alignment of its slab class. The four hot classes have natural alignments tied to their size: 0x60 and 0x70 are 8-byte aligned (the slab base is page-aligned and each carve is a multiple of 8); 0x68 and 0x78 retain the same 8-byte natural alignment because their trailing 8-byte member is a regular qword. Oversized allocations that escape to the large-allocation path are page-aligned by default, but a caller can request a stricter alignment (16, 32, or 64 bytes for vector types) by passing an explicit alignment to the Allocate call.
The slab-class selection algorithm is structurally simple: take the requested allocation size, round up to the next multiple of 8, look up the smallest slab class whose size is at least the rounded-up request and whose natural alignment satisfies the caller's required alignment, and carve from that class. The lookup is a four-entry compare-and-branch chain; the binary does not use a fancier data structure because the per-allocation cost has to beat the cost of the allocation itself.
There is one subtle alignment-related constraint. The StorageUniquer's Level-2 hash table hashes on BaseStorage*, and the lower three bits of every BaseStorage* are used as a tag for the storage-kind discriminator. If the slab-class natural alignment were less than 8 bytes, the discriminator bits would overlap with the address bits and the hash would produce false collisions. The 8-byte natural alignment of every slab class is therefore not an optimisation — it is a correctness requirement for the storage tagging.
Lifetime
BumpPtrAllocator has exactly one lifetime: it is freed all-at-once at context destruction. There is no per-object free path, no garbage collection cycle, no reference-counted release. The slabs are allocated forward over the lifetime of the MLIRContext and released in one pass when the context's destructor runs.
The consequence for the storage uniquer is that an interned Type, Attribute, Location, or AffineExpr is reachable as long as the MLIRContext is alive. The ThreadSafeRefCountedBase refcount on each storage object still tracks live references, but the refcount reaching zero does not free the storage — it only signals that the per-TypeID destructor table can run the payload deleter to release any heap-allocated state the storage owns (a std::string body, a non-inline SmallVector heap, etc.). The 808-byte / 96-byte / sizeof() portion of the storage object itself stays in its slab until the context dies.
The bytecode-reader IR-node arrays and the scheduler's schedule-node structs follow the same lifetime rule — they live inside the same arena and are freed in the same one-pass slab release. The dialect-conversion driver's ConvertOpToLLVMPattern<T> records also live in the arena because their construction happens during pass registration and they survive for the lifetime of the pass manager that owns them. A reimplementation that tried to free pattern records mid-pass would not just leak the slab — it would break every PatternBenefit lookup and every dispatch table that points back into the slab.
MLIRContextImpl Arena Ownership
Everything above sits inside the MLIRContextImpl arena, which owns the StorageUniquer Level-1 bucket table. Each bucket can publish an 88-byte StorageAllocator containing a per-TypeID pthread_rwlock_t, live/tombstone counters, a bucket count, and a pointer to the Level-2 storage table. MLIRContextImpl retains every StorageAllocator, and each allocator retains every BaseStorage it hands out, so the arena lifetime is tied to a single MLIRContext. When the context dies, every interned Type, Attribute, Location, Identifier, AffineExpr, and pattern object allocated through this stack is reclaimed through the per-TypeID destructor table. The MLIR BumpPtrAllocator proper lives inside the same context as a separate slab list for operation records and trailing objects.
The concurrency story for this arena lives in Threading and Synchronization: every per-TypeID StorageAllocator carries the pthread_rwlock_t that protects its Level-2 probe and insert, and the MLIRContextImpl's diagnostic-printer mutex serialises the failure-path messages a slab-exhaustion event would produce.
Reimplementation Notes
allocate_storage(kind, payload_size):
size = max(payload_size, 1)
ptr = malloc_retry(size)
initialize_storage_header(ptr, kind)
return ptr
get_or_create_storage(context, type_id, key):
allocator = context.storage_uniquer.lookup_or_publish_allocator(type_id)
with allocator.write_lock_if_insert_needed(key):
return allocator.lookup_or_insert(key)
Arena ownership is fundamental: individual MLIR storage records are never freed piecemeal during normal compilation. They are reclaimed when the owning MLIRContext is destroyed.
Twine, StringRef, and format
Abstract
The tileiras binary inherits the LLVM-style trio of cheap string types — StringRef, Twine, and format_object — but only StringRef survives in canonical form. Every diagnostic message, verifier complaint, and register / opcode mnemonic the NVPTX printer sinks into raw_svector_ostream ultimately bottoms out in a 16-byte (ptr, len) pair: StringRef. The Twine concatenation type survives only as a fall-back rendering path inside the Diagnostic::operator<< switch (kinds 5 and 6, plus the catch-all that materialises a Twine into a SmallString and re-emits it as a kind-3 DiagArg). The formatv("sm_{}", N)-style format engine collapses to a small raw_ostream-backed sequence of write(ptr, len) calls. No format_object<...> virtual dispatch survives — every observed "sm_{N}" callsite is open-coded as << operators against a raw_svector_ostream rather than a templated formatv.
This page catalogues the three string-passing ABIs the binary actually uses: the 16-byte StringRef calling convention shared by attribute parsers, printers, and verifiers; the Twine append family that drives diagnostic concatenation; and the raw_svector_ostream chain that NVVM verifiers use to build candidate-tuple strings before sinking them into a single owned-string DiagArg.
StringRef 16-byte (ptr, len) ABI
StringRef is passed by value as two machine words, returned as two machine words, and stored as a 16-byte structure inside larger records such as diagnostic arguments, attribute storage, and SmallString heads.
| Offset | Size | Field | Notes |
|---|---|---|---|
+0x00 | 8 | const char *ptr | static-string pointer or heap-owned buffer |
+0x08 | 8 | size_t len | byte length; never includes NUL |
Every consumer enforces the 16-byte stride. Diagnostic::operator<<(StringRef) copies (ptr, len) into a 24-byte diagnostic-argument slot and stamps the appropriate argument kind. The const char* overloads first compute strlen, then use the same 24-byte push shape with kind 3.
Twine append family
Twine is the LLVM lazy-concatenation tree — leaves are StringRefs, integers, or raw const char* pointers, interior nodes are tagged concat operators. tileiras retains this design as a fall-back diagnostic rendering path. A diagnostic argument arriving as a Twine triggers the renderer to walk the tree into a temporary SmallString, then re-push the rendered bytes as an ordinary string argument. In practice, diagnostics that need formatting usually flow through raw_svector_ostream instead.
raw_svector_ostream
The canonical diagnostic-formatting pipeline in tileiras is a raw_svector_ostream layered over a stack SmallString. Verifiers stream literal fragments, separators, and typed values into that buffer, then promote the final string into a single owned diagnostic argument at flush time. Diagnostics such as "unimplemented variant for MMA shape <...>" use the same shape.
| Operation | Role | Notes |
|---|---|---|
| stream C string | raw_ostream::operator<<(StringRef) | Computes length and forwards to explicit write. |
| write bytes | raw_ostream::write(const char*, size_t) | Emits literal separators like ", " and ">". |
| stream typed value | typed-value stream operator | Emits sm_{N} and similar formatted scalars from stored type information. |
The chain is invoked from HH02 line-by-line as:
out << "unimplemented variant for MMA shape <";
out << multiplicand_a;
out.write(", ", 2);
out << multiplicand_b;
out.write(", ", 2);
out << multiplicand_c;
out << ">";
Here out is a raw_svector_ostream aimed at a stack SmallString. The verifier later promotes the SmallString into a single owned diagnostic argument before returning control to the diagnostic engine.
formatv("sm_{}") style format strings
The binary contains no surviving formatv template instantiation. Common SM-version strings such as "sm_50", "sm_52", "sm_60", and "sm_120" are selected by lookup table when possible. The generic sm_{N} case writes "sm_" and then streams the decimal compute capability. PTX register prefixes such as "%r{N}", "%f{N}", "%p{N}", and "%rd{N}" follow the same pattern.
Owned-string DiagArg kind=4
Kind 4 is the only diagnostic-argument flavor that owns its payload. It stores a pointer to a {data, length} pair rather than the direct (ptr, len) pair used by borrowed strings. When the diagnostic receives kind 4 it heap-copies the bytes and appends the allocation to the diagnostic's owned-string vector. Verifiers use kind 4 for stack SmallString buffers because the source storage would otherwise dangle by the time the diagnostic engine emits. The diagnostic destructor frees those owned buffers.
Reimplementation Notes
emit_formatted_error(parts):
buffer = SmallString()
out = raw_svector_ostream(buffer)
for part in parts:
if part.is_literal:
out.write(part.ptr, part.len)
else:
out << part.value
diag << DiagnosticArg.owned_string(buffer)
The ownership rule is the important part: borrowed StringRef arguments may point at static strings, but any stack-built SmallString must be promoted to an owned diagnostic argument before the stack frame unwinds.
Diagnostic Helpers
Abstract
Most attribute and operation verifiers in tileiras share a small set of diagnostic helper shapes. They split into three roles: clones of Diagnostic::operator<<(const char* literal) that push borrowed C strings; a generic 24-byte SmallVector::push_back used for MMA-shape candidate records; and a SmallVector<std::string, N> family used when verifiers build richer candidate tuples. These helpers are pre-emission infrastructure — actual formatted text still flows through raw_svector_ostream into a single owned diagnostic string when needed.
A trap to avoid: the 24-byte stride appears in two unrelated roles in this binary and must not be conflated. The literal-append helpers push a DiagnosticArgument (kind 3, borrowed C-string) whose stride happens to be 24 bytes. The MMA allowlist verifier pushes 24-byte candidate records that are not diagnostic arguments at all and carry no kind tag. Both look like 24-byte SmallVector::push_back shapes in the binary, but only the first interacts with the diagnostic engine. The companion page for the StringRef / Twine / raw_svector_ostream chain that human-facing diagnostics actually flow through is Twine, StringRef, format.
Six byte-identical Diagnostic literal-append clones
The literal-append helpers all implement the same operation. Each accepts (Diagnostic*, const char*), builds a stack DiagnosticArgument with kind 3, computes the literal length with strlen, grows the embedded argument vector when needed, copies one 24-byte argument slot, and increments the argument count.
The embedded Diagnostic body layout these clones index is:
| Offset | Width | Field |
|---|---|---|
+0x00 | 8 B | Location* |
+0x08 | 8 B | ctx ptr / packed severity dword |
+0x10 | 8 B | args_begin (24-B-stride DiagArg array head) |
+0x18 | 4 B | args_size |
+0x1C | 4 B | args_cap (initialised to 4) |
+0x20 | 96 B | inline_args[4] (four 24-B DiagArg slots) |
The packed initial state stores zero size and inline capacity four. The diagnostic body may be embedded in a larger stack frame, so helpers are written against the argument-vector head rather than a whole diagnostic object with one fixed surrounding layout.
| Helper family | Signature | Stride | Notes |
|---|---|---|---|
| literal append clone | (Diagnostic*, const char*) | 24 B | pushes borrowed C-string argument, kind 3 |
| hot verifier clone | (Diagnostic*, const char*) | 24 B | same body emitted into verifier-heavy translation units |
| support-verifier clone | (Diagnostic*, const char*) | 24 B | same body used by tcgen05 and TMA support verifiers |
Each clone is the same operation as the named Diagnostic::operator<<(const char*): wrap as kind 3, compute length, push a 24-byte argument. Every observed call site passes a static string literal.
Generic 24-byte SmallVector push_back
The generic 24-byte push helper looks like a seventh literal clone at first glance, but it is a plain copy into a standalone SmallVector. Its input is a pre-built 24-byte record, not a C string; it stamps no diagnostic kind; it never calls strlen. The vector head layout is {begin, size, cap, inline}.
The MMA multiplicand allowlist verifier uses it to build shape-multiplicand candidate records. The 24-byte stride coincidence with DiagArg is incidental — the elements are MMA-shape records, not diagnostics.
SmallVectorstd::string,N lifetime managers
The SmallVector<std::string, N> helper family manages verifier-side string vectors. Each std::string uses libc++ SSO with a 48-byte element stride. The outer vector head is the usual {begin, size, cap, inline} layout.
| Role | Stride | Notes |
|---|---|---|
| move into fresh storage | 48 B | Moves SSO or heap-backed strings into grown storage. |
emplace_back(StringRef) | 48 B | Grows if needed, then constructs a string from borrowed bytes. |
push_back(const std::string&) | 48 B | Handles self-aliasing across growth. |
| copy assignment | 48 B | Reuses, grows, or destroys elements based on source and destination size. |
| reserve/grow | 48 B | Allocates a larger buffer and moves old strings. |
| move assignment | 48 B | Uses inline or pointer-swap branches. |
~SmallVector<std::string, N>() | 48 B | Frees non-SSO payloads and non-inline vector buffers. |
~SmallVector<TupleRecord, N>() | 112 B | Walks nested string vectors before freeing the outer buffer. |
The grow helper walks existing elements, initializes SSO headers in the new slots, copies non-empty string bytes, then frees old non-SSO payloads in reverse order. The emplace helpers build strings from StringRef; copy assignment branches on source size versus destination size and capacity; move assignment uses inline and pointer-swap branches.
The tuple-record destructor handles a nested vector at the end of each outer element. It frees the inner string payloads first, frees the inner buffer if it is not inline, then continues the outer reverse walk.
The MMA allowlist verifier uses three inner per-multiplicand vectors, deep-copies them into outer tuple slots, then move-assigns the tuple into the candidate list. On the miss path, the human-facing diagnostic takes the raw_ostream formatted SmallString route rather than formatting through these vector helpers.
Reimplementation Notes
append_literal(diag, literal):
arg = DiagnosticArgument(kind="cstring", ptr=literal, len=strlen(literal))
diag.args.push(arg)
build_candidate_strings(candidates):
records = SmallVector()
for candidate in candidates:
records.push(make_candidate_record(candidate))
return records
Do not confuse the 24-byte MMA candidate records with diagnostic arguments. They share a stride, but only diagnostic arguments carry a diagnostic kind tag.
GlobalValue Flag Bits
Abstract
The tileiras binary inherits LLVM's 16-bit GlobalValue flag word. The word stores linkage, visibility, DLL storage class, thread-local mode, unnamed-address policy, and subclass data. NVIDIA reuses bit 14 as a fast marker for functions whose nvvm.annotations metadata has been transplanted into function attributes. This page covers the bit-level contract and the annotation kinds mirrored by the attribute infrastructure.
GlobalValue 16-bit flag word — LLVM standard low 14 bits
The word lives in LLVM's GlobalValue flag field. Bits 0..13 follow upstream LLVM; bit 14 is NVIDIA-repurposed; bit 15 is reserved.
| Bits | Field | Notes |
|---|---|---|
0..3 | LinkageTypes | values 0..15; InternalLinkage = 7. Cleared on kernels, set to internal on non-kernels. |
4..5 | VisibilityTypes | Default=0, Hidden=1, Protected=2. Tested via 0x30 to gate the marker write. |
6..7 | DLLStorageClass | Default / Import / Export. Preserved through 0xFCC0. |
8..9 | ThreadLocalMode | one of four TLS models. Preserved. |
10..11 | UnnamedAddr | None / Local / Global. Preserved. |
12 | HasLLVMReservedName | LLVM sentinel. Preserved. |
13 | subclass-data | GlobalValue subclass slot. Preserved. |
14 | NVIDIA-repurposed | See next section. |
15 | reserved | Always zero in observed paths. |
Bit 14 — NVIDIA-repurposed "nvvm.annotations_transplanted" marker
Bit 14 is the NVIDIA-private "nvvm.annotations_transplanted" marker. It is set on defined kernel functions with default visibility. The same fact is also stored as a string-keyed function attribute, giving the backend a dual encoding — a fast bit check plus the structured attribute. isKernelFunction can short-circuit when the marker is set, skipping the fallback to legacy !nvvm.annotations !"kernel" metadata.
Consumers read the marker through a single bit test before falling back to attribute or metadata lookup:
static bool is_kernel_fast(const Function *fn) {
/* Bit 14 of the 16-bit GlobalValue flag word doubles as the
* "nvvm.annotations_transplanted" cache. If it is set, the
* attribute is guaranteed present and the !nvvm.annotations
* fallback can be skipped. */
return (fn->global_value_flags & (1u << 14)) != 0
|| function_has_attribute(fn, "nvvm.kernel");
}
The bit is only a cache; the source of truth remains the string attribute, which is what IR-text consumers see. Dropping the bit but keeping the attribute is correct (just slower); setting the bit but omitting the attribute breaks anything that round-trips IR through textual MLIR.
Bit-mask decoded for IPMSP clone-stamping
The linkage pass rewrites the flag word with four hard-coded masks. Each one names exactly one logical operation on the field.
| Mask | Width | Preserved bits | Cleared bits | Operation | Resulting state |
|---|---|---|---|---|---|
0xFCC0 | 16 | 6, 7, 10..15 | 0..5, 8, 9 | flags = (flags & 0xFCC0) | 7 on non-kernels | LinkageTypes = InternalLinkage; DLLStorage, unnamed-addr, marker survive. |
0xF0 | 8 lo | 4..7 | 0..3 | flags_lo &= 0xF0 on kernels | Clears the 4-bit linkage nibble; visibility + DLLStorage low half preserved. |
0x30 | 8 lo | — | — (test) | (flags_lo & 0x30) != 0 on kernels | Tests visibility != Default. If non-zero, marker write is skipped. |
0x40 | 8 hi | — | — (OR) | flags_hi |= 0x40 on kernels with Default visibility | Sets bit 14, the "nvvm.annotations_transplanted" marker. |
The asymmetry between non-kernels (0xFCC0 mask plus InternalLinkage) and kernels (0xF0 low-byte mask plus visibility test plus marker bit) is deliberate. Non-kernels exit after one rewrite; kernels preserve externally visible linkage state but add the NVIDIA marker when visibility allows it.
nvvm.annotations 10-kind catalog
Ten distinct kinds are encoded by the legacy !nvvm.annotations named metadata node and the parallel "nvvm.<kind>" function-attribute form. Bit 14 gates short-circuiting between the two via isKernelFunction.
| # | Legacy MDString | Attribute form | Format |
|---|---|---|---|
| 1 | kernel | nvvm.kernel | i32 1 becomes an empty attribute. |
| 2 | maxntid{x,y,z} | nvvm.maxntid | Dim3 tuple serialized as "X,Y,Z". |
| 3 | reqntid{x,y,z} | nvvm.reqntid | Dim3 tuple serialized as "X,Y,Z". |
| 4 | cluster_dim_{x,y,z} | nvvm.cluster_dim | Dim3 tuple serialized as "X,Y,Z". |
| 5 | minctasm | nvvm.minctasm | i32 serialized as a decimal string. |
| 6 | maxnreg | nvvm.maxnreg | i32 serialized as a decimal string. |
| 7 | cluster_max_blocks / maxclusterrank | nvvm.maxclusterrank | i32 stored as an integer-valued attribute. |
| 8 | nvvm.blocksareclusters | nvvm.blocksareclusters | i32 1 becomes an empty attribute. |
| 9 | grid_constant | nvvm.grid_constant | Variadic 1-based indices. |
| 10 | — | nvvm.annotations_transplanted | Empty attribute plus bit 14 marker. |
Kinds 2..4 are dim3 triples serialized as comma-joined strings. Kinds 5..6 are string-valued scalars; kind 7 is the lone integer-valued attribute, matching the CUDA cluster-attribute shape. Kind 8's legacy MDString already carries the "nvvm." prefix and passes through unchanged. Kind 9 (grid_constant) is the only kind the transplanter does not rewrite; it remains in the legacy node. Kind 10 has no legacy MDString and exists only as the dual-encoded marker.
Reimplementation Notes
for function in module.functions:
if is_kernel(function):
function.flags.linkage = external_kernel_linkage(function)
if function.visibility == default:
function.flags.nvvm_annotations_transplanted = true
function.attrs["nvvm.annotations_transplanted"] = unit
else:
function.flags.linkage = internal
Keep the bit and the string attribute synchronized. Consumers may use the bit as a fast path, but tools that inspect IR text should still see the explicit nvvm.annotations_transplanted attribute.
GPU Execution Model
Abstract
NVIDIA GPUs execute kernels through a five-tier hierarchy: thread, warp, CTA, cluster, grid. Each tier has its own sync primitive, its own resource limit, and its own compiler-controlled launch attribute. Tileiras emits PTX directives at the kernel boundary that fix the launch contract, and intrinsics inside the body that target each tier's sync primitive. The choice of directive constrains every downstream decision — register allocation, occupancy, cluster-aware copy partitioning, and warp-group instruction legality all depend on the thread shape written at the .entry header.
The fence/arrive/wait protocols documented elsewhere assume this hierarchy is already established. This page is the canonical reference for the hierarchy itself: where each tier comes from, what synchronisation it offers, and how tileiras chooses the directive that pins the kernel to its shape.
The Five-Tier Hierarchy
A kernel launch is a grid of clusters of CTAs of warps of threads. Each tier has a defined cardinality cap, a sync primitive, and a resource binding that lives at that tier.
| Tier | Max size | Sync primitive | Sync scope | Resource binding |
|---|---|---|---|---|
| Thread | 1 | sequential program order | — | per-thread register file slice |
| Warp | 32 threads | shfl.sync, vote.sync, match.sync, redux.sync | intra-warp | per-warp register file partition |
| CTA / thread-block | 1024 threads (32 warps) | bar.sync (16 NamedBarrier slots), mbarrier.arrive, mbarrier.try_wait | intra-CTA | per-CTA shared memory |
| Cluster (SM90+) | 8 CTAs | cluster.arrive.relaxed + cluster.wait, DSMEM read/write through mapa | intra-cluster | per-cluster DSMEM windows |
| Grid | unbounded | cooperative-groups grid sync (host-coordinated launch only) | grid-wide | global memory |
Three structural facts follow from the table. First, every sync primitive is intra-tier — there is no hardware warp-to-warp sync inside a CTA other than going through a CTA-level barrier, and no hardware CTA-to-CTA sync inside a cluster other than going through cluster arrive/wait. Second, every tier above the warp has a cardinality cap that the compiler must verify against the requested launch shape. Third, the resource binding follows the tier: registers belong to the warp, shared memory belongs to the CTA, DSMEM belongs to the cluster, and global memory is the only resource visible grid-wide.
The CTA cap of 1024 threads is fixed by hardware: every SM90+ device exposes the same 32-warp upper bound. The cluster cap of 8 CTAs is the SM90 portable maximum; portable cluster-launchable kernels must declare .maxclusterrank 8 if they want to opt out of the cap. Below SM90 the cluster tier does not exist and the hierarchy collapses to four levels.
Launch ABI
The path from host code to a running CTA crosses three layers: the CUDA driver, the GPU scheduler, and the SM front-end.
The host calls cuLaunchKernel (driver API) or cuLaunchKernelEx (extended API with cluster shape), or uses the runtime-API triple-chevron kernel<<<grid, block, smem, stream>>>(args...). The driver packs kernel parameters into the GPU's per-launch parameter memory (the .param address space, AS=101 in NVVM IR) and dispatches the launch descriptor to the GPU's grid scheduler. The grid scheduler enumerates CTAs (or clusters of CTAs on SM90+) and dispatches each onto an SM that has enough free SMEM and warp slots to host it. Inside the SM the warp scheduler picks warps from the resident CTAs and feeds them into the issue pipeline; the per-thread register file is statically partitioned among the resident warps at CTA dispatch time.
host driver GPU
┌────────┐ cuLaunch ┌──────────┐ submit ┌────────────┐
│ kernel │───────────────▶│ pack │────────────────▶│ grid │
│ <<<>>> │ │ params │ │ scheduler │
└────────┘ │ AS=101 │ └─────┬──────┘
└──────────┘ │ dispatch CTA/cluster
▼
┌────────────┐
│ SM │
│ - warp pool│
│ - SMEM │
│ - regs │
└─────┬──────┘
│ issue warps
▼
┌────────────┐
│ execution │
│ pipeline │
└────────────┘
The driver never sees the warp tier. Warp scheduling is internal to the SM and follows the launch's thread-shape declaration. If the kernel's .reqntid says 128 threads per CTA, the SM dispatches four 32-thread warps for every CTA it admits. If the kernel's .maxnreg says 168 registers per thread, the SM partitions the register file so that at most register_file_size / (168 × 32) warps from this kernel can be resident on the SM at one time.
Tileiras emits no host-side code. The host is responsible for the launch call; tileiras only writes the kernel's static contract into the cubin via .entry directives, and the runtime/driver reads those directives back when packing the launch descriptor.
Cluster Execution Mechanics (SM90+)
Hopper introduces the cluster — a group of up to 8 CTAs scheduled together on adjacent SMs so they can read each other's shared memory and synchronise without going through global memory. The cluster is the only tier above the CTA that has hardware sync support; everything grid-wide must round-trip through cooperative-groups grid sync, which is host-coordinated and substantially more expensive.
Inside a cluster every CTA can:
- Read its position via
%cluster_ctarank(0..7). - Read the cluster size via
%cluster_dim(one to eight). - Map a local SMEM pointer into a peer's DSMEM window with
nvvm.mapa, producing a pointer the issuing CTA can read/write but that physically lives in the peer's SMEM. - Issue
nvvm.cluster.arrive.relaxedto mark itself as ready, thennvvm.cluster.waitto block until every peer has arrived.
The DSMEM window has the same address space (addrspace(3), SMEM) as a CTA's own shared memory; mapa returns a pointer to the peer's bank in that address space. Reads and writes use ordinary ld.shared / st.shared instructions — the hardware routes them across the cluster network when the address falls inside a peer window. The rendezvous protocol that consumes this — mbarrier.expect_tx paired with cluster.arrive.relaxed and cluster.wait — is documented in Cluster Sync and DSMEM Handshake.
Cluster shape is declared at the .entry header through .cluster_dim X, Y, Z (plus .explicitcluster and optionally .maxclusterrank and .blocksareclusters). The driver reads the declared cluster shape and packs CTAs into clusters of that size before dispatch; CTAs in the same cluster are guaranteed to land on SMs that share a cluster network, which is what makes DSMEM physically routable.
Warp-Group Execution (SM90+)
Some SM90+ instructions are warp-group instructions: they require a 128-thread cooperating group (four contiguous warps) and read or write the warp group's register file as a unit. WGMMA on Hopper and tcgen05.mma on Blackwell are the canonical examples — they consume a 4-warp register block as the accumulator and one or two SMEM descriptors as inputs.
A warp-group instruction has three structural requirements:
- The CTA's thread count must be a multiple of 128, since warps must align onto 4-warp groups without partials.
- The warp group must be coherent — all four warps must reach the instruction together, or the ISA contract is broken.
- Per-thread register usage must leave room for the warp group's accumulator fragment, since the accumulator lives in the register file.
Tileiras enforces (1) by emitting .reqntid (or .maxntid) with an X dimension that is a multiple of 128. The downstream lowering pass that emits WGMMA refuses to emit a wgmma.mma_async instruction when the kernel-spec's thread count is not a 128-multiple — the four-op protocol covered in WGMMA Emission Protocol needs four warps per group, and the scheduler's resource model assumes warp groups are atomic. Requirement (2) is the source of the wgmma.fence.aligned / wgmma.commit_group.sync.aligned / wgmma.wait_group.sync.aligned triple — each is .aligned precisely because it requires warp-group convergence. Requirement (3) is what drives the .maxnreg choice: a kernel that emits an m64n256k16 WGMMA needs at least 32 FP32 registers per thread just for the accumulator slice, before counting descriptor and loop-index registers.
If the launch shape cannot satisfy these requirements, the legal options are to lower to a synchronous mma.sync form (slower, but available on every SM70+) or to refuse to compile and emit a diagnostic. Tileiras takes the second path when the kernel-spec explicitly requested a WGMMA atom but the thread shape disagrees.
Kernel Attribute Decision
Every PTX directive at the .entry header has an MLIR attribute counterpart and a rule for when tileiras emits it. The verifier rules in Host Launch ABI + ptxas Knobs cover the well-formedness checks; the table below covers the decision policy — what input tileiras consults when picking each value.
| PTX directive | Source attribute | Tileiras input | Emission policy |
|---|---|---|---|
.entry kernel_name | nvvm.kernel (UnitAttr) | LLVM function carries the marker | always emit when present; controls .entry vs .func |
.maxntid X, Y, Z | nvvm.maxntid (1..3 i32) | upper bound from kernel-spec or user __maxnreg__-style hint | emitted when the upper bound matters for register budgeting |
.reqntid X, Y, Z | nvvm.reqntid (1..3 i32) | user-declared block shape, or 128-multiple forced by WGMMA/tcgen05 emission | emitted when the kernel relies on an exact shape (warp-group or specialized warps) |
.minnctapersm N | nvvm.minctasm (i32) | occupancy hint from kernel-spec | emitted when the user requested an occupancy floor |
.maxnreg N | nvvm.maxnreg (i32) | per-thread register cap from kernel-spec | emitted to bound register usage and let ptxas trade registers for occupancy |
.explicitcluster | implied by nvvm.cluster_dim presence | any nvvm.cluster_dim attribute on the function | always emitted with .reqnctapercluster on SM90+ |
.reqnctapercluster X, Y, Z | nvvm.cluster_dim (exactly 3 i32) | user-declared cluster shape | emitted on SM90+ when nvvm.cluster_dim is present |
.maxclusterrank N | nvvm.maxclusterrank (i32) | portability cap from kernel-spec | emitted on SM90+ when the user wants a portable cluster cap |
.blocksareclusters | nvvm.blocksareclusters (UnitAttr) | only legal when nvvm.reqntid and nvvm.cluster_dim are also present | emitted on SM90+ when the cluster shape is (1, 1, 1) and the user opts in |
The driver decides which directives apply per target: SM89 and earlier suppress every cluster directive, even when the IR carries nvvm.cluster_dim, because ptxas would reject them. Cluster-shaped kernels are not portable to pre-SM90 targets without a recompile that drops the cluster directives entirely.
The .maxntid versus .reqntid distinction is the most subtle: .maxntid is an upper bound that lets ptxas size the per-thread register fragment without committing to an exact launch shape, while .reqntid is a hard contract — a launch with a different shape is rejected by the driver. Tileiras emits .maxntid for kernels that adapt to launch shape, and .reqntid for kernels whose lowering already baked in a specific thread count (every WGMMA-using kernel, since the four-warp group is mandatory; every warp-specialized kernel, since the producer/consumer split partitions named warps).
Worked Example: WGMMA Kernel Launch
Consider a Hopper GEMM kernel launched with the following host triple-chevron call:
gemm_kernel<<<dim3(2, 1, 1), // grid: 2 clusters along X
dim3(128, 1, 1), // block: 128 threads = 4 warps = 1 warp-group
48 * 1024, // dynamic SMEM: 48 KiB per CTA
stream,
dim3(2, 1, 1) // cluster: 2 CTAs per cluster (cuLaunchKernelEx)
>>>(A, B, C, D, M, N, K);
The launch shape decomposes to:
- One grid of 2 clusters along X.
- Each cluster has 2 CTAs (the X dimension of the cluster shape).
- Each CTA has 128 threads (one warp group).
- Total: 4 CTAs × 128 threads = 512 threads in 2 clusters.
The driver packs the seven scalar parameters into PMEM, encodes the cluster shape (2, 1, 1) into the launch descriptor, and dispatches both clusters to the GPU scheduler. The scheduler picks two SMs that share a cluster network — say SM 0 and SM 1 — and places cluster 0's CTAs (rank 0 on SM 0, rank 1 on SM 1) on them. Cluster 1 lands on a different SM pair (say SM 2 and SM 3). Each SM partitions its register file to leave 168 registers per thread, allowing the 32-FP32 WGMMA accumulator slice to fit alongside the descriptor and loop-index registers.
Inside cluster 0, CTA 0 (%cluster_ctarank = 0) and CTA 1 (%cluster_ctarank = 1) cooperate on the multicast TMA load that feeds the WGMMA. The producer warp on CTA 0 issues a multicast cp.async.bulk.tensor whose destination addresses span CTA 0's SMEM and CTA 1's DSMEM window; the rendezvous goes through the transaction-mbarrier handshake covered in Cluster Sync and DSMEM Handshake. Once the handshake clears, each CTA's four-warp warp group runs the four-op WGMMA protocol on its own SMEM tile.
Tileiras emits the kernel header for this kernel as:
.version 8.4
.target sm_90a
.address_size 64
.entry gemm_kernel(
.param .u64 gemm_param_0, // A
.param .u64 gemm_param_1, // B
.param .u64 gemm_param_2, // C
.param .u64 gemm_param_3, // D
.param .u32 gemm_param_4, // M
.param .u32 gemm_param_5, // N
.param .u32 gemm_param_6 // K
)
.reqntid 128, 1, 1 // four-warp warp group, mandatory
.maxnreg 168 // accumulator fits, leaves occupancy room
.explicitcluster
.reqnctapercluster 2, 1, 1 // pair CTAs into 2-CTA clusters
{
; ... TMA descriptor encode, mbarrier init ...
; ... cluster.arrive + WGMMA four-op protocol per K tile ...
; ... epilogue: load C, add, TMA store of D ...
}
Five directives form one coherent launch contract. .entry declares the symbol as a kernel. .reqntid 128, 1, 1 commits the launch to a 128-thread CTA, which fixes the warp count at four and lets WGMMA emission succeed. .maxnreg 168 reserves enough registers for the accumulator fragment plus working registers. .explicitcluster and .reqnctapercluster 2, 1, 1 tell the driver to dispatch two-CTA clusters so that DSMEM addresses across %cluster_ctarank XOR 1 resolve.
If any directive disagrees with the launch — say the host calls with block = (96, 1, 1) — the driver rejects the launch at submission time because .reqntid is a hard contract. If the kernel were compiled with .maxntid 128, 1, 1 instead, the launch would succeed for block = (96, 1, 1) but the WGMMA would silently consume an incomplete warp group, racing or hanging on the wgmma.commit_group.sync.aligned. The choice between .maxntid and .reqntid is therefore not a stylistic preference: WGMMA-using kernels must commit to .reqntid to make the contract enforceable.
Cross-References
Host Launch ABI + ptxas Knobs documents the verifier rules and PTX directive emission order for every kernel attribute the policy table above references.
Cluster Sync and DSMEM Handshake covers the cluster-tier rendezvous protocol — the cluster.arrive.relaxed / cluster.wait pair and its DSMEM transaction-byte extension — that this page treats as a black box.
Blackwell 2-CTA and 4-CTA MMA shows the cluster-side copy fan-out that consumes the cluster shape declared at the .entry header.
WGMMA Emission Protocol documents the four-op fence/MMA/commit/wait sequence that the warp-group tier requires.
mbarrier State Machine defines the 64-bit shared-memory object that the cluster handshake reads and writes through nvvm.mbarrier.* operations.
tcgen05 Tensor Memory Model covers the Blackwell successor to WGMMA, where the warp-group accumulator moves out of the register file and into tensor memory.
DSL to PTX End-to-End walks a representative kernel through every stage of the cascade, including the .entry header that this page focuses on.
Memory Hierarchy and Data Flow
Abstract
NVIDIA GPUs expose seven distinct memory spaces with order-of-magnitude differences in bandwidth, latency, capacity, and visibility scope. A Blackwell-class kernel routinely touches five of them in one mainloop: kernel parameters arrive in parameter memory, operands stage from global memory through shared memory into tensor memory, accumulators live in tensor memory or registers, and spills land in thread-local memory backed by global memory. Tileiras tracks every pointer's address space through the lowering cascade, validates address-space-aware operations at each layer, and emits PTX that names each space explicitly so ptxas can issue the right state-space-qualified memory instructions.
The compiler enforces a single invariant: every pointer-typed SSA value has a known address space by the time it reaches the backend, or it has provably failed to converge and emits an assuming global memory space diagnostic. The rest of the pipeline — operand staging, async-copy lowering, alias analysis, register allocation — assumes that invariant and breaks if a generic pointer slips past memory-space optimization without a concrete tag.
This page is the canonical reference for the spaces themselves, their representation at every compilation stage, and the data-flow patterns that move operands between them. AddrSpace Vote Lattice covers the inference algorithm in depth; MemorySpaceOpt and process-restrict covers the propagation pass that runs the algorithm; this page provides the orientation that ties the two together.
The Seven Memory Spaces
Every NVPTX address space corresponds to a distinct hardware structure with its own physical realisation. The PTX state-space modifier, the LLVM address-space number, and the MLIR address-space attribute must all agree on which structure a pointer names — the three encodings are isomorphic and disagreement is a verifier error.
| Space | PTX state space | LLVM AS | MLIR encoding | Capacity (H100/B100) | Bandwidth | Visibility |
|---|---|---|---|---|---|---|
| Global | .global | 1 | #gpu.address_space<global> | tens of GB | ~3 TB/s | device-wide |
| Shared | .shared | 3 | #gpu.address_space<workgroup> | 228 KB / SM | ~20 TB/s | CTA |
| Distributed Shared | .shared::cluster | 7 | #nvvm.shared_space<cluster> | cluster_size × 228 KB | DSMEM network | cluster |
| Constant | .const | 4 | #gpu.address_space<uniform_constant> | 64 KB / module | broadcast | device |
| Local | .local | 5 | (LLVM stack) | bounded by spill budget | GMEM-backed | thread |
| Tensor (SM100+) | .tmem (proxy via tcgen05) | 6 | TMEM dialect handle | 256 cols × 128 rows / SM | per-SM | SM |
| Parameter | .param | 101 | (kernel arg, byval) | <= 4 KB / launch | broadcast | per-launch |
| Generic | (unqualified) | 0 | (unqualified pointer) | union of above | via cvta | (resolved at runtime) |
| Register | (named regs) | n/a | (LLVM virtual reg) | ~64 KB / SM | warp-private | warp |
Generic is the lattice's top element — a pointer that has not been refined to one concrete space. The hardware supports it through cvta.to.X instructions that decode the high-order address bits to recover the concrete space, but every generic-pointer dereference costs an extra cycle and prevents the backend from issuing a state-space-qualified load. Memory-space optimization exists to eliminate generic pointers wherever a concrete provenance can be proved.
Tensor memory is the SM100 newcomer. It is per-SM, addressed in a 128-row dense grid, and reachable only through the tcgen05 instruction family — no ldg, no cp.async, no register-to-TMEM move outside that family. See tcgen05 Tensor Memory Model for the allocator contract and operand-residency rules.
Distributed shared memory is shared memory addressed across cluster CTAs. The pointer is still addrspace(3) at the LLVM level, but a separate addrspace(7) exists for pointers that have already been translated through nvvm.mapa and name a peer CTA's shared region. The translation itself is a hardware instruction; the peer CTA's allocator must have placed the destination at the same SMEM offset for the translated pointer to be meaningful.
Register and parameter spaces sit at the extremes. Registers are warp-private, do not carry a pointer type at the LLVM level (they appear as virtual registers in MIR), and have no addrspace encoding because they cannot be the target of a load or store. Parameter memory is the kernel-argument buffer the driver fills before launch: in PTX it appears as ld.param, in LLVM IR as addrspace(101) pointers that the LowerArgs pass converts to direct loads.
Address Space at Each Compiler Stage
The same memory space appears under different encodings at every stage of the lowering. Tileiras stage transitions preserve the address-space tag — a pointer that was global at the cuda_tile level is still global at the LLVM level — but the encoding changes from a dialect-specific attribute to a generic LLVM addrspace(N) numeric token to a PTX state-space modifier.
| Stage | How address spaces appear | Where the tag lives |
|---|---|---|
cuda_tile | !cuda_tile.ptr<f32> and !cuda_tile.partition_view<...> | Type attribute, optional addrspace annotation |
nv_tileaa | !llvm.ptr with addrspace attribute on memrefs | Operation-side memory-space attribute |
nv_tileas | TMA descriptor types, async-copy ops with explicit space args | TMA descriptor + per-op mem_space enum |
cute_nvgpu | Copy/MMA atoms tagged with operand residency | Atom metadata in the dialect attribute |
| LLVM IR | ptr addrspace(N) typed pointers | Pointer type, propagated through SSA |
| NVPTX MIR | Per-instruction state-space encoding | MachineMemOperand::getAddrSpace() |
| PTX | State-space modifier on every memory instruction | Lexical .global / .shared / .tmem etc. |
The boundary that matters most is between tileas and LLVM. Up to tileas, address spaces live on operation attributes — a cp.async op carries its source and destination spaces as enum operands, and the verifier rejects mismatched pairings. Below LLVM, address spaces live on the pointer type — a ptr addrspace(1) is structurally distinct from ptr addrspace(3), and the LLVM verifier rejects assignments between them without an explicit addrspacecast. The cute-to-LLVM lowering is the pass that translates the operation-attribute encoding into the type encoding; see cute and cute_nvgpu to LLVM.
Two stages have no native address-space encoding and rely on context. The cuda_tile dialect carries address spaces only as optional annotations on memrefs — the public surface is shape-typed, not space-typed, and the inference cascade fills in the missing tags. The PTX layer has no encoding at all: state spaces are part of the instruction mnemonic, and ptxas sees them as lexical tokens that select the right hardware opcode at assembly time.
The Address-Space Inference Algorithm
MemorySpaceOpt runs a finite-height-lattice forward data-flow analysis over the SSA graph. The lattice has one bottom element (BOTTOM, unknown), one top element (GENERIC, conflict), and six concrete address-space elements that form an antichain in between. The meet of two distinct concrete elements is GENERIC; the meet of BOTTOM with any element is the other element. Convergence is bounded by 2 × |pointer values| because each pointer can be refined at most twice (BOTTOM → concrete → GENERIC) before reaching a fixed point.
AddressSpace meet(AddressSpace a, AddressSpace b) {
if (a == AS_BOTTOM) return b;
if (b == AS_BOTTOM) return a;
if (a == b) return a;
return AS_GENERIC;
}
void propagate(Lattice *lat, Function *fn) {
seed_from_kernel_arguments(lat, fn); /* global / constant / byval seeds */
while (lat_changed(lat)) {
for (Instruction *inst : fn->pointer_instructions) {
switch (inst->opcode) {
case GEP:
case BITCAST:
lat_set(lat, inst, lat_get(lat, inst->operand[0]));
break;
case PHI:
case SELECT:
lat_set(lat, inst, lat_meet_all_incoming(lat, inst));
break;
case ADDR_SPACE_CAST:
lat_set(lat, inst, inst->target_as); /* force, do not inherit */
break;
case CALL:
propagate_call_args(lat, inst); /* backward into caller */
propagate_call_ret(lat, inst); /* forward from callee */
break;
case WMMA:
case CP_ASYNC_BULK:
lat_refine_backward(lat, inst->pointer_operand, AS_GLOBAL);
break;
}
}
}
}
The seeds come from kernel-argument attributes: pointers tagged with the kernel-pointer attribute start at GLOBAL, grid-constant arguments start at CONSTANT, and byval struct arguments start at GENERIC because they cross a true generic boundary at the launch site. Backward refinement at WMMA and async-bulk sites adds the only non-monotone edge — WMMA forces GLOBAL on the operand chain because the hardware does not implement WMMA against any other space.
The full data-flow algorithm, the four red-black trees that hold per-block lattice state, the "nvvm.as" attribute that publishes results across passes, and the clone-budget that bounds inter-procedural specialization all live in AddrSpace Vote Lattice.
Data Flow Examples
The three examples below trace one pointer per pattern from the kernel-launch boundary to its operand-residency destination. Each example names the address-space transitions and identifies which cvta conversions survive into PTX versus which the optimizer eliminates.
Example 1: Kernel Parameter to SMEM Stage
A typical Hopper TMA mainloop loads a GMEM tile into SMEM through cp.async.bulk.tensor. The kernel parameter starts in parameter memory and reaches SMEM through one address-space transition and one async copy:
.param u64 A_ptr; // PMEM (LLVM addrspace(101))
.shared.align 1024 .b8 A_smem[16384]; // SMEM (LLVM addrspace(3))
ld.param.u64 %rd1, [A_ptr]; // PMEM -> register (no AS change)
cvta.to.global.u64 %rd2, %rd1; // register -> GMEM-tagged pointer
mov.u64 %rd3, A_smem; // SMEM base address as a register
cp.async.bulk.tensor.2d.shared::cluster.global
[%rd3], [%tma_desc, {%coord_m, %coord_n}], [%mbar];
Three transitions appear in the source IR but only one survives into PTX. The ld.param instruction is not an address-space cast — it is a load from .param into a register, and the register has no address space. The cvta.to.global is the real conversion: it tells the hardware that the address now refers to global memory, and ptxas emits it as a real instruction unless MemorySpaceOpt proved that the pointer was always global (in which case the cast folds to a no-op). The cp.async.bulk.tensor instruction names both source and destination spaces in its mnemonic; no further cast is needed.
The mbarrier object that completes the copy lives in SMEM and the TMA descriptor lives in GMEM. Both pointers are name-only operands to the cp.async.bulk.tensor instruction; the hardware tracks their spaces from the mnemonic, not from any pointer type carried into the instruction.
Example 2: WGMMA Operand Staging
A Hopper WGMMA mainloop reads operand A from SMEM (or optionally from registers) and operand B from SMEM, and accumulates into a register-resident fragment. The end-to-end staging pattern is GMEM → SMEM → WGMMA → register → SMEM → GMEM:
+---------+ cp.async.bulk.tensor +---------+
A in GMEM (AS=1) ---| |---------------------------->| |
| TMA | | A SMEM |
B in GMEM (AS=1) ---| desc |---------------------------->| B SMEM |
+---------+ | (AS=3) |
+----+----+
|
SMEM descriptor (64-bit packed)
|
v
+---------+--------+
| wgmma.mma_async |
| (consumes B as |
| SMEM descrip- |
| tor, A as desc |
| or RF) |
+---------+--------+
|
accumulator in RF
|
v
+---------+--------+
| wgmma.wait_group |
+---------+--------+
|
v
stmatrix to SMEM
|
v
st.global to GMEM
The accumulator lives in registers for the whole mainloop. SMEM operands are referenced through a 64-bit descriptor — a packed address-plus-layout immediate, not a pointer — that the operand-builder constructs once per tile and threads through the mma_async as an l-constraint i64. See WGMMA Emission Protocol for the descriptor bit layout and the fence/commit/wait sequence that orders the async MMA against subsequent reads.
The address-space transitions in this pattern are entirely between GMEM and SMEM. The descriptor construction is a pure arithmetic operation on a SMEM base offset — no cvta is involved. The stmatrix and st.global instructions at the epilogue name their spaces lexically, so the only generic pointer that could survive is the GMEM output base, which the kernel-pointer attribute pins to GLOBAL from the seed.
Example 3: tcgen05 With TMEM Allocation
The Blackwell pattern adds tensor memory between SMEM and the MMA. The accumulator moves from registers to TMEM; operand A optionally moves from registers to TMEM for weight-stationary chains; operand B stays in SMEM (the WGMMA descriptor format is preserved). The full staging pattern is GMEM → SMEM → TMEM → tcgen05.mma → TMEM → SMEM → GMEM:
+---------+ cp.async.bulk.tensor +---------+
A in GMEM (AS=1) ---| TMA |---------------------------->| SMEM |
B in GMEM (AS=1) ---| desc |---------------------------->| (AS=3) |
+---------+ +----+----+
|
tcgen05.cp (SMEM -> TMEM)
v
+-----------------------+ +----------+
| TMEM accumulator | | TMEM A |
| (allocated via | | (AS=6, |
| tcgen05.alloc.shared)| | weight- |
| (AS=6, 128-row grid) | | stationary)|
+-----+-----------------+ +-----+----+
| |
| B SMEM descriptor |
| |
+----- tcgen05.mma ----------+
|
v
TMEM accumulator
|
tcgen05.st to RF
|
v
stmatrix to SMEM
|
v
st.global to GMEM
The TMEM allocator op nvvm.tcgen05.alloc returns an addrspace(6) handle; every subsequent tcgen05.mma op consumes the handle as a 32-bit base address plus row/column descriptor. The handle lifetime is scoped to the enclosing dialect operation — there is no way to pass a TMEM handle out of the function it was allocated in, and the allocator op must dominate every MMA op that uses the handle. See tcgen05 Tensor Memory Model for the allocator contract and the variant taxonomy.
The cooperative 2-CTA MMA variant shares TMEM across two CTAs in a cluster: CTA 0 holds rows [0..M/2) and CTA 1 holds rows [M/2..M). The two halves never exchange data through TMEM directly — the only inter-CTA path on the data side is through DSMEM (distributed shared memory, addrspace(7)) via the nvvm.mapa address translation. See Cluster Sync and DSMEM Handshake for the rendezvous protocol.
Address-Space Transition Table
The hardware implements a fixed set of address-space conversions through the cvta instruction family. The legal conversions form an asymmetric matrix: every concrete space can be converted to generic, but the reverse direction is conditional on the runtime address actually naming the target space. A cvta.to.global on a pointer that points into shared memory is undefined behaviour at runtime; the compiler issues it only when the lattice has proved the pointer is global, or when the user has explicitly requested it through an intrinsic.
| From | To | Instruction | Always legal? | Notes |
|---|---|---|---|---|
| GMEM | Generic | cvta.global | yes | Identity in PTX; the high bits already name global |
| SMEM | Generic | cvta.shared | yes | Sets the SMEM marker in the high bits of the address |
| CMEM | Generic | cvta.const | yes | Same as SMEM, with the constant marker |
| LMEM | Generic | cvta.local | yes | Per-thread stack window |
| PMEM | Generic | implicit via ld.param | yes | The PMEM-to-generic conversion happens inside ld.param |
| TMEM | Generic | (no such cast) | no | TMEM has no generic representation; only tcgen05 reads it |
| DSMEM | Generic | cvta.shared after nvvm.mapa | yes | The mapa translation produces an SMEM-tagged pointer first |
| Generic | GMEM | cvta.to.global | conditional | UB if the pointer does not name global memory |
| Generic | SMEM | cvta.to.shared | conditional | UB if not shared |
| Generic | CMEM | cvta.to.const | conditional | UB if not constant |
| Generic | LMEM | cvta.to.local | conditional | UB if not local |
| Generic | TMEM | (no such cast) | no | TMEM is unreachable from generic |
| Generic | PMEM | (no such cast) | no | PMEM is only readable through ld.param |
| GMEM | SMEM | (none) | no | Must go through generic; usually a bug if it appears |
| SMEM | GMEM | (none) | no | Same as above |
| SMEM | DSMEM | nvvm.mapa | conditional | Requires a multi-CTA cluster and a valid peer rank |
The conditional conversions are the source of every assuming global memory space warning the compiler emits. When the lattice fails to prove a generic pointer's concrete space, the rewriter assumes GLOBAL and emits a warning; the backend then issues a cvta.to.global that is correct only if the pointer was in fact global at runtime. The warning is the user's signal to either add a kernel-pointer attribute, replace a pointer-of-pointer indirection with a direct argument, or restructure the kernel to avoid the generic boundary entirely.
What the Compiler Enforces
Address-space rules are enforced at three levels: the dialect verifier, the LLVM verifier, and the PTX assembler.
Pointer arithmetic stays within one address space. A GEP, bitcast, or PHI cannot change a pointer's address space; only addrspacecast (LLVM) or cvta (PTX) can. The lattice propagator relies on this: every non-cast pointer operation inherits its address-space tag from its operand, so a single seed propagates to the whole def-use tree. A kernel that mixes a GEP with a different-space operand fails the LLVM verifier with a type mismatch.
TMA descriptor pointers must be GMEM. The descriptor itself is a packed 128-byte structure that names the multi-dimensional tile layout, and the hardware reads it through a global-memory load before issuing the async copy. The verifier rejects any cp.async.bulk.tensor op whose descriptor operand is not GMEM-tagged.
SMEM allocations are statically sized. The kernel declares its shared-memory footprint at compile time through a .shared directive in PTX; the launch site passes the size as a kernel-launch parameter. There is no dynamic allocation, no malloc in SMEM, no growth at runtime. A kernel that needs to grow its SMEM footprint must be relaunched with a larger size, and the launch is constrained by the SM's total SMEM capacity of 228 KB.
TMEM allocations are scope-bound to a specific dialect operation. The allocator returns a handle that is an SSA value; the matching deallocator must dominate every use of the handle and be dominated by the allocator. The dialect does not allow TMEM regions to outlive their enclosing scope, and the kernel cannot pass a TMEM handle through a function call. The constraint is structural — TMEM does not survive the SM reset that occurs between CTAs scheduled on the same SM, so a function-call-crossing handle would dangle.
Local-memory atomics are rejected. The PTX architecture does not support atomic operations on .local, and the lattice walker turns a backward-inferred LOCAL tag at an atomic site into a hard error rather than letting the backend emit an unsupported instruction. The diagnostic is Cannot do atomic on local memory; the fix is almost always to move the atomic target to .shared or .global.
Param-memory pointers do not escape. The compiler turns every addrspace(101) load into a direct ld.param instruction at the function entry, and the resulting register has no address space. A pointer that survives as addrspace(101) past the LowerArgs pass is a bug — the parameter memory is only readable inside the kernel, and any function call that takes a parameter pointer must already have been inlined.
Cross-References
AddrSpace Vote Lattice is the canonical reference for the inter-procedural inference algorithm: the lattice, the four red-black trees, the clone-budget, and the "nvvm.as" attribute that publishes results across passes.
MemorySpaceOpt and process-restrict is the LLVM-IR pass that runs the inference and rewrites generic pointers. It covers the walker, the addrspacecast folder, the WMMA backward constraint, and the diagnostic catalog.
tcgen05 Tensor Memory Model is the canonical reference for tensor memory: the allocator, the operand-residency table, the variant taxonomy, and the collector cache.
Cluster Sync and DSMEM Handshake covers distributed shared memory: the nvvm.mapa translation, the cluster-barrier rendezvous, and the transaction-byte handshake that pairs an inter-CTA copy with its consumer.
WGMMA Emission Protocol covers the Hopper async MMA: the fence/commit/wait protocol, the 64-bit SMEM descriptor, and the accumulator-lifetime contract.
Lower-Args, Aggr, Struct covers parameter-memory lowering: how byval struct arguments become addrspace(101) pointers, how the LowerArgs pass converts them to direct loads, and the launch-argument check that validates the parameter footprint.
NVPTX Backend Passes Overview places the memory-space passes in the wider backend cluster.
Address-Space Vote Lattice
Abstract
Interprocedural memory-space propagation classifies each generic pointer argument with a small flat lattice. Every call-site observation for one callee argument folds into one of three useful states: no decision yet, all observations agree on one concrete NVPTX address space, or conflicting observations make specialization unsafe.
The lattice is deliberately tiny — one bottom element, six concrete address spaces, one absorbing poison element. That suffices for the cloner to decide whether a callee can be cloned with a narrower pointer type, and the same partition is reused by the type converter and NVVM alias analysis.
NVIDIA GPUs expose seven memory spaces — global, shared, distributed shared, constant, local, tensor (Blackwell), and parameter — plus a generic encoding that names the union when provenance is unknown. Memory Hierarchy and Data Flow is the orientation page for those spaces: it tabulates the PTX state-space names, the LLVM address-space numbers, the MLIR encodings, and the data-flow patterns that move operands between them. This page focuses on the lattice machinery that infers which space a pointer names; the orientation page provides the why.
Force-Inline and Specialize Callees is the data-flow companion: it covers the worklist driver, the kernel/image/size thresholds that decide between inlining and specialization, and the call-site rewrite mechanics. This page covers only the lattice itself, the four red-black trees that hold its working state, and the "nvvm.as" attribute that publishes results across passes. NVVMAA uses the same six-way partition for MayAlias/NoAlias, so the lattice page doubles as the canonical reference for any consumer reasoning about per-pointer NVPTX address spaces.
Worker Model
For each candidate function, the pass allocates one vote slot per argument. Pointer arguments start as UNDET. The classifier then walks the argument's uses at all relevant call sites and returns either a concrete NVPTX address space or a failure. Failure poisons the slot. Agreement keeps or adopts the concrete address space. Disagreement poisons the slot.
The outer algorithm is a fixed-point worklist. A successful clone or signature rewrite can make callers newly classifiable, so affected callers re-enqueue until the worklist drains or the clone budget is exhausted.
Lattice
| Element | Value | Role | Lattice position |
|---|---|---|---|
UNDET | 1000 | no useful observation yet | bottom |
GLOBAL | 1 | addrspace(1), global memory | concrete address-space element |
SHARED | 3 | addrspace(3), shared memory | concrete address-space element |
CONST | 4 | addrspace(4), constant memory | concrete address-space element |
LOCAL | 5 | addrspace(5), local memory | concrete address-space element |
TMEM | 6 | addrspace(6), tensor memory | concrete address-space element |
DIST_SH | 7 | addrspace(7), distributed shared memory | concrete address-space element |
POISON | 2000 | conflict or unclassifiable use | top, absorbing |
UNSET | 256 | scratch state used inside one classifier invocation | outside vote slots |
The concrete address spaces form an antichain. None is more specific than another. The meet of
two distinct concrete address spaces is POISON.
TOP = POISON (2000)
|
+------+------+------+------+------+------+
| | | | | | |
GLOBAL SHARED CONST LOCAL TMEM DIST_SH (1/3/4/5/6/7)
| | | | | | |
+------+------+------+------+------+------+
|
UNDET (1000)
AddressVote meet_address_vote(AddressVote old_vote, ClassifierResult observed) {
if (!observed.valid) {
return AS_POISON;
}
AddressVote new_vote = address_space_to_vote(observed.addrspace);
if (old_vote == AS_UNDET) {
return new_vote;
}
if (old_vote == new_vote) {
return old_vote;
}
return AS_POISON;
}
UNSET must stay separate from the vote lattice. It is a scratch sentinel used while one use is
being classified; it should never be stored as the final vote for an argument.
Clone-budget DoCloneForIpMsp
The clone budget is exposed as DoCloneForIpMsp, described as the maximal number of callees
specialized for a call base. It has three useful modes:
| Budget | Meaning |
|---|---|
-1 | Unlimited cloning; this is the default. |
0 | No clone attempts; useful for inspecting candidates and diagnostics. |
Positive N | Permit at most N clone attempts before suppressing further clones. |
The counter is attempt-based, not success-based. Charging attempts prevents recursive or ambiguous call graphs from repeatedly trying the same specialization shape.
Verbatim --dump-ip-msp transcript strings
When DumpIpMsp is enabled, the pass emits a compact transcript. The strings are intentionally
simple because they describe the fixed-point worklist:
Initial work list size :
arguments)
is cloned
avoid cloning of
callees are affected
: changed in argument memory space (
: return memory space is resolved :
The first line reports seed size. The argument-memory-space line reports resolved parameter slots.
avoid cloning of marks a budget-suppressed candidate. is cloned marks a successful clone.
callees are affected reports callers re-enqueued after a rewrite. The return-space line reports
the same lattice decision applied to return values.
Pass Entry sub_281D480
sub_281D480 is IPMSPPass::run, the function-level entry point of the Intra-Procedural Memory-Space Propagation pass. It reads DoCloneForIpMsp, DumpIpMsp, and the small handful of threshold options that tune the lattice, allocates the four per-function red-black trees described below, then dispatches the seeded worklist into the driver core sub_2858070. On terminate it hands the converged lattice state to the I08 type converter sub_2882E00, where the inferred address spaces become a real "nvvm.as" attribute on the function.
The entry function is the only place that knows the pass-option layout. Everything below it works on raw lattice values and tree handles, which keeps the driver core small enough to inline its inner update step.
Driver Core sub_2858070
sub_2858070 is the driver body of the pass: 3 784 bytes of code, 234 basic blocks. It walks each function's basic blocks, lifts every pointer value to its address-space lattice element, and propagates joins across SSA def-use edges. Case analysis on MLIR operation kinds plus the four-tree update fast paths dominate the 234-block fan-out; the actual control flow is a single worklist loop guarded by a per-block budget cap.
The driver reads pointer address spaces directly out of PointerType rather than through the public MLIR accessor. Pointer types in tileiras carry their AS in PointerType::getSubclassData(), reached via a mask-less >> 8 of the kindAndSubclass u32 at Type+0x10. The call form inlines poorly, so the binary keeps a hand-rolled read in the inner loop, preventing the address-space read from showing up as a non-trivial call in the propagation hot path.
The Four RB-Trees
The driver keeps four red-black trees of per-block lattice state. Each tree is a libc++
std::_Rb_tree with the canonical _Rb_tree_increment / _Rb_tree_decrement callees; they are
the entire mutable working set of the pass.
| Tree | Keyed by | Value |
|---|---|---|
| argmask | BlockArgument * | AS bitmask propagated through PHI/select |
| clone-count | Block * | how many times this block has been re-visited (for budget cap) |
| vote | Value * | lattice vote (one of the 10 elements) |
| return-space | Function * | inferred AS for the function's return-value pointer (if any) |
The argmask tree carries a bitmask, not a single lattice element, because PHI and select merges must remember every concrete AS that reached the block argument before the lattice meet collapses them. The clone-count tree governs convergence: every visit increments the count for the value's parent block, and the driver bails out of the worklist as soon as any block exceeds kBudgetCap. The vote tree holds the lattice value that the cross-link section consumes. The return-space tree is populated only for functions whose return type is a pointer — the analogue of the per-argument vote slot for the result.
Driver Pseudocode
LogicalResult IPMSPPass::run(Function *F) {
RBTrees state = allocRBTrees();
WorkList wl = seedFromArguments(F);
while (!wl.empty()) {
Value *v = wl.pop();
ASLattice newAS = visit(v, state);
if (newAS != getOldAS(state, v)) {
updateLattice(state, v, newAS);
wl.pushUses(v);
}
++state.cloneCount[parentBlock(v)];
if (state.cloneCount[parentBlock(v)] > kBudgetCap)
return failure(); // give up
}
typeConverter.install(F, state); // sub_2882E00
return success();
}
kBudgetCap defaults to 1024 visits per block. Exceeding it means the lattice failed to converge in a reasonable time, and the pass leaves the function unchanged rather than guess. The fall-back is deliberate: a poisoned slot is correct (the cross-link section already defines POISON as absorbing), but a non-converged function should not be specialized at all.
I08 Type Converter sub_2882E00
sub_2882E00 is the I08 type converter that runs after the lattice has converged. It is the only
component that turns the lattice's in-memory state into observable IR. For each function it:
- Reads the inferred AS for each pointer argument from the vote tree.
- Installs a
"nvvm.as"function attribute on the function, with the inferred AS values serialized as aDenseI32ArrayAttr. - Leaves the actual pointer types alone; rewriting argument types is the cloner's job, not the converter's.
The downstream NVPTX backend reads "nvvm.as" via the standard
mlir::nvvm::isPointerInAddressSpace() API and emits the corresponding PTX AS modifier
(global, shared, local, const, tmem, dist_shared) on every load/store that touches the
argument. The attribute is therefore the public contract between the lattice and the rest of the
toolchain: everything upstream of sub_2882E00 is internal lattice state, and everything
downstream sees only the array attribute.
Cross-link
The same six-way partition is consumed by other NVVM components. The type converter writes the
resolved address space into specialized function signatures, so the backend sees
ptr addrspace(N) instead of a generic ptr. TMEM and DIST_SH are first-class participants;
excluding them would downgrade Blackwell tensor-memory and distributed-shared kernels back to
generic address conversions.
NVVMAA uses the same partition for aliasing:
AliasResult nvvm_alias(AddressSpace a, AddressSpace b) {
if (a == ADDRSPACE_GENERIC || b == ADDRSPACE_GENERIC) {
return MayAlias;
}
if (a != b) {
return NoAlias;
}
return MayAlias;
}
That mirrors the lattice. A poisoned vote means specialization cannot prove one concrete space, so alias analysis falls back to MayAlias. Distinct concrete address spaces are disjoint.
Concurrency and Synchronization Semantics
Abstract
NVIDIA GPUs expose a layered memory model: SIMT lockstep within a warp, explicit synchronization between warps in a CTA, cluster-scope synchronization between CTAs on SM90 and newer, and device- or system-scope ordering for atomics that cross those boundaries. Every memory operation tileiras emits carries an explicit (semantic, scope) pair, and that pair is what fixes the operation's position in the layered model. The pair survives every lowering stage from cuda_tile and nv_tileaa TileIR down to PTX: the mem_semantic and mem_scope attributes on the IR op map directly onto the .sem and .scope modifiers in the printed PTX, and the verifier rejects any combination that the printer cannot legally emit.
This page is the canonical reference for that pair: how the four scopes nest in the execution hierarchy, what the five semantics promise about ordering, which (semantic, scope) combinations each operation family accepts, and how the worked release-acquire pair fits into a producer/consumer pipeline that crosses the cluster boundary.
The Four Scopes
A scope answers the question "which set of threads is required to observe the ordering this operation establishes?" The four scopes nest strictly: CTA is contained in cluster, cluster is contained in GPU, GPU is contained in system. A wider scope subsumes the visibility guarantee of every narrower scope but costs more cycles to enforce, because the hardware has to push or pull traffic across more of the on-chip network.
| Scope | PTX modifier | Visibility | Cost (~cycles) |
|---|---|---|---|
| CTA | .cta | threads in one CTA | 1-10 |
| Cluster | .cluster | CTAs in one cluster (SM90+) | 10-50 |
| GPU / device | .gpu | every SM on one device | 100-1000 |
| System | .sys | NVLink-coherent multi-GPU and host memory | 1000+ |
A few operations also accept the compound .cta::cluster scope. That form means the same as .cluster for visibility but lets the hardware take a shorter path when the operand address turns out to be local to the current CTA — the cluster tail is conditional on the address. Tileiras emits the compound only when the operand's address-space inference proves the access may target a peer CTA's distributed shared memory.
Below SM90 the cluster tier does not exist; the hierarchy collapses to CTA / GPU / system, and any IR op carrying mem_scope = cluster must be rewritten or rejected before printing.
The Five Semantics
A semantic answers the question "what other memory operations are ordered relative to this one?" The five values form a lattice: relaxed is the weakest, sc is the strongest, and the three middle values (acquire, release, acq_rel) are partially ordered with respect to each other.
| Semantic | PTX modifier | Meaning | Typical use |
|---|---|---|---|
| relaxed | .relaxed | atomicity only; no inter-thread ordering | counters, statistics, lock-free queues with hand-rolled fences |
| acquire | .acquire | subsequent loads/stores see writes that happened before the matching release | consumer side of a producer/consumer handoff |
| release | .release | prior loads/stores are visible after this op to any thread that performs a matching acquire | producer side of a producer/consumer handoff |
| acq_rel | .acq_rel | both acquire and release; only legal on RMW operations | atomic counters that gate both publication and consumption |
| sc / seq_cst | .sc (on fences) | total order across every sc-ordered op of equal-or-wider scope | rarely emitted; tileiras prefers explicit acq_rel pairs |
A pure ld cannot carry .release and a pure st cannot carry .acquire — the verifier rejects either combination before printing. The acq_rel semantic is reserved for RMW (atom.*) instructions and the corresponding fence forms. Sequential consistency is supported by the PTX fence.sc.{scope} instruction; tileiras emits it only when an upstream TileIR op carries mem_semantic = sc and the optimizer cannot prove a weaker form suffices.
Scope-Semantic Matrix per Operation
The TileIR opset partitions memory effects into five families. Each family takes a different subset of (semantic, scope) pairs, and the verifier knows which subsets each op family accepts.
| Op family | Takes semantic | Takes scope | Notes |
|---|---|---|---|
atom.* (RMW, CAS) | yes — all five | yes — cta / cluster / gpu / sys | scope required when semantic > relaxed |
ld (load) | yes — relaxed / acquire | yes — cta / cluster / gpu / sys | scope is required when semantic is acquire |
st (store) | yes — relaxed / release | yes — cta / cluster / gpu / sys | scope is required when semantic is release |
fence.* | yes — acquire / release / acq_rel / sc | yes — cta / cluster / gpu / sys | scope is always required |
mbarrier.* | implicit | implicit | scope dictated by the mbarrier's address space; cluster forms use mbarrier.expect_tx.cluster |
cp.async.bulk.* | implicit | implicit | ordering flows through the completion mbarrier paired with the copy |
The TileAA verifier hard-codes one structural rule that applies across families: a non-weak semantic requires a scope, and a weak semantic must not carry a scope. This rule is what the verify_memory_op_common predicate in nv_tileaa Operation Roster enforces.
LogicalResult verify_memory_ordering(MemoryOp op) {
MemSemantic sem = op.mem_semantic();
Optional<MemScope> scope = op.mem_scope();
if (sem == WEAK) {
require(!scope.has_value(),
"weak memory ordering must not carry a scope");
return success();
}
require(scope.has_value(),
"non-weak memory ordering requires explicit scope");
if (op.is_load_only()) {
require(sem == RELAXED || sem == ACQUIRE,
"loads accept only relaxed or acquire");
} else if (op.is_store_only()) {
require(sem == RELAXED || sem == RELEASE,
"stores accept only relaxed or release");
}
return success();
}
The matrix is asymmetric on purpose: only RMW operations support acq_rel, because only a single atomic instruction can plausibly establish a happens-before edge in both directions at once.
How Tileiras Chooses
The frontend produces tile-IR ops with memSemantic and memScope attributes attached at construction. The defaults are conservative — sc and sys — and the contract is that lowering may strengthen but never weaken the pair. In practice the frontend overrides the defaults at every site where the user-supplied source carries a weaker promise: a tl.atomic_add with sem="relaxed", scope="gpu" lowers to a cuda_tile.atomic_rmw with the same pair, which lowers to nv_tileaa.atomic_rmw with the same pair, which lowers to nv_tileas.atomic_rmw with the same pair, which finally prints as the matching PTX modifier.
tl.atomic_add(p, v, sem="relaxed", scope="gpu")
│
▼ tile-IR construction
cuda_tile.atomic_rmw add, ... { mem_semantic = relaxed, mem_scope = gpu }
│
▼ cuda_tile to tileaa lowering
nv_tileaa.atomic_rmw add, ... { mem_semantic = relaxed, mem_scope = gpu }
│
▼ tileaa to tileas lowering
nv_tileas.atomic_rmw add, ... { mem_semantic = relaxed, mem_scope = gpu }
│
▼ tileas to NVVM/LLVM lowering, PTX emission
atom.relaxed.gpu.add.u32.global [%rd0], %r1;
Each stage's converter copies the attribute pair through addStoreAttribute / addLoadAttribute helpers without re-deriving them. The only stage that strengthens the pair is the safety pass that detects an mbarrier.expect_tx without a matching upstream release, which inserts an explicit fence.release.cluster instead of demoting the cluster transaction to a weaker form.
CTA-Scope Sync Primitives
Within a CTA, threads synchronize through three mechanisms. The choice depends on whether the rendezvous is warp-cooperative, count-only, or transactional.
bar.sync 0— the implicit barrier; every thread in the CTA arrives, every thread waits. This is the cheapest CTA-wide rendezvous and the only one safe to emit when the warp count is unknown.bar.sync NforN = 1..15— a named barrier slot with an explicit participant count. The 16-slot pool is allocated at compile time by the buffer-assignment pass and bound to specific producer/consumer pairs. See Buffer Assignment and Named-Barrier Binding for the slot pool's allocation discipline.mbarrier.*— a 64-bit transactional barrier object in shared memory. Unlikebar.sync, an mbarrier carries explicit state (arrival count, expected transaction byte count, phase parity) and is polled instead of blocked on. The full state machine is documented in mbarrier State Machine.
A NamedBarrier and an mbarrier are structurally distinct primitives that often coexist in the same kernel — a cutlass.pipeline producer typically arrives on an mbarrier to publish a TMA-completed tile and on a NamedBarrier to synchronize its warp group. The two never substitute for each other.
Cluster-Scope Sync Primitives
Above the CTA, the only hardware-supported rendezvous is the cluster arrive/wait pair. The producer side issues cluster.arrive.relaxed, the consumer side issues cluster.wait, and the rendezvous completes when every participating CTA has arrived. The DSMEM transaction variant additionally publishes a transaction byte count on a peer CTA's mbarrier through mbarrier.expect_tx.cluster before arriving, so the rendezvous completes only when both the arrival count and the transaction byte count clear.
The full cluster-side protocol — peer-CTA address translation through nvvm.mapa, the fence.release.cluster ordering prelude, master-lane phase-bit handoff, and the arrive/wait tail — is documented in Cluster Sync and DSMEM Handshake. The scope-semantic view of that protocol is straightforward: the producer-side mbarrier.expect_tx.cluster carries an implicit release scoped to cluster, and the consumer-side mbarrier.try_wait.parity (after the cluster wait completes) carries an implicit acquire of equal scope.
Race Patterns and the Verifier
Race-freedom is not decidable from IR alone — the verifier cannot enumerate every interleaving of threads and ranks. What it can do is reject structural patterns that are racy by construction, where the IR provably lacks the synchronization edge the operation needs. Three such patterns are checked at TileAA verification time.
The first is a scope/address-space mismatch: an atomic with mem_scope = cta on an addrspace(1) (global) pointer is suspicious, because CTA-scope atomicity cannot enforce visibility across the device-wide L2 path that a global access takes. The verifier emits a diagnostic; the optimizer either widens the scope to gpu or rejects the program.
The second is an mbarrier with expect_tx but no matching upstream arrive.expect_tx: the consumer is waiting for a transaction byte count that no producer will ever publish, and the rendezvous deadlocks. The verifier walks the local dataflow graph backwards from the wait site to confirm that an arrive-with-tx producer dominates it.
The third is a WGMMA without the preceding wgmma.fence.sync.aligned: the warp group reads SMEM through the descriptor before the producer's SMEM writes are guaranteed visible, which races. The WGMMA Emission Protocol documents the four-op sequence that the verifier enforces.
None of these checks proves the program race-free; they only reject the structural patterns where racing is the only possible outcome. Programs that race through more subtle channels — false sharing, ABA on a reused mbarrier slot, an atomic counter with wrong scope across a launch-cooperative boundary — pass the verifier and fail at runtime.
Worked Example: Producer-Consumer Pipeline Ordering
Consider a three-stage software-pipelined GEMM loop: a TMA load fetches an A-tile and a B-tile from global memory into shared memory, a barrier publishes the SMEM stage, and a WGMMA consumer reads through the SMEM descriptor and accumulates into TMEM. The pipeline crosses three sync tiers (per-stage TMA completion, per-CTA SMEM publication, per-warp-group WGMMA fence) and forms one valid release-acquire pair at the SMEM boundary.
The four steps in one iteration of the steady-state loop:
-
The producer warp issues
cp.async.bulk.tensor.shared::cluster.global.mbarrier::complete_tx::bytes. The asynchronous bulk copy reads global memory and writes the destination SMEM stage. Ordering flows through the mbarrier passed inmbarrier::complete_tx::bytes: the copy will increment the mbarrier's transaction byte count when its writes are visible to every thread that polls the mbarrier on the consumer side. -
The producer warp issues
mbarrier.arrive.expect_tx.shared.b64 %tok, [%mbar], %tx. This publishes the expected transaction byte count and arrives at the same time. The semantic-scope view: this op carries an implicitreleasesemantic atctascope (orclusterif%mbarlives in a peer CTA's DSMEM, in which case the printer emitsmbarrier.expect_tx.cluster.b64). Any prior writes from this warp — including the TMA payload itself, asynchronously — are guaranteed visible to a consumer that subsequently acquires the same mbarrier. -
The consumer warp issues
mbarrier.try_wait.parity.shared.b64 %p, [%mbar], %ph, %ns. The op carries an implicitacquiresemantic atcta(orcluster) scope. The wait succeeds only when both the arrival count and the transaction byte count have cleared, at which point every write that the producer ordered through this mbarrier is visible to the consumer. -
The consumer warp issues
wgmma.fence.sync.alignedfollowed by the WGMMA instruction(s) andwgmma.commit_group.sync.aligned. The fence is required because the WGMMA reads SMEM through the descriptor outside the normal load/store path, so the producer-to-consumerrelease/acquireedge through the mbarrier needs an extra fence to be visible to the WGMMA pipeline specifically. See WGMMA Emission Protocol — The Four-Op Sequence for the full sequence.
The release-acquire pair at the SMEM boundary is mbarrier.arrive.expect_tx (release) paired with mbarrier.try_wait.parity (acquire), both implicitly scoped to the mbarrier's address space. The pair satisfies the layered memory model: the consumer's reads see the producer's writes, including the asynchronous TMA payload, and the optimizer is allowed to reorder the producer's tile-compute and the consumer's tile-compute across the boundary as long as it never sinks operations past their release or hoists them past their acquire.
producer warp consumer warp
│ │
│ cp.async.bulk.tensor → SMEM stage │
│ mbarrier.arrive.expect_tx (release, cta) │
│ │
│ ▼
│ mbarrier.try_wait.parity (acquire, cta)
│ wgmma.fence.sync.aligned
│ wgmma.mma_async ← reads SMEM via descriptor
│ wgmma.commit_group.sync.aligned
If %mbar is mapped into a peer CTA's DSMEM through nvvm.mapa, the same diagram describes a cluster-scope handshake: the release widens to cluster, the acquire widens to cluster, and the rendezvous spans every participating CTA in the cluster. The mechanism is unchanged; only the scope modifier on the printed PTX changes.
Cross-References
GPU Execution Model establishes the five-tier hierarchy (thread / warp / CTA / cluster / grid) whose tiers this page's scopes nest into.
mbarrier State Machine covers the barrier object whose arrive.expect_tx and try_wait.parity ops carry the implicit release/acquire pair documented above.
Cluster Sync and DSMEM Handshake is the cluster-scope counterpart: peer-CTA address translation, the fence.release.cluster prelude, and the cluster arrive/wait tail.
WGMMA Emission Protocol is the consumer side of the worked example: the warp-group fence, the MMA op, the commit, and the matching wait-group.
nv_tileaa Operation Roster — Memory Effects catalogues the mem_semantic and mem_scope attribute slots on every memory-effect op.
Atomic, Warp, Sreg, Fence Emission prints the final PTX form: the modifier ordering on atom.*, the fence.* family, and the cluster-scope mbarrier variants.
Buffer Assignment and Named-Barrier Binding allocates the 16-slot CTA-scope NamedBarrier pool that complements the mbarrier-based rendezvous discussed here.
Cluster Sync and DSMEM Handshake
Abstract
The cluster tier in the GPU execution model — covered end-to-end in GPU Execution Model — is the only level above the CTA with hardware sync support. The cluster-side rendezvous protocols tileiras emits to use that hardware are the subject of this page.
A Hopper or Blackwell cluster is a group of cooperating CTAs that share work through a single cluster-level rendezvous. Tileiras lowers cluster-aware barrier operations through two related paths. The plain cluster barrier is a control-flow rendezvous: every participating CTA arrives, then waits, and execution resumes once every participant has reached the same point. The DSMEM transaction handshake is a data-flow rendezvous: a peer CTA publishes its expected transaction byte count on a remote mbarrier before the cluster arrive/wait pair, and the rendezvous completes only when both the arrival count and the transaction-byte count clear.
Both paths share one mechanism. Plain cluster sync is the general primitive every multi-CTA cluster needs; the DSMEM transaction handshake is a specific case where the rendezvous carries a transaction-byte payload because peers are exchanging distributed shared memory through an asynchronous copy. The split matches the CUTLASS distinction between ClusterBarrier::wait() and ClusterTransactionBarrier::arrive_and_expect_tx().
The transaction-byte field is the contract between the producer-side copy and the consumer-side wait. Blackwell 2-CTA and 4-CTA MMA is the producer of the multicast S2T copy whose tcgen05.cp payload is exactly the byte count published by nvvm.mbarrier.txn below: producer and consumer must agree on a single byte count or the cluster rendezvous deadlocks. The mbarrier object that carries the count is documented separately in mbarrier State Machine; this page covers the cluster-side rendezvous that consumes it.
Plain Cluster Barrier
The plain cluster-barrier lowering consumes a barrier scope and the target compute capability. The compute-capability gate controls only the nvvm.fence.mbarrier.init prelude: Hopper and newer hardware get the prelude, older hardware skips it. The scope decides whether a CTA-local barrier is emitted before the cluster arrive/wait pair.
| Scope | sm <= 89 | sm >= 90 |
|---|---|---|
CTA (0) | nvvm.barrier | fence.mbarrier.init + nvvm.barrier |
Cluster (1) | cluster.arrive.relaxed + cluster.wait | fence.mbarrier.init + arrive + wait |
ClusterAligned (2) | cluster.arrive.relaxed + cluster.wait | fence.mbarrier.init + barrier + arrive + wait |
The CTA-only branch returns after nvvm.barrier. The cluster branches fall through into nvvm.cluster.arrive.relaxed and nvvm.cluster.wait. Plain barriers always use the relaxed arrive form: release ordering comes from the mbarrier-init prelude on newer hardware and from the CTA-local barrier where that scope requires it.
void lower_plain_barrier(Rewriter *rewriter, BarrierOp op, int sm) {
if (sm >= 90) {
emit_nvvm_fence_mbarrier_init(rewriter, op);
}
if (op.scope == BARRIER_SCOPE_CTA || op.scope == BARRIER_SCOPE_CLUSTER_ALIGNED) {
emit_nvvm_barrier(rewriter);
if (op.scope == BARRIER_SCOPE_CTA) {
return;
}
}
emit_cluster_arrive_relaxed(rewriter);
emit_cluster_wait(rewriter);
}
DSMEM Transaction Handshake
The DSMEM transaction handshake is the cluster-sync variant that carries a transaction-byte payload. It extends the plain arrive/wait pair with a peer-CTA address translation, an mbarrier expect_tx publication, and a master-lane phase flip — all before the cluster arrive.
For a single-CTA layout the transaction path collapses to the phase-bit update used by ordinary pipeline barriers: compute the next phase with phase ^ 1, load the current phase, store the flipped value. No DSMEM mapping or cluster fence is needed when there are no peer CTAs.
For a multi-CTA layout the lowering emits one handshake sequence per peer participant:
| Operation | Purpose |
|---|---|
nvvm.mapa | Translate a shared-memory pointer into the peer CTA's DSMEM address. |
llvm.addrspacecast | Convert the DSMEM pointer to the generic pointer type expected by the mbarrier op. |
llvm.inline_asm | Emit fence.release.cluster; when the caller requested an explicit release fence. |
nvvm.mbarrier.txn | Advertise the expected transaction byte count to the shared mbarrier. |
arith.cmpi / scf.if | Restrict phase-bit mutation to the master lane. |
llvm.load / arith.xori / llvm.store | Toggle the phase bit. |
nvvm.cluster.arrive.* | Arrive at the cluster rendezvous. |
nvvm.cluster.wait | Wait until every participating CTA reaches the same point. |
%dsmem_ptr = nvvm.mapa %smem_ptr, %peer_ctarank : !llvm.ptr<3>
%gen_ptr = llvm.addrspacecast %dsmem_ptr : !llvm.ptr<3> to !llvm.ptr
llvm.inline_asm "fence.release.cluster;" // when the upstream release flag is set
nvvm.mbarrier.txn %gen_ptr, %tx_bytes : !llvm.ptr, i32
%master = arith.cmpi eq, %laneid, %zero : i1
scf.if %master {
%phase = llvm.load %phase_ptr : i1
%flip = arith.xori %phase, %one : i1
llvm.store %flip, %phase_ptr : i1
}
nvvm.cluster.arrive.relaxed { aligned }
nvvm.cluster.wait { aligned }
Without a multi-CTA parent the DSMEM operations are skipped and the lowering emits only the arrive/wait tail. The release mode controls the arrive opcode: when an upstream fence.release.cluster; is already in place the lowering uses nvvm.cluster.arrive.relaxed; otherwise it can use the aligned arrive form directly.
void lower_dsmem_transaction_barrier(Rewriter *rewriter, TransactionBarrierOp op) {
if (op.cluster_size == 1) {
emit_phase_flip(rewriter, op.phase_ptr);
return;
}
for (PeerCta peer : op.peers) {
Value *dsmem = emit_nvvm_mapa(rewriter, op.smem_ptr, peer.rank);
Value *generic = emit_addrspacecast_to_generic(rewriter, dsmem);
if (op.requires_explicit_release) {
emit_side_effect_inline_asm(rewriter, "fence.release.cluster;");
}
emit_nvvm_mbarrier_txn(rewriter, generic, op.transaction_bytes);
emit_master_lane_phase_flip(rewriter, op.phase_ptr);
}
emit_cluster_arrive_for_release_mode(rewriter, op.requires_explicit_release);
emit_cluster_wait(rewriter);
}
The ordering invariant is: publish the DSMEM transaction expectation before cluster arrive, toggle the phase only on the master lane, and pair every cluster arrive with a cluster wait for multi-CTA rendezvous. Reversing the order — arriving before publishing the transaction count — races peer CTAs that read the mbarrier as part of the arrive completion.
Cross-References
GPU Execution Model places the cluster tier in the five-tier hierarchy (thread / warp / CTA / cluster / grid) and documents the .cluster_dim / .explicitcluster / .maxclusterrank directives that establish the cluster shape this page's rendezvous operates over.
mbarrier State Machine covers the barrier object itself: arrival semantics, phase parity, and the transaction-byte field this page consumes.
Blackwell 2-CTA and 4-CTA MMA is the producer of the multicast S2T copy whose transaction-byte count drives the consumer-side handshake here.
tcgen05 Tensor Memory Model describes the TMEM allocator and instructions whose 2-CTA cooperative MMA variant rides on top of this rendezvous.
Concurrency and Sync Semantics places the cluster-scope release/acquire pair carried by mbarrier.expect_tx.cluster and mbarrier.try_wait.parity inside the four-scope, five-semantic matrix that every tileiras memory op participates in.
Atomic, Warp, Sreg, Fence Emission documents the PTX printer for cluster.arrive, cluster.wait, and fence.mbarrier_init.release.cluster.
mbarrier State Machine
Abstract
An mbarrier is a 64-bit transactional barrier object that lives in shared memory and synchronises a fixed set of arrival participants with an arbitrary count of in-flight transactions. It is the SM80-and-later primitive that decouples asynchronous data movement from compute: TMA loads, Hopper WGMMA, Blackwell tcgen05, and the entire cutlass.pipeline producer/consumer scaffold all observe completion through one of these objects. This page is the canonical reference for the mbarrier state machine — initialization, arrival, transaction tracking, phase parity, and invalidation — and for the 21-op NVVM family that touches it.
This page supersedes the scattered mbarrier paragraphs in Atomic, Warp, Sreg, Fence Emission (the 21-op printer table), tcgen05 / WGMMA / mbarrier / Cluster Emission (the finalize-phase fragment), Cluster Sync and DSMEM Handshake (the transactional handshake), Pipeline and Tile Scheduler (the producer/consumer step function), and Seq-Bar and Block-Striped (the ring-of-slots view). Those pages now defer here for the mechanism itself.
NamedBarrier Is a Different Thing
Tileiras code paths and CUTLASS-style kernels routinely reach for two synchronization primitives whose surface vocabulary overlaps. They are structurally distinct and must not be conflated.
A NamedBarrier is one of the 16 hardware bar.sync slots per CTA. Allocation is static: Buffer Assignment and Named-Barrier Binding reserves a 32-bit slot vector in Phase 2 and hands each producer/consumer pair one slot. The synchronization model is warp-cooperative — all participating warps bar.sync N, count against the same slot id, and the barrier releases when count arrivals have accumulated. There is no transaction tracking, no shared-memory storage, no phase bit. Slots can be reused across disjoint lifetimes but not at one program point. The cutlass.bar op and the nvvm.bar.cta.sync family print into NamedBarrier slots.
An mbarrier is a shared-memory object — a 64-bit word at an aligned SMEM address — that carries explicit state: an arrive_count, an expected_txn byte count, a phase bit, and a current count. Synchronization is by polling: a consumer issues mbarrier.try_wait.parity against an expected phase, and the hardware reports completion when both arrivals and transactions have reached their targets. There is no shared hardware slot, no warp-cooperative constraint, and no static allocation table — every kernel instantiates as many mbarriers as it wants, subject only to SMEM capacity. The nvvm.mbarrier.* family operates on these objects.
The two primitives often appear in the same kernel: a cutlass.pipeline producer typically arrives on an mbarrier (to publish a TMA-completed tile) and on a NamedBarrier (to synchronise its warp group), and the buffer-assignment pass binds both kinds of slot for the same pipeline value. They remain distinct mechanisms.
⚡ QUIRK — NamedBarrier and mbarrier share vocabulary but no mechanism "Barrier" appears on both sides — both objects live in SMEM-adjacent storage, both gate producer/consumer regions, and both end up bound by the same pass for the same pipeline value. They are otherwise unrelated: NamedBarrier is one of 16 statically allocated CTA-wide
bar.syncslots with a warp-cooperative count gate, mbarrier is a 64-bit transactional object with arrive/expect-tx/parity polling. Reusing one's idioms on the other (a polling wait on a NamedBarrier, abar.syncarrival on an mbarrier) does not type-check and produces nothing resembling synchronisation if it slips past the front end.
State Machine
An mbarrier carries four fields packed into one shared-memory 64-bit word:
typedef struct MBarrier {
uint32_t arrive_count; /* remaining producer arrivals before this phase completes */
uint32_t expected_txn; /* expected transaction-byte count, 0 for ordinary barriers */
uint32_t txn_count; /* current transaction-byte total */
uint32_t phase : 1; /* parity bit, flips on completion */
uint32_t pending: 31; /* current arrivals remaining */
} MBarrier;
Hardware-visible state advances through five operations: init, arrive, arrive-with-expect-tx, try-wait-parity, and inval. The producer side decrements pending (and optionally publishes a transaction byte count); the consumer side polls until completion; the phase bit flips on each completion and re-arms the barrier for the next round.
void mbarrier_init(MBarrier *b, uint32_t count) {
b->arrive_count = count;
b->pending = count;
b->expected_txn = 0;
b->txn_count = 0;
b->phase = 0;
}
void mbarrier_arrive(MBarrier *b) {
if (atomic_fetch_sub(&b->pending, 1) == 1) {
atomic_store(&b->pending, b->arrive_count);
b->phase ^= 1;
}
}
void mbarrier_arrive_expect_tx(MBarrier *b, uint32_t tx_bytes) {
b->expected_txn = tx_bytes;
mbarrier_arrive(b);
}
bool mbarrier_try_wait_parity(MBarrier *b, uint32_t want_phase) {
return b->phase == want_phase
&& b->pending == b->arrive_count
&& b->txn_count >= b->expected_txn;
}
The hardware implementation is atomic and lock-free, but a reimplementation only needs to preserve three invariants: pending decrements toward zero, the phase bit flips on the decrement that reaches zero (re-arming pending), and try_wait.parity releases only when both the arrival side and the transaction side have caught up.
Phase Parity
A consumer that waits on the same mbarrier across loop iterations cannot ask "is the barrier complete?" — the answer is yes between every two iterations. It asks "has the barrier flipped to phase p?". The producer flips phase on arrival, the consumer reads the phase it expects to see, and the wait succeeds exactly when the producer's flip and the consumer's expectation agree.
In practice each pipeline agent carries a one-bit phase counter that toggles on every wraparound of its stage index. For a depth-D pipeline the phase of stage s at iteration i is (i / D) & 1; the producer pre-arms phase ((i / D) ^ 1) & 1 (the slot's "next-empty" parity) and the consumer waits for phase (i / D) & 1 (the slot's "now-full" parity). This is why cutlass.pipeline.state carries both index and phase — the phase is what makes a single barrier slot reusable across iterations without ABA hazards.
Kinds: Ordinary, Transaction, Cluster
Three kinds of mbarrier appear in TileIR:
-
Ordinary.
expected_txnis implicitly 1 (or zero, with the count-only path), and onlyarrive_countparticipants need to arrive. Thecutlass.barandseq_barpaths use this kind. Lowering emitsnvvm.mbarrier.arriveornvvm.mbarrier.arrive.nocomplete. -
TMA transaction.
expected_txnis the byte count the TMA copy will deliver —32 * size_minorfor a tiled TMA load ofsize_minorelements per minor dimension. The producer announces the expectation withnvvm.mbarrier.arrive.expect_txbefore issuing the TMA instruction; the TMA copy then updatestxn_countasynchronously, and the consumer'stry_wait.parityreleases only once both the arrival side and the transaction-byte side complete. This is the kind that tiescp.async.bulk.tensorto the consumer's WGMMA or tcgen05 instruction. -
Cluster transaction. The cross-CTA variant. The producer maps the barrier into a peer CTA's distributed shared memory through
nvvm.mapa, publishesexpected_txnvianvvm.mbarrier.txn(the cluster-scope expect-tx op), then participates incluster.arrive/cluster.wait. The DSMEM handshake on Cluster Sync and DSMEM Handshake documents the rendezvous; the mbarrier state-machine view is just that the transaction byte count is published on a peer-CTA mbarrier rather than a local one.
The 21-Op NVVM Family
The nvvm dialect exposes 21 ops that touch mbarrier state. They cover initialization, three arrive variants by transaction kind, two wait variants by blocking semantics, plus invalidation and the shared-memory specialisations of each. Lowering picks the .shared form when the barrier address space is 3 and the generic form otherwise.
| State-machine role | NVVM op | PTX |
|---|---|---|
| init / inval | nvvm.mbarrier.init / .shared | mbarrier.init[.shared].b64 [%p], %n; |
| init / inval | nvvm.mbarrier.inval / .shared | mbarrier.inval[.shared].b64 [%p]; |
| init / inval | nvvm.fence.mbarrier.init | fence.mbarrier_init.release.cluster; |
| arrive (count only) | nvvm.mbarrier.arrive / .shared | mbarrier.arrive[.shared].b64 %r, [%p]; |
| arrive (count, no-complete) | nvvm.mbarrier.arrive.nocomplete / .shared | mbarrier.arrive.noComplete[.shared].b64 %r, [%p], %cnt; |
| arrive (expect-tx, local) | nvvm.mbarrier.arrive.expect_tx / .shared | mbarrier.arrive.expect_tx[.shared].b64 %r, [%p], %tx; |
| arrive (expect-tx, cluster) | nvvm.mbarrier.txn | `mbarrier.expect_tx{.relaxed}.{cta |
| wait (parity, blocking) | nvvm.mbarrier.wait | mbarrier.wait[.parity].b64 %r, [%p][, %par]; |
| wait (parity, polling) | nvvm.mbarrier.try_wait.parity.shared | mbarrier.try_wait.parity.shared.b64 %p, [%mbar], %ph, %ns; |
| wait (test) | nvvm.mbarrier.test.wait / .shared | mbarrier.test_wait[.shared].b64 %r, [%p], %token; |
The same 21 ops cover the address-space split: each ordinary form has a .shared variant chosen by the rewriter when the barrier lives in address space 3. The nvvm.mapa op is not in this family — it translates a shared pointer into a peer CTA's DSMEM address — but always appears upstream of a cluster mbarrier.txn because no other instruction reaches a remote mbarrier object.
Cluster Init Fence
When the barrier object crosses a CTA boundary, the producer must publish the initialisation before any peer can observe it. Hopper and Blackwell expose fence.mbarrier_init.release.cluster for exactly this purpose. The prelude pattern is:
%bar = memref.get_global @__shared_mbarrier : memref<...>
nvvm.mbarrier.init.shared %bar, %count : i32
nvvm.fence.mbarrier.init // cluster-visible publish
nvvm.cluster.arrive.relaxed { aligned }
nvvm.cluster.wait { aligned }
Older targets (sm_70 / sm_80) skip the fence — there is no cross-CTA visibility to guarantee. The fence is also unnecessary for purely intra-CTA mbarriers; only the cross-CTA path needs it.
Diagnostic Strings
The mbarrier verifier and lowerings emit these verbatim binary messages:
" must be mbarrier barrier type, but got "— the typed-operand trait reports a non-mbarrier SSA type; the printed type name follows the trailing space."Only acquire/relaxed ordering supported for MBarrierWaitOp."(and the parallelMBarrierWaitParityOp./MBarrierTryWaitTimeLimitOp./MBarrierTryWaitParityTimeLimitOp.variants) — the memory-ordering attribute is outside the acquire / relaxed set."using transaction mbarrier is not supported"— a transaction-mbarrier was used on a code path that has not been wired up to the txn family."mbarrier has wait-like users, cannot share pipeline buffer."— the alias pass refuses to fold a buffer shared with a wait-side user."Invalid TxnKind in MBarrierTransactionCTASpaceOp."— the transaction kind enum carries an unsupported value.- Lowering-time failures:
"failed to find smem buffer address for mbarrier","failed to find address of omitted mbarrier"(note: the binary also carries the misspelled twin"failed to find address of ommited mbarrier"),"failed to init mbarrier"/"Failed to init mbarrier","failed to setup mbarrier","failed to get MBarrier object".
The lowering rejects mismatched-address-space combinations before the printer fires, so the final PTX template never has to recover from a malformed modifier word.
Cross-References
Buffer Assignment and Named-Barrier Binding documents the 32-slot NamedBarrier pool that this page disambiguates from mbarrier; both kinds end up in the same value record but are different mechanisms.
Pipeline and Tile Scheduler builds its producer/consumer step function on top of the state machine above; its try_wait.parity calls and arrive.expect_tx calls land verbatim in the NVVM ops listed here.
Seq-Bar and Block-Striped wraps the same primitives into an ordered ring with a phase cursor and uses arrives plus parity waits exactly as documented above.
WGMMA Emission Protocol — The Four-Op Sequence is the consumer side of the TMA-transaction kind: the WGMMA wait-group sequence runs after try_wait.parity succeeds on the producer's mbarrier.
tcgen05 Tensor Memory Model — Tensor Memory uses cluster-transaction mbarriers to coordinate the 2-CTA and 4-CTA TMEM staging copies that precede each MMA.
Cluster Sync and DSMEM Handshake extends the transaction kind across CTA boundaries with peer-CTA address translation and the cluster arrive/wait rendezvous.
Concurrency and Sync Semantics frames mbarrier.arrive.expect_tx and mbarrier.try_wait.parity as the implicit release/acquire pair at the heart of the producer/consumer pipeline ordering story.
Atomic, Warp, Sreg, Fence Emission lists the PTX-print form of every mbarrier op alongside the fence and warp-collective families.
tcgen05 / WGMMA / mbarrier / Cluster Emission covers the backend-side validation that runs once the NVVM op has been selected.
DSL to PTX End-to-End shows the transaction-kind mbarrier in flight — Stage 3 carries the producer/consumer rendezvous as async.pipeline.producer_commit / consumer_wait tokens, Stage 4 lowers them to nvvm.mbarrier.init.shared plus a parity-encoded try_wait.parity.shared, Stage 5 surfaces MBARRIER_TRY_WAIT_PARITY_SHARED in MIR, and Stage 6 emits the mbarrier.try_wait.parity.shared.b64 retry loop with explicit @!%p bra fallback.
WGMMA Emission Protocol
Abstract
WGMMA is Hopper's asynchronous warp-group matrix multiply. Four warps cooperate on one accumulator tile; the multiply itself is asynchronous against the issuing warp group and only becomes visible to subsequent reads through a wait-group barrier. The legal usage contract is a four-op emission protocol — fence, one or more async MMA instructions, commit-group, wait-group — and an accumulator-lifetime contract that says: an accumulator written by a still-in-flight WGMMA cannot be read until its group has been drained. Violations are silent data races, not verifier errors.
This page is the canonical reference for the protocol. It supersedes the duplicated lower-WGMMA snippets in tcgen05 / WGMMA / mbarrier / Cluster Emission, Lowering: nvgpu / gpu to NVVM, nvgpu Dialect Overview, and MMA Atoms SM70-SM120. Those pages now defer here for the emission sequence and the lifetime contract; they keep their own descriptor-construction, dialect-pattern, and verifier content.
WGMMA exists only on sm_90a. Blackwell removes it: SM100 onwards uses tcgen05.mma over tensor memory instead.
The Four-Op Sequence
A WGMMA region emits exactly one fence, one tile loop of MMA instructions, one commit, and one wait. The fence orders prior shared-memory writes against the first async MMA; the commit closes the current async group; the wait drains the group's accumulator results back into the warp group's visible state.
nvvm.wgmma.fence.aligned // 1. fence
%acc1 = nvvm.wgmma.mma_async %a0, %b0, %acc0 // 2. async MMA, tile 0
%acc2 = nvvm.wgmma.mma_async %a1, %b1, %acc1 // async MMA, tile 1
...
%accN = nvvm.wgmma.mma_async %ak, %bk, %accN-1 // async MMA, tile K-1
nvvm.wgmma.commit.group.sync.aligned // 3. commit
nvvm.wgmma.wait.group.sync.aligned %waitN // 4. wait
void emit_wgmma_region(WgmmaOp op, Rewriter *rw, int wait_n) {
rw->create("nvvm.wgmma.fence.aligned");
Value acc = op.accumulator();
for (int m = 0; m < op.m / op.inst_m; ++m) {
for (int k = 0; k < op.k / op.inst_k; ++k) {
uint64_t da = advance_descriptor(op.a_desc, m, k, op.a_layout);
uint64_t db = advance_descriptor(op.b_desc, m, k, op.b_layout);
acc = rw->create("nvvm.wgmma.mma_async", {da, db, acc}, acc.getType());
}
}
rw->create("nvvm.wgmma.commit.group.sync.aligned");
rw->create("nvvm.wgmma.wait.group.sync.aligned", {rw->i32(wait_n)});
rw->replace_op(op, acc);
}
The fence/commit/wait triple is non-negotiable. Skipping the fence races SMEM stores against the first async MMA. Skipping the commit means the wait drains the wrong group (a different in-flight group, or none at all). Skipping the wait reads stale or partial accumulator state.
Accumulator Lifetime
The accumulator returned by each mma_async is symbolic: the SSA value is defined, but its register contents are not yet visible to the warp group. Reads of that SSA value before its group has been drained by wait_group are silent data-race UB — the hardware does not trap, the MLIR verifier does not flag, and the result depends on the timing of the warp scheduler.
Two rules cover this:
- Any read of an accumulator written by an
mma_asyncmust follow await_groupthat drains that MMA's group. - A
wait_group Ndrains every group whose commit predates the wait by more thanNcommits.
The second rule is the source of the most common subtle bug. wait_group N is "the number of groups still in flight after this wait, not the number to wait for." wait_group 0 is the drain-everything case, and it is what most pipelined kernels emit at the tail of the WGMMA region.
A useful mental model: commit_group closes the current group and increments an in-flight counter. wait_group N blocks until the in-flight counter is at most N, then returns. Counter monotonicity means the wait drains every group older than the current cohort of N.
⚡ QUIRK —
wait_group Nis a "leave at most N in flight" gate The natural reading ofwait_group Nis "wait for N groups to finish," and that reading is wrong. The operand is the maximum number of groups still allowed to be in flight after the wait returns.wait_group 0drains every committed group;wait_group 1leaves the most recent one running. Reimplementations that translate the parameter as a count-to-drain underflow the in-flight counter on the first call and either spin forever or release the accumulator while its MMA is still resident in the math pipe.
SMEM Descriptor Bit Layout
Operand B is always an SMEM descriptor — a packed 64-bit immediate-style word built once per operand before the tile loop, then threaded through the inline-asm fragment as an l-constraint i64 input. The same bit layout serves every Hopper WGMMA shape; the constructor is one routine fed by per-atom shape and swizzle metadata, not a family of per-shape variants. The canonical 64-bit packing layout is:
| Bits | Field | Width | Meaning |
|---|---|---|---|
| 0-13 | start_addr | 14 | Low 14 bits of SMEM byte offset right-shifted by 4 (16-byte alignment) |
| 14-29 | lbo | 16 | Leading byte offset between rows of a warp tile |
| 30-45 | sbo | 16 | Stride byte offset between consecutive warp tiles along K |
| 46-48 | base_offset | 3 | Per-CTA SMEM offset, scaled by 8 |
| 49-51 | reserved | 3 | Must be zero; constructor masks explicitly |
| 52-53 | swizzle_mode | 2 | 0 = none, 1 = 128B, 2 = 64B, 3 = 32B |
| 54-63 | pad | 10 | Unused |
The bit ranges come from the constructor in cute_nvgpu and are mirrored by the operand-layout verifier — see SMEM-Descriptor Construction for the same table from the dialect side.
typedef union WgmmaDescriptor {
uint64_t raw;
struct {
uint64_t start_addr : 14; /* bits 0-13 */
uint64_t lbo : 16; /* bits 14-29 */
uint64_t sbo : 16; /* bits 30-45 */
uint64_t base_offset : 3; /* bits 46-48 */
uint64_t reserved : 3; /* bits 49-51 */
uint64_t swizzle_mode : 2; /* bits 52-53 */
uint64_t pad : 10; /* bits 54-63 */
};
} WgmmaDescriptor;
uint64_t make_smem_desc(uint32_t smem_byte_off,
uint16_t lbo, uint16_t sbo,
uint8_t base_offset, uint8_t swizzle_mode) {
WgmmaDescriptor d = {0};
d.start_addr = (smem_byte_off >> 4) & 0x3FFF; /* keep low 14 bits */
d.lbo = lbo;
d.sbo = sbo;
d.base_offset = base_offset & 0x7;
d.swizzle_mode = swizzle_mode & 0x3; /* 0/1/2/3 = none/128B/64B/32B */
return d.raw;
}
The constructor must mask the reserved field. Selection sometimes leaves uninitialised scratch bits in the upper half of the SDNode operand, and the WGMMA hardware does not ignore them: a non-zero reserved field is silently UB.
⚡ QUIRK — reserved bits in the SMEM descriptor must be zeroed Bits 49–51 of the WGMMA SMEM descriptor are reserved, and Hopper does not treat them as don't-care. A non-zero value silently corrupts the operand fetch with no fault, no verifier message, and no PTX warning. The constructor masks the field explicitly because selection routinely leaves scratch bits live in the upper word of the SDNode. A descriptor that round-trips through naive
unionpacking without an explicit mask boots and runs but produces garbage tiles intermittently.
Worked Decode
Take the canonical Hopper choice: m64n128k16.f32.f16.f16 with swizzle = 128B, lbo = 2048, sbo = 0, base_offset = 0, and an SMEM byte offset whose (>> 4) value lands at 0x1000. The packed bit fields are:
| Field | Logical | Hex | Encoded position |
|---|---|---|---|
start_addr | smem_off >> 4 = 0x1000 | 0x1000 | bits 0-13 |
lbo | 2048 = 0x800 | 0x800 | bits 14-29 |
sbo | 0 | 0x0 | bits 30-45 |
base_offset | 0 | 0x0 | bits 46-48 |
swizzle_mode | 128B | 1 | bits 52-53 |
Composing them:
uint64_t raw = 0;
raw |= ((uint64_t)0x1000) << 0; /* start_addr */
raw |= ((uint64_t)0x0800) << 14; /* lbo */
raw |= ((uint64_t)0x0000) << 30; /* sbo */
raw |= ((uint64_t)0x0000) << 46; /* base_offset */
raw |= ((uint64_t)0x0001) << 52; /* swizzle 128B */
/* raw == 0x0010_0000_0200_1000 */
The decode is the inverse: bits 0-13 hold 0x1000, bits 14-29 hold 0x800 (which spills into nibble 0x02000 of the raw word because the field starts at bit 14), bits 52-53 hold 1, and every reserved bit is clear. A reimplementation that round-trips through decode_descriptor(0x00100000_02001000) produces the exact original logical-field set.
The swizzle table the constructor consults:
swizzle_mode | Row width | Typical use |
|---|---|---|
| 0 | none | Plain row-major SMEM tile |
| 1 | 128 B | Canonical Hopper choice for full-width A and B tiles |
| 2 | 64 B | Smaller tensor-core operand (sub-canonical tile) |
| 3 | 32 B | Sub-tile WGMMA |
The 128 B mode is the canonical choice for m64n{128, 192, 256}k{8, 16, 32} tiles. The 64 B and 32 B modes kick in when the operand element width or warp-tile footprint is smaller than a canonical 128 B row.
Descriptor Advancement
When the WGMMA region iterates over output tiles, descriptors advance by the per-tile byte stride converted to 16-byte units:
uint64_t advance_descriptor(uint64_t desc, int m_tile, int k_tile, Layout layout) {
uint64_t byte_offset = layout_byte_offset(layout, m_tile, k_tile);
return desc + (byte_offset >> 4);
}
The advancement adds to start_addr and may carry through into the lbo field if the M or K extent crosses a 14-bit boundary — the field aliasing is intentional, since start_addr and lbo together carry the SMEM offset for the next warp tile. A reimplementation that forgets the >> 4 advances the descriptor 16x too far on the first tile and silently aliases distant SMEM regions on subsequent tiles. The verifier does not catch it because the descriptor field is opaque from the dialect's point of view.
⚡ QUIRK — descriptor advancement is in 16-byte units, not bytes The SMEM address inside the descriptor is pre-shifted right by 4, so
start_addrcounts 16-byte chunks rather than bytes. Per-tile advancement must apply the same>> 4to the byte stride before adding it to the descriptor word. The MLIR layer treats the descriptor as opaque i64, so dropping the shift compiles cleanly, passes the verifier, and silently walks 16x past the intended tile boundary on the very first iteration.
Operand A may be either a register fragment or an SMEM descriptor, controlled by a per-atom a_in_rf predicate. When A rides registers, the descriptor advancement applies only to B; when A rides SMEM, both operands advance using their own layouts.
Inline-Asm Template and Constraint String
For SM90 WGMMA atoms that bypass the NVVM op and emit PTX directly, the inline-asm template carries the constraint string =f,=r,l,r,n in argument order:
| Constraint | Operand | Role |
|---|---|---|
=f | output | each FP register in the accumulator fragment |
=r | output | the i32 register that captures the scale-D return |
l | input | the i64 descriptor input (operand B, or A if SMEM-resident) |
r | input | the i32 scale input that toggles accumulator update |
n | input | the compile-time-known predicate that conditions the MMA |
The =f block expands to as many lanes as the accumulator fragment carries — M * N / 256 per thread for FP32 accumulators, varying by atom. The l slot carries the WGMMA descriptor word the SMEM-descriptor constructor produced; when A is also SMEM-resident, a second l input precedes it.
wgmma.mma_async.sync.aligned.m64nXkY.<acc>.<a>.<b>
{ %f0, %f1, ... }, // accumulator fragment (out)
%ra, // A operand (descriptor or RF)
%rb, // B descriptor
%scale, // scale-D selector
1, 1, // transpose flags (compile-time)
%la, %lb // SMEM descriptors when A in SMEM
Scale-D
The scale-D operand is a single boolean: 0 means "zero the accumulator before adding the MMA result", 1 means "add to the existing accumulator". The dialect-side WgmmaOp exposes it through a scale_d attribute; the lowering routes it into the r input of the inline-asm template.
The mainloop pattern is to issue the first WGMMA with scale_d = 0 (zeroing the tile) and every subsequent K iteration with scale_d = 1 (accumulating). Forgetting to clear scale-D on the leading WGMMA does not zero the accumulator; instead, the kernel multiplies into whatever values the destination registers happened to hold at warp-group start — usually garbage.
Operand Residency
Operand B is always an SMEM descriptor. There is no register-resident-B WGMMA variant. The descriptor encodes both the SMEM base address (low 14 bits, in 16-byte units) and the leading/stride byte offsets that pin the 2D tile shape into SMEM.
Operand A is one of two residencies:
- A register fragment, when the producing pipeline has staged A into the warp group's registers (typical for warp-specialized mainloops where A is small and stays close to the MMA).
- An SMEM descriptor, with the same construction rules as operand B (used when A is large enough to want SMEM staging or when the producer is a TMA load).
The accumulator stays in registers in every WGMMA variant. The destination is the warp group's register file; that is also why each mma_async returns a typed accumulator SSA value the rest of the IR can thread through subsequent MMAs in the same group.
Per-Shape Lattice
WGMMA fixes M at 64 — that is the warp-group dimension (4 warps × 16-thread tile = 64 rows of output per instruction). N steps in multiples of 8 up to 256, and K is fixed per input element type at 256 / elem_bits. The per-input-family availability is:
| Input family | Accumulator | Legal (M, N, K) shapes | K |
|---|---|---|---|
f16 × f16 | f16 or f32 | {64} × {8, 16, 24, ..., 256} × {16} | 16 |
bf16 × bf16 | f32 | {64} × {8, 16, 24, ..., 256} × {16} | 16 |
tf32 × tf32 | f32 | {64} × {8, 16, 24, ..., 256} × {8} | 8 |
e4m3 × e4m3 (FP8) | f32 | {64} × {8, 16, 24, ..., 256} × {32} | 32 |
e5m2 × e5m2 (FP8) | f32 | {64} × {8, 16, 24, ..., 256} × {32} | 32 |
Mixed e4m3 × e5m2 | f32 | {64} × {8, 16, 24, ..., 256} × {32} | 32 |
s8 × s8 / u8 × u8 | s32 | {64} × {8, 16, 24, ..., 256} × {32} | 32 |
s4 × s4 / u4 × u4 | s32 | {64} × {8, 16, 24, ..., 256} × {64} | 64 |
b1 × b1 (popcount) | s32 | {64} × {8, 16, 24, ..., 256} × {256} | 256 |
The K column reflects the canonical 256 / elem_bits rule, with one exception: b1 rides a .xor.popc or .and.popc reduction over 256 bits of K, well past the canonical 256-bit-element budget. The b1 path is the only WGMMA variant that does not multiply-accumulate in the conventional sense.
The N step of 8 is the WGMMA hardware constraint on the output tile size — there is no N=12 or N=20 variant. Lowering rejects any N that is not a multiple of 8 with "WGMMA N must be a multiple of 8". The K column entry is a hard match — the lowering does not synthesise a K=24 f16 WGMMA by issuing one K=16 and one K=8 instruction; the K=8 form is tf32-only, and the K extent for f16 must be exactly 16 per instruction.
The largest single-instruction tile is m64n256k16.f16 for FP16 inputs (8192 output elements per warp-group instruction) and m64n256k32.e4m3 for FP8 (8192 outputs over twice the K extent). Lowering tiles a logical matmul into per-instruction tiles by stepping along N in chunks bounded by the largest legal N and along K in chunks of the per-family K column; the M axis stays at 64 for the entire warp group's lifetime and the loop nest threads tiles into the four-op sequence one at a time.
For comparison against earlier and later tiers, see Matmul Progression by SM for the cross-architecture shape lattice that places WGMMA between Ampere's m16n8k* register MMA and Blackwell's tcgen05.mma.
SM Gating
WGMMA is sm_90a only. The architecture-conditional suffix matters: plain sm_90 rejects WGMMA at NVVM verification. The dialect exposes WGMMA atoms through cute_nvgpu.sm90.mma and lowering rejects them on every other target.
Blackwell removes WGMMA. SM100 and SM103 use tcgen05.mma over tensor memory; SM120 and SM121 (consumer Blackwell) use a synchronous mma.sync.aligned with explicit per-operand scale factors. SM110 (Jetson Thor) is enumerated as a target tier but the dialect registers no SM110-specific MMA atom — kernels targeting sm_110 fall through to the universal-FMA atom rather than to any WGMMA or tcgen05 path. All three post-Hopper replacements have different operand-residency models — see Matmul Progression by SM for the cross-architecture story.
Cross-References
Matmul Progression by SM places WGMMA in the broader SM70-to-SM121 lineage and explains what replaced it on each generation.
tcgen05 Tensor Memory Model is the Blackwell successor; the 4-op protocol changes because the accumulator now lives in TMEM.
mbarrier State Machine defines the transaction-barrier kind that producers use to publish WGMMA completion when a downstream pipeline stage needs to observe it.
MMA Atoms SM70-SM120 documents the WGMMA SMEM descriptor bit layout and the per-element-type GMMA-K table that drives advance_descriptor.
nvgpu Dialect Overview shows how nvgpu.warpgroup.mma lowers into this protocol.
Lowering: nvgpu / gpu to NVVM is the dialect-conversion path that materialises the four-op sequence.
tcgen05 / WGMMA / mbarrier / Cluster Emission covers the backend-side validation of the selected WGMMA machine form.
DSL to PTX End-to-End walks the four-op WGMMA sequence in context — Stage 3 shows the nv_tileas.dot carrying the sm90_wgmma_m64n128k16_f32_f16_f16 atom, Stage 4 expands it to the fence / mma_async / commit_group / wait_group NVVM quartet, Stage 5 renders the MIR WGMMA_* opcodes, and Stage 6 emits the matching wgmma.* PTX directives for one steady-state K-iteration.
Blackwell 2-CTA + 4-CTA MMA
Abstract
The cluster tier — covered end-to-end in GPU Execution Model — is the prerequisite that makes 2-CTA and 4-CTA cooperative MMA legal. This page documents the copy-side fan-out that distributes operand tiles across cluster CTAs before the MMA fires; the cluster shape it depends on is established by the .cluster_dim directive at the kernel header.
Blackwell tensor-core lowering separates the cooperative copy from the matrix instruction that consumes the copied tile. The SMEM-to-TMEM staging copy can be single-CTA, 2-CTA, or 4-CTA. The matching tcgen05.mma instruction carries only the MMA-side group encodings it understands; the 4-CTA fan-out lives on the copy side, where the A operand is distributed across a CTA cluster before the MMA consumes each CTA's local slice.
Tileiras lowers the cute_nvgpu.atom.make_s2t_copy atom through one shared MLIR rewrite path. That path builds a cute.tiled.copy, optionally guards it with an scf.if, and later lowers the copy to the tcgen05.cp family. The sibling IMMA and WGMMA atom paths do not read the cluster CTA-rank special register; rank-aware partitioning is specific to S2T copy lowering.
The cluster fan-out lives on the copy side, not the MMA side. PTX gives tcgen05.mma only cta_group::1 and cta_group::2; there is no cta_group::4 MMA encoding. The 4-CTA shape must therefore be a copy-time partition that produces four already-sliced TMEM destinations, and the MMA that follows is a plain single-CTA matrix instruction over the per-CTA slice. A reimplementation that puts the fan-out on the MMA side will fail to encode anything in PTX. The DSMEM handshake described in Cluster Sync and DSMEM Handshake is the synchronisation companion of this copy lowering: the multicast S2T copy advertises its transaction bytes to peer CTAs through exactly that handshake.
⚡ QUIRK — no
cta_group::4MMA encoding PTX exposescta_group::1andcta_group::2ontcgen05.mmaand nothing else: the 2-bitcta_groupfield has no4slot. The 4-CTA shape is purely a copy-side partition that produces four pre-sliced TMEM destinations, and the matrix instruction over each slice is single-CTA. Lowerings that try to encode the cluster fan-out into the MMA op fail to emit any legal PTX. The 4-CTA story is the copy lowering plus the rank-parity gate, not an MMA flag.
Copy-Side Ownership
The S2T copy rewrite performs four jobs:
- Resolve the source and destination tile layouts.
- Initialize or find the mbarrier that protects the asynchronous copy.
- Partition the source and TMEM destination according to the CTA-group shape.
- Emit the
cute.tiled.copyand return the async token expected by the surrounding pipeline.
The AtomS2tCopyShape properties carry the group width through two fields: a numeric cta_group value from {1, 2, 4} and a one-based enum selector used by the shape-dispatch table. They co-vary in observed inputs, but the lowering reads them independently. The numeric field controls mbarrier and predicate shortcuts; the enum controls the multicast width selected by the layout-composition branch.
Rank Predicate
The multi-CTA gate reads nvvm.read.ptx.sreg.cluster.ctarank, computes the rank modulo the multicast width, masks the low bit, converts the result into a warp-uniform predicate, and uses that predicate to guard the copy body. In the 4-CTA case, ranks with odd low bits issue the multicast copy while peer CTAs receive their partition through the cluster copy semantics.
The 2-CTA case differs: it uses a direct uniform-true predicate and relies on the downstream tcgen05.cp 2-CTA handshake to handle the pair. The single-CTA case shares some lowering scaffolding with the 2-CTA case, but it is not a cluster partition — only one CTA participates.
static Value *build_s2t_copy_predicate(Rewriter *rewriter, CtaGroup group) {
if (group == CTA_GROUP_1 || group == CTA_GROUP_2) {
return constant_true_i1(rewriter);
}
int32_t rank = nvvm_read_cluster_ctarank(rewriter);
int32_t rem = arith_remsi(rank, (int32_t)group);
int32_t low_bit = arith_andi(rem, 1);
return make_warp_uniform_i1(rewriter, low_bit != 0);
}
The make_warp_uniform wrap is structural, not cosmetic. The cluster.ctarank SReg is per-CTA — every thread in a CTA reads the same value — but the rewrite emits the predicate at warp scope. Without the warp-uniform wrapper the verifier rejects the predicate as a control-flow operand that could diverge between lanes; with it, every lane in the producing warp agrees on the predicate value, and the downstream tcgen05.cp instruction (which requires warp-uniform predicates by ISA contract) accepts the operand. The wrapper is a no-op at runtime — it tells the verifier and downstream codegen that the SSA value is provably warp-uniform.
⚡ QUIRK —
make_warp_uniformis a verifier-only no-op The wrapper emits no machine code at runtime; it exists purely to label the SSA value as warp-uniform so the verifier and downstreamtcgen05.cplowering accept the predicate. Removing it produces no behavioural difference at execution time but breaks the IR contract: the verifier rejects the copy and codegen never reaches PTX. Treat it as a structural type tag, not as an optimisation hint that can be dropped.
Cluster Sibling Pairing
The 2-CTA cluster MMA pairs each CTA with its sibling through the cluster.ctarank XOR 1 peer-selection idiom. The XOR maps rank 0 to peer 1, rank 1 to peer 0, rank 2 to peer 3, rank 3 to peer 2 — every even-ranked CTA pairs with the odd-ranked CTA one slot above it.
int32_t peer_rank = nvvm_read_cluster_ctarank(rewriter) ^ 1;
The peer rank then feeds into the multicast destination address for the cooperative tcgen05.cp copy. The DSMEM handshake covered in Cluster Sync and DSMEM Handshake is what makes the cross-CTA address dereference legal — the multicast copy advertises its transaction bytes to the peer CTA's mbarrier through the cluster transaction protocol before the destination address becomes readable on the peer side.
The 4-CTA group-mapping partitions the cluster into 2-CTA groups by rank parity:
int32_t group_id = nvvm_read_cluster_ctarank(rewriter) % 2;
Group 0 holds CTAs at even ranks (0, 2, ...), group 1 holds the odd-ranked CTAs. Inside each group the same XOR 1 sibling rule applies. The two groups never share TMEM destinations — the partition_D step splits the destination into four quarter slices and gives each group two adjacent quarters to fill cooperatively.
CTA Group Control Word
The cta_group field is a 2-bit bitfield inside the tcgen05.mma instruction's control word: encoding 01 selects single-CTA MMA, 10 selects the 2-CTA cooperative MMA. The encoding has no 4-CTA value — the hardware would have to interpret the remaining slot 11 as either reserved or as something it does not implement, and the PTX ISA assigns it neither. The structural consequence is what makes the 4-CTA shape a copy-side partition rather than an MMA-side encoding: the producer's cta_group bits select 1 or 2, the matrix instruction runs over its already-partitioned per-CTA slice, and the cluster fan-out lives entirely on the tcgen05.cp side that fed the slices.
The cta_group bits sit alongside the rest of the Tcgen05MmaKind enum in the instruction's kind-and-modifier control word; the corrected bitfield layout (after the 2-bit cta_group field was disambiguated from the surrounding kind bits) is the same control word the modifier-cascade canonicaliser threads through every tcgen05.mma emission.
CTA-Group Mapping
Combining the enum selector and the numeric group gives the runtime mapping:
| Shape enum | Numeric cta_group | Copy lowering | MMA-side meaning |
|---|---|---|---|
1 | 1 | Single-CTA S2T copy; no real cluster partition. | Ordinary single-CTA MMA input slice. |
2 | 2 | 2-CTA cooperative S2T copy with uniform predicate. | Two CTAs co-own opposite halves. |
3 | 4 | 4-CTA S2T copy with rank-based issuing predicate. | MMA consumes already-partitioned slices. |
Destination partitioning is part of the copy layout. In the 4-CTA case, partition_D splits the TMEM destination into per-CTA quarter slices before the copy is emitted. The downstream MMA therefore needs no cta_group::4 control word: by the time it runs, each participating CTA already sees the slice it owns.
tcgen05 and the Tensor Memory Model
Abstract
Blackwell introduces tensor memory — TMEM — as a third on-chip memory class alongside registers and shared memory. TMEM is per-SM, addressed in a 128-row dense grid, and reachable only from a small family of asynchronous instructions. The tcgen05 instruction family is that small family: matrix multiply, sparse multiply, weight-stationary multiply, and the block-scaled microscale variants all consume TMEM operands and write TMEM accumulators. This page documents the tensor memory model and the tcgen05.mma instruction family that consumes it. SM100 and SM103 only — SM110 (Jetson Thor) is enumerated as a Blackwell-era target but registers no tcgen05 atom, and SM120 / SM121 (consumer Blackwell) drop TMEM entirely in favour of register-resident block-scaled MMA.
This page is the canonical reference for the model and the variant taxonomy. It supersedes the scattered tcgen05 paragraphs in tcgen05 / WGMMA / mbarrier / Cluster Emission (the validation snippet plus control-word table) and Mode Pattern Verifiers (the 13-diagnostic kind-word verifier). Those pages keep their backend-validation and verifier-diagnostic content; the structural model lives here.
Tensor Memory
TMEM is per-SM, not per-CTA. A kernel that wants TMEM allocates from the SM's TMEM region through nvvm.tcgen05.alloc, which returns a handle that subsequent tcgen05 instructions consume as a 32-bit base address plus row/col descriptor. The allocator is shared across all warps on the SM — every warp in every resident CTA sees the same TMEM address space, but the allocation contract pins each region to one logical owner.
The grain is one 128-bit lane, organised into a 128-row grid where rows index along the M dimension of an MMA tile and columns index along K (or N for the accumulator). A WGMMA-style MMA tile of m64n128k16.fp16 occupies a contiguous TMEM region spanning 64 rows and the K-derived column count; the allocator hands back the base row index, and the MMA operand encoding adds the column offset.
Only tcgen05 instructions can read or write TMEM. There is no ldg to TMEM, no cp.async to TMEM directly, no register-to-TMEM move outside the tcgen05 family. Staging into TMEM happens through tcgen05.cp, the copy variant that moves data from SMEM to TMEM. Staging out of TMEM happens through tcgen05.st and tcgen05.ld. The model is "TMEM is the accumulator and operand reservoir, and only the MMA family talks to it."
The instruction family also gates the 2-CTA cooperative MMA path. When two CTAs in a cluster cooperate on one MMA tile, they share TMEM rows: CTA 0 holds rows [0..M/2) and CTA 1 holds rows [M/2..M). The cooperating MMA emits a cta_group::2 opcode that pairs the two halves at execute time. The 4-CTA copy variant exists only on the copy side — the MMA encoding has no cta_group::4 form, and Blackwell's 4-CTA semantics is a copy-time partition into already-sliced TMEM destinations that ordinary single-CTA MMAs then consume.
Allocation Grain and Lifetime
The TMEM allocator works in 128-row columns. The minimum allocation unit is one column of 128 rows × 16 bytes = 2 KiB; columns extend along the N axis of the accumulator (or along K for an operand region). A single SM has 256 columns of TMEM, organised as 128 rows × 512 KiB total. The allocator hands back a (base_column, num_columns) pair as a 32-bit handle:
typedef struct TmemHandle {
uint16_t base_column; /* 0 .. 255, granularity 1 column = 128 rows × 16 B */
uint16_t num_columns; /* 1 .. 256 - base_column */
} TmemHandle;
For a typical m64n128k16.f32.f16.f16 MMA, the accumulator region needs 64 rows × 128 columns of f32 = 64 × 128 × 4 = 32 KiB of TMEM, which lands at 16 columns of the 128-row grid (each column is 2 KiB, so 32 KiB / 2 KiB = 16 columns). A weight-stationary A region for the same tile needs another 16 columns of A residency (also 64 × 16 × 2 = 2 KiB per K-step × 16 K-steps), and a per-block scale region for the block-scaled variants needs another 1-4 columns depending on vecSize.
Allocation is statically scoped to the enclosing dialect operation. The nvvm.tcgen05.alloc op returns the handle as an SSA value; every tcgen05.mma op that consumes the handle pins the allocator's region for its issue lifetime; the matching nvvm.tcgen05.dealloc op (emitted at the end of the enclosing tile-scheduler scope) returns the columns to the free pool. The dialect does not allow TMEM regions to outlive their enclosing scope — there is no global TMEM heap, and the kernel cannot pass a TMEM handle out of the function it was allocated in. This is by construction: TMEM does not survive the SM reset that occurs between CTAs scheduled on the same SM, so any global handle would dangle on every CTA boundary.
The lifetime contract has one practical consequence: a kernel that wants to chain MMAs across iterations of an outer loop must keep the TMEM allocation alive across the loop body, which means the allocator op must dominate every MMA op in the loop. Lowering does this by hoisting tcgen05.alloc out of the outer loop to the function entry and matching tcgen05.dealloc to the function exit — see the consumer-side lifetime annotations in the tcgen05 / WGMMA / mbarrier / Cluster Emission end-to-end lowering.
The tcgen05 Variant Taxonomy
The tcgen05.mma family covers ten machine variants. Each combines an MMA kind (dense, sparse, block-scaled, sparse block-scaled) with optional weight-stationary mode and CTA-group selector. The lowering packs the choice into a 9-bit kind word; the backend verifier rejects illegal combinations before machine selection.
| Variant | CTA group | Sparsity | Block scale | Weight-stationary |
|---|---|---|---|---|
| dense MMA | 1 or 2 | no | no | no |
| sparse MMA | 1 or 2 | yes | no | no |
| weight-stationary dense | 1 | no | no | yes |
| weight-stationary sparse | 1 | yes | no | yes |
| block-scaled dense | 1 or 2 | no | yes | no |
| block-scaled sparse | 1 or 2 | yes | yes | no |
| warp-specialized dense | 1 | no | no | yes (alias) |
| warp-specialized sparse | 1 | yes | no | yes (alias) |
| warp-specialized block-scaled | 1 | no | yes | yes (alias) |
| warp-specialized sparse block-scaled | 1 | yes | yes | yes (alias) |
Weight-stationary mode reuses bit 0 of the kind word as a 1-bit predicate; the warp-specialized variants are weight-stationary at cta_group::1. The verifier rejects cta_group::2 whenever the weight-stationary bit is set, and rejects weight-stationary mode for the wider mxf8f6f4 and FP4 input families.
Per-Variant Operand Contracts
Every tcgen05.mma variant lowers to a five-operand machine form: D destination, A operand, B operand, control word, and optional metadata or scale-factor operands. The residency of each operand is fixed per variant and the verifier rejects any mismatch. The contract is:
| Variant | A operand | B operand | C / D operand | Metadata | Scale-factor operand |
|---|---|---|---|---|---|
dense MMA (kind::f16, kind::tf32, kind::i8) | SMEM desc or TMEM | SMEM desc | TMEM | — | — |
sparse MMA (.sp) | TMEM (halved value region) | SMEM desc | TMEM | TMEM (u32 selector stream) | — |
weight-stationary dense (.ws) | TMEM (pinned across K) | SMEM desc | TMEM | — | — |
weight-stationary sparse (.ws.sp) | TMEM (pinned, halved) | SMEM desc | TMEM | TMEM | — |
block-scaled dense (kind::f8f6f4, kind::mxf8f6f4, kind::mxf4, kind::mxf4nvf4) | SMEM desc or TMEM | SMEM desc | TMEM | — | SFA, SFB in TMEM (E8M0 or E4M3FN) |
block-scaled sparse (.sp + block-scale) | TMEM (halved) | SMEM desc | TMEM | TMEM | SFA, SFB in TMEM |
warp-specialized dense (.ws, alias) | TMEM (pinned) | SMEM desc | TMEM | — | — |
warp-specialized sparse (.ws.sp, alias) | TMEM (pinned, halved) | SMEM desc | TMEM | TMEM | — |
warp-specialized block-scaled (.ws + block-scale) | TMEM (pinned) | SMEM desc | TMEM | — | SFA, SFB in TMEM |
warp-specialized sparse block-scaled (.ws.sp + block-scale) | TMEM (pinned, halved) | SMEM desc | TMEM | TMEM | SFA, SFB in TMEM |
Two patterns repeat across the variant table:
- B is always an SMEM descriptor. There is no TMEM-resident B variant. The descriptor format is identical to the WGMMA Hopper descriptor — same 64-bit packing, same swizzle codes, same alignment rules. See WGMMA SMEM Descriptor Bit Layout.
- C and D are the same TMEM region. The MMA reads C and writes D into the same TMEM region in-place; the dialect-level distinction is bookkeeping. The accumulator-zero predicate (the analogue of WGMMA
scale_d) lives in the control word'sscale_input_accbit.
The variant choice is driven by the source-language idiom:
| Source-language pattern | Selected variant |
|---|---|
| Plain matmul mainloop (no operand reuse) | dense MMA |
| Structurally-sparse weight matrix (50%/2:4 sparsity) | sparse MMA |
| Inner loop reuses the same A operand across many invocations | weight-stationary dense |
| FP8 / FP6 / FP4 microscale matmul | block-scaled dense |
| MoE / multi-LoRA where the A operand is shared across experts | warp-specialized dense |
| Microscale matmul with structurally-sparse activations | block-scaled sparse |
The .ws and warp-specialized aliases differ in scheduling intent but compile to the same machine opcode at cta_group::1. Tileiras picks .ws when the inner loop is a plain K-loop reusing A, and picks the warp-specialized form when the producer warp pipeline that fills A runs in a separate warp specialisation from the consumer.
Control Word Layout
The 9-bit kind word packs five orthogonal fields:
typedef union Tcgen05MmaKind {
uint32_t raw : 9;
struct {
uint32_t cta_group : 2; // bits 0-1: 1 = 1-CTA, 3 = 2-CTA
uint32_t scale_vector_size : 2; // bits 2-3: 0 = 1X (16), 1 = 2X (32), 2 = 4X (64)
uint32_t scale_input_acc : 1; // bit 4: scale applied to accumulator
uint32_t block_scale : 1; // bit 5: block-scaled (FP4/FP8 microscale)
uint32_t mma_kind : 3; // bits 6-8: one of seven enum values
};
} Tcgen05MmaKind;
The mma_kind field picks the element-type family and the variant of block scaling:
| Value | mma_kind | Operands |
|---|---|---|
| 0 | mxf4nvf4 | NVFP4 inputs with E8M0 block scales |
| 1 | i8 | signed 8-bit integer inputs (arch-conditional) |
| 2 | mxf8f6f4 | OCP MX-FP8 / FP6 / FP4 inputs with E8M0 scales |
| 3 | f16 | half-precision inputs |
| 4 | tf32 | TensorFloat-32 inputs |
| 5 | f8f6f4 | non-block-scaled FP8/FP6/FP4 (alias of mxf8f6f4 for backward compat) |
| 7 | mxf4 | OCP MX-FP4 inputs with E4M3FN scales |
The cross-field consistency rules — for example, "scale-input-accumulator only applies to f16 and tf32", "block-scale rejects f16/tf32/i8" — are enforced by the verifier and listed in detail on the Mode Pattern Verifiers page.
Beside the kind word, a separate collector word controls how operand A is staged into the MMA. The collector is a per-warp-group register cache that buffers the most recently staged A operand; subsequent MMA instructions can either consume that cached A directly, refill it from TMEM, or discard it. The three modes are:
collector::a mode | Reads A from | Updates collector | Pairs with |
|---|---|---|---|
discard | TMEM (fresh load) | cleared (no reuse downstream) | standalone MMA, no chaining |
fill | TMEM (fresh load) | new A retained for next MMA | the use mode in the next MMA of the chain |
use | collector cache (no TMEM load) | unchanged (carries forward) | an earlier fill that staged the A operand |
The motivation is bandwidth. A TMEM-resident A operand costs 1 TMEM read per MMA when re-read on every iteration; the collector lets a chain of fill → use → use → ... MMAs amortise that read across multiple invocations. The collector capacity is one A operand per warp-group — there is no multi-slot cache — so the chain is linear, not branching.
Worked Sequence
A streamed inner-product mainloop computes D += A_k × B_k for k = 0, 1, 2, reusing the same A operand for all three iterations (a weight-stationary inner loop where A is the weight matrix and B steps through activation slices). The optimal collector schedule is discard → fill → use → use, but for a 3-iteration chain that fits in the collector cache from the start, the schedule is fill → use → use plus a final discard if no further chain follows:
collector state
---------------
iter 0: tcgen05.mma.collector::a::fill A_0, B_0, D // load A_0 from TMEM, cache it
A_0 in collector
iter 1: tcgen05.mma.collector::a::use -- , B_1, D // A_0 reused from collector; no TMEM load
A_0 still in collector
iter 2: tcgen05.mma.collector::a::use -- , B_2, D // A_0 reused from collector; no TMEM load
A_0 still in collector
(end of chain)
tcgen05.mma.collector::a::discard ... // optional: clears collector if next region wants
// a fresh A slot
The first MMA fills the collector and pays one TMEM read. The next two MMAs reuse the cached A and pay zero TMEM reads for the A operand. The net A-side bandwidth is 1 / 3 of the naive cost. The B operand reads from SMEM every iteration; collector caching applies only to A.
When the A operand changes (a different weight tile in iteration 3), the next MMA must re-fill:
iter 3: tcgen05.mma.collector::a::fill A_1, B_3, D // load A_1 from TMEM, replaces A_0 in cache
A_1 in collector
The verifier rejects use against a stale collector — if the previous MMA in the warp group's program order discarded the collector or never filled it, the verifier emits "collector::a::use without preceding fill". This is a control-flow check: the verifier walks the warp group's program order from each use backward to the most recent fill or discard and rejects any path where the collector is not filled.
Collector mode interacts with the ashift modifier — collector::a::use or collector::a::fill cannot combine with ashift, because both want exclusive control of the A operand's staging slot. The verifier emits "Cannot use collector::a::use or colletor::a::fill with ashift" (preserving the verbatim typo in colletor) for that combination.
Block-scaled variants also reject the collector use/fill modes: the SFA scale operand changes per iteration and the cached A would mismatch the scales after the first chained call. Lowering forces collector::a::discard for every block-scaled MMA.
Sparsity Metadata
Sparse tcgen05.mma variants halve the structurally-sparse operand and add a metadata operand that encodes which lanes are non-zero. The metadata is a 2-bit-per-element selector packed into a u32 stream: each four-element group of the structured-sparse operand carries one byte of metadata that names the two non-zero positions within the group.
The metadata operand rides a separate TMEM region from the value operand. Allocation pairs the two: the dense-value region holds the halved operand at one base row, and the metadata region holds the selector stream at a fixed offset from that base. The pairing is part of the atom contract — the lowering allocates both regions atomically, and the verifier rejects operands where the metadata layout does not match the value layout at the corresponding stride.
For block-scaled sparse variants, the metadata operand applies to the structurally-sparse input (typically operand A), and the scale-factor operands apply independently. The kind word's block-scale bit and sparsity bit are independent — the verifier's ladder checks them as orthogonal modifiers and rejects only specific illegal combinations (MXF4 and MXF4NVF4 with sparsity require arch-conditional targets).
Block-Scale Operands
Block-scale microscale MMA is the Blackwell answer to MXFP4, MXFP6, MXFP8, and NVFP4. Inputs ride narrow-precision element types (4-bit, 6-bit, or 8-bit); a separate scale-factor vector multiplies each contiguous block of vecSize elements by a per-block scale factor. The accumulator stays FP32.
The legal (atom_K, vecSize) triples are exactly three:
| (atom_K, vecSize) | A × B types | Scale type | Variant |
|---|---|---|---|
| (32, 32) | FP8 × FP8 | E8M0 | kind::f8f6f4 |
| (64, 16) | FP4 × FP4 | E4M3FN | kind::mxf4 (OCP MX-FP4) |
| (64, 32) | FP4 × FP4 | E8M0 | kind::mxf4nvf4 (NVFP4 block-64) |
Other combinations fail verification with the per-combo expectation diagnostics — for (atom_K=64, vecSize=16) the binary emits "expects A and B element types are valid 4bit types, such asFloat4E2M1FNType or FloatNV4E0M3FType , when (atom_K=64 && vecSize=16)" and "expects sfa/sfb element types to be Float8E8M0FNUType or Float8E4M3FNType when (atom_K=64 && vecSize=16)"; for (atom_K=64, vecSize=32) it emits "expects A/B element types to be Float4E2M1FNType and sfa/sfb element types to be Float8E8M0FNUType when (atom_K=64 && vecSize=32)". atom_K is the K extent per MMA instruction; vecSize is the number of contiguous K-axis elements that share one scale factor.
NVFP4 and OCP MX-FP4 share a 4-bit element type encoding but differ in their scale-factor format: NVFP4 uses E8M0 (8-bit exponent-only) and OCP MX-FP4 uses E4M3FN (4-bit exponent, 3-bit mantissa, finite-only). The dispatcher distinguishes them by inspecting sf_a / sf_b element types — if both scale-factor operands are E8M0 the layout is NVFP4 and the opcode is kind::mxf4nvf4; if both are E4M3FN the layout is OCP MX-FP4 and the opcode is kind::mxf4. A mismatch between sf_a and sf_b rejects with the verbatim diagnostic "expects sfa/sfb element types to be the same".
The scale-factor operands ride dedicated TMEM regions that the atom builder allocates alongside the value operands. The scale-factor layout is one E8M0 (or E4M3FN) value per (M / vecSize) tile element — sparse compared to the value operands, but parallel in tile addressing.
Weight-Stationary Mode
Weight-stationary mode pins operand A to its TMEM region across the K loop, letting subsequent MMA tiles reuse the staged operand without re-loading. The op encoding sets bit 0 of the kind word; the variant is cta_group::1 only (the verifier rejects cta_group::2 with weight-stationary), and the operand-A element type is restricted — mxf8f6f4, f8f6f4, and mxf4 are all rejected because their wider operand layouts cannot stay stationary across the K loop.
The practical effect is throughput: weight-stationary mainloops amortise A-side TMEM bandwidth across many K iterations. The cost is operand flexibility — the A operand stays in its TMEM region for the whole loop, so the kernel cannot use that region for any other purpose between MMAs.
Cross-References
mbarrier State Machine is the consumer-side synchronisation that pipelines staging copies into TMEM against the MMA that reads them.
WGMMA Emission Protocol is the Hopper predecessor; comparing the two shows why the accumulator moved from registers to TMEM.
Matmul Progression by SM places tcgen05 in the broader SM70-to-SM121 lineage and explains the operand-residency change at SM100.
MMA Atoms SM70-SM120 carries the (atom_K, vecSize) block-scaled triple table and the SM100 UMMA layout grammar.
Mode Pattern Verifiers documents the 13-diagnostic ladder that enforces the inter-field consistency rules summarised above.
Blackwell 2-CTA and 4-CTA MMA covers the cluster-side copy patterns that stage operands into the cooperating CTAs' TMEM regions.
tcgen05 / WGMMA / mbarrier / Cluster Emission covers the backend-side machine-form validation.
Fast-Math and Numerical Precision documents the FP8, MX-FP, and NV-FP4 element-type semantics that the block-scaled MMA dispatcher consumes, including the scale-type rules that distinguish OCP MX-FP4 from NV-FP4.
tcgen05.mma Walkthrough
Abstract
A single Blackwell tcgen05.mma — the asynchronous warp-group matrix multiply that consumes a TMEM accumulator and writes its result back into the same TMEM region — touches every layer of the tileiras cascade. It begins as a tile-shaped cuda_tile.mmaf, picks up an sm100.umma copy/MMA atom witness in nv_tileaa, materializes a TMEM-handle SSA value plus an SMEM-to-TMEM staging copy in nv_tileas, expands into the nvvm.tcgen05.mma.cta_group::1 intrinsic in LLVM IR, becomes a TCGEN05_MMA_* machine instruction with packed control word and collector word in NVPTX MIR, and surfaces as tcgen05.mma.cta_group::1.kind::f16 in PTX text. The 9-bit kind word — cta_group, scale_vec_size, scale_input_acc, block_scale, and mma_kind — flows through each layer under a different name, and the TMEM-handle SSA value persists from allocation through dealloc with verifier-enforced dominance.
This page traces one MMA end-to-end on sm_100a (Blackwell B200 / GB200 datacenter). The kernel-wide walkthrough in DSL to PTX End-to-End shows the same kind of trace at every stage for an sm_90a WGMMA GEMM; this page is its sm_100a companion. The TMA-load focused walkthrough in TMA Load Walkthrough traces the producer side of the same pipeline; this page traces the consumer side that reads what TMA staged. Cross-reference targets remain the per-stage canonical pages: cuda_tile to nv_tileaa, nv_tileaa to nv_tileas, nv_tileas to LLVM, tcgen05 Tensor Memory Model, Mode Pattern Verifiers — tcgen05.mma Kind-Word Verifier, and tcgen05 / WGMMA / mbarrier / Cluster Emission.
Confidence: HIGH for IR shapes, mnemonic spellings, kind-word bit layout, and the verifier rule order; MED for the exact SSA naming used in the worked example (the binary-derived examples in the source pages use slightly different temp names).
The Operation
The walkthrough operation is one warp-group dense tcgen05.mma of a 64 × 128 × 16 BF16 tile with FP32 accumulator, on sm_100a, cta_group::1 (single-CTA dispatch), no block-scale, no sparsity, no weight-stationary mode, collector::a::fill for an initial accumulation. The kernel consumes one TMEM accumulator region for D (64 rows × 128 columns of FP32 = 32 KiB = 16 TMEM columns out of the 256 the SM owns), one SMEM-resident A operand staged through TMA, and one SMEM-resident B operand likewise staged. The accumulator stays in TMEM across the K loop — no register-fragment accumulator, no mma.sync style register fan-out.
The frontend constructed:
a_tile = load(a_view, (block_m, block_k)) # tile<64x16xbf16>
b_tile = load(b_view, (block_n, block_k)) # tile<16x128xbf16>
acc = mmaf(a_tile, b_tile, acc) # tile<64x128xf32>
The MMA itself does not specify TMEM, the kind word, the collector mode, or the CTA group selector. Those decisions are downstream — the same mmaf op on sm_90a becomes a WGMMA with a register-resident accumulator, and on sm_80 becomes a series of mma.sync instructions. The capability cross-check in Matmul Progression by SM — SM100 / SM103 covers the divergent lowering paths.
The 9-bit kind word that this MMA encodes:
mma_kind = f16 (value 3) bits 6..8
block_scale = 0 bit 5
scale_input_acc = 0 bit 4
scale_vec_size = 0 (1X, implicit) bits 2..3
cta_group = 1 (1-CTA) bits 0..1
ws bit overlay = 0 (no weight-stationary)
─────────
raw = 0b011000001 = 0xC1
That single integer — the encoded kind word — is what the LLVM intrinsic carries as its first immediate operand, what the MIR opcode encodes in its packed control word, and what ptxas decodes from the printed .cta_group::1.kind::f16 modifier set.
Stage 1: cuda_tile IR
The first IR the compiler sees comes out of the frontend's bytecode. The MMA is a cuda_tile.mmaf — token-free MMA over three tile-typed SSA values — and the verifier contract on the operation is the standard cuda_tile contract: power-of-two tile dimensions, a 16-million-element ceiling, conforming M × K / K × N / M × N shapes between A / B / C, and an optional fastmath attribute that records the precision-relaxation budget the lowering may exploit.
%a_tile : !cuda_tile.tile<64x16xbf16>
%b_tile : !cuda_tile.tile<16x128xbf16>
%acc_in : !cuda_tile.tile<64x128xf32>
%acc_out = cuda_tile.mmaf %a_tile, %b_tile, %acc_in
{ fastmath = "contract" }
: !cuda_tile.tile<64x16xbf16>,
!cuda_tile.tile<16x128xbf16>,
!cuda_tile.tile<64x128xf32>
There is no TMEM, no kind word, no CTA-group selector, no collector mode, and no tcgen05 mention. cuda_tile is the public surface and deliberately stays target-agnostic: three tile-typed SSA values plus an optional fast-math hint is all the frontend has to publish. The atom selection — dense vs sparse, block-scaled vs plain, single-CTA vs two-CTA, weight-stationary vs streamed — is downstream of the layout-assignment pre-pass that runs between Stage 1 and Stage 2.
⚡ QUIRK —
cuda_tile.mmafcarries no TMEM, no kind word, no CTA group, no collector The public dialect has no syntax for a TMEM handle, no syntax for the 9-bit kind word, nocta_group::*selector, and nocollector::a::*modifier. Everytcgen05-specific noun first appears innv_tileaa(thesm100.ummaatom witness) ornv_tileas(the TMEM-handle SSA, the staging copy, the kind word as a packed attribute). A reimplementer who tries to express any of those on the public surface has misread the contract —cuda_tile.mmafis a tile-algebra op, not a tensor-memory-shaped op. The promotion totcgen05.mmais a downstream decision driven by the copy/MMA atom registry, not a frontend gesture.
Stage 2: nv_tileaa IR
ConvertCudaTileToTileAA rewrites the MMA through the three-populator structure documented in cuda_tile to nv_tileaa. Part C of that structure owns MMA and reductions, so the rewrite for this op lives in the Part-C nv_tileaa.dot pattern. The tile types become MLIR tensor<...>, the result name flips from mmaf to dot, and — the key change for this walkthrough — the op picks up an MMA atom witness. The witness is an attribute that names the hardware MMA primitive selected by the layout-assignment pre-pass; for an sm_100a BF16 input with FP32 accumulator the witness family is cute_nvgpu.arch.mma.SM100.umma ("Unified MMA," the dialect-side name for the tcgen05.mma family).
%acc_out = nv_tileaa.dot %a_tile, %b_tile, %acc_in
{ atom = #cute.mma_atom<sm100_umma_m64n128k16_f32_bf16_bf16>,
input_precision = "bf16",
fastmath = "contract" }
: tensor<64x16xbf16>, tensor<16x128xbf16>, tensor<64x128xf32>
-> tensor<64x128xf32>
The MMA atom witness sm100_umma_m64n128k16_f32_bf16_bf16 names the (M, N, K) = (64, 128, 16) tcgen05.mma shape for BF16 inputs with FP32 accumulator. A different witness in the same slot — sm100_umma_m64n128k16_f32_bf16_bf16_sp for the 2:4-sparse variant, sm100_umma_m64n128k16_f32_mxf8f6f4_mxf8f6f4_bs for the block-scaled FP8 variant, sm90_wgmma_m64n128k16_f32_bf16_bf16 for the Hopper fallback — would steer the next stage's rewrite into a different lowering path. Layout assignment runs before this pass and is what consults the MMA Atoms SM70-SM120 — SM100 UMMA Layout Grammar registry; after this pass the witness travels verbatim down to the LLVM lowering.
Three things are not yet visible at this stage. The accumulator residency is still implicit in the operand types (a plain tensor<64x128xf32> makes no commitment to register, SMEM, or TMEM placement). The CTA-group selector is implicit in the atom name (no _2cta suffix means single-CTA dispatch). And the kind-word bits are still derived: mma_kind = f16, block_scale = 0, scale_vec_size = 0, cta_group = 1 all flow from the atom's element types and the absence of any per-op modifier attribute. The packing into a single 9-bit word happens at Stage 3.
Stage 3: nv_tileas IR
ConvertTileAAToTileAS keeps the same operand shape but renames the op, updates the dialect namespace, and — for the SM100 path — splits the single dot op into a four-instruction sequence: TMEM allocation, SMEM-to-TMEM staging copy for the A operand (if A is TMEM-resident in the selected atom), MMA proper, and TMEM read-back at use sites. The TileAS layout and buffer family and TileAS scheduling glue drive the split; the tcgen05 Tensor Memory Model — Allocation Grain and Lifetime page documents the TMEM allocator contract this stage materialises.
// ---- TMEM allocation, hoisted to function entry
%tmem_d = nv_tileas.alloc_tmem { num_columns = 16 : i32 }
: !nv_tileas.tmem<64x128xf32>
// ---- SMEM-resident B descriptor (built once per K iteration, see TMA Load Walkthrough)
%b_desc = nv_tileas.make_umma_smem_desc %smem_b,
layout = #cute_nvgpu.umma_k_layout<base_offset=0, lbo=128, sbo=2048,
swizzle=128B>
: !nv_tileas.umma_smem_desc<16x128xbf16>
// ---- SMEM-resident A descriptor (A operand for this walkthrough; the kernel
// could equally stage A into TMEM via tcgen05.cp.smem.tmem and use the
// TMEM-resident A path)
%a_desc = nv_tileas.make_umma_smem_desc %smem_a,
layout = #cute_nvgpu.umma_mn_layout<base_offset=0, lbo=16, sbo=512,
swizzle=128B>
: !nv_tileas.umma_smem_desc<64x16xbf16>
// ---- tcgen05.mma with packed kind word; D is the TMEM accumulator
%tok_mma = nv_tileas.umma %tmem_d, %a_desc, %b_desc
{ kind = #nvvm.tcgen05_mma_kind<f16>,
cta_group = #nvvm.tcgen05_group<cta1>,
scale_vec_size = #nvvm.tcgen05_mma_scale_vec<1X>,
scale_input_acc = false,
block_scale = false,
collector_a = #nvvm.tcgen05_mma_collectorop<fill>,
ashift = false,
atom = #cute.mma_atom<sm100_umma_m64n128k16_f32_bf16_bf16> }
: !nv_tileas.tmem<64x128xf32>,
!nv_tileas.umma_smem_desc<64x16xbf16>,
!nv_tileas.umma_smem_desc<16x128xbf16>
-> !nv_tileas.async_token
// ---- TMEM read-back when the accumulator is needed by the epilogue
%acc_reg = nv_tileas.tmem_load %tmem_d
{ shape = #nvvm.tcgen05_ldst_shape<m64n8x32b> }
: !nv_tileas.tmem<64x128xf32> -> tensor<64x128xf32>
// ---- TMEM deallocation, sunk to function exit
nv_tileas.dealloc_tmem %tmem_d : !nv_tileas.tmem<64x128xf32>
Five new entities appear at this stage. First, the TMEM handle %tmem_d is a first-class SSA value of opaque dialect type !nv_tileas.tmem<64x128xf32> — its 32-bit handle encodes base_column and num_columns (the layout the tcgen05 Tensor Memory Model — Allocation Grain and Lifetime page documents). Second, the UMMA SMEM descriptor %a_desc / %b_desc is the same 64-bit packing the WGMMA descriptor uses on sm_90a (documented in WGMMA Emission Protocol — SMEM Descriptor Bit Layout) — tcgen05.mma reuses the bit format verbatim. Third, the kind word is now an explicit packed attribute carrying the five orthogonal fields the verifier inspects. Fourth, the collector mode is exposed as a separate attribute (the tcgen05 Tensor Memory Model — Control Word Layout collector word). Fifth, TMEM read-back is a separate op (cute_nvgpu.atom.tmem_load) that the epilogue or the next MMA must emit — the accumulator does not become a register-resident SSA value through the MMA itself, it stays in TMEM until explicitly read.
⚡ QUIRK — TMEM-handle SSA propagation requires dominance over every consumer The TMEM handle
%tmem_dproduced byalloc_tmemis an SSA value, but it does not behave like a value-typed accumulator. It is a handle to a per-SM TMEM region whose contents the MMA mutates in-place. The verifier oncute_nvgpu.arch.sm100.dealloc_tmemrequires that everyumma,tmem_load, andtmem_storeop that names%tmem_dbe dominated by the matchingalloc_tmemand dominate the matchingdealloc_tmem. A reimplementation that hoists the MMA op above the alloc, sinks it below the dealloc, or — most subtly — places the alloc inside a conditional branch the MMA escapes, builds IR that passes the dialect verifier but produces a kernel where the MMA reads garbage from TMEM rows the allocator has already returned to the free pool. The dominance contract is the only protection: TMEM regions do not survive the SM context reset between CTAs, so any out-of-lifetime read sees whatever the next CTA-on-this-SM wrote.
⚡ QUIRK —
cta_group::1andcta_group::2encode in the same kind-word bits but reject different operand shapes The CTA-group selector lives in bits 0..1 of the kind word:cta_group = 1(binary01) selects single-CTA dispatch,cta_group = 2(binary10— not3, the historical 4-CTA reservation) selects two-CTA cooperative dispatch. Verifier rule 8 in Mode Pattern Verifiers — tcgen05.mma Kind-Word Verifier rejectscta_group::2whenever the weight-stationary bit is set, with the diagnostic"cta_group::2 is not supported with weight stationary". The 4-CTA value3exists in the encoding range but has no MMA-side variant — the 4-CTA semantics is a copy-time partition ontcgen05.cponly; the consuming MMA over each partition is a plain single-CTA instruction. A reimplementation that emitscta_group::3for a 4-CTA dispatch builds an opcode the verifier rejects with a different rule from the documented ladder.
⚡ QUIRK —
scale_vec_sizebit-packing is per-mma_kind, with verifier rules 11/12/13 fencing each off Thescale_vec_sizefield at bits 2..3 of the kind word is a 2-bit selector (0 = 1X (16-element),1 = 2X (32-element),2 = 4X (64-element),3 = reserved). The verifier ladder rules 11, 12, and 13 (see Mode Pattern Verifiers — tcgen05.mma Kind-Word Verifier) each pin a singlemma_kindto a specific subset of legalscale_vec_sizevalues:mxf8f6f4only accepts1X(rule 11 rejects2Xand4X),mxf4nvf4only accepts2Xor4X(rule 12 rejects1X), andmxf4only accepts2X(rule 13 rejects1Xand4X). Outside the arch-conditional surface, rule 3 globally forbids any non-zeroscale_vec_size. For this walkthrough'skind::f16MMA,scale_vec_size = 0is the only legal value — block-scale is off, so the field is unused and the verifier doesn't fire on it, but a frontend that sets a non-zero value on a non-block-scale kind passes the dialect verifier and is silently miscompiled at PTX emission time.
TMEM Allocation Lifecycle
The cute_nvgpu.arch.sm100.alloc_tmem op carves a 32 KiB region (16 columns of the SM's 256 TMEM columns) out of the per-SM TMEM allocator's free pool. The Buffer Assignment and Named-Barrier Binding pass is what decides the base_column value the handle encodes; for this walkthrough's accumulator the allocator picks column 0 (no other TMEM users) and the handle becomes {base_column = 0, num_columns = 16}. The allocator is per-SM, not per-CTA: every warp in every resident CTA sees the same TMEM address space, but each region is pinned to one logical owner for its issue lifetime.
The lifecycle has three named operations:
| Op | Role | Verifier requirement |
|---|---|---|
cute_nvgpu.arch.sm100.alloc_tmem | Reserves num_columns of TMEM, returns a handle | Must dominate every consumer that names the handle |
cute_nvgpu.arch.mma.SM100.umma | Mutates the TMEM region in place; reads C, writes D | Handle operand must come from a dominating alloc_tmem |
cute_nvgpu.atom.tmem_load / cute_nvgpu.atom.tmem_store | Reads / writes TMEM into / out of register tensors | Same dominance requirement |
cute_nvgpu.arch.sm100.dealloc_tmem | Returns the columns to the free pool | Must post-dominate every consumer |
The allocator does not allow re-allocating a region across function boundaries — there is no global TMEM heap and no out-of-function handle propagation. A kernel that wants to chain MMAs across iterations keeps the TMEM allocation alive across the loop body, with alloc_tmem hoisted to function entry and dealloc_tmem sunk to function exit. The tcgen05 Tensor Memory Model — Allocation Grain and Lifetime page covers the full grain and lifetime model.
SMEM-to-TMEM Staging Copy (Optional A Path)
The walkthrough above uses A as an SMEM descriptor — the simpler residency. The TMEM-resident A path uses a tcgen05.cp.smem.tmem staging copy to move the A operand into a TMEM region before the MMA reads it. The staging form looks like:
%tmem_a = nv_tileas.alloc_tmem { num_columns = 8 : i32 }
: !nv_tileas.tmem<64x16xbf16>
nv_tileas.umma_smem_to_tmem_cp %smem_a, %tmem_a
{ shape = #nvvm.tcgen05_cp_shape<m64n128b>,
multicast = #nvvm.tcgen05_cp_multicast<warpx2_01_23>,
src_fmt = #nvvm.tcgen05_cp_src_fmt<b32x2> }
: !nv_tileas.smem<64x16xbf16>, !nv_tileas.tmem<64x16xbf16>
The tcgen05.cp family supports shape codes m64n128b, m64n256b, m32x128b, and m64x128b, each pairing with a different multicast mask. The 4-CTA copy variant uses multicast = warpx4 against shape m32x128b; the 2-CTA variants use warpx2_01_23 or warpx2_02_13 against shape m64x128b. The verifier strings "Shape 64x128b requires multicast warpx2_01_23 or warpx2_02_13 for tcgen05.cp Op" and "Shape 32x128b requires multicast warpx4 for tcgen05.cp Op" enforce the pairing. This walkthrough sticks with the SMEM-descriptor A path to keep the trace focused on the MMA itself; the Blackwell 2-CTA and 4-CTA MMA page covers the cluster-side copy patterns in detail.
⚡ QUIRK —
ashiftis rejected with block-scale and withcollector::a::use/fillTheashiftmodifier on the collector word advances the A operand's column index by one before the MMA reads it — a single-instruction prefetch-like optimisation for inner loops that walk A's columns in lockstep with K. Verifier rule 7 in Mode Pattern Verifiers — tcgen05.mma Kind-Word Verifier rejectsashiftwheneverblock_scale = 1(the diagnostic is"ashift is not supported with tcgen05.mma.block_scale variants"). Verifier rule 10 rejectsashiftwhenevercollector::a::useorcollector::a::fillis set (the diagnostic is"Cannot use collector::a::use or colletor::a::fill with ashift"— with the verbatimcolletortypo preserved). The conjunction is a single bit position in the encoding: bit 2 of the collector word overlays both theashiftflag and the high bit of thecollector_afield, so the encoder treats them as mutually exclusive at the byte level. A reimplementation that emits both flags simultaneously builds a collector word with ambiguous semantics and the verifier rejects it at the first failure encountered.
Stage 4: NVVM Intrinsic in LLVM IR
ConvertTileASToLLVM is the terminal MLIR-side lowering, and its nine-phase body conversion documented in tileas to LLVM carries the MMA to LLVM. The TMEM allocation lowers to nvvm.tcgen05.alloc, the SMEM-to-TMEM staging copy (if present) lowers to nvvm.tcgen05.cp, the MMA proper lowers to nvvm.tcgen05.mma.cta_group::1, and the read-back lowers to nvvm.tcgen05.ld. The kind word collapses into a packed i32 immediate carrying the same five fields the dialect attribute exposed.
; ---- TMEM allocation (one column-base handle per accumulator region)
%tmem_d_handle = call i32 @llvm.nvvm.tcgen05.alloc.shared(
i32 16) ; num_columns
; ---- (optional) SMEM-to-TMEM staging copy for the A operand
; skipped in this walkthrough — A rides the SMEM-descriptor path
; ---- UMMA SMEM descriptor encode (64-bit packed, same bit layout as WGMMA)
%a_desc = call i64 @llvm.nvvm.tcgen05.mma_smem_desc.encode(
ptr addrspace(3) %smem_a, i32 512, i32 16, i32 0, i32 1)
%b_desc = call i64 @llvm.nvvm.tcgen05.mma_smem_desc.encode(
ptr addrspace(3) %smem_b, i32 2048, i32 128, i32 0, i32 1)
; ---- tcgen05.mma with packed kind word
; i32 kind: 0xC1 = mma_kind::f16 + cta_group::1
; i32 collector: 0x01 = collector::a::fill, ashift=0
call void @llvm.nvvm.tcgen05.mma.cta_group__1(
i32 %tmem_d_handle, ; D (TMEM handle, also reads C in place)
i64 %a_desc, ; A operand (SMEM descriptor)
i64 %b_desc, ; B operand (SMEM descriptor)
i32 193, ; kind word: 0xC1
i32 1, ; collector word: collector::a::fill
i1 true) ; enable_input_d (analogue of WGMMA scale_d)
; ---- TMEM read-back when the epilogue needs the accumulator in registers
%acc_reg = call <128 x float> @llvm.nvvm.tcgen05.ld.m64n8x32b(
i32 %tmem_d_handle)
; ---- TMEM deallocation (at function exit)
call void @llvm.nvvm.tcgen05.dealloc(i32 %tmem_d_handle, i32 16)
call void @llvm.nvvm.tcgen05.relinquish_alloc_permit()
Five things change at the LLVM boundary. The MMA atom witness is consumed — the intrinsic name llvm.nvvm.tcgen05.mma.cta_group__1 encodes the CTA-group selector and the variant family (cta_group__2 for the two-CTA form, the sp / block_scale / ws family suffixes for the sparse / block-scaled / weight-stationary variants), so no attribute is needed at the LLVM op level. The kind word becomes a single i32 immediate (193 = 0xC1 for our walkthrough, with mma_kind::f16 + cta_group::1). The collector word is a separate i32 operand carrying the collector-A mode and the ashift bit. The TMEM handle is an i32 SSA value threaded through every consumer. And the SMEM descriptors are i64 SSA values produced by llvm.nvvm.tcgen05.mma_smem_desc.encode, packing the same 64-bit bit field that WGMMA uses on Hopper — the encoding is genuinely shared between the two MMA families.
The MMA does not automatically wait for itself. The producer-side instruction is asynchronous; the consumer-side nvvm.tcgen05.wait (lowering of tcgen05.wait.cta_group::1) is what drains the MMA before the accumulator is read. The two are independent instructions tied together by the TMEM handle:
; ---- After the K loop body completes, drain the asynchronous MMA queue
call void @llvm.nvvm.tcgen05.wait.cta_group__1()
; ---- Now safe to read the TMEM accumulator
%acc_reg = call <128 x float> @llvm.nvvm.tcgen05.ld.m64n8x32b(
i32 %tmem_d_handle)
See tcgen05 / WGMMA / mbarrier / Cluster Emission — End-To-End Lowering for the full asynchronous-MMA wait protocol and the mbarrier-completion variant that pairs the MMA with an mbarrier.
Stage 5: NVPTX MIR
The NVPTX backend's instruction selector (ISelDAG and MatcherTable) consumes the LLVM intrinsics and produces a MachineFunction instruction. The tcgen05.mma family of opcodes is a set of TCGEN05_MMA_* machine instructions, one per (cta_group, sparsity, block_scale, weight_stationary) tuple — the closed-range 10521..10530 opcode set the verifier in Mode Pattern Verifiers — tcgen05.mma Kind-Word Verifier selects. For the single-CTA dense non-block-scale form, the opcode is TCGEN05_MMA_CTA_GROUP1_DENSE.
bb.entry:
; --- TMEM allocation: handle in a 32-bit virtual register
%tmem_d_handle:b32 = TCGEN05_ALLOC_SHARED imm:16 ; 16 columns
bb.loop:
; --- UMMA SMEM descriptor encode (same opcode as WGMMA on sm_90a)
%a_desc:b64 = TCGEN05_MMA_SMEM_DESC_ENCODE
%smem_a:b64, imm:512, imm:16, imm:0, imm:1
%b_desc:b64 = TCGEN05_MMA_SMEM_DESC_ENCODE
%smem_b:b64, imm:2048, imm:128, imm:0, imm:1
; --- tcgen05.mma with packed control word + collector word
TCGEN05_MMA_CTA_GROUP1_DENSE
d: %tmem_d_handle ; TMEM handle (in-place accumulate)
a: %a_desc ; A descriptor (SMEM)
b: %b_desc ; B descriptor (SMEM)
ctrl: imm:193 ; 0xC1 = kind::f16 + cta_group::1
collector:imm:1 ; collector::a::fill
scale_d: imm:1 ; enable_input_d = true
; opcode index: 10522 (dense, non-block-scale, cta_group::1, ws=0)
; --- Loop body continues with next K tile, same TMEM handle...
bb.epi:
; --- Drain the asynchronous MMA queue before reading TMEM
TCGEN05_WAIT_CTA_GROUP1
; --- Read the TMEM accumulator into a register vector
%acc:v128_f32 = TCGEN05_LD_M64N8X32B %tmem_d_handle
; --- Deallocate TMEM at function exit
TCGEN05_DEALLOC %tmem_d_handle, imm:16
TCGEN05_RELINQUISH_ALLOC_PERMIT
Four observations matter at MIR level. First, the opcode encodes the CTA-group selector, sparsity, block-scale, and weight-stationary bits in its name — TCGEN05_MMA_CTA_GROUP1_DENSE is one opcode; TCGEN05_MMA_CTA_GROUP2_DENSE, TCGEN05_MMA_CTA_GROUP1_SPARSE, TCGEN05_MMA_CTA_GROUP1_BLOCK_SCALED_DENSE, and seven other variants are each a separate opcode in the NVPTX .td files. The verifier in verify_tcgen05_mma (documented in tcgen05 / WGMMA / mbarrier / Cluster Emission — Verifier Rules) reads the packed control word out of the immediate operand and re-checks every constraint the dialect verifier already checked, because arch-conditional flags (is_arch_cond) and subtarget features (has_scale_input_accumulator, has_arch_conditional) only become fully visible after target selection. Second, the kind word 0xC1 = 193 is a literal immediate, with bits decoded as mma_kind = 011 (f16), block_scale = 0, scale_input_acc = 0, scale_vec_size = 00 (1X), cta_group = 01 (1-CTA) — the encoding documented in tcgen05 / WGMMA / mbarrier / Cluster Emission — Control-Word Bit Layout. Third, the TMEM handle is a single 32-bit virtual register threaded through every consumer (alloc → MMA → ld → dealloc), and the MIR register allocator pins it to a single physical register for the entire lifetime — there is no spill path for TMEM handles because the TMEM region cannot move. Fourth, the TCGEN05_WAIT_CTA_GROUP1 opcode has no operand — it drains the per-CTA asynchronous MMA queue globally, not per-handle.
The kind word has now flowed through five levels of representation: implicit shape-and-element-type in cuda_tile, implicit atom-name in nv_tileaa, explicit five-attribute group in nv_tileas, explicit packed i32 193 immediate to llvm.nvvm.tcgen05.mma.cta_group__1 in LLVM IR, and explicit immediate operand imm:193 to TCGEN05_MMA_CTA_GROUP1_DENSE in MIR.
Stage 6: PTX Text
The AsmPrinter (AsmPrinter and Per-SM Windows) walks the MachineFunction and renders each instruction. The single-CTA dense tcgen05.mma with FP16 kind prints as tcgen05.mma.cta_group::1.kind::f16, with the collector and ashift modifiers in their own qualifier slots.
//
// Generated by tileiras 13.1, target sm_100a
//
.version 8.6
.target sm_100a
.address_size 64
.extern .shared .align 16 .b8 global_smem[];
.entry gemm_blackwell(
.param .u64 gemm_param_0,
.param .u64 gemm_param_1,
// ...
)
.reqntid 128, 1, 1
{
.reg .pred %p<8>;
.reg .b32 %r<48>;
.reg .b64 %rd<24>;
.reg .f32 %f<128>;
// ---- TMEM allocation in shared-prefix scratch
tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%rd_tmem_scratch], 16;
ld.shared.b32 %r_tmem_d, [%rd_tmem_scratch];
// ---- UMMA SMEM descriptors for A and B (encoded once before the K loop)
// (descriptor build elided; same bit layout as wgmma.descriptor.encode.smem)
mov.u32 %r_k, 0;
LBB_loop:
// ---- (TMA loads for A and B into smem stages, see TMA Load Walkthrough)
// ---- (mbarrier wait on producer barriers, see mbarrier State Machine)
// ---- tcgen05.mma proper
// modifier set:
// .cta_group::1 selector (single-CTA dispatch)
// .kind::f16 element-type family for A/B/D
// .collector::a::fill load A from TMEM/SMEM, cache for the next call
//
// operands (in PTX operand order):
// [%r_tmem_d] TMEM accumulator handle
// %rd_a_desc A operand (SMEM descriptor)
// %rd_b_desc B operand (SMEM descriptor)
// idesc packed instruction descriptor (kind + cta_group + flags)
// enable-input-d predicate
tcgen05.mma.cta_group::1.kind::f16.collector::a::fill
[%r_tmem_d], // D = C += A * B, TMEM in-place
%rd_a_desc, // A: SMEM descriptor
%rd_b_desc, // B: SMEM descriptor
%r_idesc, // instruction descriptor (kind word)
1; // enable_input_d (scale_d analogue)
add.u32 %r_k, %r_k, 16;
setp.lt.u32 %p_done, %r_k, %r_k_end;
@%p_done bra LBB_loop;
// ---- After the K loop: drain the asynchronous MMA queue
tcgen05.wait.cta_group::1.sync.aligned;
// ---- TMEM read-back into the warp's register file (epilogue uses the
// register-resident accumulator for downstream addf / TMA store)
tcgen05.ld.sync.aligned.16x64b.x32.b32
{%f0, %f1, %f2, %f3, ..., %f31},
[%r_tmem_d];
// ---- TMEM dealloc + relinquish at function exit
tcgen05.dealloc.cta_group::1.sync.aligned.b32 [%r_tmem_d], 16;
tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;
ret;
}
The mnemonic encodes seven independent decisions. tcgen05.mma is the family. .cta_group::1 is the CTA-group selector (versus .cta_group::2 for the two-CTA form). .kind::f16 is the element-type family (versus .kind::tf32, .kind::i8, .kind::f8f6f4, .kind::mxf8f6f4, .kind::mxf4, .kind::mxf4nvf4). .collector::a::fill is the collector mode (versus .collector::a::use, .collector::a::lastuse, or absence-of-collector for the discard path). Optional modifiers — .sp for sparsity, .block_scale for the microscale variants, .ws for weight-stationary, .ashift for the A-shift modifier — are concatenated in a fixed order. Each modifier maps back to a specific bit in the kind word or collector word that travelled from cute_nvgpu.arch.mma.SM100.umma through the LLVM intrinsic name into the MIR opcode suffix and finally into the printed mnemonic.
The instruction descriptor operand (%r_idesc) is the packed kind word re-materialised as a runtime register value when the kernel needs to vary the kind across iterations; for constant-kind MMA the compiler folds the descriptor into the mnemonic modifiers and the operand becomes a constant imm to PTX. The dual-form encoding — modifiers on the mnemonic versus an immediate operand — is the same kind of trade-off WGMMA makes for scale_d on sm_90a.
The TMEM accumulator operand [%r_tmem_d] is bracketed because it is not a register operand — it is a TMEM handle whose syntactic form mirrors a memory address. The PTX assembler reads the brackets as a hint to use the TMEM-addressed form of the instruction; the unbracketed form %r_tmem_d would dispatch a different opcode entirely.
⚡ QUIRK —
tcgen05.waitis per-CTA, not per-handle Thetcgen05.wait.cta_group::1.sync.alignedinstruction has no operand. It drains every outstanding asynchronoustcgen05.mmaissued by the warp group; there is no "wait for this specific MMA" form. A kernel that issues multiple MMAs against different TMEM handles and wants to drain only one must serialize the issue order, because the wait is global. The verifier emits"tcgen05.wait supported only on arch-conditional or family-conditional variants from SM100 onwards."on non-arch-conditional targets. A reimplementation that tries to emit a per-handle wait builds a kernel that compiles cleanly but races against the asynchronous MMA in production. The companion mbarrier-completion variant oftcgen05.mma— thetcgen05.commitfamily — provides per-MMA completion semantics through a paired mbarrier; see tcgen05 / WGMMA / mbarrier / Cluster Emission — mbarrier Emission for the protocol.
Kind Word: Cross-Stage Flow
The packed 9-bit kind word 0xC1 is the canonical thread tying the MMA's variant choice through every layer. It is computed exactly once — five orthogonal field decisions packed at Stage 3 — but lives under different names and at different levels of abstraction at every stage. Its journey:
| Stage | Form | Carrier | Source |
|---|---|---|---|
1 — cuda_tile | implicit | tile element types + fastmath attr | derived at lower time |
2 — nv_tileaa | implicit | MMA atom name sm100_umma_m64n128k16_f32_bf16_bf16 | derived from layout assignment |
3 — nv_tileas | explicit 5-attribute group | kind, cta_group, scale_vec_size, scale_input_acc, block_scale on umma op | computed by atom desugar |
| 4 — LLVM IR | explicit i32 immediate | i32 193 argument to llvm.nvvm.tcgen05.mma.cta_group__1 | packed by ConvertTileASToLLVM |
| 5 — NVPTX MIR | explicit immediate | imm:193 operand to TCGEN05_MMA_CTA_GROUP1_DENSE | selected through ISelDAG |
| 6 — PTX text | explicit modifier set | .cta_group::1.kind::f16.collector::a::fill qualifiers | rendered by AsmPrinter |
The transition from implicit (stages 1–2) to explicit (stages 3–6) happens in the layout-assignment-to-atom-desugar pipeline, the same point where the MMA atom witness is committed. Until that point runs, the kind-word bits exist only as derivable consequences of the tile element types and the absence of per-op modifier attributes; after that point, they are first-class attributes that travel verbatim through every subsequent lowering. The verifier in Mode Pattern Verifiers — tcgen05.mma Kind-Word Verifier walks the 13-rule ladder at each of stages 3 and 5; the LLVM stage 4 inherits Stage 3's verifier output through the intrinsic name selection (the family is encoded in the intrinsic name, so a malformed kind word that survived stage 3 lands on a syntactically wrong intrinsic and fails LLVM IR verification).
Stage 7: SASS
Past the PTX text, the path leaves tileiras and enters ptxas's territory through the boundary documented in ptxas Handoff Protocol. The assembler renders the tcgen05.mma.cta_group::1.kind::f16.collector::a::fill mnemonic into the SASS instruction stream — instruction encodings, register allocation across the TMEM-handle lifetime, and the interleaving of asynchronous MMA issue against the producer-side TMA loads are entirely ptxas's decision. The TMEM handle becomes a specific 32-bit register, the kind word becomes an immediate field in the SASS encoding, and the collector mode contributes to the operand-select fields.
That layer is out of scope for tileiras's documentation. The wiki covers the path up to PTX text; everything below the handoff is ptxas territory, including the SASS opcode encoding for the UTMA / UMMA family and the SM scheduling decisions that interleave the asynchronous MMA issue against the warp group's TMA-driven operand staging.
Capability Cross-Check
The walkthrough above targets sm_100a. The same cuda_tile.mmaf would produce a different cascade on every other supported architecture; the table below summarises the divergence so a reimplementer can predict what to expect under a different --compute-capability value.
| Compute capability | MMA atom witness | Accumulator residency | Stage-6 PTX mnemonic |
|---|---|---|---|
sm_80 (Ampere) | sm80_mma_m16n8k16_f32_bf16_bf16 | register fragments | mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32 (tiled to 64 instructions) |
sm_89 (Ada) | sm80_mma_m16n8k16_f32_bf16_bf16 | register fragments | same Ampere mnemonic |
sm_90a (Hopper) | sm90_wgmma_m64n128k16_f32_bf16_bf16 | warp-group registers (<32 x float> × 4 warps) | wgmma.mma_async.sync.aligned.m64n128k16.f32.bf16.bf16 |
sm_100a (Blackwell datacenter) | sm100_umma_m64n128k16_f32_bf16_bf16 | TMEM (16 columns × 128 rows) | tcgen05.mma.cta_group::1.kind::f16.collector::a::fill |
sm_103a (Blackwell Ultra GB300) | sm100_umma_m64n128k16_f32_bf16_bf16 (inherited) | TMEM | same Blackwell mnemonic |
sm_110 (Jetson Thor) | no MMA atom registered | register fallback via universal-FMA | (no tcgen05.mma; falls back to mma.sync family) |
sm_120 (consumer Blackwell) | sm120_mma_m16n8k32_f32_bf16_bf16 (no TMEM) | register fragments | mma.sync.aligned.m16n8k32.row.col.f32.bf16.bf16.f32 |
The transition between sm_90a and sm_100a is where the accumulator moves out of the register file and into TMEM, the kind word enters the encoding, the cta_group selector becomes a first-class modifier, and the collector cache enters the instruction operand set. Below that boundary the MMA writes registers; above that boundary it writes TMEM. See Matmul Progression by SM — SM100 / SM103 for the parallel progression and tcgen05 Tensor Memory Model for the structural model of the TMEM-resident accumulator.
Verifier Surface at Each Stage
Each stage's verifier catches a different class of malformed MMA. The same operation must satisfy every verifier on its path; an MMA that survives Stage 2 because Stage 1's verifier didn't notice an issue still fails at Stage 3 once the kind-word ladder runs its 13 rules. The catalog by stage:
| Stage | Verifier | Sample diagnostics |
|---|---|---|
1 — cuda_tile | cuda_tile.mmaf verifier | "tile dimensions must conform: …", "tile would exceed the maximum of …", "fastmath attribute is not one of {reassoc, contract, …}" |
2 — nv_tileaa | nv_tileaa.dot verifier (inherited tile invariants) | "atom output type does not match accumulator: …", "atom input precision incompatible with operand type: …" |
3a — cute_nvgpu.arch.mma.SM100.umma | inherited from TileAS verify_umma_canonical_layout | "Not a canonical UMMA_MN Layout: Expected stride failure.", "Not a canonical UMMA_K Layout: Expected MN-size multiple of …" |
3b — cute_nvgpu.arch.mma.SM100.umma kind-word | the 13-rule ladder in Mode Pattern Verifiers — tcgen05.mma Kind-Word Verifier | "INT8 type is supported only on arch-conditional variants.", "Scale input accumulator can only be used with f16 and tf32 types", "Block scale is not supported for f16, tf32, f8f6f4, and i8 types", "cta_group::2 is not supported with weight stationary" |
3c — cute_nvgpu.arch.sm100.alloc_tmem | TMEM allocator verifier | "allocated tmem out of resource: …", "failed to find scratch smem to allocate tmem", "failed to init tmem" |
| 4 — LLVM IR | shared TypeConverter + intrinsic-arity check | (catch-all for arity mismatches against llvm.nvvm.tcgen05.mma.* declarations) |
| 5 — NVPTX MIR | verify_tcgen05_mma (see tcgen05 / WGMMA / mbarrier / Cluster Emission — Verifier Rules) | "tcgen05.mma supported only on arch-conditional or family-conditional variants from SM100 onwards.", "ashift is not supported with tcgen05.mma.block_scale variants", "Cannot use collector::a::use or colletor::a::fill with ashift" (verbatim colletor typo) |
| 6 — PTX text | ptxas directive verifier | (out of scope; documented under ptxas Handoff Protocol) |
The 13-rule kind-word ladder at Stage 3b is the most important to flag: a kind word that fails any of the 13 rules — INT8 outside arch-conditional, MXF4 sparse outside arch-conditional, explicit scale_vec_size outside arch-conditional, scale-input-accumulator on non-SM100A, scale-input-accumulator with non-f16/tf32, block-scale with f16/tf32/f8f6f4/i8, ashift with block-scale, cta_group::2 with weight-stationary, weight-stationary with mxf8f6f4/f8f6f4/mxf4, collector use/fill with ashift (preserving the verbatim colletor typo), mxf8f6f4 with scale_vec_size > 1, mxf4nvf4 with 1X, mxf4 with 1X or 4X — rejects the MMA with a diagnostic that names the rule but not the surrounding context. A reimplementation that emits a tcgen05.mma op without first walking the same 13-rule ladder builds opcodes that the backend verifier rejects at lowering time with diagnostics that are deliberately hard to map back to the originating MLIR op.
Reimplementation Checklist
Anyone reproducing a one-shot tcgen05.mma from a higher-level IR should walk the same six gates this page traces, in order. The checklist mirrors the cascade:
- Pick an MMA atom whose interface tag (
SM100UmmaAtomTypeInterface) marks it as atcgen05.mmacandidate. The atom name encodes the variant (dense / sparse / block-scaled, weight-stationary or not). Anything else stays on a different MMA path. - Verify the UMMA canonical layout invariants: K-size must be a multiple of
256 / sizeof_bits(elem)for dense or512 / sizeof_bits(elem)for sparse, MN-size must be a multiple of the atom-imposed stride, and the descriptor's swizzle mode must match the atom's expected residency. Theverify_umma_canonical_layoutladder catches every violation. - Allocate TMEM through
alloc_tmemat function entry, sized in units of 128-row columns (16 bytes per row, 2 KiB per column). Reserve exactlynum_columnscolumns for the accumulator; the SM has 256 columns total. - Pack the 9-bit kind word with
cta_groupin bits 0..1,scale_vec_sizein bits 2..3,scale_input_accin bit 4,block_scalein bit 5,mma_kindin bits 6..8. Walk the 13-rule verifier ladder before emitting the op; a kind word that passes any subset of the rules without passing all is silently miscompiled. - Pack the collector word with the
collector_amode (fill,use,lastuse, or absence-as-discard) and theashiftbit, ensuring thatashiftandcollector::a::use/fillare mutually exclusive (rule 10) and that block-scale opcodes rejectashift(rule 7). - Pair every issue with a downstream
tcgen05.wait.cta_group::N(matching the issue'scta_group) before anytcgen05.ldreads the accumulator. The wait is not a property of the MMA — it is a separate operation, and it drains every outstanding MMA in the warp group rather than the specific issue.
Skipping any of these six steps yields a kernel that either fails verifier mid-pipeline, fails ptxas at SASS time, races against the asynchronous MMA queue at runtime, or — worst — reads stale data from a TMEM region the allocator has already returned to the free pool. The QUIRK callouts above flag the most error-prone of the six.
Two further constraints are worth flagging because they are easy to miss when working backward from a PTX dump. First, the cute_nvgpu.arch.sm100.alloc_tmem op must dominate every umma, tmem_load, and tmem_store op that names its handle, and the matching cute_nvgpu.arch.sm100.dealloc_tmem must post-dominate them; placing the alloc inside a conditional that the consumer escapes produces IR that passes the dialect verifier but reads garbage from TMEM at runtime. Second, the tcgen05.relinquish_alloc_permit op must be issued before the kernel exits even if no MMA was actually emitted — the allocator-permit token is per-CTA, not per-region, and a CTA that exits with an outstanding permit prevents the next CTA-on-this-SM from allocating.
Cross-References
DSL to PTX End-to-End is the kernel-wide walkthrough this page mirrors; it traces the same kind of cascade for an sm_90a WGMMA GEMM and stays a useful reference for the producer/consumer pipeline structure that wraps the MMA.
TMA Load Walkthrough is the producer-side companion: the TMA bulk-tensor load that stages the A and B operands into SMEM before the tcgen05.mma reads them, with the mbarrier transaction-byte handshake the consumer-side wait depends on.
tcgen05 Tensor Memory Model is the canonical reference for the TMEM model, the ten-variant taxonomy, the per-variant operand contracts (which operand rides SMEM descriptor versus TMEM), the control word bit layout, the collector cache model, the block-scale operand layout, and the weight-stationary mode contract.
Mode Pattern Verifiers — tcgen05.mma Kind-Word Verifier is the 13-rule kind-word verifier this walkthrough exercises at Stage 3b, with the verbatim diagnostic strings (including the preserved colletor typo) and the worked-example tables.
tcgen05 / WGMMA / mbarrier / Cluster Emission covers the backend-side machine-form validation, the packed control-word/collector-word format, the subtarget feature probe, and the mbarrier-completion variant for per-MMA completion semantics.
Blackwell 2-CTA and 4-CTA MMA covers the cluster-side copy patterns (tcgen05.cp with warpx2_* and warpx4 multicast masks) that stage operands into the cooperating CTAs' TMEM regions, including the rank predicate and the cluster-sibling pairing protocol.
WGMMA Emission Protocol is the Hopper predecessor; comparing the four-op WGMMA protocol to the alloc / MMA / wait / dealloc tcgen05 protocol shows why the accumulator moved from registers to TMEM at the SM90→SM100 boundary, and the WGMMA SMEM Descriptor Bit Layout section documents the descriptor format that tcgen05 reuses verbatim.
Matmul Progression by SM places this walkthrough in the broader SM70-to-SM121 lineage and shows how the same cuda_tile.mmaf lowers to a different cascade on every supported architecture.
mbarrier State Machine is the synchronisation reference for the mbarrier-completion variant of tcgen05.mma (the tcgen05.commit family) that provides per-MMA completion semantics through a paired mbarrier — the alternative to the global tcgen05.wait this walkthrough uses.
MMA Atoms SM70-SM120 — SM100 UMMA Layout Grammar catalogues the atom witness shapes for every supported tcgen05.mma variant and the layout-assignment pre-pass that picks among them.
Buffer Assignment and Named-Barrier Binding covers the TMEM-column allocation strategy at Stage 3, including the column-base assignment and the lifetime-aware reuse across pipelined K iterations.
Matmul Progression by SM
Abstract
NVIDIA's matrix-multiply abstraction has evolved across seven SM generations. Each generation adds capacity along one of three axes — concurrency model (warp-cooperative → warp-group → cluster-cooperative), operand storage class (register fragments → SMEM descriptors → tensor memory), or numerical range (FP16 → FP8 → MXFP4 with block scales). Some generations also remove resource classes that earlier ones introduced: Blackwell datacenter parts drop the register-resident accumulator that WGMMA used, and Blackwell consumer parts drop tensor memory entirely while keeping the block-scale operand encoding.
This page is the canonical cross-architecture overview. It supersedes the scattered per-tier discussions in MMA Atoms SM70-SM120 (the per-arch shape lattice), the WGMMA and tcgen05 topic pages (which focus on one generation each), and tcgen05 / WGMMA / mbarrier / Cluster Emission. Those pages keep their per-tier content; this page covers the cross-architecture story.
SM70 / SM75: Warp-Cooperative mma.sync
SM70 (Volta) and SM75 (Turing) introduced the first generation of tensor cores. The MMA instruction is mma.sync: warp-cooperative (32 threads cooperate on one tile), synchronous (the result is visible to the warp immediately after the instruction returns), and entirely register-resident (both operands and the accumulator live in the warp's register file).
The tile shapes are fixed and small. SM70 supports 8 x 8 x 4 with FP16 inputs and FP16 or FP32 accumulators. SM75 adds 16 x 8 x 8 with FP16, BF16, and the integer low-bit forms. The operand layouts are pinned by the architecture: each lane carries a specific subset of the matrix tile, and the layout grammar in cute_nvgpu exists in large part to record these per-lane subsets without losing them across pipeline transformations.
emit: mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 { %d0, %d1 }, { %a0, %a1 }, { %b0 }, { %c0, %c1 };
(warp-cooperative, synchronous, all operands and accumulator in registers)
SM80 / SM86 / SM87 / SM89: Dense and Sparse mma.sync
SM80 (Ampere A100) keeps the same warp-cooperative synchronous model but expands the shape lattice substantially: 16 x 8 x 16 with FP16 / BF16 / TF32 and a sparse mma.sp.sync variant that halves the structurally-sparse operand and adds a metadata operand. The lower SM80 derivatives (SM86, SM87) keep the same operations with smaller tensor-core arrays.
SM89 (Ada L40) adds FP8 E4M3 and E5M2 inputs to the same warp-cooperative synchronous register-MMA model. FP8 inputs always accumulate into FP32; the FP8 shape is 16 x 8 x 32 and the K extent doubles compared to FP16 because each element takes half the bits.
emit: mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 ... (SM80)
mma.sp.sync.aligned.m16n8k32.row.col.s8.s8.s32 ... (SM80 sparse)
mma.sync.aligned.m16n8k32.row.col.e4m3.e4m3.f32 ... (SM89 FP8)
None of the SM80-tier MMAs touch shared memory directly — they read operands from registers. The kernel is responsible for staging tiles into registers, typically via ldmatrix from shared memory and cp.async into shared memory upstream.
SM90 / SM90a: Warp-Group Async WGMMA
SM90 (Hopper H100) introduces the first asynchronous MMA: wgmma.mma_async. Four warps now cooperate on one accumulator tile (warp-group cooperative, hence WGMMA). The instruction is asynchronous against the issuing warps — it returns immediately, and the accumulator is not visible until a wait-group instruction drains the in-flight cohort.
The operand storage class changes too. Operand B is always an SMEM descriptor — a packed 64-bit word encoding base address, leading byte offset, stride byte offset, base offset, and swizzle mode. Operand A may be a register fragment or an SMEM descriptor depending on the atom variant. The accumulator stays in the warp group's register file, but is invisible until drained.
The four-op emission protocol — fence → tile loop of mma_async → commit → wait — is the contract a correct lowering must preserve. See wgmma-emission-protocol for details.
Shapes range over 64 x N x K where M is fixed at 64 per instruction, N steps in multiples of 8 up to 256, and K is the canonical 256 / elem_bits per element type. The architecture-qualified sm_90a variant is mandatory — plain sm_90 rejects WGMMA at NVVM verification.
emit: wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {...}, %a, %b_desc, %scale, ...;
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 0;
(warp-group cooperative, asynchronous, B in SMEM descriptor, accumulator in RF)
SM100 / SM103: Tensor Memory and tcgen05.mma
SM100 (Blackwell B200) and SM103 (Blackwell Ultra GB300) remove WGMMA and replace it with tcgen05.mma. The concurrency model stays warp-group cooperative; the accumulator moves out of the register file and into tensor memory (TMEM), a new on-chip memory class. Operand A becomes either an SMEM descriptor or a TMEM pointer; operand B stays as an SMEM descriptor.
TMEM is per-SM, dense (128 rows per region), and reachable only from the tcgen05 instruction family. The accumulator residency change is the single biggest architectural shift between WGMMA and tcgen05: a kernel that reads the accumulator must use tcgen05.ld to copy TMEM back into registers, not just observe the SSA value as on Hopper.
SM100 also adds two new variant axes:
- Block-scaled MMA for microscale formats (FP4, FP6, FP8) with per-block E8M0 or E4M3FN scale factors stored in dedicated TMEM regions.
- Weight-stationary mode that pins operand A to its TMEM region across the K loop, amortising A-side bandwidth.
The cluster-cooperative variant cta_group::2 lets two CTAs in a cluster share an MMA tile; CTA 0 holds half of TMEM rows, CTA 1 holds the other half. A 4-CTA copy variant exists on the staging-copy side but not on the MMA side — Blackwell's 4-CTA semantics is a copy-time fan-out, and the MMA that follows is a plain single-CTA instruction over its slice. See tcgen05-tensor-memory-model.
emit: tcgen05.alloc.shared %h, 256; // allocate TMEM region
tcgen05.cp.smem.tmem ...; // stage operand into TMEM
tcgen05.mma.cta_group::1 %h_d, %a_desc, %b_desc, %h_scale, 1;
(warp-group cooperative, asynchronous, A in SMEM/TMEM, B in SMEM, D in TMEM)
SM110: Jetson Thor — No Dedicated MMA Surface
SM110 (Jetson Thor) sits between datacenter Blackwell (SM100/SM103) and consumer Blackwell (SM120/SM121) in the architecture roster, and the compiler enumerates sm_110, sm_110a, and sm_110f as legal target strings. The cute_nvgpu dialect does not register any sm110.* MMA atom mnemonic — no WGMMA-style warp-group MMA, no tcgen05.mma over tensor memory, and no consumer-style block-scaled register MMA is dialect-side dispatched for SM110. Kernels compiled against sm_110 use the universal-FMA fallback or an earlier-tier MMA atom that the architecture-conditional gate accepts. See SM Tier Roster and Copy Atom Registry — SM110 (Jetson Thor) for the dialect-side evidence. Confidence: HIGH.
SM120 / SM121: Consumer Blackwell Block-Scaled MMA
SM120 (consumer RTX 50-series and enterprise Pro) and SM121 (DGX Spark) are a different lineage from datacenter Blackwell. They keep the block-scaled operand encoding but remove tensor memory. The MMA is once again warp-cooperative (32 threads, like SM70-SM89), synchronous (no wait-group), and entirely register-resident.
The instruction is a synchronous mma.sync.aligned with two new per-operand operands: scale_a and scale_b, both E8M0 register fragments. Each operand carries one scale factor per vecSize elements along K; the legal (K, vecSize) combinations are (32, 32) for the FP4/FP6/FP8 family and (64, 16) or (64, 32) for FP4-only inputs.
The accumulator stays in registers. The MMA is synchronous, so there is no wait-group barrier. The operand-encoding is closer to SM89 than to SM100 — block-scale is a numerical-range expansion of the register-MMA model, not a concurrency-model change.
emit: mma.sync.aligned.m16n8k32.row.col.f4.f4.f32.block_scale
{ %d0, %d1, %d2, %d3 },
{ %a0, %a1 }, // FP4 operand A
{ %b0 }, // FP4 operand B
{ %c0, %c1, %c2, %c3 },
{ %sa }, // E8M0 scale factor for A
{ %sb }; // E8M0 scale factor for B
(warp-cooperative, synchronous, all operands and accumulator in registers,
block-scale operands in dedicated register fragments)
Worked Example: m64n128k16 bf16 × bf16 → f32
The clearest way to see the per-generation lowering differences is to pick a single logical matmul shape and trace what each tier emits. The shape below is large enough to require warp-cooperation on every tier but small enough to fit in one warp-group instruction on SM90 and SM100:
Computation: D = A × B + C
A: 64 × 16 tile, bf16
B: 16 × 128 tile, bf16
C, D: 64 × 128 tile, f32
SM70 / SM75: warp-cooperative mma.sync, register-resident
Volta and Turing have no instruction that produces a 64 × 128 tile in one issue. The compiler tiles the 64 × 128 output into a 4 × 16 grid of m16n8k8 sub-tiles and dispatches them across four warps (one per M = 16 sub-tile-row) with each warp running 16 K-sub-tiles inside. Operand fragments load from SMEM via ldmatrix into the warp's register file before each mma.sync:
for warp_m in 0..4: # 64 / 16 = 4 warps cover the M extent
for n_tile in 0..16: # 128 / 8 = 16 N sub-tiles per warp
for k_tile in 0..2: # 16 / 8 = 2 K sub-tiles per N sub-tile
ldmatrix A[warp_m, k_tile] # 4 i32 registers
ldmatrix B[k_tile, n_tile] # 2 i32 registers
mma.sync.aligned.m16n8k8.row.col.f32.bf16.bf16.f32
{ D[warp_m, n_tile] regs }, # 4 f32 registers
{ A regs }, { B regs },
{ C[warp_m, n_tile] regs } # 4 f32 registers
Total: 4 warps × 16 N-tiles × 2 K-tiles = 128 individual MMAs. Every operand and accumulator lives in the register file; SMEM is staging only. The MMAs are synchronous — the result is in registers when the instruction returns.
SM80 / SM86 / SM89: warp-cooperative mma.sync with wider K
Ampere expands the legal shapes to m16n8k16 for bf16, doubling the K extent per instruction. The same 64 × 128 output now tiles into a 4 × 16 grid of m16n8k16 sub-tiles — one K sub-tile per N sub-tile per warp:
for warp_m in 0..4: # 64 / 16 = 4 warps
for n_tile in 0..16: # 128 / 8 = 16 N sub-tiles per warp
ldmatrix A[warp_m, 0] # K = 16 in one load
ldmatrix B[0, n_tile] # K = 16 in one load
mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32
{ D regs }, { A regs }, { B regs }, { C regs };
Total: 4 warps × 16 N-tiles × 1 K-tile = 64 MMAs — half the SM70/SM75 count. Operand and accumulator residency is identical to Volta; the change is the K extent per instruction. SM80 also gains the mma.sp.sync.aligned sparse variant for 2:4-structured operands; SM89 adds FP8 inputs (mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32) with the K extent doubled again to 32.
SM90a: warp-group async WGMMA, B in SMEM descriptor
Hopper collapses the entire 64 × 128 output into a single warp-group instruction. Four warps cooperate on the same accumulator tile (M = 64 is the warp-group dimension); operand B rides an SMEM descriptor; operand A may be a register fragment or an SMEM descriptor. The accumulator stays in the warp group's register file but is invisible until the wait drains the group:
# Build the SMEM descriptor for B once before the loop
%b_desc = make_smem_desc(smem_off=&B, lbo=16*2, sbo=0, base_offset=0, swizzle=128B)
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n128k16.f32.bf16.bf16
{ %fd0, %fd1, ..., %fd31 }, # 32 f32 accumulator registers per thread
{ %ra0, %ra1, ..., %ra3 }, # 4 bf16 A-fragment registers (or %b_desc_a if SMEM-resident)
%b_desc, # 64-bit SMEM descriptor for B
%scale_d, # 1 if accumulating, 0 if zeroing
1, 1, # scale-A, scale-B (FP families)
0, 0; # transpose-A, transpose-B
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 0;
One MMA replaces 64 from SM80. The four-op protocol — fence, async MMA, commit, wait — is mandatory (see WGMMA Emission Protocol). The accumulator is async-visible only: reads of %fd* before wait_group are silent UB.
SM100 / SM103: warp-group tcgen05.mma, accumulator in TMEM
Blackwell moves the accumulator out of the register file entirely. The 64 × 128 f32 output now lives in TMEM, occupying 16 columns of the SM's 128-row × 256-column TMEM grid. Operand A lands in TMEM (staged from SMEM via tcgen05.cp); operand B stays as an SMEM descriptor with the same 64-bit format as Hopper. Single-CTA variant:
# Allocate TMEM for the accumulator (16 columns × 128 rows × 16 B = 32 KiB)
%d_tmem = tcgen05.alloc.shared 16
# Stage A from SMEM into TMEM (one column × 128 rows for bf16 A tile)
tcgen05.cp.smem.tmem %a_tmem, smem_off=&A, layout=...
# Build the SMEM descriptor for B
%b_desc = make_smem_desc(smem_off=&B, lbo=..., sbo=..., swizzle=128B)
# Pack the control word: kind::f16 (covers bf16), cta_group::1, no block-scale
%ctrl = ((MMA_KIND_F16 << 6) | (CTA_GROUP_1 << 0))
tcgen05.mma.cta_group::1.kind::f16.f32.bf16.bf16
[%d_tmem], # TMEM destination (C and D in-place)
[%a_tmem], # TMEM source for A
%b_desc, # SMEM descriptor for B
%ctrl; # packed control word
# Drain via mbarrier or tcgen05.commit + tcgen05.wait
mbarrier.arrive.expect_tx [%mbar], 1
mbarrier.wait %mbar
# Copy D out of TMEM back to registers if a consumer needs it
tcgen05.ld.shared %dst_regs, [%d_tmem]
The MMA is async like WGMMA, but the completion signal is an mbarrier transaction rather than a wait-group counter. Reading the accumulator requires an explicit tcgen05.ld to copy TMEM into registers — there is no SSA visibility shortcut like Hopper's. See tcgen05 Tensor Memory Model.
The 2-CTA cooperative variant halves the M extent per CTA: CTA 0 owns the top 32 rows of D (M = 0..32), CTA 1 owns the bottom 32 rows (M = 32..64). The MMA opcode becomes tcgen05.mma.cta_group::2.kind::f16.f32.bf16.bf16 and pairs the two CTAs at execute time.
SM120 / SM121: warp-cooperative block-scale mma.sync, register-resident
Consumer Blackwell drops TMEM but keeps the block-scale operand encoding. Without TMEM, the warp-group cooperation model collapses back to per-warp synchronous MMA, so the 64 × 128 output once again tiles across four warps as on Ampere — but with a per-operand scale factor:
for warp_m in 0..4: # 64 / 16 = 4 warps
for n_tile in 0..16: # 128 / 8 = 16 N-tiles per warp
for k_tile in 0..(K/32): # K = 32 per block-scale instruction
ldmatrix A[warp_m, k_tile]
ldmatrix B[k_tile, n_tile]
mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.scale_vec::1X.block_scale.f32.e4m3.e4m3.f32
{ D regs },
{ A regs }, # FP8 operand A (e4m3 here)
{ B regs }, # FP8 operand B
{ C regs },
%sfa, # E8M0 scale factor for A (register)
%sfb; # E8M0 scale factor for B (register)
No async, no TMEM, no warp-group cooperation. The scale factors %sfa and %sfb are per-warp register fragments — one E8M0 byte per vecSize = 32 elements along K. Compared to SM100, the block-scale operand encoding is identical (same E8M0 / E4M3FN formats, same (K, vecSize) triples) but the residency is registers, not TMEM.
Side-By-Side Summary
| Tier | Instructions per 64×128×16 | Operand A | Operand B | Accumulator | Sync model | Operand-A bandwidth |
|---|---|---|---|---|---|---|
| SM70/75 | 128 (m16n8k8) | RF | RF | RF | sync | re-loaded per inner tile |
| SM80/89 | 64 (m16n8k16) | RF | RF | RF | sync | re-loaded per N-tile |
| SM90a | 1 (m64n128k16) | RF or SMEM desc | SMEM desc | RF (async) | async (4-op) | one load per instruction |
| SM100/103 | 1 (m64n128k16) | SMEM desc or TMEM | SMEM desc | TMEM | async (mbarrier) | amortised by collector |
| SM110 (Jetson Thor) | falls through to universal-FMA / earlier-tier atoms | — | — | — | — | no SM110-specific MMA dispatch |
| SM120/121 | 64 (m16n8k32 block-scale) | RF | RF | RF | sync | re-loaded per N-tile |
Reading the table: the instruction-count progression collapses the per-warp tile loop into the hardware between SM89 and SM90, then keeps it collapsed through SM100. SM120 reverts to per-warp tiling because consumer Blackwell removes the warp-group cooperation model, but the block-scale operand encoding stays — so SM120 is "SM89-shaped MMA with SM100's numerical range". The accumulator-residency progression is the most consequential: it moves out of the register file at SM90 (still in RF but async-visible only), out the rest of the way at SM100 (into TMEM), and back into RF at SM120. A kernel author who reuses an SM100 codepath on SM120 has to re-introduce explicit ldmatrix staging because TMEM is no longer there.
What Each Generation Adds and Removes
| Tier | Concurrency | Operand A | Operand B | Accumulator | Sync | New |
|---|---|---|---|---|---|---|
| SM70/75 | warp (32 lanes) | RF | RF | RF | sync | dense mma.sync, FP16 |
| SM80 | warp (32 lanes) | RF | RF | RF | sync | sparse mma.sp.sync, BF16, TF32 |
| SM89 | warp (32 lanes) | RF | RF | RF | sync | FP8 E4M3 / E5M2 inputs |
| SM90a | warp-group (4 warps) | RF or SMEM desc | SMEM desc | RF (async-visible) | async | warp-group MMA, SMEM operand descriptors |
| SM100/103 | warp-group, optional 2-CTA cluster | SMEM desc or TMEM | SMEM desc | TMEM | async | tensor memory, block-scale, weight-stationary, sparse block-scale |
| SM110 (Jetson Thor) | — | — | — | — | — | target tier registered, no dedicated MMA atom; lowering falls through to universal-FMA |
| SM120/121 | warp (32 lanes) | RF | RF | RF | sync | block-scale on consumer parts, no TMEM, no async |
The progression is not monotonic. SM90a moves the accumulator out of registers (sort of: still in the RF, but async-visible only). SM100 moves it the rest of the way out, into TMEM. SM120 moves it back into registers, but keeps the block-scale operand encoding that SM100 added. The right way to read the table is one column at a time: concurrency grows up to SM100 and then resets for consumer Blackwell; operand storage class climbs steadily through SM100 and then resets; numerical range grows monotonically.
Cross-References
MMA Atoms SM70-SM120 carries the per-arch shape lattice and the dialect-side atom contracts. WGMMA Emission Protocol covers the SM90a four-op protocol. tcgen05 Tensor Memory Model covers the SM100/103 model and the 10-variant taxonomy. Mode Pattern Verifiers carries the kind-word verifier ladder that gates SM100 and SM120 block-scaled variants. Blackwell 2-CTA and 4-CTA MMA documents the cluster-cooperative copy patterns that stage TMEM operands for SM100. mbarrier State Machine is the synchronisation primitive every async generation builds its producer/consumer protocol on top of.
TMA Load Walkthrough
Abstract
A single TMA load — the asynchronous bulk-tensor copy that moves a tile from global memory into shared memory on sm_90a and later — touches every layer of the tileiras cascade. It begins life as a tile-shaped cuda_tile.load_view_tko, picks up an alias-aware token in nv_tileaa, acquires a TMA descriptor handle and an mbarrier slot in nv_tileas, expands into an nvvm.cp.async.bulk.tensor.shared.cluster.global.2d intrinsic in LLVM, becomes a CP_ASYNC_BULK_TENSOR_2D_* machine instruction in NVPTX MIR, and surfaces as cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes in PTX text. The transaction-byte count — the number the consumer's mbarrier.try_wait.parity checks against — flows through each layer under a different name and at a different level of abstraction.
This page traces one load end-to-end. The kernel-wide walkthrough in DSL to PTX End-to-End shows the same kernel at every stage with all operations in place; this page narrows the focus to a single operation so the descriptor lifecycle, the mbarrier handshake, and the transaction-byte accounting are visible without GEMM scaffolding. Cross-reference targets remain the per-stage canonical pages: cuda_tile to nv_tileaa, nv_tileaa to nv_tileas, nv_tileas to LLVM, TileAS TMA and Memops Family, mbarrier State Machine, WGMMA Emission Protocol, and TMA, Tensormap, and cp.async.bulk.
Confidence: HIGH for IR shapes, mnemonic spellings, and the transaction-byte arithmetic; MED for the SSA value naming used in the worked example (the binary-derived examples in the source pages use slightly different temp names).
The Operation
The walkthrough operation is one TMA bulk-tensor load of a 128-row × 128-column BF16 tile from a global tensor into shared memory, on sm_90a, with one mbarrier slot acting as the completion barrier. The element type is bf16 (2 bytes), so the tile carries 128 × 128 × 2 = 32 768 bytes — that integer is the transaction-byte count every layer eventually publishes against the barrier. The CTA hosts a single warp group (128 threads) doing the load on behalf of a consumer that will read the shared-memory tile.
The frontend constructed:
a_view = make_partition_view(A, [M, K], tile=(128, 128), dim_map=[0, 1])
a_tile = load(a_view, (block_m, block_k)) # tile<128x128xbf16>
A is a global-memory bf16 tensor of shape [M, K] with row-major strides [K, 1]. block_m and block_k are CTA-supplied tile coordinates. The load completes asynchronously — the consumer side waits on an mbarrier before reading a_tile, but that wait is a separate operation. This page traces the load itself.
The transaction-byte arithmetic that runs through every stage:
tile_bytes = tile_rows * tile_cols * sizeof(bf16)
= 128 * 128 * 2
= 32768 bytes
That single integer is the value nv_tileas.async.tiled_tma_load carries as its tx_count attribute, nvvm.mbarrier.arrive.expect_tx publishes against the barrier, and the consumer side sees as expected_txn in the mbarrier state machine documented in mbarrier State Machine — State Machine.
Stage 1: cuda_tile IR
The first IR the compiler sees comes out of the frontend's bytecode. The load is a cuda_tile.load_view_tko — token-ordered tile load from a partition view — and the verifier contract on the operation is the standard cuda_tile contract: power-of-two tile dimensions, a 16-million-element ceiling per tile, a token operand for ordering, and an explicit tile-typed SSA result.
%a_view = cuda_tile.tensor_view %A, shape = [%M, %K], stride = [%K, 1]
: !cuda_tile.tensor_view<128x128xbf16>
%a_part = cuda_tile.partition_view %a_view, tile = [128, 128], dim_map = [0, 1]
: !cuda_tile.partition_view<128x128xbf16>
%tok0 = cuda_tile.make_token : !cuda_tile.token
%a_tile, %tok_a = cuda_tile.load_view_tko %a_part, [%bm, %bk], %tok0
: !cuda_tile.tile<128x128xbf16>, !cuda_tile.token
There is no descriptor, no mbarrier, no transaction-byte count, and no TMA mention. cuda_tile is the public surface and deliberately stays target-agnostic: a partition view plus tile coordinates plus an ordering token is all the frontend has to publish. The _tko suffix denotes the token-ordered shape — the load consumes an input token and produces an output token, so subsequent loads and stores can be scheduled against the ordering edge rather than against explicit fences. The optional allow_tma attribute (defaulting to true on sm_90a and later) is what the next pass reads to decide whether the load becomes a TMA copy or falls back to plain ldg. See cuda_tile to nv_tileaa for the conversion target this op is illegal against.
⚡ QUIRK —
cuda_tilecarries no descriptor, no mbarrier, no tx-count The public dialect has no syntax for a TMA descriptor, no syntax for an mbarrier slot, and no transaction-byte attribute. Every TMA-specific noun first appears innv_tileaa(the copy-atom witness) ornv_tileas(the descriptor handle, the mbarrier slot, thetx_count). A reimplementer who tries to express any of those on the public surface has misread the contract —cuda_tileis a tile-algebra dialect, not a TMA-shaped dialect. The promotion to TMA is a downstream decision driven by the copy-atom registry, not a frontend gesture.
Stage 2: nv_tileaa IR
ConvertCudaTileToTileAA rewrites the load through the three-populator structure documented in cuda_tile to nv_tileaa. Part B of that structure owns the memory and view families, so the rewrite for this load lives in the Part-B tiled_load pattern. The partition view dissolves into an nv_tileaa.make_memref plus an nv_tileaa.addptr, the tile type becomes an MLIR tensor<128x128xbf16>, and — the key change for this walkthrough — the load picks up a CopyAtom witness. The witness is an attribute that names the hardware copy primitive selected by the layout-assignment pre-pass; for an sm_90a load that meets the TMA eligibility rules (the box-dim invariants documented in TileAS TMA and Memops Family — Descriptor Builders and Verifiers) the witness is sm90_tma_load_2d_bf16.
%a_ref = nv_tileaa.make_memref %A, shape = [%M, %K], stride = [%K, 1],
space = #nv_tileaa.global
: !nv_tileaa.memref<?x?xbf16>
%off_a = nv_tileaa.addptr %a_ref, [%bm, %bk]
: !nv_tileaa.memref<?x?xbf16>
%a_tile, %tok_a = nv_tileaa.tiled_load %off_a, %tok0
{ atom = #cute.copy_atom<sm90_tma_load_2d_bf16>,
in_bounds = array<i1: true, true>,
mem_semantic = #nv_tileaa<mem_semantic relaxed>,
mem_scope = #nv_tileaa<mem_scope cluster> }
: !nv_tileaa.memref<?x?xbf16> -> tensor<128x128xbf16>,
!nv_tileaa.mem_token
The CopyAtom witness is the single most consequential change. nv_tileaa.tiled_load does not commit to a particular hardware primitive — that decision rides on the attribute. A different atom in the same slot (sm80_cp_async_4_bf16, sm70_ldg_128_bf16, sm90_tma_load_2d_bf16_mcast) would steer the next stage's rewrite into a different lowering path. Layout assignment runs before this pass and is what consults the copy-atom registry documented in SM-Tier Roster and Copy Atom Registry; after this pass the witness travels verbatim down to the LLVM lowering.
Token ordering also takes its alias-aware form. The output token %tok_a is now an !nv_tileaa.mem_token, threaded into the next memory op so the scheduler can reason about read-after-write and write-after-read edges without inserting fences. The mem_semantic = relaxed and mem_scope = cluster operands together declare that the load is observably ordered against other cluster-scope traffic but does not impose an acquire fence.
The transaction-byte count is still implicit. The element type (bf16) and the tile shape (128 × 128) determine it, but no attribute yet records the integer.
CopyAtom Witness Selection
The witness attached to the tiled_load is not free-form text. The dialect interface CopyAtomAttrInterface constrains every valid witness to one of the entries in the SM-tier copy atom registry, and the layout-assignment pre-pass picks among them by reading three pieces of information from the operand context: the compute capability (the --compute-capability driver option, parsed once and threaded onto the function as nv_tileaa.compute_capability), the source-memref's address space (global versus shared), and the destination tile's shape and element type.
For this walkthrough's load, layout assignment sees sm_90a, address-space-1 (global) source, and a 128 × 128 × bf16 destination. The matching atom in the SM-tier registry is sm90_tma_load_2d_bf16 with swizzle_128B — the swizzle mode is selected so that consecutive K-axis vectors in the destination shared-memory tile fall in different SMEM banks, avoiding bank conflicts when the consumer's WGMMA reads through 128-bit ldmatrix.sync.aligned-style fragments. On Ada Lovelace (sm_89) the same source cuda_tile.load_view_tko would resolve to sm80_cp_async_4_bf16 because TMA hardware is not available; on Ampere (sm_80) the same atom; on Volta or earlier, a plain sm70_ldg_128_bf16.
The witness is also the gate LowerTMALoadStoreToAsync reads in phase 2 of its eight-phase walk. Atoms that do not implement TmaAtomTypeInterface (the plain ldg, stg, ldgsts family) are skipped in phase 2 without rewrite, so the TMA expansion documented in Stage 3 of this page never runs for them. The atom-to-interface dispatch is the central decision that pins the rest of the lowering — picking a non-TMA atom in the witness slot keeps the load as a synchronous tiled_load all the way down to the LLVM stage.
Stage 3: nv_tileas IR
ConvertTileAAToTileAS keeps the same operand shape but renames the op and updates dialect namespaces. The TileAS rewrite documented in TileAA to TileAS — tiled_load Witness Hand-Off preserves the CopyAtom witness verbatim, swaps the mnemonic prefix from nv_tileaa to nv_tileas, and leaves the rest of the operand vector unchanged.
%a_tile, %tok_a = nv_tileas.tiled_load %off_a, %tok0
{ atom = #cute.copy_atom<sm90_tma_load_2d_bf16>,
in_bounds = array<i1: true, true>,
mem_semantic = #nv_tileas<mem_semantic relaxed>,
mem_scope = #nv_tileas<mem_scope cluster> }
: !nv_tileaa.memref<?x?xbf16> -> tensor<128x128xbf16>,
!nv_tileaa.mem_token
Then the TileAS TMA and Memops Family pipeline runs, with LowerTMALoadStoreToAsync doing the heavy work. The eight-phase walk (KernelSpec gate, TMA-eligibility scan, tmaIdx assignment, descriptor bind, async op materialization, mbarrier emission, wait sinking, diagnostic finalization) is documented in TileAS TMA and Memops Family — LowerTMALoadStoreToAsync. The output is the four-op sequence the downstream lowering expects: descriptor build, async TMA op, mbarrier expect-tx, mbarrier wait.
// ---- descriptor materialized by phase 4
%desc_a = nv_tileas.make_tiled_tma_desc %a_ref, box = [128, 128],
atom = #cute_nvgpu.atom_copy_field_tmaload<load_2d_bf16, swizzle_128B>,
tmaIdx = 0 : i32
: !nv_tileas.tma_desc<128x128xbf16>
// ---- mbarrier reserved by phase 6 (one slot, count=1, expecting one arrival)
%mbar_a = nv_tileas.alloc_mbarrier { count = 1 : i32 }
: !nv_tileas.mbarrier
// ---- async TMA load: phase 5 emission
%tok_a = nv_tileas.async.tiled_tma_load
%desc_a, %a_smem[%bm, %bk], %mbar_a
{ atom = #cute_nvgpu.atom_copy_field_tmaload<load_2d_bf16, swizzle_128B>,
tx_count = 32768 : i32,
tmaIdx = 0 : i32 }
: !nv_tileas.tma_desc<128x128xbf16>,
!nv_tileas.smem<128x128xbf16>,
index, index,
!nv_tileas.mbarrier
-> !nv_tileas.async
Three new entities appear at this stage. First, the TMA descriptor is a first-class SSA value — %desc_a is the result of make_tiled_tma_desc, which captures tensor shape, stride, padding mode, descriptor mode (tiled), element type, and the cute_nvgpu.tma_atom witness with its swizzle. The descriptor's tmaIdx attribute is the per-function counter assigned in phase 3 of LowerTMALoadStoreToAsync; downstream the AttachTMADescriptorArgs pass (documented in TileAS TMA and Memops Family — Descriptor ABI) reads it to wire host descriptor preparation back to device descriptor consumption. Second, the mbarrier slot is an explicit !nv_tileas.mbarrier SSA value with an arrive_count of 1 — one producer agent will publish completion. Third, the transaction-byte count is a concrete tx_count = 32768 : i32 attribute on the async load. That integer is the byte count the consumer's mbarrier.try_wait.parity will check against.
⚡ QUIRK — transaction-byte count is per-atom, not per-mbarrier A single mbarrier slot can receive transaction-byte updates from multiple TMA loads (one per operand in a multi-input WGMMA, for instance). The
tx_countattribute stamped on eachnv_tileas.async.tiled_tma_loadis the byte count that load will contribute; the consumer'sexpected_txnfield on the mbarrier is the sum across all loads that arrive on the same barrier. A reimplementation that publishes a per-mbarrier total at descriptor-build time and ignores per-atom byte counts produces a barrier whoseexpected_txnnever matches the actual transaction total, andtry_wait.parityeither fires early (if the published total is too small) or hangs forever (if too large). The per-atom accounting is what makes the multi-operand WGMMA producer/consumer handshake work.
⚡ QUIRK — CopyAtom witness vs concrete PTX form The CopyAtom witness
sm90_tma_load_2d_bf16does not name the PTX mnemonic the load eventually becomes. It names a family: the basic 2D tile load iscp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes, but the multicast variant of the same atom prints…multicast::cluster, the L2-cache-hint variant prints…L2::cache_hint, and the im2col atom variant prints…im2col. The witness names the legal shape; per-load options the rewriter discovers (an attached multicast mask, an attached cache hint, an im2col mode flag) select among the printable variants. A naive name-to-mnemonic mapping in a reimplementation skips the variant gates the codegen page documents in TMA, Tensormap, and cp.async.bulk.
Descriptor Lifecycle and the Host/Device Split
The nv_tileas.make_tiled_tma_desc op materialized above is one of two possible descriptor origins. AttachTMADescriptorArgs and SeparateHostTMA (documented in TileAS TMA and Memops Family — Descriptor ABI) split the descriptor population between host and device. When the descriptor depends only on values the runtime can supply through the launch ABI (the kernel's global pointer arguments, the tensor shape and stride parameters, the box-dim attribute that came from the partition_view), the pass hoists descriptor construction into a host-side companion module that builds a 64-byte CUDA tensormap once per launch. When the descriptor depends on values only the device knows (a runtime-computed shape, a divergent stride), it stays on the device and the kernel emits a cp.async.bulk.tensor.encode sequence inline before the load.
For this walkthrough, the descriptor depends only on the kernel's %A, %M, and %K arguments, so the host-side path wins. The kernel ABI grows a .param slot holding a pointer to the descriptor, the slot carries the cute_nvgpu.grid_constant argument attribute (later lifted to nvvm.grid_constant), and the device-side make_tiled_tma_desc op survives only as a marker the next stage's pattern looks up by tmaIdx = 0. The runtime's per-launch descriptor-emit callback runs once per launch and writes 64 bytes per descriptor into a scratch buffer the kernel reads through its .param slot.
The descriptor's 64-byte payload encodes the eleven fields the TileAS TMA and Memops Family — Tensormap Mutators page documents: global base address, per-axis dimension sizes, per-axis strides, element type, rank, format (tiled / im2col / im2col_at / tiled_at), box shape, swizzle mode, fill mode, OOB fill value, and interleave layout. Eight of the eleven are immutable on device (set at construction, never replaced); only the global base pointer, per-axis sizes, and non-leading strides can be replaced via tensormap.replace.{global_address,dim_size,stride_size} if the kernel needs to vary them across iterations.
Stage 4: NVVM Intrinsic in LLVM IR
ConvertTileASToLLVM is the terminal MLIR-side lowering, and its nine-phase body conversion documented in tileas to LLVM carries the load to LLVM. The TMA-load rewrite specifically follows the five-step pattern in tileas-to-llvm — async.tiled_tma_load: the descriptor becomes an llvm.ptr<1> to its global-memory home, the destination view becomes an llvm.ptr<3> to the shared-memory base address, the per-axis coordinates flow through unchanged as i32 values, and the mbarrier slot becomes an llvm.ptr<3> to the completion barrier.
; ---- descriptor pointer (taken from the kernel's grid-constant .param slot)
%desc_a_ptr = call ptr addrspace(1) @llvm.nvvm.tma.get.descriptor.address(i32 0)
; ---- mbarrier publishes expected transaction-byte count
call void @llvm.nvvm.mbarrier.arrive.expect_tx.shared(
ptr addrspace(3) %mbar_a, i32 32768)
; ---- TMA bulk-tensor load: shared <- global, through the descriptor,
; coordinate order reversed to inner-axis-first (column-major)
call void @llvm.nvvm.cp.async.bulk.tensor.shared.cluster.global.2d(
ptr addrspace(3) %a_smem_dst, ; destination: shared base address
ptr addrspace(1) %desc_a_ptr, ; source: TMA descriptor in global
i32 %bk_coord, ; coord[0]: inner-axis (was %bm)
i32 %bm_coord, ; coord[1]: outer-axis (was %bk)
ptr addrspace(3) %mbar_a) ; completion barrier
Four things change at the LLVM boundary. The CopyAtom witness is consumed — the intrinsic name llvm.nvvm.cp.async.bulk.tensor.shared.cluster.global.2d encodes everything the witness named (2D tile, shared destination, global source, mbarrier completion), so no attribute is needed. The transaction-byte count is published in its own instruction: llvm.nvvm.mbarrier.arrive.expect_tx.shared writes 32768 into the barrier's expected_txn field. Coordinate order reverses — TileAS lists coordinates outer-axis-first ([%bm, %bk]) to match layout-assignment's row-major convention, but the LLVM intrinsic expects inner-axis-first ([%bk, %bm]) to match the PTX instruction. And the AsyncToken SSA result becomes an i32 zero constant: the token does not carry hardware state, only an IR-level data-dependence edge, so the lowering replaces it with a placeholder whose only purpose is keeping the SSA dataflow connected.
The mbarrier's role is decoupled here. The mbarrier.arrive.expect_tx.shared call is what publishes the byte count; the cp.async.bulk.tensor call is what issues the asynchronous transfer that will update the barrier's txn_count field asynchronously as bytes arrive in shared memory; a downstream mbarrier.try_wait.parity.shared (emitted by the consumer-side pattern, not by the load itself) is what gates the consumer on both arrival and transaction-byte completion. The three are independent instructions tied together only by the shared barrier pointer. See mbarrier State Machine — Kinds: Ordinary, Transaction, Cluster for the TMA-transaction kind's full state-machine view.
⚡ QUIRK — failed transactions are silent UB If a TMA load fails mid-flight — out-of-bounds coordinate, malformed descriptor, multicast mask referencing a CTA outside the cluster — the hardware does not raise an exception, does not signal the barrier, and does not abort the kernel. The transaction-byte count simply never reaches
expected_txn, and the consumer'smbarrier.try_wait.parityspins forever (or, with a non-zeronstimeout, returns failure). No diagnostic surfaces from MLIR, LLVM, orptxas; the failure mode is a hang. Reimplementations that assume any kind of error reporting from the load itself will misdiagnose this class of bug as a barrier mis-init. The only defense is the descriptor verifier — see TileAS TMA and Memops Family — Descriptor Builders and Verifiers for the catalog.
Stage 5: NVPTX MIR
The NVPTX backend's instruction selector (ISelDAG and MatcherTable) consumes the LLVM intrinsic and produces a MachineFunction instruction. The TMA family of opcodes is a set of CP_ASYNC_BULK_TENSOR_*_* machine instructions, one per (rank, mode, destination, options) tuple. For the 2D tile load with mbarrier completion, the opcode is CP_ASYNC_BULK_TENSOR_2D_SHARED_CLUSTER_GLOBAL_MBARRIER.
bb.loop:
; --- mbarrier arrives with expected transaction byte count
MBARRIER_ARRIVE_EXPECT_TX_SHARED %mbar_a:b64, 32768
; --- TMA load: shared destination, global source via descriptor, mbarrier completion
CP_ASYNC_BULK_TENSOR_2D_SHARED_CLUSTER_GLOBAL_MBARRIER
%a_smem_dst:b64, ; destination shared address (b64 SMEM ptr)
%desc_a:b64, ; descriptor address (b64 global ptr)
%bk:b32, %bm:b32, ; coordinates, inner-axis first
%mbar_a ; completion barrier (b64 SMEM ptr)
Three observations matter at MIR level. First, the opcode encodes the address spaces in its name — _SHARED_CLUSTER_GLOBAL_ selects the variant whose destination is SMEM, whose source is global via descriptor, and whose completion handshake is the cluster-scope mbarrier transaction. A 1D variant would have _1D_ in the slot; an im2col variant would have _IM2COL_; a multicast variant would have _MULTICAST_. Each is a distinct opcode in the NVPTX .td files, picked up by the AsmPrinter table to render the corresponding PTX modifier set. Second, the MBARRIER_ARRIVE_EXPECT_TX_SHARED instruction is the in-flight expect_tx publish: it writes 32768 into the barrier's expected_txn slot. The number is a literal immediate at this level, not a register; the rewriter has folded it from the tx_count = 32768 attribute that originated in nv_tileas.async.tiled_tma_load. Third, the AsyncToken is gone — the LLVM-side i32 zero placeholder dies during instruction selection because no MIR opcode consumes it.
The transaction-byte count has now flowed through four levels of representation: implicit shape-and-element-type in cuda_tile, still implicit in nv_tileaa, explicit as tx_count = 32768 : i32 attribute in nv_tileas, explicit as i32 32768 operand to llvm.nvvm.mbarrier.arrive.expect_tx.shared in LLVM IR, and explicit as immediate operand 32768 to MBARRIER_ARRIVE_EXPECT_TX_SHARED in MIR.
Stage 6: PTX Text
The AsmPrinter (AsmPrinter and Per-SM Windows) walks the MachineFunction and renders each instruction. The 2D TMA load with mbarrier completion prints as cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes, the canonical Hopper TMA tile-load mnemonic.
// ---- mbarrier publishes the 32768-byte expectation
mbarrier.arrive.expect_tx.shared.b64 _, [%rd_mbar], 32768;
// ---- TMA bulk-tensor 2D load
cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes
[%rd_smem_dst], // destination: SMEM address of tile
[%rd_desc_a, {%r_bk, %r_bm}], // source: descriptor + coords (inner-axis first)
[%rd_mbar]; // completion barrier address
The mnemonic encodes seven independent decisions. cp.async.bulk.tensor is the family. 2d is the descriptor rank. shared::cluster.global is the address-space pair — destination in CTA-scope shared memory visible to the cluster, source in global memory via descriptor. tile is the descriptor mode (versus im2col, gather4, etc.). mbarrier::complete_tx::bytes is the completion mechanism — the bytes-based transaction barrier that pairs with mbarrier.arrive.expect_tx and mbarrier.try_wait.parity. Each modifier maps back to a specific attribute that traveled from nv_tileas.async.tiled_tma_load through the LLVM intrinsic name into the MIR opcode suffix and finally into the printed mnemonic.
The coordinate operands {%r_bk, %r_bm} are inner-axis-first; the original nv_tileas form was outer-axis-first [%bm, %bk]. The reversal happened in the LLVM lowering, propagated through MIR, and surfaces here at print time.
The transaction-byte count 32768 is the literal immediate to mbarrier.arrive.expect_tx. The load itself does not carry a byte-count operand — the load's job is to issue the transfer and have the hardware update the barrier's txn_count field; the byte expectation was set out-of-band by the expect_tx instruction.
⚡ QUIRK —
mbarrier::complete_tx::bytesis a load modifier, not an mbarrier modifier Thembarrier::complete_tx::bytesqualifier appears in thecp.async.bulk.tensor.*mnemonic, not in thembarrier.*mnemonic. It selects the transaction-completion behaviour of the bulk-tensor instruction — the load updates an mbarrier'stxn_countfield as bytes arrive — and does not describe how the barrier itself is built or armed. The barrier'sexpected_txnis established by a separatembarrier.arrive.expect_txinstruction issued before the load. Reimplementations that emit only the bulk-tensor instruction and rely on the mnemonic to "publish" the byte count produce a barrier whoseexpected_txnstays at zero, sotry_wait.parityreturns success immediately (thetxn_count >= expected_txncheck is vacuous at zero), and the consumer reads garbage from a still-uncopied destination.
Transaction-Byte Count: Cross-Stage Flow
The single integer 32768 is the canonical thread tying the load to the consumer. It is computed exactly once — 128 rows × 128 cols × 2 bytes/element = 32 768 bytes — but lives under different names and at different levels of abstraction at every stage. Its journey:
| Stage | Form | Carrier | Source |
|---|---|---|---|
1 — cuda_tile | implicit | tile shape <128x128xbf16> | derived at lower time |
2 — nv_tileaa | implicit | tensor shape <128x128xbf16> plus CopyAtom witness | derived at lower time |
3 — nv_tileas | explicit attr | tx_count = 32768 : i32 on async.tiled_tma_load | computed by LowerTMALoadStoreToAsync phase 5 |
| 4 — LLVM IR | explicit i32 operand | argument to llvm.nvvm.mbarrier.arrive.expect_tx.shared | folded from tx_count attribute |
| 5 — NVPTX MIR | explicit immediate | operand to MBARRIER_ARRIVE_EXPECT_TX_SHARED | selected through ISelDAG |
| 6 — PTX text | explicit literal | last operand to mbarrier.arrive.expect_tx.shared.b64 | rendered by AsmPrinter |
The transition from implicit (stages 1–2) to explicit (stages 3–6) happens in phase 5 of LowerTMALoadStoreToAsync, the same phase that materializes the async TMA op itself. Until that phase runs, the byte count exists only as a derivable consequence of the tile shape and the element type; after that phase runs, it is a first-class attribute that travels verbatim through every subsequent lowering. The consumer side reads it through the mbarrier's expected_txn field, exactly as the mbarrier State Machine try_wait.parity predicate documents.
⚡ QUIRK — tx_count diverges from naive byte count under swizzle The byte count published to
expect_txis the number of bytes the TMA hardware will actually deposit into shared memory, which equals the tile size in bytes when the descriptor's swizzle mode isnone. When the swizzle mode is128B,64B, or32B, the hardware deposits the unswizzled byte count — the data after swizzle reordering still occupies the same number of bytes, even though their layout in SMEM is permuted. A reimplementer who computestx_count = tile_rows * tile_cols * sizeof(elem)is correct for tiled mode regardless of swizzle, but the same formula does not generalize to im2col mode (where padded rows expand the byte count beyondrows × cols × sizeof) or to multicast (where the byte count is per receiving CTA, not aggregated). The phase-5 computation inLowerTMALoadStoreToAsyncreads the atom's descriptor metadata to get the right answer for each mode.
Stage 7: SASS
Past the PTX text, the path leaves tileiras and enters ptxas's territory through the boundary documented in ptxas Handoff Protocol. The assembler renders the cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes mnemonic into the SASS instruction stream — instruction encodings, register allocation, and scheduling are entirely ptxas's decision. The byte count 32768 becomes part of the encoded immediate operand of the SASS MBARRIER instruction; the load itself decomposes into a sequence of UTMALDG and related SASS instructions that issue the transfer and arm the L1 cache fill paths.
That layer is out of scope for tileiras's documentation. The wiki covers the path up to PTX text; everything below the handoff is ptxas territory, including the SASS opcode encoding for the bulk-tensor family and the SM scheduling decisions that interleave the asynchronous TMA issue against the warp group's compute.
Coordinate Reversal in Detail
The coordinate operands flip between outer-axis-first and inner-axis-first ordering exactly once in the cascade — between Stage 3 (nv_tileas) and Stage 4 (LLVM IR). Tracking which axis ordering each stage uses is necessary for any reimplementation that wants to walk an IR dump and validate the coordinate operand order without re-reading the source patterns.
The convention at each stage:
| Stage | Coordinate order | Why |
|---|---|---|
1 — cuda_tile.load_view_tko | outer-axis-first [%bm, %bk] | matches Python-style tile indexing in the frontend bytecode |
2 — nv_tileaa.tiled_load | outer-axis-first [%bm, %bk] | preserved verbatim through Part-B rewrite for SSA-edge stability |
3 — nv_tileas.async.tiled_tma_load | outer-axis-first [%bm, %bk] | layout-assignment writes coordinates in row-major order |
4 — nvvm.cp.async.bulk.tensor.shared.cluster.global.2d | inner-axis-first [%bk, %bm] | matches the PTX instruction operand order |
5 — CP_ASYNC_BULK_TENSOR_2D_SHARED_CLUSTER_GLOBAL_MBARRIER | inner-axis-first [%bk, %bm] | survives ISel verbatim |
6 — cp.async.bulk.tensor.2d.shared::cluster.global.tile… | inner-axis-first [%bk, %bm] | PTX consumes inner-axis-first |
The reversal happens inside the async.tiled_tma_load → nvvm.cp.async.bulk.tensor.* pattern in tileas-to-llvm — async.tiled_tma_load. The pattern walks the operand list and emits the operands in reverse coordinate order as part of the intrinsic call construction. Im2col variants get the same reversal applied to their coordinate vector, with the K-offset prefix preserved at the head of the operand list.
Verifier Surface at Each Stage
Each stage's verifier catches a different class of malformed load. The same operation must satisfy every verifier on its path; a load that survives Stage 2 because Stage 1's verifier didn't notice an issue still fails at Stage 3 once the descriptor builder runs its 128-byte alignment check. The catalog by stage:
| Stage | Verifier | Sample diagnostics |
|---|---|---|
1 — cuda_tile | cuda_tile.load_view_tko verifier | "all dimensions must be positive constants, got …", "all dimensions must be powers of two, got …", "tile would exceed the maximum of …" |
2 — nv_tileaa | nv_tileaa.tiled_load verifier (inherited tile-dim invariants) | "expects N coordinates, but got M", "expects CoordType is same as memref index type", "view elementType not equal with tensor element type: …" |
3a — nv_tileas.tiled_load | inherited from TileAS shared verify_tiled_memop | "unsupported mem_semantic: acquire" (loads forbid acquire), "incorrect number of in_bounds elements: expected …" |
3b — nv_tileas.make_tiled_tma_desc | descriptor builder verifier | "expected tma descriptor pointer to have alignment at least 128", "tma boxDims[0] * elemTypeBitWidth is not a multiple of 16 bytes", "smem layout is not TMA compatible", "TmaLoad only support zero padding now" |
3c — nv_tileas.async.tiled_tma_load | post-rewrite atom verifier | "expect a tma_load atom type", "tmaBoxDim and atomBoxDim length mismatch", "mcast is not supported for TMA load with less than 128bytes per atom" |
| 4 — LLVM IR | shared TypeConverter + intrinsic-arity check | (catch-all for arity mismatches against llvm.nvvm.cp.async.bulk.tensor.* declarations) |
| 5 — NVPTX MIR | LLT-typed operand check at ISel | (rejects malformed b64/b32 operand bundles before they reach the AsmPrinter) |
| 6 — PTX text | ptxas directive verifier | (out of scope; documented under ptxas Handoff Protocol) |
The 128-byte descriptor alignment check at Stage 3b is the one most worth flagging: TMA descriptors must be 128-byte aligned, which is why the kernel ABI marks descriptor-pointer arguments as grid_constant (placed in .param, naturally 128-byte aligned) rather than .global (only 16-byte aligned in the general case). A reimplementation that drops the grid_constant attribute keeps the IR well-formed all the way through verifier, but ptxas rejects the resulting cp.async.bulk.tensor load with an alignment diagnostic at SASS-generation time.
Address-Space Trail
Every operand of the load lives in a specific GPU address space, and the address-space attribution flows through the cascade in a different shape at each level. The trail for this walkthrough's operands:
| Operand | Address space at each stage |
|---|---|
Source tensor %A | space-1 (global) at every stage |
TMA descriptor %desc_a | space-1 (global), with grid_constant mark routing it to .param |
| Destination tile (SMEM) | space-3 (shared) starting at Stage 3, allocated from global_smem |
Mbarrier slot %mbar_a | space-3 (shared), 64-bit aligned, lives in global_smem |
| Loop iterator / coordinates | space-0 (generic register) |
The descriptor's address-space story is the subtlest of the five. make_tiled_tma_desc produces a value typed !nv_tileas.tma_desc<…>, an opaque type that the Shared LLVM Type Converter lowers to !llvm.ptr<1> — an LLVM global-space pointer. The descriptor itself lives in global memory at launch time, but the pointer to it is passed through the kernel ABI in .param space. The cute_nvgpu.grid_constant argument attribute is what tells the codegen "this argument is a .param slot containing a pointer that, when dereferenced, lands in global memory." The downstream cute-to-llvm lowering at sub_1698C20 lifts that attribute to nvvm.grid_constant, which is the form ptxas reads for the .param placement decision.
The mbarrier slot's address-space story is the simplest: it is always .shared. nvvm.mbarrier.init.shared, nvvm.mbarrier.arrive.expect_tx.shared, nvvm.mbarrier.try_wait.parity.shared — every member of the 21-op NVVM family for the cases this walkthrough exercises takes the .shared variant, because mbarrier hardware lives in CTA-scope SMEM. The non-.shared variants exist for cluster-scope mbarriers reached through nvvm.mapa (peer-CTA address translation), but those are out of scope for a single-CTA tile load.
Mbarrier Slot Allocation and Reuse
The nv_tileas.alloc_mbarrier op produced in Stage 3 carves a 64-bit barrier out of the kernel's SMEM arena. The buffer-assignment pass documented in Buffer Assignment and Named-Barrier Binding is what decides the offset; for this walkthrough the slot lives at a fixed offset past the tile-storage region of global_smem. The barrier is initialized once at kernel prologue with nvvm.mbarrier.init.shared — the arrive_count matches the number of producer arrivals the load contributes (one, for a single-issue load), and the expected_txn field starts at zero and is set by the first mbarrier.arrive.expect_tx that publishes against it.
In a pipelined kernel — the steady-state shape documented in DSL to PTX End-to-End — Stage 3: nv_tileas IR — the same mbarrier slot is reused across multiple iterations under different phase parities. The phase bit on the barrier flips on every completion, so iteration i and iteration i+1 see opposite parities on the same slot; the consumer's try_wait.parity reads the iteration-derived parity from the loop's pipeline iterator state. For a depth-D pipeline the slot at stage index s carries phase (i / D) & 1 on iteration i, and the producer's expect_tx flips the phase implicitly through the arrive-with-expect-tx machinery the mbarrier State Machine — State Machine page documents.
This is why the load alone does not need to specify a phase: the load increments the in-flight transaction byte count, and mbarrier_arrive flips the phase when pending reaches zero — exactly one arrival is enough for the single-producer load. The consumer reads the iteration phase from its pipeline iterator and asks try_wait.parity for that phase. The phase invariant is what makes a single barrier slot reusable across iterations without ABA hazards.
Capability Cross-Check
The walkthrough above targets sm_90a. The same cuda_tile.load_view_tko would produce a different cascade on every other supported architecture; the table below summarises the divergence so a reimplementer can predict what to expect under a different --compute-capability value.
| Compute capability | CopyAtom witness | Stage-3 op | Stage-6 PTX mnemonic |
|---|---|---|---|
sm_70 (Volta) | sm70_ldg_128_bf16 | nv_tileas.tiled_load (no rewrite) | ld.global.nc.v4.b32 (per-thread vectorised) |
sm_75 (Turing) | sm70_ldg_128_bf16 | same | ld.global.nc.v4.b32 |
sm_80 (Ampere) | sm80_cp_async_4_bf16 | nv_tileas.async.cp_async | cp.async.ca.shared.global.4 |
sm_89 (Ada) | sm80_cp_async_4_bf16 | same | same |
sm_90a (Hopper) | sm90_tma_load_2d_bf16 | nv_tileas.async.tiled_tma_load | cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes |
sm_100 (Blackwell) | sm90_tma_load_2d_bf16 (inherited) | nv_tileas.async.tiled_tma_load | same TMA mnemonic; consumer is tcgen05.mma not WGMMA |
sm_120 (consumer Blackwell) | sm90_tma_load_2d_bf16 | same | same |
The transition between sm_89 and sm_90a is where the TMA descriptor, the mbarrier transaction-byte machinery, and the asynchronous cp.async.bulk.tensor instruction first enter the cascade. Below that boundary the load is synchronous (ldg) or coroutine-style async with per-thread cp.async; above that boundary it is the bulk-tensor instruction the rest of this page traces. See Matmul Progression by SM for the parallel progression on the WGMMA / tcgen05 consumer side.
Consumer-Side Pairing
The load alone is half the story. Once the producer issues cp.async.bulk.tensor and mbarrier.arrive.expect_tx, the consumer needs to wait until the destination tile is ready. That wait is a separate operation — nv_tileas.async.pipeline.consumer_wait at the TileAS level, lowering to nvvm.mbarrier.try_wait.parity.shared at the LLVM level, and surfacing as mbarrier.try_wait.parity.shared.b64 in PTX. The consumer wait is documented in detail in DSL to PTX End-to-End — Stage 4: LLVM IR with NVVM intrinsics and in mbarrier State Machine — Phase Parity.
The wait's release predicate is exactly what the state-machine table in mbarrier State Machine — State Machine shows: phase == want_phase && pending == arrive_count && txn_count >= expected_txn. The third clause is what the transaction-byte machinery in this walkthrough sets up — the load updates txn_count asynchronously, and the wait releases only once that field crosses the expected_txn = 32768 threshold the expect_tx instruction published.
The WGMMA consumer that typically reads the loaded tile is documented in WGMMA Emission Protocol. The end-to-end producer/consumer pipeline that wires the TMA load to the WGMMA consumer through a multi-stage mbarrier ring is documented in DSL to PTX End-to-End — Stage 3: nv_tileas IR.
Reimplementation Checklist
Anyone reproducing a one-shot TMA load from a higher-level IR should walk the same six gates this page traces, in order. The checklist mirrors the cascade:
- Pick a CopyAtom whose interface tag (
TmaAtomTypeInterface) marks it as a TMA candidate. Anything else stays synchronous. - Verify the box-dim invariants: leading dim's bit-width must be a 16-byte multiple, descriptor pointer 128-byte aligned, smem layout TMA-compatible. The descriptor builder verifier catches every violation.
- Compute
tx_countfrom atom metadata, not from the naiverows × cols × sizeof(elem)shortcut. Im2col and multicast change the formula. - Reserve exactly one mbarrier slot per pipeline stage, not per producer or per atom. Multiple atoms publishing to the same slot is normal; the slot's
expected_txnis the sum. - Reverse coordinate order at the TileAS-to-LLVM boundary and only there. Earlier reversals corrupt scheduling decisions; later reversals corrupt the PTX operand order.
- Pair every load with a downstream consumer wait on the same mbarrier and the same parity. The wait is not a property of the load — it is a separate operation, emitted by a separate pattern, against a separately-tracked phase.
Skipping any of these six steps yields a kernel that either fails verifier mid-pipeline, fails ptxas at SASS time, or hangs forever at runtime. The QUIRK callouts above flag the most error-prone of the six.
Two further constraints are worth flagging because they are easy to miss when working backward from a PTX dump. First, the nv_tileaa.kernel_spec attribute must be present on the function before LowerTMALoadStoreToAsync runs — its absence fires "LowerTMALoadStoreToAsync: missing or invalid KernelSpecAttr on function" and skips every TMA rewrite, leaving the IR with tiled_load ops that the downstream verifier rejects. Second, the function-level nv_tileas.num-host-tmas and nv_tileas.num-device-tmas counters must agree with the actual tmaIdx range; "tmaIdx exceed tmaHostNum." and "tmaIdx exceed tmaDeviceNum." are emitted by the ABI verifier when they don't.
Cross-References
DSL to PTX End-to-End is the kernel-wide walkthrough this page narrows; it shows the same kernel at every stage with all operations in place and traces the full producer/consumer pipeline through scheduling.
mbarrier State Machine is the canonical reference for the barrier object's state machine, the 21-op NVVM family that touches it, and the three barrier kinds (ordinary, TMA-transaction, cluster-transaction); this walkthrough exercises the TMA-transaction kind end-to-end.
WGMMA Emission Protocol is the consumer-side companion: the four-op wgmma.fence / mma_async / commit_group / wait_group sequence runs once the try_wait.parity on the load's mbarrier releases.
TileAS TMA and Memops Family covers the eight-phase LowerTMALoadStoreToAsync pass, descriptor builders, ABI separation between host and device descriptors, and the full diagnostic catalog from the verifier.
cuda_tile to nv_tileaa documents the first lowering stage (Stage 1 → Stage 2 in this walkthrough), the three-populator structure, and the CopyAtom-attaching layout-assignment pre-pass.
nv_tileaa to nv_tileas covers the alias-aware-to-assembler-near rewrite (Stage 2 → Stage 3) and the witness hand-off shape that preserves the CopyAtom attribute verbatim.
nv_tileas to LLVM is the terminal MLIR-side lowering (Stage 3 → Stage 4), with the five-step TMA-load rewrite, the coordinate reversal, and the nine-phase body conversion.
TMA, Tensormap, and cp.async.bulk is the codegen-side reference for the full cp.async.bulk.tensor.* mnemonic family, descriptor mutator ABI, and proxy-fence rules.
TMA Atoms catalogues the cute_nvgpu.tma_atom witness shapes for every supported (rank, mode, element-type, swizzle, multicast) tuple, and the make_exec_tma binding step that pairs each atom with its mbarrier.
SM-Tier Roster and Copy Atom Registry is the registry the layout-assignment pre-pass consults to pick sm90_tma_load_2d_bf16 over the alternatives.
Matmul Progression by SM and Capability Matrix explain why the lowering chose the sm_90a TMA path; on Ampere or earlier the same cuda_tile.load_view_tko would lower to cp.async or plain ldg with no TMA descriptor, no mbarrier transaction kind, and no tx_count attribute at all.
DSL to PTX End-to-End
Abstract
The tileiras wiki documents each stage of the MLIR-to-PTX cascade on its own page. A reader following one kernel from a Triton-style frontend down to emitted PTX would otherwise have to traverse the per-stage pages and reconstruct the IR shape at every transition. This page is the walkthrough: a representative GEMM kernel rendered at every level of the pipeline, with each transition annotated by the pass that produced it. The per-stage canonical pages remain authoritative for pass internals, fold rules, and verifier contracts; this page focuses on the IR shape continuity that ties them together.
The kernel is a fused D = A * B^T + C operation targeting sm_90a. Inputs A and B are tile<128x64xf16> blocks staged through TMA into shared memory; C and D are tile<128x128xf32> blocks of the same row tile. The walkthrough follows one steady-state iteration of the K loop; descriptor construction, prologue, and epilogue are elided in favor of the producer/consumer body the scheduler operates on.
The Kernel
A Triton-style DSL surface mirrors the public cuda_tile contract: structured control flow, tile-shaped SSA values, partition-view memory access, and tile-granular MMA. The illustrative source below is what a frontend constructs before lowering — the syntax is not real Triton, but the abstraction level is the same.
@kernel
def gemm(A: ptr<f16>, B: ptr<f16>, C: ptr<f32>, D: ptr<f32>,
M: i32, N: i32, K: i32):
block_m = program_id(0)
block_n = program_id(1)
a_view = make_partition_view(A, [M, K], tile=(128, 64), dim_map=[0, 1])
b_view = make_partition_view(B, [N, K], tile=(128, 64), dim_map=[0, 1])
c_view = make_partition_view(C, [M, N], tile=(128, 128), dim_map=[0, 1])
d_view = make_partition_view(D, [M, N], tile=(128, 128), dim_map=[0, 1])
acc = zeros(tile<128x128xf32>)
for k in range(0, K, 64):
a_tile = load(a_view, (block_m, k // 64)) # tile<128x64xf16>
b_tile = load(b_view, (block_n, k // 64)) # tile<128x64xf16>
acc = mmaf(a_tile, b_tile, acc) # tile<128x128xf32>
c_tile = load(c_view, (block_m, block_n))
d_tile = addf(acc, c_tile)
store(d_view, (block_m, block_n), d_tile)
The frontend serialises this as cuda_tile bytecode and hands it to tileiras. Everything below this point is internal IR.
Stage 1: cuda_tile IR
The first IR the compiler sees is cuda_tile itself — the only public dialect in the cascade and the input contract documented in cuda_tile Overview. Tile values are shaped SSA primitives, memory access rides on partition_view operands with explicit token ordering, and mmaf describes intent without committing to an MMA atom.
cuda_tile.module {
cuda_tile.entry @gemm(%A: !cuda_tile.ptr<f16>, %B: !cuda_tile.ptr<f16>,
%C: !cuda_tile.ptr<f32>, %D: !cuda_tile.ptr<f32>,
%M: i32, %N: i32, %K: i32) {
%tok0 = cuda_tile.make_token : !cuda_tile.token
%bm = cuda_tile.get_tile_block_id { axis = 0 : i32 } : i32
%bn = cuda_tile.get_tile_block_id { axis = 1 : i32 } : i32
%a_view = cuda_tile.tensor_view %A, shape = [%M, %K], stride = [%K, 1]
: !cuda_tile.tensor_view<128x64xf16>
%b_view = cuda_tile.tensor_view %B, shape = [%N, %K], stride = [%K, 1]
: !cuda_tile.tensor_view<128x64xf16>
%c_view = cuda_tile.tensor_view %C, shape = [%M, %N], stride = [%N, 1]
: !cuda_tile.tensor_view<128x128xf32>
%d_view = cuda_tile.tensor_view %D, shape = [%M, %N], stride = [%N, 1]
: !cuda_tile.tensor_view<128x128xf32>
%a_part = cuda_tile.partition_view %a_view, tile = [128, 64], dim_map = [0, 1]
: !cuda_tile.partition_view<128x64xf16>
%b_part = cuda_tile.partition_view %b_view, tile = [128, 64], dim_map = [0, 1]
: !cuda_tile.partition_view<128x64xf16>
%zero = cuda_tile.constant dense<0.0> : !cuda_tile.tile<128x128xf32>
%k_end = arith.muli %K, %K : i32
%c0 = arith.constant 0 : i32
%c64 = arith.constant 64 : i32
%acc_out = cuda_tile.for %k = %c0 to %K step %c64 iter_args(%acc = %zero)
-> !cuda_tile.tile<128x128xf32> {
%kt = arith.divsi %k, %c64 : i32
%a, %tok_a = cuda_tile.load_view_tko %a_part, [%bm, %kt], %tok0
: !cuda_tile.tile<128x64xf16>, !cuda_tile.token
%b, %tok_b = cuda_tile.load_view_tko %b_part, [%bn, %kt], %tok_a
: !cuda_tile.tile<128x64xf16>, !cuda_tile.token
%acc_n = cuda_tile.mmaf %a, %b, %acc { fastmath = "contract" }
: !cuda_tile.tile<128x64xf16>, !cuda_tile.tile<128x64xf16>,
!cuda_tile.tile<128x128xf32>
cuda_tile.yield %acc_n : !cuda_tile.tile<128x128xf32>
}
%c_part = cuda_tile.partition_view %c_view, tile = [128, 128], dim_map = [0, 1]
: !cuda_tile.partition_view<128x128xf32>
%d_part = cuda_tile.partition_view %d_view, tile = [128, 128], dim_map = [0, 1]
: !cuda_tile.partition_view<128x128xf32>
%c_tile, %tok_c = cuda_tile.load_view_tko %c_part, [%bm, %bn], %tok0
: !cuda_tile.tile<128x128xf32>, !cuda_tile.token
%d_tile = cuda_tile.addf %acc_out, %c_tile : !cuda_tile.tile<128x128xf32>
%tok_s = cuda_tile.store_view_tko %d_part, [%bm, %bn], %d_tile, %tok_c
: !cuda_tile.token
cuda_tile.return
}
}
Distinctive markers at this tier: tile types are !cuda_tile.tile<...>, memory ops carry the _tko token-ordered suffix, the K loop uses cuda_tile.for with explicit iter_args, and mmaf carries a fastmath attribute rather than an atom selection. The verifier contract enforces power-of-two tile dimensions and a 16-million-element ceiling per tile, both of which the 128x128xf32 accumulator satisfies.
Stage 2: nv_tileaa IR
ConvertCudaTileToTileAA rewrites every public operation into the alias-aware internal dialect. The three-populator structure documented in cuda_tile to tileaa drives the rewrite: Part A handles arithmetic and control flow, Part B handles memory and views, Part C specialises mmaf and the reductions. Tile types collapse to plain tensor<...>, token types become !nv_tileaa.mem_token, and pointer arithmetic becomes explicit through addptr and make_memref.
nv_tileaa.func @gemm(%A: !llvm.ptr<1>, %B: !llvm.ptr<1>,
%C: !llvm.ptr<1>, %D: !llvm.ptr<1>,
%M: i32, %N: i32, %K: i32) {
%tok0 = nv_tileaa.create_mem_token : !nv_tileaa.mem_token
%bm = nv_tileaa.get_program_id { axis = 0 : i32 } : i32
%bn = nv_tileaa.get_program_id { axis = 1 : i32 } : i32
%a_ref = nv_tileaa.make_memref %A, shape = [%M, %K], stride = [%K, 1],
space = #nv_tileaa.global
: !nv_tileaa.memref<?x?xf16>
%b_ref = nv_tileaa.make_memref %B, shape = [%N, %K], stride = [%K, 1],
space = #nv_tileaa.global
: !nv_tileaa.memref<?x?xf16>
%c_ref = nv_tileaa.make_memref %C, shape = [%M, %N], stride = [%N, 1],
space = #nv_tileaa.global
: !nv_tileaa.memref<?x?xf32>
%d_ref = nv_tileaa.make_memref %D, shape = [%M, %N], stride = [%N, 1],
space = #nv_tileaa.global
: !nv_tileaa.memref<?x?xf32>
%zero = nv_tileaa.constant_tensor dense<0.0> : tensor<128x128xf32>
%acc_out = scf.for %k = %c0 to %K step %c64
iter_args(%acc = %zero) -> tensor<128x128xf32> {
%off_a = nv_tileaa.addptr %a_ref, [%bm, %k]
: !nv_tileaa.memref<?x?xf16>
%a, %tok_a = nv_tileaa.tiled_load %off_a, %tok0
{ copy_atom = #cute.copy_atom<sm90_tma_load_2d_f16> }
: !nv_tileaa.memref<?x?xf16> -> tensor<128x64xf16>,
!nv_tileaa.mem_token
%off_b = nv_tileaa.addptr %b_ref, [%bn, %k]
: !nv_tileaa.memref<?x?xf16>
%b, %tok_b = nv_tileaa.tiled_load %off_b, %tok_a
{ copy_atom = #cute.copy_atom<sm90_tma_load_2d_f16> }
: !nv_tileaa.memref<?x?xf16> -> tensor<128x64xf16>,
!nv_tileaa.mem_token
%acc_n = nv_tileaa.dot %a, %b, %acc
{ input_precision = "tf32", fastmath = "contract" }
: tensor<128x64xf16>, tensor<128x64xf16>, tensor<128x128xf32>
-> tensor<128x128xf32>
scf.yield %acc_n : tensor<128x128xf32>
}
%off_c = nv_tileaa.addptr %c_ref, [%bm, %bn] : !nv_tileaa.memref<?x?xf32>
%c_tile, %tok_c = nv_tileaa.tiled_load %off_c, %tok0
{ copy_atom = #cute.copy_atom<sm90_tma_load_2d_f32> }
: !nv_tileaa.memref<?x?xf32> -> tensor<128x128xf32>,
!nv_tileaa.mem_token
%d_tile = arith.addf %acc_out, %c_tile : tensor<128x128xf32>
%off_d = nv_tileaa.addptr %d_ref, [%bm, %bn] : !nv_tileaa.memref<?x?xf32>
%tok_s = nv_tileaa.tiled_store %off_d, %d_tile, %tok_c
{ copy_atom = #cute.copy_atom<sm90_tma_store_2d_f32> }
: tensor<128x128xf32>, !nv_tileaa.memref<?x?xf32>, !nv_tileaa.mem_token
nv_tileaa.return
}
Three changes carry the most weight downstream. Tile types are now plain MLIR tensor<...>, which lets ordinary tensor passes and the shared LLVM TypeConverter see through them. Every memory operation produces or consumes a !nv_tileaa.mem_token, giving the scheduler an SSA representation of memory ordering. And every tiled_load/tiled_store carries a copy_atom witness attribute, picked from the SM-Tier Roster and Copy Atom Registry; that witness is what the next stage uses to select a concrete hardware copy primitive.
Stage 3: nv_tileas IR (after scheduling)
ConvertTileAAToTileAS keeps the alias-aware shape but rewrites memory and compute into operational forms the scheduler can reason about. The scheduling passes — modulo scheduler (Modulo Scheduler and Rau), buffer assignment, async-pipeline materialization — then turn the linear K loop into an explicit producer/consumer pipeline. After the TileAS pass family runs, the loop body is wrapped in an async.pipeline region with TMA-based producer loads, an mbarrier-coordinated handshake, and a WGMMA-based consumer body.
nv_tileas.func @gemm(...) attributes { nv_tileaa.kernel_spec = #ks } {
%desc_a = nv_tileas.make_tiled_tma_desc %a_ref, box = [128, 64],
atom = #cute_nvgpu.atom_copy_field_tmaload<load_2d_f16, swizzle_128B>
: !nv_tileas.tma_desc<128x64xf16>
%desc_b = nv_tileas.make_tiled_tma_desc %b_ref, box = [128, 64],
atom = #cute_nvgpu.atom_copy_field_tmaload<load_2d_f16, swizzle_128B>
: !nv_tileas.tma_desc<128x64xf16>
%smem_a = nv_tileas.alloc_tensor { stages = 3 : i32 }
: !nv_tileas.smem<3x128x64xf16>
%smem_b = nv_tileas.alloc_tensor { stages = 3 : i32 }
: !nv_tileas.smem<3x128x64xf16>
%pipe = nv_tileas.async.pipeline.create_pipeline
stages = 3, producer = #ag_p, consumer = #ag_c
: !nv_tileas.pipeline_token
%iter0 = nv_tileas.async.pipeline.create_iterator %pipe
: !nv_tileas.pipeline_iter<i32>
%acc_out, %iter_end = scf.for %k = %c0 to %K step %c64
iter_args(%acc = %zero, %iter = %iter0)
-> (tensor<128x128xf32>, !nv_tileas.pipeline_iter<i32>) {
// ---- producer agent: TMA bulk loads into stage-local SMEM
nv_tileas.async.pipeline.produce_one %pipe, %iter {
%ptok = nv_tileas.async.pipeline.producer_acquire %pipe, %iter
: !nv_tileas.producer_token
%ptok2 = nv_tileas.async.pipeline.producer_write %ptok, %iter {
nv_tileas.async.tiled_tma_load %desc_a, [%bm, %k], %smem_a, %iter
nv_tileas.async.tiled_tma_load %desc_b, [%bn, %k], %smem_b, %iter
nv_tileas.async.pipeline.yield
}
nv_tileas.async.pipeline.producer_commit %ptok2
nv_tileas.async.pipeline.yield
}
// ---- consumer agent: WGMMA reads the same stage
%acc_n = nv_tileas.async.pipeline.consume_one %pipe, %iter
consumer_idx = 0 : i32 {
%ctok = nv_tileas.async.pipeline.consumer_wait %pipe, %iter, 0
: !nv_tileas.consumer_token
%ctok2, %acc_loop = nv_tileas.async.pipeline.consumer_read %ctok, %iter {
%a_stage = nv_tileas.view %smem_a, %iter
: !nv_tileas.smem<128x64xf16>
%b_stage = nv_tileas.view %smem_b, %iter
: !nv_tileas.smem<128x64xf16>
%acc_w = nv_tileas.dot %a_stage, %b_stage, %acc
{ atom = #cute.mma_atom<sm90_wgmma_m64n128k16_f32_f16_f16> }
: !nv_tileas.smem<128x64xf16>, !nv_tileas.smem<128x64xf16>,
tensor<128x128xf32> -> tensor<128x128xf32>
nv_tileas.async.pipeline.yield %acc_w : tensor<128x128xf32>
}
nv_tileas.async.pipeline.consumer_release %ctok2
nv_tileas.async.pipeline.yield %acc_loop : tensor<128x128xf32>
}
%iter_n = nv_tileas.async.pipeline.inc_iter %iter
: !nv_tileas.pipeline_iter<i32>
scf.yield %acc_n, %iter_n
: tensor<128x128xf32>, !nv_tileas.pipeline_iter<i32>
}
// epilogue: load C, add, store D (TMA store)
%c_tile = nv_tileas.tiled_load %c_ref, [%bm, %bn]
{ atom = #cute.copy_atom<sm90_ldg_128_f32> }
: tensor<128x128xf32>
%d_tile = arith.addf %acc_out, %c_tile : tensor<128x128xf32>
%desc_d = nv_tileas.make_tiled_tma_desc %d_ref, box = [128, 128],
atom = #cute_nvgpu.atom_copy_field_tmastore<store_2d_f32, swizzle_128B>
: !nv_tileas.tma_desc<128x128xf32>
nv_tileas.async.tiled_tma_store %desc_d, [%bm, %bn], %d_tile
nv_tileas.return
}
The K loop is no longer a flat sequence of loads and an MMA. It is an async pipeline with three rotating stages, one producer agent owning the TMA loads, and one consumer agent owning the WGMMA. The pipeline iterator threads through scf.for via the type-propagation rule documented in nv_tileas Overview. Each MMA invocation carries a concrete sm_90a WGMMA atom; descriptor construction is a first-class operation with its own SSA result, not a hidden side effect of the load. The kernel-spec attribute on the function records numWarps, clusterDim, and per-stage SMEM size for the downstream LLVM lowering to lift onto nvvm.* discardable attributes.
Stage 4: LLVM IR with NVVM intrinsics
ConvertTileASToLLVM (the nine-phase body conversion documented in tileas to LLVM) is the terminal MLIR-side lowering. Pipeline structure flattens into integer phase tokens and nvvm.mbarrier.* operations; the WGMMA region expands into the four-op fence/MMA/commit/wait protocol from WGMMA Emission Protocol; TMA loads expand into cp.async.bulk.tensor intrinsics. The kernel function picks up nvvm.reqntid, nvvm.cluster_dim, and nvvm.maxnreg attributes from the kernel-spec.
define void @gemm(ptr addrspace(1) %A, ptr addrspace(1) %B,
ptr addrspace(1) %C, ptr addrspace(1) %D,
i32 %M, i32 %N, i32 %K)
#0 !nvvm.kernel !1 {
entry:
; ---- TMA descriptor construction (one per operand, hoisted to entry)
%desc_a = alloca [128 x i8], align 64, addrspace(5)
call void @llvm.nvvm.cp.async.bulk.tensor.encode.2d(
ptr addrspace(5) %desc_a, ptr addrspace(1) %A,
i32 128, i32 64, i32 %K, i32 1, i32 1) ; box, stride, swizzle=128B
%desc_b = alloca [128 x i8], align 64, addrspace(5)
call void @llvm.nvvm.cp.async.bulk.tensor.encode.2d(
ptr addrspace(5) %desc_b, ptr addrspace(1) %B,
i32 128, i32 64, i32 %K, i32 1, i32 1)
; ---- shared-memory backing for the 3-stage pipeline
%smem_a = getelementptr inbounds i8,
ptr addrspace(3) @global_smem, i32 0
%smem_b = getelementptr inbounds i8,
ptr addrspace(3) @global_smem, i32 49152
; ---- mbarriers (one per stage, init by warp 0)
%mbar_full = getelementptr i8, ptr addrspace(3) @global_smem, i32 98304
call void @llvm.nvvm.mbarrier.init.shared(
ptr addrspace(3) %mbar_full, i32 1) ; thread-count arrival
br label %loop
loop:
%k = phi i32 [ 0, %entry ], [ %k_next, %loop ]
%stg = phi i32 [ 0, %entry ], [ %stg_next, %loop ]
%ph = phi i32 [ 0, %entry ], [ %ph_next, %loop ]
%acc0 = phi <128 x float> [ zeroinitializer, %entry ], [ %acc4, %loop ]
; (real lowering carries the accumulator as 16 lanes of <8 x float>,
; one per WGMMA atom slice; we elide the fragment split here.)
; ---- producer: TMA bulk load into smem_a[stg], smem_b[stg]
%smem_a_stg = getelementptr i8, ptr addrspace(3) %smem_a,
i32 %stg_off_a
%smem_b_stg = getelementptr i8, ptr addrspace(3) %smem_b,
i32 %stg_off_b
call void @llvm.nvvm.cp.async.bulk.tensor.shared.cluster.global.2d(
ptr addrspace(3) %smem_a_stg, ptr addrspace(5) %desc_a,
i32 %bm, i32 %k, ptr addrspace(3) %mbar_full)
call void @llvm.nvvm.cp.async.bulk.tensor.shared.cluster.global.2d(
ptr addrspace(3) %smem_b_stg, ptr addrspace(5) %desc_b,
i32 %bn, i32 %k, ptr addrspace(3) %mbar_full)
; ---- consumer wait: parity-encoded transaction barrier
%parity = and i32 %ph, 1
%arrived = call i1 @llvm.nvvm.mbarrier.try_wait.parity.shared(
ptr addrspace(3) %mbar_full, i32 %parity)
; ---- WGMMA region: fence, async MMAs across the K tile, commit, wait
call void @llvm.nvvm.wgmma.fence.sync.aligned()
%da = call i64 @llvm.nvvm.wgmma.descriptor.encode.smem(
ptr addrspace(3) %smem_a_stg, i32 2048, i32 0, i32 0, i32 1)
%db = call i64 @llvm.nvvm.wgmma.descriptor.encode.smem(
ptr addrspace(3) %smem_b_stg, i32 2048, i32 0, i32 0, i32 1)
%acc1 = call <32 x float>
@llvm.nvvm.wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16(
i64 %da, i64 %db, <32 x float> %acc0, i32 1, i32 1, i32 1)
; ... three more atom slices along N to cover the 128 output columns ...
call void @llvm.nvvm.wgmma.commit_group.sync.aligned()
call void @llvm.nvvm.wgmma.wait_group.sync.aligned(i32 0)
; ---- end-of-stage bookkeeping
%k_next = add i32 %k, 64
%stg_next = urem i32 (add i32 %stg, 1), 3
%ph_next = xor i32 %ph, 1
%done = icmp uge i32 %k_next, %K
br i1 %done, label %epi, label %loop
epi:
; ---- C load, add, TMA store of D
...
ret void
}
attributes #0 = {
"nvvm.reqntid"="128,1,1"
"nvvm.cluster_dim"="2,1,1"
"nvvm.maxnreg"="168"
"nvvm.kernel"
}
What looked like a queue in nv_tileas is now a flat loop carrying a stg index, a parity bit, and a vector accumulator phi node. WGMMA descriptors are SSA i64 values produced by llvm.nvvm.wgmma.descriptor.encode.smem, packing the bit fields documented in WGMMA Emission Protocol — SMEM Descriptor Bit Layout. The kernel attributes — nvvm.reqntid=128, nvvm.cluster_dim=2, nvvm.maxnreg=168 — are the tileas-to-llvm Phase 3 translations of the nv_tileaa.kernel_spec block.
Stage 5: NVPTX MIR
The NVPTX backend selector (ISelDAG and MatcherTable) consumes the LLVM IR and produces a MachineFunction whose instructions are NVPTX target opcodes. Parameter loads become NVPTXISD::LoadParam SDNodes resolved into LD_PARAM_v*. TMA tensor copies become CP_ASYNC_BULK_TENSOR_* machine instructions. WGMMA becomes a WGMMA_MMA_ASYNC_* machine instruction that the AsmPrinter renders as the wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 mnemonic.
bb.0.entry:
liveins: $r0, $r1, $r2, $r3, $r4, $r5, $r6, $r7, $r8
; .param block for the kernel entry — emitted by the call-prototype
; printer, not by individual MIR instructions in the body
%rd0:b64 = LD_PARAM_64 0, gemm_param_0 ; A
%rd1:b64 = LD_PARAM_64 0, gemm_param_1 ; B
%rd2:b64 = LD_PARAM_64 0, gemm_param_2 ; C
%rd3:b64 = LD_PARAM_64 0, gemm_param_3 ; D
%r0:b32 = LD_PARAM_32 0, gemm_param_4 ; M
%r1:b32 = LD_PARAM_32 0, gemm_param_5 ; N
%r2:b32 = LD_PARAM_32 0, gemm_param_6 ; K
; --- TMA descriptor encode (writes 128B of shared)
CP_ASYNC_BULK_TENSOR_2D_ENCODE_SHARED_GLOBAL
%smem_desc_a:b64, %rd0, 128, 64, %r2, 1, 1
CP_ASYNC_BULK_TENSOR_2D_ENCODE_SHARED_GLOBAL
%smem_desc_b:b64, %rd1, 128, 64, %r2, 1, 1
bb.1.loop:
successors: %bb.1, %bb.2
%k:b32 = PHI 0, %bb.0, %k_next:b32, %bb.1
%stg:b32 = PHI 0, %bb.0, %stg_next:b32, %bb.1
%ph:b32 = PHI 0, %bb.0, %ph_next:b32, %bb.1
; --- TMA load: shared <- global through SMEM descriptor
CP_ASYNC_BULK_TENSOR_2D_SHARED_CLUSTER_GLOBAL_MBARRIER
%smem_a_stg:b64, %smem_desc_a, %bm:b32, %k, %mbar_full:b64
CP_ASYNC_BULK_TENSOR_2D_SHARED_CLUSTER_GLOBAL_MBARRIER
%smem_b_stg:b64, %smem_desc_b, %bn:b32, %k, %mbar_full
; --- transaction barrier wait, parity-encoded
%parity:b32 = AND_b32 %ph, 1
%p0:pred = MBARRIER_TRY_WAIT_PARITY_SHARED %mbar_full, %parity
; --- WGMMA four-op sequence
WGMMA_FENCE_SYNC_ALIGNED
%da:b64 = WGMMA_DESCRIPTOR_ENCODE_SMEM %smem_a_stg, 2048, 0, 0, 1
%db:b64 = WGMMA_DESCRIPTOR_ENCODE_SMEM %smem_b_stg, 2048, 0, 0, 1
WGMMA_MMA_ASYNC_SYNC_ALIGNED_M64N128K16_F32_F16_F16
dst: %f0:f32, %f1:f32, ..., %f31:f32
src_a: %da
src_b: %db
src_c: %f0, %f1, ..., %f31 ; in-place accumulate
scale_d: 1, trans_a: 1, trans_b: 1
WGMMA_COMMIT_GROUP_SYNC_ALIGNED
WGMMA_WAIT_GROUP_SYNC_ALIGNED 0
%k_next:b32 = ADD_b32 %k, 64
%stg_next:b32 = REM_b32 ADD_b32(%stg, 1), 3
%ph_next:b32 = XOR_b32 %ph, 1
%done:pred = ICMP_UGE %k_next, %r2
BRCOND %done, %bb.2
BR %bb.1
Three things are worth noting at the MIR level. First, the LD_PARAM_* opcodes are NVPTX-specific pseudo-ops that the AsmPrinter renders as ld.param.* — they cannot be expressed as generic ISD::LOAD because the PTX .param space disallows aliasing and arbitrary access patterns. Second, the WGMMA accumulator is materialised as 32 physical FP32 registers (one per thread per output element of the 64x128xf32 tile / 32 lanes per warp / 4 warps), all alive across the MMA instruction; this is what drives the nvvm.maxnreg=168 budget the kernel-spec sets. Third, the MBARRIER_TRY_WAIT_PARITY_SHARED form encodes the producer/consumer handshake as a single predicate-producing instruction — the i1 result drives the conditional branch that retries the wait.
Stage 6: PTX text
The AsmPrinter (AsmPrinter and Per-SM Windows) walks the MachineFunction and renders each instruction through its print shape. The result is the PTX text that ptxas consumes.
//
// Generated by tileiras 13.1, target sm_90a
//
.version 8.4
.target sm_90a
.address_size 64
.extern .shared .align 16 .b8 global_smem[];
.entry gemm(
.param .u64 gemm_param_0,
.param .u64 gemm_param_1,
.param .u64 gemm_param_2,
.param .u64 gemm_param_3,
.param .u32 gemm_param_4,
.param .u32 gemm_param_5,
.param .u32 gemm_param_6
)
.reqntid 128, 1, 1
.maxnreg 168
.cluster_dim 2, 1, 1
{
.reg .pred %p<8>;
.reg .b32 %r<48>;
.reg .b64 %rd<24>;
.reg .f32 %f<128>;
ld.param.u64 %rd0, [gemm_param_0]; // A
ld.param.u64 %rd1, [gemm_param_1]; // B
ld.param.u64 %rd2, [gemm_param_2]; // C
ld.param.u64 %rd3, [gemm_param_3]; // D
ld.param.u32 %r0, [gemm_param_4]; // M
ld.param.u32 %r1, [gemm_param_5]; // N
ld.param.u32 %r2, [gemm_param_6]; // K
mov.u32 %r3, %ctaid.x; // bm
mov.u32 %r4, %ctaid.y; // bn
// ---- TMA descriptor construction (one .b1024 tensormap per operand)
cp.async.bulk.tensor.encode.2d.global
[%rd10], [%rd0], {128, 64}, {%r2, 1}, 1, 1;
cp.async.bulk.tensor.encode.2d.global
[%rd11], [%rd1], {128, 64}, {%r2, 1}, 1, 1;
// ---- mbarrier init by warp 0
@%p0 mbarrier.init.shared.b64 [%rd12], 1;
mov.u32 %r5, 0; // k
mov.u32 %r6, 0; // stg
mov.u32 %r7, 0; // ph
LBB_loop:
// ---- TMA load A and B into stage-local SMEM
cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
[%rd20], [%rd10, {%r3, %r5}], [%rd12];
cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
[%rd21], [%rd11, {%r4, %r5}], [%rd12];
// ---- consumer wait: try-wait drives a retry loop
and.b32 %r8, %r7, 1;
LBB_wait:
mbarrier.try_wait.parity.shared.b64 %p1, [%rd12], %r8;
@!%p1 bra LBB_wait;
// ---- WGMMA four-op protocol
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16
{%f0, %f1, %f2, %f3, %f4, %f5, %f6, %f7,
%f8, %f9, %f10, %f11, %f12, %f13, %f14, %f15,
%f16, %f17, %f18, %f19, %f20, %f21, %f22, %f23,
%f24, %f25, %f26, %f27, %f28, %f29, %f30, %f31},
%rd22, // A descriptor
%rd23, // B descriptor
1, 1, 1; // scale_d, trans_a, trans_b
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 0;
add.u32 %r5, %r5, 64; // k += 64
add.u32 %r9, %r6, 1;
rem.u32 %r6, %r9, 3; // stg = (stg+1) % 3
xor.b32 %r7, %r7, 1; // ph ^= 1
setp.lt.u32 %p2, %r5, %r2;
@%p2 bra LBB_loop;
// ---- epilogue: C load, add, TMA store of D
// ...
ret;
}
This is exactly the PTX text that tileiras ships across argv to ptxas. The .reqntid, .maxnreg, and .cluster_dim directives are the lifted kernel-spec attributes; the WGMMA fence/MMA/commit/wait sequence is the four-op contract documented in WGMMA Emission Protocol — The Four-Op Sequence; the mbarrier.try_wait.parity form is the parity-encoded handshake whose state machine is documented in mbarrier State Machine.
Stage 7: SASS (ptxas output)
The PTX text in Stage 6 is the final artefact tileiras produces. ptxas, running as a separate subprocess over the boundary documented in ptxas Handoff Protocol, assembles the PTX into the SASS (Streaming Assembler) instruction stream — the hardware-level encoding the SM actually executes. SASS includes register allocation across the full live range of the WGMMA accumulator, instruction scheduling that interleaves the producer warps' TMA-issue with the consumer warps' WGMMA, and the exact 128-bit instruction encodings the GPU front-end decodes.
That layer is out of scope for tileiras's documentation. The wiki covers the path up to PTX text; everything below the handoff is ptxas territory. The argv shape, knob-file structure, and stdout-cubin convention are documented at the boundary page.
Cross-References
The per-stage canonical pages remain authoritative for everything this walkthrough abbreviates. cuda_tile Overview, nv_tileaa Overview, and nv_tileas Overview cover the three tile dialects' operation rosters, type contracts, and verifier rules. cuda_tile to tileaa, tileaa to tileas, and tileas to LLVM cover the three partial-conversion passes that move IR between those dialects. Modulo Scheduler and Rau and Buffer Assignment and Named-Barrier Binding cover the scheduling work that turns the linear K loop into a three-stage pipeline. WGMMA Emission Protocol, TMA, Tensormap and cp.async.bulk, and mbarrier State Machine cover the three hardware contracts the consumer body relies on. Per-SM Emission Templates and AsmPrinter and Per-SM Windows cover the NVPTX backend's PTX emission. Matmul Progression by SM and Capability Matrix explain why the lowering chose sm_90a WGMMA and what the same kernel produces on Ampere, Ada, and Blackwell. ptxas Handoff Protocol closes the loop by describing the argv-over-subprocess interface where the PTX text leaves tileiras.
Frontend Contract and Tile IR Emission
Abstract
Tileiras consumes Tile IR bytecode; producing that bytecode is a frontend's
responsibility. A conformant frontend follows three conventions: emit operations
in the cuda_tile dialect with the documented operand and attribute structure,
attach module-level kernel-launch metadata that survives every subsequent
lowering, and serialize using MLIR bytecode under the tileiras-flavored
attribute-tag wire format. This page is the producer-facing contract: kernel
signature rules, op-construction conventions, attribute requirements,
bytecode-format constraints, and the common emission mistakes that produce
modules tileiras rejects.
The contract is documented from the consumer's perspective. Every rule below
corresponds to a check that fires somewhere in the bytecode reader, the
cuda_tile verifier, the ConvertCudaTileToTileAA conversion target, or the
downstream kernel-spec lookup. A frontend that satisfies all four boundaries
produces bytecode that flows through the entire 53-pass pipeline without
producer-visible diagnostics.
Who Produces Tile IR
Tile IR is an open input format. Any compiler that wants to target the tile pipeline can build a frontend; tileiras only inspects the bytecode buffer it receives, not the producer that wrote it.
| Producer | Surface | Status |
|---|---|---|
| NVIDIA's Triton-style frontend | High-level Python DSL with tt.* kernel attributes | Primary producer; sets the de facto attribute conventions |
| CUTLASS DSL frontend | Python DSL that emits cuda_tile directly through MLIR Python bindings | Targets the same bytecode container with the same attribute names |
mlir-translate with a tileiras-aware bytecode writer | Textual cuda_tile IR plus the tileiras AttrTag numbering | Practical for hermetic tests and small reproducers; requires the tileiras-flavored writer rather than stock upstream |
| Hand-rolled bytecode emitters | Direct LEB128 record construction against the wire format documented in MLIR Bytecode Format | Used for differential testing and bug reduction; only viable when the producer freezes the tileiras tag table |
The producer set is open in both directions: tileiras has no allowlist of signing frontends, and the public bytecode contract has no producer-identity field. The only invariants it checks are the bytecode envelope, the dialect list, and the per-op verifier rules.
The Kernel Signature Contract
A frontend's first job is producing a cuda_tile.entry operation whose
signature, attributes, and body region match the public dialect contract. The
verifier checks the structure; the kernel-attribute attachment step in
ConvertTileAAToTileAS reads the attributes; the function-boundary lowering in
ConvertTileFuncToLLVM projects the attributes onto nvvm.* directives.
Required Module Shell
Every conformant module looks like this at the top level:
module attributes {
nv_tileaa.compute_capability = 90 : i32, // sm_90
tt.num_warps = 4 : i32, // four warps per CTA
tt.num_ctas = 1 : i32 // single-CTA cluster
} {
cuda_tile.module {
cuda_tile.entry @gemm(%A: !cuda_tile.ptr<f16>, ...) {
...
cuda_tile.return
}
}
}
The outer builtin.module carries the kernel-launch attributes; the inner
cuda_tile.module holds one or more cuda_tile.entry operations whose bodies
contain the kernel logic.
Kernel-Function Requirements
An entry function must satisfy four structural rules. The verifier in cuda_tile Verifiers catches all four before the conversion target rejects the operation.
| Rule | Verifier check | Producer responsibility |
|---|---|---|
Operation is cuda_tile.entry (not func.func) | The dialect declares entry as the kernel constructor; arbitrary func.func ops are not recognized as kernels by the downstream lowering | Frontend must construct cuda_tile.entry, not lift to func.func |
Body terminates with cuda_tile.return | Region terminator must be the matching return op for the entry | No raw func.return allowed in the entry body |
Arguments use cuda_tile.ptr, cuda_tile.tensor_view, cuda_tile.partition_view, or scalar types | The type converter only knows how to lift these | A frontend that passes a raw !llvm.ptr argument fails at type conversion |
| No view-typed return values | The verifier rejects view results across structured-control-flow boundaries | Views are produced from arguments inside the kernel, not returned out |
The cuda_tile.entry op is distinct from func.func by design. It carries the
region in which the kernel-private structured control flow lives, and the
downstream lowering can identify the entry without a separate annotation walk.
A frontend that tries to lift the body into func.func and tag it with a
custom unit attribute will not produce a kernel; the resulting function emits
.func rather than .entry and is invisible to the CUDA driver at launch
time.
Compute-Capability Attribute
nv_tileaa.compute_capability is the single attribute the frontend must attach
to choose a target. Its absence is fatal at ConvertTileFuncToLLVM: the pass
emits "Failed to get ComputeCapability" through severity 259/0x103 and aborts
the module. The same encoding rule applies everywhere — the integer is the
two-digit form major * 10 + minor.
| Compute capability | Encoded integer | SM string |
|---|---|---|
| sm_70 | 70 | "sm_70" |
| sm_80 | 80 | "sm_80" |
| sm_89 | 89 | "sm_89" |
| sm_90 (Hopper) | 90 | "sm_90" |
| sm_100 (Blackwell datacenter) | 100 | "sm_100" |
| sm_103 (Blackwell Ultra) | 103 | "sm_103" |
| sm_120 (consumer Blackwell) | 120 | "sm_120" |
The driver passes --gpu-name=sm_<NN> on the command line; the conversion pass
reads the --compute-capability option (major * 10 + minor integer) and
writes the attribute onto the module. A frontend that emits bytecode without
this attribute must rely on the driver to inject it from --gpu-name; a
frontend that emits the attribute itself short-circuits the option lookup.
The fallback nv_tileaa.target_spec (a StringAttr of the form "sm_XX") is
read when the integer attribute is absent. The two spellings convert to one
logical concept; new IR should prefer the canonical underscore form
compute_capability.
Kernel-Launch Attributes
The tt.* namespace is the de facto convention for kernel-launch attributes
attached at the module level. They flow through ConvertCudaTileToTileAA
verbatim (no rename) and are folded into nv_tileaa.kernel_spec during
attachKernelSpecAttributes. The kernel-spec record is then read by the
scheduler, the agent-switch builder, and the function-boundary lowering, which
projects it onto nvvm.* function attributes the AsmPrinter emits as PTX
directives.
⚡ QUIRK — Triton's
tt.*prefix is the project-neutral compiler's de-facto schema Tileiras is a project-neutral CUDA tile compiler, but its kernel-launch contract reads attributes under thett.*namespace — the prefix Triton uses for its own frontend.tt.num_warps,tt.num_ctas,tt.cluster_dim, andtt.num_stagesflow throughConvertCudaTileToTileAAwith no rename and land innv_tileaa.kernel_specunchanged. A frontend that uses a clean per-project namespace (saymyfrontend.num_warps) gets the silent defaults instead — 4 warps, 1 CTA, cluster[1,1,1]— and the scheduler emits a kernel sized for a single warp group with no warning that the producer's intent was ignored.
| Attribute | Default if absent | Projected PTX directive |
|---|---|---|
tt.num_warps = N : i32 | 4 | .reqntid (32*N), 1, 1 |
tt.num_ctas = N : i32 | 1 | (drives cluster directive emission) |
tt.cluster_dim (3-element i32 array) | [1, 1, 1] | .reqnctapercluster X, Y, Z on SM90+ |
tt.num_stages = N : i32 | scheduler default | (consumed by modulo scheduler, no direct PTX) |
A frontend may attach additional implementation-specific attributes under its own namespace; they survive every lowering stage that does not actively rewrite them and are dropped if no consumer reads them. The recommended practice is to keep producer-internal attributes prefix-namespaced so they cannot collide with the consumer-visible ones.
Optional NVVM Directive Hints
A frontend can ask for a tighter register cap or a per-SM occupancy floor by
attaching nvvm.* attributes directly to the kernel function. These bypass the
kernel-spec mirror and reach the AsmPrinter unchanged.
| Attribute | Type | PTX directive | Use case |
|---|---|---|---|
nvvm.maxnreg = N : i32 | IntegerAttr<i32> | .maxnreg N | Bound per-thread register usage so ptxas can trade registers for occupancy |
nvvm.minctasm = N : i32 | IntegerAttr<i32> | .minnctapersm N | Request a minimum occupancy floor |
nvvm.maxclusterrank = N : i32 | IntegerAttr<i32> | .maxclusterrank N | Portability cap on cluster size |
nvvm.blocksareclusters | UnitAttr | .blocksareclusters | Treat every CTA as its own cluster (legal only with cluster_dim = (1, 1, 1)) |
These attributes are optional. The kernel-spec path is the primary mechanism;
direct nvvm.* attributes are for cases where the frontend already knows the
exact directive value and wants to skip the mirror step. Mixing the two is
legal — the function-boundary lowering writes nvvm.* attributes derived from
the kernel-spec only when they are not already present.
Op-Construction Conventions
The 92-op cuda_tile roster (see Operation Roster)
divides into families with consistent operand-order and attribute conventions.
A frontend that follows the family conventions produces IR that satisfies the
verifier on first construction; a frontend that improvises operand orders
triggers verbatim diagnostics keyed off the operandSegmentSizes arrays the
verifier consults.
Token-Ordered Memory Operations
Every memory-side-effect op carries a token chain. The convention for constructing a load is:
%value, %tok_out = cuda_tile.load_view_tko %view[%i, %j], %mask, %fallback, %tok_in
{ mem_semantic = #cuda_tile<mem_semantic relaxed>,
mem_scope = #cuda_tile<mem_scope gpu>,
operandSegmentSizes = array<i32: 1, 2, 1, 1, 1> }
: !cuda_tile.partition_view<128x64xf32>, index, index,
tile<128x64xi1>, tile<128x64xf32>, !cuda_tile.token
-> tile<128x64xf32>, !cuda_tile.token
The operand order is fixed by the family: (view, indices..., mask, fallback, token_in) for views; (ptr, indices..., mask, fallback, token_in) for raw
pointers. The operandSegmentSizes array partitions the operand list into the
five slots {view_or_ptr, indices, mask, fallback, token} and is the
verifier's primary structural check.
Three structural rules govern token construction:
- Every memory op consumes one input token and produces one output token. A frontend that leaves the token unthreaded breaks the dataflow representation of memory ordering. Stores produce a token but no data; loads and atomics produce both.
- A
make_tokenop at function entry seeds the chain. Use it once per independent ordering chain; multiple independent loads that can reorder use separate chains, multiple ordered loads thread the same chain. join_tokensmerges two chains. When two independent chains both need to feed into a later store, join them and pass the result through.
The _tko suffix marks the family, but it is also the verifier's keyword:
omitting the suffix produces an unknown-op diagnostic during bytecode read.
MMA Operations
Matrix multiply-accumulate operations follow the (A, B, C) -> D convention
where C is the accumulator-in and D is the accumulator-out. The same SSA value
is permitted to flow through both — that is the common pattern when an MMA
runs inside a K-loop. The shape contract is enforced at construction:
| Op | A shape | B shape | C/D shape | Required attributes |
|---|---|---|---|---|
mmaf (floating) | tile<[B ×] M × K × elem_a> | tile<[B ×] K × N × elem_b> | tile<[B ×] M × N × elem_c> | optional rounding |
mmai (integer) | same | same | same | required signedness_a, signedness_b |
The K dimension is contracted (must agree between A and B); the M and N dimensions are the output extents (must agree between A/B and C/D). The batched form takes rank-3 tile types with a shared leading batch dimension; the verifier rejects any rank disagreement, K-dimension mismatch, or accumulator/ result type divergence.
The MMA atom (WGMMA / tcgen05.mma / mma.sync) is not selected by the
frontend. It is the lowering pipeline's job to pick the right atom for the
target. A frontend that tries to pick a specific atom must do so through an
optimization hint (op-level attribute under optimization_hints), not by
constructing a different op.
Reductions and Scans
The reduce and scan ops carry a region with a pure combiner body. The
convention is (input, identity) -> result with the combiner taking two
block arguments of the input element type.
%sum = cuda_tile.reduce %a, %identity { axis = 1 : i32 } : tile<8x16xf32>
-> tile<8xf32> {
^bb0(%lhs: f32, %rhs: f32):
%r = cuda_tile.addf %lhs, %rhs : f32
cuda_tile.yield %r : f32
}
The body must be a pure region — no side-effecting ops, no token-ordered memory ops, no view operations. Element-type identity in the combiner must match the input element type; rank-zero block arguments are mandatory. The verifier rejects each violation with a verbatim diagnostic that names the rule that fired.
Async Pipeline Ops
Async-pipeline ops are emitted by the scheduler in nv_tileas, not by the
frontend. A frontend that wants explicit pipeline staging communicates the
intent through the module-level tt.num_stages hint and lets the modulo
scheduler in Modulo Scheduler and Rau
turn it into producer/consumer agents during lowering. A frontend that emits
nv_tileas.async.pipeline.* ops directly bypasses the verifier — nv_tileas
is not legal at the bytecode boundary.
Required vs Optional Attributes
A single table summarises every attribute the frontend must, may, or must not
attach to a conformant cuda_tile module. "Required" means the lowering fails
without it; "optional" means a sensible default applies; "advisory" means the
attribute is read if present but has no failure path when absent.
| Attribute | Carrier | Status | Default if absent |
|---|---|---|---|
nv_tileaa.compute_capability | Module | Required | "Failed to get ComputeCapability" (O3); driver-supplied from --gpu-name |
nv_tileaa.target_spec | Module | Fallback | Used when compute_capability is absent |
tt.num_warps | Module | Optional | 4 (1 warp group) |
tt.num_ctas | Module | Optional | 1 (single-CTA cluster) |
tt.cluster_dim | Module | Optional | [1, 1, 1] |
tt.num_stages | Module | Advisory | Scheduler default |
nvvm.maxnreg | Function | Optional | No cap; ptxas chooses |
nvvm.minctasm | Function | Optional | No occupancy floor |
nvvm.maxclusterrank | Function | Optional | No portability cap |
nvvm.blocksareclusters | Function | Optional | Off |
nvvm.kernel | Function | Synthesized | Attached by downstream rewrite; frontend should not emit |
nv_tileaa.kernel_spec | Function | Synthesized | Attached by attachKernelSpecAttributes from tt.* hints |
nv_tileaa.occupancy | Function | Optional | No nvvm.maxnreg synthesized |
Per-op mem_semantic, mem_scope | Op | Conditional | weak/CTA when absent; required for non-weak orderings |
Per-op fastmath | Op | Optional | No fast-math flags |
Per-op optimization_hints | Op | Advisory | No hint applied |
Per-op operandSegmentSizes | Op | Required | Verifier emits structural error when absent on multi-operand-family ops |
The cross-cutting policy for every attribute family — which stages drop, preserve, synthesize, or read each one — is documented in Attribute System and Lowering.
Bytecode-Format Constraints
The wire format is not stock MLIR bytecode. A frontend that constructs a valid
cuda_tile module in memory still has to clear the bytecode envelope and the
attribute-tag numbering before tileiras can read the buffer.
Magic and Version
Every conformant container opens with the eight-byte magic and a three-VarInt
Tile-version triple. The magic and version constants are documented at
MLIR Bytecode Format — Header Parser.
The accepted version range is 13.1.x only; the parser emits an
"unsupported Tile version ..." diagnostic for everything else.
7f 54 69 6c 65 49 52 00 // "\x7fTileIR\0"
0d 01 00 // VarInt 13, VarInt 1, VarInt 0 (Tile 13.1.0)
Upstream MLIR fills the eighth magic byte with the start of "\nMLIR". A
producer that uses an unmodified upstream writer emits 0x0A in that slot, and
the tileiras reader rejects the buffer with "invalid magic number at position 7". The driver also surfaces this case with the diagnostic "input does not correspond to Tile IR bytecode (it looks like MLIR bytecode instead)".
Dialect List
The envelope's dialect list must include cuda_tile. A builtin entry is
synthesized automatically by the MLIR infrastructure. Other dialects are legal
only if they appear in the registered set and the frontend actually uses them:
arith for constants whose representation is not a cuda_tile.constant,
func for symbol references, and the optional debug-info dialects.
A module that lists nv_tileaa, nv_tileas, cute, cute_nvgpu, cutlass,
nvgpu, nvvm, or llvm in its dialect list is rejected by the
conversion-target legality check: those are internal lowering dialects, not
public input. The diagnostic spelling is "unregistered dialect: <name>" from
the dialect-list walker.
AttrTag Wire Format
The 13-entry attribute-tag table inside the bytecode reader is the single-largest wire-format-breaking divergence from upstream. The shipped numbering is documented in Self-Contained Attribute Dispatch; the key differences are:
| Tag | tileiras meaning | Upstream MLIR meaning |
|---|---|---|
| 1 | StringAttr | IntegerAttr |
| 4 | DenseElementsAttr | TypeAttr |
| 5 | DenseElementsAttr<string> | StringAttr |
| 13 | AssumePredicateAttr | (undefined) |
A producer that writes attributes through stock mlir-translate --serialize-bytecode emits the upstream numbering and the tileiras reader
decodes every attribute incorrectly — usually surfacing as garbled type
mismatches mid-IR rather than envelope errors. The practical implication is
that a frontend cannot use upstream mlir-translate directly; it must
either link the tileiras-aware writer or fork the upstream AttrTag table.
Canonical VarInt Encoding
Every multi-byte integer in the container uses the LEB128 variant documented in
VarInt Encoding. Producers
must emit the canonical (shortest) encoding for every integer; an overlong
encoding decodes to the same value but is rejected with "non-canonical VarInt" and the section fails. The writer-side rule is straightforward: count
leading zero bytes in the integer's two's-complement form, pick the shortest
encoding that fits, never zero-pad for alignment.
Section Ordering
Sections must be present in dependency order. The reader's walker assumes that
later sections can index into earlier ones, so a producer that writes the
sections out-of-order fails the cross-section index validation, not the
section-walker. The required order is documented in Section Walker
Algorithm. The minimum
set for a cuda_tile module is string, type, attribute/constant, IR (func and
global), and the end marker. Resource and debug sections are optional.
Common Pitfalls
Most frontend bugs are well-formed bytecode that tileiras refuses for one of a small set of repeatable reasons. The diagnostics are verbatim from the reader and the verifier; the root causes are producer-side.
Missing Kernel Marker
Symptom. The kernel compiles, ptxas accepts the PTX, but the resulting cubin exposes no entry symbol for the CUDA driver to launch.
Cause. The frontend wrote a func.func instead of cuda_tile.entry, or
the downstream cute.kernel-to-nvvm.kernel rewrite did not fire because the
function never picked up the cute.kernel marker. The directive emitter wrote
.func rather than .entry because no kernel-spec attached and no
nvvm.kernel was present.
Fix. Emit cuda_tile.entry for every kernel. Do not lift to func.func
before the bytecode boundary; the dialect's structured-control-flow surface
covers everything a kernel body needs.
Wrong AttrTag Numbering
Symptom. The bytecode parser emits "unknown attribute tag <N>" mid-IR,
or — more confusingly — successfully decodes the file but produces a module
whose attribute types are systematically off by one slot.
Cause. The writer used stock upstream AttrTag numbering. Tag 1 wrote an
IntegerAttr (upstream) where the tileiras reader expected a StringAttr;
tag 5 wrote a StringAttr where the reader expected a
DenseElementsAttr<string>.
Fix. Use a tileiras-aware writer. The producer-side AttrTag table is frozen to the values the reader uses; encoding through any other table produces an unreadable buffer regardless of in-memory correctness.
Missing Compute Capability
Symptom. "Failed to get ComputeCapability" (O3) or
"failed to get compute capability." (O2) at lowering time, depending on which pass first observes the missing attribute.
Cause. Neither nv_tileaa.compute_capability nor nv_tileaa.target_spec
attached to the module. The driver's --gpu-name=sm_<NN> option is the
intended injection point; the frontend may skip the attribute and rely on the
driver, but a module produced without the attribute is not portable across
drivers that do not inject one.
Fix. Attach nv_tileaa.compute_capability = N : i32 at the module level
when the frontend knows the target, or document the requirement that the
caller pass --gpu-name.
Wrong Operand Order on Token-Ordered Ops
Symptom. The verifier emits "expected token operand" or a structural
error keyed on operandSegmentSizes.
Cause. The frontend placed the token operand at a non-canonical position,
or omitted operandSegmentSizes. The verifier reconstructs the operand
partition from the array; without it, the multi-operand families fail
structural validation.
Fix. Follow the operand-order convention in Operation Roster:
view/ptr first, indices, optional mask, optional fallback, token last. Always
emit operandSegmentSizes as a five-element array<i32> for the
load/store/atomic families.
tile<...> vs tensor<...> Type Confusion
Symptom. The first-stage type converter fails with an unexpected type diagnostic, or a downstream pattern fails to match.
Cause. A frontend that ported from a tensor-typed IR may have lifted shape
operations to tensor<> rather than cuda_tile.tile<>. The two types have
different verifier contracts: cuda_tile.tile enforces power-of-two dimensions
and a 16-million-element ceiling, while tensor<> has neither check.
Fix. Construct cuda_tile.tile<...> for every shaped value in the kernel
body. Tensor types appear in the IR only after ConvertCudaTileToTileAA lifts
tiles to tensors during the alias-aware stage.
Returning a View
Symptom. Verifier emits "view-typed result rejected" from cuda_tile.if
or cuda_tile.for.
Cause. View types are not first-class results of structured-control-flow operations. The intended pattern is to construct the view inside the region and consume it directly, not to return it across the region boundary.
Fix. Construct views close to where they are consumed; if conditional view
construction is necessary, branch around the consuming load rather than
yielding a view from cuda_tile.if.
Power-of-Two Tile Dimensions
Symptom. "tile dimensions must be powers of two" at type construction.
Cause. A tile shape includes a non-power-of-two dimension. The shape
verifier walks each tile type and rejects any dimension that is not 2^k for
some non-negative k. The element-count ceiling fires later: products above
16 million elements are rejected with "tile would exceed the maximum element count".
Fix. Round tile shapes up to the next power of two and use masking for the ragged region. Frontends that target non-power-of-two problem sizes typically tile around an oversized power-of-two block and predicate the tail.
Unregistered Dialect
Symptom. "unregistered dialect: <name>" from the dialect-list walker.
Cause. The frontend declared an internal lowering dialect (nv_tileaa,
nv_tileas, cute, etc.) in its bytecode envelope. These dialects are not
public input — they are produced inside tileiras and are illegal at the
bytecode boundary.
Fix. Restrict the dialect list to cuda_tile, builtin, arith, func,
and the debug-info dialects. Construct everything else through cuda_tile's
own operation surface.
Minimal Hand-Rolled Kernel
For testing, hermetic builds, or differential reduction, a frontend can hand-construct a tiny kernel as textual MLIR and run it through a tileiras-aware writer. The minimum is one entry function with one return:
module attributes {
nv_tileaa.compute_capability = 90 : i32,
tt.num_warps = 4 : i32
} {
cuda_tile.module {
cuda_tile.entry @noop() {
cuda_tile.return
}
}
}
Serialized through a tileiras-aware writer, this produces a 256-byte buffer
that flows through the entire pipeline and emits a .entry noop PTX function
with the four expected directive lines (.entry, .reqntid 128, 1, 1,
.maxnreg if set, and the parameter block). It is the smallest input that
exercises every stage of the cascade and is the canonical reduction target for
producer-side bugs.
Triton-Frontend Extensions
NVIDIA's Triton-style frontend extends the contract with a handful of
domain-specific module attributes. They follow the same convention as the
documented tt.* attributes — module-level, integer-or-array values, read
once by attachKernelSpecAttributes and folded into the nv_tileaa.kernel_spec
record on each entry function.
| Triton attribute | Effect | Lowering site |
|---|---|---|
tt.num_stages = N : i32 | Hint to the modulo scheduler about pipeline depth | Async/Pipeline Family |
tt.cluster_size = [X, Y, Z] | Shorthand for tt.cluster_dim plus tt.num_ctas | CTA Cluster Family |
tt.is_persistent | Mark the kernel as persistent for the StaticPersistent scheduler | Pipeline and Tile Scheduler |
tt.dump_intermediate | Producer-side debugging hint (informational only) | (no consumer) |
These are not part of the canonical contract — a non-Triton frontend can ignore them entirely — but they are stable enough that downstream consumers can rely on them when they are present. See cuda_tile Overview for the public dialect surface they map onto and Attribute System and Lowering for the lifecycle that each one follows from the module dictionary to the PTX directive emitter.
Cross-References
cuda_tile Overview documents the public
dialect surface this contract targets. Operation Roster
catalogues every legal mnemonic and operand-order convention. Types and
Attributes covers the type-storage
parameters and attribute parse contract. Verifiers
documents the verifier diagnostics this page references by spelling.
MLIR Bytecode Format is the wire-format
reference; Dialect Bytecode Reader/Writer Status
explains why only cuda_tile has a linked reader. Position in nvcc 13.1
Toolchain shows where the frontend's
bytecode artifact lands in the larger build. Attribute System and
Lowering is the cross-stage policy reference
for every attribute discussed above; GPU Execution Model
walks the same kernel-attribute story from the perspective of PTX directive
emission. DSL to PTX End-to-End traces a single
kernel through every stage of the pipeline from this contract down to PTX
text.
Attribute System and Lowering
Abstract
Compiler attributes carry the semantic context that IR-graph structure alone cannot express: kernel launch shape, fast-math flags, pipeline staging, scheduling hints, memory ordering, layout descriptors. Every lowering stage in tileiras has a deliberate policy for each attribute family — preserve it under a renamed key for the new dialect, consume it and drop the carrier after the analysis that needed it finished, or synthesize a fresh attribute from inferred facts. A reimplementation that drops the wrong attribute at the wrong stage emits PTX that compiles and runs but produces the wrong answer; the bytes survive ptxas because nothing it sees is malformed.
This page is the canonical reference for the attribute system as a whole. The per-stage lowering pages document where individual attributes flow; the dialect type-and-attribute pages document the per-attribute parse contract and verifier. This page documents the cross-cutting policy that ties them together: what each lowering stage does to each attribute family, which transitions are intentionally lossy, and which silent drops are wrong-output bugs waiting to be introduced.
Attribute carriers
MLIR exposes five places an attribute can live. Tileiras uses all five, and the decision of which carrier to use is part of the attribute's contract — moving an attribute from one carrier to another changes who reads it, when it is read, and what happens when the carrier disappears under a rewrite.
| Carrier | Storage | Lifetime | Primary readers |
|---|---|---|---|
| Op attribute dictionary | DictionaryAttr on the op header, or inherent Properties storage for ops that declared an inherent attribute slot | Bound to the op; survives clones unless the op is rewritten away | Verifiers, fold rules, conversion patterns, the AsmPrinter |
| Type-storage parameters | Fields inside the type's TypeStorage derivative, uniqued through the context StorageUniquer | Bound to the type identity; outlives every op that uses the type | Type-equality checks, type converters, every walker that keys on type |
| Function-level named-attribute dictionary | func.func (or llvm.func) operation header | Bound to the function symbol; survives function-level clones | Function-boundary lowering, LLVM function-attribute emission, the PTX directive emitter |
| Module-level dictionary | builtin.module operation header | Bound to the module; survives across passes that do not rewrite the module shell | Pipeline driver, target-attachment pass, options-mapping pass |
| NVVM properties blob | Per-op compact slot table at Operation*+64, slots stride 8 bytes | Bound to the op like an inherent attribute, but the slots are positional, not keyed | The NVVM-to-LLVM dispatcher arms documented in Properties Blob and Attr Parsers |
The carrier decision matters because the rules for who can read a carrier differ. An op-attribute dictionary entry is keyed by string; a passes that consumes the op can fetch it through getAttr("name"). A type-storage parameter is positional; only code that knows the type's storage class can read it. A function-level attribute is read by a different set of passes than an op-level attribute carrying the same name. Moving an attribute from op-level to function-level — for example, when a kernel-spec entry on the function summarises a per-op annotation — changes the answer to "which pass owns this attribute now?".
Lifecycle of a kernel attribute
A concrete attribute makes the policy concrete. The frontend hint tt.num_warps = 4 — a Triton-style annotation requesting four warps per CTA — flows through every lowering stage in tileiras, changing carrier and key as it travels. The end result is the PTX directive .reqntid 128, 1, 1 in the kernel's .entry header.
Stage 0 (frontend input). A Triton-style producer emits cuda_tile bytecode with the kernel-spec hint attached to the module:
module attributes { tt.num_warps = 4 : i32 } {
cuda_tile.entry @gemm(...) { ... }
}
Stage 1 (ConvertCudaTileToTileAA). The pass walks cuda_tile.module operations and lowers their bodies, but the module-level dictionary entry passes through verbatim. The nv_tileaa dialect declares the same string key as a legal attribute on its enclosing module, so the conversion target does not reject it. The lifecycle here is "preserve, do not rename".
Stage 2 (ConvertTileAAToTileAS, kernel-spec attach). The attachKernelSpecAttributes step folds the frontend hint into the function-level nv_tileaa.kernel_spec attribute. The bytes num_warps = 4 become one field of a structured kernel-spec record on the function. The lifecycle is "consume and synthesize" — the module-level tt.num_warps is read once and a function-level nv_tileaa.kernel_spec is written.
Stage 3 (TileAS scheduling and layout). The scheduler reads kernel_spec.num_warps = 4 to size the warp partitioning and resource pools. The agent-switch pass reads nv_tileas.num_warps = 4 (a per-agent mirror written by OptimizeExecutionUnitMapping) to round each agent's starting warp to its group size. The lifecycle is "read to act, do not rewrite".
Stage 4 (ConvertTileFuncToLLVM). The function-boundary lowering reads nv_tileaa.kernel_spec and writes nvvm.reqntid = 128 : i32 onto the rewritten func.func, derived from 32 * num_warps = 32 * 4 = 128. The lifecycle is "consume and synthesize"; the kernel-spec attribute remains for downstream readers, but the nvvm.reqntid carrier is what the PTX emitter consumes next.
Stage 5 (PTX directive emission). The kernel-directive emitter walks the LLVM function's nvvm.* attribute set in the fixed order documented in Host Launch and ptxas Knobs and emits .reqntid 128, 1, 1 into the .entry header. The lifecycle is "read and project to PTX".
The full trace is six carriers in five stages: module dictionary → function-level kernel-spec → scheduler-internal pool sizing → function-level nvvm.* → PTX directive → cubin metadata consumed by the CUDA driver at launch time. Each transition has a different rule, and each transition is owned by exactly one pass.
Attribute categories
The attribute system breaks into nine functional families. Each family has its own carrier policy, its own set of readers, and its own per-stage rewrite rules.
| Family | Representative attributes | Carrier | Primary readers |
|---|---|---|---|
| Launch shape | nvvm.reqntid, nvvm.maxntid, nvvm.cluster_dim, nvvm.maxclusterrank, nvvm.minctasm, nvvm.maxnreg, nvvm.blocksareclusters, nvvm.explicitcluster, nvvm.grid_constant | Function-level dictionary | Kernel-directive emitter, NVVM IR verifier |
| Compute capability | nv_tileaa.compute_capability, nv_tileaa.target_spec, nv_tileas.compute_capability, nvvm.target | Module-level dictionary; nvvm.target is a type-storage parameter on a type-encoded target attribute | Target-attachment pass, SM-gated rewriter guards |
| Kernel spec | nv_tileaa.kernel_spec, nv_tileas.num_warps, nv_tileas.workspace_global_offset | Function-level dictionary | Scheduler, agent-switch builder, function-boundary lowering |
| Fast-math | fastmath = "contract", fastmath = "nnan", fastmath = "ninf", fastmath = "nsz", fastmath = "arcp", fastmath = "afn", fastmath = "reassoc" | Op-level inherent attribute on arithmetic and MMA ops | Arith folder, instruction selector, intrinsic-rewrite pattern |
| Memory ordering | mem_semantic (relaxed / acquire / release / acq_rel / sc), mem_scope (cta / cluster / gpu / sys), mbar_scope, mbar_space | Op-level inherent attribute on memory ops; later an NVVM properties slot | Memory-op verifier, NVVM dispatcher arm A, LLVM atomic emitter |
| Cache policy | cache_modifier (.ca/.cg/.cs/.cv), eviction_policy, l2_prefetch, cache_eviction_priority | Op-level inherent attribute | Memory-op selectors, ptxas directive emitter |
| Layout / shape | cuTe layout descriptors, DenseI32ArrayAttr tile shape on partition_view, mma_layout, wgmma_layout | Type-storage parameter (layout-on-view), op-level attribute (layout-on-MMA) | Layout-assignment pass, atom builders, MMA intrinsic selector |
| Pipeline staging | pipeline_stage, num_stages, nv_tileas.persistent, tileas.schedule.constraint.* | Op-level discardable attribute on async-pipeline ops; some live in inherent properties when the op definition reserved a slot | Modulo scheduler, MaterializeAsync, schedule-constraint parser |
| Assumption / debug | div_by, bounded, same_elements (assumption predicates on cuda_tile.assume); di_loc, di_compile_unit, di_file, di_lexical_block, di_subprogram (debug info) | Op-level attribute | Optimizer, debug-info translator |
The kernel-spec family is the central pivot. Frontend hints land as kernel-spec fields, the scheduler reads kernel-spec fields, function-boundary lowering reads kernel-spec fields and writes nvvm.* attributes from them, the PTX emitter walks the nvvm.* set in a fixed order — every interesting per-kernel decision passes through the kernel-spec at least once.
Per-stage attribute rules
Each lowering stage has a published rule for each attribute family. The table below is the policy matrix every conversion pattern must respect: which attributes the stage is allowed to drop, which it must preserve (renamed under the new dialect's prefix), which it must synthesize from inferred or read facts, and which it reads to drive its own rewriting decisions without modifying.
| Stage | Drops | Preserves and renames | Synthesizes | Reads to act |
|---|---|---|---|---|
Frontend → cuda_tile (bytecode input) | (none) | (none) | (none) | (none — this is the input contract) |
ConvertCudaTileToTileAA | (none) | All op-attribute dictionaries flow through the TypeConverter and emerge on the rewritten nv_tileaa ops; fastmath carries verbatim on arithmetic ops | (none) | Compute capability from the pass option to specialise type conversion |
ConvertTileAAToTileAS | (none at this stage; downstream passes drop intermediate analysis attrs) | Per-op CopyAtom and ReduceAtom witnesses ride verbatim onto the new nv_tileas ops; layout attributes carry through | nv_tileaa.kernel_spec on the function from frontend hints; SM-gated rewrites consult it through the attached attribute | nv_tileaa.compute_capability for SM100 block-scaled MMA admission |
| TileAS scheduling and layout (D07-D22) | Scheduler-internal intermediate attrs after MaterializeSchedule consumes them | pipeline_stage, nv_tileas.num_warps, schedule-constraint attrs survive into materialization | pipeline_stage integer on each producer/consumer region, nv_tileas.num_warps mirror on agent-switch ops, agent_strides array on agent_switch | kernel_spec.num_warps, kernel_spec.num_ctas, schedule-constraint attrs |
ConvertTileFuncToLLVM | nv_tileaa.kernel_spec field-by-field (the function-level dictionary entry stays; its readers move) | nv_tileaa.compute_capability, nv_tileaa.target_spec; nv_tileaa.grid_constant argument attributes are renamed and migrated onto the LLVM-typed arguments by the downstream CuteKernelToNvvmRewrite pass | nvvm.reqntid from 32 * numWarps; nvvm.cluster_dim when targetSM > 89 && clusterProduct > 1; nvvm.blocksareclusters under the same predicate; nvvm.minctasm = 1; nvvm.maxnreg from per-SM occupancy table when nv_tileaa.occupancy is set; cute.kernel unit marker (renamed to nvvm.kernel only in the downstream pass) | nv_tileaa.kernel_spec field accessors |
ConvertTileASToLLVM body conversion | Async-token operand types collapse to i32 carriers; some carrier-only attrs disappear with their ops | nvvm.* properties attributes on lowered ops survive into the NVVM dispatcher slots described in Properties Blob and Attr Parsers | NVVM properties slots from the lowered op's MLIR attribute dictionary; mem_semantic becomes a Pattern-A enum slot at +64, mem_scope becomes a Pattern-A enum slot at +72 | cute.kernel marker, CopyAtom and ReduceAtom witnesses |
Companion cute*-to-LLVM lowering | CuTe-internal layout-algebra attributes after descriptor materialization | Tile-shape attributes survive into the emitted descriptor constants | TMA descriptor constants from cuTe layout attributes | Layout attributes, compute_capability for atom selection |
ConvertNVGPUAndGPUToNVVM | gpu.kernel after rewriting to nvvm.kernel | nvvm.* family unchanged | (none beyond what the rewrite emits) | gpu.kernel, gpu.module target attribute |
AttachNVVMTarget | (none) | Compute-capability and target-spec data folded into #nvvm.target | #nvvm.target attribute on the gpu.module with chip, features, link-files, flags | nv_tileaa.compute_capability, nv_tileaa.target_spec |
| MLIR-to-LLVM translation | The nvvm.* markers that have no LLVM-IR counterpart (e.g. nvvm.kernel is emitted as a calling-convention attribute, not as a metadata node) | All function-level nvvm.* directive carriers become LLVM function attributes named nvvm-reqntid, nvvm-cluster-dim, etc., or NVVM annotation tuples on the legacy path | LLVM function attributes; debug-info intrinsics | All carrier-only nvvm.* attributes |
| NVPTX MIR | Most function-level attributes outside the directive-bearing ones | nvvm-reqntid, nvvm-cluster-dim, nvvm-maxnreg, nvvm-minctasm, nvvm-grid-constant, nvvm-maxclusterrank, nvvm-blocksareclusters carry through as function attributes the AsmPrinter reads | NVPTXISD pseudo-opcodes for grid-constant arguments and TMA descriptor materialization | nvvm.kernel (entry vs func split), per-arg nvvm.grid_constant |
| AsmPrinter (MIR → PTX) | (none at emission time) | (none — this is the projection step) | PTX directives: .entry, .maxntid, .reqntid, .minnctapersm, .maxnreg, .explicitcluster, .reqnctapercluster, .maxclusterrank, .blocksareclusters | Every directive-bearing function attribute |
The two stages that synthesize the most are ConvertTileFuncToLLVM and AttachNVVMTarget. Function-boundary conversion is where frontend hints, scheduler analysis, and kernel-spec fields collapse into the small set of nvvm.* attributes the AsmPrinter will eventually project to PTX. Target attachment is where the per-module compute_capability and target_spec strings become the single resolved #nvvm.target attribute that drives every SM-gated decision downstream.
Intentional drops and silent miscompiles
Not every attribute drop is a bug. The pipeline deliberately drops attributes once their consumer has read them, and the carrier serves no purpose after that point. Distinguishing intentional drops from accidental drops is the central correctness concern for any reimplementation.
Intentional drops:
- Scheduler-internal intermediate attributes are dropped after
MaterializeScheduleconsumes them. They exist only to communicate analysis state from one scheduler subpass to the next, and they would clutter the IR if left behind. The drop is correct because no downstream pass reads them. fastmathattributes on an op's output value disappear when the op is rewritten as an intrinsic that re-encodes the same flags. The intrinsic's argument list carries the flags forward (typically as afastmathflagsLLVM operand bundle), so the original attribute carrier is redundant.cute.kernelis renamed tonvvm.kernelby the downstreamCuteKernelToNvvmRewritepass; the original marker disappears once the rename runs. The two-step rename exists because the rewriter also liftscute_nvgpu.grid_constantargument attributes tonvvm.grid_constant, and that lift needs the LLVM-typed function arguments the function-boundary pass has just produced.- Per-op
mem_semanticandmem_scopeop-attribute entries fold into NVVM properties slots duringConvertTileASToLLVM. The op-attribute carrier vanishes, but the value survives at a positional slot the NVVM dispatcher reads.
Silent-miscompile drops to avoid:
- Dropping
mem_semanticon a memory op during lowering produces a load or store with weaker ordering than the source requested. The NVVM dispatcher picks the relaxed-ordering arm by default, and the resulting PTX validates cleanly under ptxas — there is no diagnostic to surface the missing fact. - Dropping
mem_scopeon a cluster-scope atomic produces a CTA-scope atomic on Hopper hardware. The two opcodes both exist and both pass the NVVM IR verifier; the cluster invariant is not checked. - Dropping
nv_tileaa.compute_capabilitybeforeAttachNVVMTargetruns produces a#nvvm.targetattribute with the default chip, not the requested one. The NVVM IR verifier accepts the target because the chip string is legal; the cubin compiles for the wrong SM and runs in degraded mode (or crashes on unsupported instructions). - Dropping
nv_tileaa.kernel_specbefore function-boundary conversion produces a kernel without launch-bound directives. The function compiles as a.funcinstead of a.entry, and the resulting cubin exposes no kernel for the driver to launch. - Dropping
nvvm.grid_constanton a TMA descriptor argument produces a kernel that copies the descriptor through parameter memory on every launch instead of materializing it once. ptxas accepts the result; the kernel runs but at degraded performance. - Dropping
fastmathon anmmafop that the frontend markedcontractproduces an MMA emission that refuses fused-multiply-add formation. The PTX is correct under IEEE-754, slower than the user requested, and the diagnostic surface is empty.
A reimplementation should treat any attribute drop that is not on the intentional list as a candidate bug. The pass-level verifier catches structural mismatches but does not see semantic drops; the NVVM IR verifier sees structural target-violation but does not see semantic miscompiles. The attribute-drop policy is the producer-side discipline that fills that gap.
NVVM properties blob
The NVVM properties blob is the dialect-specific compact carrier that sits below the standard MLIR attribute-dictionary surface. Every nvvm.* op that carries inline-data attributes gets a uniform Properties record bump-allocated next to its Operation* header, with attribute slots starting at byte +64 and striding 8 bytes apart. The five access patterns (A through E) cover every per-op attribute family in the dialect — enum payloads, optional enums, unit attributes, integer attributes, and array attributes.
The blob is positional, not keyed. A slot's meaning is fixed by the dispatcher arm for that op mnemonic; a reimplementation that gets the slot ordering wrong reads the wrong attribute even when the data is present. The full slot tables for each op family, plus the 67-element enum-attr registrar chain that backs the parsers, are documented in Properties Blob and Attr Parsers.
Three properties of the blob matter for the cross-stage attribute system. First, the blob is the terminal carrier for memory-ordering and cache-modifier attributes; once an op reaches the NVVM dispatcher, its op-attribute dictionary has been collapsed into the slot table. Second, the blob is inherent storage, not discardable, so cloning an op preserves the slot values verbatim. Third, the slot ordering is the canonical reference for how the lowering pass maps op-attribute keys to NVVM Properties positions — getting the ordering right is exactly the constraint that a hand-written pattern set must satisfy.
Cross-references
Per-stage attribute movement is documented in Lowering Overview, cuda_tile to nv_tileaa, nv_tileaa to nv_tileas, and nv_tileas to LLVM. Host Launch and ptxas Knobs documents the launch-shape directive emitter and the per-directive policy. Properties Blob and Attr Parsers documents the NVVM properties carrier in detail. cuda_tile Types and Attrs, nv_tileaa Types, Attrs, Verifiers, and nv_tileas Types document the per-dialect parse contract and verifier for each attribute. Schedule Constraint Attributes covers the nine scheduler-constraint attribute strings the modulo scheduler reads. GPU Execution Model is the canonical reference for the launch-shape directives at runtime. DSL to PTX End-to-End walks a representative kernel through every stage and shows the attribute movement in context.
PTX Version and Target Selection
Abstract
Every PTX module tileiras emits begins with the same three-directive header:
a .version, a .target, and an .address_size. The three values are not
independent. They are the projection of one decision — pick a subtarget —
made by stitching together the user's --gpu-name flag, the
nv_tileaa.compute_capability module attribute, the NVPTX subtarget feature
bitset, and the TargetMachine debug toggle. Picking a target also picks
an instruction surface: wgmma, tcgen05, and the block-scaled MMA family
are gated by the a / f suffix on the .target directive, and a kernel
that requires any of them cannot run on a vanilla sm_NN variant.
This page is the cross-cutting story. It explains which knobs choose the
PTX version, which choose the .target line, what the a / f suffixes
mean architecturally, and where the resulting subtarget object is consumed
during codegen.
The Three-Directive Header
The AsmPrinter emits the header exactly once per PTX module, drawing
every field from the active NvptxSubtarget plus the TargetMachine
debug flag. A representative sm_90a build with debug info enabled
produces:
.version 8.4
.target sm_90a, debug
.address_size 64
| Directive | Source | Choice |
|---|---|---|
.version | Subtarget +ptxNN feature bit | Highest PTX ISA the chosen subtarget supports for the requested features. |
.target | Subtarget CPU plus optional debug flag | sm_NN[a|f][, debug]. |
.address_size | Subtarget pointer width | Always 64 in this build. |
The header is one of the few PTX surfaces where the AsmPrinter does
zero independent thinking. Every value already exists on the
NvptxSubtarget by the time the printer runs; the header step is a
projection, not a decision. See Module Header Directives
for the exact printing routine.
The .version Directive
The PTX ISA version is the version of the PTX grammar the emitted
module conforms to. PTX is a forward-compatible ISA: a ptxas shipped
with CUDA 13.1 can ingest any earlier PTX version, but it can only
ingest later versions up to the maximum its build understands.
Tileiras picks the PTX version through a subtarget feature bit, not
through a free-form integer. The thirty bits ptx32..ptx88 in the
NVPTX feature index table
each act as a discrete version selector. The driver layer (cicc or the
hosting tool) sets exactly one of them through -mattr=+ptxNN; the
NVPTX subtarget parses the numeric tail of the feature name into
PTXVersionTimesTen and the AsmPrinter divides by ten to print
.version major.minor.
| PTX Version | Minimum for |
|---|---|
| 6.0 | sm_70 (Volta WMMA, basic mbarrier) |
| 7.0 | sm_80 (Ampere baseline, cp.async) |
| 7.5 | sm_86 / sm_87 |
| 7.8 | sm_89 (FP8 mma.sync on Ada) |
| 8.0 | sm_90 baseline (Hopper) |
| 8.2 | wgmma.mma_async, TMA bulk copies on sm_90a |
| 8.4 | Extended cp.async.bulk, mbarrier additions |
| 8.6 | tcgen05.* family on sm_100a / sm_103a |
| 8.7 | Consumer-Blackwell mma.sync.aligned.*.block_scale on sm_120a |
| 8.8 | The build cap for this drop |
The table is what the toolchain enforces, not what the language
mandates. NVIDIA's PTX manual states the minimum version per
instruction, and ptxas refuses any module that uses an instruction
without declaring at least the matching .version. Tileiras's job is
to declare a version high enough for every instruction it emits,
without picking a version higher than the downstream ptxas supports.
The CPU rows in the 40 CPU rows table
carry no implied PTX bit; the PTX-version selector is orthogonal to
the CPU selection. A reimplementation that bundles +ptx84 into the
implication mask of sm_90a breaks the orthogonality and forces
downgrades. Pick the highest version compatible with the chosen
feature set, set the corresponding +ptxNN flag, and let the CPU row
contribute only its self-bit.
The .target Directive
The .target directive identifies the streaming-multiprocessor
generation the module is being compiled for. It is the single most
consequential field in the entire PTX file — it selects the
instruction lattice, the warp model, the shared-memory and
register-file sizes, and the set of architecture-conditional
operations available.
The grammar accepted by ptxas is:
.target sm_<digits>[<suffix>][, debug][, map_f64_to_f32]
The suffix is one of three states:
- No suffix —
sm_90,sm_100,sm_120. The vanilla architecture. Only baseline ISA instructions are available, but the module is forward-compatible: a binary built forsm_90runs on everysm_>=90device, including future ones. asuffix —sm_90a,sm_100a,sm_120a. Architecture-specific. The module unlocks the full instruction set of that exact architecture, including any architecture-conditional families documented per generation. It is not forward-compatible: a binary built forsm_90aruns only on Hopper, never on Blackwell.fsuffix —sm_100f,sm_103f. Family-conditional. The module unlocks architecture-conditional instructions but promises forward compatibility within the family of variants that share the same major SM number. Builds forsm_100frun on every Blackwell datacenter variant (sm_100,sm_101,sm_103cores) but not on consumer Blackwell or future generations.
The complete grid of who-implies-what lives in the 40 CPU rows table.
Each a or f variant is a separate CPU row in the subtarget table,
with its own feature bit and its own implication mask. The tmem
feature (index 80) is the prime example: it is implied by every
datacenter a / f Blackwell row and by no base or consumer row.
Architecture-Conditional Instructions
Several instruction families are reachable only through a target suffix. The compiler's lowering is built around a feature predicate; plain SM rows leave the predicate false, suffix rows toggle it true.
| Family | Required suffix | Predicate gate |
|---|---|---|
wgmma.mma_async.sync.aligned (Hopper warp-group MMA) | sm_90a | HasSM90a |
wgmma.fence, wgmma.commit_group, wgmma.wait_group | sm_90a | HasSM90a |
TMA cp.async.bulk.tensor im2col modes | sm_90a | HasSM90a |
setmaxnreg.inc, setmaxnreg.dec | sm_90a | HasSM90a |
tcgen05.alloc, tcgen05.dealloc, tcgen05.relinquish | sm_100a / sm_100f / sm_103a / sm_103f | HasTMem (index 80) |
tcgen05.mma, tcgen05.mma.sp, tcgen05.mma.ws | sm_100a / sm_100f / sm_103a / sm_103f | HasTMem |
tcgen05.ld, tcgen05.st, tcgen05.cp | sm_100a / sm_100f / sm_103a / sm_103f | HasTMem |
mma.sync.aligned.*.block_scale (MXFP8, MXFP4, NVFP4) | sm_120a / sm_121a | HasSM120a / HasSM121a |
2-CTA and 4-CTA tcgen05.mma.cta_group::N modes | sm_100a / sm_103a | HasTMem plus shape verifier |
When the user picks --gpu-name=sm_90 (without the a), tileiras
cannot emit wgmma. There are two well-defined outcomes:
- The frontend has already specialized its lowering to avoid
producing
tt.dotops that would lower towgmma. The pipeline completes and the emitted PTX usesmma.syncfallbacks. - The frontend has emitted a tensor-core op that requires
wgmma. The selector finds no legal MachineInstr and fails with an "unsupported operation for target" diagnostic. The compile stops.
There is no third path. Tileiras does not silently degrade a wgmma
kernel into a mma.sync loop nest; that admission belongs upstream,
at the dialect-lowering or tile-scheduler level. The same rule
applies one tier up: a kernel that requires tcgen05.mma cannot run
on sm_100 (base), only on sm_100a or sm_100f. Consumer
Blackwell (sm_120/sm_121) substitutes block-scaled mma.sync
instead and is described in SM120 / SM121 emission.
The Compute-Capability Attribute
Inside the compiler, the source of truth for the target choice is the
nv_tileaa.compute_capability module attribute. Each lowering and
codegen pass consults this attribute through the
attribute-attached lifecycle:
the driver writes it from --gpu-name, the
ConvertTileFuncToLLVM
stage propagates it, and the AttachNVVMTarget stage folds it into a
single #nvvm.target attribute that the NVPTX backend reads when
constructing the NvptxTargetMachine.
The attribute carries the numeric SM major-times-ten value (90 for
sm_90, 90 for sm_90a — the variant suffix is recorded in a sibling
target_spec field, not in the integer). Downstream rewrites that
need to distinguish sm_90 from sm_90a consult both fields, never
just the integer.
Dropping the attribute before AttachNVVMTarget runs is a known
source of silent miscompiles: the #nvvm.target attribute falls back
to a default chip, the NVVM IR verifier accepts it because the chip
string is well-formed, and the cubin compiles for the wrong SM. The
intentional drops list
documents this failure mode explicitly.
Target Machine Construction
The NVPTX backend wraps the choices above into an
NvptxTargetMachine constructed from a triple, a CPU string, and a
feature string:
NvptxTargetMachine *tm = NVPTXTarget::createTargetMachine(
/*triple=*/ "nvptx64-nvidia-cuda",
/*cpu=*/ "sm_90a",
/*features=*/ "+ptx84",
/*options=*/ TargetOptions{...},
/*reloc=*/ Reloc::Default,
/*code-model=*/ CodeModel::Small,
/*opt-level=*/ CodeGenOpt::Aggressive);
The triple is fixed: nvptx64-nvidia-cuda for every supported target.
The 32-bit variant nvptx-nvidia-cuda is not produced by this build;
the .address_size directive is always 64.
The CPU string is the literal sm_NN[a|f] form, taken verbatim from
the compute_capability + target_spec pair. std::lower_bound
against the sorted CPU table resolves it to a row, and the row's
implication mask is ORed into the runtime feature bitset.
The feature string is a comma-separated list of +feature_name
tokens. The PTX-version bit (+ptx84 in the example) is the most
common entry; other tokens like +fma-level=2, +prec-divf32=3,
+prec-sqrtf32=1 appear when the driver propagates the corresponding
numerical-precision flags. The string is additive over the CPU
row's mask — the row contributes its self-bit and any implied bits,
the string adds whatever else the driver wants.
A worked example for the canonical CUDA 13.1 Hopper build:
--gpu-name=sm_90a
→ compute_capability = 90, target_spec = "a"
→ CPU = "sm_90a"
→ CPU row 39 implication mask: {bit 60 = sm_90a}
→ driver propagates --ptx-version=8.4
→ feature string = "+ptx84"
→ runtime feature_bits[0] |= (1ULL << 60)
→ runtime feature_bits[0] |= (1ULL << 28) // ptx84 = index 28
→ SMVersionTimesTen = 90
→ PTXVersionTimesTen = 84
The AsmPrinter divides PTXVersionTimesTen by ten and prints
.version 8.4. It reads the CPU string out of the subtarget and
prints .target sm_90a. The whole chain is two field reads and a
print.
Address Size
.address_size is always 64 in this build. The full set of CPU rows
listed in the subtarget table starts at sm_20 (Fermi), and Fermi
era cards were the last NVIDIA generation to ever use 32-bit
addressing. Even the legacy CPU rows in this build emit
.address_size 64; the 32-bit code path was removed when the build
was cut, and no flag re-enables it.
This is one of the rare PTX header fields with no decision logic at
all: the printer emits the literal .address_size 64 after
.target, full stop.
The Debug Suffix
A second comma-separated token on the .target line declares the
presence of DWARF debug information:
.target sm_90a, debug
The token is added when the TargetMachine debug level is non-zero.
The driver sets that whenever the user passes --device-debug (or
its -g alias); the
option validator rule
requires -O0 in that case, because full device debug disables
several code-motion and block-merge transforms that an optimized
build relies on.
A non-debug build with only --lineinfo set does not add the
debug token. The line-info path emits source-location records as
PTX .loc directives inside function bodies; the .target header
remains the un-suffixed form. The two paths are independent axes,
not a single switch.
The fourth token on the .target line, map_f64_to_f32, exists in
the ptxas grammar but is never emitted by this compiler. It belongs
to a legacy fp64 emulation path the modern stack does not select.
Cross-Architecture Builds
Tileiras compiles for one target at a time. A single invocation
produces one PTX file for one (--gpu-name, --ptx-version) pair,
with no -arch=... list, no compute_NN / sm_NN pairing, and no
fatbin section table.
Multi-architecture builds are managed entirely at the nvcc level.
nvcc invokes tileiras once per target architecture in the user's
-gencode list, collects the resulting PTX or cubin files, and
hands them to fatbinary and nvlink for packaging. Each tileiras
invocation is independent of the others. See
the nvcc-tileiras handoff diagram
for how the driver-level orchestration assembles a fatbin from
multiple single-target tileiras runs.
The implication for an integrator: there is no API on the tileiras side to ask "give me a JIT-able PTX for any device this fatbin covers". The granularity is one-target-per-invocation, and the fatbin-aware logic lives strictly above the tileiras boundary.
Choosing Between sm_NN, sm_NNa, and sm_NNf
The choice is driven by three orthogonal questions, and the answers combine into the suffix decision.
- Does the kernel need arch-conditional instructions on this
generation? If the lowered IR contains
wgmma,tcgen05.mma,mma.sync.aligned.*.block_scale, TMA im2col, orsetmaxnreg, the answer is yes; anaorfsuffix is mandatory. - Will the binary be deployed across multiple variants in the
same SM family? If yes, prefer the
fsuffix on the generations where it exists.sm_100fruns on every Blackwell datacenter chip;sm_100aruns only on the specific GB100 / B100 die. Consumer Blackwell (sm_120,sm_121) has nofvariant because the family does not containtcgen05— architectural specialization is purely theaform. - Is forward compatibility with future generations required?
Only the bare
sm_NNform is forward-compatible across major generations. Choose it when the kernel can do without arch-cond instructions and must run on hardware released after the build.
Practical guidance: choose the narrowest suffix that still admits
every instruction the kernel emits. If the kernel uses wgmma,
pick sm_90a. If it uses both tcgen05.mma and is deployed on a
mixed Blackwell datacenter fleet (GB100, GB200, B100), pick
sm_100f. If neither is needed, pick the bare sm_NN for maximum
forward compatibility. Fatbin construction at the nvcc level is the
correct mechanism for multi-architecture deployment, not a single
broad-target tileiras invocation.
The compute-capability attribute selection in the frontend is what
decides which branch tileiras takes. There is no fallback: a
kernel emitted by a frontend that expects sm_90a semantics will
not compile against sm_90, and vice versa.
Cross-References
NVPTX Subtarget and Feature Matrix — The 40 CPU Rows
catalogs every CPU row, including which a/f variants imply
tmem. Per-SM Emission Templates — Capability Matrix
walks the actual instruction surfaces unlocked at each SM tier.
Attribute System and Lowering — Lifecycle of a Kernel Attribute
explains how compute_capability propagates from the driver through
to AttachNVVMTarget. Driver CLI Options
documents the --gpu-name enum table. Position in nvcc 13.1
covers the fatbin assembly that wraps multi-architecture builds.
AsmPrinter — Module Header Directives
shows the exact printing path for the three-directive header.
cuda_tile Simplifier Walker
Abstract
The cuda_tile simplifier walker is a recursive expression-tree rewriter sitting between the public MLIR cuda_tile.* operation layer and the storage/uniquer layer. It lifts a fully elaborated CUDA tile op graph into a private expression IR, runs constant folding, identity rewrites, and structural deduplication over that IR, then materializes canonical MLIR values back through the normal builders.
Separation is the key design. The simplifier never mutates arbitrary MLIR operations while reasoning — it works on compact expression records, memoizes simplified results in two caches, and re-emits public operations only after the recursive fold has selected a canonical shape.
Tileiras runs the simplifier underneath the canonicaliser layer rather than next to it. The public canonicaliser (covered by Canonicalizers and Folds) sees only the result of this private walk. Reimplementers who try to merge the simplifier and the canonicaliser into one pass produce quadratic IR churn: the private expression IR is what lets the simplifier dedupe a shared subgraph in one place instead of repeatedly rewriting the public op graph. The page is organised as the boundary contract first (private expression kinds, the binary-arith dispatch table that maps internal kinds to public arith.* opcodes), then the materialiser last.
Private Expression IR
Every private expression node has the same logical header:
- a 16-bit expression kind
- a small flag field
- a pointer to operands or trailing operand storage
- a packed operand count
The expression record is not an Operation * — it is a side representation built for folding. MLIR values enter and leave through mapping tables, which lets the walker share subexpressions and avoid rewriting the public op graph until the chosen replacement is ready.
Operand storage mirrors LLVM and MLIR trailing objects: small operand lists sit inline, while larger lists use a separately allocated trailing array. Tombstone nodes use the sentinel expression kind, removed by compaction before the packed count is recomputed.
Recursive Simplifier
The main simplifier dispatches on 17 expression kinds. Unary nodes recurse into their only operand, variadic boolean nodes fold every child, binary arithmetic nodes share one builder, and leaf nodes classify bit-vector payloads as constants, undef values, or symbols. The caller chooses which cache to use, and a tunable recursion limit protects the pass from pathological expression graphs.
Value *simplify_expr(SimplifyContext *ctx, Expr *expr, CacheKind cache, unsigned depth) {
if (depth > ctx->recursion_limit) {
return materialize_without_deepening(ctx, expr);
}
if (Value *cached = cache_lookup(ctx, cache, expr)) {
return cached;
}
Value *result = NULL;
switch (expr->kind) {
case EK_CONSTANT:
result = materialize_constant(ctx, expr);
break;
case EK_NOT:
case EK_NEG:
case EK_ABS:
result = simplify_unary(ctx, expr, cache, depth + 1);
break;
case EK_AND:
case EK_OR:
result = simplify_variadic_boolean(ctx, expr, cache, depth + 1);
break;
case EK_ADD:
case EK_SUB:
case EK_MUL:
case EK_UDIV:
case EK_SDIV:
result = simplify_binary_arithmetic(ctx, expr, cache, depth + 1);
break;
case EK_SELECT:
result = simplify_select(ctx, expr, cache, depth + 1);
break;
case EK_CAST:
result = simplify_cast(ctx, expr, cache, depth + 1);
break;
case EK_BITVEC_CONST:
result = materialize_bitvec_leaf(ctx, expr);
break;
default:
bug_invalid_expr_kind(expr->kind);
}
cache_insert(ctx, cache, expr, result);
return result;
}
SimplifierExprKind enum
| Value | Name | Meaning |
|---|---|---|
| 0 | EK_Sentinel0 | invalid or uninitialized record |
| 1 | EK_Constant | folded constant |
| 2 | EK_NotI | unary bitwise not |
| 3 | EK_NegI | unary arithmetic negate |
| 4 | EK_AbsI | unary absolute value |
| 5 | EK_AndN | variadic bitwise and |
| 6 | EK_OrN | variadic bitwise or |
| 7 | EK_EqBin | binary equality predicate |
| 8 | EK_Select | select(condition, true_value, false_value) |
| 9 | EK_Add | binary addition |
| 10 | EK_Sub | binary subtraction |
| 11 | EK_Mul | binary multiplication |
| 12 | EK_DivU | unsigned division |
| 13 | EK_DivS | signed division |
| 14 | EK_Cast | passthrough or narrowing/widening cast |
| 15 | EK_BitVecConst | bit-vector leaf payload |
| 16 | EK_Sentinel16 | invalid terminator |
Type Dispatcher
The second walker dispatches on a compact type-kind byte. Its covered range aligns with the CUDA-tile bytecode opcode band from FToFOp through PtrToIntOp, reserved slots included. That alignment does not make it the public bytecode enum — it is an internal type-kind enum that deliberately uses the same dense range so the dispatcher can share one contiguous jump table.
Most cases share one traversal body. A few are conditional on caller mode, and one uses a separate handler. The reimplementation detail to preserve is operand discovery: read the type-kind record, choose inline or heap trailing operands according to the storage flag, then recursively walk child types before calling the uniquing layer.
Type *walk_cuda_tile_type(TypeWalkContext *ctx, TypeRecord *record, bool conditional_mode) {
TypeKind kind = record->kind;
TypeRange operands = type_record_operands(record);
if (type_kind_is_conditional(kind) && !conditional_mode) {
return record->original_type;
}
SmallVector<Type *, 8> rewritten;
for (Type *operand : operands) {
rewritten.push_back(walk_cuda_tile_type(ctx, type_record(operand), conditional_mode));
}
return unique_cuda_tile_type(ctx, kind, rewritten);
}
Binary-Arith Dispatch Table
The inner switch that handles binary integer arithmetic does not branch op-by-op. Cases 9 through 13
share one body, and that body uses the index v23 - 9 to read a 5-entry table at dword_4F99CE0.
The table maps the simplifier's local kind index into the public arith.* opcode that the
materialiser will emit.
v23 - 9 | dword_4F99CE0[i] | Arith opcode | Notes |
|---|---|---|---|
| 0 | 0x52 | arith.addi | case 9, integer add |
| 1 | 0x54 | arith.subi | case 10, integer subtract |
| 2 | 0x56 | arith.muli | case 11, integer multiply |
| 3 | 0x58 | arith.divsi | case 12, integer signed divide |
| 4 | 0x5A | arith.remsi | case 13, integer signed remainder |
The 5-row partition reflects the simplifier's scope. Only integer arith ops are touched at the IR level. Floating-point arithmetic flows downstream: FMA shapes form in the dedicated fusion pass, and remaining floating-point rewrites fall to the LLVM optimiser. This keeps the simplifier's fold rules small and lets the table double as a sanity check on the inner switch.
Materializer
sub_3BBED30 is the materialiser. It consumes the per-op rewrite result produced by simplify_binary_arithmetic and its siblings, writing the chosen replacement back into the MLIR op graph. The rewrite result is a compact record:
typedef struct {
Operation *original; // op being replaced
SmallVector_Value new_operands; // canonical operand list
OpKind new_kind; // local simplifier kind, mapped via dword_4F99CE0
ArrayAttr new_attrs; // attribute set for the new op
} SimplifierResult;
The materialiser runs four steps in fixed order. It looks up the canonical OperationName for
new_kind, asks the uniquer for an interned op carrying that tuple, falls back to the builder if
the tuple has not been seen before, then rewires uses and erases the original.
Step 1 uses the public arith.* opcode from dword_4F99CE0, not the local simplifier kind. The
table lookup happens at the boundary between the private expression IR and public MLIR, so the
materialiser never sees the simplifier's internal numbering.
Steps 1 through 3 go through a paired uniquer and builder. sub_3F1A460 is the uniquer: given a
(name, operands, attrs) tuple it returns the canonical, deduplicated Operation *. The
simplifier walks many subgraphs and tends to produce structurally identical ops; the uniquer
guarantees one canonical instance per tuple, which keeps later CSE and pattern matching cheap.
sub_3F1A5E0 is the builder: given the same tuple, it materialises a fresh Operation * through
the MLIR OpBuilder API. The builder runs only when the uniquer has no entry yet.
The pair is split because some rewrites produce new constants. Those constants must go through the StorageUniquer first (see StorageUniquer and Context Impl) before they can appear as operands. The uniquer side is the lookup; the builder side is the allocator that runs on a uniquer miss.
Operation *materialize(const SimplifierResult *r) {
OperationName name = lookupOpName(r->new_kind);
Operation *unique = sub_3F1A460(name, r->new_operands, r->new_attrs); // uniquer
if (!unique) {
unique = sub_3F1A5E0(name, r->new_operands, r->new_attrs); // builder fallback
}
r->original->replaceAllUsesWith(unique);
r->original->erase();
return unique;
}
replaceAllUsesWith runs before erase. The order matters: erasing first would drop the SSA def while live uses still reference it, which the MLIR verifier rejects. The materialiser is the only place in the simplifier that mutates the public op graph; every step before it operates on the private expression IR and on uniquer queries.
DenseElements Debug Printer
The DenseElementsAttr printer sits off the hot simplification path. An inbound-file debug replay path reaches it; from there it reads an element buffer and formats it as comma-separated values with no surrounding brackets and no shape prefix. Invalid element type codes abort through the same internal bug path used by invalid expression kinds.
DenseElementsTypeCode
| Value | Name | Element type |
|---|---|---|
| 0 | DETC_Reserved0 | BUG (invalid) |
| 1 | DETC_F32 | 32-bit IEEE float |
| 2 | DETC_F64 | 64-bit IEEE float |
| 3 | DETC_I8 | signed 8-bit integer |
| 4 | DETC_I16 | signed 16-bit integer |
| 5 | DETC_I32 | signed 32-bit integer |
| 6 | DETC_I64 | signed 64-bit integer |
| 7 | DETC_U8 | unsigned 8-bit integer |
| 8 | DETC_U16 | unsigned 16-bit integer |
| 9 | DETC_U32 | unsigned 32-bit integer |
| 10 | DETC_U64 | unsigned 64-bit integer |
| 11 | DETC_Reserved11 | BUG (invalid) |
void append_dense_elements_debug_string(StringBuilder *out,
const void *data,
DenseElementsMetadata meta) {
for (uint64_t i = 0; i < meta.element_count; ++i) {
if (i != 0) {
string_builder_append(out, ",");
}
switch (meta.type_code) {
case DETC_F32:
append_f32(out, load_f32(data, i));
break;
case DETC_F64:
append_f64(out, load_f64(data, i));
break;
case DETC_I8:
case DETC_I16:
case DETC_I32:
case DETC_I64:
append_signed_decimal(out, load_signed_integer(data, meta, i));
break;
case DETC_U8:
case DETC_U16:
case DETC_U32:
case DETC_U64:
append_unsigned_decimal(out, load_unsigned_integer(data, meta, i));
break;
default:
bug_invalid_dense_element_type(meta.type_code);
}
}
}
Force-Inline and Specialize Callees
Abstract
Two module-level NVVM passes in tileiras share one purpose: remove call boundaries that are expensive or impossible to preserve in the NVPTX .param ABI. One marks functions as mandatory inline when their signature cannot be lowered cheaply. The other specializes callees after proving more precise address spaces for generic-pointer arguments.
Together they turn difficult interprocedural cases into simpler local code before NVPTX lowering:
- Kernel and image-handle helpers are forced through the normal LLVM inliner.
- Large argument lists and large aggregate returns become inline candidates before param-space lowering.
- Generic-pointer callees can be cloned into address-space-specialized variants.
- Rewritten or cloned callees are marked so later passes can assume the call boundary is temporary.
The semantic preference is clear: tileiras deletes an unsafe call boundary rather than teaching the downstream ABI path to carry a shape it cannot represent reliably.
Operational Model
The force-inline pass is a pure function-attribute pass — no cloning, no call-site rewriting. For each defined function it decides whether inlining is mandatory, then writes both the normal LLVM function attribute and NVIDIA's compact cached attribute field.
The callee-specialization pass is interprocedural. It builds a worklist of functions with generic pointer parameters, infers the concrete address spaces passed at call sites, and either rewrites the original callee or creates a private clone with a narrower signature, then retargets matching call sites to the specialized body.
The two passes are complementary:
force-inline pass:
signature is hard for NVPTX ABI -> mark original function always_inline
callee-specialization pass:
generic address-space argument has a stable concrete space -> rewrite or clone
The specialization pass does not replace the inliner. It prepares better callees for it: clones are internal, address-space-resolved, and marked as inline-friendly.
The address-space lattice used by the specialization pass has its own page; this page summarises only the call-graph rewrite. For the full lattice contract (the UNDET/POISON partition, the meet operator, the kBudgetCap per-block bound, and the I08 type-converter handoff that publishes "nvvm.as"), see AddrSpace Vote Lattice. The two pages are complementary by design: this page owns the inliner-vs-specializer choice, the size thresholds, and the call-site retargeting machinery; the lattice page owns the data-flow rules that decide whether specialization is even legal.
Force-Inline Decision
The force-inline pass evaluates functions in priority order. Earlier reasons override later cost-model reasons.
| Reason | Condition | Effect |
|---|---|---|
| Kernel | Function is an NVVM/PTX kernel entry. | Force inline even if the source requested noinline. |
| Image handle | Any argument carries an image/sampler typedef such as wroimage, rdoimage, or sampler. | Force inline because image handles do not survive the param ABI cleanly. |
| Large parameters | Aligned parameter payload exceeds 384 bytes. | Force inline unless the user explicitly requested noinline. |
| Large return | Return payload exceeds 144 bytes. | Force inline unless the user explicitly requested noinline. |
The parameter-size rule uses the ABI-allocated size, not merely the IR type bit width. Each parameter contributes at least 4 bytes, and pointer-like values are rounded according to their ABI alignment.
static size_t param_slot_size(Type *ty, DataLayout dl) {
size_t bytes = dl.alloc_size(ty);
size_t align = dl.pointer_abi_align_if_pointer_like(ty);
if (align != 0)
bytes = align_up(bytes, align);
return max(bytes, 4);
}
static bool has_large_param_payload(Function *fn, DataLayout dl) {
size_t total = 0;
for (Argument *arg = fn->first_arg; arg != NULL; arg = arg->next) {
total += param_slot_size(arg->type, dl);
if (total > 384)
return true;
}
return false;
}
static bool has_large_return_payload(Function *fn, DataLayout dl) {
Type *ret = fn->return_type;
if (ret->is_void)
return false;
return dl.alloc_size(ret) > 144;
}
The pass is intentionally idempotent. A function already carrying alwaysinline is skipped, so repeated pipeline construction does not accumulate redundant mutations.
bool should_force_inline(Function *fn, DataLayout dl, ForceInlineReason *reason) {
if (fn->is_declaration || fn->has_alwaysinline)
return false;
if (is_kernel(fn)) {
*reason = FORCE_INLINE_KERNEL;
return true;
}
if (has_image_or_sampler_argument(fn)) {
*reason = FORCE_INLINE_IMAGE_HANDLE;
return true;
}
if (fn->has_noinline)
return false;
if (has_large_param_payload(fn, dl)) {
*reason = FORCE_INLINE_LARGE_PARAMS;
return true;
}
if (has_large_return_payload(fn, dl)) {
*reason = FORCE_INLINE_LARGE_RETURN;
return true;
}
return false;
}
When the answer is yes, tileiras sets the normal LLVM alwaysinline attribute and updates its compact cached flags so downstream proprietary passes see the same decision without re-querying the attribute set. A compatible reimplementation should treat the LLVM attribute as the source of truth and mirror any cached representation only if it reproduces NVIDIA's in-memory ABI.
Address-Space Specialization
Specialization targets functions that still take generic pointers after ordinary lowering. Generic pointers are legal in LLVM IR, but they hide address-space facts that matter to NVPTX: global, shared, constant, local, tensor memory, and distributed shared memory have different instruction-selection and aliasing consequences.
The pass maintains a lattice per pointer argument:
UNDETERMINED
-> global
-> shared
-> constant
-> local
-> tensor_memory
-> distributed_shared
-> POISON
UNDETERMINED means no useful evidence has been seen. A concrete address space means every inspected use agrees. POISON means conflicting evidence was found and the argument must remain generic.
typedef enum AddressVote {
AS_UNDETERMINED,
AS_GLOBAL,
AS_SHARED,
AS_CONSTANT,
AS_LOCAL,
AS_TENSOR_MEMORY,
AS_DISTRIBUTED_SHARED,
AS_POISON,
} AddressVote;
static AddressVote meet_address_votes(AddressVote old_vote, AddressVote new_vote) {
if (old_vote == AS_UNDETERMINED)
return new_vote;
if (new_vote == AS_UNDETERMINED)
return old_vote;
if (old_vote == new_vote)
return old_vote;
return AS_POISON;
}
Only functions with bodies, at least one generic pointer parameter, and no hard opt-out attributes are seeded into the worklist. Kernels are excluded; they are handled by the force-inline and kernel-argument paths.
static bool specialization_candidate(Function *fn) {
return !fn->is_declaration
&& !fn->is_kernel
&& !fn->has_optnone
&& !fn->has_noinline
&& !fn->has_naked
&& !fn->already_specialized
&& has_generic_pointer_parameter(fn);
}
Specialization Algorithm
The pass is a fixed-point worklist. Each successful specialization can make callers newly profitable, so affected callers are re-enqueued.
bool specialize_callees(Module *m, int clone_budget) {
Worklist wl = {};
bool changed = false;
for (Function *fn = m->first_function; fn != NULL; fn = fn->next) {
if (specialization_candidate(fn))
worklist_push(&wl, fn);
}
while (!worklist_empty(&wl)) {
Function *fn = worklist_pop(&wl);
AddressVote votes[MAX_ARGS];
init_votes(votes, fn->arg_count);
for (Use *use = fn->first_use; use != NULL; use = use->next) {
CallSite call = classify_callsite(use);
if (!call.valid)
continue;
for (unsigned i = 0; i < fn->arg_count; ++i) {
if (!is_generic_pointer(fn->arg[i].type))
continue;
AddressVote vote = infer_argument_address_space(call.arg[i]);
votes[i] = meet_address_votes(votes[i], vote);
}
}
if (!has_resolved_specialization(votes, fn->arg_count))
continue;
Function *target = fn;
if (!can_rewrite_in_place(fn, votes)) {
if (!clone_allowed(&clone_budget))
continue;
target = clone_for_address_spaces(fn, votes);
mark_internal_inline_candidate(target);
changed = true;
}
changed |= rewrite_matching_calls(fn, target, votes);
changed |= resolve_return_address_space(target);
for (Function *caller = first_affected_caller(fn);
caller != NULL;
caller = next_affected_caller(fn, caller)) {
if (specialization_candidate(caller))
worklist_push(&wl, caller);
}
}
return changed;
}
The clone budget has three modes:
| Budget | Meaning |
|---|---|
-1 | Unlimited cloning. |
0 | Disable cloning; only in-place rewrites can happen. |
Positive N | Permit at most N clone attempts before suppressing further clones. |
The counter is attempt-based rather than success-based. This prevents recursive or ambiguous call graphs from retrying indefinitely.
Call-Site Retargeting
Retargeting is not a textual rename. The pass builds a replacement call with operands converted to the specialized address spaces, inserts it before the original call, rewires every use of the original result, then erases the old call.
bool rewrite_matching_calls(Function *old_fn,
Function *new_fn,
const AddressVote *votes) {
bool changed = false;
for (CallSite call = first_callsite(old_fn);
call.valid;
call = next_callsite(old_fn, call)) {
if (!call_matches_votes(call, votes))
continue;
Value *new_args[MAX_ARGS];
for (unsigned i = 0; i < call.arg_count; ++i)
new_args[i] = convert_arg_for_vote(call.arg[i], votes[i]);
CallInst *replacement = build_call_before(call.inst, new_fn, new_args);
replace_all_uses_with(call.inst, replacement);
erase_instruction(call.inst);
changed = true;
}
return changed;
}
Return values are resolved through the same lattice. If every return instruction produces a pointer in the same concrete address space, the result type can be treated as address-space-resolved by later passes.
Diagnostics and Knobs
The implementation has debug output for the force-inline reason and for interprocedural memory-space specialization. The useful user-facing controls are the IPMSP dump switch and clone-budget switch. A reimplementation should provide equivalent observability: initial worklist size, clone suppression, affected caller count, and successful return-address-space resolution are the events needed to debug this pass family.
LoopIdiomVectorize + Divergent-Target Gate
Abstract
Tileiras carries LLVM's LoopIdiomVectorize pass alongside a nearby NVIDIA legality check in LoopVectorize. They solve different problems. LoopIdiomVectorize recognizes scalar byte-search loops and rewrites them into masked or VP-predicated vector form — naturally SIMT-friendly because it predicates individual lanes instead of cloning scalar and vector loop versions behind a runtime branch.
The divergent-target gate belongs to LoopVectorize, not LoopIdiomVectorize. It prevents the ordinary runtime-pointer-check path from versioning a loop when the target may execute branch paths divergently. Responsibility splits cleanly: LoopVectorize refuses a versioning strategy that would require a uniform runtime check, while LoopIdiomVectorize remains available for idioms expressible with per-lane masks.
Both pieces live on one page because together they answer one user-facing question — "why did my loop vectorize here but not there on this SIMT target?" Without seeing both the predicated LIV path (which works on a SIMT target) and the runtime-versioning veto (which does not), an upstream-LLVM reader cannot reconcile the contradictory reports. The page therefore reads top-down: what LIV is, what CantVersionLoopWithDivergentTarget is, and why these two strategies sit on opposite sides of the divergence question.
LoopIdiomVectorize Role
LIV walks loops looking for a small set of scalar idioms whose control flow can be represented as vector compares, masks, and reductions. Its target interaction is limited to the normal cost model: it asks the target which vector width and predication style are profitable, then emits generic vector IR. It never asks whether the target has branch divergence, because it never introduces a uniform runtime-versioning branch.
The pass lives in the ordinary LLVM optimization pipeline under the canonical pass name loop-idiom-vectorize. It is not a CUDA-tile-only pass; treat it as an inherited LLVM mid-end transformation with NVIDIA target-cost participation.
Three idiom expanders
| Idiom | Expansion shape | Distinguishing IR names |
|---|---|---|
byte.compare | Builds a masked byte comparison and feeds the result into the shared mismatch machinery. | byte.compare |
find_first_vec | Creates a vector header, match check, and lane calculation for the first matching byte. | scalar_preheader, find_first_vec_header, match_check_vec, calculate_match, needle_check_vec |
mismatch_vec_loop | Builds the vector mismatch loop, found predicate, index calculation, and final LCSSA values. | mismatch_vec_loop_pred, mismatch_vec_index, mismatch_vec_found_pred, mismatch_vec_found_index |
The user-facing controls match LLVM's pass-level knobs:
disable-loop-idiom-vectorize-alldisable-loop-idiom-vectorize-bytecmploop-idiom-vectorize-bytecmp-vfdisable-loop-idiom-vectorize-find-first-byteloop-idiom-vectorize-style=none|masked|predicatedloop-idiom-vectorize-verify
No separate NVIDIA-only option exists for the LIV idiom recognizer. NVIDIA's behavior comes from the target cost model and from the adjacent LoopVectorize legality gate.
Divergent-Target Gate
LoopVectorize evaluates the gate before accepting a plan that needs runtime pointer checks. Such checks usually create a version branch: one side runs a vectorized loop under a no-alias assumption, the other a fallback loop. On a SIMT target the branch condition has to be uniform across the warp. If the target can diverge and the loop still needs runtime pointer checks, NVIDIA's legality hook rejects the plan and emits an optimization remark.
The observable remark uses three stable pieces of text:
- remark name:
CantVersionLoopWithDivergentTarget - pass name:
Not inserting runtime ptr check for divergent target - message:
runtime pointer checks needed. Not enabled for divergent target
static bool can_version_loop_on_target(const Loop *loop,
const TargetTransformInfo *tti,
const RuntimePointerChecks *checks) {
if (tti_has_branch_divergence(tti) && runtime_pointer_checks_needed(checks, loop)) {
emit_loop_vectorize_remark(loop,
"CantVersionLoopWithDivergentTarget",
"Not inserting runtime ptr check for divergent target",
"runtime pointer checks needed. Not enabled for divergent target");
return false;
}
return true;
}
This is why the two pieces coexist without conflict. LoopVectorize refuses only the specific runtime-versioning strategy that is unsafe for a divergent target. LIV remains free to transform recognized idioms because its output uses masks and predicates, not a branch whose condition must be uniform.
LowerMatrix + mfadd
Abstract
Tileiras includes LLVM's target-independent LowerMatrixIntrinsics pass, which handles @llvm.matrix.* intrinsics: it verifies matrix shapes when requested, performs the transpose peephole (transpose A) + (transpose B) -> transpose(A + B), gathers pass statistics, and lowers remaining matrix intrinsics into ordinary scalar and vector IR.
The correction worth flagging: mfadd and mfadd_t are not NVIDIA instructions, NVPTX opcodes, or private intrinsics. They are SSA value-name prefixes created by the upstream LLVM matrix pass while rewriting transposed additions. CUDA tensor-core paths such as WMMA, WGMMA, and tcgen05.mma use the NVVM intrinsic family and NVPTX instruction selection — they never go through this generic matrix-lowering pass.
This page exists because the mfadd string surfaces in the binary's rodata between strings that do belong to NVIDIA-private tensor-core paths, and an unwary cross-reference would wire it to the wrong subsystem. The appearance of mfadd and mfadd_t in the binary is proof that the upstream LowerMatrixIntrinsics pass is linked in, not proof of a custom WMMA path. For the actual NVIDIA tensor-core lowering see the WMMA / WGMMA / tcgen05.mma sections under codegen — nothing on this page applies.
Attribution Correction
An earlier working note treated "mfadd", "mfadd_t", and the diagnostic "Matrix shape verification failed, compilation aborted!" as NVIDIA-internal additions. That attribution is wrong. Those names and diagnostics belong to upstream LLVM's LowerMatrixIntrinsics.cpp.
The mistake matters for documentation and reimplementation. A tileiras-compatible frontend needs no NVIDIA-specific "mfadd" operation — only the normal upstream matrix pass behavior when @llvm.matrix.* intrinsics are present, with CUDA tensor-core lowering routed through the NVVM/NVPTX intrinsic path.
Provenance
The pass body in tileiras is pinned to three public llvm/llvm-project commits authored by Florian Hahn (fhahn@apple.com, Apple). Each commit landed before the upstream snapshot that cicc statically links. The table records the verbatim hashes, dates, and the strings each commit introduced.
| Commit | Date | Subject | String(s) introduced |
|---|---|---|---|
da09b35334ab | 2022-11-28 | [Matrix] Optimize matrix transposes around additions | "mfadd", "mfadd_t" (Aᵀ+Bᵀ → (A+B)ᵀ rewrite) |
f10153fe9150 | 2023-04-21 | [Matrix] Handle integer types when distributing transposes across adds | extends the same rewrite to integer FAdd/Add |
0e8717f71198 | 2023-05-13 | [Matrix] Add shape verification | verify-matrix-shapes cl::opt, "Conflicting shapes (", ") for ", "Matrix shape verification failed, compilation aborted!" |
The first commit added the peephole that splits a pair of transposed operands feeding an FAdd into one FAdd named "mfadd" and one outer transpose named "mfadd_t". The second commit broadened the same rewrite to handle integer Add as well as FAdd, so the same "mfadd" / "mfadd_t" value names are reused on the integer path. The third commit added the VerifyShapeInfo debug pass and its report_fatal_error diagnostic, gated by the hidden cl::opt named verify-matrix-shapes that defaults off in upstream. The cicc snapshot post-dates all three commits, so the strings reach the binary by direct inclusion of the upstream pass, not by patch.
mfadd identity
"mfadd" is not an NVPTX instruction mnemonic, not an LLVM intrinsic ID, and not a target opcode of any kind. It is a literal llvm::Twine Name argument passed to IRBuilder::CreateFAdd (and on the integer path to IRBuilder::CreateAdd) inside the pass's OptimizeTransposes sweep. LLVM uses the Twine to build the SSA value name of the new Instruction, so the rewritten IR carries the prefix %mfadd on the fused FAdd and %mfadd_t on the outer @llvm.matrix.transpose call. Both are ordinary IR values; the pass then re-runs its own lowering on the freshly minted transpose, converting it into the column-by-column scalar/vector form (col.load / vec.start / vec.gep). The transform is a pure target-independent IR peephole: (transpose A) + (transpose B) becomes transpose(A + B), halving the number of materialized transposes when both operands of an FAdd are transposed views of equally shaped matrices.
Pass Pipeline
Read the implementation as five semantic phases, not as a collection of binary entry points:
| Phase | Role | User-visible artifacts |
|---|---|---|
| Shape verification | Walk the shape map and reject incompatible matrix dimensions when enabled. | verify-matrix-shapes, Conflicting shapes, fatal shape-verification diagnostic |
| Transpose optimization | Rewrite (A^T + B^T) into (A + B)^T for compatible shapes. | mfadd, mfadd_t, NumExposedTransposes |
| Pass setup and accounting | Register and update matrix-lowering statistics. | matrix-lowered, NumStores, NumLoads, NumComputeOps, NumFPOps |
| Top-level driver | Sequence verification, transpose optimization, optional dumps, and final lowering. | matrix-print-after-transpose-opt |
| Column-major lowering | Replace matrix intrinsics with loads, GEPs, shuffles, and arithmetic. | col.load, vec.start, vec.gep, result.vec. |
No matrix-shaped intrinsic remains by the time control returns to the pass manager. NVPTX instruction selection sees ordinary IR: loads, address arithmetic, fmul/fadd or integer arithmetic, vector shuffles, and stores.
Performance and Cost Model
Abstract
Tileiras turns "make this kernel fast" into three coordinated decisions. The modulo scheduler chooses a steady-state initiation interval and seats every op into a Resource Reservation Table. The layout selector picks SMEM swizzles, register fragments, and TMEM atoms for each pipelined value. The per-architecture atom catalog binds the dialect-level operation roster to concrete hardware costs — TMA bytes per transfer, WGMMA shape and accumulator stride, tcgen05 column count, transport-row occupancy. The three layers feed a layered cost model that ranks placements lexicographically and rejects illegal ones outright.
The cost model is not a black box. Every term in it traces back to a concrete hardware constraint: a row bit in the Blackwell 15-slot vocabulary, a capacity pool cap, an SMEM bank-conflict count, a TMEM cycles-per-row charge. A reader who understands those four sources understands why one schedule wins over another at the same --opt-level.
Cost Vocabulary
The compiler tracks five categories of cost. Each enters the model at a different stage and serves a different downstream consumer.
| Category | Unit | Where it enters | Consumed by |
|---|---|---|---|
| latency | cycles | per-op latency table seeded by the resource constraint builder | modulo scheduler MII bound, structural distance closure |
| resource pressure | row bit + pool count | per-op footprint in the Resource Reservation Table | placement admission probe |
| register pressure | live virtual registers | layout selector cost layer 3 | layout choice and spill avoidance |
| SMEM bytes | bytes per CTA | buffer assignment and TMA descriptor builder | capacity pool index 5 (singleton SMEM lock), 227 KiB ceiling |
| TMEM columns | columns per allocation | tcgen05 atom verifier | capacity pool index 1 (TMEM bank pressure, cap = 4) |
| bank conflicts | 32-byte transactions | layout selector layer 3 SMEM term | tie-break in three-term layout score |
| code size | PTX bytes | NVPTX AsmPrinter | not modeled — observable only in the emitted text |
Latency is the most pervasive term because it feeds both the modulo scheduler's lower-bound calculation and the structural distance matrix that the cost-based fallback walks. The 23-entry per-op latency table at Blackwell Pipeline 15-Slot Model — Per-Op Latency Table is the primary source; dependence-edge latency adds to it through the Floyd-Warshall closure.
Resource pressure and register pressure are independent axes. A schedule can be resource-feasible (every modulo cycle clears its row probe) while still spilling registers to local memory because the assigned layouts demand more fragments than the SM offers. The two axes do not trade against each other inside the modulo scheduler — register pressure is decided one layer earlier by the layout selector and presented to the scheduler as a fixed input.
Lexicographic Cost Vector
Tileiras uses lexicographic comparison, not weighted sums, at both the scheduler and the layout selector. A candidate that improves a softer term at the price of a harder one always loses.
Scheduler: Four-Component Vector
The modulo scheduler's cost-based generator ranks candidates by a four-component vector. Component 1 is a hard gate — a candidate that fails it scores (∞, *, *, *) and is rejected before any later component matters.
| Position | Component | Source | Role |
|---|---|---|---|
| 1 | resource legality | RRT row-OR test plus capacity-pool caps | hard admission gate |
| 2 | pipe-slot pressure | structural distance matrix produced by the SSE2-unrolled Floyd-Warshall | hard gate on dependence reach |
| 3 | bank-pressure pressure | dual-RRT probe of in-iteration vs cross-iteration occupancy | preference signal |
| 4 | structural distance | Kendall-tau inversion count vs original program order | tie-breaker |
The two evaluators that produce components 2 and 3 share an exponential-then-binary search driver — the driver doubles its cost threshold until the candidate first becomes feasible, then binary-searches for the smallest feasible threshold. The threshold is the candidate's score on that component; the cost reducer picks the candidate with the minimum across all candidates at the current cycle. Schedule Solve and Cost Evaluators — Worked Scoring Example walks the cost vector through a four-op loop body.
Layout Selector: Three-Term Additive
The layout selector — D14 in the TileAS pipeline — uses a different cost shape because layouts are evaluated independently per pipeline alias group rather than against a shared modulo cycle. The structural filter at layer 2 enforces hard constraints (operand shape, memory space, alignment), so layer 3 sees only legal candidates and can sum cost terms with a fixed weight vector:
score = w_smem · smem_transactions
+ w_tmem · tmem_cycles
+ w_reg · live_registers
The default weight vector is (1, 4, 0.25) on (SMEM, TMEM, registers). Ties break on register pressure first, then on SMEM bank-conflict count — see TileAS Layout and Buffer Family — Three-Layer Cost Model.
The two cost shapes coexist because they answer different questions. The scheduler's lexicographic vector ranks placements against a fixed II where any single resource overcommit kills the candidate. The layout selector's additive score ranks layouts whose legality has already been proven, so summing costs across orthogonal axes is sound.
Roofline Reasoning
The number of pipeline stages a software-pipelined loop needs is fundamentally a roofline calculation. The compiler computes it as:
stage_count = ceil(memory_cycles / compute_cycles)
where memory_cycles is the wall-clock cost of the loop's heaviest async load and compute_cycles is the throughput of the surrounding compute. The modulo scheduler's II is the floor — it sets how often a new iteration starts — and the stage count is the ceiling — how many in-flight iterations the steady state holds.
Worked Example 1: TMA Load + WGMMA on Hopper
A typical SM90 mainloop loads a (128, 64) tile through TMA and consumes it through wgmma.mma_async.m64n128k16:
| Op | Slot | Cycles | Role |
|---|---|---|---|
cp.async.bulk.tensor (TMA) | tma + tp_smem_wr | 8 | descriptor + SMEM write transport |
wgmma.mma_async | tc_and_mma + tp_mma | 8 | TC issue + MMA transport |
Both ops occupy 8 cycles on Hopper. The roofline ratio is 8 / 8 = 1, so stage_count = ceil(1) = 1. A single in-flight iteration is enough to keep the WGMMA pipe fed — the compiler does not need to overlap multiple loads with multiple computes, because the load finishes in exactly the time the compute takes.
In practice the compiler still schedules two stages, because the TMA descriptor commit (cp.async.bulk.commit_group) and the WGMMA fence (wgmma.fence) introduce a half-cycle of synchronisation overhead. The modulo scheduler models this by promoting the producer-consumer pair to stage (0, 1) rather than collapsing them into stage (0, 0).
Worked Example 2: TMA Load + tcgen05 on Blackwell
The same mainloop on Blackwell uses tcgen05 instead of WGMMA, and the cost picture changes. The accumulator now lives in tensor memory, not in registers:
| Op | Slot | Cycles | Role |
|---|---|---|---|
cp.async.bulk.tensor (TMA) | tma + tp_smem_wr | 8 | descriptor + SMEM write transport |
cp.async.tcgen05 (SMEM→TMEM) | tp_tmem_wr | 7 | tensor-memory write transport |
tcgen05.mma | tc_and_mma + tp_mma | 8 | TC issue + MMA transport |
tcgen05.ld (TMEM→reg) | tp_tmem_rd | 7 | tensor-memory read transport (epilogue only) |
The load-to-compute path is now a chain of three transports — SMEM write, TMEM write, MMA issue — each consuming a different singleton transport row. The roofline ratio is (8 + 7) / 8 = 1.875, so stage_count = ceil(1.875) = 2. The compiler must keep at least two iterations in flight to hide the SMEM→TMEM staging.
The capacity pool at index 1 caps in-iteration TMEM bank pressure at 4 (see Blackwell Pipeline 15-Slot Model — Pool Capacity Vector). Two in-flight iterations consume two banks each on average, leaving headroom; a kernel that tried for four stages would saturate the TMEM bank cap and the cost-based generator would reject the placement at component 1.
The 5000-cycle HBM3e ceiling at Blackwell Pipeline 15-Slot Model — Cycle Anchor Table is the absolute round-trip budget the scheduler attributes to a worst-case far-memory dependence. If the loop body's accumulated latency exceeds 5000 cycles before the dependence closes, the candidate is rejected before any RRT probe runs — the Big-M term acts as a hard ceiling on pipeline depth.
Opt Level Table
Each --opt-level=N selects a different pass pipeline; see Pass List by Optimization Level for the full per-level pass roster. The performance-relevant compile-time and runtime trade-offs:
| Level | Compile-time cost | Runtime quality | Used for |
|---|---|---|---|
O0 | minimal (verify-only) | not runnable — IR stays in cuda_tile | bytecode round-trip validation |
O1 | low (one dialect hop) | not runnable — IR stays in TileAA | front-end debugging |
O2 | moderate (full scheduler) | production-grade with default placement | default production builds |
O3 | high (full conversion stack) | production-grade with full kernel-ABI legalisation | non-debug production builds, the only level that exercises every NVVM target attachment |
The modulo scheduler is the dominant pass at both O2 and O3 — the difference between the two is not scheduling quality but the breadth of dialects converted to LLVM. The cost-based generator runs at both levels; its cost vector is identical. A kernel that scheduled well at O2 will schedule identically at O3 because the schedule analysis is computed once and reused.
Warp-specialised scheduling is a layered adder, not a level. When pipeline-strategy=warp-specialize is set, the adder replaces the modulo-schedule stage with a warp-specialisation pipeline that partitions the loop body across agents. The light variant (when rrt-size-threshold=0) inserts boundaries and barriers without scheduling; the heavy variant runs the full modulo scheduler against agent-partitioned RRTs. The choice is independent of opt level above O1.
Performance-Critical Tunables
A handful of environment variables and cl::opt flags directly shift the cost model's behaviour. The complete inventory is in Environment Variable and Runtime Gate Catalog; the performance-relevant subset is:
| Tunable | Default | Effect on cost model |
|---|---|---|
TILE_AS_DEBUG_UNLIMITED_SMEM | unset | raises the 227 KiB SMEM ceiling at capacity pool index 5 to INT_MAX; used to isolate whether a placement failed on SMEM pressure or on a different resource |
TILEIR_PREFER_TMA_FOR_LOAD_STORE | "false" | when "true", biases the layout selector toward cp.async.bulk.tensor atoms whose register-pressure cost is zero |
TILEIR_ALWAYS_SWIZZLE | unset | forces the swizzled layout regardless of the layer-3 cost score — diagnostic only |
TILEIR_DELAY_TMA_STORE_WAIT | unset | defers the cp.async.bulk.wait_group barrier after a TMA store, raising effective bandwidth at the cost of correctness margin |
--max-chain-length (default 64) | 64 | caps the IDPA chain length — longer chains let the LLVM-tier optimizer fuse more FMA-style operations but raise compile-time cost |
--do-base-address-strength-reduce (default 4) | 4 | BASR master, 0..4; higher levels enable more aggressive base-address strength reduction at the LLVM tier |
--scev-cgp-inst-limit (default 500) | 500 | caps the SCEV-CGP instruction budget — a higher limit lets the SCEV-driven code-generation prepare more aggressively but extends compile time linearly |
--rrt-size-threshold (warp-spec) | varies | switches between light and heavy warp-specialisation variants based on RRT size |
TILE_AS_DEBUG_UNLIMITED_SMEM is the most surgical diagnostic switch — it isolates whether a placement failed because of SMEM byte pressure (pool 5) or because of a different resource. Setting it temporarily and rerunning the same kernel will produce identical schedules if SMEM was not the binding constraint, and a different schedule if it was.
Performance Gotchas
Five anti-patterns appear repeatedly in kernels that schedule worse than expected.
Register pressure spilling to LMEM. The layout selector's layer-3 register-pressure term scores live registers across the atom's window, but it does not see the whole-kernel register budget. A kernel that picks low-cost ldmatrix.sync atoms throughout (each pays 32–64 register fragments) can accumulate live ranges that exceed the SM's 64 KiB register file, forcing spills to local memory. The spills emit st.local and ld.local instructions that the NVPTX backend cannot eliminate. The fix is to bias the layout selector with TILEIR_PREFER_TMA_FOR_LOAD_STORE=true so memory-resident atoms (zero register cost) win the cost score for the largest tiles.
SMEM bank conflicts. The layer-3 SMEM cost term counts conflict-free transactions for the chosen swizzle, but the scorer cannot see across pipeline alias groups. Two independent groups can each pick a low-conflict swizzle internally and still collide at the cycle level if their footprints overlap on the same banks. The modulo scheduler's component-3 bank-pressure term catches the collision after the fact, but the only fix is to rerun the layout selector with a different swizzle hint or to raise the alignment on the offending operand.
Sync overhead. Named-barrier slots are capped at 3 across iterations (pool index 6, cross-iteration carry). A kernel that uses bar.sync.named for every producer-consumer handoff exhausts the cap at two-stage pipelines and the cost-based generator drops back to the trivial fallback. The fix is to coalesce barriers — multiple producer-consumer pairs can share one named barrier when their handoff cycles align.
Pipeline staging miscount. The roofline formula stage_count = ceil(memory_cycles / compute_cycles) assumes both numerator and denominator are dominated by the heaviest op. A loop body with a long-latency cvta.to.global or an unfolded address computation can extend the compute denominator without contributing usefully to throughput, lowering the apparent ratio and the chosen stage count. The fix is to inspect the per-op latency table charges at Blackwell Pipeline 15-Slot Model — Per-Op Latency Table and check whether non-load ops are inflating the compute side.
TMEM column over-allocation. The tcgen05 allocator partitions 256 columns × 128 rows of tensor memory per SM. A kernel that allocates wide accumulators (e.g., tensor<256x256xf32> split across two CTAs) can consume more columns than the per-SM budget when the CTAs co-reside, and the modulo scheduler rejects the placement at capacity pool index 1 (TMEM bank pressure, cap = 4). The fix is either to lower the accumulator precision (FP16 instead of FP32) or to split the kernel into separate CTAs that do not co-reside.
Performance-Analysis Workflow
Performance debugging in Tileiras follows a four-step trace from the highest-level scheduler decisions down to the assembled PTX.
Step 1: dump the schedule. Setting --schedule-trace-file=<path> writes the per-op stage and order assignments, the chosen II, and the placement-arm sequence that produced each placement. The trace also records the cost vector for the cost-based fallback when it runs, which surfaces the dominant cost component. A kernel that fails to schedule emits a diagnostic naming the binding constraint — the slot, the pool, or the cycle ceiling.
Step 2: inspect the snapshot IR. Both O2 and O3 provide an optional snapshot printer between the heaviest lowering and CSE. The snapshot is the natural inspection window for layout decisions — every tiled_load and tiled_store carries its assigned nv_tileas.layout attribute, and a layout-induced bank conflict is visible as a swizzle that does not match the surrounding pattern. The snapshot also shows which pipe values the scheduler emitted; a missing Pipe_ between an expected producer-consumer pair indicates that Schedule::solve fell back to the trivial zero-producer path.
Step 3: read the PTX. The NVPTX backend writes state-space-qualified memory instructions (ld.global, ld.shared, ld.tmem via tcgen05.ld). A kernel that schedules well at the MLIR level can still emit suboptimal PTX if the AsmPrinter chose a wide instruction where a narrow one would suffice, or if address-space promotion left a cvta.to.global that better refinement would have eliminated. The PTX text is also where the launch-bound directives surface — .maxntid and .reqntid collisions, parameter-space size, register count.
Step 4: profile with nvprof or Nsight Compute. Once the PTX is in cubin form, runtime profiling closes the loop. The metrics that map directly back to Tileiras's cost terms are smsp__inst_executed_pipe_tensor_op_hmma.sum (TC issue throughput, component 2 in the scheduler's vector), l1tex__data_bank_conflicts_pipe_lsu.sum (SMEM bank conflicts, component 3), smsp__inst_executed_pipe_lsu.sum divided by smsp__cycles_active.avg (transport pressure, component 1's RRT row-OR test), and sm__warps_active.avg.pct_of_peak_sustained_active (occupancy, the ratio that drives the stage-count ceiling). A mismatch between predicted and measured throughput points to the cost-model term that was wrong — usually a latency the per-op table charged but the hardware did not, or a bank conflict the layer-3 scorer missed.
Cross-References
Pass List by Optimization Level documents which passes each level runs and the IR shape at every stage boundary. Schedule Solve and Cost Evaluators walks the four-component cost vector through a concrete scoring example. Blackwell Pipeline 15-Slot Model documents the slot vocabulary, the latency families, and the capacity pools the cost model reads. TileAS Layout and Buffer Family documents the three-term layout selector cost. Environment Variable and Runtime Gate Catalog lists every tunable that shifts the model.
Fast-Math and Numerical Precision
Abstract
Tileiras lets the user trade floating-point correctness for performance along five orthogonal axes: per-op fast-math flags on arithmetic operations, flush-to-zero control on division and transcendentals, approximate-versus-exact intrinsic selection in libdevice, narrow-precision FP8/FP4 arithmetic with explicit cast semantics, and block-scaled formats that share one exponent across a value group. Each axis is controlled by an attribute or reflect key that travels through the lowering pipeline; the final PTX modifier or hardware intrinsic is chosen at NVVM-to-PTX emission time.
The axes compose, but they do not commute. A function-level FTZ promise interacts with per-op arcp; an approximate intrinsic selected by afn is still subject to FTZ at the instruction modifier; FP8 casts must round under a rounding mode that may differ from the surrounding fast-math context. This page documents the legal compositions and the data flow that produces them.
The Fast-Math Flags
Tileiras carries the standard LLVM fast-math flag set as an MLIR attribute on each arithmetic op. The flag bits map one-to-one onto LLVM's FastMathFlags:
| Flag | Assumption | Optimization unlocked |
|---|---|---|
nnan | result is not NaN | folds such as fcmp ord x, x → true |
ninf | result is not infinity | infinity-arm removal in select chains |
nsz | sign of zero is irrelevant | x - x → 0, 0 - x → -x without sign care |
arcp | a/b may become a * (1/b) | reciprocal substitution and CSE on the reciprocal |
contract | FMA fusion is permitted | a*b + c → fma(a,b,c) across a basic block |
afn | approximate intrinsics are allowed | __nv_sqrt may resolve to sqrt.approx.f32 |
reassoc | algebraic reassociation is permitted | reduction-tree rebalancing, horner reordering |
The aggregate flag fast is the bitwise OR of all seven. Tileiras frontends emit individual flags rather than the aggregate, which lets later passes turn one bit off without losing the others.
The flag bits are not advisory. Each downstream consumer reads the exact bit that authorises its rewrite: the FMA former reads contract, the reciprocal pass reads arcp, the libdevice resolver reads afn. A flag missing from the op blocks the rewrite even when the surrounding context is fast.
bool can_fuse_to_fma(Operation *mul, Operation *add) {
if (!single_use_chain(mul, add)) return false;
return has_fastmath_flag(mul, FMF_CONTRACT)
&& has_fastmath_flag(add, FMF_CONTRACT);
}
bool can_use_approx_sqrt(Operation *call) {
return has_fastmath_flag(call, FMF_AFN);
}
The lattice across the pipeline is monotone: passes may drop bits when they cannot prove the assumption is preserved (typically across a control-flow merge that joins fast and slow operands), but they do not add bits. The frontend is the sole producer of fast-math flags.
FTZ — Flush to Zero
FTZ treats subnormal inputs and results as signed zero. On every NVIDIA GPU since Maxwell, FTZ is a per-instruction modifier rather than a global mode; the PTX form is .ftz appended to the mnemonic of mul, div, sqrt, rsqrt, and the transcendental family. The hardware path for FTZ-enabled instructions is one cycle faster on the f32 path on most architectures because it skips the subnormal handler.
FTZ has two control surfaces in tileiras:
| Surface | Scope | Source |
|---|---|---|
function attribute denormal-fp-math | every op in the function | LLVM IR attribute populated from CLI --use-fast-math / -ftz=true |
reflect key __CUDA_FTZ | libdevice bodies before linking | NVVMReflect var-map from CLI plus module metadata |
The function attribute has three legal values: ieee (no FTZ), preserve-sign (subnormals become signed zero), positive-zero (subnormals become +0 regardless of sign). NVPTX only emits .ftz PTX modifiers when the function-level attribute is preserve-sign and the op family has an FTZ form.
The reflect key drives a separate decision earlier in the pipeline. When NVVMReflect runs on a libdevice body whose control flow is gated on __nvvm_reflect("__CUDA_FTZ"), the fold collapses the FTZ arm or the non-FTZ arm before the body inlines into the caller. The two surfaces must agree at compile time: __CUDA_FTZ=1 with denormal-fp-math=ieee is rejected by the backend because the libdevice body selected the FTZ-aware intrinsic but the NVPTX emitter refuses to add .ftz to it.
PtxModifier select_ftz_modifier(Operation *op, FunctionAttrs attrs) {
if (!op_family_supports_ftz(op->kind)) return MODIFIER_NONE;
if (attrs.denormal_fp_math != DENORMAL_PRESERVE_SIGN) return MODIFIER_NONE;
return MODIFIER_FTZ;
}
The 2x2 case matrix below shows the four legal combinations on f32 single-precision sqrt. The fifth and sixth cells — ieee function with __CUDA_FTZ=1 libdevice resolution, or preserve-sign function with __CUDA_FTZ=0 libdevice resolution — are rejected at NVVMIRVerifier with a "function FTZ disagrees with libdevice FTZ" diagnostic.
| function | reflect | resolved op |
|---|---|---|
ieee | __CUDA_FTZ=0 | nvvm.sqrt.rn.f (no .ftz) |
ieee | (FTZ not consulted; __CUDA_PREC_SQRT=0) | nvvm.sqrt.approx.f (no .ftz) |
preserve-sign | __CUDA_FTZ=1, __CUDA_PREC_SQRT=1 | nvvm.sqrt.rn.ftz.f |
preserve-sign | __CUDA_FTZ=1, __CUDA_PREC_SQRT=0 | nvvm.sqrt.approx.ftz.f |
f64 has no FTZ form on any current architecture: the .ftz modifier is rejected by the assembler on f64 mnemonics. Tileiras silently drops the modifier on the f64 path even when the function attribute is preserve-sign.
Approximate Transcendentals
The NVIDIA SFU has dedicated hardware for five transcendentals: sin.approx, cos.approx, rsqrt.approx, lg2.approx, ex2.approx. Each is approximately 22 bits of accuracy on f32 input (vs ~24 for fully IEEE single) and runs at one result per cycle per SFU lane. The IEEE path through libdevice is roughly an order of magnitude slower.
Tileiras selects the .approx variant on three independent triggers:
-
The op carries the
afnfast-math bit. This is the frontend-driven path: a math function called withnvfuser::fastmathor compiled under--ffast-matharrives at the math pass withafnset, and the libdevice resolver rewritesmath.sin %x : f32intonvvm.sin.approx.fdirectly without going through__nv_sinf. -
The libdevice symbol that was called is itself one of the explicit
__nv_fast_*aliases (__nv_fast_sinf,__nv_fast_cosf,__nv_fast_logf, etc.). These bodies are a single approximate intrinsic with no reflect-guarded fallback. The resolver folds the symbol regardless of the surrounding fast-math context. -
The
__nv_*body is reflect-gated and__CUDA_PREC_*selected the approximate arm.__nv_sqrtffor example has__nvvm_reflect("__CUDA_PREC_SQRT") ? sqrt.rn : sqrt.approx; with__CUDA_PREC_SQRT=0the reflect fold leaves only thesqrt.approx.farm.
The five SFU operations have FTZ and non-FTZ forms; the four-way matrix (approx × FTZ) is enumerated in the NVVM-Reflect crosswalk for each op. Operations outside the SFU family — tanh, erf, atan — have no .approx PTX form; the only fast path is the libdevice __nv_fast_* alias, which inlines a polynomial approximation in software.
Intrinsic resolve_transcendental(MathOp op, Target t, ReflectMap r) {
if (op_has_fastmath_flag(op, FMF_AFN)) return approx_intrinsic_for(op, t);
if (call_target_is_fast_alias(op)) return approx_intrinsic_for(op, t);
if (reflect_says_approx(op.kind, r)) return approx_intrinsic_for(op, t);
return exact_intrinsic_for(op, t);
}
The decision happens in the math pass, before the NVPTX backend sees the call. By the time the LLVM IR reaches the backend the call has been replaced by an nvvm.*.approx.* intrinsic, and the only remaining choice is FTZ.
Libdevice Gating: Bit-Exact vs Fast
Libdevice ships two callable variants for almost every transcendental, distinguished by symbol name:
| Symbol | Behaviour | Reflect-gated |
|---|---|---|
__nv_sqrt | IEEE-correct, FTZ-agnostic | yes, on __CUDA_PREC_SQRT and __CUDA_FTZ |
__nv_fast_sqrt | approximate, FTZ-aware | no, single arm |
__nv_sinf | reflect-gated approx-or-exact | yes, on __CUDA_FTZ |
__nv_fast_sinf | approximate, FTZ-aware | no, single arm |
__nv_sin (f64) | IEEE-correct | no f64 fast variant |
Frontend code that wants the fast variant must call the __nv_fast_* symbol explicitly or compile under a fast-math context that the math pass can use to rewrite the call. There is no CLI option that globally swaps __nv_sqrt for __nv_fast_sqrt; the dispatch is symbol-name-based, not flag-based.
The two reflect keys interact: __CUDA_PREC_DIV, __CUDA_PREC_SQRT, __CUDA_PREC_RSQRT, and __CUDA_PREC_LOG each select the bit-exact arm of one transcendental family, while __CUDA_FTZ selects the FTZ variant of whichever arm survived. Setting __CUDA_PREC_SQRT=0 with __CUDA_FTZ=1 resolves __nv_sqrtf to sqrt.approx.ftz.f; setting both to zero resolves to sqrt.approx.f; setting __CUDA_PREC_SQRT=1 with __CUDA_FTZ=0 resolves to sqrt.rn.f.
NVVMReflect Mechanism documents the var-map source order and the constant-conditional cleanup that follows the substitution. Math Pass Pipeline and Crosswalk carries the per-op crosswalk between math.* ops, __nv_* libdevice symbols, and the final nvvm.* intrinsic for each fast-math configuration.
FP8 — E4M3 and E5M2
Two FP8 formats are first-class types in tileiras:
| Type | Layout | Range | Special values | Typical use |
|---|---|---|---|---|
f8E4M3FN | 1 sign + 4 exp + 3 mantissa | ±448 | no inf, one NaN encoding | forward activations and weights |
f8E4M3FNUZ | same layout, unsigned-zero variant | ±448 | no inf, no negative zero | some training recipes |
f8E5M2 | 1 sign + 5 exp + 2 mantissa | ±57344 | inf and NaN encoded | backward gradients |
f8E5M2FNUZ | same layout, unsigned-zero variant | ±57344 | no inf, no negative zero | some training recipes |
The FN suffix means "finite": no infinity encoding, only NaN; the UZ suffix means "unsigned zero": no negative zero. The four element types are MLIR built-ins and round-trip through the bytecode.
Cast semantics matter because FP8's narrow range makes overflow common. Tileiras supports four rounding modes on f16→f8 and f32→f8 casts: round-to-nearest-even (default), round-to-nearest-tied-away-from-zero, round-toward-zero, and round-toward-positive-infinity. Saturation is independent: a satf modifier clamps overflowing values to ±max-finite instead of producing NaN or infinity (the latter is impossible on FN types).
nv_tileaa.cast %x : tile<128x128 x f32> to tile<128x128 x f8E4M3FN>
{rounding = #rne, satf = true}
FP8 MMA is available on SM89 (Ada) for the small-tile WMMA family and on SM90 (Hopper) for WGMMA. On SM89 the FP8 inputs are accumulated into f32; on SM90 the WGMMA accumulator is also f32 or f16. The MMA atom builder rejects f8 × f8 → f8 and f8 × f8 → bf16 shapes because the hardware refuses to issue them.
A worked dot product mixing precisions:
%a : tile<128x64 x f8E4M3FN>
%b : tile<64x128 x f8E4M3FN>
%c : tile<128x128 x f32>
%d = nv_tileaa.dot %a, %b, %c
: tile<128x64 x f8E4M3FN>, tile<64x128 x f8E4M3FN>, tile<128x128 x f32>
-> tile<128x128 x f32>
%out = nv_tileaa.cast %d : tile<128x128 x f32> to tile<128x128 x f8E4M3FN>
{rounding = #rne, satf = true}
The dot lowers to one or more WGMMA atoms on SM90 or one WGMMA-equivalent UMMA group on SM100. The cast lowers to a cvt.rn.satfinite.e4m3x2.f32 pair-packed conversion on hardware that supports the packed form.
Block-Scaled FP — MX-FP and NV-FP4
Block-scaled formats pack N narrow values together with a single shared scale factor. The effective dynamic range of the block is the value precision times the scale precision; the per-value cost is the narrow-value width plus 1/N of the scale width.
On Blackwell SM100+ tileiras supports four block-scaled formats:
| Format | Value type | Block size | Scale type | OpCode group |
|---|---|---|---|---|
| MX-FP8 | f8E4M3FN or f8E5M2 | 32 | e8m0 | kind::f8f6f4 |
| MX-FP6 | f6E2M3FN or f6E3M2FN | 32 | e8m0 | kind::f8f6f4 |
| MX-FP4 | f4E2M1FN | 32 | e8m0 | kind::mxf4 |
| NV-FP4 | f4E2M1FN | 16 or 32 | e4m3 | kind::mxf4nvf4 |
The scale factor lives in its own MLIR operand (sf_a, sf_b on the MMA op) and rides a dedicated TMEM region allocated alongside the value operands. The MMA hardware multiplies the value-product by the scale-product per block before adding into the accumulator.
The two kind::mxf4 variants differ only in scale type: OCP-standard MX-FP4 uses e4m3 scales (4-bit exponent, 3-bit mantissa, finite-only), and NVIDIA-defined NV-FP4 uses e8m0 scales (8-bit exponent, no mantissa). The dispatcher reads the scale element type to pick the opcode group; mismatched scale types across sf_a and sf_b are a verifier error. The MMA atom registry in tcgen05 Tensor Memory Model enumerates the legal (atom_K, vecSize) triples per variant.
%d = nv_tileaa.dot %a, %b, %c
sfa(%sa) sfb(%sb)
: tile<M x K x f4E2M1FN>, tile<K x N x f4E2M1FN>, tile<M x N x f32>
sfa: tile<M x (K/32) x e4m3>, sfb: tile<(K/32) x N x e4m3>
-> tile<M x N x f32>
The scale factor operands consume their own TMEM region and have their own staging pipeline. The mainloop must keep the scale and value operands aligned across the K loop or the MMA produces silently wrong results — the hardware does not check operand correspondence.
Recommended Precision Recipes
The four common configurations across the production matrix:
| Scenario | Input | Accumulator | Output | FTZ | Fast-math |
|---|---|---|---|---|---|
| inference / serving | f8E4M3FN | f32 | f8E4M3FN or bf16 | on | afn, contract |
| training forward | bf16 | f32 | bf16 | off | contract only |
| training backward | f8E5M2 | f32 | bf16 | off | contract only |
| bit-exact reference | f32 | f32 | f32 | off | none |
The asymmetry between forward and backward in training reflects the different dynamic-range requirements: forward activations are tightly bounded around the activation function output, while gradients span many orders of magnitude across layers. f8E4M3FN has more mantissa precision and a narrower range; f8E5M2 has wider range and less precision. The forward path tolerates the narrow range because activations are bounded; the backward path requires the wider range because gradient magnitudes are not.
Block-scaled formats (MX-FP4, NV-FP4) are usable on the forward path on SM100+ but require quantisation-aware training to converge. They do not currently compose with bit-exact reference recipes.
Cross-References
NVVMReflect Mechanism documents the var-map source order, the merge rules between metadata and CLI overrides, and the constant-conditional cleanup that follows reflect substitution.
Math Pass Pipeline and Crosswalk carries the per-op crosswalk from math.* through __nv_* to nvvm.* for every fast-math configuration on f32 and f64.
Intrinsic ID Switch and Name Table documents how the constant folder recognises post-libdevice call sites and which nvvm.* intrinsics it folds.
tcgen05 Tensor Memory Model carries the block-scaled MMA opcode table and the scale-factor TMEM allocation rules.
nv_tileaa Op Roster documents the dot operand shape and the scale-factor verifier diagnostics.
Matmul Progression by SM places FP8 on SM89, FP8 WGMMA on SM90, and block-scaled MX-FP / NV-FP4 on SM100 within the broader hardware lineage.
Correctness Layers
Abstract
Tileiras wraps four concentric verification layers around its pass pipeline, and a fifth external layer — ptxas — closes the loop after the PTX text leaves the process. Each layer catches a distinct class of bug. The per-op verifier catches structural malformations the moment an op is built. The pass-level verifier catches mismatches a single pass was supposed to repair but didn't. The module-level verifier catches whole-module invariants the pipeline as a whole is supposed to establish. The NVVM IR verifier catches NVPTX-specific errors that upstream LLVM's generic verifier knows nothing about. Anything that slips past all four lands on ptxas, which re-parses the PTX text and rejects programs the producer's invariants missed.
A reimplementer who omits any one layer creates a different class of wrong-output bug. Omitting the per-op verifier produces malformed IR that propagates silently into later passes and triggers obscure crashes far from the source. Omitting the NVVM verifier produces PTX that ptxas rejects with a generic syntax error, robbing users of the actionable diagnostic the higher layer would have emitted. The layers are not redundant; they are positioned to surface each bug class at the point where the producer still has the structural context needed to explain it.
The Five Layers
Layer 1: Per-Op Verifier
Every MLIR op carries a verify() method anchored in its OperationName slot at header offset +0x40; the per-op verifier hook is documented in detail in Operation Layout — Pointer-Identity Dispatch. The hook fires every time an op is built through OpBuilder::create<OpT> and again after every pass when verify-each is on. The body checks per-op structural invariants — operand count, operand types, result count, result types, region structure, terminator kind, attribute presence, trait predicates such as IsolatedFromAbove.
The per-op verifier cannot reach across operations. It sees only the op handed to it and the values dangling off its operand list; it cannot inspect the parent block, the enclosing function, or another op that produced one of its operands. Cross-op invariants require a higher layer.
A concrete example. The TileAS TMA load op encodes the source coordinate list as operands 2..R+1, where R is the descriptor's box rank stored on operand 0. When a builder constructs the op with the wrong number of coordinate operands, the per-op verifier walks the operand list and emits:
'nv_tileas.async.tiled_tma_load' op expects 3 coordinates, but got 2
The error surfaces inside the pass that built the op, before the pass's transformation pattern even returns. The pass-failure handshake propagates the diagnostic up to the driver and aborts the pipeline at the pass that introduced the malformed op rather than at a later consumer that would have seen incomprehensible IR.
Layer 2: Pass-Level Verifier
The pass-manager between-pass verifier fires after each pass when verify-each is on (the default for non-Release builds), running the full verify() on the anchor operation, which in turn walks every op in the anchor's region and re-runs each per-op verifier. It catches invariants the pass was supposed to establish but didn't — partial rewrites, terminators left behind by an aborted pattern application, region asymmetries from a rewrite that fired once on a producer but failed on its matching consumer.
A concrete example. MaterializeAsync rewrites every pipeline op into a pair of producer and consumer regions. If the pass aborts a rewrite halfway through, leaving an nv_tileas.async.pipeline.consume_one op without its paired produce_one, the between-pass verifier walks the body and the pipeline-region verifier fires:
'nv_tileas.async.pipeline.consume_one' op expects region arguement types to match with producer types [...], but got: [...]
(The typo "arguement" is verbatim in the binary; see nv_tileas Verifiers — Region-Op Verifier Template for the full set of region-op invariants.) The pass-level verifier surfaces the broken state at the boundary of the pass that produced it, not at the boundary of the next pass that consumes it.
Layer 3: Module-Level Verifier
The module-level verifier fires at named verifier passes (TileIR operation analysis, TileAA agent verifier, NVVM IR verifier) and at the end of the pipeline. It checks whole-module invariants: every kernel-marked function carries nvvm.kernel metadata, every symbol referenced by a func.call resolves to a function in the module, every type carried in an attribute dictionary belongs to the resolved type table, and every metadata-attached kernel entry has consistent launch-bound directives.
A concrete example. A late LLVM-tier pass strips function attributes during cleanup. If the strip pass runs after KernelAttrPass but before the NVVM verifier, a kernel function loses its nvvm.kernel attribute. The module-level verifier walks the function list, sees a function with kernel-shaped signature but no kernel metadata, and emits a diagnostic. The pipeline aborts before the NVPTX backend reaches instruction selection, which would otherwise have generated correctly typed but non-kernel-emittable code.
Layer 4: NVVM IR Verifier
The NVVM IR verifier is the LLVM-tier sibling of the module verifier. It runs after MLIR-to-LLVM conversion and catches NVPTX-specific invariants the upstream LLVM Verifier knows nothing about: formal parameter-space overflow per the active SM target, launch-argument address-space mismatches that would be undefined at runtime, intrinsics used below their introducing SM, kernel metadata required on functions the launch operator references. The check anatomy is documented in NVVM IR Verifier.
A concrete example. A kernel argument list whose accumulated size exceeds the SM's parameter-space limit:
Formal parameter space overflowed (40016 bytes required, max 1024 bytes allowed) in function big_kernel
Upstream LLVM's verifier accepts this function unconditionally because parameter space is an NVPTX concept; the NVVM verifier is the only place this becomes a hard error rather than a silent truncation in the backend.
Layer 5: ptxas
The final correctness net is external. Tileiras emits PTX as ASCII text and hands it to ptxas through the subprocess harness described in Handoff Protocol — Subprocess argv. ptxas re-parses the PTX, applies its own validators, and assembles to cubin. Anything tileiras's four internal layers missed gets caught here: PTX-syntax violations from a buggy AsmPrinter template, instruction-availability mismatches against the -arch flag, virtual-to-physical register allocation conflicts, launch-bound directive collisions (.maxntid and .reqntid together, for instance, per Handoff Protocol — Producer-side bug flagged).
ptxas diagnostics surface through the subprocess's stderr pipe and reach the user through the harness's diagnostic callback. The bug class ptxas catches uniquely is PTX-text well-formedness: tileiras's structural verifiers know what they emitted, ptxas knows what was actually serialized. A bug in the serializer that produces well-formed IR but malformed PTX is invisible to every internal layer.
What Each Layer Catches
| Bug class | Layer 1 | Layer 2 | Layer 3 | Layer 4 | ptxas |
|---|---|---|---|---|---|
| Wrong operand count | YES | — | — | — | — |
| Wrong operand type | YES | — | — | — | — |
| Wrong attribute type | YES | — | — | — | — |
| Region missing terminator | YES | — | — | — | — |
| Region-asymmetric after pass | — | YES | — | — | — |
| Partial-rewrite leftover op | — | YES | — | — | — |
| Missing kernel metadata after pipeline | — | — | YES | — | — |
| Unresolved symbol reference | — | — | YES | — | — |
| Parameter-space overflow | — | — | — | YES | YES (different msg) |
| Use of sm_100 intrinsic with sm_90 target | — | — | — | YES | YES |
| Launch argument in wrong addrspace | — | — | — | YES | — |
| Launch target not a kernel | — | — | — | YES | — |
| PTX syntax error from AsmPrinter | — | — | — | — | YES |
| Invalid PTX register allocation | — | — | — | — | YES |
.maxntid + .reqntid together | — | — | — | — | YES |
| Wrong-output (silent data race) | — | — | — | — | — |
| Wrong-output (numerical drift) | — | — | — | — | — |
| Wrong-output (scheduler picks bad II) | — | — | — | — | — |
Two columns repeat. Parameter-space overflow and SM-versioned intrinsics fire at the NVVM layer with a tileiras-shaped diagnostic and again at ptxas with a PTX-shaped diagnostic. The duplication is intentional: the higher-layer diagnostic names the MLIR-level construct that caused the overflow, which the PTX-level diagnostic cannot, so the higher-layer message is the actionable one.
The Unverifiable
Some bug classes pass through every layer without being caught.
Data races in user-written warp-cooperative algorithms. The verifier infrastructure proves IR well-formedness; it does not prove execution behavior. A cute_nvgpu algorithm that reads a shared-memory buffer before the producer warp has signalled completion is structurally valid IR and produces structurally valid PTX. Race detection requires output testing under thread-perturbing instrumentation, not invariant checking.
Numerical-precision mismatches. A pass that picks the wrong rounding mode, or that fuses a multiply-add where the fused form differs from the source-level non-fused form, produces output that compiles cleanly and runs correctly in the sense that no instruction faults. The wrong number reaching the wrong destination is invisible to every structural verifier.
Performance regressions. The scheduler may pick an initiation interval one cycle larger than optimal; the layout-conversion pass may insert a stride-1 reshape where a stride-2 path was reachable; the register allocator may spill where a feasible coloring existed. None of these is a correctness violation, so no verifier fires. Performance regressions surface through end-to-end timing, not invariants.
Bugs in tileiras's own optimization decisions. The scheduler's cost model may rank a worse placement above a better one; the canonicalizer may rewrite an expression into a form the next pass cannot fold; the fusion pass may merge two ops whose fused form has worse register pressure than either alone. These are decision bugs, not correctness bugs, and they require human review or differential output testing to catch.
The four-plus-one layer model is exhaustive only for the bug class it was designed to catch: structural violations of typed IR. Everything else is the responsibility of output testing.
Verifier-Ladder Algorithm
The pass manager invokes the four internal layers in a fixed sequence around each pass. The pseudocode below collapses the upstream OpPassManager machinery into the single contract a reimplementer must preserve.
LogicalResult run_pipeline_with_verifiers(ModuleOp module, PassManager *pm) {
for (Pass *pass : pm->passes) {
// Layer 1 fires implicitly inside pass->run() every time the pass
// constructs or mutates an op through OpBuilder. No explicit call
// is needed at the driver level — the per-op verifier is part of
// op construction itself and short-circuits pass->run on failure.
if (failed(pass->run(module))) {
return failure;
}
// Layer 2 fires after every pass when verify-each is on. The full
// verify() walks every op in the anchor's region and re-runs each
// per-op verifier, but it also catches cross-op invariants by
// running region-bearing verifiers on parents.
if (pm->verify_each) {
if (failed(verify(module, /*verifyRecursively=*/true))) {
return failure;
}
}
// Layer 3 is scheduled as a named pass and runs only when the
// pipeline reaches it; the loop above invokes it like any other
// pass through pass->run().
}
// Layer 4 is the NVVM IR verifier, scheduled as an LLVM FunctionPass
// after MLIR-to-LLVM conversion. It runs through the LLVM pass
// manager's normal path and reports failure through the pass result.
if (failed(run_nvvm_ir_verifier(module->target))) {
return failure;
}
// Layer 5 is external. The PTX serializer produces ASCII PTX and the
// subprocess harness hands it to ptxas. ptxas failures surface via
// the harness's stderr capture and the driver's diagnostic callback.
if (failed(serialize_and_invoke_ptxas(module))) {
return failure;
}
return success;
}
Two ordering rules tie the layers together. Layer 1 always fires before pass->run returns; if it fires, the pass-failure handshake propagates and layer 2 is never reached. Layer 2 always fires before the next pass begins; if it fires, the pipeline aborts before any later layer sees the broken state. Layer 4 fires only once at the end of the LLVM-tier pipeline, after every MLIR-tier pass has had a chance to repair its own output. Layer 5 fires only when layers 1–4 have all signalled success.
Failure Recovery
What happens after each layer fires.
Layer 1. OperationName::create or OpBuilder::create<OpT> returns null. The pass body sees a null result and either signals pass failure explicitly or returns without producing the expected new op, at which point the rewrite driver detects the missing replacement and signals pass failure on its own. The diagnostic is attached to the op being constructed, so it points at the source location of the input op the pass was trying to rewrite. See Pass-Failure Handshake for the propagation mechanism.
Layer 2. The between-pass verifier emits a diagnostic identifying the pass that produced the broken state and the op that violates the invariant. The pass-failure handshake propagates the failure up to the driver and the pipeline aborts.
Layer 3. The named verifier pass emits a diagnostic identifying the module-level invariant that failed and the symbol or attribute involved. The driver returns a non-zero exit code; no later pass runs.
Layer 4. The NVVM IR verifier calls signalPassFailure() on the LLVM FunctionPass. The LLVM pass manager picks the failure up on the Pass::run return path and aborts before the NVPTX backend reaches instruction selection.
Layer 5. ptxas exits non-zero. The subprocess harness reads ptxas's stderr through the captured pipe and forwards the text to the driver's diagnostic callback. The driver itself returns the harness's failure to its own caller; no cubin reaches the user.
In every case the diagnostic carries the highest-context layer's view of the bug — the per-op verifier names the op kind, the pass-level verifier names the pass, the module verifier names the module-wide invariant, the NVVM verifier names the SM target and the offending size, ptxas names the PTX line. The architecture trades early detection (most bugs surface at layer 1 or 2) for actionable context (the diagnostic that fires names the construct the user can change).
Cross-References
Pipeline Invariants and Verifiers walks the three internal verifier layers as the pass manager invokes them and is the primary reference for the in-pipeline contract. NVVM IR Verifier covers layer 4 in depth, including the parameter-space sizer and the per-SM limit table. Operation Layout — Pointer-Identity Dispatch documents the per-op verifier hook anchored in the OperationName slot. Pass-Failure Handshake covers the propagation mechanism every layer uses to signal failure. Handoff Protocol — Subprocess argv documents the layer-5 boundary, and Handoff Protocol — Producer-side bug flagged covers the canonical bug class that escapes layers 1–4 and surfaces only at ptxas. nv_tileas Verifiers — Region-Op Verifier Template is the canonical layer-1 example covered above.
Troubleshooting and Known Issues
maps each layer's canonical verbatim diagnostic to a symptom-driven
index — useful when a user has a stderr line and needs to identify
which verifier layer produced it before consulting the ladder here.
Testing and Observability covers the
test patterns that pin diagnostics from each verifier layer as golden
strings, and identifies which of the unverifiable bug classes from the
section above remain invisible to observable-behavior testing.
Error Handling and Diagnostics
Abstract
Three error-handling layers cooperate across tileiras's compilation pipeline. MLIR's diagnostic engine carries structured messages from verifier and pattern sites through a context-anchored handler chain. The TileAS pass family layers a soft-failure handshake on top of that engine so a broken pass can stop the pipeline without throwing. The driver consumes accumulated diagnostic severity into a small set of integer exit codes that the caller acts on. The result is a system where a failed compile produces both a precise verbatim message for the user and a machine-readable signal for downstream passes and embedding hosts — without ever leaving the IR in a partially mutated state.
The three layers
MLIR diagnostic engine
Every user-visible error, warning, note, and remark produced by tileiras lives
inside a 208-byte Diagnostic body. Verifiers, parsers, conversion patterns,
pass drivers, and dialect-init routines all seed that body through one of
three constructors — the operation-aware emitOpError form, the
location-only emitError form, and the generic location-plus-severity form
— stream fragments into a 4-slot inline argument buffer, and rely on an
InFlightDiagnostic RAII wrapper to flush the body through a
context-registered handler at scope exit.
The handler chain is owned by the MLIRContext. A diagnostic produced inside
the pipeline locks the engine's pthread mutex, walks the intrusive handler
list, and offers the diagnostic body to each handler in turn. The first
handler that returns true consumes the diagnostic. If no handler consumes
it, the default handler prefixes error: , warning: , note: , or
remark: on the formatted output (selecting from the severity class in the
packed flag word at offset +0x10 of the body), renders each argument
through the argument-printer dispatch, and flushes to whichever
raw_ostream the body's sink points at — by default llvm::errs().
The five canonical severity words that appear in the binary are 0x101,
0x103, 0x104, 0x302, and 0x503. The low byte names the class; bit 8
sets the op-name prefix; bit 9 marks a child trace note. A verifier failure
emitted through emitOpError writes 0x103 — Error with op prefix; a remark
that carries a stack-trace child writes 0x104; the inliner emits 0x302
and 0x503 when it walks call-context traces. The bit layout is documented
in detail on Diagnostic ABI and Helpers.
TileAS pass-failure handshake
MLIR's pass-manager exposes signalPassFailure() for hard pass failures, but
the TileAS pass family wants a softer signal. Hopper and Blackwell pipelines
routinely contain loops that one pass cannot transform — a loop whose
producer/consumer graph is not pipelinable, for instance, or whose layout
does not match the target spec — and the next pass still has useful work to
do on the rest of the function. The fix is a one-byte handshake at offset
+40 of each pass's PassObject: bit 2 (0x04) is the soft-failure flag.
A pass that decides it cannot complete its rewrite emits an MLIR diagnostic
first, then ORs 4 into its status word, then keeps walking or returns
success(). The pass manager treats success() as a normal return — the
next pass still runs — but a downstream pass that depends on this one's
output peeks at the status word and skips the dependent work. The bit is
cumulative within one pass run; the driver clears it before the pass starts
and inspects it once the pass returns. The full contract, including the
ordering rule that the diagnostic must always precede the bit-set, is
documented on Pass-Failure Handshake.
Driver-level exit codes
The driver's public C API exposes five non-zero exit codes. Each is a fixed
integer that the embedding host can switch on, paired with one of a small
catalog of verbatim diagnostic strings routed through the standard MLIR
engine. The numbering is stable across tileirasProgramCreate,
tileirasProgramCompile, tileirasProgramGetOutput, and
tileirasProgramRelease.
| Code | Class | Trigger |
|---|---|---|
| 0 | success | (no error) |
| 1 | allocation failure | program-handle allocation returned NULL |
| 2 | configuration rejection | null pointer, out-of-range option, unsupported GPU |
| 3 | bytecode parse failure | magic or version mismatch on the input buffer |
| 4 | handle-state rejection | null or uncompiled handle passed to a getter |
| 5 | compile failure | pass manager returned failure() |
Code 2 covers every front-end configuration gate and uses severity 0x503
(class 3 with the trace bit set). Codes 1 and 5 use severity 0x103 —
Error with op prefix — because the failure carries a structural message
about the IR or the allocator. Code 3 uses severity 0x104 (class 4, the
Remark flavour) with an MLIR-bytecode tail heuristic appending
(it looks like MLIR bytecode instead) when the input looks like an MLIR
container instead of a TileIR one. The full code catalog with the verbatim
strings is on Driver Program Handle.
Severity to behavior mapping
The packed severity byte at offset +0x10 of the diagnostic body drives every
downstream decision. The table below collapses the per-layer behavior into a
single view.
| Class | Engine behavior | Pass-failure layer | Driver behavior |
|---|---|---|---|
| 1 — Note | Attached to a parent; never printed alone | Never sets the bit | Never directly affects exit code |
| 2 — Warning | Printed with warning: prefix | Never sets the bit | Returns 0; the user sees the text on stderr |
| 3 — Error | Printed with error: prefix | Pass typically sets bit 2 | Pipeline run returns failure(); exit code 5 |
| 4 — Remark | Printed with remark: prefix when enabled | May set the bit (soft miss) | Returns 0 unless paired with a separate Error |
The Warning and Remark classes never alone cause a non-zero exit. A pass
that emits a Remark, sets bit 2 of its status word, and returns success()
produces no diagnostic the driver sees as fatal; the bit is for downstream
passes only, and the exit code is 0. The driver returns a non-zero exit
code only when the pass manager itself returns failure(), which happens
when at least one Error-class diagnostic flushed through the engine.
The verifier ladder
Three concentric verifier layers wrap each pass invocation. The innermost fires while the pass is still mutating IR; the middle fires immediately after; the outermost fires only when a named verifier pass reaches its slot.
The operation-name verifier is the layer 1 check. Op construction inside the
pass body — every builder.create<...>(...) call — implicitly runs the
verifier registered for that op name. Operand counts, result counts, region
counts, required attribute presence, and trait-driven type constraints all
fire here, before the constructed op has been linked into its parent. A
break at this layer typically propagates as an InFlightDiagnostic returned
from the rewrite, which the pattern driver flushes to the engine and turns
into a failure() return.
The between-pass verifier is layer 2. When verify-each is on (the default
for non-Release builds), the pass manager runs verify(anchor, /*recursive=*/true)
after every pass that returned success(). The catch is broader than layer
1: cross-op invariants — a use that escapes its defining region, a
terminator whose successor list does not line up with its target — show up
here even when the individual op constructions all passed.
The named-verifier-pass layer is layer 3. Three explicit verifier passes
appear in the pipeline at fixed slots: the TileIR operation analysis (before
LLVM conversion), the TileAA agent verifier (warp-specialized path), and
the NVVM IR verifier (after target conversion). These passes enforce
whole-module or target-context invariants that the lower layers cannot see —
the NVVM verifier's parameter-space ceiling is the canonical example,
because the ceiling depends on the resolved #nvvm.target attribute and the
verifier needs the post-conversion address-space metadata to walk the
parameter list. The full ladder is documented on
Pipeline Invariants and Verifiers.
Verbatim diagnostic catalogs
The verbatim strings that flow through the engine are spread across the per-dialect verifier pages. The catalog below points at each verifier's canonical home; the strings themselves stay where they live so that the verifier code and its diagnostics remain colocated.
| Layer / source | Examples |
|---|---|
| cuda_tile verifiers | expect non-empty block, expect 0-rank tile type at index: N |
| nv_tileaa types/attrs/verifiers | Tile-attr alignment, layout-shape invariants |
| nv_tileas verifiers | Memory-op ordering and schedule-region structure |
| cute verifiers | Layout-algebra rank and stride consistency |
| cute_nvgpu mode-pattern verifiers | Copy-atom mode patterns, atom-vs-SM compatibility |
| cute_nvgpu TMA atoms | Descriptor-shape and address-space rules |
| passes/tileas/tma-and-memops-family | LowerTMALoadStoreToAsync: missing or invalid KernelSpecAttr on function |
| passes/tileas/async-pipeline-family | Failed to pipeline loop, Alias is not expected here. |
| nvptx-passes/nvvm-ir-verifier | Formal parameter space overflowed (X bytes required, max Y bytes allowed) in function Z, a function that is not __global__ cannot be launched |
The wording is part of the public contract. Frontends and tests key off the exact string to distinguish "I emitted illegal IR" from "I hit a compiler bug." A reimplementer must reproduce the strings verbatim or break downstream tooling.
Worked example: malformed kernel parameter buffer
Consider a kernel whose by-value parameter struct overflows the SM's parameter-space ceiling. The trace below follows a single diagnostic from emission through to exit code.
The user compiles a TileIR module containing the equivalent of
struct Heavy {
double scale; // 8 B
char tag; // 1 B (+ 7 B padding)
int data[10000]; // 40000 B
};
__global__ void big_kernel(struct Heavy h) { /* ... */ }
against an sm_75 target. The early lowering passes promote h into a
parameter-space pointer; the front end accepts it because nothing earlier
in the pipeline knows the target's parameter-space ceiling. The NVVM IR
verifier eventually picks it up.
// Inside NVVMIRVerifier (layer-3 verifier, runs after MLIR-to-LLVM lowering).
LogicalResult check_parameter_space(Function &fn, TargetInfo *target) {
uint64_t total = 0;
for (Argument &arg : fn.args()) {
uint64_t sz = size_of_param(describe(arg), target);
total = align_to(total, align_of(describe(arg), target));
total += sz;
}
uint64_t limit = param_space_limit_for(target->sm);
if (total > limit) {
return fn.emitOpError()
<< "Formal parameter space overflowed ("
<< total << " bytes required, max "
<< limit << " bytes allowed) in function "
<< fn.getName();
}
return success();
}
The 21-tag NVVM sizer descends through Heavy and walks out at 40016 bytes.
For sm_75 the ceiling is 1024 bytes. The check fails, and emitOpError
runs.
emitOpError allocates a 208-byte Diagnostic body, zero-fills it, writes
the function's location to +0x00, writes packed severity 0x103 to
+0x10 (Error with op-prefix), initializes the argument buffer at +0x28,
and streams in five fragments: the literal Formal parameter space overflowed (,
the integer 40016, the literal bytes required, max, the integer
1024, the literal bytes allowed) in function, and finally the function
name big_kernel. The first four arguments fit in the inline slots; if the
function name pushed the count past four, the streamer would promote the
buffer to the heap and rewrite args_begin.
When the InFlightDiagnostic wrapper goes out of scope, its destructor
calls the engine entry. The engine takes its mutex, walks the registered
handler chain, and offers the body to each handler. The default handler
prints
error: Formal parameter space overflowed (40016 bytes required, max 1024 bytes allowed) in function big_kernel
to llvm::errs(), then flushes the sink. The verifier itself follows the
emitOpError with signalPassFailure(), which marks the pass-manager
result as failure().
Control returns up the stack. The NVVM verifier pass returns failure() to
the pass manager. The pass manager propagates failure() to
tileirasProgramCompile, which sees a non-success result, emits its own
generic failed to compile Tile IR program diagnostic at severity 0x103,
and returns exit code 5. main propagates 5 to its caller. The user sees
both diagnostics on stderr, in emission order, and the calling tool can
distinguish "compile failed" from "input rejected" by inspecting the exit
code alone.
The same kernel compiled against sm_80 takes the same emission path through
the verifier, but the sizer compares 40016 against 32760 (the sm_80 limit),
still fails, and prints the limit appropriate to the target. The verifier
text is parametric on target->sm; the rest of the trace is identical.
Failure modes
The table below catalogs the principal failure classes a user can hit, ordered roughly by the pipeline depth at which they fire.
| Failure | Layer | What the user sees | Exit code |
|---|---|---|---|
| Bytecode magic mismatch | Driver, pre-pipeline | input does not correspond to Tile IR bytecode (with MLIR-tail suffix if matched) | 3 |
| Unsupported GPU / opt / host config | Driver, pre-pipeline | unsupported GPU target, invalid optimization level, unsupported host operating system | 2 |
| Dialect not registered for op in bytecode | Parser | MLIR's unresolved-dialect diagnostic on the offending op | 5 |
| Op verification failure (per-dialect) | Layer 1 / 2 | The verbatim string from the relevant verifier page | 5 |
| Pass-failure soft handshake | TileAS layer | A per-pass diagnostic (Failed to pipeline loop, etc.); downstream passes skip dependent work | 0 or 5 |
| Kernel parameter-space overflow | Layer 3 (NVVM) | Formal parameter space overflowed (X bytes required, max Y bytes allowed) in function Z | 5 |
| Non-kernel device launch target | Layer 3 (NVVM) | a function that is not __global__ cannot be launched | 5 |
| Codegen catastrophe (unsupported ISel) | LLVM backend | LLVM's report_fatal_error text; abort signal | abort |
Subprocess failure (nvdisasm, ptxas) | Driver post-pipeline | Wrapper's exit-status diagnostic | 5 |
Two failure modes do not produce a clean exit. A report_fatal_error from
the LLVM backend ends the process through abort(), not through main's
return path; the driver cannot translate it into a friendly exit code
because the fatal-error handler runs ahead of any cleanup. The other is the
parser fall-through on a malformed dialect symbol, which produces a stderr
diagnostic but reaches the driver as a generic compile failure (code 5)
because the parser's failure surface is opaque to the driver wrapper.
The soft-handshake case is the only entry in the table whose exit code depends on what other passes do. A pass that sets bit 2 and emits a Remark produces exit code 0; the same pass emitting an Error produces exit code 5 even though the bit-set behavior is identical. The bit is for the pipeline; the severity is for the user and the driver.
Reimplementer notes
A reimplementation must preserve four structural invariants for the error-handling architecture to round-trip with the recovered binary.
The Diagnostic body is exactly 208 bytes and packs severity into the
16-bit word at offset +0x10 using the same class-in-low-byte plus
op-prefix-at-bit-8 plus trace-at-bit-9 encoding. The four-slot inline
argument buffer at offset +0x28 must precede any heap spill; downstream
consumers walk args_begin at a 24-byte stride and rely on the small
buffer staying live until the spill threshold is crossed. The
InFlightDiagnostic RAII wrapper must flush through the engine on scope
exit unless the body's location pointer has been cleared by a move; the
double-flush guard is a single byte at the body's end and a corrupted byte
either drops the diagnostic or emits it twice.
The pass-failure handshake bit sits at offset +40 of every TileAS
PassObject and means soft failure only when set as bit 2. The bit is
cumulative within one pass run, never cleared mid-run, and the driver
clears it once before pass entry. A pass-object layout that places the
status word at a different offset cannot participate in the handshake.
The driver-level exit codes form a flat namespace from 0 to 5. Codes are assigned per call site, not per error category, so a single code can cover multiple verbatim strings (code 4 covers every null-handle and not-yet-compiled rejection) and a single error category can produce different codes from different entry points (a null input pointer is code 2 during create and code 4 during get-output). The numbering is stable; the strings change.
The interaction between the layers is one-directional: a layer-1 verifier
failure surfaces as a layer-2 failure() return, which becomes a layer-3
failure() on the pass-manager handle, which becomes exit code 5 from
main. No backward propagation. The bit-set handshake is the only path
that does not propagate failure upward — a pass that sets the bit and
returns success() is, from the layer above, indistinguishable from a pass
that did clean work, and a reimplementer that conflates the two will
silently turn a soft miss into a compile abort.
Cross-references
Diagnostic ABI and Helpers is
the canonical reference for the 208-byte body layout, the 24-byte
DiagnosticArg 3-tuple, the severity-word bit encoding, and the
constructor / streamer / destructor triad. Pass-Failure
Handshake covers the soft-failure
convention used across the TileAS pass family and the reasoning behind
choosing it over signalPassFailure(). Pipeline Invariants and
Verifiers documents the
three-layer verifier ladder and the explicit verifier passes that occupy
layer 3. Driver Program Handle catalogs the
five driver-level exit codes and the verbatim strings each one carries.
NVVM IR Verifier is the worked-example
target: it is where the parameter-space overflow diagnostic comes from, and
its per-SM ceiling table is the canonical reference for the limit values.
Pass Manager Internals covers the
pass-manager dispatch model that ties the diagnostic engine, the handshake
bit, and the driver exit codes together.
Troubleshooting and Known Issues
turns this page's architecture inside-out for the user: it indexes by the
verbatim diagnostic the user sees and points back at the layer that
emitted it, the exit code it carries, and the change that resolves it.
Debugging and Introspection
Abstract
Tileiras exposes five debugging surfaces. PTX line info ties emitted instructions back to the source .cu lines. Full device debug widens that link into stepping, breakpoints, and local-variable inspection at the cost of forcing -O0. The MLIR IR-snapshot surface dumps the pipeline's intermediate state between any pair of passes. Diagnostic stack traces attach a backtrace to each emitted diagnostic so the source of an error can be pinpointed when no pass name appears in the message. Finally, the scheduler decision trace records every candidate placement the modulo scheduler considered, with the cost vector and rejection reason for each one.
Each surface answers a different question. PTX line info answers "what source line is this PTX instruction?". Device debug answers "let me step through". The IR-snapshot surface answers "what's the IR after pass N, and how does it differ from after pass N-1?". Stack-traced diagnostics answer "which pass emitted this warning?". The scheduler trace answers "why did the scheduler pick this layout?".
Each surface has its own cost. None should be on by default. This page describes the surfaces, their costs, and how to combine them on a real debugging session.
Surface 1: PTX line info
The driver flag --lineinfo and the pipeline option emit-line-info are the two halves of the same mechanism. The driver flag toggles the option to the FromInput snapshot stage; the option can be set independently by integrators to None, Frontend, or TileasBoundary. The selected snapshot becomes the source IR whose locations populate the .loc directives in the emitted PTX.
The PTX-side cost is small. Every emitted instruction grows by one .loc file line col directive. SASS size is unchanged because the assembler attaches the line table out-of-band. The compile-time cost is one extra IR walk per emission point. Use this surface when the question is post-mortem: nvprof, ncu, cuobjdump, or any tool that needs to map back from SASS to source.
The choice of snapshot stage matters. FromInput uses the locations attached to the bytecode the driver received and answers the question most users care about — "which input-program line is this?". Frontend and TileasBoundary use the locations live in the IR after the named stage. The latter two are useful when the question is about an internal pass — for example "which TileAS-generated tile is this?" — but they will reference IR locations that are not in the original .cu source.
The --lineinfo to emit-line-info mapping is described in Driver CLI Options — Pipeline Options. The IR-to-DWARF lowering is described in Lowering: Target and Debug Info — Lineinfo vs Device-Debug.
Surface 2: Full device debug
Full device debug is enabled by --device-debug or its alias -g. The driver validator rejects the combination of --device-debug and any non-zero --opt-level with the verbatim diagnostic:
optimized debugging is not supported, change optimization level to 0 or disable full debug info
The validator's source is Driver CLI Options — Validation Algorithm. The rule is not cosmetic. Full device debug injects libNVVM options that disable several code-motion, value-fold, and block-merge transforms. The driver refuses to silently downgrade an optimised build rather than emit code whose optimisation level is unclear.
The PTX-side cost is substantial. --lineinfo adds .loc directives. --device-debug adds DWARF sections, full name preservation, dbg.value intrinsics, and the llvm.nvvm.move value pins that keep debugged values visible across passes. Expect PTX size on the order of ten times larger, compile time several times slower, and SASS that mirrors the unoptimised IR closely enough that cuda-gdb can step through it.
Use this surface when the question is interactive: setting breakpoints in cuda-gdb, watching local variables, stepping through control flow. For post-mortem mapping back to source, --lineinfo is enough and an order of magnitude cheaper.
Surface 3: MLIR IR snapshots
MLIR's standard print-IR flags expose the pipeline's intermediate state. The flags reach Tileiras through the MLIR pass-manager surface; they apply to any MLIR-based compiler, but the scopes named in the output are the Tileiras-specific scopes enumerated in Instrumentation and Action Handler — Scope tree the binary emits.
| Flag | What it prints |
|---|---|
--mlir-print-ir-after-all | IR after every pass that mutates it |
--mlir-print-ir-before-all | IR before every pass |
--mlir-print-ir-after-failure | IR only when the pipeline fails |
--mlir-print-ir-module-scope | Print the whole module, not just the changed function |
--mlir-print-ir-after-change | Print only when the IR actually changed |
The combination most useful during bring-up is --mlir-print-ir-after-all --mlir-print-ir-module-scope. The default per-pass scope prints only the immediate operation that changed, which truncates context when the pass operates on a single gpu.func inside a multi-function module.
The compile-time cost is dominated by AsmPrinter throughput. The cost scales with module size and pass count; on a multi-kernel module with O3 it is normal for IR printing to dominate compile time. The cost is purely diagnostic — IR printing does not affect emitted PTX.
--mlir-print-ir-after-failure is the cheapest of the three. It prints only when the pipeline reports failure, which makes it the right default for batch runs where the question is "what did the pipeline look like when it broke?".
Surface 4: Diagnostic stack traces
MLIR diagnostics carry an MLIR Location but no compiler-side call stack by default. The flag --mlir-print-stacktrace-on-diagnostic toggles the engine into attaching a child diagnostic with the literal text "diagnostic emitted with trace:\n" followed by a backtrace of the C++ frame that emitted the diagnostic. The mechanism is documented in Diagnostic ABI and Helpers.
The cost is per-diagnostic: each emitted diagnostic walks the C++ stack and resolves symbols. For a clean compile the cost is zero. For a compile that emits many warnings, every warning pays the trace cost.
Use this surface when the question is "which pass emitted this diagnostic". The MLIR Location answers "where in the IR" but not "where in the compiler". A scheduler diagnostic that reports a resource violation might come from any of the four placement arms (see Modulo Scheduler and Rau — Placement Arms); the stack trace pins the source frame.
Surface 5: Scheduler decision trace
The pipeline option schedule-trace-file=PATH writes a Chrome-timeline-style JSON file recording every decision the cost-based scheduler made. The writer is the DumpTraceImpl instrumentation enumerated in Instrumentation and Action Handler — Scope tree the binary emits. The option is read once when the pass manager installs instrumentation; setting it after the pipeline starts has no effect.
The trace records the four placement arms — permute, fuse, retry, cost-based — and the per-candidate decisions inside each one: which (op, cycle) pair was tried, which cost vector it produced, which gate rejected it (G1, G2, G3, or G4), and which seat finally committed. The arms are described in Serial vs Cost-Based Generators; the gate ladder is described in the same page's "Pre-commit Gates" section.
The cost is one per-decision JSON record plus a tail write at trace close. On a heavily pipelined kernel the trace is in the tens of megabytes. The format is loadable in Chrome's chrome://tracing UI, but a jq-style filter pass is usually faster than scrolling the timeline view.
Use this surface when the question is "why did the scheduler pick this II?", "why did this op end up at that stage?", or "which gate rejected this candidate seat?".
MlirAction-based instrumentation
The MLIR pass manager exposes two more flags for users who already understand the pass timing model:
| Flag | Effect |
|---|---|
--mlir-pass-timing | Emits a per-pass wall-clock and CPU breakdown at compile end |
--mlir-pass-statistics | Emits the pass-internal statistics counters |
Pass timing exposes the pass-instrumentation scope tree directly. Each scope name in the output is one of the scopes enumerated in Instrumentation and Action Handler. The same scope tree is exposed through the C++ instrumentation API for integrators who want callbacks rather than printed reports.
The action surface — the MlirAction mechanism described in Instrumentation and Action Handler — MLIR Actions — is the lower-level handle. Each rewrite, pattern application, and greedy-driver iteration emits an action. A context-level action handler can observe every one of them; without a handler the action surface is a no-op. The mechanism is the right one for tools that need to instrument pattern application without modifying the pass list.
Worked debugging session: wrong WGMMA shape
A user reports that their kernel emits a wgmma.mma_async.sync.aligned.m64n128k16 where they expected m64n256k16. The mismatch shows up in the emitted PTX. The question is which pass is responsible.
Step 1 — bisect the snapshot range. Run with --mlir-print-ir-after-all --mlir-print-ir-module-scope. Search the output for the first wgmma operation. Note which pass produced it. If the wgmma shape is wrong at first appearance, the responsible pass is upstream of the emission point — typically the WGMMA atom-selection logic. If the shape was correct at emission and is later rewritten, the responsible pass is downstream — typically a canonicalisation or layout-refinement pass.
Step 2 — pin the placement. If the wrong shape appears at WGMMA emission, the source is the atom registry in cute_nvgpu — MMA Atoms SM70-120. Re-run with --schedule-trace-file=/tmp/trace.json to see the scheduler's decision. If the scheduler picked a candidate that the atom registry should have rejected, the issue is registry-side. If the scheduler never saw the candidate the user expected, the issue is upstream of the atom registry.
Step 3 — attribute a stray diagnostic. If the IR snapshot output contains an unexpected warning that does not name the emitting pass, re-run with --mlir-print-stacktrace-on-diagnostic. The attached backtrace pins the C++ frame that emitted it.
Step 4 — bisect by opt-level. Re-run with --opt-level=0, --opt-level=1, --opt-level=2, --opt-level=3 in turn. The first level at which the wrong shape appears identifies the pass band — the segments added at that opt-level boundary are listed in Driver and Opt Levels and Pipeline Options Mapping. Combining the bisect result with the snapshot output of Step 1 typically isolates the responsible pass within one or two candidates.
Tunables decision matrix
| Symptom | Surface to enable | Cost |
|---|---|---|
| Wrong PTX line in profiler output | --lineinfo | Small (.loc directives) |
| Need to step through with cuda-gdb | --device-debug (forces -O0) | Large (~10x PTX, ~Nx compile) |
| IR is wrong mid-pipeline | --mlir-print-ir-after-all --mlir-print-ir-module-scope | Compile time dominated by AsmPrinter |
| IR may be wrong only on failure | --mlir-print-ir-after-failure | Zero on success |
| Diagnostic source unclear | --mlir-print-stacktrace-on-diagnostic | Per-diagnostic backtrace resolution |
| Schedule looks wrong | --schedule-trace-file=PATH | Tens of MB JSON per kernel |
| Compile is slow | --mlir-pass-timing | Negligible |
| Want only kernel-side output | --dump-host=PATH | Negligible; writes host code separately |
| Want pipeline statistics | --mlir-pass-statistics | Negligible |
The matrix's first principle is to pick the cheapest surface that answers the question. --lineinfo beats --device-debug for post-mortem mapping. --mlir-print-ir-after-failure beats --mlir-print-ir-after-all for batch runs that succeed most of the time. The scheduler trace beats general IR printing when the question is specifically about placement.
Caveats
--device-debug with any --opt-level other than 0 is a hard error. The driver validator emits the verbatim diagnostic shown above and refuses to start the compile. There is no override; an integrator who wants optimised builds with debug-style symbols must use --lineinfo instead, which preserves source locations without disabling optimisation.
--lineinfo and --device-debug set different libNVVM flag dictionaries. The -g flag is added to the libNVVM option channel only under --device-debug; --lineinfo alone does not set it. The flag dictionary is documented in Lowering: Target and Debug Info — Generated Target Fields.
schedule-trace-file is read once at instrumentation install time. Setting it via --pass-pipeline="tileir{schedule-trace-file=...}" after the pipeline has been constructed has no effect. The option must be passed at the same layer as --opt-level.
emit-line-info and --lineinfo are not aliases. The driver flag toggles the pipeline option to one specific enum value (FromInput); the pipeline option has three enum values. An integrator who wants snapshot-tagged line info from a different IR stage must set emit-line-info directly.
--mlir-print-ir-after-all without --mlir-print-ir-module-scope prints only the operation that changed. On a multi-kernel module this can suppress the surrounding context the user actually wants to see. Module-scope is almost always the better default for debugging.
--mlir-pass-timing reports walltime that includes the time spent printing IR if any of the print-IR flags are also enabled. To get a clean pass-timing report, disable the print flags.
Cross-references
Driver CLI Options enumerates the driver-level debugging flags and their pipeline-option counterparts. Instrumentation and Action Handler documents the scope tree the --mlir-pass-timing flag exposes and the action surface that MlirAction-aware tooling consumes. Lowering: Target and Debug Info documents how --lineinfo and --device-debug become LLVM debug-info constructs and which libNVVM options each sets. Diagnostic ABI and Helpers documents the --mlir-print-stacktrace-on-diagnostic engine path. Serial vs Cost-Based Generators and Modulo Scheduler and Rau describe the scheduler whose decisions --schedule-trace-file records. Testing and Observability is the companion page that takes the surfaces enumerated here and applies them to differential, regression, and golden-output test patterns an integrator can build outside the compiler.
Troubleshooting and Known Issues
Abstract
A symptom-to-root-cause catalog. Each entry pairs the user-visible text — the literal stderr line, the exit code, or the silent behavior — with the layer that produced it and the change that resolves it. The catalog also covers diagnostic typos that are part of the public contract (and therefore preserved), wire-format incompatibilities between tileiras's bytecode and upstream MLIR tooling, and a set of behaviors that are not bugs but frequently confuse first-time users.
The page is organized by where in the pipeline the error originates rather
than by what the user typed. Driver-level rejections fire before any pass
runs and produce exit codes 2 or 3 from
Driver Program Handle.
Verifier failures fire inside the pass manager and produce exit code 5.
Codegen failures surface as either an LLVM report_fatal_error (abort) or
an exit code 5 carrying a ptxas-shaped tail. The verifier-ladder positioning
of each layer is documented in
Correctness Layers, and the exit-code contract is
documented in Error Handling and Diagnostics.
Symptom-driven index
| If you see | Read |
|---|---|
failed to parse IR bytecode (it looks like MLIR bytecode instead) | Bytecode parse failures |
unknown attribute tag from the bytecode reader | Bytecode parse failures |
unsupported GPU target, invalid optimization level | Driver-level rejections |
optimized debugging is not supported | Driver-level rejections |
could not find libdevice | Driver-level rejections |
op expects ... arguement types to match ... (note the typo) | Verifier failures |
Formal parameter space overflowed (X bytes required, max Y bytes allowed) | Verifier failures |
Not a canonical UMMA_MN Layout: No flat offset mode | Verifier failures |
colletor::a (note the typo) in tcgen05 diagnostics | Verifier failures, Known typos |
| Cannot find WGMMA in selection, sm_90 target | Codegen failures |
Function uses too much shared data, ptxas stderr | Runtime and ptxas failures |
| Cluster launch silently fails or aborts at runtime | Gotchas |
| TMA descriptor produces garbage output | Gotchas |
Bytecode parse failures
The bytecode reader is the first stage every invocation passes through. Three failure modes surface here, all returned as driver exit code 3.
Symptom. input does not correspond to Tile IR bytecode. Exit code 3.
Cause. The magic word at the head of the file is not the Tile IR magic.
The driver checked looks_like_mlir_bytecode first and that probe also
failed; the file is neither Tile IR bytecode nor upstream MLIR bytecode.
Fix. Verify the producer. Tile IR bytecode comes from the frontend's
serializer; a file with a .bc extension is almost certainly LLVM bitcode,
which belongs to a different stage of the toolchain (ptxas), not tileiras.
Symptom. failed to parse IR bytecode (it looks like MLIR bytecode instead).
Exit code 3.
Cause. The magic-word probe matched upstream MLIR bytecode. This is
almost always a wire-format incompatibility (see
Wire-format incompatibilities below): the
caller used mlir-translate --serialize-bytecode or
mlir-opt --emit-bytecode on a module that happens to import Tile IR
dialects, and the resulting file uses upstream AttrTag numbering rather
than tileiras's.
Fix. Re-serialize through the tileiras-aware bytecode writer. The
frontend's emission path is described in
Frontend Contract and Tile IR Emission;
the bytecode envelope itself is in
MLIR Bytecode Format.
Symptom. unknown attribute tag N from the bytecode reader, with N an
integer. Exit code 5 (parser failures surface after the magic check
succeeded, so the driver classes them as compile failures rather than
configuration failures).
Cause. The bytecode envelope passed the magic check but a per-dialect
reader hit an AttrTag value it does not recognize. Tileiras's
Dialect Reader/Writer Status
table shows which dialect-specific readers know which tag ranges; a tag
outside the recorded ranges typically means the producer is from a
different tileiras snapshot.
Fix. Pin the producer and the consumer to the same tileiras revision.
The AttrTag numbering is not stable across snapshots.
Driver-level rejections
The driver validator runs before any pass begins and rejects ill-formed inputs with exit code 2 (configuration) or 3 (bytecode shape). The verbatim text appears in Driver CLI Options — Validation Algorithm.
Symptom. unsupported GPU target. Exit code 2.
Cause. --gpu-name was set to a string the driver's accept table does
not list. The accept table covers sm_100, sm_103, sm_110, sm_120,
sm_121. Note in particular that sm_90 is not in the driver's accept
table — Hopper is reachable only through a frontend that writes the
nv_tileaa.compute_capability module attribute directly.
Fix. Pick a supported spelling. The full list is in
Driver CLI Options — Enum-valued Options;
the subtarget mechanism that turns the spelling into .target sm_NNa
is documented in
PTX Version and Target Selection.
Symptom. invalid optimization level. Exit code 2.
Cause. --opt-level was outside 0..3. The driver checks (uint32_t)opt > 3
so negative values appear as huge unsigned values and also fail.
Fix. Use 0, 1, 2, or 3. The driver and pipeline use different
defaults (3 vs 2); both are valid on the CLI.
Symptom. optimized debugging is not supported, change optimization level to 0 or disable full debug info.
Exit code 2.
Cause. --device-debug (or -g) was combined with --opt-level != 0.
Full device debug injects NVVM debug options that disable several
code-motion and block-merge transforms; the driver rejects the
combination rather than silently degrading the build.
Fix. Either drop -g or set -O0. For lighter source-line context
without disabling optimization, use --lineinfo instead.
Symptom. could not find libdevice.bc, or a missing-symbol error from
the math pipeline after the NVVM-Reflect pass.
Cause. Tileiras resolves the libdevice path from the environment
variables CUDA_HOME, CUDA_PATH, CUDA_ROOT, and the install-relative
fallback; when none of them point at a directory containing
libdevice.10.bc, the math pipeline cannot link the device math
intrinsics and ends up with unresolved symbols at the NVVM verifier
layer.
Fix. Export one of CUDA_HOME, CUDA_PATH, or CUDA_ROOT to the
CUDA install root that ships nvvm/libdevice/libdevice.10.bc. The full
env-var contract is in
Env Vars and Runtime Gates;
the link order is covered in Libdevice Overview.
Symptom. unsupported host operating system, unsupported host architecture. Exit code 2.
Cause. --host-os or --host-arch was a string outside the accept
table (linux, windows and x86_64, aarch64, arm64ec respectively).
Fix. Pick a supported spelling. There is no autodetection on the CLI
surface; the defaults are platform-derived but the CLI parser rejects
arbitrary strings.
Verifier failures
Verifier diagnostics fire inside the pass manager and propagate to the driver as exit code 5. The text is part of the public contract — frontends and tests key off the exact spelling. The verifier ladder (per-op, between-pass, named, NVVM IR) is described in Correctness Layers.
Symptom. op expects ... arguement types to match with producer types ....
Note the verbatim typo arguement.
Cause. A region-bearing op (typically nv_tileas.async.pipeline.consume_one)
was left with a region-argument list that does not match its paired producer's
result list. This is almost always a partial-rewrite leftover from a pass that
aborted between the producer and consumer rewrites.
Fix. Identify the pass that touched the op last; the between-pass
verifier names it. The verbatim diagnostic is preserved with the typo;
see Known typos and
nv_tileas Verifiers — Region-Op Verifier Template.
Symptom. Formal parameter space overflowed (X bytes required, max Y bytes allowed) in function Z.
Cause. The kernel's by-value parameter struct, summed with target
alignment rules, exceeds the SM's parameter-space ceiling. For sm_75 the
ceiling is 1024 bytes; for sm_80 it is 32760 bytes; the per-SM table is
in NVVM IR Verifier.
Fix. Restructure the kernel signature. Move the bulk of the data
behind a global-memory pointer, split into two kernels, or pack with
explicit alignment. The diagnostic is the actionable form; the ptxas
analogue is far less specific.
Symptom. Not a canonical UMMA_MN Layout: No flat offset mode.
Cause. A cute_nvgpu layout passed to a UMMA verifier does not match
the verifier's expected canonical shape. The verifier walks the layout's
mode tree and requires a flat offset mode at a specific position.
Fix. Have the frontend emit the canonical layout shape, or run a
layout-canonicalization pass before the consumer. The mode-pattern
contract is in
cute_nvgpu Mode-Pattern Verifiers.
Symptom. expects #C element type to be f32, but got <type>.
Cause. A WGMMA or tcgen05.mma op was built with a non-f32
accumulator type. The verifier rejects accumulator types its emission
template does not have a code path for.
Fix. Use f32. The accumulator-type matrix is in
WGMMA Emission Protocol and
tcgen05 Tensor Memory Model.
Symptom. expected TileType for block arguments but got types: ....
Cause. The frontend built a block whose entry argument types do not
match the surrounding tile op's contract. This typically means the
emitter passed an LLVM-tier type into a Tile-tier region.
Fix. Run the frontend's tile-type fixup before the tileiras entry
point; the tile-type discipline is described in
Frontend Contract and Tile IR Emission.
Symptom. 'tcgen05.alloc' op expects colletor::a layout (note the
typo colletor::a).
Cause. A tcgen05 allocation was passed a layout that does not match
the expected collector shape.
Fix. Search source by the literal typo, not by the corrected
spelling. The full diagnostic catalog for the dialect is in
tcgen05 Ops.
Codegen and backend failures
These fire during MLIR-to-LLVM conversion or NVPTX instruction selection.
Symptom. WGMMA requires sm_90a, or an unhelpful "Cannot select"
message from the NVPTX selector.
Cause. WGMMA is arch-conditional. Plain sm_90 does not enable the
WGMMA instruction set; the a suffix on the target string is required.
Tileiras's driver does not accept sm_90a as a --gpu-name value — the
suffix is selected by the frontend through the
nv_tileaa.compute_capability module attribute.
Fix. Have the frontend write the arch-conditional attribute. The
mechanism is in PTX Version and Target Selection.
Symptom. Cannot select tcgen05.* intrinsic, or a similar selector
error.
Cause. tcgen05 instructions are introduced at sm_100. A target below
sm_100 cannot legalize them.
Fix. Use --gpu-name=sm_100 (and ensure the frontend selects
sm_100a if the program uses arch-conditional variants). The per-SM
intrinsic matrix is in
NVPTX Subtarget and Feature Matrix.
Symptom. unsupported tma load mode '<mode>', where the mode name
contains im2col.
Cause. The frontend emitted an im2col TMA descriptor against a target
that does not implement that descriptor variant. Im2col was added later
than the basic tiled mode and is not available on every SM that has TMA.
Fix. Pick the basic tiled mode or pick a target that supports im2col.
The atom registry is in TMA Atoms.
Symptom. Compilation succeeds but dsmem operations produce zero or
garbage at runtime on sm_80.
Cause. Distributed shared memory (dsmem) requires sm_90 or higher.
Lowering does not reject the op on sm_80 at the dialect level; the lowered
PTX exists but the hardware path is absent.
Fix. Target sm_90 or higher. The DSMEM handshake is documented in
Cluster Sync and DSMEM Handshake.
Runtime and ptxas failures
Tileiras invokes ptxas as a subprocess. ptxas diagnostics surface through the harness's stderr capture; the driver returns exit code 5 carrying the ptxas text.
Symptom. ptxas: error: Function '<name>' uses too much shared data.
Cause. The kernel's static + dynamic shared-memory footprint exceeds
the per-SM shared-memory ceiling. Tileiras's smem-accounting does not
re-check the ceiling after pipeline buffer assignment, so the limit
surfaces only at ptxas.
Fix. Reduce SMEM pressure — fewer pipeline stages, smaller tile
sizes, or split into two kernels. See
Buffer Assignment and Named-Barrier Binding.
Symptom. ptxas: error: Multiple kernel definitions for the same
function name.
Cause. The frontend emitted the same nvvm.kernel function twice,
typically because two upstream modules with the same kernel name were
merged before tileiras saw them.
Fix. Deduplicate at the frontend. Tileiras does not rename to break
collisions.
Symptom. ptxas: error: Address out of range, or any internal ptxas
assertion.
Cause. Almost always a tileiras codegen bug. ptxas does not normally
diagnose the producer; an Address-out-of-range from ptxas usually means
the AsmPrinter emitted an out-of-range immediate.
Fix. Report. Capture the full invocation (see
Reporting a bug).
Symptom. ptxas exits non-zero with no stderr text. Cause. ptxas crashed (signal exit) rather than failing cleanly. The subprocess harness reports the non-zero status but cannot reconstruct a diagnostic from a signal-killed child. Fix. Re-run ptxas directly on the PTX text tileiras emitted; the harness logs the argv on debug builds. The handoff protocol is documented in ptxas Handoff Protocol.
Known typos in diagnostic strings
The following typos are present in the binary's diagnostic strings and are preserved across snapshots. Downstream tooling (log scrapers, test-failure classifiers, frontend translation tables) keys on the verbatim text, so a corrected spelling would be a wire-format-style break. When grepping the binary, the build directory, or production logs, use the typo'd form; the corrected form will not match.
| Verbatim string in binary | Corrected English | Where it fires |
|---|---|---|
colletor::a | collector::a | tcgen05 layout verifier in tcgen05 Ops |
arguement (e.g. region arguement types to match) | argument | region-op verifiers in nv_tileas Verifiers |
types to be match | types to match | several pattern-side rewrite diagnostics |
paramater (occasional) | parameter | a small number of parser callouts |
succeded | succeeded | a single info-class message from the pass instrumentation |
A reimplementer who silently corrects the spelling produces a binary whose diagnostics no longer line up with the recovered binary's downstream consumers. The corrections must be a coordinated change at the consumer side first.
Wire-format incompatibilities
Tileiras's bytecode envelope reuses upstream MLIR's container format but the per-dialect AttrTag numbering is tileiras-specific. Mixing upstream MLIR tooling with tileiras bytecode produces files that pass the magic check but fail at the per-dialect reader.
Pitfall. mlir-translate --serialize-bytecode on a module that
imports cuda_tile, nv_tileaa, nv_tileas, cute, or cute_nvgpu
emits the upstream AttrTag numbering. When tileiras reads it, the
per-dialect reader sees a tag value outside its accept range and emits
unknown attribute tag N. Use the tileiras-aware serializer that
ships with the frontend.
Pitfall. mlir-opt --emit-bytecode produces the same shape and
fails the same way. The --emit-bytecode flag is in upstream
mlir-opt; the tileiras driver does not expose an --emit-bytecode
mode because it consumes bytecode rather than producing it.
Pitfall. Bytecode produced by an older tileiras snapshot may fail
the version word check at the envelope level (unsupported version),
even if every dialect tag would otherwise be readable. The envelope
version is independent of the dialect AttrTag versioning, and both must
match.
Pitfall. Mixing two tileiras snapshots' bytecode in a single multi-module compile (for example, by linking a prebuilt library bytecode with a newly emitted module) is not supported. The dialect tag numbering can shift between snapshots without an envelope-level version bump.
The full bytecode envelope is documented in MLIR Bytecode Format; per-dialect tag range support is in Dialect Reader/Writer Status.
Gotchas
These are not bugs. They are mechanism details that routinely trip up first-time users.
Cluster dim must be a power of 2. The cluster verifier accepts only
1, 2, 4, or 8 for each cluster-dim component, with the product
bounded by 16. A frontend that emits cluster_dim = [3, 1, 1] fails
the verifier. The hardware does not implement non-power-of-2 cluster
shapes; the verifier is enforcing a hardware constraint, not a
tileiras convention. See
Cluster Sync and DSMEM Handshake.
TMA descriptor alignment is 128 bytes. The TMA descriptor passed to
cp.async.bulk.tensor must be aligned to 128 bytes in host memory.
Tileiras's host-side descriptor mutator assumes the alignment; a
misaligned descriptor produces incorrect copies at runtime rather than
a clean error. The frontend's descriptor builder is responsible for the
alignment; if the descriptor lives in a host allocator that does not
honor the alignment, the descriptor builder must over-allocate and
align manually.
WGMMA accumulator reads inside the body are silent UB. The WGMMA
accumulator can only be read after wgmma.wait_group N completes for
the relevant group. Reading the accumulator earlier — inside the
warp-group body before the matching wait — produces no diagnostic and
no compile-time rejection; the read silently returns stale data. The
read-after-wait discipline is documented in
WGMMA Emission Protocol.
⚡ QUIRK — accumulator-before-
fence_asyncis silent UB, not a verifier error The natural assumption is that any verifier that knows about WGMMA also knows that the accumulator is asynchronously written and would diagnose a too-early read. It doesn't. The mbarrier/fence_async ordering is enforced only at runtime by the hardware's async-proxy ordering rules; the MLIR verifier accepts a use of the accumulator SSA value at any point after the WGMMA op, including betweenwgmma.commit_groupandwgmma.wait_group N. The read compiles, runs, and returns whatever bits the accumulator register held before the WGMMA retired — usually the previous iteration's result, occasionally garbage from a sibling warp-group's register reuse. There is no--Werrorflag that catches it; the discipline is a frontend obligation.
--use-fast-math enables FTZ even when no op carries fast-math
flags. The driver's fast-math flag is a global on/off; it does not
key off per-op MLIR fast-math metadata. A program that writes
fast-math-free MLIR but compiles with --use-fast-math (or the
pipeline option ftz=true) still emits FTZ-mode arithmetic. To get
non-FTZ arithmetic, leave the global flag off.
⚡ QUIRK —
--use-fast-mathis a module-wide FTZ master switch with no per-function escape hatch Upstream LLVM exposes FTZ as a function attribute (denormal-fp-math) plus per-opFastMathFlags, so a single hot kernel can opt in while the rest of the module stays IEEE. Tileiras's driver flag short-circuits that: enabling--use-fast-mathwrites theunsafe-fp-mathFunction attribute onto every function it lowers, and the case-0x66FMA selector reads that attribute and picks the FTZ opcode unconditionally. There is no__attribute__((noflush))or per-function override that reverses the global flag — a function that needs IEEE-denormal arithmetic must be compiled in a separate invocation with the flag off, then linked in. The same applies to the pipeline optionftz=true.
Libdevice link order matters. NVVMReflect must run before
always-inline. If a pipeline rearrangement moves always-inline above
NVVMReflect, libdevice's __nvvm_reflect calls get inlined with
unresolved arguments and the math intrinsics emit fallback bodies
rather than fast paths. The pass order is in
NVVMReflect Mechanism.
--gpu-name does not accept the a or f suffix. The driver's
parser only matches the bare sm_NN form. Architecture-conditional
selection (sm_90a, sm_100a, etc.) is decided by the frontend's
module attribute and combined with --gpu-name to pick the final
.target line. A user trying to "force" sm_100a from the CLI cannot
do so; the frontend must write the attribute.
⚡ QUIRK —
--gpu-name=sm_90ais silently rejected, not diagnosed The driver's--gpu-nameaccept table contains only the bare numeric forms (sm_100,sm_103,sm_110,sm_120,sm_121); the family- and arch-conditional suffixes (a,f) that downstream tools likeptxasaccept are not in this table. A user who writes--gpu-name=sm_90adoes not get an "unknown target" diagnostic — the parser either ignores the unrecognised string and falls back to the default or rejects it with a generic "unrecognised --gpu-name" message that does not mention the suffix. The architecture-conditional.target sm_90aline is written by the frontend's module attribute, not the CLI, so the only way to compile forsm_90ais to set that attribute and pass--gpu-name=sm_90— butsm_90is also missing from the accept table on the current driver (Blackwell-first), so Hopper builds bypass--gpu-nameentirely.
The driver --opt-level default differs from the pipeline
opt-level default. Driver default is 3; pipeline default is 2.
A user invoking the pipeline directly through --pass-pipeline
without specifying opt-level gets 2; a user invoking the driver
without --opt-level gets 3. The mismatch is documented but a
common source of "I changed nothing and the output changed" reports.
⚡ QUIRK — driver and pipeline disagree on the default
opt-levelTwo entry points to the same compiler read different defaults from different option tables:--opt-levelon the driver defaults to3, whileopt-level=inside--pass-pipelinedefaults to2. Identical IR therefore produces different PTX depending on whether the user invoked the driver or hand-rolled the pipeline, with no warning either way. This is the canonical "I changed nothing and the output changed" failure mode — pinopt-level=explicitly in both invocations to make builds reproducible across entry points.
sm_90 is not in the driver's accept table. The driver targets
Blackwell as its primary deployment surface. Hopper builds go through
a frontend that writes compute_capability directly and bypass the
--gpu-name table.
Reporting a bug
A useful bug report contains five artifacts. Capture them at the time of the failure; reconstructing them after the fact is significantly harder.
- The exact tileiras invocation. The full argv vector and the
environment variables
CUDA_HOME,CUDA_PATH,CUDA_ROOT,LD_LIBRARY_PATH, and anyTILE*variable. The env-var catalog is in Env Var and Runtime Gate Catalog. - The input Tile IR. Either the original bytecode buffer or the
textual form produced by
--mlir-print-ir-before-all(which dumps the IR before every pass; the last entry before the crash is the one that mattered). See Debugging and Introspection. - The full stderr output. Tileiras's diagnostics arrive in emission order; the first diagnostic is usually the actionable one even when later diagnostics look more dramatic.
- The scheduler trace, if the failure is scheduling-related. Pass
schedule-trace-file=<path>through--pass-pipelineto emit the Chrome-timeline JSON. The format is described in Modulo Driver and 4-Arm OR-Chain. - The SM target, CUDA version, and host platform. SM target is in the
argv; CUDA version comes from
${CUDA_HOME}/version.txtornvcc --version; host platform isuname -srmon Linux,cmd /c veron Windows.
A report that omits any of these forces the maintainer to ask for it before any diagnosis can begin; one with all five usually permits an immediate root-cause hypothesis.
Cross-references
Error Handling and Diagnostics is
the canonical reference for the diagnostic engine, the pass-failure
handshake bit, and the five driver-level exit codes; the symptoms in
this page are organized by the layer that produces them in that page's
architecture.
Correctness Layers places each verifier-level
failure mode in its layer (per-op, between-pass, named, NVVM IR,
ptxas), explaining why the same overflow can appear with two different
diagnostic texts at two different layers.
Driver CLI Options is the source for every
driver-level rejection string; the validation algorithm there matches
the Driver-level rejections section
line-for-line.
Debugging and Introspection is the
primary reference for the introspection flags
(--mlir-print-ir-before-all, --mlir-print-ir-after-failure,
schedule-trace-file, dump-host) used to assemble a bug report.
MLIR Bytecode Format and
Dialect Reader/Writer Status
document the envelope and AttrTag contracts that the
Wire-format incompatibilities section
describes from the user's side.
Frontend Contract and Tile IR Emission
documents the producer side of the same wire-format contract and the
tile-type discipline a frontend must observe to avoid the verifier
failures in this page's catalog.
Env Vars and Runtime Gates
covers the CUDA_HOME / CUDA_PATH / CUDA_ROOT resolution that the
libdevice gotcha turns on.
Testing and Observability takes the
verbatim diagnostic strings catalogued here — including the preserved
typos in Known typos — and shows
how to pin them as golden-test assertions so a downstream regression
suite detects diagnostic-catalog drift between snapshots.
Testing and Observability
Abstract
Tileiras ships no public test harness. The binary is stripped, its source is closed, and there are no exposed unit-test entry points. What the compiler does expose is an observation surface wide enough that an integrator can build a complete validation suite around it from outside: stderr is the diagnostic catalog, exit codes are the pass/fail oracle, --mlir-print-ir-after-all is a per-pass IR snapshot, --schedule-trace-file is a per-decision scheduler log, --mlir-pass-timing is a per-pass walltime breakdown, and the emitted PTX is the final golden artifact. Every layer that produces a diagnostic emits a verbatim string the user can pin tests against; every pass that mutates IR emits a snapshot the user can diff.
This page enumerates each observation surface, the test pattern it supports, and the failure modes a regression suite built on those surfaces will and will not catch. The principle running through all of it is that tileiras is a black box with five honest windows, and a tester who knows where each window faces can build robust validation without source.
Observable surfaces
The five surfaces below are the only mechanisms by which tileiras's behavior reaches a test harness. Everything else — the compiler's internal cost evaluations, the verifier's branch-by-branch decisions, the random tie-break order — is inaccessible from outside the binary.
| Surface | Output | Test pattern it enables | Cost when enabled |
|---|---|---|---|
| Stderr diagnostics | Verbatim error and warning strings | Diagnostic golden tests | None on success, per-message on failure |
| Exit code | Integer 0..5 from Driver Program Handle | Pass/fail classification, error-class bucketing | None |
| Pipeline IR snapshots | Pre- and post-pass MLIR text from --mlir-print-ir-after-all | Per-pass invariant validation, regression bisection | AsmPrinter throughput per snapshot point |
| Pass timing | Per-pass wall-clock from --mlir-pass-timing | Performance regression detection | Negligible |
| Scheduler trace | Per-decision JSON log from --schedule-trace-file=PATH | Scheduling-determinism validation, gate-rejection bucketing | Tens of MB per heavily pipelined kernel |
| Generated PTX | Final stdout text | Output-level golden tests | Always emitted |
| Symbol dumps | None (binary is stripped) | — | — |
The stripped-binary line is not cosmetic. nm tileiras returns the dynamic-symbol table only; objdump --syms reports the same minimal set. A tester cannot identify which internal function emitted a diagnostic by symbol — only by the diagnostic's verbatim text and, when --mlir-print-stacktrace-on-diagnostic is enabled, by a backtrace whose frames resolve to address-only lines. The diagnostic text is the only stable identifier.
Differential testing patterns
Each pattern below uses one or more of the surfaces above. The patterns compose — a regression suite typically chains several of them — but each has a single dominant failure mode it is designed to catch.
Pattern 1: Output golden tests
Compile a small fixed input with a fixed --gpu-name, --opt-level, and pipeline option set. Capture stdout. Diff against a previously recorded golden PTX file. Any pipeline change that affects emission reveals itself as a non-empty diff.
The diff itself is not the verdict. PTX text has elements that legitimately vary across builds: comment headers carrying timestamps, virtual-to-physical register naming inside .reg declarations, label numbering inside basic-block tails. A robust harness normalizes those before comparison. The structural diff — the instruction sequence, the launch-bound directives, the parameter declarations — is what matters.
Golden updates are reviewable code-review artifacts. A diff that changes a single .maxntid directive is a one-line review; a diff that rewrites every WGMMA shape is a structural-change review. Treating goldens as VCS-tracked source rather than as autogenerated noise keeps the review loop honest.
Pattern 2: Diagnostic golden tests
For each diagnostic in the catalog from Troubleshooting and Known Issues — Symptom-driven index, construct a small input designed to trigger it. Compile. Assert that the captured stderr contains the verbatim diagnostic string and that the exit code is the documented one.
The pattern verifies that the diagnostic catalog stays stable across snapshots. Tileiras's diagnostic strings are part of the public contract — downstream log scrapers, frontend translation tables, and CI failure classifiers key off the verbatim text, including the preserved typos (arguement, colletor::a, succeded) enumerated in Troubleshooting — Known typos. A test that silently passes after a diagnostic text changes is a false negative; matching the literal string with the typo is correct.
A useful refinement: pin the exit code along with the text. optimized debugging is not supported always exits 2; unknown attribute tag N always exits 5. A test that asserts both catches the case where the text stays but the classification drifts.
Pattern 3: Cross-version diff testing
Run two snapshots of tileiras (for example, the CUDA 13.0 binary and the CUDA 13.1 binary) on the same fixed input with the same flags. Diff stdout. Any non-empty diff is a behavioral change between the two releases.
The pattern is the canonical way to learn what changed in a snapshot when no changelog is published. A non-empty diff narrows the question to "which pass produced this output difference"; that question is then answerable with pattern 5 below. The two patterns together reduce "the new binary is different" to "this pass at this stage produces different IR".
Cross-version testing is the only way to identify silent behavior changes. A snapshot that adds a new optimization without a new diagnostic produces no exit-code change, no stderr change, and no timing red flag — only an output diff at the PTX level reveals it.
Pattern 4: Cross-target diff testing
Same input compiled for sm_90a, sm_100a, sm_103a, sm_120a. Diff PTX outputs pairwise. The diff reveals the per-architecture dispatch — which atoms differ, which intrinsics differ, which scheduler decisions differ. The cross-target matrix is documented in PTX Version and Target Selection and Matmul Progression by SM.
The pattern catches arch-conditional bugs. An emission template that should produce different instructions for sm_100a and sm_120a but produces identical output for both is a regression; the diff is empty when it should not be. Conversely, a template that should produce identical instructions for both but produces different output is also a regression; the diff is non-empty when it should be empty. The expected-shape matrix has to be encoded into the harness — bare diffing only tells the tester that something changed, not whether the change was intentional.
Pattern 5: Snapshot diff testing
Enable --mlir-print-ir-after-all --mlir-print-ir-module-scope. Capture the per-pass IR stream. When an output regression appears, diff the snapshot stream pass-by-pass between the working and broken configurations. The first pass at which the snapshots diverge is the pass that introduced the regression.
The pattern depends on the snapshot stream being deterministic between identical runs. Tileiras's pipeline is deterministic at fixed opt-level and fixed flags — the modulo scheduler uses stable sort, the cost-based arm uses lexicographic ranking, and the random-tie-break path documented in Modulo Scheduler and Rau is seeded from input hash rather than from wall-clock. Two identical invocations produce byte-identical snapshot streams.
The IR stream is large. A O3 compile on a multi-kernel module can produce hundreds of megabytes of snapshot text. A harness that retains snapshots for every passing test will run out of disk; the right design is to retain snapshots only when an output diff appears, which means the snapshot capture has to be deferred to the second run, not the first.
Performance regression patterns
Performance regressions are a separate test class because they fire no diagnostic. The compiler does not know its scheduler picked a worse II; the cost model judged the chosen seat as best. Only an external test can rank "best from the cost model's view" against "best from the user's view".
The two performance-class observation surfaces are pass timing and the scheduler trace.
--mlir-pass-timing produces a per-pass wall-clock breakdown at compile end. A regression that doubles a single pass's runtime is visible as a 2x entry in the timing report. The expected baseline has to be recorded; a percentage threshold of ±5–10% is a sensible default to absorb microsecond-level noise without missing structural regressions. The flag is documented in Debugging and Introspection — MlirAction-based instrumentation.
--schedule-trace-file=PATH produces per-decision JSON for the modulo scheduler. The trace records every (op, cycle) placement attempt, the cost vector it produced, and which gate (G1–G4) accepted or rejected it. A regression that changes the chosen II from 4 to 5 is visible as a different commit decision in the trace; a regression that increases the number of gate-G3 rejections is visible as a different rejection count. The trace format is documented in Debugging and Introspection — Surface 5.
Neither surface catches the case where the compiler picks a structurally similar but slightly worse placement that produces equal-shape PTX with worse runtime SASS performance. That case is invisible to source-side observation and requires runtime profiling against compiled cubin.
Observability gotchas
Several mechanisms produce output that looks deterministic but is not, or output that looks comparable across runs but is not.
PTX register names are dependent on register-allocator state. Two compiles of the same input typically produce identical register naming, but a compile where a pass produced more or fewer virtual registers above the allocator can shift every physical register downstream of that point. A diff harness that compares %r0..%r127 literally fails on any pipeline change that touches register count; comparing instruction structure with register names normalized to opaque tokens is the robust form.
Label numbering inside basic-block tails is also allocator-dependent. A diff that detects only a BB0_42 becoming BB0_43 is almost always cosmetic. Normalizing label numbers before structural comparison eliminates the noise.
Diagnostic order is not always deterministic when multiple passes can emit independently. The pipeline's verify-each path emits in pass order, which is stable, but a single verify pass walking a multi-function module can visit functions in hash-table order. A test that asserts diagnostic A appears before diagnostic B may pass on one platform and fail on another; assert the set of diagnostics, not their order.
Pass timing has microsecond-level jitter. The MLIR pass-timing harness reports wall-clock, which includes kernel-side OS scheduling jitter, page-fault costs from cold caches, and TLB-fill overhead on the first invocation. Single-run timing comparisons are unreliable; aggregate over a minimum of three runs and compare medians. The --mlir-pass-timing walltime also includes time spent printing IR if any of the print-IR flags are enabled simultaneously; clean timing reports require the print flags to be off.
The diagnostic-stack-trace mechanism resolves symbols from a stripped binary. Frames within the tileiras binary resolve to address-only entries (tileiras+0x12345); frames within libdevice or libLLVM may resolve to names if the host's libraries are not stripped. A diagnostic-source identification routine that assumes all frames resolve to names will silently skip the tileiras-internal frames and report only the host library frames.
Continuous integration patterns
For a project shipping tileiras-generated PTX as part of its deliverable, the natural CI shape is a four-stage pipeline.
Stage one regenerates every PTX golden from its source on every commit. Failures here mean tileiras refuses the input; capture the exit code and the stderr, and bucket the failure by exit code class — 2 is configuration, 3 is wire-format, 5 is verifier or codegen. The exit-code contract is in Driver Program Handle — Public error codes.
Stage two diffs the regenerated PTX against the VCS-tracked golden. Failures here mean the output changed without a corresponding golden update. The diff is the actionable artifact; treat it as a review-required change rather than as a hard break. A snapshot-stream capture (pattern 5 above) on the failing input localizes which pass introduced the change.
Stage three runs the diagnostic golden tests (pattern 2 above). Each test pins a verbatim diagnostic and an exit code. Failures here mean the diagnostic catalog drifted; the new text becomes a new golden, or the old behavior is restored.
Stage four runs pass-timing measurements against a recorded baseline. Failures here are advisory rather than blocking — a 5% per-pass slowdown is worth investigating but not worth refusing a commit. The blocking threshold should be higher (2x or more) to absorb measurement jitter without false alarms.
The four stages compose into a regression net that catches most observable-behavior changes. Stage one catches outright build breaks; stage two catches output-shape changes; stage three catches diagnostic-catalog changes; stage four catches gross performance regressions. What slips through all four is the runtime-correctness class — wrong-output bugs that compile, link, and run without a diagnostic. Those require a separate runtime test layer that loads the cubin and validates against a numerical reference, which lives outside the compile-time observation surface entirely.
Regression scenarios
The typical regression-suite shape covers six scenarios. Each one is testable from the observation surfaces above; each one corresponds to a different failure mode in tileiras.
All kernels compile. The exit code from every entry in the fixture set is 0. Any non-zero exit is a regression. The fixture has to be small enough that a full sweep runs in seconds; a per-SM matrix of one kernel per major mechanism (matmul, convolution, attention, reduction, transpose) is a reasonable starting shape.
No new warnings. Stderr from every entry is matched against a known baseline. Any new diagnostic — whether warning or error — surfaces as a stderr diff. The baseline grows over time; old diagnostics get added to the allowlist when they are intentionally expected.
All kernels produce identical PTX to their goldens. The diff is empty after register and label normalization. Diff-non-empty entries become review-required.
Pass timing within ±10% of baseline. The --mlir-pass-timing report is captured and compared against a recorded baseline. The 10% threshold absorbs measurement noise; larger deviations are flagged.
Generated PTX register count within budget. The .reg declarations at the head of each kernel are parsed and summed. A kernel that grew from 64 registers to 96 may now spill, which has runtime cost the harness cannot directly observe but which is predictable from register-pressure budgets per SM documented in NVPTX Subtarget and Feature Matrix.
SMEM usage within budget. The kernel's static + dynamic SMEM footprint is parsed from the PTX (.shared declarations) and the launch directives. A kernel that exceeds the per-SM SMEM ceiling fails at ptxas with Function uses too much shared data, but the harness can preempt that failure by checking before invocation. The mechanism is documented in Troubleshooting — Runtime and ptxas failures.
What you cannot test from outside
Some compiler decisions never reach the observation surface. The scheduler's internal cost evaluations are exposed through the trace, but only the chosen decisions; the rejected ones are aggregated into a count rather than itemized. The verifier's branch-by-branch logic produces a single pass/fail bit at the diagnostic; the chain of predicates that led to the verdict is not exposed. The random-tie-break order is seeded from input hash, which makes it reproducible from the test side, but the seed itself is not exposed and re-seeding for fault injection is not possible.
The unverifiable bug classes from Correctness Layers — The Unverifiable — data races in user-written warp-cooperative algorithms, numerical-precision mismatches, performance regressions below the noise threshold, decision bugs in the compiler's own cost model — all of these compile cleanly, produce no diagnostic, and pass every observation-based regression test. Catching them requires runtime testing against a numerical reference, not source-side observation.
A reasonable position for an integrator: observable-behavior regression tests cover the compile-time contract; runtime-correctness tests cover the execution-time contract; performance benchmarks cover the runtime-performance contract. The three layers are not substitutes for each other, and a CI pipeline that runs only the first will pass every test on a build that silently produces wrong-output numerics.
Cross-references
Debugging and Introspection is the primary reference for each observation flag enumerated above; this page assumes its surface catalog and applies the surfaces to test design rather than re-explaining what each flag prints. Troubleshooting and Known Issues provides the symptom-driven index of diagnostic strings that diagnostic-golden tests (pattern 2) pin against, including the verbatim typos that must be preserved in test assertions. Correctness Layers documents the verifier ladder whose layer-by-layer diagnostics the regression suite is keying off, and identifies the bug classes that observable-behavior testing cannot catch. Error Handling and Diagnostics covers the diagnostic engine and the five exit-code classes that exit-code-based test bucketing relies on. Driver CLI Options enumerates the flags that activate each observation surface, and Driver Program Handle — Public error codes is the canonical exit-code reference. Pipeline Instrumentation and Action Handler documents the scope tree that --mlir-pass-timing exposes and the MlirAction mechanism that lets external tooling instrument any pass without modifying the pass list. OSS Comparison Overview is the cross-validation reference: the open-source cuda-tile preview is the only point where parts of tileiras's IR surface are visible in original source form, and a tester who wants to validate tileiras's behavior against a reference implementation has only that subset to compare against.
Architecture Evolution and Design Decisions
Abstract
Tileiras's shape — an MLIR substrate, a four-stage dialect cascade, a Rau-style modulo scheduler at the core, a CUTLASS-derived dialect family for tile primitives, and a wire-format-breaking bytecode boundary — is not arbitrary. Each layer is a deliberate response to a constraint that the older cicc pipeline could not solve cleanly. This page documents the choices and the alternatives they passed over, so a reimplementer can recover the intent behind the structure rather than only the structure itself.
The page is a retrospective, not a tutorial. It assumes the reader already has the mechanical picture from the dialect, scheduler, and lowering chapters and is asking the more fundamental question: why this shape, and not another?
The MLIR Choice
cicc, tileiras's sibling in the CUDA 13.1 toolkit, accepts CUDA C++ source and reaches PTX through the NVVM bridge and the upstream NVPTX backend. That pipeline is mature, well understood, and entirely adequate for traditional CUDA C++. It is not adequate for tile-shaped computation, and the reason is structural rather than performance-driven.
Three classes of information disappear when a tile program is expressed in LLVM IR directly.
The first is tile typing. A statement like "this value is a 128x64 fragment with swizzle mode XOR-3 living in shared memory" has no native LLVM type. The closest LLVM construct is an opaque pointer with metadata; the swizzle, the layout algebra, and the per-thread fragment shape all become side data that every analysis pass must reconstruct. Once reconstructed, the analysis carries its own copy of the structure, and the structure drifts between passes.
The second is pipeline structure. "This loop is the producer side of a software-pipelined async copy; the consumer is in the same loop body" is something the modulo scheduler needs to see directly. In LLVM IR the pattern is a memory-token chain and a hand-marked instruction sequence; recovering the producer/consumer pairing requires re-running the analysis that originally placed the pattern.
The third is descriptor-vs-pointer typing. A WGMMA op takes operand A from shared memory through a 64-bit descriptor and operand B from the register file. In LLVM IR both look like ordinary loads. In nv_tileas and nvvm they have distinct operand types, and the scheduler, the register allocator, and the asm printer can each reason about them without consulting a side analysis.
MLIR's dialect mechanism keeps each level of abstraction explicit until the level below it is ready to consume it. Tileiras adds dialects exactly where structural information matters; it lowers down to LLVM only when the structural information has been fully exploited. The alternative — riding LLVM IR end-to-end like cicc does — would force a tile-aware emitter to encode every layout, every async-copy chain, and every pipeline boundary in metadata, then re-derive it at every pass that cares.
The Dialect Cascade
The cascade has four stages and not one, and the answer to "why not collapse them?" is that each stage establishes invariants the next stage relies on.
cuda_tile (input form, frontend-emitted)
|
| ConvertCudaTileToTileAA
v
nv_tileaa (analysis form, alias-aware, typed pointers + tokens)
|
| ConvertTileAAToTileAS
v
nv_tileas (scheduled form, pipeline regions, barrier slots)
|
| ConvertTileASToLLVM
| ConvertNVGPUToNVVM
v
nvvm + llvm (codegen-ready, NVPTX-backend input)
| Stage | What it adds | What downstream relies on |
|---|---|---|
cuda_tile | tile-typed values, abstract memops, frontend op surface | nothing earlier; this is the input shape |
nv_tileaa | typed pointer/view types, memory tokens, alias-analysis attributes | every subsequent pass assumes tokens carry the alias relation |
nv_tileas | scheduling annotations, pipeline-region markers, barrier slot bindings | LLVM lowering assumes each scheduled op knows its stage and slot |
nvvm + llvm | NVVM intrinsics, LLVM IR shape | NVPTX backend assumes verifier-clean NVVM IR |
The alternative is one giant rewrite that takes cuda_tile and emits LLVM directly. That alternative would have to encode all the scheduling state, all the alias state, and all the layout state inside a single pass — the kind of monolithic transform that resists testing and inversion. The cascade trades total pass count against per-pass simplicity: each conversion only needs to understand two adjacent levels, never the full distance.
A second reason for the split is materialization order. Pipe_ and Mutex_ IR — the synchronized-handshake form that drives the runtime — is materialized inside the tileaa to tileas transition. If the cascade were collapsed, that materialization would have to be interleaved with the layout selection that precedes it and the lowering that follows it, making both harder to debug.
The Rau-Style Modulo Scheduler
Modulo Scheduler and Rau covers the mechanics; the question here is why this specific scheduler.
Rau modulo scheduling is the canonical software-pipelining algorithm from the VLIW era. GPUs are not VLIW machines, but tile-based kernels share the relevant structure: a small loop body that the compiler wants to overlap across iterations, a small number of architectural resources with explicit capacity, and a clear separation between producer-side and consumer-side operations. Three properties make Rau a natural fit.
Modulo placement gives each operation a (stage, cycle mod II) coordinate, which directly encodes overlap. An async-copy producer in stage 0 and a WGMMA consumer in stage 2 share the same cycle mod II slot but occupy different stages; the schedule writes them down in a single coordinate system without auxiliary bookkeeping.
The Resource Reservation Table extends naturally to GPU-specific resources. Stock Rau tracks issue slots; tileiras tracks TMA channels, WGMMA pipeline lanes, shared-memory bank groups, and the named-barrier pool through the same RRT shape. The probe-then-commit discipline holds across all of them.
The II search starts at the maximum of resource MII, recurrence MII, fine-density MII, and dependency MII, then increments until placement succeeds. This is the standard Rau outer loop, and it has the property a production compiler needs: it terminates in bounded time with a deterministic schedule.
Two alternatives were available and rejected. An ILP-based scheduler — formulating placement as an integer-program and handing it to a solver — would find more optimal schedules but at unacceptable compile-time cost. A list-scheduling pass with manual pipeline-region annotation, the style used by some Triton-derived backends, would be simpler but would require the frontend to commit to a pipeline shape before the scheduler runs. Rau lets the scheduler discover the shape from the loop body itself.
The CUTLASS-in-MLIR Family
The cute, cute_nvgpu, and cutlass dialects look on first inspection like a redundant layer: the tile primitives they expose already exist in CUTLASS upstream. The redundancy is intentional.
CUTLASS upstream is a C++ template library. Its layout algebra, its copy atoms, and its MMA atoms are abstractions that work well inside a C++ kernel but cannot be inspected by an IR-level pass. A pass that wants to ask "does this copy atom use a TMA descriptor or a generic load?" must run C++ template instantiation; a pass that wants to fuse two MMA atoms must understand C++ template specialization.
Porting those abstractions into MLIR dialects has three immediate consequences. Layout algebra becomes first-class IR operations — cute.local_tile, cute.partition, cute.divide, cute.size, cute.cosize are inspectable, verifiable, and foldable. Copy atoms become MLIR ops with explicit operand contracts — a TMA atom and a generic-load atom are different ops with different verifiers, not different template instantiations. MMA atoms become ops the scheduler can reason about — the operand sources, the latency, and the resource footprint are op-level facts.
The same dialect family drives both CUTLASS-style kernels and Triton-style kernels. A CUTLASS frontend lowers C++ kernels into cute and cutlass ops, then through the rest of the cascade. A Triton-style frontend lowers tile-shaped kernels into cuda_tile, which lowers into nv_tileaa, which interacts with the same cute and cutlass primitives at the scheduling layer. One MLIR substrate, two frontends.
The alternative — leaving CUTLASS as a C++ library and lowering kernels through it at the source level — is exactly what cicc does, and it is the reason cicc cannot reason about the tile-shape structure that the modulo scheduler needs.
The Wire-Format-Breaking Bytecode
Tileiras's MLIR bytecode reader dispatches AttrTag and TypeTag values through a table whose ordering and case set do not match upstream MLIR. A bytecode file produced by upstream mlir-translate --serialize-bytecode will not parse cleanly in tileiras, and the reverse is also true. The question is whether this is policy or accident.
Two readings of the evidence are consistent.
The first is an intentional ABI fork. NVIDIA's binary is a hermetic distribution: users go through frontends that produce conformant bytecode, and the bytecode itself is internal to NVIDIA's pipeline. Reserving the right to add private attribute kinds, reorder the dispatch table for code-density reasons, or freeze a particular tag layout is a reasonable internal-format decision. The frontend is the contract; the binary format is implementation.
The second is snapshot drift. Tileiras was forked from a pre-release MLIR snapshot, and upstream's AttrTag table evolved differently after the fork. The wire-format incompatibility is then incidental rather than designed, and it persists because nothing in the toolkit needs the formats to match.
Both readings produce the same consequence: a tileiras-compatible reimplementation cannot use upstream mlir-translate as a substitute for the tileiras bytecode reader. It must either implement a tileiras-aware writer or use text MLIR and the tileiras parser. The MLIR Bytecode Format page enumerates the specific tag-table deltas.
Decisions Visible in the Binary
Several smaller decisions show up in the binary itself and have entries elsewhere in the wiki; collecting them here lets a reader see the design as a whole.
LLVM 21 base. Tileiras embeds a stock LLVM 21 snapshot, statically linked. The ten-fingerprint argument is in the LLVM Fingerprint Table. The decision is to track upstream LLVM closely rather than maintain a heavily forked private LLVM; private behavior is concentrated in the NVPTX backend's peephole passes and a small set of TableGen additions, not in the core IR.
XOR-3 mnemonic-pool obfuscation. The NVPTX asm printer's instruction mnemonic table is XOR-encoded at rest and decoded once at program start through a pthread_once-guarded init. The encoding is weak; it raises the cost of trivial strings extraction without claiming any cryptographic guarantee. The wiki nonetheless documents the decoder so a reader can recover the full mnemonic pool from the binary.
Static linkage of LLVM and MLIR. Tileiras carries no shared-library dependency on libLLVM.so or libMLIR.so. The binary is hermetic. The trade-off is binary size for distribution simplicity: a CUDA toolkit shipped to a customer machine cannot rely on a system LLVM being present, in compatible shape, or even installed.
No GPU dependency for tileiras itself. Tileiras runs on CPU and emits PTX. The ptxas subprocess it spawns also needs no GPU to produce SASS. Both compilers can build for an SM target that is not physically present on the host. This is a design property, not an accident; it makes cross-compilation, CI builds, and offline kernel libraries straightforward.
Stripped binary. The shipped binary has its .symtab removed; only dynamic symbols remain. This is standard distribution practice for a production toolchain and not a security claim. The wiki's job — and the String Evidence and Confidence Policy's job — is to recover from the strip with confidence-tagged claims rather than treat the binary as opaque.
Trade-Offs and Remaining Questions
An honest framing of the design needs to admit where the choices cost.
Compile time. A four-stage dialect cascade plus a modulo scheduler is slower per-kernel than a direct LLVM IR-to-PTX descent. For AI workloads where a kernel is compiled once and runs many times, the cost amortizes and the optimized schedule pays back many times over. For one-off compilations — small test programs, exploratory kernels — the cost is real.
Per-SM atom catalogs. The cute_nvgpu dialect carries SM70-through-SM120 atom rosters, each with a TMA family, an MMA family, a WGMMA family (where applicable), and a tcgen05 family (where applicable). New SM architectures require updates to the atom catalogs, the scheduler resource models, and the verifier patterns. The cost of adding an SM is non-trivial.
Wire-format incompatibility. Whatever its cause, the bytecode delta makes interop with non-NVIDIA MLIR tooling effortful. A user who wants to debug a tileiras bytecode file with upstream tools must first round-trip through text MLIR; a user who wants to feed an upstream-tool-produced bytecode file into tileiras must first round-trip the other way.
OSS preview is a subset. The public cuda-tile repository covers part of the cuda_tile dialect surface — types, attributes, ops, two helper passes, a standalone driver. It does not cover nv_tileaa, nv_tileas, the TileAS pass family, the cute_nvgpu SM rosters, the modulo scheduler, or the NVPTX peephole additions. A reimplementation that wants the full system must recover those pieces from the binary, with the wiki as accelerator.
None of these are fatal. They are the consequences of decisions made deliberately, and they are visible enough in the binary that a reader can weigh them against the alternatives the design rejected.
Cross-References
The boundaries with neighboring tools are documented in cicc Comparison, Position in nvcc 13.1, and Toolchain Integration. The relationship to the public cuda-tile source preview is the OSS Comparison Overview. The binary-level evidence behind the decisions on this page is Binary Anatomy and RE Methodology and the Program Layout. The canonical depth pages for the scheduler and the cascade are Modulo Scheduler and Rau, Pipeline Overview, and Lowering Overview.
Common Compiler Patterns and Idioms
Abstract
Tileiras uses a small set of recurring structural patterns drawn from upstream MLIR, LLVM, and libstdc++. Once a reader can spot them, hundreds of pages of architecture collapse to a handful of moves repeated across every subsystem: a public handle with one void * to a per-pass state struct, a hand-rolled vtable table with two fixed sentinel slots, a switch on a tag byte that drives every parse and conversion boundary, a single bit in a status word that ferries pass failure across passes, an Itanium guard byte that gates every first-use initialiser, a 24-byte string with the small-string mode encoded in its last byte, a stack-buffered vector whose overflow is one pointer indirection, and an alloca-style allocation that packs a header and several trailing arrays into one block.
This page is the pattern catalogue. Each entry describes the shape, the canonical recognition fingerprint, and the wiki page that documents the pattern in production. A reader who internalises the catalogue reads any other page faster: the structural moves are already named, and the semantic story is what remains to learn.
PIMPL State Objects
A public class holds one pointer named *self (sometimes _impl, sometimes a field with no separate name). The actual state lives in a heap-allocated struct whose layout is fixed across one subsystem and known at every call site. Each TileAS pass extends the base layout with its own fields; the total size lands somewhere between 0x150 and 0x3C0 bytes depending on the pass.
struct PassObject {
/*+0x00*/ MLIRContext *context; // first slot is always the owning context
/*+0x08*/ DiagnosticEngine *engine; // shared with every other pass
/*+0x10*/ void *analysis_manager;
/*+0x18*/ void *pass_manager;
/*+0x20*/ void *options_blob;
/*+0x28*/ uint32_t status_word; // bit 2 = soft failure (see below)
/*+0x2C*/ uint32_t opt_level;
/*+0x30*/ /* pass-specific fields, sized to round the object to 0x150..0x3C0 */
};
The fixed prefix at +0x00..+0x28 is the cross-pass contract: the context pointer for every accessor, the engine for diagnostics, the analysis and pass managers for inter-pass lookup, and the status word that carries the failure handshake. Pass-specific state — options, caches, temporary maps — lives in the trailing area and never appears in cross-pass code.
Recognition is one instruction: any function whose first argument is loaded with [rdi] to read an MLIRContext * is operating on a PassObject. Pages that lean on this shape include TileAS Pass-Failure Handshake and Pass Manager Internals.
Vtable Banks
LLVM disables RTTI through -fno-rtti. Polymorphism is hand-rolled: every class declares a static array of function pointers; instances hold a pointer-to-the-array in their first slot. Subclasses provide their own array. Two array shapes dominate tileiras: an 8-slot vtable for OpConversionPattern-style classes and a 5-slot vtable for plain RewritePattern.
static const PatternVtable kArithGenericOpPatternVtable = {
/*+0x00*/ &typeinfo_arith_GenericOpPattern, // RTTI helper
/*+0x08*/ &delete_pattern_object, // deleting destructor
/*+0x10*/ &sub_36C8EC0, // non-deleting destructor (invariant body)
/*+0x18*/ &nullsub_11937, // empty trait callback (invariant slot)
/*+0x20*/ &get_debug_name, // returns typeinfo string
/*+0x28*/ &match, // may stub to slot 6
/*+0x30*/ &match_and_rewrite, // the rewrite body
/*+0x38*/ &get_dependent_operation_names, // returns generatedOps SmallVector
};
Slot 2 and slot 3 are invariant: sub_36C8EC0 for the non-deleting destructor body, nullsub_11937 at 0x447F250 for the empty trait callback. That pair is the strongest fingerprint for an 8-slot pattern vtable in stripped code. The 5-slot vtable has no empty-trait slot and no dependent-operation accessor; slot 3 is the rewrite body, not a stub. The two shapes are catalogued in Pattern Vtables and Shapes and Binary Vtable Banks and Static Constructors.
Dispatcher Tables
A large switch on a tag value appears at every parse and conversion boundary. The compiler lowers dense ranges to a jump table; sparse cases stay as compares. The shape is the same everywhere: read one byte, switch over it, call a handler.
The principal dispatchers in tileiras:
| Dispatcher | Cases | Where |
|---|---|---|
MLIR bytecode OpTag reader | 110 | dialect-by-dialect op-tag table |
MLIR bytecode AttrTag reader | 13 | wire-format-breaking vs upstream's 17 |
| AsmWriter MC instruction print | ~6400 | one case per NVPTX backend opcode |
| AsmPrinter non-MMA partition | 18 | one case per non-tensor-core op family |
cute_nvgpu mnemonic switch | 64 | one case per atom family |
Op read_op(BytecodeReader *r) {
uint8_t tag = read_byte(r);
switch (tag) { // jump table dense over [0..N]
case OP_RETURN: return parse_return(r);
case OP_BRANCH: return parse_branch(r);
case OP_CALL: return parse_call(r);
/* ... 107 more cases ... */
default: return parse_extension(r, tag);
}
}
Recognition is a function with a jump table at its head; the table itself sits in .rodata and is referenced by an indirect jmp. Pages that catalogue the per-table contents include MLIR Bytecode Format and AsmPrinter Status.
Failure-Bit Handshake
TileAS passes communicate soft failure by ORing 4 into the status word at offset +0x28 of their PassObject. The bit signals "this pass could not complete its work, but the IR remains valid and the pipeline should continue." The pass manager reads the bit when the walk terminates; downstream passes that depend on the predecessor read the same bit and either short-circuit or run a fallback.
static inline void pass_mark_soft_failure(PassObject *self) {
self->status_word |= 4; // *(self+0x28) |= 4
}
static inline bool pass_soft_failed(const PassObject *self) {
return (self->status_word & 4) != 0;
}
The pattern is *(self+40) |= 4 in disassembly — a 32-bit OR of an immediate 4 into the dword at +0x28. The bit always pairs with a diagnostic: the pass emits its error or remark first, sets the bit second, and returns. The convention is documented in full in TileAS Pass-Failure Handshake; the broader three-layer error story is in Error Handling and Diagnostics.
Lazy-Init Guards
First-use initialisation of singletons uses one of two guard families. The Itanium ABI __cxa_guard_acquire / __cxa_guard_release pair gates function-local statics; pthread_once gates larger pool decodings and dialect registrations.
static const Pool *cached_pool;
static atomic<uint64_t> pool_guard; // Itanium guard byte in low bit
const Pool *get_pool(void) {
if (__cxa_guard_acquire(&pool_guard)) { // returns nonzero on first call
cached_pool = build_pool(); // runs exactly once
__cxa_guard_release(&pool_guard); // publishes through release fence
}
return cached_pool; // every subsequent call: plain load
}
The acquire load and release store form a release-acquire pair: subsequent threads see the initialised state without an extra fence. The low bit of the guard byte encodes "initialised"; uncontended subsequent calls inline to a single byte load and a branch. The pthread_once form is the equivalent for larger init work — the threading machinery, the spin-vs-block trade-off, and the weak-symbol single-threaded collapse are catalogued in Threading and Synchronization.
SSO Strings
libstdc++ std::string is 24 bytes on x86-64 and stores up to 15 characters inline. In small mode the layout is { char *_M_dataplus, size_t _M_string_length, char _M_local_buf[16] } — the data pointer points into the inline buffer at the end of the same object. In heap mode the same struct stores { char *heap_ptr, size_t length, size_t capacity } and the data pointer points at a separate heap allocation.
struct sso_string {
/*+0x00*/ char *data; // points to &local_buf in small mode
/*+0x08*/ size_t length;
union { // anonymous union at +0x10
/*+0x10*/ char local_buf[16]; // small-string inline storage
/*+0x10*/ size_t capacity; // heap mode capacity
};
}; // sizeof == 24
The discriminator is the data pointer at +0x00: if it equals this + 0x10, the string is in small mode and the 16 trailing bytes are the inline content; otherwise the string is on the heap and +0x10 is the capacity. Recognition in a binary is a 24-byte field whose first 8 bytes either point into the same object (small) or point to a separate heap chunk (heap).
SmallVector
LLVM's SmallVector<T, N> carries an inline buffer of N elements directly in the object. When the size exceeds N, the vector spills to a heap allocation and the inline buffer is unused. The layout is { T *begin, T *end, T *capacity_end, T inline_buf[N] } — the same three pointers describe both inline and heap modes; the discriminator is whether begin points into the inline buffer or to a separate allocation.
struct SmallVectorBase {
/*+0x00*/ void *begin; // points into inline_buf when small
/*+0x08*/ void *end;
/*+0x10*/ void *capacity_end;
/*+0x18*/ /* inline_buf[N * sizeof(T)] follows */
};
The pattern fingerprint is three contiguous pointers followed by a small array, with begin either pointing into the same object or to a separate heap buffer. The 0x60-byte pattern prefix described in Pattern Vtables and Shapes embeds a SmallVector<OperationName, 4> at +0x38; the empty-vector marker 0x400000000 in the size word is the inline-storage discriminator for that specific instantiation.
TypeID Meyers Caches
Every MLIR type, attribute, op, and dialect carries a TypeID. The implementation puts a single byte-sized static variable in an anonymous namespace per class; the address of that variable is the TypeID. The variable is never written; its address is stable across the whole process lifetime, and TypeID::get<T>() returns it.
template <typename T>
struct TypeIDResolver {
static const char id_storage; // never read, only addressed
};
template <typename T>
const char TypeIDResolver<T>::id_storage = 0;
template <typename T>
TypeID TypeID::get(void) {
return TypeID(&TypeIDResolver<T>::id_storage); // pointer identity is the type's ID
}
The byte itself is irrelevant; the & operator and the linker's per-class single-definition guarantee are what produce the unique identity. Recognition in a binary is a one-byte .rodata symbol whose only references are address-of in dispatch code. The sentinel records that back the per-op OperationName slots at +0x40 follow the same model and are catalogued in TypeID Sentinels and Anchors.
TrailingObjects
LLVM allocates "header plus variable-length tail arrays" as one block. The allocator returns sizeof(Header) + sum_of_tails bytes; the header occupies the leading bytes; each tail array follows at a computed offset. Accessors compute the offset from this and the per-field counts stored in the header.
Operation *create_operation(unsigned n_results,
unsigned n_operands,
unsigned n_regions,
unsigned n_successors) {
size_t bytes = sizeof(Operation)
+ n_results * sizeof(OpResult)
+ n_operands * sizeof(OpOperand)
+ n_regions * sizeof(Region)
+ n_successors * sizeof(BlockOperand)
+ sizeof(DictionaryAttr *);
void *block = arena_alloc(bytes);
/* placement-new Header at block, then placement-new each trailing run */
return reinterpret_cast<Operation *>(block);
}
The canonical example is the MLIR Operation header (0x48 bytes) followed by inline results, operands, regions, successors, and the attribute-dictionary slot, all in one allocation. The seven-line decoder at sub_4492630 computes the operand base via (hdr + 16*trailing + 8*n_inline + 64 + 7) & ~7. Full layout is in MLIR Operation Layout.
Recognising Patterns in Practice
A short workflow for any unfamiliar function:
- First argument loaded as
MLIRContext *at[rdi]? Almost certainly a PIMPL state object; the next 0x28 bytes are the shared prefix. - First field a pointer to
.rodatafollowed by 5 or 8 contiguous function pointers? A vtable bank; check slot 2 againstsub_36C8EC0and slot 3 againstnullsub_11937to confirm an 8-slot pattern vtable. - A
switchwith more than 50 cases or ajmp [table + tag*8]at function entry? A dispatcher table; the.rodatajump table reveals the case count. - An
ORof4into the dword at[rdi+0x28]preceded by a diagnostic call? A TileAS soft-failure handshake. __cxa_guard_acquireon a byte symbol, or apthread_once_tglobal followed by a call topthread_once? A first-use initialiser; the cached value lives in a sibling static.- A 24-byte field whose first qword either points into the same object or into a separate allocation? A libstdc++
std::stringin small or heap mode. - Three contiguous pointers followed by an inline array, with the first pointer optionally pointing back into the array? A
SmallVectorin inline mode. - A one-byte
.rodatasymbol whose only references are address-of? ATypeIDresolver storage byte. - An allocation of
sizeof(Header) + N * stridefollowed by pointer arithmetic fromthisto reach trailing arrays? ATrailingObjectsblock.
These nine shapes account for the structural bulk of tileiras's complexity. Anything that does not match one of them is either domain-specific algorithm code (the scheduler, the lattice solvers, the layout algebra) or a one-off helper. Recognising the shape lets a reader skip the bookkeeping and focus on what each function actually computes.
Cross-References
Pattern Vtables and Shapes is the in-depth catalogue of the two vtable shapes summarised above. MLIR Operation Layout is the canonical TrailingObjects example. TileAS Pass-Failure Handshake documents the soft-failure bit. Threading and Synchronization covers the lazy-init guard families. TypeID Sentinels and Anchors catalogues the per-class identity bytes. Binary Vtable Banks and Static Constructors shows how the vtable arrays land in the binary at link time. Error Handling and Diagnostics ties the failure handshake to the broader diagnostic story.
Binary Anatomy and Reverse-Engineering Methodology
Abstract
Tileiras ships as a single stripped 88 MB x86-64 ELF binary inside the CUDA 13.1 toolkit. The rest of this wiki describes what is inside that binary; this page describes the binary itself. It records the file's identity, the section and segment layout a disassembler will show, the tools the wiki authors used to extract information, and a recipe a reader can follow to verify any individual claim in the wiki against the bytes on disk. The page exists so that a reimplementer who does not trust the wiki can quickly close the gap by opening the binary directly.
Binary Identity
| Property | Value |
|---|---|
| File | tileiras |
| Toolkit path | cuda-13.1/bin/tileiras |
| Approximate size | ~88 MB |
| Format | ELF64, x86-64, dynamically linked executable |
| Stripped | Yes; .symtab removed, only dynamic symbols retained |
| Compiler | clang 21 (verified via the LLVM21.0.0git producer string) |
| Toolkit banner | Cuda compilation tools, release 13.1, V13.1.80 |
| Linkage | LLVM, MLIR, libstdc++ statically linked; libc and libpthread dynamic |
| Default output | Host relocatable named elf.o |
The producer string is the strongest single anchor: it appears verbatim in .rodata, it is referenced from the bitcode-writer body, and the same version string also surfaces in every emitted PTX header. The detailed argument for "this is LLVM 21" is the ten-fingerprint analysis collected in the LLVM Fingerprint Table.
Section and Segment Layout
| Section | Purpose | Approximate footprint |
|---|---|---|
.text | The entire compiler: driver, bytecode reader, dialect logic, scheduler, codegen, asm printer, plus statically linked LLVM and MLIR | tens of MB |
.rodata | Mnemonic pools, diagnostic strings, pattern descriptors, the XOR-3 NVPTX printer pool, bitcode tag table, cl::opt help text, typeinfo, embedded libdevice bitcode | ~10 MB |
.data | cl::opt mutable storage, dialect and pass registration tables, the XOR-3 mnemonic walking-cipher working copy, LLVM global initialisers | a few MB |
.data.rel.ro | Vtables and typeinfo nodes for polymorphic classes, AbstractOperation singletons, conversion-target descriptors | small |
.bss | StorageUniquer hash tables, TypeID Meyers-cache slots, dialect singletons, operation-name registry, per-thread caches, LLVMContext state | ~1 MB |
.got, .plt | Dynamic-link tables for libc and libpthread | small |
.eh_frame, .eh_frame_hdr | C++ exception unwind information | small |
The deeper subsystem-by-subsystem breakdown of what lives inside each segment is the subject of the Program Layout page.
Tools the Wiki Was Produced With
The wiki was authored with an iterative reverse-engineering workflow on a single workstation. The dominant tools were:
- IDA Pro 9 (or compatible) — primary disassembler and decompiler. Provides the
sub_ADDRauto-naming the wiki uses internally as an evidence trail. readelf/objdump— section and segment structure, dynamic-symbol table, relocations.strings(1)— extracting.rodataand.datastrings to build a diagnostic and mnemonic catalog.xxd/hexdump— byte-level layout reading for vtable shapes, walking-cipher pools, and packed bitfields.mlir-translatefrom the upstream LLVM tree — cross-checking bytecode wire-format claims against an independent implementation.- The OSS preview source tree (
cuda-tile) — used as a sanity check for tablegen-derived structures and dialect rosters.
The wiki was produced in multiple passes. An initial sweep extracted every printable string and clustered them by subsystem; a second pass identified function bodies by call-graph traversal from string-anchored entry points; a third pass cross-validated the recovered structures against the OSS preview where it overlapped. Each pass narrowed the evidence base; only claims that survived all three pass forms made it into the wiki, with confidence tags reflecting how many independent forms of evidence agreed.
Verifying a Wiki Claim Against the Binary
The wiki is structured so that any individual claim can be checked against the binary in a small constant amount of time.
For a diagnostic string the wiki cites verbatim, run strings tileiras | grep "the cited fragment". Every backticked string in the wiki is byte-identical to an entry in the binary's string table; if the binary has it, the wiki claim is verified at the byte level. The discipline behind that rule is documented in the String Evidence and Confidence Policy.
For a sub_ADDR the wiki cites in an evidence table, open the binary in IDA, navigate to that address, and compare the body to the wiki description. The auto-named address is not a stable interface — it is an evidence trail — but it is reproducible across identical loads of the same binary in IDA.
For a vtable layout the wiki describes, find the AbstractOperation singleton or the class-instance allocation site referenced in the page, follow the vtable pointer at offset zero, and dump the function-pointer array. The 4-slot and 8-slot pattern vtable shapes documented in the wiki are observable as contiguous 0x60-byte and 0x68-byte arrays in .data.rel.ro.
For a bit-field layout the wiki gives, find a use site of the field — usually a verifier diagnostic that prints the field name — and read the immediate operand of the bit-extract instruction. As a worked example: the Tcgen05 MMA kind bitfield. Locate the verifier diagnostic that mentions cta_group; trace back to the and or bextr that extracts the field from the encoded attribute word; confirm that bits 0-1 are the cta_group selector.
Where the Wiki's Anchors Come From
Four kinds of binary-content evidence dominate the wiki.
The string catalog is the primary anchor. Every backticked string is byte-identical to a .rodata entry. Diagnostic strings, op mnemonics, pass names, and the producer string itself are all directly quotable. This is the kind of evidence with the highest signal-to-noise ratio and the lowest risk of misidentification.
Vtable shapes are the second anchor. Polymorphic classes — patterns, passes, dialects, the conversion target, the diagnostic engine — show up as contiguous arrays of function pointers in .data.rel.ro. The slot count and ordering of those arrays is a stable structural fingerprint even when the function bodies themselves are inlined or shared.
Mnemonic pools are the third anchor. The XOR-3 walking cipher used by the NVPTX asm printer is observable as a pthread_once-guarded decode function in .text plus a contiguous block of XOR-3-encoded bytes in .data. The encrypted form keeps the readable PTX vocabulary out of strings output; the decode site reveals the full pool to anyone who reads the binary statically.
Bytecode tag tables are the fourth anchor. The 110-case OpTag dispatcher in the bytecode reader compiles to a contiguous jump table whose row count and case-label values are visible in the disassembly. That table fixes the wire-format claims independently of any string.
Binary Distinction from Upstream LLVM and MLIR
The binary is mostly LLVM and MLIR plus NVIDIA-private additions on top. Specifically:
- Stock LLVM 21, verified by the ten independent fingerprints in the LLVM Fingerprint Table.
- Stock MLIR with the post-2024 / LLVM 21 layout (Operation header is 0x48 bytes, AsyncValueImpl is 808 bytes).
- The NVPTX backend with private peephole passes, an enlarged
MatcherTable, and a contiguous typed-ProxyReg whitelist that lands in LLVM 21 itself. - The TileAS pass family, which is NVIDIA-private and has no upstream counterpart.
- The
cute,cute_nvgpu, andcutlassMLIR dialects, which are mostly ports of NVIDIA's open-source CUTLASS to MLIR. - The
cuda_tiledialect, which is NVIDIA-private; a partial OSS preview is available under thecuda-tiletree and is discussed on the OSS Comparison Overview page.
The combination is roughly 60% upstream LLVM/MLIR by code size and 40% NVIDIA-private; the wiki focuses on the NVIDIA-private portion because that is where reimplementation effort is concentrated.
Limits of This Wiki
The binary is stripped. Function references in the wiki's evidence tables use IDA's auto-naming convention (sub_ABCDEF), which is reproducible but is not a real symbol. Anyone reproducing the analysis with a different disassembler will see different labels for the same addresses.
Inline-only functions have no separate compiled body and cannot be located by address. Macro- and TableGen-generated code may have many addresses for the same logical entity, because each instantiation is its own compiled body. Some claims rest on structural evidence — vtable shape, basic-block count, allocation footprint — rather than on a verbatim string; those claims carry MED rather than HIGH confidence. The discipline is documented in the String Evidence and Confidence Policy.
Finally, the wiki documents the binary as-shipped in CUDA 13.1. A reader who needs to confirm a claim against a later toolkit should reverify against that release before relying on it.
Reimplementation Viability
The wiki is dense enough that a reimplementer can reproduce the great majority of tileiras's behavior from the wiki alone, with bit-level correctness for diagnostic strings, op rosters, attribute encodings, and bitfield layouts. The remaining behavior — corner cases not exercised by static analysis — would require running tileiras on test inputs and observing the output.
The wiki is not a substitute for binary access; it is an accelerator. Instead of starting from "what does this 88 MB binary do," a reader starts from "I know the pattern-applicator uses a 4-slot vtable; let me find the singleton." That shortcut is what makes a stripped binary tractable to a small reimplementation team.
Cross-References
The structural layout of each segment is described in detail on the Program Layout page. The editorial methodology that governs how evidence becomes wiki prose is documented on the Methodology page. The confidence-tag discipline applied to every claim is the String Evidence and Confidence Policy. The ten-anchor argument for the LLVM 21 base is the LLVM Fingerprint Table. The boundary between NVIDIA-private and upstream-derived code is mapped on the OSS Comparison Overview page. The deliberate decisions visible in the binary — static linkage, XOR-3 mnemonic obfuscation, the stripped-by-design distribution — are framed as design choices on the Architecture Evolution and Design Decisions page.
Frequently Asked Questions
Abstract
This page collects the questions a reader most often arrives with and points each one at the page that answers it in full. The entries are short by design: enough context to confirm that the question matches the situation, plus a link into the subsystem documentation. Detailed contracts, pseudocode, and confidence claims live on the linked pages.
The page is organized into five clusters: what tileiras is, how to use it, how to read this wiki, suggested reading paths for common goals, and meta questions about the project itself.
About tileiras
What is tileiras?
Tileiras is NVIDIA's CUDA TileIR optimizing assembler, shipped in CUDA 13.1
as a separate compiler binary that sits alongside cicc, ptxas, and
cudafe++. It consumes MLIR bytecode describing a tile-level GPU program,
runs that program through a cascade of nine dialect layers, and emits a host
ELF relocatable that carries compiled SASS for one Blackwell-family target.
The narrow framing is the useful one. Tileiras does not parse CUDA C++, does
not handle host-side template expansion, and does not generate launch stubs.
Those responsibilities belong elsewhere in the toolchain. Tileiras begins
after a frontend has already produced Tile IR and ends after ptxas has
finished. The high-level shape is documented in
Tileiras Internals and in
Position in nvcc 13.1.
How is tileiras different from cicc?
The two binaries are sibling device compilers that target different input
languages. cicc is the legacy LLVM-based path: CUDA C++ enters through
cudafe++, lowers through a device LLVM IR cascade, and emits PTX.
Tileiras is the MLIR-based path: a tile frontend emits MLIR bytecode,
tileiras runs the dialect cascade, and emits PTX. They share ptxas as a
downstream and they share several IR concepts in the lower layers, but they
do not share a frontend, a pass pipeline, or a scheduler.
The contrast is unpacked in cicc Comparison.
Is tileiras open source?
Mostly no. A small portion of the cuda_tile dialect appears in NVIDIA's
public CUTLASS repository and that portion is what the
OSS Comparison Overview maps. The rest of the program,
including every other dialect, the scheduler, the NVPTX customisations, and
the driver, is closed and ships only as a binary in the CUDA toolkit. The
.td Files Delta page enumerates which TableGen
files have public counterparts.
What architectures does tileiras support?
The driver accepts the SM levels listed in the
CLI Options page as valid --gpu-name values.
The supported set includes sm_100, sm_103, sm_110, sm_120, sm_121,
their a (architecture-specific) variants, and a backward-compatibility
range that covers earlier Hopper and Ampere targets used by older CUTLASS
atoms. The exact mapping from SM level to feature set is in
PTX Version and Target Selection.
Using tileiras
How do I invoke tileiras?
Normally, you do not invoke it directly. nvcc invokes tileiras when the
input is Tile IR bytecode produced by a tile frontend. Direct invocation is
also supported and matches the form
tileiras --gpu-name=<target> --opt-level=<n> -o <output> <input>. The
full option matrix is in CLI Options and the
runtime gates that change behavior without appearing on the command line are
in Env Vars and Runtime Gates.
How do I produce Tile IR bytecode?
A tile frontend produces it. NVIDIA's Triton-style frontend and the CUTLASS
DSL frontend are the two known producers. Either one emits MLIR bytecode
that uses the dialect tags tileiras expects. Hand-writing Tile IR is
possible through mlir-translate with a tileiras-aware bytecode writer,
but it is not the supported path. The frontend contract, including the
dialect schema that a producer must obey, is in
Frontend Contract and Tile IR Emission.
Why doesn't mlir-translate --serialize-bytecode produce tileiras-readable files?
Tileiras's bytecode uses a wire format that diverges from upstream MLIR
bytecode in the attribute and type tag tables. Upstream mlir-translate
writes upstream tags; the tileiras reader expects its own tag space and
rejects files that probe as upstream bytecode. The reader probes for the
upstream magic explicitly so that this case produces a specific diagnostic,
failed to parse IR bytecode (it looks like MLIR bytecode instead). The
wire-format contract is in
MLIR Bytecode Format and the dialect-by-dialect
status is in Dialect Reader/Writer Status.
My compile fails with --device-debug --opt-level=3. Why?
The combination is rejected at driver level with optimized debugging is not supported. --device-debug implies --opt-level=0; raising the
optimization level past that is an error rather than a warning. Use
--lineinfo for source mapping at higher optimization levels. The full
debugging story is in
Debugging and Introspection.
What does --gpu-name=sm_90a mean versus --gpu-name=sm_90?
The a suffix marks an architecture-specific target. sm_90a unlocks
Hopper-only instructions, most importantly WGMMA, that are not part of the
forward-compatible sm_90 baseline. Code compiled for sm_90a does not
run on later architectures without recompilation; code compiled for sm_90
does. The same pattern repeats for sm_100a, sm_103a, sm_120a, and
sm_121a. The selection logic is in
PTX Version and Target Selection.
Understanding the wiki
How accurate is this wiki?
For verbatim artifacts, very accurate. Diagnostic strings, opcode mnemonics,
attribute schemas, and bit-field layouts are extracted from the binary
byte-by-byte and carry HIGH confidence. For named functions like
sub_ABCDEF, the addresses are exact but the names are auto-generated by
IDA Pro because the binary is stripped; the algorithm descriptions on those
pages are derived from disassembly rather than from a source-level symbol.
The confidence taxonomy that every page uses is in
String Evidence and Confidence Policy.
Why are there sub_XXX references throughout the wiki?
Tileiras ships as a stripped binary, so the original function names are not
recoverable. IDA Pro names unknown functions sub_<hex_address> and the
wiki keeps that convention as the canonical reference for a function whose
real name is unknown. The address is stable across analyses and useful for
cross-referencing the binary; the prose around it describes what the
function does. The reverse-engineering methodology is in
Binary Anatomy and RE Methodology.
What is the difference between cute, cute_nvgpu, and cutlass?
Three layers of the CUTLASS programming model, each a separate dialect.
cute is the layout algebra and tile-decomposition primitive set. The
contract is in cute Overview. cute_nvgpu
is the NVIDIA-specific atom layer that binds layouts to actual hardware
copy and MMA instructions; its roster is in
cute_nvgpu Overview. cutlass is the
high-level pipeline and tile-scheduler dialect that orchestrates kernels
built from the lower two; its overview is in
cutlass Overview.
What does "wave-specialized" mean?
A scheduling pattern, also called producer-consumer specialization, where
one warp-group performs asynchronous loads and another warp-group performs
the matrix-multiply. The division is explicit in the IR: a producer
warp-group issues TMA copies and signals an mbarrier, the consumer
warp-group waits on that mbarrier and consumes the data. The op roster is
in nv_tileas Op Roster and Builders
and the synchronization protocol is in
mbarrier State Machine.
What is mbarrier?
A transactional barrier living in shared memory, introduced on Hopper as the synchronisation primitive for asynchronous copies. An mbarrier carries an arrival count and a transaction-byte count; producers update the transaction count when their copy commits, consumers wait until both counters reach a threshold. The state machine is documented in mbarrier State Machine.
What is TMA?
The Tensor Memory Accelerator, a Hopper-and-later hardware engine for
asynchronous bulk tensor copies between global and shared memory. The TMA
descriptor (CUtensorMap) encodes the multi-dimensional shape and
swizzling; the copy itself is initiated by cp.async.bulk.tensor and its
completion is ordered through an mbarrier. The codegen contract is in
TMA, Tensormap and cp.async.bulk.
What is WGMMA?
Warp-Group Matrix Multiply-Accumulate, the Hopper sm_90a instruction in
which four warps cooperate to issue one matrix-multiply against shared-memory
operands described by a 64-bit descriptor. The descriptor layout, the
synchronisation fence sequence, and the way the scheduler treats WGMMA
issue groups are in WGMMA Emission Protocol.
What is tcgen05?
The Blackwell sm_100a matrix-multiply family that replaces WGMMA. Unlike
WGMMA, tcgen05 keeps operands and accumulators in a dedicated tensor memory
(TMEM) bank rather than in registers, and supports 2-CTA and 4-CTA modes
where the multiply spans multiple thread-blocks within a cluster. The
tensor-memory programming model is in
tcgen05 Tensor Memory Model and the
multi-CTA variants are in
Blackwell 2-CTA/4-CTA MMA.
Reading paths
I want to reimplement tileiras
Read Position in nvcc 13.1 first to fix the binary's role, then Program Layout for the executable shape, then Pipeline Overview for the top-to-bottom cascade. Drill into whichever subsystem you are implementing next. Verify any single claim against the binary using the recipes in Binary Anatomy and RE Methodology.
I want to write a Tile IR frontend
Read Frontend Contract and Tile IR Emission for the dialect schema your emitter must satisfy, then cuda_tile Overview for the public-input dialect, then DSL to PTX End-to-End to follow a worked example from frontend output to PTX.
I want to understand WGMMA emission
Read WGMMA Emission Protocol for the issue contract, then cute_nvgpu MMA Atoms SM70-120 for the per-SM atom registry, then Per-SM Emission Templates for the PTX templates that the backend prints.
I want to debug a slow kernel
Read Performance and Cost Model for the scheduling cost function, then Debugging and Introspection for the diagnostic surfaces tileiras exposes, then Modulo Scheduler and Rau when the bottleneck reaches the scheduler itself.
I want to verify a claim made on this wiki
Read Binary Anatomy and RE Methodology for the verification recipes, then the String Evidence and Confidence Policy for how each page tags its claims.
Meta
Who wrote this wiki?
Reverse engineering and writing by Grigory Evko. The project is not endorsed by NVIDIA. Every claim derives from static analysis of the publicly-distributed CUDA 13.1 tileiras binary.
How can I contribute?
The wiki source lives at github.com/GrigoryEvko/nvopen-tools under
tileiras/wiki/. Issues and pull requests are welcome. Corrections that
challenge a specific claim are most useful when they cite either a
reproducible behavior of the binary or a binary offset.
Can I trust this wiki for production decisions?
For documentation and reimplementation reference, yes, within the confidence labels each page declares. For correctness in production, treat the wiki as a derived description and confirm any safety-critical behavior against the actual binary. The wiki is a reverse-engineered model; authoritative behavior lives only in the tileiras binary itself.
Cross-references
The questions above point into the rest of the wiki; the converse direction
is Reading Map, which organizes reading sequences by
subsystem rather than by question. The Glossary defines
the terms used here without unpacking them. The
Subsystem Map is the cross-reference for any
sub_<hex> name encountered while answering a follow-up question.
OSS Comparison Overview
Abstract
NVIDIA ships a small open-source preview of the cuda-tile dialect: one MLIR dialect declaration, three TableGen files, three transform passes, a standalone optimizer driver, and a thin interface-glue stub. Tileiras is a much larger compiler — twelve dialects, a four-stage lowering cascade, a modulo scheduler, and an NVPTX backend with private peephole passes — but the OSS preview is the only point where parts of the internal IR surface are visible in original source form.
The four OSS pages compare that preview against tileiras. The comparison is not symmetric. The OSS tree is a strict subset of one front-end dialect; tileiras carries the same surface plus six private dialects (nv_tileaa, nv_tileas, cute, cute_nvgpu, cutlass, NVVM) and the lowering pipelines between them. The useful question is: for each artifact in the public tree, what shape does the corresponding behavior take in tileiras?
The comparison methodology, the divergence taxonomy, and the per-page table conventions appear below. The other three OSS pages apply the methodology to TableGen declarations (.td Files Delta), interface and optimizer driver source (cuda_tile Tree Mapping), and transform passes (Transforms / FuseFMA / SynthDbg).
What the OSS Preview Contains
The public cuda-tile repository ships five categories of source:
| Category | Files | What it declares |
|---|---|---|
| Dialect TableGen | Types.td, AttrDefs.td, Ops.td | The cuda_tile type, attribute, and operation surface. |
| Interface glue | Interfaces.cpp, Interfaces.td | The attribute-interface and type-interface declarations consumed by op verifiers. |
| Transform passes | FuseFMA.cpp, LoopSplit.cpp, SynthesizeDebugInfoScopes.cpp | Three optimization passes operating on cuda_tile IR. |
| Optimizer driver | CudaTileOptimizer.cpp | A standalone tool that loads TileIR, runs an optimizer pipeline, and emits TileIR or LLVM bytecode. |
| Build glue | CMake fragments, pass registration helpers | The supporting infrastructure to compile the preview as a standalone library. |
Everything else in tileiras — the private dialect chain, the NVPTX backend, libdevice integration, the modulo scheduler, the bytecode I/O — has no OSS counterpart. The OSS pages do not attempt to invent one.
What "Comparison" Means Here
For each public artifact, the comparison answers four questions:
- Does tileiras carry the same behavior in a recognizable form?
- If it does, is the implementation structured the same way, or split, merged, or relocated to another layer?
- If it does not, was the artifact replaced by something else, deleted entirely, or scheduled for a later release?
- What can a reader infer about the public design from the tileiras shape, and vice versa?
The comparison runs from OSS to tileiras, not the other way around. Asking "what's missing from OSS that tileiras has" is a much larger question and would dominate the page with material that has no public counterpart. The OSS-to-tileiras direction stays bounded by the public surface.
Divergence Taxonomy
Comparing two implementations of a dialect surface produces seven recurring outcomes. The OSS pages use them as a controlled vocabulary:
| Status | Meaning |
|---|---|
PRESENT | The public artifact exists in tileiras with the same role and a recognizable implementation shape. |
REWRITTEN | The role is preserved, but the implementation is split across multiple sites or restructured around a different anchor. |
ABSORBED | A public helper is folded into a larger tileiras driver — the function disappears as a named unit but its work happens inline at the caller. |
SUPERSEDED | A different compiler layer (TileAS, NVPTX backend, libdevice) provides the same semantic effect. |
INLINED | The artifact exists at use sites rather than as an out-of-line helper — common for generated ODS predicates and small verifier templates. |
PARTIAL | Some public behavior matches in tileiras while another part is changed, missing, or relocated. |
ABSENT | The public artifact has no observable counterpart in tileiras — either deleted, replaced by a different mechanism entirely, or scheduled for a later compiler release. |
The seven statuses are not orthogonal — a SUPERSEDED pass is also, by definition, ABSENT at its original layer — but the distinction matters because a reimplementer needs to know whether to look elsewhere in tileiras for the behavior or whether to leave the gap unfilled.
Divergence Kinds
Cutting the same surface a different way: every concrete delta is one of six kinds.
| Kind | What changes | Example |
|---|---|---|
| Structural | Behavior is preserved, call graph is not. | OSS Interfaces.cpp includes generated code in one file; tileiras spreads the same code across parser, verifier, and printer call sites. |
| Semantic | Behavior changes. | cuda_tile.print accepts a cuda_tile.string type that the OSS dialect does not declare. |
| Granularity | A public unit is folded into a larger driver or split into smaller ones. | The OSS optimizer driver is absorbed into the full compile-to-GPU pipeline rather than exposed as a standalone tool. |
| Anchor-op | A pass is nested under a different MLIR operation. | OSS FuseFMA is rooted at cuda_tile::EntryOp; the closest tileiras pass is rooted at nv_tileas and runs on a different IR. |
| ABI | Parameter or storage layout differs. | OSS PipelineState is a C++ template member tuple; tileiras !cutlass.pipeline_state is a typed MLIR value with explicit phase/index/count fields. |
| Layering | A public pass is replaced by a lower or higher compiler layer. | OSS SynthesizeDebugInfoScopes is replaced by upstream MLIR's DIScopeForLLVMFuncOp plus the tileiras ConvertDebugInfoToLLVM path. |
Each per-page table identifies which kind applies. Readers implementing a tileiras-compatible compiler can decide on a per-kind basis whether to follow the OSS shape, the tileiras shape, or something else that preserves the same external contract.
How to Read the Pages
Each of the three detail pages targets one slice of the public tree:
cuda_tile Tree Mapping covers the two C++ source files in the public preview: Interfaces.cpp (mostly ODS-generated glue) and CudaTileOptimizer.cpp (the standalone driver). Each file gets a per-component table identifying which artifacts are PRESENT, REWRITTEN, ABSORBED, or ABSENT in tileiras, plus prose explaining the structural choices.
.td Files Delta covers the three TableGen files. Categories where every declaration is identical get one-line summaries. Categories where tileiras carries a delta get focused tables showing the public declaration shape next to the tileiras-recovered declaration. The known deltas are small in count but each one matters for parser compatibility: one renamed op, one absent op, one added type.
Transforms / FuseFMA / SynthDbg covers the three transform passes. None of the three survives in cuda_tile-dialect form: one is SUPERSEDED by lower layers, one is ABSENT without replacement, and one is replaced by upstream MLIR's debug-scope pass. The page documents each migration target.
Reimplementation Stance
A tileiras-compatible reimplementation should treat the OSS tree as authoritative for what it covers and the rest of this wiki as authoritative for everything outside the public surface. Specifically:
- Use OSS
Types.td,AttrDefs.td, andOps.tdfor thecuda_tiledeclaration surface, with the deltas listed in .td Files Delta applied. - Use OSS
Interfaces.cppandInterfaces.tdfor the ODS interface shape, but expect that consumers spread across the verifier/parser/printer rather than concentrating in one stub. - Do not copy the OSS
Transforms/directory into the lowering pipeline. The three passes have different replacement strategies in tileiras and copying them produces double-firing or anchor-op mismatches. - Do not expose
CudaTileOptimizeras a standalone tool unless deliberately adding functionality. Tileiras has no equivalent standalone entry point — the full compile pipeline subsumes the optimizer role.
Documentation Stance
The OSS pages describe behavior and contracts in prose. They do not depend on raw reverse-engineering notes being visible to readers, and they do not treat internal symbol names as the comparison surface. When the public tree is the relevant reference, the page names the public file or artifact directly; when tileiras-only behavior is described, the page describes the behavior rather than reproducing the implementation.
Cross-References
- cuda_tile Tree Mapping — the per-file comparison for
Interfaces.cppandCudaTileOptimizer.cpp. - .td Files Delta — the TableGen-level deltas across types, attributes, and ops.
- Transforms / FuseFMA / SynthDbg — the three OSS transform passes and their tileiras counterparts.
- cuda_tile Overview — the dialect as seen from inside tileiras.
- Architecture Evolution and Design Decisions — why the OSS preview is a subset of one front-end dialect rather than a cross-section of the whole compiler.
cuda_tile Tree Mapping
Abstract
The public cuda-tile repository contains two C++ source files that describe the dialect as actual code: Interfaces.cpp, which is mostly an ODS-generated stub that hosts the interface implementation; and CudaTileOptimizer.cpp, the standalone tool that takes TileIR in, runs an optimizer pipeline, and emits TileIR or LLVM bytecode out. Together they cover the dialect contract (what verifiers must check) and the dialect's user-facing entry point (how a developer drives the optimizer).
This page maps both files to their tileiras counterparts. The mapping is not symmetric: Interfaces.cpp corresponds to a distributed pattern in tileiras (ODS-generated interface code spread across parser/verifier/printer), and CudaTileOptimizer.cpp has no standalone counterpart at all — its role is absorbed into the full compile-to-GPU pipeline.
Interfaces.cpp and Interfaces.td
The OSS Interfaces.cpp is a one-screen stub:
// Interfaces.cpp (OSS, abbreviated)
#include "CudaTile/IR/Interfaces.h"
#include "CudaTile/IR/Interfaces.cpp.inc" // ODS-generated TypeInterface bodies
#include "CudaTile/IR/AttrInterfaces.cpp.inc" // ODS-generated AttrInterface bodies
All real interface code lives in the ODS-generated .cpp.inc files. The declarations in Interfaces.td are what matter for the comparison.
AssumePredicateAttrInterface
Upstream declaration:
// Interfaces.td (OSS)
def AssumePredicateAttrInterface : AttrInterface<"AssumePredicateAttrInterface"> {
let cppNamespace = "::mlir::cuda_tile";
let methods = [
InterfaceMethod<
"Verify that the predicate is well-formed for a given assume op.",
"::mlir::LogicalResult", "verifyWithAssumeOp",
(ins "::mlir::Operation *":$assumeOp)
>,
];
}
Tileiras carries the same interface with the same method signature. The implementation pattern in tileiras: each concrete predicate attribute (DivByAttr, SameElementsAttr, BoundedAttr) declares AssumePredicateAttrInterface in its interfaces ODS field; the ODS expansion produces a per-attribute verifyWithAssumeOp body that runs the attribute-specific check; the cuda_tile.assume op verifier resolves the predicate attribute through the interface and dispatches to the concrete implementation.
| Aspect | OSS | Tileiras | Status |
|---|---|---|---|
| Interface declaration | Interfaces.td | matching ODS declaration | PRESENT |
| ODS-generated dispatch glue | Interfaces.cpp.inc | inlined into each concrete attribute's verifier slab | INLINED |
| Per-attribute verifier body | one implementation per predicate attribute | one implementation per predicate attribute | PRESENT |
| Interface TypeID | one interned TypeID shared by all implementors | same — single interned TypeID per interface | PRESENT |
The divergence is the location of the dispatch glue. OSS keeps it in one .cpp.inc that Interfaces.cpp includes; tileiras inlines the same dispatch into each concrete attribute's slab during ODS expansion. Both call the same per-attribute verifyWithAssumeOp body. The semantic contract is identical.
TileView TypeInterface
Upstream declaration:
// Interfaces.td (OSS)
def TileView : TypeInterface<"TileView"> {
let cppNamespace = "::mlir::cuda_tile";
let methods = [
InterfaceMethod<
"Returns the rank of the view's index space.",
"int64_t", "getViewIndexRank"
>,
InterfaceMethod<
"Returns the tile type produced when the view is fully indexed.",
"::mlir::Type", "getViewTileType"
>,
];
}
Tileiras carries the same two methods on the same interface. The view types implementing it are cuda_tile.tensor_view and cuda_tile.partition_view.
| Aspect | OSS | Tileiras | Status |
|---|---|---|---|
| Interface declaration | Interfaces.td | matching ODS declaration | PRESENT |
getViewIndexRank() | per-view-type implementation | per-view-type implementation | PRESENT |
getViewTileType() | per-view-type implementation | per-view-type implementation | PRESENT |
| Consumers | view-consuming op verifiers call interface methods | same set of view-consuming ops use the same interface methods | PRESENT |
Same status as AssumePredicateAttrInterface: declaration identical, dispatch glue location differs, semantics preserved.
AllElementTypeMatch Predicate
Upstream declaration:
// Interfaces.td (OSS, predicate trait)
class AllElementTypeMatch<list<int> indices> : PredOpTrait<...> { ... }
This is not a runtime-dispatched interface — it is a generated ODS predicate that emits a static check into each consuming op's verifier. OSS centralizes the predicate template in Interfaces.td and lets the TableGen expander inline it per use.
Tileiras follows the same model. The predicate is INLINED at every consuming op verifier: the ODS expander emits the same element-type-match check into each verifier body. No central helper exists at runtime in either tree; both spell out the check at every use site.
| Aspect | OSS | Tileiras | Status |
|---|---|---|---|
| Predicate template | Interfaces.td | identical template | PRESENT |
| Runtime helper function | none — generated inline | none — generated inline | INLINED |
| Per-op verifier code | one inlined predicate per consuming op | one inlined predicate per consuming op | PRESENT |
CudaTileOptimizer.cpp
The OSS driver is a standalone tool. Its main function follows a textbook MLIR-tool shape:
// CudaTileOptimizer.cpp (OSS, abbreviated)
int main(int argc, char **argv) {
mlir::registerAllPasses();
cuda_tile::registerOptimizerPasses();
mlir::DialectRegistry registry;
cuda_tile::registerDialects(registry);
MLIRContext ctx(registry);
ctx.loadAllAvailableDialects();
// Parse input — accepts TileIR bytecode or textual MLIR.
OwningOpRef<Operation *> module = parseSourceFile(input_file, &ctx);
if (!module) return 1;
// Build pass manager rooted at cuda_tile::EntryOp.
PassManager pm(&ctx);
pm.addNestedPass<cuda_tile::EntryOp>(createFuseFMAPass());
pm.addPass(createCanonicalizerPass());
pm.addPass(createCSEPass());
pm.addPass(createLoopInvariantCodeMotionPass());
pm.addNestedPass<cuda_tile::EntryOp>(createLoopSplitPass());
// Accept optional pre/post textual pipeline fragments.
applyTextualPipelineFragments(pm, pre_fragment, post_fragment);
if (failed(pm.run(*module))) return 1;
// Emit TileIR bytecode, memory bytecode, MLIR file, or MLIR stdout.
return emitOutput(*module, output_kind, output_file);
}
Tileiras has no standalone optimizer entry point. The same passes — FMA fusion, canonicalization, CSE, LICM, loop splitting — exist or have replacements in the full compile pipeline, but they are reached as part of tileiras_compile(), not as a cuda_tile-opt-style tool. The compile pipeline does not stop at cuda_tile; it lowers through nv_tileaa, nv_tileas, cute_nvgpu, cutlass, and nvvm, then runs the NVPTX backend.
| Driver component | OSS behavior | Tileiras behavior | Divergence kind |
|---|---|---|---|
| Input format | TileIR bytecode or textual MLIR | TileIR bytecode only | Semantic (textual MLIR rejected) |
| Optimizer anchor | cuda_tile::EntryOp-nested pass manager | full pipeline; per-pass anchors vary | Anchor-op |
| FMA fusion | FuseFMA at cuda_tile layer | tileas-legalize-fma-dot plus NVPTX -nvptx-fma-level | Layering (SUPERSEDED) |
| Canonicalization | createCanonicalizerPass() at cuda_tile layer | canonicalizer runs after every lowering stage | Granularity (split) |
| CSE | createCSEPass() at cuda_tile layer | CSE runs at multiple lowering layers | Granularity (split) |
| LICM | createLoopInvariantCodeMotionPass() at cuda_tile layer | LICM runs at the nv_tileas and LLVM layers | Layering (REWRITTEN) |
| Loop splitting | LoopSplit at cuda_tile layer | no equivalent at any layer | ABSENT |
| Textual pipeline fragments | applyTextualPipelineFragments() | no equivalent; pipeline is fixed by opt-level | ABSENT |
| Output: TileIR bytecode | emit bytecode | not exposed as terminal output | ABSORBED |
| Output: TileIR memory bytecode | emit memory bytecode | not exposed as terminal output | ABSORBED |
| Output: MLIR file/stdout | emit textual MLIR | not exposed as terminal output | ABSORBED |
| Output: LLVM bitcode | emit LLVM bitcode | not exposed as terminal output | ABSORBED |
| Terminal output | one of the four above | PTX text or CUBIN binary | Layering |
| Pass registration | one helper that adds the optimizer passes | distributed across dialect and extension installers | Structural |
The driver's anchor — cuda_tile::EntryOp — is the structural reason the OSS optimizer cannot be lifted directly into tileiras. Once the pipeline lowers past cuda_tile, no EntryOp exists to anchor a nested pass manager against. The OSS-style pass scheduling assumes the IR stays in cuda_tile for the entire optimizer run; tileiras's pipeline schedules each pass against whichever dialect is current at that point.
What Survives
The pass concepts survive. FMA fusion is a real concern in tileiras — it just happens at the TileAS and NVPTX backend layers rather than at cuda_tile. Canonicalization, CSE, and LICM are real concerns in tileiras — they run between every lowering stage rather than in one batch. Loop splitting is the one OSS pass with no tileiras counterpart at any layer; a reimplementer adding it would be extending tileiras's capabilities rather than reproducing them.
What Does Not Survive
The standalone cuda-tile-opt-style tool does not survive. The four output kinds do not survive at the tool level. The textual pipeline fragments do not survive — tileiras's pipeline is built per opt-level by a fixed builder rather than assembled from caller-supplied textual fragments.
A tileiras-compatible compiler should not expose a CudaTileOptimizer-shaped tool unless the goal is to add a tile-level optimizer that does not exist in the released binary. The full compile pipeline is the supported entry point.
Generated Code Layout
Both files include ODS-generated .cpp.inc content. The mapping for the generated pieces:
| Generated artifact | OSS | Tileiras | Status |
|---|---|---|---|
Dialect.cpp.inc (dialect registration) | included in dialect translation unit | inlined into the dialect ctor slab | INLINED |
Ops.cpp.inc (op classes) | included in ops translation unit | inlined into per-op slabs | INLINED |
Types.cpp.inc (type classes) | included in types translation unit | inlined into per-type slabs | INLINED |
AttrDefs.cpp.inc (attribute classes) | included in attrs translation unit | inlined into per-attribute slabs | INLINED |
Interfaces.cpp.inc | included in Interfaces.cpp | inlined into each concrete implementor | INLINED |
AttrInterfaces.cpp.inc | included in Interfaces.cpp | inlined into each concrete implementor | INLINED |
Passes.cpp.inc (pass registration helpers) | included in pass-registration TU | spread across dialect and extension installers | REWRITTEN |
The cross-cutting pattern: tileiras's LTO build inlines ODS-generated dispatch into each concrete consumer rather than concentrating it in central includes. The behavior at the source-language level is identical; the build-time factoring differs.
Reimplementation Guidance
For a tileiras-compatible reimplementation:
- Use OSS
Interfaces.tdas the authoritative declaration of the dialect's interfaces. The three interfaces (AssumePredicateAttrInterface,TileView,AllElementTypeMatch) are unchanged. - Implement
AssumePredicateAttrInterfaceon the three predicate attributes (DivByAttr,SameElementsAttr,BoundedAttr). ImplementTileViewon the two view types (cuda_tile.tensor_view,cuda_tile.partition_view). - Do not expose a standalone
cuda-tile-opt-shaped tool. The driver layer intileirasistileiras_create_program+tileiras_compile_program+tileiras_get_output; reimplement those, not the OSS optimizer. - Do not accept textual MLIR input — tileiras consumes TileIR bytecode only.
- Do not register a
LoopSplitpass for compatibility. It has no tileiras counterpart. - FMA fusion, canonicalization, CSE, and LICM should run at the tileiras-equivalent layers (TileAS for FMA, between every lowering stage for the others), not at the
cuda_tilelayer with the OSS scheduling.
Cross-References
- OSS Comparison Overview — the divergence taxonomy used in the tables above.
- .td Files Delta — the TableGen-declared surface that
Interfaces.cppconsumes. - Transforms / FuseFMA / SynthDbg — the OSS optimizer pass set in detail.
- cuda_tile Verifiers — how the interfaces declared here are consumed at verify time inside tileiras.
- Driver Overview — the supported entry point that replaces the OSS standalone tool.
.td Files Delta
Abstract
The public cuda_tile TableGen surface is declared by three files: Types.td, AttrDefs.td, and Ops.td. Tileiras matches almost all of that surface. Three concrete deltas distinguish the tileiras dialect from the upstream declarations:
- The mnemonic
print_tkoin upstreamOps.tdis renamed toprintin tileiras (semantic change at the parser/printer level). - The operation
cuda_tile.atan2declared in upstreamOps.tdis absent from tileiras — it ships in the 13.2 dialect surface but not in the 13.1 release this wiki covers. - The type mnemonic
cuda_tile.stringis added in tileiras with no upstream counterpart, used by the renamedcuda_tile.printop as the format-string operand type.
The rest of the surface — all 92 operations beyond the two ops above, the entire type system, all attributes, all interfaces, all predicate helpers — is declaration-for-declaration identical between the public TableGen sources and the recovered tileiras declarations. For a reimplementer, the dialect is a cuda_tile 13.1 dialect with one mnemonic rename, one operation suppression, and one added type.
Types.td
Upstream Types.td declares five concrete types and thirteen scalar aliases. Tileiras carries all five concrete types unchanged and all thirteen aliases unchanged. The one delta is the addition of cuda_tile.string.
Concrete Types
| Definition | Mnemonic | Upstream | Tileiras |
|---|---|---|---|
CudaTile_PointerType | cuda_tile.ptr | declared | declared (identical) |
CudaTile_TileType | cuda_tile.tile | declared | declared (identical) |
CudaTile_TensorViewType | cuda_tile.tensor_view | declared | declared (identical) |
CudaTile_PartitionViewType | cuda_tile.partition_view | declared | declared (identical) |
CudaTile_TokenType | cuda_tile.token | declared | declared (identical) |
| (tileiras-only) | cuda_tile.string | absent | added |
Added Type: cuda_tile.string
Tileiras adds one type with no upstream equivalent. The declaration shape:
// Tileiras-only (no upstream counterpart in Types.td)
def CudaTile_StringType : CudaTile_Type<"String", "string"> {
let summary = "A static-length string value used by cuda_tile.print";
let description = [{
Carries a UTF-8 byte sequence with a static length. The compiler does not
expect arbitrary string-valued computations at the cuda_tile layer; this
type exists so the renamed print op can take a typed format string as an
operand rather than as an attribute.
}];
let parameters = (ins "int64_t":$length);
let assemblyFormat = "`<` $length `>`";
}
The type is parsed and printed as !cuda_tile.string<N> where N is the static byte length. The op that consumes it is cuda_tile.print (described below in the Ops.td section). No other tileiras op accepts cuda_tile.string operands.
Scalar Aliases
Both upstream and tileiras declare the same thirteen scalar aliases:
i1, i8, i16, i32, i64
f16, bf16, f32, tf32, f64
f8E4M3FN, f8E5M2, f8E8M0FNU
These are predicate-helper aliases used by op verifiers; they are not standalone types. f8E8M0FNU is carried through ODS predicate expansion in tileiras but no observed op consumer accepts it as an element type at runtime. Practical element-type validation in both trees ends at f8E5M2.
Predicate Helpers
The predicate helpers (CudaTile_IntegerType, CudaTile_FloatType, CudaTile_NumberType, CudaTile_TileElementType, CudaTile_TileOf<...>, CudaTile_RankedTileOf<...>, CudaTile_ScalarTileOf<...>, CudaTile_IntegerTile, CudaTile_BaseFloatTile, CudaTile_FloatTile, CudaTile_NumberTile, CudaTile_PointerTile) are declared identically in both trees. They are TableGen predicate templates expanded inline at each consuming op's verifier. No runtime helpers exist for them in either tree.
AttrDefs.td
The attribute surface is identical between upstream and tileiras. All six attribute groups are present declaration-for-declaration. No deltas exist in this file.
Attribute Groups (Both Trees, Identical)
| Group | Attributes | Upstream / Tileiras |
|---|---|---|
| Arithmetic enums | signedness, overflow, rounding, comparison_ordering, comparison_predicate | identical declarations |
| Atomics and memory | AtomicRMWModeAttr, MemoryScopeAttr, MemoryOrderingSemanticsAttr | identical declarations |
| Assumption predicates | DivByAttr, SameElementsAttr, BoundedAttr | identical declarations |
| Layout and padding | OptimizationHintsAttr, PaddingValueAttr | identical declarations |
| Debug-info nodes | DILocAttr, DICompileUnitAttr, DIFileAttr, DILexicalBlockAttr, DISubprogramAttr | identical declarations |
| Debug-info bases | DIAttr, DINodeAttr, DIScopeAttr, DILocalScopeAttr | identical declarations |
AtomicRMWModeAttr has ten cases in both trees: AND, OR, XOR, ADD, ADDF, MAX, MIN, UMAX, UMIN, XCHG. The three assumption-predicate attributes (DivByAttr, SameElementsAttr, BoundedAttr) all implement AssumePredicateAttrInterface, so cuda_tile.assume verifies them through the same interface dispatch in both trees. DivByAttr uses the same custom assembly format (div_by<...>) in both trees.
OptimizationHintsAttr accepts the same SM-key vocabulary in both trees: sm_80, sm_86, sm_87, sm_88, sm_89, sm_90, sm_100, sm_103, sm_110, sm_120, sm_121. The useful keys (kNumCTAInCGA, kAllowTMA, kLatency, kOccupancy) are declared identically.
Ops.td
Upstream Ops.td declares 94 operation records — 93 ops plus the CudaTile_FmaTile type-constraint pseudo-record. Tileiras carries 92 of those records unchanged, renames one mnemonic, and omits one.
Operation Census
| Source | Op count | Notes |
|---|---|---|
Upstream Ops.td | 93 ops + 1 type constraint | full 13.2-preview surface |
| Tileiras | 92 ops + 1 type constraint | 13.1 surface |
| Renamed | 1 | print_tko → print |
| Removed | 1 | atan2 (13.2-only) |
| Added | 0 | no tileiras-only ops in Ops.td |
The 92 carried ops are identical declarations. Listing them would duplicate the OSS source verbatim; instead, the table below shows the two deltas with their exact declaration shapes.
Delta 1: print_tko → print Rename
Upstream declaration:
// Ops.td (OSS)
def CudaTile_PrintTkoOp : CudaTile_Op<"print_tko", [
DeclareOpInterfaceMethods<MemoryEffectsOpInterface>
]> {
let summary = "Token-ordered runtime print operation";
let arguments = (ins
CudaTile_TokenType:$inToken,
StrAttr:$format,
Variadic<AnyType>:$args
);
let results = (outs CudaTile_TokenType:$outToken);
let assemblyFormat = [{
$inToken `,` $format ( `,` $args^ )? attr-dict `:` type($args)
}];
}
Tileiras-recovered declaration:
// Tileiras Ops.td (recovered)
def CudaTile_PrintTkoOp : CudaTile_Op<"print", [
DeclareOpInterfaceMethods<MemoryEffectsOpInterface>
]> {
let summary = "Token-ordered runtime print operation";
let arguments = (ins
CudaTile_TokenType:$inToken,
CudaTile_StringType:$format,
Variadic<AnyType>:$args
);
let results = (outs CudaTile_TokenType:$outToken);
let assemblyFormat = [{
$inToken `,` $format ( `,` $args^ )? attr-dict `:` type($format) ( `,` type($args)^ )?
}];
}
Two changes. First, the mnemonic in the op definition is print rather than print_tko, so the textual and bytecode forms use cuda_tile.print everywhere. The _tko suffix that the upstream dialect uses to flag token-ordered ops is dropped from this specific op's mnemonic, though the C++ class name (PrintTkoOp) and the token-ordered semantics are preserved.
Second, the format operand is typed as CudaTile_StringType rather than as a StrAttr. This converts the format string from an attribute (compile-time constant) to an operand (SSA value). The motivation is downstream lowering: a typed string operand can be lowered through cuda_tile.string materialization to a global symbol holding the format bytes, which is what the NVPTX vprintf ABI expects. A StrAttr-typed format would force every print site to emit the string bytes inline at the use site.
The renamed op is the only consumer of cuda_tile.string. Its absence in the upstream dialect — combined with the upstream StrAttr format — explains why upstream Types.td does not need a string type at all.
Delta 2: atan2 Absent
Upstream declaration:
// Ops.td (OSS, 13.2-preview)
def CudaTile_Atan2Op : CudaTile_Op<"atan2", [
Pure, ElementwiseMappable, SameOperandsAndResultElementType
]> {
let summary = "Elementwise two-argument arctangent";
let arguments = (ins
CudaTile_FloatTile:$y,
CudaTile_FloatTile:$x
);
let results = (outs CudaTile_FloatTile:$result);
let assemblyFormat = "$y `,` $x attr-dict `:` type($result)";
}
Tileiras-recovered declaration:
// Tileiras Ops.td (recovered): no CudaTile_Atan2Op record declared.
The operation is absent. A strict tileiras-compatible 13.1 parser rejects cuda_tile.atan2 because no op record matches the mnemonic. Frontends emitting cuda_tile IR that target both the 13.1 and 13.2 dialect surfaces should gate atan2 emission behind explicit version logic and fall back to a mul/div/atan/select sequence for 13.1 targets.
The absence is not an accident of the recovery: it reflects that atan2 was added to the dialect after the tileiras 13.1 release. The carried-through f8E8M0FNU alias mentioned in the Types.td section is the inverse case — a declaration that survives in tileiras as a TableGen artifact but has no observed runtime consumer.
Delta 3: cuda_tile.string Added (Cross-Reference)
The added cuda_tile.string type belongs structurally to Types.td (see above), but its only consumer is the renamed cuda_tile.print op. The two deltas — the string type and the print rename — are coupled: removing one without the other would leave either a typeless format operand or an unused type declaration.
Renamed-Or-Removed Op Summary
| Upstream definition | Upstream mnemonic | Tileiras mnemonic | Status |
|---|---|---|---|
CudaTile_PrintTkoOp | print_tko | print | RENAMED (also: format operand retyped) |
CudaTile_Atan2Op | atan2 | (absent) | ABSENT (13.2-only) |
The other 92 ops — every arithmetic op, every memory op, every control-flow op, every shape op, every conversion op, every MMA op, every reduction/scan op, every constant/select/diagnostic op — are declaration-for-declaration identical between upstream Ops.td and the tileiras-recovered declarations. The producer-side surface for those 92 ops can be lifted directly from upstream without modification.
Reimplementation Guidance
- Use upstream
Types.tdas the authoritative declaration for all five concrete types and all thirteen aliases. Add one extraCudaTile_StringTypedeclaration with the shape shown above. - Use upstream
AttrDefs.tdverbatim. No deltas. - Use upstream
Ops.tdfor 92 of the 93 ops verbatim. RenameCudaTile_PrintTkoOp's mnemonic fromprint_tkotoprint, and retype itsformatoperand fromStrAttrtoCudaTile_StringType. Delete theCudaTile_Atan2Oprecord entirely. - A strict tileiras-compatible 13.1 parser must reject
cuda_tile.atan2and must acceptcuda_tile.printwhile rejectingcuda_tile.print_tko. Older bytecode files emitted against the 13.0 dialect surface would have usedprint_tko; the tileiras bytecode reader does not accept that mnemonic — a re-emission against the 13.1 dialect is required.
Cross-References
- OSS Comparison Overview — the divergence taxonomy classifying the three deltas.
- cuda_tile Tree Mapping — how the interface declarations in
Interfaces.td(a fourth public TableGen file outside the three covered here) map between trees. - cuda_tile Op Roster — the operation surface as exposed inside tileiras, with the deltas applied.
- cuda_tile Types and Attrs — the type and attribute surface inside tileiras, including the added
cuda_tile.string.
Transforms / FuseFMA / SynthDebugInfo
Abstract
None of the three OSS cuda-tile transform previews ships in tileiras in its original
cuda-tile-dialect form. FuseFMA.cpp is superseded by lower compiler layers. LoopSplit.cpp is
absent without a TileIR-equivalent replacement. SynthesizeDebugInfoScopes.cpp is replaced by
upstream MLIR's LLVM debug-scope pass, with only shared location helper behavior surviving.
That means a reimplementation should not blindly copy the public Transforms/ directory into the
tileiras pipeline. The public files are useful for understanding the historical cuda_tile tool,
but the released compiler routes FMA, loop shaping, and debug-scope synthesis elsewhere.
FuseFMA
The OSS pass rewrites mulf/addf and mulf/subf patterns into cuda_tile.fma when rounding
modes and modifiers agree. Tileiras keeps the cuda_tile.fma operation, but not the pass that
searches for these patterns at the cuda_tile layer.
FMA formation is delegated to lower layers:
tileas-legalize-fma-dothandles TileAS-level MMA accumulator contraction.-nvptx-fma-levelcontrols scalar FMA formation after lowering to LLVM/NVPTX IR.-enable-fma-to-ffma2covers the backend's F2 fused variant.
This is a semantic decision. Fusing (a * b) + c changes double-rounding into a single-rounded FMA,
so tileiras places the scalar decision under the same backend policy that nvcc --fmad controls.
LoopSplit
The OSS LoopSplit.cpp pass walks cuda_tile.for loops and splits a loop when an inner
cuda_tile.if predicate flips at a loop-invariant boundary. Tileiras does not ship that pass and
does not provide an equivalent TileIR or TileAS pass.
The nearest named relative is loop unrolling, not loop splitting. Schedule materialization can
decompose some guarded loop structure earlier in the pipeline, but it is not the same
predicate-based loop-split transform. A compatible clone should not add OSS LoopSplit unless it
is intentionally adding functionality beyond tileiras.
Debug Scope Synthesis
The OSS SynthesizeDebugInfoScopes.cpp pass is replaced by upstream MLIR's LLVM function-scope
debug pass. The replacement pass is anchored on builtin.module, requires the LLVM dialect, emits
compile units with producer "MLIR", and supports the standard emission-kind enum:
| Value | Emission kind |
|---|---|
0 | None |
1 | Full |
2 | LineTablesOnly |
3 | DebugDirectivesOnly |
The important behavioral difference is where locations are attached. The OSS pass rewrites per-op
locations to DILocAttr. Tileiras leaves that work for the later ConvertDebugInfoToLLVM path,
which consumes debuginfo.value operations after LLVM-dialect lowering. The scope pass itself
walks LLVM functions and attaches function-level DISubprogramAttr information.
Delta Summary
| OSS transform | Tileiras behavior | Compatibility decision |
|---|---|---|
FuseFMA.cpp | Not present as a cuda-tile pass; superseded by TileAS and NVPTX backend policy. | Do not register OSS fuse-fma in the tileiras-compatible pipeline. |
LoopSplit.cpp | Not present; no equivalent TileIR/TileAS split pass. | Do not add a loop-split substitute for compatibility. |
SynthesizeDebugInfoScopes.cpp | Replaced by upstream LLVM function debug-scope pass. | Use DIScopeForLLVMFuncOp and leave per-op location lowering downstream. |
Reimplementation Notes
void configure_tileiras_transform_pipeline(Pipeline *pipeline, OptLevel opt_level) {
add_tileas_legalize_fma_dot(pipeline);
set_nvptx_fma_level(pipeline, 2);
if (opt_level == OPT_O3) {
add_di_scope_for_llvm_func_op(pipeline, DEBUG_DIRECTIVES_ONLY);
} else {
add_di_scope_for_llvm_func_op(pipeline, LINE_TABLES_ONLY);
}
add_convert_debug_info_to_llvm(pipeline);
}
The key omission is intentional: do not add cuda-tile FuseFMA, cuda-tile LoopSplit, or the
cuda-tile-specific debug-scope pass when targeting tileiras behavior.
Boundaries: tileiras vs cicc
Abstract
The tileiras and cicc binaries shipped inside CUDA Toolkit 13.x are siblings. They live in the same bin/ directory, are both invoked by nvcc, and both emit PTX that is handed to the same ptxas. What differs is the front edge of the pipeline: cicc accepts CUDA C++ source and rides an EDG-driven NVVM bridge into the NVPTX backend; tileiras accepts MLIR bytecode and rides a 53-pass MLIR pipeline driver into the same NVPTX backend. This page assumes the reader already knows cicc and documents what is shared, what is reinvented, and what cicc carries that tileiras jettisoned.
Premise
Tileiras and cicc are sibling tools in CUDA 13.1's device-compilation toolchain. They link the same NVIDIA-internal LLVM 21.0.0git fork, expose the same MC subsystem identity, and carry the same NVVM/NVPTX pass family names. Cicc 13.0 carries the same family one minor revision earlier; cicc 13.1 tracks tileiras's LLVM snapshot.
cicc is a CUDA-C++-to-PTX compiler. Its three major subsystems are an EDG 6.6 frontend, an NVVM bridge, and an LLVM NVPTX backend. Together they implement a full source-to-PTX flow with standalone and libNVVM-shaped dispatch. The compiler parses C++, lowers through EDG IL, emits the .int.c/.device.c/.stub.c split artifacts, optimizes through NVIDIA's NVVM pass family, and runs the NVPTX backend.
tileiras is an optimizing assembler in the literal MLIR sense: it consumes a serialized representation of an already-lowered tile program, finishes lowering to a hardware-near IR, and emits a deployable artifact. Input is MLIR bytecode — the on-disk encoding of a builtin.module containing a cuda_tile payload — not source. There is no C++ parser, no EDG frontend, no .int.c emission, no constexpr evaluator. Tileiras is also explicitly not a cudafe++ replacement: cudafe++ does C++ source-to-source rewrite (kernel-launch lowering, host/device split), while tileiras only consumes bytecode and emits a host ELF (elf.o by default).
Pass-by-pass overlap matrix
The clean way to read the shared surface is to split it into two layers.
Layer A — MLIR / IR-frontend. No equivalent in cicc. Cicc has no MLIR; its frontend is EDG 6.6 emitting C, then a hand-written EDG-IL-to-LLVM-IR translator inside the NVVM bridge.
Layer B — NVVM-IR / NVPTX-backend. Shared. Pass names, command-line keys, diagnostic strings, and pass-info constructor shapes match byte-for-byte across the two binaries.
| Layer | Subsystem | tileiras | cicc | Status |
|---|---|---|---|---|
| A | C++ parser | absent | EDG 6.6 frontend | cicc-only |
| A | constexpr evaluator | absent | EDG tree-walker | cicc-only |
| A | .int.c/.device.c/.stub.c triple | absent | EDG backend output | cicc-only |
| A | EDG IL → LLVM IR | absent | source-language IR generation | cicc-only |
| A | MLIR bytecode reader | present | absent | tileiras-only |
| A | 9-dialect cascade + dialect registration | present | absent | tileiras-only |
| A | TileAS pass family | present | absent | tileiras-only |
| A | MLIR PassManager constructor | 53-pass pipeline | absent | tileiras-only |
| A | MODSBuilder | cost-based modulo scheduler | absent | tileiras-only |
| A | TileIR pipeline driver | register, configure, run MLIR lowering | absent | tileiras-only |
| A | Pipeline option registrar | compact typed option table | broad cl::opt surface | different shape |
| A | OptiX IR generation | absent | --emit-optix-ir path | cicc-only |
| A | Wizard mode / fast-compile tier | absent | present | cicc-only |
| B | NVVMReflect family | present | present | shared |
| B | NVVM Peephole Optimizer | present | present | shared |
| B | BaseAddressStrengthReduce | present | present | shared |
| B | MemorySpaceOpt | present | present | shared |
| B | DeadSyncElim | present | present | shared |
| B | CommonBaseElim | present | present | shared |
| B | NVVMIRVerifier | present | present | shared |
| B | IPMSPPass | present | present | shared |
| B | NVPTXSetFunctionLinkagesPass | present | present | shared |
| B | SelectKernelsPass | present | present | shared |
| B | KernelInfoPrinter | present | present | shared |
| B | NVVMAA | present | present | shared |
| B | nvvm-reflect-pp | present | present | shared |
| B | NVPTX SelectionDAG | present | present | shared |
| B | NVPTX instruction printer | present | present | shared |
| B | PassBuilder::registerAllPasses | present | present | shared |
| B | libdevice bitcode | embedded once | embedded twice in the two cicc paths | shared content |
| B | ptxas subprocess | launched by tileiras | launched by the nvcc/cicc path | both shell out |
The pattern is simple: above NVVM-IR everything is rewritten; below NVVM-IR almost everything is shared.
Shared NVPTX backend evidence
When the cuda_tile MLIR module finishes its descent through the 9-dialect cascade and reaches the llvm/nvvm dialect, tileiras hands the resulting LLVM module to a NVPTX backend from the same NVIDIA-internal fork that cicc links. The pass roster, command-line keys, diagnostics, and analysis names line up across the two tools.
| Pass | Public key or surface | Role |
|---|---|---|
| NVVM Peephole Optimizer | nvvm-peephole-optimizer | Performs NVVM-specific instruction and intrinsic cleanups before codegen. |
| BaseAddressStrengthReduce | internal debug type | Rewrites address arithmetic into forms that are cheaper for NVPTX selection. |
| MemorySpaceOpt | -mllvm knob family | Normalizes memory-space casts and address-space information. |
| DeadSyncElim | -nvvm-dead-sync-elim | Removes synchronization operations proven unnecessary. |
| CommonBaseElim | SCEV-driven transform | Deduplicates related GEP/base-address computations. |
| NVVMIRVerifier | verifier diagnostics | Rejects invalid NVVM IR shapes before NVPTX lowering. |
| IPMSPPass | ipmsp | Interprocedural module-specialization support. |
| NVPTXSetFunctionLinkagesPass | check-kernel-functions | Sets and validates kernel linkage state. |
| SelectKernelsPass | select-kernels | Restricts compilation to selected kernel sets or ranges. |
| KernelInfoPrinter | kernel-info | Emits kernel metadata for downstream consumers. |
| NVVMAA | nvvm-aa | NVIDIA alias analysis for NVVM/NVPTX transforms. |
| NVVMReflect | nvvm-reflect, nvvm-reflect-pp | Resolves __nvvm_reflect queries from reflection metadata. |
Two CLI knob families confirm the shared backend contract at the user-visible layer. The nvvm-reflect- option family installs the same enable and key/value override behavior in both tools, and the kernel-selection family accepts the same kernel-list, kernel-range, IPMSP dump, and clone-control options.
Crucial scoping note: these passes are not invoked by tileiras's own MLIR PassManager. They run one level down, after tileiras's LLVM-dialect output is materialized as an llvm::Module and handed to the embedded NVPTX backend. The MLIR layer produces valid-shape NVVM-dialect IR; the LLVM layer applies the shared NVPTX pass family unchanged.
Tileiras-only inventions
Above the NVVM-IR boundary, tileiras introduces an MLIR-shaped front-end with no analogue in cicc. None of the following symbols, dialects, or pass mnemonics appear in the cicc binary.
| Subsystem | Description |
|---|---|
| MLIR bytecode reader | Project-private MLIR bytecode I/O with Tile versioning, frozen op/type/attribute tags, and cuda_tile schema support. |
| TileIR top-level driver | Compile-and-serialize path that registers dialects, registers pipeline options, and runs lowering. |
| 9-dialect cascade | cuda_tile → nv_tileaa → nv_tileas (+ cute, cute_nvgpu, cutlass) → nvgpu → nvvm → llvm. |
| MLIR-pipeline driver | Builds the mlir::PassManager for O0/O1/O2/O3; the tier is decoded from bytecode attributes such as "nvopt<O2>". |
| TileAS family | Removes dead args, resolves agent boundaries, schedules async work, materializes layouts, plans CTA mapping, and inserts OCG knobs. |
| MODSBuilder | Cost-based modulo scheduler used at O2 and O3 (inherited from O2) after schedule generation and after GPU-op conversion. |
cute dialect | CuTe layout algebra: local tiling, partitioning, shape arithmetic, size/cosize, and divide helpers. |
cute_nvgpu dialect | SM70-SM120 atoms for TMA, tensor memory, GMMA/UMMA descriptors, warp-uniform values, and WGMMA. |
cutlass dialect | Pipeline acquire/commit/wait, tile-scheduler records, block-striped operations, and sequence barriers. |
cuda_tile dialect | Public control, entry, tensor-view, atomic, selection, constant, and optimization-hint surface. |
nv_tileaa / nv_tileas | Alias-aware typed-pointer/token/view layer plus assembler-near schedules, layouts, execution units, tiled loads/stores, and dot operations. |
| Pipeline option registrar | Compact typed table for integer, unsigned, boolean, enum, and string options. |
nvdisasm -c shell-out | Optional SASS disassembly pass that appends a disassembly section to the emitted host object. |
Three pieces deserve a closer look. First, dialect registration has no analogue in cicc, which builds its IR directly in LLVM-IR shape. Second, the MLIR PassManager uses nested operation pass managers, function adapters, and the canonicalizer/CSE/SymbolDCE cleanup trio; cicc's pass manager is a conventional LLVM function/module pipeline. Third, the optimization tier comes from an attribute embedded in the TileIR bytecode, while cicc uses the conventional -O0/-O1/-O2/-O3 driver flag family.
cicc-only baggage tileiras dropped
Cicc's bulk comes from features tileiras explicitly does not need. The following are visible in the cicc binary and entirely absent from tileiras.
| Dropped subsystem | cicc responsibility | Why tileiras drops it |
|---|---|---|
| EDG 6.6 frontend | C++ parsing, type checking, templates, constexpr, and CUDA source diagnostics. | input is MLIR bytecode, not C++ |
.int.c / .device.c / .stub.c emission | EDG backend source splitting and host/device artifact generation. | emits host ELF directly |
| OptiX IR generation | Optional OptiX IR output stage. | no OptiX path |
| Wizard mode | cicc-internal experimental mode. | absent |
| Fast-compile tiers | Multiple compile-tier knobs. | only the TileIR optimization tier applies |
| NVVMPassOptions struct | Large shared knob block for the cicc NVVM pipeline. | consolidated into a compact typed option table |
| Dual Path A / Path B dispatch | Two frontend/IR-generation paths for standalone and libNVVM-shaped usage. | one bytecode-to-object path |
Broad cl::opt registry | Large standalone compiler option surface. | small driver surface plus TileIR pipeline options |
| NVVM builtin resolution table | Source-level builtin name and overload resolution. | resolution happens upstream |
| constexpr evaluator | EDG tree-walking interpreter. | C++ template/constexpr evaluation happens upstream |
| C++ template cleanup | Synthesized source-language runtime cleanup. | no synthesized C++ runtime |
-nvvm-version=nvvm-latest/nvvm70 switch | Path selector for older cicc modes. | absent |
| LibNVVM API entry points | Library-facing API surface. | not a libNVVM client |
Tileiras is 88 MB despite carrying a full MLIR runtime, a 9-dialect cascade, the CuTe/CUTLASS pipeline op surface, a cost-based modulo scheduler, and the TileAS pass family, because it leaves the 3.2 MB EDG, the dual-path duplication, the 1,689-option registry, the 4 KB NVVMPassOptions struct, and the OptiX path behind. Cicc 13.0's 60 MB skew toward EDG and dual-path overhead; tileiras's 88 MB skew toward the MLIR/dialect surface and the TileAS family.
Architectural sketch (side-by-side)
cicc tileiras
──── ────────
CUDA C++ source (.cu / .ci / .i) MLIR bytecode (.ctir / .ctb)
│ │
▼ ▼
┌─────────────────────┐ ┌──────────────────────┐
│ EDG 6.6 frontend │ │ MLIR bytecode │
│ parser, constexpr │ │ reader │
│ parser, constexpr │ └──────────┬───────────┘
│ evaluator │ │
└──────────┬──────────┘ ▼
│ .int.c / .device.c / .stub.c ┌────────────────────────┐
▼ │ cuda_tile dialect │
┌─────────────────────┐ └──────────┬─────────────┘
│ IRGEN: EDG IL → │ ▼
│ LLVM IR translator │ ┌────────────────────────┐
│ standalone/libNVVM │ │ nv_tileaa dialect │
│ shaped paths │ └──────────┬─────────────┘
└──────────┬──────────┘ ▼
│ ┌────────────────────────┐
▼ │ nv_tileas dialect │
┌─────────────────────┐ │ + cute │
│ LNK + libdevice │ │ + cute_nvgpu │
│ (456 KB embedded) │ │ + cutlass │
└──────────┬──────────┘ │ TileAS 16 passes │
│ │ MODSBuilder │
▼ │ 53-pass MLIR pipeline │
┌─────────────────────┐ └──────────┬─────────────┘
│ OPT: NVVM passes │ ▼
│ 35 NVIDIA-custom + │ ┌────────────────────────┐
│ standard LLVM │ │ mlir::nvgpu │
│ NVVM pipeline │ └──────────┬─────────────┘
└──────────┬──────────┘ ▼
│ ┌────────────────────────┐
│ (no MLIR layer) │ nvvm dialect │
│ └──────────┬─────────────┘
│ ▼
│ ┌────────────────────────┐
│ │ llvm dialect │
│ └──────────┬─────────────┘
│ │
└───────────────────┬───────────────────────────┘
│
▼ (CONVERGENCE — same NVPTX backend)
┌────────────────────────────────────────────────────────┐
│ NVPTX backend (LLVM 21.0.0git internal fork) │
│ ─ nvvm-peephole-optimizer / BaseAddressStrengthReduce│
│ ─ MemorySpaceOpt / DeadSyncElim / CommonBaseElim │
│ ─ NVVMIRVerifier / IPMSP / NVVMAA │
│ ─ NVPTXSetFunctionLinkagesPass / SelectKernelsPass │
│ ─ KernelInfoPrinter / NVVMReflect / nvvm-reflect-pp │
│ ─ NVPTX SelectionDAG ISel / NVPTXInstPrinter │
└────────────────────────────┬───────────────────────────┘
│
▼
PTX text
│
▼
┌──────────────────────────┐
│ ptxas (subprocess) │
│ PTX → SASS │
└────────────┬─────────────┘
│
▼
cicc: .ptx tileiras: elf.o
(with optional
nvdisasm -c
SASS section)
The two pipelines converge at the moment the LLVM module is materialized for the NVPTX backend, and from that point forward they share the same code — passes, ISel, register allocation, scheduling, asm-printer.
Decision matrix: which compiler does nvcc run?
The two compilers see disjoint inputs, so the routing decision is structural rather than policy-driven. nvcc classifies each input artifact and dispatches once; neither compiler probes the input format the other expects.
| Input artifact | Debug mode | SM target | Compiler chosen | Why |
|---|---|---|---|---|
.cu CUDA C++ source | release | any supported | cudafe++ → cicc | only cicc has a C++ frontend |
.cu CUDA C++ source | -G device debug | any supported | cudafe++ → cicc at -O0 | only cicc accepts source-language debug info |
Preprocessed .cpp1.ii / .cudafe1.cpp | any | any supported | cicc | EDG IL re-entry is a cicc-only path |
.tileir / .ctir / .ctb bytecode | release | sm_100, sm_103, sm_110, sm_120, sm_121 | tileiras | only tileiras parses TileIR bytecode |
.tileir bytecode | --device-debug requested | any supported | tileiras at -O0 | tileiras rejects -G above -O0 |
.tileir bytecode | release | sm_70 .. sm_90a | (no valid path) | tileiras's GPU whitelist excludes pre-Blackwell SMs |
.ptx precompiled | n/a | any | neither (ptxas only) | neither device compiler runs on PTX input |
.cubin precompiled | n/a | any | neither (nvlink/fatbinary only) | both device compilers are upstream of cubin |
Three rows deserve commentary. The pre-Blackwell row is the hard constraint: tileiras's --gpu-name enum accepts only sm_100, sm_103, sm_110, sm_120, and sm_121, so a CUDA build targeting sm_80 or sm_90 cannot use the tileiras path even if the upstream MLIR emitter exists. The cicc path remains the only compile route for those targets. The debug row is a softer constraint: both compilers reject the combination of optimization above -O0 with full device debug, but the wording of the diagnostic and the downstream NVVM options differ. The bytecode rows depend on the upstream emitter — without a CUTLASS-on-MLIR, CuTe-DSL, or Triton-for-CUDA frontend in the build, no .tileir ever appears and the tileiras path stays unused.
Capability split
The clean rule is that tileiras and cicc consume disjoint inputs. CUDA C++ source, with all of its template-instantiation, constexpr-evaluation, lambda-capture, and host/device-split machinery, is cicc's territory; TileIR bytecode, with its already-resolved tile-program structure expressed in the cuda_tile dialect family, is tileiras's territory. Neither tool has a backdoor that consumes the other's input.
What they share is the NVPTX backend below the LLVM-dialect/NVVM-IR handoff. Both compilers materialise an llvm::Module and hand it to the same NVPTX backend from the same LLVM 21 fork. Below that handoff, the two compilers are byte-for-byte equivalent: same SelectionDAG, same NVVM custom passes, same instruction printer, same libdevice payload. Above the handoff they share almost nothing.
The capability split has a practical consequence for emitters and integrators. Upstream tooling that wants the convenience of CUDA C++ source — including templates, constexpr, lambdas, and the standard CUDA runtime API — must target cicc through cudafe++. Upstream tooling that wants the precision of a tile-shaped program, hand-managed pipelines, explicit CTA mapping, and the cuda_tile/cute/cutlass op surfaces must target tileiras through TileIR bytecode. There is no overlap; the question of "which compiler should this kernel use" reduces to "which input format is the emitter willing to produce".
Migration trajectory
cicc is the longer-standing compiler and the only path that accepts CUDA C++ source. tileiras is the newer compiler, introduced in CUDA 13.1, that accepts bytecode produced by MLIR-rooted frontends. The two are sibling tools in the same toolkit, not staged replacements.
Three reading signals shape the trajectory. First, the shared NVPTX backend means new SM targets, new MMA shapes, and new fence semantics arrive in both compilers simultaneously through the LLVM fork. Neither compiler is locked to a particular hardware generation. Second, the tileiras-specific dialect cascade (cuda_tile, nv_tileaa, nv_tileas, cute, cute_nvgpu, cutlass) carries operations that have no analogue in cicc's LLVM-IR-only input; those operations encode tile-program structure that source-level CUDA cannot express directly. Third, cicc still ships in CUDA 13.1, with a one-minor-version-newer copy of the same LLVM fork that tileiras links; both tools track upstream NVPTX changes through the same vendor backport pipeline.
A reimplementation does not have to choose between the two tools. The honest model is "two device-code compilers, one shared backend": dispatch by input format, share the backend by linking the same NVPTX library, and treat the dialect cascade and the EDG frontend as independent front-ends that meet at the LLVM-module level.
Cross-link recommendations
Everything tileiras inherits unchanged from the LLVM 21 fork is documented in the cicc wiki, and those pages are reusable verbatim for the tileiras NVPTX backend.
- NVPTX backend internals — see cicc
pipeline/codegen.mdandpipeline/emission.md. Same SelectionDAG, sameNVPTXTargetLowering, same 19 MMA shapes x 11 data types, and same instruction-printer surface. - NVVMReflect mechanism — see cicc reflect docs. Same
__nvvm_reflect/__nvvm_reflect_oclrewrite, samenvvm.reflectionmodule-flag table, samenvvm-reflect-addparser. - libdevice — same ~456 KB bitcode payload. Tileiras embeds it once (no Path A / Path B duplication).
- NVVM Peephole / BaseAddressStrengthReduce — same pre-codegen cleanup and address-strength-reduction roles.
- MemorySpaceOpt — same address-space normalization and memory-space cleanup behavior.
- DeadSyncElim — same synchronization-elimination pass.
- NVVMIRVerifier — same verifier role before backend lowering.
- IPMSP / SelectKernels / KernelInfo / NVPTXSetFunctionLinkages / NVVMAA / nvvm-reflect-pp — same backend registration family.
For everything above the NVVM-IR boundary, the cicc wiki has nothing to offer; refer to the tileiras-internal pages: cuda_tile Overview, cute Overview, cute_nvgpu Overview, cutlass Overview, nv_tileaa Overview, nv_tileas Overview, the TileAS Pass Families series, Full Pass List by Opt Level, Modulo Scheduler and Rau, CLI Options, and MLIR Bytecode Format. The intent behind the cicc-vs-tileiras split — why an MLIR substrate at all, why a four-stage cascade, why a Rau scheduler — is documented in Architecture Evolution and Design Decisions.
Reimplementation Notes
Model the two tools as two different producers for the same downstream backend shape:
cicc:
input: CUDA C++ source or preprocessed CUDA source
frontend: EDG and NVVM bridge
handoff: LLVM/NVVM module
backend: shared NVPTX backend
output: PTX for ptxas
tileiras:
input: TileIR MLIR bytecode
frontend: MLIR dialect cascade and TileAS passes
handoff: LLVM/NVVM module
backend: shared NVPTX backend
output: host object that carries ptxas output
This split is the key design constraint. Above the LLVM/NVVM handoff, reuse between the two tools is mostly conceptual. Below that handoff, the pass names, reflection behavior, libdevice payload, and PTX emission semantics should be treated as one shared backend contract.
Handoff Protocol: tileiras → ptxas
Abstract
Tileiras finishes its MLIR-to-PTX lowering inside its own address space and then shells out to a separate ptxas binary to obtain a cubin. The boundary is text-only: PTX leaves tileiras as an ASCII string passed inline on the child's command line, ptxas writes the assembled cubin bytes to stdout, and tileiras reads them back through the parent end of the pipe set up by its subprocess harness. No shared memory, no temporary file for the PTX, no IPC beyond argv plus stdout. A separate knob file (path supplied through the environment) carries scheduling and codegen hints that tileiras itself never inspects. This page reconstructs that boundary from the binary.
Subprocess argv
The argv vector is assembled by the PTX serialization path and handed to a subprocess wrapper that uses the platform process-launch primitives. The launcher itself is architecture-agnostic; the GPU target appears only in the argv strings assembled at the call site.
The final argv shape, in order, is:
ptxas
[ <module-attribute "ptxas-options" tokens> ]
-arch sm_<NN>
--opt-level <N>
--input-as-string <PTX text>
[ ...basePTXOptions tokens... ]
= " --knobs-file=<PTX_KNOBS_PATH>"
= " --nv-host=\"<host-code-temp-path>\""
= " <basePTXOptions string-attr value>"
| Flag | Origin | Role |
|---|---|---|
ptxas | fixed argv program name | Tool name; resolved through $PATH by the spawn helper. |
-arch sm_<NN> | module GPU compute-capability attribute | Target architecture string. NN is decimal, for example sm_100, sm_103, sm_120, or sm_121. |
--opt-level <N> | module optimization-level attribute | ptxas optimization level, accepted as a small decimal value. |
--input-as-string <PTX> | PTX serializer output | Inlines the PTX program as a single argv token rather than reading a file. |
--knobs-file=<path> | $PTX_KNOBS_PATH when $MLIR_ENABLE_EVO is set | Hands ptxas a path to the scheduling-knob file. Tileiras performs no path validation. |
--nv-host="<path>" | host-code serialization path | Points ptxas at a temporary host-code blob. Quotes and backslashes in the path are escaped before the token is wrapped in double quotes. |
The --input-as-string choice ties the PTX size to the kernel's MAX_ARG_STRLEN budget (131 072 bytes per token on Linux). For larger kernels a fallback to --input-file=<temp.ptx> would be required; the current binary does not implement one.
PTX text protocol
Tileiras emits PTX as ASCII text, not LLVM bitcode and not NVVM IR. LLVM bitcode, NVVM IR text, and PTX-only output modes all stop before ptxas; only the cubin-producing mode reaches the subprocess launcher. By the time argv is built, the PTX has already passed through the full NVPTX backend pipeline inside tileiras's process. ptxas sees a finished PTX program, not an intermediate.
Subprocess construction
The argv vector flows into the generic POSIX launcher documented in Subprocess Harness. Three decisions are tileiras-specific:
- Program path resolution. The first argv token is the literal string
"ptxas". The launcher resolves it through the inheritedPATH; there is no in-binary table of fallback paths and no hard-coded toolkit prefix. A reimplementation must keep the CUDAbin/directory onPATHor supply an absolute path through a wrapper. - Spawn primitive. ptxas is invoked through the
posix_spawnfast path because neithersetsidnor process resource limits are requested. The harness only falls back tofork+execfor callers that need those facilities, which the ptxas adapter does not. - Stdio plumbing. stdin is closed; stdout is piped into a parent-side accumulator that captures the cubin bytes; stderr targets the same accumulator object so the launcher applies the
dup2(stdout, stderr)merge optimisation described in the subprocess-harness page. The result is one in-memory buffer that carries both the assembled cubin and any ptxas diagnostic text.
The stderr merge is a deliberate consequence of how tileiras consumes ptxas output. ptxas writes the cubin as a binary blob to stdout and writes any diagnostic text to stderr; when the compile succeeds, stderr is empty (or limited to informational notes such as register-spill summaries) and the captured buffer holds only the cubin. When the compile fails, ptxas writes a textual diagnostic to stderr and stdout stays empty; the merged buffer is then pure ASCII text, which tileiras surfaces through its diagnostic callback verbatim.
There is no in-binary --quiet-ptxas or similar suppression switch. Stderr forwarding is unconditional, and the only way to filter ptxas chatter is at the harness boundary on the parent side. Reimplementations that want a quiet mode should attach a custom diagnostic callback that inspects the captured buffer before forwarding.
Cubin returned via stdout
There is no -o <out.cubin> flag in the argv. Instead, the subprocess harness plumbs ptxas's stdout into a parent-side buffer and stores the captured bytes as the cubin payload. No temporary cubin file is named on the parent side for ptxas's output. Stderr is merged into the same buffer through the harness's dup2 optimisation, so a successful compile yields a clean cubin and a failed compile yields a diagnostic string distinguishable by inspecting the leading bytes for the ELF magic.
The harness enforces a wall-clock timeout. On expiry the child is killed, and the diagnostic "Child timed out" or "Child timed out but wouldn't die" is surfaced through the same stderr pipe. Abnormal exits decode into either "Program could not be executed" or a signal-name string with an optional " (core dumped)" suffix.
Exit-code interpretation
The harness decodes the wait4 status word through the POSIX rules documented in Subprocess Harness. tileiras interprets the resulting exit code as follows:
| ptxas exit | Decoded by harness as | tileiras driver response |
|---|---|---|
0 | normal success | use captured stdout as the cubin payload, append it to the host ELF |
1..125 | ptxas internal failure (PTX rejected, knob-file error, codegen abort) | bubble the captured stderr through the diagnostic callback; the outer compile returns code 5 |
126 | program found but could not be executed (permission denied, ENOEXEC) | surface "Program could not be executed" and return code 5 |
127 | program not found on PATH | same diagnostic shape as 126; the more usual root cause is a missing toolkit bin/ on PATH |
| any signal | signal-name string emitted; optional " (core dumped)" suffix | return code 5; tileiras does not retry |
| timeout | "Child timed out" or "Child timed out but wouldn't die" | return code 5; the harness has already sent SIGKILL and reaped the child |
tileiras does no automatic retry on a non-zero ptxas exit and treats the captured stderr as opaque text. Knob-file diagnostics, register-spill rejections, mismatched-architecture errors, and PTX-parse failures all collapse into the same return path: code 5 from tileirasProgramCompile, the verbatim ptxas stderr forwarded through the diagnostic callback, no partial output on disk.
A reimplementation should preserve two invariants. First, never strip the ptxas stderr before surfacing it; users rely on the verbatim text to diagnose PTX-level issues. Second, never collapse 126/127 into a "ptxas crashed" message — the shell-style codes are diagnostic on their own and point to deployment issues (missing binary, wrong PATH) rather than compiler bugs.
Knob-file format
Tileiras only writes the path; ptxas does the parsing. The receiver-side file format is:
<arbitrary preamble bytes>
[knobs]
<command-stream>
The literal [knobs] is mandatory and case-sensitive; everything before it is preamble and silently discarded. After the header, commands separate on whitespace, the ~ byte, or the ;; sequence. A command is either an INJECTSTRING <body> ;;, a WHEN=<clause> directive, or a regular key=value or bare-key assignment. Identifiers are case-insensitive.
| Knob (representative) | Value type | Effect |
|---|---|---|
DUMPIR=AllocateRegisters | string/identifier | Dumps the IR after the named pass (debug aid). |
EmitLDCU | bool/int | SM90+ only; controls whether ptxas may emit ldcu instructions. Requires -forcetext plus -sso out.sass. |
IgnorePotentialMixedSizeProblems | bool | Suppresses one class of mixed-width verifier errors. |
WHEN=SH=<clause> | when-list (type 9) | Conditional predicate gate that scopes the next assignment. |
INJECTSTRING <text> ;; | raw bytes | Splices a SASS template into the output stream. |
any int knob (...=N) | INT32 / UINT32 | Decimal only; 0x prefixes silently parse as 0. |
any range knob (...=N..M) | INT32_RANGE | Either side may be omitted (sentinels INT_MIN/INT_MAX). |
any list knob (...=N1,N2,N3) | INT32_LIST | Comma-separated decimals; trailing commas reject with "End of integer range value is not ',' or null character". |
any float knob (...=1.5e-3) | FLOAT32 / FLOAT64 | Whatever libc `sscanf("%f" |
Malformed knob files terminate the compile with a fatal diagnostic — "Knobs header not found in %s", "Invalid knob identifier", "Invalid knob specified (%s)", "Invalid knob type" — emitted to stderr and surfaced via the harness.
Scheduling boundary invariant
Tileiras schedules MLIR ops on its own internal Blackwell pipeline model with fifteen reservation slots; ptxas independently schedules SASS using its own latency tables and dual-issue rules. The two scheduling layers do not share any explicit constraint vocabulary across the boundary. The PTX text carries instruction order plus a small set of declarative directives (.maxntid, .reqntid, .minnctapersm, .pragma "nounroll", ...); none of it expresses tileiras's slot map. ptxas is free to reorder within the bounds PTX semantics permit, but only ever adds stalls relative to the order tileiras committed to — it never reorders past PTX-level dependences, and tileiras has already committed to whatever in-instruction parallelism it chose. The practical consequence is that any scheduling intent tileiras wants enforced has to survive PTX-text serialization either as instruction order or as a knob-file / directive hint; anything else is lost at the boundary.
Producer-side bug flagged
Tileiras can in principle emit both .maxntid and .reqntid directives on a single entry function because its directive emission paths are independent. ptxas rejects that combination during final entry-function validation. The relevant rule is ".maxntid and .reqntid are mutually exclusive", alongside the related constraints that .maxnctapersm/.minnctapersm require launch-bounds metadata, .reqntid plus .reqnctapercluster requires .blocksareclusters, and .reqnctapercluster conflicts with .maxclusterrank.
For reimplementation, the safest rule is to normalize launch-bound metadata before PTX printing. Pick either .maxntid or .reqntid, emit the dependent cluster directives only when their prerequisite is present, and surface ptxas stderr verbatim when the receiver rejects the final PTX.
tileiras vs cudafe++ (Non-Relationship)
Abstract
A common misconception about CUDA Toolkit 13.1 is that the new tileiras binary is a successor or replacement for cudafe++. It is not. The two tools share a parent driver (nvcc), a vendor (NVIDIA), and a problem domain (CUDA), but their inputs, outputs, internal architectures, and roles in the build graph have zero overlap. This page documents that non-relationship explicitly.
What tileiras does NOT have
The cleanest way to state the boundary is to enumerate, point by point, each cudafe++ subsystem that is absent from tileiras. The absence is architectural, not just cosmetic: tileiras starts from serialized MLIR bytecode, so every source-language responsibility that belongs to cudafe++ has already happened upstream or does not apply.
- No EDG frontend. cudafe++ is built around the Edison Design Group C++ Front End v6.6: lexer, parser, type system, template instantiation engine, overload resolver, and constexpr interpreter. tileiras has none of that machinery. Its bulk comes from the MLIR runtime, TileIR dialect libraries, and the LLVM 21 NVPTX backend.
- No C++ parser. tileiras has no recursive-descent C++ parser, token kind table, operator-precedence engine, or Itanium ABI name mangler. Its inbound surface is the MLIR bytecode reader, which decodes a serialized
builtin.modulewhose ops, types, and attributes have already been resolved upstream. tileiras enters at bytecode, not source text. - No
.int.cemission. cudafe++ is a C++ source-to-source translator; one of its jobs is writing the transformed host-side.int.coutput. tileiras emits no C source. Its terminal output is a host ELF object, with PTX as the intermediate textual artifact handed toptxas. - No host stubs. cudafe++ generates
__wrapper__device_stub_<kernel>()host-side forwarding functions, the.nvHRKI/.nvHRDE/.nvHRCEELF host-reference arrays, the__cudaRegisterFatBinary/__cudaRegisterFunctionregistration table, and the CRC32-derived module ID. tileiras is device-only. No kernel-launch lowering, no host-side stub synthesis, no fat-binary registration boilerplate. - No lambda machinery. cudafe++ injects template wrappers (
__nv_dl_wrapper_t,__nv_hdl_wrapper_t,__nv_hdl_create_wrapper_t) to carry extended__device__and__host__ __device__lambdas across the host/device boundary, driven by 1024-bit capture-count bitmasks. tileiras has no concept of a lambda or a capture. Whatever upstream tool produces the bytecode has already lowered any C++ lambda away by the time tileiras sees it. - No template instantiation. cudafe++ runs a full C++ template instantiation worklist with deduction, partial specialization, SFINAE, and constexpr evaluation. tileiras has no template engine — no instantiation queue, no template parameter binding table, no constexpr tree-walker. Template specialization is a source-language concept that does not exist in MLIR bytecode.
Why people might confuse them
The confusion is structural rather than semantic. Both binaries live in the same bin/ directory of a CUDA Toolkit installation. Both are stripped, statically linked NVIDIA-internal ELF binaries. Both are invoked transparently by nvcc. Both bear the word "CUDA" in their public framing. Both deal with device-side work. None of those surface similarities reflect any internal overlap. The two tools operate at completely different levels of the pipeline — cudafe++ in the source-translation layer, tileiras at the device-IR-to-PTX layer — and never see each other's outputs.
What cudafe++ actually does
cudafe++ is the CUDA C++ source-to-source translator. It accepts a .cu translation unit, runs the EDG 6.6 C++ frontend over it, separates device code from host code via execution-space attributes (__device__, __host__, __global__), and produces two outputs: an EDG IL stream consumed by cicc, and a transformed .int.c file consumed by the system C++ compiler (gcc, clang, or MSVC). cudafe++ is not a compiler in the conventional sense — it never emits PTX, never emits cubin, and never emits machine code. It is a frontend that splits a CUDA translation unit and hands the two tracks to different downstream tools.
Redirect
This wiki documents tileiras only. For cudafe++ documentation — its EDG frontend internals, the 5-pass IL finalization, the 85-entry-kind IL graph, the .int.c emission format, the CUDA execution-space bitfield, lambda wrapper template injection, the 276-flag CLI surface, and the 3,795-entry diagnostic table — see the separate cudafe++ wiki at nvopen-tools/cudafe++/wiki/.
Boundary table
The four NVIDIA device-toolchain binaries, their inputs, outputs, and roles:
| tool | input | output | role |
|---|---|---|---|
| cudafe++ | .cu source (CUDA C++) | .int.c (transformed C/C++ host source) + EDG IL stream | C++ source-to-source translator; host/device split |
| cicc | .cu / .i / EDG IL | PTX text | CUDA-to-PTX compiler (EDG 6.6 + NVVM bridge + LLVM NVPTX backend) |
| tileiras | MLIR bytecode (cuda_tile dialect) | host ELF (elf.o) wrapping PTX (and optional SASS section) | MLIR-to-PTX optimizing assembler (53-pass MLIR pipeline + shared NVPTX backend) |
| ptxas | PTX text | SASS / cubin | PTX-to-SASS assembler |
cudafe++ is the gate at the source boundary; cicc is the conventional source-language compile path; tileiras is the optimizing-assembler path for tile-shaped kernels expressed in MLIR; ptxas is the final SASS encoder. tileiras and cudafe++ sit at opposite ends of this chain and never interact.
Reimplementation Notes
Do not model tileiras as a cudafe++ mode. A clean driver should keep the responsibilities separate:
cudafe++:
input: CUDA C++ source
work: split host and device code, lower launches, emit host-side transformed source
output: host-side source plus device-side compiler input
tileiras:
input: TileIR MLIR bytecode
work: verify bytecode schema, run MLIR/NVVM/NVPTX lowering, invoke ptxas
output: host ELF object carrying the generated device code
The only shared orchestration point is nvcc, which chooses which downstream compiler to run. The tools themselves should remain independent in any faithful reconstruction.
Position in nvcc 13.1 Toolchain
Abstract
CUDA 13.1 is the first toolkit release in which nvcc ships with two parallel device-code compilers in bin/. The legacy compiler cicc handles CUDA C++ source via the EDG 6.6 frontend and the NVVM bridge. A second compiler, tileiras (88 MB, build tag release 13.1, V13.1.80, Build local.local.36836380_), handles a new MLIR-bytecode input format that did not exist in any prior CUDA release. Both compilers link the same NVIDIA-internal LLVM 21.0.0git fork, share the same NVPTX backend, and emit PTX consumed by the same ptxas. What distinguishes them is the front edge of the pipeline: source language, IR shape, and dialect surface. This page locates tileiras inside the nvcc 13.1 toolchain, contrasts the two device-code paths end to end, and identifies which upstream MLIR DSLs can plausibly emit the bytecode tileiras consumes.
Path A: cicc legacy (CUDA C++ source)
The classical CUDA device-compilation pipeline is unchanged from prior toolkits:
.cu source
|
v
cudafe++ (EDG frontend, host/device split, kernel-launch lowering)
|
v
.int.c / .device.c / .stub.c (transformed C with CUDA extensions stripped)
|
v
cicc (C/EDG-IL -> NVVM IR -> NVPTX backend -> PTX text)
|
v
PTX text
|
v
ptxas (PTX -> SASS)
|
v
cubin (or fatbin section, embedded by fatbinary/nvlink/nvcc)
Inside cicc, EDG parses CUDA C++, evaluates constexpr expressions, and produces the split artifacts that the rest of the classic CUDA pipeline expects. The NVVM bridge translates the device side into LLVM IR, runs the NVIDIA NVVM pass family, and hands the module to the NVPTX backend. The observable compiler product at this stage is PTX text.
Path B: tileiras new (MLIR bytecode)
The MLIR-rooted pipeline is structurally distinct above the LLVM IR layer:
MLIR DSL frontend (CUTLASS-on-MLIR, custom DSL, etc.)
|
v
.mlir-bc (MLIR bytecode containing a builtin.module with a cuda_tile payload)
|
v
tileiras (MLIR -> 9-dialect cascade -> NVVM dialect -> llvm dialect -> NVPTX backend -> PTX text -> elf.o)
|
v
PTX text (materialized internally; ptxas is invoked as a subprocess)
|
v
ptxas (PTX -> SASS, embedded in elf.o)
|
v
elf.o (host ELF relocatable carrying the SASS payload)
Inside tileiras, the MLIR bytecode reader parses the input into a builtin.module. The driver registers the cuda_tile target, loads the nv_tileaa, nv_tileas, cute, cute_nvgpu, cutlass, nvgpu, nvvm, and llvm dialect families, and builds a 53-pass MLIR pipeline that lowers the module to the LLVM dialect. Below the NVVM-IR boundary the same NVPTX backend used by cicc produces PTX. The driver then invokes ptxas, embeds the resulting SASS into a host ELF object, and writes the result to --output-file (default elf.o).
Driver invocation: how nvcc chooses which compiler
Selection visible in the tileiras driver is input-format-driven. The command line accepts one positional argument named "<tile bytecode file>", and the public creation path expects one byte buffer containing valid TileIR bytecode. A null buffer returns error code 2 with the diagnostic "null inputBuffer provided, expected valid bytecode buffer". A malformed buffer returns error code 3 with "failed to parse IR bytecode" or "input does not correspond to Tile IR bytecode". If the byte stream appears to be ordinary upstream MLIR bytecode rather than TileIR bytecode, the diagnostic appends " (it looks like MLIR bytecode instead)".
There is no C++ parsing path in tileiras: no EDG frontend, no .int.c emission, no CUDA C frontend, and no source-level kernel-launch lowering. The driver contract starts after source-language analysis has already happened.
The nvcc driver therefore routes work between the two compilers based on the input artifact rather than a runtime flag inside either tool. .cu translation units flow through cudafe++ and into cicc; serialized TileIR bytecode flows directly into tileiras. No flag inside tileiras toggles between the two paths. A reimplementation of the nvcc driver layer should classify the input artifact before dispatch and should reject ambiguous bytecode early with the same diagnostics users see from tileiras.
Invocation triggers
nvcc does not branch on a user-facing --use-tile-ir switch. The driver's choice is observable on the receiving end: tileiras requires one positional argument that begins with the TileIR bytecode magic, so the only way nvcc legitimately reaches the tileiras binary is to have a bytecode buffer in hand. Three concrete triggers explain where that buffer comes from:
- An MLIR-emitting frontend has run before nvcc receives the file. A CUTLASS-on-MLIR pipeline, a CuTe-DSL JIT, or a Triton-for-CUDA backend writes a
.tileir/.ctir/.ctbfile with the7f 54 69 6c 65 49 52 00magic header and acuda_tilepayload.nvccrecognises the extension or the magic and routes the file to tileiras without invokingcudafe++orcicc. - An ahead-of-time tooling step has produced the bytecode. A library-level build (CUTLASS profiler, custom tile-program library) emits the bytecode artifact at install time;
nvccconsumes it during the final assembly phase the same way it would consume a precompiled.ptxor.cubin. - An integrator drives tileiras directly. The tool accepts the bytecode path as its sole positional argument and writes the host ELF relocatable to
--output-file. No nvcc wrapper is involved; the integrator owns process spawning, environment setup, and result handling. ThetileirasProgram*C API is the in-process analogue of this path.
Nothing inside the tileiras binary changes between the three cases. The bytecode magic check, the GPU whitelist, the optimization-level validator, and the dialect cascade are identical regardless of caller.
Argv shape that reaches tileiras
The driver-facing argv schema is fixed and small. A representative invocation that an nvcc dispatcher (or a reimplementation) constructs for a Blackwell datacenter target at -O2 with line info looks like:
tileiras \
--gpu-name=sm_100 \
--opt-level=2 \
--lineinfo \
--host-arch=x86_64 \
--host-os=linux \
--output-file=/tmp/build/kernel.tileir.o \
/tmp/build/kernel.tileir
Every token maps to one of the validated driver options catalogued in Driver CLI Options. The positional argument is the single input file; there is no support for multiple bytecode buffers in one invocation, no @response-file expansion, and no environment-fed argv extension. A reimplementation of the nvcc dispatcher should construct argv per call rather than building a long-lived tileiras subprocess.
| Argv token | Origin in nvcc | Mandatory? |
|---|---|---|
--gpu-name=sm_<NN> | nvcc's -arch=sm_<NN> parameter, validated against the SM whitelist before dispatch | yes |
--opt-level=<0..3> | nvcc's -O<0..3> parameter, defaulting to -O3 if unset | yes (defaulted) |
--lineinfo | nvcc's -lineinfo switch | conditional |
--device-debug | nvcc's -G switch; rejected unless -O0 | conditional |
--sanitize=memcheck | nvcc's -Xcompiler-sanitize=memcheck analogue | conditional |
--host-arch=<…> | nvcc's host-architecture detection or -target shadow | yes (defaulted) |
--host-os=<…> | nvcc's host-OS detection | yes (defaulted) |
--output-file=<path> | nvcc-derived temporary path that nvcc later links into the fatbin | yes |
<bytecode path> | The TileIR bytecode that triggered the dispatch | yes |
The driver rejects unrecognised tokens during command-line parsing, so an nvcc reimplementation must not splat its full argv into the tileiras call. Only the schema above is accepted.
Environment inheritance
tileiras inherits the full process environment of its parent because the subprocess harness sets no envp override at spawn time. Three families of variables matter for an nvcc-orchestrated build:
- Toolkit discovery.
CUDA_ROOT,CUDA_HOME,CUDA_PATHresolve the install root used to locate libdevice andnvdisasm. The driver-side resolver falls back to a/proc/self/exewalk; the NVVM-side resolver does not. The hazard is described in Driver Env Vars and Runtime Gates; an nvcc wrapper should exportCUDA_ROOTexplicitly rather than relying on the executable-path fallback. - PATH for downstream subprocesses. tileiras spawns
ptxasandnvdisasmby basename; both resolve through the inheritedPATH. An nvcc orchestrator must keep the CUDAbin/directory onPATHfor the inherited environment, otherwiseptxaswill fail with exit code127and the diagnostic"Program could not be executed". - Tileiras-specific gates.
MLIR_ENABLE_EVO,PTX_KNOBS_PATH,TILEIR_*,TILE_AS_DEBUG_*. The full table lives in Env Var and Runtime Gate Catalog. nvcc forwards them verbatim because it never strips environment variables before spawning a tool subprocess.
What nvcc does not forward is anything specific to its own option surface. The driver does not understand -Xcicc, -Xptxas, -Xcompiler analogues; their bodies are not threaded into the tileiras argv. An nvcc reimplementation that wants to pass per-tool tuning to tileiras must translate the option into one of the validated flags listed above or into a PTX_KNOBS_PATH file consumed by ptxas downstream.
Fallback behaviour on failure
There is no automatic fallback from tileiras to cicc inside the tileiras process. A failed compile returns one of the five public error codes catalogued in Driver Program Handle with a verbatim diagnostic on stderr; partial output never lands on disk. Exit-code semantics from the nvcc perspective:
| tileiras exit | Meaning | nvcc-side handling that makes sense |
|---|---|---|
0 | success; --output-file exists | proceed to nvlink/fatbinary |
1 | allocation failure | fatal; nvcc should report and abort the build |
2 | configuration rejected (bad GPU, opt-level, debug/opt combo) | fatal; the upstream emitter chose an unsupported target tuple |
3 | bytecode parse failure (including MLIR fall-through hint) | fatal; the upstream emitter produced incompatible bytecode |
4 | null handle / not-compiled (only reachable through the C API) | not visible from the CLI |
5 | compile failure inside the MLIR pipeline | fatal; surface tileiras stderr verbatim |
The cicc path is not a contingency for tileiras failures. cicc accepts CUDA C++ source, not TileIR bytecode; the two compilers see disjoint inputs and cannot substitute for each other. An nvcc driver that wanted source-level retry would have to re-run the upstream emitter, which is outside the toolkit. The conservative orchestration is therefore: dispatch once, propagate the exit code, leave retry policy to the user.
The --output-file invariant is worth restating because nvcc relies on it: tileiras either writes the full host relocatable object atomically or writes nothing at all. nvcc can safely treat the presence of the output path as proof of success without secondary checks.
Shared downstream: ptxas
Both pipelines converge at ptxas. The PTX text from cicc and from tileiras is produced by the same NVPTX backend, the same SelectionDAG instruction selector, and the same NVIDIA NVVM pass roster: NVVMReflect, NVVMPeepholeOptimizer, BaseAddressStrengthReduce, MemorySpaceOpt, DeadSyncElim, CommonBaseElim, NVVMIRVerifier, IPMSPPass, NVPTXSetFunctionLinkagesPass, SelectKernelsPass, KernelInfoPrinter, and NVVMAA. From ptxas's perspective, the upstream identity of the PTX is invisible. PTX-to-SASS-to-cubin assembly is the same regardless of which compiler emitted the PTX.
Host code path is unrelated
Neither cicc nor tileiras handles host code. The host translation unit is preprocessed by nvcc, split by cudafe++, and handed to the system C++ compiler. tileiras accepts --host-arch (x86_64, aarch64, arm64ec) and --host-os (linux, windows) only because its output is a host ELF relocatable object: those flags select the host triple of the wrapper ELF, not a host compiler. Host-side C++ compilation is orchestrated by nvcc and is independent of which device-code compiler is in use; both paths emit artifacts the host linker later combines with the host object file.
MLIR DSL frontends that emit tileiras-bound .mlir-bc
Tileiras's input is a serialized MLIR module whose top-level dialect is cuda_tile. Its dialect cascade covers cuda_tile, nv_tileaa, nv_tileas, cute, cute_nvgpu, and cutlass. This dialect surface tells the story of which upstream producers are intended to feed tileiras:
- CUTLASS-on-MLIR is the most direct match. The
cutlassdialect carriespipeline.{acquire, tail, commit, wait},tile_scheduler.work_tile_info,block_striped.{reduce, load, store}, andseq_bar- the exact pipeline-orchestration vocabulary CUTLASS uses for collective mainloops, persistent kernels, and stream-K schedulers. - CuTe-DSL frontends. The
cutedialect (~50 ops:cute.local_tile,cute.local_partition,cute.tile_to_shape,cute.add_offset,cute.size,cute.cosize, divide family) implements the CuTe layout algebra at MLIR-IR level. Any DSL that produces tile-by-tile descriptions of GPU work in CuTe terms can target this dialect. - Triton-for-CUDA-on-MLIR. A Triton backend that targets the
cuda_tiledialect (instead of, or in addition to, the existingtriton-gpulowering) would produce inputtileirasaccepts. Thecuda_tile.{if, select, xori, constant, atomic_cas_tko, entry, for, make_tensor_view, optimization_hints}surface is general enough to host SPMD-tile programs. - Custom DSLs and JIT pipelines. The bytecode contract is open: any caller that constructs a
builtin.modulewith acuda_tilepayload, a valid"nvopt<O0>"/"<O1>"/"<O2>"/"<O3>"tier attribute, and dialect references confined to the registered cascade can serialize and feedtileiras. Schema versions 13.1/13.2 are recognized.
These producers are upstream of tileiras and outside the nvcc toolkit's bin/ directory. The integration point is the bytecode file: the producer writes it; nvcc dispatches to tileiras; the rest of the build proceeds identically to a cicc-emitted artifact.
The producer-side contract — kernel-signature rules, the tt.* attribute namespace, operand-order conventions per op family, the AttrTag wire-format divergence from upstream MLIR, and the common emission mistakes a frontend must avoid — is documented in Frontend Contract and Tile IR Emission.
Side-by-side architectural diagram
Path A: cicc legacy Path B: tileiras new
------------------- --------------------
.cu source MLIR DSL frontend (CUTLASS-on-MLIR /
| CuTe DSL / Triton / custom)
v |
cudafe++ (EDG frontend, host/device split, v
kernel-launch lowering) .mlir-bc (cuda_tile bytecode)
| |
v v
.int.c / .device.c / .stub.c tileiras
| |
v MLIR bytecode reader
cicc |
- EDG IL -> LLVM IR translator v
- NVVM bridge (~4 MB) cuda_tile dialect
- 35 NVIDIA-custom NVVM passes |
v
nv_tileaa / nv_tileas / cute /
cute_nvgpu / cutlass dialects
+ 16-pass TileAS family
+ MODSBuilder modulo scheduler
+ 53-pass mlir::PassManager
(53-pass pipeline)
|
v
nvgpu dialect
|
v
nvvm dialect
|
v
llvm dialect
| |
+---------------------+----------------------------------------+
| CONVERGENCE: same NVPTX backend (LLVM 21.0.0git fork)
v
----------------------------------
NVPTX backend
- NVVMReflect / nvvm-reflect-pp
- NVVMPeepholeOptimizer
- BaseAddressStrengthReduce
- MemorySpaceOpt / DeadSyncElim / CommonBaseElim
- NVVMIRVerifier / IPMSPPass / NVVMAA
- NVPTXSetFunctionLinkagesPass / SelectKernelsPass
- KernelInfoPrinter
- NVPTX SelectionDAG ISel
- NVPTX instruction printer
----------------------------------
|
v
PTX text
|
v
ptxas (PTX -> SASS)
|
v
----------------------------------------
cicc path: tileiras path:
cubin / .ptx elf.o (host ELF
wrapping SASS, with
optional nvdisasm -c
disassembly section)
The diagram mirrors the architectural reality: the two pipelines diverge above the LLVM IR layer and converge at the NVPTX backend.
Reimplementation Notes
For a driver reimplementation, treat tileiras as a separate device-code compiler selected by artifact type:
if input.kind == "cuda-cpp-source":
run cudafe++ to split host and device work
run cicc on the device-side artifact
elif input.kind == "tileir-bytecode":
run tileiras on the bytecode buffer
else:
reject the input before invoking either compiler
The important invariant is that the choice happens before either compiler starts. Once PTX has been produced, the downstream assembly path no longer needs to know whether the source was CUDA C++ or TileIR bytecode.
Toolchain Integration
Abstract
The other pages in this section describe individual handoffs: tileiras versus cicc, tileiras versus cudafe++, the ptxas subprocess protocol, the place tileiras occupies in nvcc 13.1. This page joins those handoffs into a single end-to-end story for build engineers and integrators. It catalogues file formats at every stage, documents how the subprocess control flow nests, traces environment-variable inheritance from nvcc down to ptxas, and reconstructs three worked invocations so the reader can map their own build against the toolchain.
The goal is operational. A reimplementation of the nvcc dispatcher, an MLIR-emitting frontend that needs to feed tileiras directly, or a build system that wants to invoke tileiras as part of a custom packaging pipeline should be able to read this page and produce a correct invocation with no further reverse engineering.
Position in the CUDA toolchain
┌─────────────────────────────────┐
│ Source-level inputs (any form) │
└──────────────┬──────────────────┘
│
┌──────────────┴───────────────┐
│ │
.cu CUDA C++ source MLIR-emitting frontend
│ (CUTLASS-on-MLIR, CuTe-DSL,
│ Triton-for-CUDA, custom)
│ │
▼ ▼
┌───────────┐ ┌──────────────────────┐
│ cudafe++ │ │ TileIR bytecode │
│ (host / │ │ (.tileir / .ctir / │
│ device │ │ .ctb; │
│ split) │ │ magic 7F 54 69 6C │
└─────┬─────┘ │ 65 49 52 00) │
│ └──────────┬───────────┘
│ host code │
▼ ▼
system C++ compiler ┌──────────────────────┐
(gcc/clang/MSVC) │ tileiras │
│ │ (53-pass MLIR │
│ │ pipeline, NVPTX │
│ device code │ backend, ptxas │
▼ │ subprocess) │
┌───────────┐ └──────────┬───────────┘
│ cicc │ │
│ (EDG + │ PTX text │ PTX text emitted in-process
│ NVVM) │ (.ptx) │
└─────┬─────┘ │
│ PTX text │
▼ ▼
┌─────────────────────────────────────────────────┐
│ ptxas (PTX → SASS, embedded in cubin) │
└──────────────────────┬──────────────────────────┘
│
▼
┌──────────────────┐
│ cubin / SASS │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ nvlink │
│ (multi-cubin │
│ resolution) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ fatbinary + │
│ host linker │
└────────┬─────────┘
│
▼
final binary
cicc and tileiras are sibling device-code compilers. They share the NVPTX backend below the LLVM-dialect handoff but accept disjoint inputs and never see each other's outputs. The convergence point is ptxas: both compilers hand PTX text to the same ptxas binary, and the rest of the build (cubin assembly, nvlink resolution, fatbinary embedding, host linking) is indistinguishable.
File formats at each handoff
| Stage | Input | Output | Format reference |
|---|---|---|---|
| frontend → tileiras | TileIR MLIR bytecode | (none; tileiras receives) | MLIR Bytecode Format |
| cudafe++ → cicc | CUDA C++ source + EDG IL | (none; cicc receives) | EDG IL — see cudafe++ wiki |
| cicc → ptxas | (none; cicc produces) | PTX text (.ptx) | PTX ISA reference manual |
| tileiras → ptxas | (none; tileiras produces) | PTX text (passed via --input-as-string) | PTX ISA reference manual; ptxas Handoff Protocol |
| ptxas → cubin | PTX text | ELF cubin with .text.<kernel> SASS sections | CUDA Binary Utilities documentation |
| tileiras → nvlink/host linker | (none; tileiras produces) | Host ELF relocatable wrapping the cubin payload | Driver main() Entry; ELF specification |
| nvlink → fatbinary | Multiple cubins per arch | Multi-arch fatbin section | CUDA documentation |
Three format details matter operationally. First, TileIR bytecode begins with the 8-byte magic 7F 54 69 6C 65 49 52 00 ("\x7fTileIR\0"), distinguishing it from upstream MLIR bytecode whose magic is ML\xefR. The tileiras driver's parse failure on a non-TileIR input appends the hint " (it looks like MLIR bytecode instead)" to the error message, documented in Driver Program Handle. Second, tileiras passes PTX to ptxas inline via --input-as-string, not through a temporary file; this bounds the maximum kernel PTX size to the platform MAX_ARG_STRLEN. Third, tileiras's terminal output is a host ELF relocatable object, not a raw cubin — the cubin produced by ptxas is embedded in the ELF along with an optional .nvdisasm SASS-text section.
Subprocess control flow
The harness is two-level. Top level: nvcc (or an integrator) spawns tileiras. Bottom level: tileiras spawns ptxas and optionally nvdisasm.
nvcc (parent)
│
│ posix_spawn(tileiras, argv, envp, file_actions)
│ wait4() with optional alarm-based timeout
│
└── tileiras (child of nvcc, parent of ptxas)
│
│ posix_spawn(ptxas, argv_with_PTX_inline, envp_inherited, file_actions)
│ wait4() with timeout enforced via SIGALRM
│ stdout + stderr merged via dup2 into one accumulator
│
└── ptxas (child of tileiras)
│
│ writes assembled cubin to stdout
│ writes diagnostics to stderr (merged at parent)
│ exits with shell-style status code
│
└── (no further children for the PTX-to-SASS stage)
Both levels use posix_spawn as the fast path and fall back to fork+exec only when the caller requests setsid or process resource limits — see Subprocess Harness for the launcher contract. Timeouts ride on SIGALRM; the parent installs a temporary handler, arms alarm(seconds), calls wait4, and on EINTR sends SIGKILL to the child before reaping.
The control-flow model has one important property: tileiras's process lifetime brackets ptxas's. If nvcc kills tileiras, the active ptxas child is orphaned and reparented to PID 1 with no further cleanup. An nvcc orchestrator that wants reliable cancellation should kill the entire process group rather than the tileiras leader alone; the easiest path is to spawn tileiras with setsid so the harness can killpg the resulting session.
Environment-variable inheritance
The subprocess harness sets no explicit envp override at spawn time, so tileiras inherits the full nvcc environment, and ptxas inherits the full tileiras environment in turn. The chain is therefore:
nvcc environment
└── tileiras environment (inherited)
└── ptxas environment (inherited)
Variables that tileiras itself consumes are catalogued in Env Var and Runtime Gate Catalog. The high-impact subset for toolchain integration is:
- Toolkit discovery.
CUDA_ROOT,CUDA_HOME,CUDA_PATH. Two resolvers inside tileiras walk this chain; one falls back to/proc/self/exe, the other does not. The hazard is documented in Driver Env Vars and Runtime Gates; production builds should exportCUDA_ROOTexplicitly. - Subprocess discovery.
PATH. tileiras spawns ptxas and nvdisasm by basename; both need the CUDAbin/directory onPATH. - ptxas knob forwarding.
MLIR_ENABLE_EVOandPTX_KNOBS_PATH. AND-gated; setting only one is silently ignored. When both are set, tileiras appends--knobs-file=<path>to the ptxas argv. The knob-file grammar belongs to ptxas — see ptxas Handoff Protocol. - TMA and swizzle policy.
TILEIR_DELAY_TMA_STORE_WAIT,TILEIR_PREFER_TMA_FOR_LOAD_STORE,TILEIR_ALWAYS_SWIZZLE. Pass-internal gates that affect codegen choices. - Debug.
TILEIR_DEBUG_DUMP_BC,TILEIR_DEBUG_DUMP_LLVM,TILE_AS_DEBUG_UNLIMITED_SMEM,TILE_AS_DEBUG_VERBOSE. Diagnostic switches; presence-only or string-equality against"1"depending on the variable.
ptxas reads its own environment variables (notably PTXAS_KNOBS_DEFAULTS) that tileiras does not interpret. An nvcc orchestrator must keep ptxas-specific variables in the parent environment for inheritance to work; tileiras does not synthesise them.
Error propagation
Errors travel from the innermost child back to the outermost parent. Each level transforms the failure differently:
- ptxas → tileiras. ptxas exits with a non-zero status and writes a diagnostic to stderr. tileiras's harness captures both the exit code and the merged stdout/stderr buffer. The diagnostic is forwarded verbatim through the driver's diagnostic callback; the tileiras driver returns exit code
5(compile failure) fromtileirasProgramCompile. tileiras does not retry, does not rewrite the diagnostic, and does not produce a partial output file. The exit-code table is in ptxas Handoff Protocol. - tileiras → nvcc. tileiras exits with one of the five public error codes from Driver Program Handle. nvcc observes the exit code and the on-disk presence (or absence) of the
--output-filepath. A successful tileiras invocation leaves a complete relocatable object on disk; a failed invocation leaves nothing. nvcc cannot retry by falling back to cicc — the two compilers consume disjoint inputs. - nvcc → user. nvcc translates the tileiras exit code into one of its own driver-level messages and exits the build. The verbatim tileiras stderr (which itself may contain verbatim ptxas stderr) is preserved through the chain so the user can diagnose PTX-level issues.
The conservative rule for a reimplementation is to surface the deepest diagnostic without rewriting it. ptxas knows the most about why PTX was rejected; rewriting its message into "tileiras compile failed" or "nvcc subprocess failure" loses information that the user needs. The harness's merge-stderr-into-stdout optimisation makes verbatim forwarding cheap because the diagnostic arrives as one contiguous buffer.
Worked invocations
Release build of a CUTLASS-on-MLIR kernel
The upstream frontend has emitted kernel.tileir containing a cuda_tile payload targeted at sm_100. The user runs:
nvcc --gpu-architecture=sm_100 kernel.tileir -O2 -o app
The nvcc dispatcher classifies the input as TileIR bytecode (magic check) and constructs the tileiras invocation:
tileiras \
--gpu-name=sm_100 \
--opt-level=2 \
--host-arch=x86_64 \
--host-os=linux \
--output-file=/tmp/nvcc-12345/kernel.tileir.o \
/tmp/nvcc-12345/kernel.tileir
tileiras parses the bytecode, runs the 53-pass MLIR pipeline, emits PTX text into a heap buffer, and spawns ptxas as:
ptxas \
-arch sm_100 \
--opt-level 2 \
--input-as-string '<PTX text inline>'
ptxas writes the assembled cubin to stdout. tileiras captures the bytes, embeds them in a host ELF relocatable along with an optional .nvdisasm-produced SASS section, and writes /tmp/nvcc-12345/kernel.tileir.o. nvcc picks up the file, links it with the host translation units through the host linker, and produces app.
Debug build with line info
The user runs:
nvcc -G -lineinfo --gpu-architecture=sm_100 kernel.tileir -o app
nvcc translates -G to --device-debug and adds --lineinfo. The validator in tileiras rejects --device-debug unless --opt-level=0, so nvcc must dispatch:
tileiras \
--gpu-name=sm_100 \
--opt-level=0 \
--device-debug \
--lineinfo \
--host-arch=x86_64 \
--host-os=linux \
--output-file=/tmp/nvcc-67890/kernel.tileir.o \
/tmp/nvcc-67890/kernel.tileir
The downstream ptxas call is constructed at the same --opt-level 0, which suppresses most code transformations in ptxas. Debug-info preservation in the cubin is handled by the lowering pipeline; tileiras emits the appropriate nvvm.* debug attributes and PTX .dwarf directives during pipeline execution, not at the ptxas argv layer.
A user who omits the -O0 part of the combination (for example by mixing -G with a project-wide -O3 default) triggers tileiras's validator with the diagnostic "optimized debugging is not supported, change optimization level to 0 or disable full debug info" and exit code 2. nvcc sees the exit code, surfaces the message, and aborts.
Direct integrator invocation
An integrator building a custom packaging pipeline bypasses nvcc and drives tileiras directly:
MLIR_ENABLE_EVO=1 \
PTX_KNOBS_PATH=/etc/myproject/ptxas-knobs.cfg \
CUDA_ROOT=/opt/cuda-13.1 \
tileiras \
--gpu-name=sm_103 \
--opt-level=3 \
--output-file=build/kernel.o \
src/kernel.tileir
The two environment variables are AND-gated; both are required to forward --knobs-file=/etc/myproject/ptxas-knobs.cfg to ptxas. CUDA_ROOT is exported explicitly because the integrator does not want to rely on the /proc/self/exe fallback for libdevice resolution. The integrator owns process lifecycle, exit-code interpretation, and downstream linking; tileiras is treated as a one-shot transform with one input file and one output file.
Cross-references
- Position in nvcc 13.1 Toolchain covers the dispatch decision and argv contract from nvcc's side.
- ptxas Handoff Protocol covers the tileiras-to-ptxas subprocess in depth, including the knob-file format.
- tileiras vs cicc catalogues the capability split between the two device-code compilers and the migration trajectory.
- tileiras vs cudafe++ (Non-Relationship) clarifies that tileiras is not a cudafe++ replacement.
- Driver main() Entry describes how the tileiras process orchestrates the compile internally.
- Driver CLI Options enumerates every option in the tileiras argv schema.
- Subprocess Harness documents the POSIX launcher shared by every external tool invocation.
- Env Var and Runtime Gate Catalog lists every environment variable the binary reads.
- Driver Program Handle defines the public error codes returned through the tileiras exit status.
- MLIR Bytecode Format specifies the TileIR bytecode shape that triggers the dispatch.
cl::opt Full Catalog
Every command-line option the tileiras binary registers — through LLVM cl::opt / cl::list / cl::alias, the NVIDIA-private dialect-bag, and MLIR PassOption registrars that share its textual surface. An option counts if a static-storage symbol calls llvm::cl::Option::setArgStr(name, len) (sub_4534CC0, 1174 B) at static-construction time, if a tileiras dialect-bag helper (sub_5FE350 / sub_5FE910 / sub_5FED40) runs from a per-invocation builder, or if mlir::detail::PassOptions::Option<T> is constructed from sub_6D3460. The binary contains 689 distinct caller addresses for sub_4534CC0; this catalog covers the 77 user-visible options across registrars 1–7. The 478-row PassBuilder name registry is summarized in PassBuilder Mega-Registry.
Reading guide
Option families surface to the user through different CLI prefixes. The first stop depends on which prefix appears on the command line:
| If you see... | Find it in... | Examples |
|---|---|---|
bare --opt-level, -O, -g | Layer 1 (Driver Globals) | --opt-level, --gpu-name, --sanitize |
bare --compute-capability, --debuginfo-level | Layer 2 (Dialect Options Bag) | targets pre-driver loop |
--pass-pipeline="tileir{key=value ...}" | Layer 3 (TileIR PassOptions) | compute-capability=sm_90, num-warps=8, opt-level=2 |
-mllvm -nvptx-... | Layer 4 (NVPTX Backend) | upstream LLVM NVPTX target options |
bare -Om, -Osize, -w, -Werror | Layer 4b (NVPTX-CL Options) | host-driver-mode compatibility flags |
-mllvm -nvvm-reflect-* | Layer 5 (NVVM Reflect) | nvvm-reflect-add KEY=VAL, -R KEY=VAL |
--mlir-... | Layer 6 (MLIR Framework) | --mlir-print-ir-after-all, --mlir-timing |
--passes="..." pass-name strings | PassBuilder mega-registry | not user-visible cl::opts; see PassBuilder Mega-Registry |
When the same name appears in two layers (the documented --compute-capability collision between Layer 2 and Layer 3 is the canonical example), the driver propagates the Layer-2 value down into Layer 3; the Layer-3 default fires only when the MLIR pass library is loaded outside the tileiras driver.
Registrar Tiers
The static-init-time CLI surface partitions into five disjoint registrars plus two LLVM-inherited bulk registrars. Each tier owns its builder body, its storage scheme, and its help-text rodata cluster.
| Tier | Count | Registrar function | Storage scheme | End-user prefix |
|---|---|---|---|---|
| Driver-tier (Layer 1) | 5 | sub_579270 (3834 B), called by thunk sub_57A170 from main | Heap-allocated 864-byte TileirasDriverCLOpts aggregate | bare (--opt-level, -O, --lineinfo, --device-debug, -g) |
| Dialect-bag (Layer 2) | 3 | sub_602440 (1599 B), called from sub_602A80 per invocation | Heap-allocated ~1488-byte dialect options bag | bare (parsed alongside Layer 1) |
| TileIR PassOptions (Layer 3) | 20 | sub_6D3460 (13726 B); helpers sub_6D3140 (int), sub_6D2E20 (uint/size_t), sub_5FED40 (bool), sub_4534CC0 (string/enum), sub_44E10F0 (string default setter) | Caller-owned TileIRPipelineOptions struct, ≥5616 B | --pass-pipeline="tileir{...}" |
| NVPTX backend (Layer 4) | 26 | LLVM static ctors, per TU within NVPTXTargetMachine | Per-TU BSS globals, 160–200 B per option | -mllvm -nvptx-* |
| NVVM reflect (Layer 5) | 4 | ctor_238 (0x463A70); cl::list backing at 0x5B4F380 | Static cl::opt<bool> at 0x5B4F400, cl::list<string> at 0x5B4F300 | -mllvm -nvvm-reflect-* |
| MLIR framework (Layer 6) | 18 | MLIR support-library static ctors (AsmPrinter / Diagnostics / MLIRContext / PassTiming) | Per-TU BSS globals | --mlir-* |
| NVPTX-CL options (Layer 4b) | 13 | sub_45BA4C0 (8524 B), ManagedStatic-guarded; __cxa_atexit per opt | Static globals, 13 __cxa_atexit registrations | bare (no -mllvm prefix) |
| Misc LLVM (Layer 7) | 1 | LLVM static ctor in ValueTracking.cpp | Per-TU BSS global | -mllvm |
| PassBuilder mega-registry | 478 | sub_1CCB7D0 (35948 B) | StringMap*(this+8) | name-registry only; not a CLI option |
| Driver-tier total (1+2) | 8 | |||
| Pass-options surface (3) | 47 | (counts include the helper-distinguished int/uint/bool/string/enum sub-types) | ||
| Total user-visible | 77 | |||
Total setArgStr xrefs | 689 | (LLVM/MLIR/Target full link graph) |
Static-init order follows ELF .init_array; each ctor invokes setArgStr(opt, name, len) then done() (sub_4534420, 199 B), inserting into cl::GlobalParser (sub_4530050). Categories attach via sub_452D690 (OptionCategory::addOption, 455 B). The atomic counter NumOccurrences (sub_452D580, 88 B) uses InterlockedExchangeAdd64. Standard cl::alias diagnostics live at 0x45E7218/0x45E7258/0x45E7288/0x45E72C0 (verbatim upstream) and route through sub_452D9F0 → sub_459CCA0; the duplicate-registration error " registered more than once!" is emitted from setArgStr itself.
Master Table
Columns: option | type | default | help text (verbatim) | storage / agg offset | defining pass / TU | wiki page. Sorted alphabetically within each registrar. Where the per-option BSS storage was not individually extracted, the rodata address of the name string is given as name@<addr>.
Layer 1 — Tileiras Driver Globals (registrar sub_579270)
Aggregate: heap-allocated 864 B TileirasDriverCLOpts, owner pointer returned to main via sub_57A170. cl::opt slots are 192 B; the two cl::alias slots are 136 B. Applicator trampolines are sub_578C40 (int) and sub_578C50 (bool), paired with nullsub_10 / nullsub_11.
| option | type | default | help text (verbatim) | storage | defining pass | wiki |
|---|---|---|---|---|---|---|
--device-debug | cl::opt<bool> | false | Generate debug information (if present in the input bytecode) | name@(len 12), agg+0x218..0x2D7 | sub_579270 | driver/cli-options |
-g | cl::alias → device-debug | — | Alias for --device-debug | name@0x45E74F9 (len 1), agg+0x2D8..0x35F | sub_579270 | driver/cli-options |
--lineinfo | cl::opt<bool> | false | Generate line-number information (if present in the input bytecode) | name@0x45E74F0 (len 8), agg+0x158..0x217 | sub_579270 | driver/cli-options |
-O | cl::alias → opt-level | — | Alias for --opt-level | name (len 1), agg+0x0D0..0x157 | sub_579270 | driver/cli-options |
--opt-level | cl::opt<int> | 3 | Specify optimization level. Default Value: 3. | name@0x45E74xx (len 9), agg+0x000..0x0C7 | sub_579270 | driver/cli-options |
--opt-level ValueStr metavar = "N" (renders as --opt-level=<N>). Aggregate heap-allocated by sub_44A8C20(0x360).
Layer 1 — cl::ValuesClass int32 enum options
Four additional Layer-1 options are wired by byte-equivalent template instantiations of cl::opt<cl::ValuesClass>::opt, each differing only in its string-pair table, parser vtable, and the int32 target slot it writes into the aggregate. The parsed result is always a single int32; downstream code consults the integer, never the string.
| option | builder | parser vtable | default | int32 codes |
|---|---|---|---|---|
--gpu-name | sub_577620 (5-pair table) | &unk_59A7378 | 100 | "sm_100"=100, "sm_103"=103, "sm_110"=110, "sm_120"=120, "sm_121"=121 |
--host-arch | sub_577950 (3-pair table) | &unk_59A7468 | 0 | "x86_64"=0, "aarch64"=1, "arm64ec"=2 |
--host-os | sub_577C80 (2-pair table) | &unk_59A7558 | 0 | "linux"=0, "windows"=1 |
--sanitize | sub_577FB0 (1-pair table) | &unk_59A7648 | 0 | (unset)=0, "memcheck"=1 |
GPU-name codes correspond to compute-capability families: 100 is Datacenter Blackwell (default), 103 the Blackwell variant, 110 Jetson Thor, 120 Consumer RTX 50** / Pro, 121 DGX Spark. --sanitize is the toggle that activates the -sanitize=memcheck -g-tmem-access-check nvdisasm tail.
Host-triple resolution reads these ints downstream. sub_40FD330 keys off the host-arch int with stride 39 for x86_64 (code 0), stride 36 for aarch64 (code 1), and stride 36 for arm64ec (code 2 — which uses a sub-entry of the aarch64 record); this is the only place arm64ec diverges from aarch64. sub_40FD7E0 keys off the host-os int with OS-index 7 for linux (code 0) and OS-index 15 for windows (code 1).
The four parser vtables &unk_59A7378 / &unk_59A7468 / &unk_59A7558 / &unk_59A7648 share an 8-slot layout: vtable+0 typeinfo helper, +8 destructor, +16 parse (string → int32 map probe), +24 print (int32 → string lookup), +32 valuesDefault (initialise from a cl::values(...) builder), +40 reserved, +48 reserved, +56 reserved. The parse slot is the only operation invoked at command-line-parse time; the print slot fires only when --help is requested.
Layer 2 — Tileiras Dialect Options Bag (registrar sub_602440)
Aggregate: ~1488 B heap-allocated dialect bag attached to a per-invocation mlir::DialectRegistry. Not visible through the global LLVM parser; consumed by tileiras's pre-driver loop and forwarded to Layer 3. Helpers: sub_5FE350 (enum), sub_5FE910 (string), sub_5FED40 (bool).
| option | type | default | help text (verbatim) | storage | defining pass | wiki |
|---|---|---|---|---|---|---|
--compute-capability | cl::opt<string> | sm_100 | (metavar compute capability, no description body) | name@0x4Exxxxx (len 18), bag+0x4D8..0x5C7; default literal "sm_100"@0x45E7185 | sub_602440 | driver/cli-options |
--debuginfo-level | cl::opt<enum> (4-value) | none (=0) | The level of debug info to emit. | name@0x45EF0xx (len 15), bag+0x070..0x1EF | sub_602440 | lowering/target-and-debuginfo |
--is-optimized | cl::opt<bool> | false | Encode in the debug info whether the program is optimized or not. | name@0x45EF0xx (len 12), bag+0x400..0x4D7 | sub_602440 | lowering/target-and-debuginfo |
Enum table for --debuginfo-level (constructed inline by sub_602440):
| value | string | help (verbatim) |
|---|---|---|
| 0 | none | None. |
| 1 | full | Full. |
| 2 | line-tables | Line Tables Only. |
| 3 | debug-directives | Debug Directives Only. |
Note: --compute-capability collides with the Layer-3 PassOption of the same name (Layer 2 defaults to sm_100, Layer 3 to sm_80). The driver propagates Layer 2 into Layer 3; the Layer-3 default fires only when the MLIR passes load without the tileiras driver.
Layer 3 — TileIR MLIR PassOptions (registrar sub_6D3460)
20 mlir::Pass::Option<T> registrations on a single TileIRPipelineOptions (≥5616 B). Not in the global LLVM cl::opt registry; parsed from --pass-pipeline="tileir{key=value ...}". reg+N is the registration slot (208 B stride); val+N is the resolved-value offset in the a2 value-struct read by sub_6D6A00. For per-option consumer mapping see Options Mapping.
| option | type | default | help text (verbatim) | storage / agg+slot | consuming pass(es) | wiki |
|---|---|---|---|---|---|---|
approx | bool | 0 | Approximate calculation. | reg+2104, val~+? | NVVMReflect path (sub_14FE980) | libdevice/nvvm-reflect-mechanism |
compute-capability | string | sm_80 | compute capability | reg+600, val+728/736 | sub_738810 (Frontend→TileAA), sub_6D0E90 | driver/cli-options |
dump-host | string | "" | Print the generated host code to the provided path. | reg+4912, val+5040/5048, presence+5072 | sub_879B50 (EmitHostWrapper) | driver/tileir-callbacks-abi |
dynamic-persistent | bool | 0 | Enable dynamic persistent transformation | reg+2936, val+3064 | TileASDynamicPersistent (driver gate) | passes/tileas/cta-cluster-family |
emit-line-info | enum (5-value) | none | Emit debug line info from existing or snapshot IR (snapshot saved to ./snapshot.mlir). | reg+3408, val+3536 | driver (snapshot insertion); SynthesizeDebugInfoScopes | lowering/target-and-debuginfo |
enable-debug-logging | bool | 0 | Enable debug logging in TileIR host callbacks. | reg+4024, val+4152 | sub_879B50 EmitHostWrapper | driver/tileir-callbacks-abi |
enable-random-delay | bool | 0 | enable random delay | reg+2520 | TileAS scheduler family (LOW conf) | scheduler/overview |
ftz | bool | 0 | Flush denormal to zero. | reg+2312, val+2232 | NVVMReflect path (nvvm-reflect-ftz ModuleFlag) | libdevice/nvvm-reflect-mechanism |
host-triple | string | native | Specify the target triple for TileIR host callbacks. | reg+4232, val+4360 | sub_879B50 EmitHostWrapper | driver/tileir-callbacks-abi |
index-bitwidth | int | 32 | Bitwidth of the index type, 0 to use size of machine word. | reg+1688 | ConvertTileASToLLVM, TileAS→NVGPU, ConvertToLLVM, ConvertMemRefToLLVM, ConvertControlFlowToLLVM, UnspecializedPipeline | lowering/tileas-to-llvm |
max-constraint-iterations | uint | 10 | Maximum number of iterations for resource constraint generation. Higher values allow more optimization attempts but increase compilation time. Lower values may result in fallback to serial execution when resource constraints are tight. | reg+4704, val+4832 | sub_8A25E0 TileASPrepareForScheduling | passes/tileas/cta-cluster-family |
num-ctas | int | 1 | number of ctas in a cga | reg+392, val+520 | sub_738810 (Frontend→TileAA) | lowering/cuda-tile-to-tileaa |
num-warps | int | 4 | number of warps | reg+184, val+312 | sub_738810 (Frontend→TileAA) | lowering/cuda-tile-to-tileaa |
opt-level | int | 2 | Optimization level for NVVM compilation. Please notice that the default value is 2 and can be set from 0 to 3. | reg+864, val+992 | driver sub_6D6A00 shape switch; ConvertTargetToNVVM (sub_14FE980) | pipeline/driver-and-opt-levels |
pipeline-strategy | enum (3-value) | none | Select the strategy of pipelining optimization. | reg+1072, val+1200 | driver sub_6D6A00 / sub_6D0E90 / sub_6D18D0 (warp-specialize selector); SpecializeAgents | passes/tileas/async-pipeline-family |
rrt-size-threshold | uint | 4096 | RRT size threshold for quantization (in time slots). Applies quantization when RRT exceeds this size to reduce compilation time at the cost of schedule accuracy. Smaller thresholds enable more compression for faster compilation but with reduced scheduling precision. If threshold is 0, then no quantization will be applied. | reg+4496, val+4624 | sub_8A25E0 TileASPrepareForScheduling | passes/tileas/cta-cluster-family |
schedule-trace-file | string | "" | Generate a chrome timeline trace if not empty for the visualizationof the scheduling result for TileASv2 | reg+3144, val+3272 | sub_825050 TileASScheduleRewriteEnable | scheduler/overview |
unspecialized-pipeline-num-stages | int | 4 | numStages for unspecialized pipeline pass. | reg+1896, val+1816 | UnspecializedPipeline (sub_1A24770), ConvertTileASToLLVM, TileAS→NVGPU | passes/tileas/async-pipeline-family |
use-nvgpucomp-libnvvm | bool | 0 | Use NVGpuComp to compile NVVM IR. If false, use default libnvvm path. | reg+5488, val+5616 | ConvertTargetToNVVM (sub_14FE980) | lowering/nvgpu-and-gpu-to-nvvm |
v2-opt-level | int | 0 | Optimization level for tile_ir V2 pass pipeline. | reg+2728, val+2856 | driver sub_6D6A00 second-axis shape gate | pipeline/driver-and-opt-levels |
Enum tables for Layer-3 enums:
| option | value | string | help (verbatim) |
|---|---|---|---|
pipeline-strategy | 0 | none | no pipelining optimization |
pipeline-strategy | 1 | unspecialize | do pipelining for unspecialized flow |
pipeline-strategy | 2 | warp-specialize | do pipelining for warp specialized flow |
emit-line-info | 0 | none | Do not emit line info. |
emit-line-info | 1 | inputIR | Emit line info from the existing input IR. |
emit-line-info | 2 | tileaa | Emit line info from TileAA IR snapshot before lowering to TileAS. |
emit-line-info | 3 | tileas | Emit line info from TileAS IR snapshot before lowering to LLVM. |
emit-line-info | 4 | post-tileas | Emit line info from Cute and LLVM IR snapshot. |
Layer 4 — NVPTX Backend cl::opt (LLVM static ctors)
Per-TU global ctors invoking sub_4534CC0(&optObj, name, len) at static-init. Eight rows below are INITIALIZE_PASS markers (pass-name registrations, not cl::opt). The set is verbatim upstream LLVM NVPTXTargetMachine; default for nvptx-force-min-byval-param-align is patched to false (upstream = true).
| option | type | default | help text (verbatim) | storage (string addr) | defining pass | wiki |
|---|---|---|---|---|---|---|
alloca-hoisting | INITIALIZE_PASS | — | NVPTX specific alloca hoisting | name@0x4D12164 | NVPTXAllocaHoisting | nvptx-passes/overview |
disable-nvptx-load-store-vectorizer | cl::opt<bool> | false | Disable load/store vectorizer | name@0x4D0EF60 | LSV gate | nvptx-passes/overview |
disable-nvptx-require-structured-cfg | cl::opt<bool> | false | Transitional flag to turn off NVPTX's requirement on preserving structured CFG. The requirement should be disabled only when unexpected regressions happen. | name@0x4D0EF88 | NVPTXTargetMachine | codegen/nvptx-bring-up-and-target-init |
nvptx-aa-wrapper | INITIALIZE_PASS | — | NVPTX Address space based Alias Analysis Wrapper | name@0x4D11FB1 | NVPTXAliasAnalysisWrapper | nvptx-passes/memory-space-opt-and-process-restrict |
nvptx-approx-log2f32 | cl::opt<bool> | false | NVPTX Specific: whether to use lg2.approx for log2 | name@0x4D0DA2D | NVPTXISelLowering | libdevice/math-pass-pipeline-and-crosswalk |
nvptx-asm-printer | INITIALIZE_PASS | — | NVPTX Assembly Printer | name@0x4D07A97 | NVPTXAsmPrinter | codegen/asm-printer-monster-and-windows |
nvptx-assign-valid-global-names | INITIALIZE_PASS | — | Assign valid PTX names to globals | name@0x4D121A0 | NVPTXAssignValidGlobalNames | nvptx-passes/overview |
nvptx-atomic-lower | INITIALIZE_PASS | — | NVPTX lower atomics of local memory | name@0x4D1221C | NVPTXAtomicLower | codegen/atomic-warp-sreg-fence |
nvptx-early-byval-copy | cl::opt<bool> | false | Create a copy of byval function arguments early. | name@0x4D0F141 | NVPTXLowerArgs | nvptx-passes/lower-args-and-aggr-and-struct |
nvptx-emit-init-fini-kernel | cl::opt<bool> | false | Emit kernels to call ctor/dtor globals. | name@0x4D1262C | NVPTXCtorDtorLowering | nvptx-passes/kernel-cdp-inline-pretreat |
nvptx-exit-on-unreachable | cl::opt<bool> | false | Lower 'unreachable' as 'exit' instruction. | name@0x4D0F127 | NVPTXISelLowering | codegen/nvptx-target-lowering-call-and-args |
nvptx-fma-level | cl::opt<uint> | (default per LLVM) | NVPTX Specific: FMA contraction (0: don't do it 1: do it 2: do it aggressively | name@0x4D0D9DC | NVPTXTargetMachine | libdevice/math-pass-pipeline-and-crosswalk |
nvptx-force-min-byval-param-align | cl::opt<bool> | false (NVIDIA-patched; upstream default = true) | NVPTX Specific: force 4-byte minimal alignment for byval params of device functions. | name@0x4D0DBF0 | NVPTXLowerArgs | nvptx-passes/lower-args-and-aggr-and-struct |
nvptx-forward-params | INITIALIZE_PASS | — | NVPTX Forward Params | name@0x4D12715 | NVPTXForwardParams | nvptx-passes/overview |
nvptx-isel | INITIALIZE_PASS | — | NVPTX DAG->DAG Pattern Instruction Selection | name@0x4D1293D | NVPTXISelDAGToDAG | codegen/iseldag-and-matchertable |
nvptx-libcall-callee | cl::opt<bool> | (default per LLVM) | (controls direct libcall lowering; help colocated near 0x4D08070) | name@0x4D0805A | NVPTXTargetLowering | codegen/nvptx-target-lowering-call-and-args |
nvptx-lower-global-ctor-dtor | cl::opt<bool> | false | Lower GPU ctor / dtors to globals on the device. | name@0x4D083A1 | NVPTXCtorDtorLowering | nvptx-passes/kernel-cdp-inline-pretreat |
nvptx-lower-global-ctor-dtor-id | cl::opt<string> | "" | Override unique ID of ctor/dtor globals. | name@0x4D12678 | NVPTXCtorDtorLowering | nvptx-passes/kernel-cdp-inline-pretreat |
nvptx-no-f16-math | cl::opt<bool> | false | NVPTX Specific: Disable generation of f16 math ops. | name@0x4D0E6C4 | NVPTXISelLowering | libdevice/math-pass-pipeline-and-crosswalk |
nvptx-prec-divf32 | cl::opt<uint> | (default per LLVM) | NVPTX Specific: Override the precision of the lowering for f32 fdiv | name@0x4D0DA08 | NVPTXISelLowering (also reads __CUDA_PREC_DIV reflect key) | libdevice/math-pass-pipeline-and-crosswalk |
nvptx-prec-sqrtf32 | cl::opt<bool> | false | NVPTX Specific: 0 use sqrt.approx, 1 use sqrt.rn. | name@0x4D0DA1A | NVPTXISelLowering (also reads __CUDA_PREC_SQRT reflect key) | libdevice/math-pass-pipeline-and-crosswalk |
nvptx-rsqrt-approx-opt | cl::opt<bool> | false | Enable reciprocal sqrt optimization | name@0x4D15BF5 | NVPTXTargetLowering | libdevice/math-pass-pipeline-and-crosswalk |
nvptx-sched4reg | cl::opt<bool> | false | NVPTX Specific: schedule for register pressue | name@0x4D0D9CC | NVPTXSubtarget scheduler choice | codegen/nvptx-subtarget-and-feature-matrix |
nvptx-short-ptr | cl::opt<bool> | false | Use 32-bit pointers for accessing const/local/shared address spaces. | name@0x4D0F117 | NVPTXTargetMachine | codegen/nvptx-subtarget-and-feature-matrix |
nvptx-traverse-address-aliasing-limit | cl::opt<uint> | (default per LLVM) | Depth limit for finding address space through traversal | name@0x4D12070 | NVPTXAA | nvptx-passes/memory-space-opt-and-process-restrict |
nvptx-use-max-local-array-alignment | cl::opt<bool> | false | Use maximum alignment for local memory | name@0x4D11F00 | NVPTXLowerArgs | nvptx-passes/lower-args-and-aggr-and-struct |
Layer 4 also carries nvptx-prec-divf32 enum value-strings: Use div.approx, Use div.full, Use IEEE Compliant F32 div.rnd if available (default), Use IEEE Compliant F32 div.rnd if available, no FTZ.
Layer 4b — NVPTX-CL Options Registrar (sub_45BA4C0)
A ManagedStatic-guarded static initializer at sub_45BA4C0 (8524 B) registers exactly 13 llvm::cl::opt instances against the global registry. Each branch ends with __cxa_atexit(dtor, &opt, &__dso_handle). These appear bare on the CLI (no -mllvm prefix).
| option | type | default | help text (verbatim) | storage / line | defining pass | wiki |
|---|---|---|---|---|---|---|
debug-compile | flag | false | Compile for debugging | sub_45BA4C0:708 | tileiras CLI | driver/cli-options |
generate-line-info | flag | false | Emit line info even without -G | sub_45BA4C0:774 | tileiras CLI | driver/cli-options |
ignore-bad-fp | flag | false | Workaround Gdb problem in dumping floating-point constants | sub_45BA4C0:390 | tileiras CLI | driver/cli-options |
line-info-inlined-at | flag | false | Emit line with inlined-at enhancement | sub_45BA4C0:840 | tileiras CLI | driver/cli-options |
maxreg | int (with cl::value_desc) | (none) | max regcount | sub_45BA4C0:583 | tileiras CLI | driver/cli-options |
nvptx-f32ftz | flag | false | (no description) | sub_45BA4C0:198 | tileiras CLI | driver/cli-options |
nvptx-nan | flag | false | (no description) | sub_45BA4C0:134 | tileiras CLI | driver/cli-options |
Om | flag | false | Perform maximum optimization | sub_45BA4C0:518 | tileiras CLI | driver/cli-options |
Osize | flag | false | Optimize for code size | sub_45BA4C0:454 | tileiras CLI | driver/cli-options |
register-usage-level | int | (none) | (no description) | sub_45BA4C0:902 | tileiras CLI | driver/cli-options |
value-tracking-max-depth | int | (none) | (no description) | sub_45BA4C0:646 | tileiras CLI | driver/cli-options |
w | cl::alias | — | disable warnings | sub_45BA4C0:262 | tileiras CLI | driver/cli-options |
Werror | flag | false | Treat all warnings as errors | sub_45BA4C0:326 | tileiras CLI | driver/cli-options |
Layer 5 — NVVM Reflect cl::opt (ctor_238 at 0x463A70)
Three registrations bundled into one TU ctor: a cl::opt<bool>, a cl::list<std::string>, and a cl::alias. List backing std::vector<std::string> at 0x5B4F380 (begin/end) + 0x5B4F390 (capacity); 32 B/entry. Atexit dtors: sub_9C31D0 (opt), sub_9C3B10 (list), sub_9C3120 (alias).
| option | type | default | help text (verbatim) | storage | defining pass | wiki |
|---|---|---|---|---|---|---|
nvvm-reflect-add | cl::list<string> | (empty) | A key=value pair. Replace __nvvm_reflect(name) with value. | obj@0x5B4F300, name@0x4D3C77A; metavar name=<int>@0x4D3C78B; list backing@0x5B4F380/0x5B4F390 | NVVMReflectPass (sub_1BD0910 / sub_1BD0C50 / sub_1BD1280) | libdevice/nvvm-reflect-mechanism |
nvvm-reflect-enable | cl::opt<bool> | true | NVVM reflection, enabled by default | obj@0x5B4F400, name@0x4D3C766, help@0x5B4F428 (len 35) | NVVMReflectPass | libdevice/nvvm-reflect-mechanism |
R | cl::alias → nvvm-reflect-add | — | (alias) | obj@0x5B4F260, name@(len 1); standard 4-check cl::alias validation | NVVMReflectPass | libdevice/nvvm-reflect-mechanism |
Two pass-name strings co-locate in Layer 5 but register via INITIALIZE_PASS, not cl::opt:
| pass-arg | help (verbatim) | storage |
|---|---|---|
nvvm-intr-range | Add !range metadata to NVVM intrinsics. | name@0x4D0ED8E, help@0x4D3C3B0 |
nvvm-reflect | Replace occurrences of __nvvm_reflect() calls with 0/1 | name@0x4D0ED5D, help@0x4D3C518 |
The legacy-PM create-fn for nvvm-reflect (sub_1BD0880, 21 B) is a report_fatal_error("target-specific codegen-only pass") stub; the real pass body reaches through the new-PM path in sub_1A92780. Three soft-CLI parser errors live at 0x4D3C6B8 / 0x4D3C6E0 / 0x4D3C710 (Empty name, Missing value, integer value expected in nvvm-reflect-add option ').
The 13 B string nvvm-reflect- at 0x4D3C6A0 has ftz\0 at offset +13 (0x4D3C6AD), so a 16 B StringRef into Module::getModuleFlag materializes the key "nvvm-reflect-ftz" from concatenated rodata. The same ftz substring is shared with the Layer-3 ftz PassOption.
Layer 6 — MLIR Framework cl::opt (AsmPrinter / Diagnostics / PassManager / Timing)
18 static-ctor cl::opts in MLIR support libs (lib/IR/AsmPrinter.cpp, lib/IR/Diagnostics.cpp, lib/IR/MLIRContext.cpp, lib/Pass/PassTiming.cpp). Exposed unmodified through the global LLVM cl::opt parser.
| option | type | default | help text (verbatim) | storage | defining pass | wiki |
|---|---|---|---|---|---|---|
mlir-disable-threading | cl::opt<bool> | false | Disable multi-threading within MLIR, overrides any further call to MLIRContext::enableMultiThreading() | name@0x502E5E2, help@0x502E618 | MLIRContext | infra/threading-and-synchronization |
mlir-elide-elementsattrs-if-larger | cl::opt<uint> | (per MLIR) | Elide ElementsAttrs with "..." that have more elements than the given upper limit | name@0x502CB20 | AsmPrinter | bytecode/asm-printer-status |
mlir-elide-resource-strings-if-larger | cl::opt<uint> | (per MLIR) | Elide printing value of resources if string is too long in chars. | name@0x502CBA0 | AsmPrinter | bytecode/asm-printer-status |
mlir-output-format | cl::opt<enum> | text | Display method for timing data | name@0x502F5BB, help@0x502F5D0 | PassTiming | mlir-infra/overview |
mlir-pretty-debuginfo | cl::opt<bool> | false | Prints out debug info using the pretty forms ignoring raw loc forms | name@0x502CDAE | AsmPrinter | bytecode/asm-printer-status |
mlir-print-assume-verified | cl::opt<bool> | false | Skip op verification when using custom printers | name@0x502CDDA, help@0x502CC10 | AsmPrinter | bytecode/asm-printer-status |
mlir-print-debuginfo | cl::opt<bool> | false | Print debug info in pretty form | name@0x502CD99 | AsmPrinter | bytecode/asm-printer-status |
mlir-print-elementsattrs-with-hex-if-larger | cl::opt<int> | -1 (disabled) | Print DenseElementsAttrs with a hex string that have more elements than the given upper limit (use -1 to disable) | name@0x502CA78 | AsmPrinter | bytecode/asm-printer-status |
mlir-print-local-scope | cl::opt<bool> | false | Print with local scope and inline information (eliding aliases for attributes, types, and locations) | name@0x502CDF5, help@0x502CC40 | AsmPrinter | bytecode/asm-printer-status |
mlir-print-op-generic | cl::opt<bool> | false | Print all operations using the generic assembly form | name@0x502CDC4 | AsmPrinter | bytecode/asm-printer-status |
mlir-print-op-on-diagnostic | cl::opt<bool> | true | When a diagnostic is emitted on an operation, also print the operation as an attached note | name@0x502E5F9, help@0x502E680 | Diagnostics | mlir-infra/diagnostic-abi-and-helpers |
mlir-print-skip-regions | cl::opt<bool> | false | Skip regions when printing ops. | name@0x502CE0C, help@0x502CCA8 | AsmPrinter | bytecode/asm-printer-status |
mlir-print-stacktrace-on-diagnostic | cl::opt<bool> | false | When a diagnostic is emitted, also print the stack trace as an attached note | name@0x502E6E0, help@0x502E708 | Diagnostics | mlir-infra/diagnostic-abi-and-helpers |
mlir-print-unique-ssa-ids | cl::opt<bool> | false | Print unique SSA ID numbers for values, block arguments and naming conflicts across all regions | name@0x502CE3B, help@0x502CD10 | AsmPrinter | bytecode/asm-printer-status |
mlir-print-value-users | cl::opt<bool> | false | Print users of operation results and block arguments as a comment | name@0x502CE24, help@0x502CCC8 | AsmPrinter | bytecode/asm-printer-status |
mlir-timing | cl::opt<bool> | false | Display execution times | name@0x502F565, help@0x502F571 | PassTiming | mlir-infra/overview |
mlir-timing-display | cl::opt<enum> | list | Output format for timing data | name@0x502F589, help@0x502F59D | PassTiming | mlir-infra/overview |
mlir-use-nameloc-as-prefix | cl::opt<bool> | false | Print SSA IDs using NameLocs as prefixes | name@0x502CE55, help@0x502CD70 | AsmPrinter | bytecode/asm-printer-status |
Enum values — mlir-timing-display: list = display the results in a list sorted by total time (0x502F5F0); tree = display the results ina with a nested tree view (0x502F628, verbatim typo ina preserved from upstream). mlir-output-format: text = display the results in text format (0x502F658); json = display the results in JSON format (0x502F680).
Layer 7 — Misc LLVM cl::opt
| option | type | default | help text (verbatim) | storage | defining pass | wiki |
|---|---|---|---|---|---|---|
disable-i2p-p2i-opt | cl::opt<bool> | false | Disables inttoptr/ptrtoint roundtrip optimization | name@0x4FF26BA, help@0x4FF2688 | llvm/Analysis/ValueTracking.cpp | upstream LLVM ValueTracking |
PassBuilder Mega-Registry Note
- The 478 pretty-name-keyed (
"llvm::TPass]") entries + 66 naked-class entries + 7 special-form entries (551 total) inserted into the PassBuilderStringMap<PassInfo>bysub_1CCB7D0(35948 B) are downstream LLVM. Twenty NVIDIA-private interleavings (check-gep-index,check-kernel-functions,cnp-launch-check,ipmsp,nv-early-inliner,nv-inline-must,nvvm-pretreat,nvvm-verify,printf-lowering,select-kernels,nvvm-aa,kernel-info,nvvm-reflect-pp,nvvm-peephole-optimizer,propagate-alignment,reuse-local-memory,memory-space-opt,lower-aggr-copies,lower-struct-args,process-restrict) are pipeline-text-parser keys for--passes=..., not freestanding cl::opts; factory functors live inparseModulePass/parseCGSCCPass/parseFunctionPass/parseLoopPass/parseMachinePass. Full table: pipeline/passbuilder-mega-registry.
Environment Variable and Runtime Gate Catalog
tileiras consumes external configuration through two distinct mechanisms.
The first is the libc getenv(3) family, reachable via the PLT stub at
0x004055B0 and through the wrapper sub_45AE9A0 (getenv-into-
std::string). The second is a band of process-wide scalar globals in
the 0x5B6xxxx region of .bss/.data -- the "runtime gates" -- whose
default values are written by C++ static constructors during
dynamic-linker init, bound by name to LLVM cl::opt storage, then read
directly by optimizer passes. Both mechanisms wire up during program
startup: env vars are pulled lazily on first consumer use (or at ctor
time for a couple of LLVM-Support strings), while gates populate
unconditionally before main runs and then optionally get overwritten by
--<flag> command-line arguments parsed through the LLVM CommandLine
library. Together they form the entire externally-tunable surface of
tileiras -- no config file, no JSON, no INI, just env vars plus
cl::opt flags backed by these scalar globals.
Table 1: Environment Variables
Columns: env var name | consumer sub_ADDR | behavior | default when unset.
| Env var | Consumer (sub_ADDR) | Behavior | Default |
|---|---|---|---|
MLIR_ENABLE_EVO | sub_2D381B0 (serializeAndDumpSass) | Master gate for the ptxas-knob-file path. Tested for non-null only -- any non-empty value (including "0", "false") enables. When set together with PTX_KNOBS_PATH, --knobs-file=<path> is appended to basePTXOptions before ptxas is spawned. | Disabled -- knob-file path is skipped via early goto even when PTX_KNOBS_PATH is set. |
PTX_KNOBS_PATH | sub_2D381B0 | Path to a ptxas internal-knob text file. Forwarded verbatim as --knobs-file=<path>; tileiras itself does not parse the contents. Append uses libstdc++'s string max-size guard 0x3FFFFFFFFFFFFFFFLL - 14. | Disabled. AND-gated on MLIR_ENABLE_EVO; unsetting either skips the append. |
TILE_AS_DEBUG_UNLIMITED_SMEM | sub_12C8DF0 (TileAS memory planner) | String-equality test against literal "1" via sub_44E1F60. When equal, the per-CTA dynamic shared-memory ceiling used by the memory planner is raised from 232448 B (227 KiB, the Blackwell SM100/SM103/SM120 limit) to 0x7FFFFFFF, effectively no ceiling. Used to bypass smem-overcommit checks for diagnostic compiles. | Ceiling = 232448 B (0xE3C00). Stored as ptr[16] = max-smem-per-CTA in the per-kernel DenseMap built at sub_12BB050. |
TILEIR_PREFER_TMA_FOR_LOAD_STORE | sub_7B6970 (TMA-vs-cp.async chooser) | Value is SSO-copied and then string-compared against "true" / "false" downstream. "true" selects the TMA cp.async.bulk path for load/store legalization on SM100+; absence defaults the comparison RHS to literal "false", leaving TMA non-preferred. Non-boolean values fall through with implementation-defined effect. | "false" (5 bytes) -- TMA path is not preferred; legacy cp.async or vector load/store wins the heuristic. |
TILEIR_DELAY_TMA_STORE_WAIT | sub_8D9DD0 via sub_45AE9A0 | Active only when *a1 == 3 (TMA-store pipeline tier). Read into a std::string, then parsed by strtol(base=10) after errno clear. Empty / non-numeric throws std::stoi("stoi"); out-of-range likewise. Final return is parsed != 0. Effect: defers the cp.async.bulk.wait_group barrier after a TMA store. | Disabled -- function returns (*a1 == 4) as the default (pipeline-tier gate only); env-var absence leaves delay-wait off. |
TILEIR_ALWAYS_SWIZZLE | sub_7A9D60 (swizzle selector) | Returns 1 (true) immediately if the env-var is non-null. Any value -- including "0", "false", "no" -- short-circuits the swizzle-selection chain (sub_7A9520, sub_7A9D30, sub_79DA60, sub_7A9750) and forces the swizzled layout. Diagnostic switch only. | Disabled -- swizzle heuristic runs normally. |
CUDA_ROOT | sub_5773C0 (driver) and sub_1A41D30 (NVVM::getCUDAToolkitPath()) | First probe in both chains. sub_5773C0 SSO-copies into an std::string; sub_1A41D30 returns the raw const char * from getenv memory. | Falls through to CUDA_HOME. |
CUDA_HOME | sub_5773C0 and sub_1A41D30 | Second probe. Same copy-semantics per resolver as CUDA_ROOT. | Falls through to CUDA_PATH. |
CUDA_PATH | sub_5773C0 and sub_1A41D30 | Third probe. | sub_5773C0: falls back to sub_45AA3C0(scratch, argv[0]) -- a /proc/self/exe walk that strips two trailing path components (bin/). sub_1A41D30: returns byte_4FA453E (rodata empty/null sentinel), which produces the user-visible "Please specify the toolkit path" error from sub_1A41DB0. |
LLVM_OVERRIDE_PRODUCER | ctor_611 @ 0x00538D90 | Read once during C++ static-ctor execution. Stored into global qword_5BDF538 -- the producer string used by the disable-bitcode-version-upgrade cl::opt at LLVM bitcode load time. | Built-in a2100git rodata symbol (LLVM version tag). |
LLVM_DISABLE_SYMBOLIZATION | sub_45B5AC0 (LLVMSymbolizer probe) | Presence (any non-null) disables the in-process llvm-symbolizer invocation used by PrettyStackTrace / signal-handler backtraces. | Symbolization enabled (subject to finding the symbolizer binary). |
LLVM_SYMBOLIZER_PATH | sub_45B5AC0 | Absolute or relative path to a custom llvm-symbolizer. When set, strlen is passed to sub_45B0940 (program-path resolver) and PATH search is bypassed. | PATH-walk for basename "llvm-symbolizer" (15 bytes). |
LLVM_ENABLE_SYMBOLIZER_MARKUP | sub_45B6090 | Empty / unset early-returns 0. Non-empty engages the {{{bt:...}}} symbolizer-markup pipeline for stack traces. | Markup path skipped. |
HOME | sub_45AC290 (llvm::sys::path::home_directory) | Copied to out-SmallString if non-null. Else falls back to getpwuid_r(getuid(), …, sysconf(_SC_GETPW_R_SIZE_MAX)=70) and uses pw_dir. | passwd-database pw_dir; if also empty, returns failure. |
PWD | sub_45AA940 (llvm::sys::fs::current_path) | getenv("PWD"), then validates dev+ino equality against "." via sub_45A9AD0/sub_45A6C90. On match, uses $PWD (preserves symlink spelling). | getcwd(3) with adaptive doubling buffer from 4096 up. |
PATH | sub_45AA3C0 (GetMainExecutable) and sub_45B0940 (ExecuteAndWait) | sub_45AA3C0: only when argv[0] lacks /; tokenized via strtok_r(":", …), each entry tried with realpath + __xstat. sub_45B0940: only when caller's pre-resolved-path arg is null. | /proc/self/exe resolution skips PATH when argv[0] contains /. |
TMPDIR / TMP / TEMP / TEMPDIR | sub_45ACEA0 (llvm::sys::path::system_temp_directory) | Probed in this order; first non-null wins, copied into out-SmallString. | Fallback writes 4-byte literal 0x706D742F (= "/tmp") when ErasedOnReboot path taken. |
TERM | sub_45AE730 (color-term detection) | strlen($TERM)-switched comparison against hard-coded ASCII packs for ansi, cygwin, linux, xterm, vt100, screen, rxvt. Generic tail-check accepts any value containing "color". | Returns 0 (colors disabled). |
Two other getenv-touching helpers exist but stay dormant: one for
command-line-option scanning (sub_4535E90 -- llvm::cl::ParseCommandLineOptions's EnvVar
arg, dormant because main passes null) and one for response-file expansion
(sub_45AEBB0). They are listed for completeness; in production runs of
tileiras they read no environment variables.
Table 2: Runtime Gates
Columns: address (byte_* / qword_*) | populating ctor | default value |
consumer pass / routine. Width is 1 byte for byte_*, 4 bytes for the
low-DWORD of a qword_* integer slot. Numeric cl::opt<int> slots are
laid out as [qword_X] (Initial) + [qword_X+0x10] (Default) plus a
BYTE4(qword_X+0x10)=1 "has-default" flag; consumers always read the
Initial via LODWORD.
The consumer-side read of any cl::opt<int> gate is a literal mov from
the Initial slot, equivalent to:
/* every pass that consults a numeric gate compiles to this shape; the
* address is baked in by static-init linkage, so the call site is one
* load and no indirection. */
static inline int32_t read_gate_int(const void *initial_slot) {
return *(const int32_t *)initial_slot;
}
/* boolean gates use a 1-byte slot; nonzero means enabled. */
static inline bool read_gate_bool(const uint8_t *byte_slot) {
return *byte_slot != 0;
}
Honouring the same --flag surface in a reimplementation only requires
wiring each per-option storage cell to a cl::opt<T> so
ParseCommandLineOptions can overwrite it; everything downstream is a
direct read.
| Address | Populator | Default | Consumer / cl::opt name |
|---|---|---|---|
byte_5B6A640 | ctor_372 @ 0x491070 | 0 | -basic-dbe -- basic dead-barrier-elim (sub_27DD410) |
unk_5B6A5A0 (location) | ctor_371 @ 0x490FF0 | 1 | -opt-unsafe-algebra -- cl::location external (sub_27D7EE0 UFSimp) |
qword_5B6AEC0 (lo32) | ctor_374_0 @ 0x492550 | 8 | -scev-cgp-cross-block-limit |
qword_5B6B040 (lo32) | ctor_374_0 | 3 | -scev-cgp-idom-level-limit |
qword_5B6B100 (lo32) | ctor_374_0 | 500 | -scev-cgp-inst-limit |
qword_5B6B700 (lo32) | ctor_374_0 | 4096 | -scev-cgp-tid-max-value |
qword_5B6B880 (lo32) | ctor_374_0 | -1 | -scev-cgp-control (transformation budget) |
qword_5B6BA00 (lo32) | ctor_374_0 | 2 | -do-function-scev-cgp |
byte_5B6BAC0 | ctor_374_0 | 1 | -do-scev-cgp-aggresively (sic) |
qword_5B6BC40 (lo32) | ctor_374_0 | 0 | -dump-base-address-strength-reduce |
qword_5B6BDC0 (lo32) | ctor_374_0 | 4 | -do-base-address-strength-reduce (BASR master, 0..4) |
qword_5B6BE80 (lo32) | ctor_374_0 | 2 | -do-scev-cgp (module-level enable) |
byte_5B6BF60 | ctor_375 @ 0x4934D0 | 0 | -enable-fma-to-ffma2 |
byte_5B6C020 | ctor_375 | 1 | -enable-dot (DOT lowering master) |
byte_5B6C0E0 | ctor_375 | 1 | -aggressive-no-sink |
qword_5B6C1A0 (lo32) | ctor_375 | 64 | -max-chain-length (idpa cap) |
qword_5B6C260 (lo32) | ctor_375 | 2 | -max-chain-width |
byte_5B6C320 | ctor_375 | 1 | -balance-dot-chain |
qword_5B6C3E0 (lo32) | ctor_376 @ 0x494040 | -1 | -do-clone-for-ip-msp |
qword_5B6C4B0 (BYTE4) | ctor_376 | 0 | -dump-ip-msp |
byte_5B6CAC0 | ctor_378 @ 0x494DB0 | 1 | -lsa-opt -- copy-struct-args-to-local |
byte_5B6CC40 | ctor_379_0 @ 0x495350 | 1 | -track-indir-load |
byte_5B6CD00 | ctor_379_0 | 0 | -dump-ir-after-memory-space-opt |
byte_5B6CDC0 | ctor_379_0 | 0 | -dump-ir-before-memory-space-opt |
unk_5B6CF80 (location) | ctor_379_0 | 1 | -param-always-point-to-global (via qword_5B6CE80 location ptr) |
byte_5B6D4C0 | ctor_381 @ 0x495F80 | 1 | -nvvm-lower-printf |
byte_5B6D580 | ctor_382 @ 0x496190 | 0 | -dump-process-restrict |
qword_5B6D640 (lo32) | ctor_382 | 1 | -process-restrict (master enable) |
byte_5B6D700 | ctor_382 | 0 | -apply-multi-level-restrict |
byte_5B6D7C0 | ctor_382 | 0 | -allow-restrict-in-struct |
qword_5B6E480 (header) | ctor_385 @ 0x4975D0 | empty | -select-kernel-range (cl::list) |
qword_5B6E580 (header) | ctor_385 | empty | -select-kernel-list (cl::list) |
qword_5B6E848 (header) | ctor_387 @ 0x497BF0 | empty | NVPTXSetFunctionLinkages range list (cl::list) |
qword_5B6E908 (header) | ctor_387 | empty | NVPTXSetFunctionLinkages name list (cl::list) |
A second class of gates is .bss-resident with no static writer: they
default to zero and flip the first time their runtime consumer touches
them, behaving as a one-shot latch. They never appear in --help
because they are never bound to cl::opt storage -- pure runtime state.
Examples: byte_5B6AF80 (scev-cgp-check-latency cache, written by
sub_27F7D20), byte_5B6B4C0 (BASR pre-filter predicate, written by
sub_2800C10), dword_5B6B7C0 and dword_5B6B940 (SCEV-CGP runtime
counters), and byte_5B6D260 (MSPO cfg-selector switch set on first call
to sub_2862FD0).
How env vars get propagated to passes
The lifecycle is split across three phases. Phase 1 (ctor time): at
dynamic-linker init the C++ static constructors ctor_3xx run and write
default values into the 0x5B6xxxx band, simultaneously calling
sub_4534CC0(..., "<name>", <len>) to register each gate's textual flag
name with LLVM's CommandLine global registry. A few env vars are pulled
this early too -- LLVM_OVERRIDE_PRODUCER in ctor_611 lands in
qword_5BDF538 before main ever runs, so changing it post-launch has no
effect. Phase 2 (main startup): the driver invokes
llvm::cl::ParseCommandLineOptions, which walks argv and overwrites any
gate whose --<name> flag appears, leaving all others at their
ctor-installed defaults. The driver then calls sub_5773C0 to populate
its toolkit-root std::string from CUDA_ROOT / CUDA_HOME / CUDA_PATH
(or /proc/self/exe), and sub_1A41D30 does the same for NVVM later
when libnvvm is asked to locate libdevice. Phase 3 (per-pass /
per-kernel): consumer passes read the gates directly through the global
addresses cached at compile time -- there is no getOption() indirection
at the call site, the compiler emitted a literal mov from
0x5B6xxxx. The TILEIR-prefixed env vars are an exception to this static
ladder: each is fetched on first use inside its consumer (sub_7B6970,
sub_7A9D60, sub_8D9DD0, sub_12C8DF0), bypassing the gate band
entirely because they were never registered with CommandLine. The result
is two parallel surfaces -- the cl::opt-backed gates that respond to
both --flag and (for a handful) env vars, and the standalone TILEIR /
TILE_AS env vars that have no --flag equivalent and are reachable only
by setting the variable.
Cross-References
Performance and Cost Model lists the subset of these tunables that shift cost-model behaviour at compile and runtime, including the SMEM ceiling, the TMA-preferred layout bias, and the warp-specialisation threshold.
TypeID Sentinel Address Table
An MLIR TypeID is the runtime identity tag MLIR uses for fast structural
RTTI: every concrete C++ class the framework compares at runtime (a
Dialect, an Op, a Type, an Attribute, an Interface, even a
Trait) has exactly one mlir::TypeID value associated with it through
mlir::TypeID::get<T>(). The implementation produces that value by
reading the address of a per-class static storage object — the address
itself is the identity. A typical compiled MLIR binary therefore
contains hundreds of one-byte (or eight-byte Meyers-cached) anonymous
globals in .bss / .rodata whose sole role is to be compared against
each other by pointer equality. In tileiras (88 MB Blackwell-era CUDA
13.1 MLIR-based optimizing assembler) those sentinels cluster densely in
the 0x5B37B90 .. 0x5BE6138 band of the static-data segment, and one
sentinel address suffices to back-trace from a stripped function to the
exact Op/Type/Attr class it dispatches on. This page is the canonical
reverse-direction lookup table: address in, dialect-and-class out.
The binary uses two sentinel idioms in parallel. First, static
pointer-identity sentinels: one-byte .bss slots whose address is the
TypeID. No code ever writes the byte; the pointer is the value. These
dominate the cute / cute_nvgpu / nv_tileas / NVVM op slabs. Second,
Meyers-cached sentinels: an {8-bit guard, 64-bit qword} pair where
the qword fills in on first use by interning a C++-mangled
mlir::TypeID::getFullName() string through a process-wide pool
(sub_44A6CA0 in this binary; upstream MLIR ships the same
RTTI-string-to-pointer interner under llvm::ManagedStatic). After init,
the qword holds the TypeID. These dominate the cute interface anchors
and a few standalone singletons. Both forms reach the per-op
dispatcher exactly the same way: through a load of *(qword*)(op + 48) + 16 (the OperationName::TypeID slot) and a pointer-equality test
against a sentinel address baked into the dispatcher arm.
A third special case is the "shared no-properties guard"
&unk_5BE6138 — the global OperationName::TypeID reserved for the
sentinel class mlir::detail::UnregisteredOpProperties. Every
NVVM-to-LLVM and TileAS layout-classifier dispatcher tests against it
first to short-circuit the no-properties path or detect an op being
mid-rewritten. Every arm references it, making it the single most-cited
sentinel in the binary.
How sentinels are consumed at runtime
Pointer-identity and Meyers-cached sentinels reach the dispatcher through
the same OperationName::TypeID slot; only the lazy-init step differs.
The minimum-cost lookup that a reimplementation must reproduce is:
/* Pointer-identity sentinel — the address is the TypeID. */
const void *type_id_pointer_identity(const void *sentinel_byte_slot) {
return sentinel_byte_slot; /* no load; pointer is the value */
}
/* Meyers-cached sentinel — first call interns the C++ mangled
* mlir::TypeID::getFullName() string through the process-wide pool
* (sub_44A6CA0 in this binary), races resolved by the Itanium ABI
* guard byte. After init, the qword holds the TypeID. */
const void *type_id_meyers_cached(uint8_t *guard, const void **qword,
const char *type_full_name) {
if (__atomic_load_n(guard, __ATOMIC_ACQUIRE) == 0) {
if (__cxa_guard_acquire(guard)) {
*qword = intern_typeid_string(type_full_name);
__cxa_guard_release(guard);
}
}
return *qword;
}
/* Dispatch is pointer-equality on the resolved TypeID, applied against
* the OperationName::TypeID slot reached through Operation+0x30 ->
* OperationName::Impl+0x10. */
static inline bool op_is_sentinel(const void *op, const void *sentinel) {
const void *opname_impl = *(const void *const *)((const uint8_t *)op + 0x30);
const void *type_id = *(const void *const *)((const uint8_t *)opname_impl + 0x10);
return type_id == sentinel;
}
Allocating a fresh TypeID storage per call instead of through one static slot will produce one new identity per call site, which makes pointer-equality dispatch impossible. The address-band discipline below — every sentinel of a kind lives in one contiguous slab emitted by one translation unit — is what guarantees one address per kind.
Address-band index
The table partitions the sentinel space into the contiguous bands the linker emitted for each dialect / category. Numbers under "Count" are the distinct sentinels inside that band referenced elsewhere in the binary; the rest is padding.
| Band | Count | Owner | Form |
|---|---|---|---|
0x5B37B90 .. 0x5B37C28 | 5 | Upstream MLIR Op/DialectInterface anchors | Meyers (8-byte qword) |
0x5B37BE8 .. 0x5B37BF0 | 2 | Dialect one-shot init guards | Guard byte |
0x5B37F20 .. 0x5B38170 | 4 | cuda_tile AbstractOperation singletons (.data.rel.ro) | Pointer-identity |
0x5B38080, 0x5B381A8 | 2 | cuda_tile misc AttributeConcept / OperationState | Pointer-identity |
0x5B38BB0 .. 0x5B38BC8 | 4 | cuda_tile dialect Type TypeIDs | Pointer-identity |
0x5B38C40 .. 0x5B38C68 | 2 | nv_tile_ir::as Op-interface anchors | Meyers |
0x5B38F80 | 1 | TmaDescriptorTypeInterface anchor | Meyers |
0x5B445F8 .. 0x5B44890 | 3 | cutlass_ir::cute Layout / View / CopyAtom interfaces | Meyers |
0x5B44EB8 .. 0x5B44FD8 | 21 | nv_tileas op-info kindPtr singletons | Pointer-identity |
0x5B44F08 | 1 | nv_tileas op-ctor descriptor block tag | Pointer-identity |
0x5B452B0 .. 0x5B45970 | 6 | nv_tileas per-op attribute-vector sentinels | Pointer-identity |
0x5B45370 | 1 | nv_tileas pragma ocg* attr-vector | Pointer-identity |
0x5B46980 .. 0x5B469A0 | 2 | nv_tileaa NamedAttr-vector slots | Pointer-identity |
0x5B46D28 .. 0x5B46F68 | 33 | nv_tileaa per-op FoldRecord descriptors | Pointer-identity |
0x5B46E08, 0x5B46E80, 0x5B46E88, 0x5B46F30, 0x5B46FA0, 0x5B46FA8 | 6 | nv_tileaa producer-side / element-type sentinels | Pointer-identity |
0x5B46FF0 .. 0x5B470D0 | 8 | cutlass_ir::cute core type-interface anchors | Meyers |
0x5B47490 .. 0x5B476A0 | ~20 | cutlass dialect per-op OpInfoBlock | Pointer-identity |
0x5B47FF8 .. 0x5B481A8 | 49 | cute_nvgpu Op TypeIDs (slab) | Pointer-identity |
0x5B482C8 | 1 | cute_nvgpu dialect TypeID | Pointer-identity |
0x5B48580 .. 0x5B48B20 | 12 | cute_nvgpu per-op attribute-table sentinels | Pointer-identity |
0x5B48D88 .. 0x5B48E58 | 27 | cute_nvgpu concrete Type TypeIDs | Pointer-identity |
0x5B496B8 | 1 | cute dialect TypeID | Pointer-identity |
0x5B49A98 .. 0x5B49B18 | 17 | cute dialect concrete Type TypeIDs | Pointer-identity |
0x5B8D610 .. 0x5B8DCB8 | 213 (197 referenced) | NVVM Op TypeID slab | Pointer-identity (8-byte slot stride) |
0x5BAADB8 | 1 | IntegerType variant (i32 / blocked layout id 1) | Pointer-identity |
0x5BA8F60 | 1 | LLVM dialect TypeID | Pointer-identity |
0x5BE3FF8 | 1 | scf.if AbstractOperation kindPtr | Pointer-identity |
0x5BE4008 | 1 | nv_tileas.convert_layout AbstractOperation kindPtr | Pointer-identity |
0x5BE5858 | 1 | arith.constant AbstractOperation kindPtr | Pointer-identity |
0x5BE5908 | 1 | arith dialect TypeID | Pointer-identity |
0x5BE5C40 | 1 | nv_tileas.async.pipeline.consume_one (paired form) | Pointer-identity |
0x5BE5FC0 .. 0x5BE6138 | ~10 | MLIR builtin FloatType / FloatVariant table | Pointer-identity |
0x5BE6138 | 1 | Shared no-properties / null-OperationName guard | Pointer-identity |
The runtime invariant this layout captures: a sentinel address in
0x5B44E* / 0x5B44F* is an OperationName::opInfo slot (the
descriptor passed at registration time), whereas one in 0x5BE3F* /
0x5BE4* / 0x5BE5* is the paired kindPtr slot
(AbstractOperation::TypeID) that ends up in op->getName().getTypeID()
after uniquing. The two ranges contain duplicates of each op identity at
two different indirection levels; resolvers and rewriters generally
compare against the kindPtr form, op-builders and registrars against
the opInfo form.
Master sentinel table
Sorted by sentinel address, ascending. For each row: dialect, the C++ class or op/type/attr name, byte length of the sentinel's storage (1 for pointer-identity, 8 for the qword half of a Meyers pair, 9 for the guard+qword combined), and the wiki page that documents the matching op / type / interface in detail.
| Sentinel | Dialect | Class / op / attr name | Bytes | First-cited page |
|---|---|---|---|---|
0x5B37B90 | upstream MLIR | RegionBranchTerminatorOpInterface (guard) | 1 | dialects/cute/interfaces.md |
0x5B37B98 | upstream MLIR | RegionBranchTerminatorOpInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B37BE8 | upstream MLIR | RegionBranchOpInterface (cache slot) | 8 | dialects/cute/interfaces.md |
0x5B37BF0 | nv_tileaa | dialect one-shot init guard | 1 | dialects/nv_tileaa/index.md |
0x5B37C20 | upstream MLIR | OpAsmDialectInterface (guard) | 1 | dialects/index.md |
0x5B37C28 | upstream MLIR | OpAsmDialectInterface (TypeID dword) | 8 | dialects/index.md |
0x5B37F20 | cuda_tile | cuda_tile.return AbstractOperation (primary) | 1 | dialects/cuda_tile/return.md |
0x5B37FA8 | cuda_tile | cuda_tile.return AbstractOperation (secondary interface) | 1 | dialects/cuda_tile/return.md |
0x5B38080 | cuda_tile | ArrayAttr element AttributeConcept | 1 | dialects/cuda_tile/attrs.md |
0x5B380C0 | cuda_tile | cuda_tile.if AbstractOperation | 1 | dialects/cuda_tile/if.md |
0x5B38170 | cuda_tile | cuda_tile.continue AbstractOperation | 1 | dialects/cuda_tile/continue.md |
0x5B381A8 | cuda_tile | OperationState concept (sub_669F80) | 1 | dialects/cuda_tile/index.md |
0x5B38BB0 | cuda_tile | cuda_tile.partition_view (TypeID) | 1 | dialects/cuda_tile/types.md |
0x5B38BB8 | cuda_tile | cuda_tile.tensor_view (TypeID) | 1 | dialects/cuda_tile/types.md |
0x5B38BC0 | cuda_tile | cuda_tile.tile (TileType TypeID) | 1 | dialects/cuda_tile/types.md |
0x5B38BC8 | cuda_tile | cuda_tile.ptr (PointerType TypeID) | 1 | dialects/cuda_tile/types.md |
0x5B38C40 | nv_tile_ir::as | ProducerOpInterface (guard) | 1 | dialects/nv_tileas/interfaces.md |
0x5B38C48 | nv_tile_ir::as | ProducerOpInterface (TypeID qword) | 8 | dialects/nv_tileas/interfaces.md |
0x5B38C60 | nv_tile_ir::as | AgentLikeOpInterface (guard) | 1 | dialects/nv_tileas/interfaces.md |
0x5B38C68 | nv_tile_ir::as | AgentLikeOpInterface (TypeID qword) | 8 | dialects/nv_tileas/interfaces.md |
0x5B38F80 | cutlass_ir::cute | TmaDescriptorTypeInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B445F8 | cutlass_ir::cute | LayoutTypeInterface (guard) | 1 | dialects/cute/interfaces.md |
0x5B44600 | cutlass_ir::cute | LayoutTypeInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B44610 | cutlass_ir::cute | ViewTypeInterface (guard) | 1 | dialects/cute/interfaces.md |
0x5B44618 | cutlass_ir::cute | ViewTypeInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B44888 | cutlass_ir::cute | CopyAtomTypeInterface (guard) | 1 | dialects/cute/interfaces.md |
0x5B44890 | cutlass_ir::cute | CopyAtomTypeInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B44EB8 | nv_tileas | nv_tileas.view (opInfo) | 1 | dialects/nv_tileas/view.md |
0x5B44EC8 | nv_tileas | nv_tileas.tiled_store (opInfo) | 1 | dialects/nv_tileas/tiled-store.md |
0x5B44ED0 | nv_tileas | nv_tileas.tiled_load (opInfo) | 1 | dialects/nv_tileas/tiled-load.md |
0x5B44ED8 | nv_tileas | nv_tileas.tiled_atomic_rmw (opInfo) | 1 | dialects/nv_tileas/tiled-atomic-rmw.md |
0x5B44EE0 | nv_tileas | nv_tileas.store (opInfo) | 1 | dialects/nv_tileas/store.md |
0x5B44EF0 | nv_tileas | nv_tileas.scatter_store (opInfo) | 1 | dialects/nv_tileas/scatter-store.md |
0x5B44EF8 | nv_tileas | nv_tileas.async.pipeline.consumer_release (opInfo) | 1 | dialects/nv_tileas/async-pipeline.md |
0x5B44F08 | nv_tileas | op-ctor descriptor block tag | 1 | dialects/nv_tileas/index.md |
0x5B44F10 | nv_tileas | nv_tileas.pragma (paired opInfo) | 1 | dialects/nv_tileas/pragma.md |
0x5B44F18 | nv_tileas | nv_tileas.async.pipeline.consumer_yield | 1 | dialects/nv_tileas/async-pipeline.md |
0x5B44F20 | nv_tileas | nv_tileas.producer_write | 1 | dialects/nv_tileas/producer-write.md |
0x5B44F38 | nv_tileas | nv_tileas.async.pipeline.produce_one | 1 | dialects/nv_tileas/async-pipeline.md |
0x5B44F58 | nv_tileas | nv_tileas.produce_one_async | 1 | dialects/nv_tileas/async-pipeline.md |
0x5B44F68 | nv_tileas | nv_tileas.consumer_read | 1 | dialects/nv_tileas/consumer-read.md |
0x5B44F70 | nv_tileas | nv_tileas.async.pipeline.consume_one | 1 | dialects/nv_tileas/async-pipeline.md |
0x5B44F78 | nv_tileas | nv_tileas.consume_one_async | 1 | dialects/nv_tileas/async-pipeline.md |
0x5B44F90 | nv_tileas | nv_tileas.load (opInfo) | 1 | dialects/nv_tileas/load.md |
0x5B44FA8 | nv_tileas | nv_tileas.gather_load (opInfo) | 1 | dialects/nv_tileas/gather-load.md |
0x5B44FB8 | nv_tileas | nv_tileas.async.pipeline.consumer_release-family (paired) | 1 | dialects/nv_tileas/async-pipeline.md |
0x5B44FD8 | nv_tileas | nv_tileas.convert_layout (opInfo) | 1 | dialects/nv_tileas/convert-layout.md |
0x5B44FF0 | nv_tileas | nv_tileas.async.pipeline.acquire (positional) | 1 | dialects/nv_tileas/async-pipeline.md |
0x5B45070 | nv_tileas | nv_tileas.alloc_tensor | 1 | dialects/nv_tileas/alloc-tensor.md |
0x5B452B0 | nv_tileas | nv_tileas.scatter_store attr-vec ("atom") | 1 | dialects/nv_tileas/scatter-store.md |
0x5B45370 | nv_tileas | nv_tileas.pragma attr-vec (ocgEnter/LeaveDirectives) | 1 | dialects/nv_tileas/pragma.md |
0x5B453E0 | nv_tileas | nv_tileas.async.pipeline.consumer_wait attr-vec | 1 | dialects/nv_tileas/async-pipeline.md |
0x5B45600 | nv_tileas | nv_tileas.gather_load attr-vec | 1 | dialects/nv_tileas/gather-load.md |
0x5B458C0 | nv_tileas | nv_tileas.async.pipeline.create_iterator attr-vec | 1 | dialects/nv_tileas/async-pipeline.md |
0x5B45970 | nv_tileas | nv_tileas.async.gather_tma_load attr-vec | 1 | dialects/nv_tileas/async-pipeline.md |
0x5B46980 | nv_tileaa | NamedAttr-vector slot (2-slot pattern) | 8 | dialects/nv_tileaa/index.md |
0x5B469A0 | nv_tileaa | NamedAttr-vector slot (head) | 8 | dialects/nv_tileaa/index.md |
0x5B46D28 | nv_tileaa | nv_tileaa.yield FoldRecord | 1 | dialects/nv_tileaa/yield.md |
0x5B46D30 | nv_tileaa | nv_tileaa.view FoldRecord | 1 | dialects/nv_tileaa/view.md |
0x5B46D68 | nv_tileaa | nv_tileaa.splat FoldRecord | 1 | dialects/nv_tileaa/splat.md |
0x5B46D70 | nv_tileaa | nv_tileaa.scatter FoldRecord | 1 | dialects/nv_tileaa/scatter.md |
0x5B46D88 | nv_tileaa | nv_tileaa.return FoldRecord | 1 | dialects/nv_tileaa/return.md |
0x5B46D98 | nv_tileaa | nv_tileaa.queue.yield FoldRecord | 1 | dialects/nv_tileaa/queue.md |
0x5B46DA0 | nv_tileaa | nv_tileaa.queue.put FoldRecord | 1 | dialects/nv_tileaa/queue.md |
0x5B46DA8 | nv_tileaa | nv_tileaa.queue.get FoldRecord | 1 | dialects/nv_tileaa/queue.md |
0x5B46DB0 | nv_tileaa | nv_tileaa.ptr_to_int FoldRecord | 1 | dialects/nv_tileaa/ptr-to-int.md |
0x5B46DC0 | nv_tileaa | nv_tileaa.pragma FoldRecord | 1 | dialects/nv_tileaa/pragma.md |
0x5B46DD8 | nv_tileaa | nv_tileaa.opt_barrier FoldRecord | 1 | dialects/nv_tileaa/opt-barrier.md |
0x5B46DE0 | nv_tileaa | nv_tileaa.mulhiui FoldRecord | 1 | dialects/nv_tileaa/mulhiui.md |
0x5B46DF0 | nv_tileaa | nv_tileaa.message FoldRecord | 1 | dialects/nv_tileaa/message.md |
0x5B46DF8 | nv_tileaa | nv_tileaa.mark_for_reuse FoldRecord | 1 | dialects/nv_tileaa/mark-for-reuse.md |
0x5B46E08 | nv_tileaa | nv_tileaa.make_memref (opInfo) | 1 | dialects/nv_tileaa/make-memref.md |
0x5B46E18 | nv_tileaa | nv_tileaa.launch_func FoldRecord | 1 | dialects/nv_tileaa/launch-func.md |
0x5B46E20 | nv_tileaa | nv_tileaa.join_mem_token FoldRecord | 1 | dialects/nv_tileaa/queue.md |
0x5B46E28 | nv_tileaa | nv_tileaa.is_valid_program_id FoldRecord | 1 | dialects/nv_tileaa/program-id.md |
0x5B46E30 | nv_tileaa | nv_tileaa.int_to_ptr FoldRecord | 1 | dialects/nv_tileaa/ptr-to-int.md |
0x5B46E38 | nv_tileaa | nv_tileaa.inject_ir FoldRecord | 1 | dialects/nv_tileaa/inject-ir.md |
0x5B46E40 | nv_tileaa | nv_tileaa.histogram FoldRecord | 1 | dialects/nv_tileaa/histogram.md |
0x5B46E70 | nv_tileaa | nv_tileaa.generate FoldRecord | 1 | dialects/nv_tileaa/generate.md |
0x5B46E78 | nv_tileaa | nv_tileaa.gather_load FoldRecord | 1 | dialects/nv_tileaa/gather-load.md |
0x5B46E80 | nv_tileaa | nv_tileaa.func (opInfo) | 1 | dialects/nv_tileaa/func.md |
0x5B46E88 | nv_tileaa | nv_tileaa.fp_to_fp (opInfo) | 1 | dialects/nv_tileaa/fp-to-fp.md |
0x5B46E98 | nv_tileaa | nv_tileaa.extract_slice FoldRecord | 1 | dialects/nv_tileaa/extract-slice.md |
0x5B46EA8 | nv_tileaa | nv_tileaa.extern_ew FoldRecord | 1 | dialects/nv_tileaa/extern-ew.md |
0x5B46EC8 | nv_tileaa | nv_tileaa.ew_inline_asm FoldRecord | 1 | dialects/nv_tileaa/ew-inline-asm.md |
0x5B46EE0 | nv_tileaa | nv_tileaa.create_queue FoldRecord | 1 | dialects/nv_tileaa/queue.md |
0x5B46EE8 | nv_tileaa | nv_tileaa.create_mem_token FoldRecord | 1 | dialects/nv_tileaa/queue.md |
0x5B46F10 | nv_tileaa | nv_tileaa.cancel_next_program_id FoldRecord | 1 | dialects/nv_tileaa/program-id.md |
0x5B46F28 | nv_tileaa | nv_tileaa.broadcast FoldRecord | 1 | dialects/nv_tileaa/broadcast.md |
0x5B46F30 | nv_tileaa | nv_tileaa.block_tile (opInfo) | 1 | dialects/nv_tileaa/block-tile.md |
0x5B46F38 | nv_tileaa | nv_tileaa.bitcast FoldRecord | 1 | dialects/nv_tileaa/bitcast.md |
0x5B46F58 | nv_tileaa | nv_tileaa.assert FoldRecord | 1 | dialects/nv_tileaa/assert.md |
0x5B46F60 | nv_tileaa | nv_tileaa.addptr FoldRecord | 1 | dialects/nv_tileaa/addptr.md |
0x5B46F68 | nv_tileaa | nv_tileaa.addf FoldRecord | 1 | dialects/nv_tileaa/addf.md |
0x5B46FA0 | upstream MLIR | IntegerType variant (dot-operand layout id 2) | 1 | dialects/index.md |
0x5B46FA8 | upstream MLIR | IntegerType TypeID model (i1 / shared variant) | 1 | dialects/index.md |
0x5B46FF0 | cutlass_ir::cute | MmaAtomTypeInterface (guard) | 1 | dialects/cute/interfaces.md |
0x5B46FF8 | cutlass_ir::cute | MmaAtomTypeInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B47000 | cutlass_ir::cute | PrefetchAtomTypeInterface (guard) | 1 | dialects/cute/interfaces.md |
0x5B47008 | cutlass_ir::cute | PrefetchAtomTypeInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B47020 | cutlass_ir::cute | PrintableTypeInterface (guard) | 1 | dialects/cute/interfaces.md |
0x5B47028 | cutlass_ir::cute | PrintableTypeInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B47030 | cutlass_ir::cute | IteratorTypeInterface (guard) | 1 | dialects/cute/interfaces.md |
0x5B47038 | cutlass_ir::cute | IteratorTypeInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B47058 | cutlass_ir::cute | PointerTypeInterface (guard) | 1 | dialects/cute/interfaces.md |
0x5B47060 | cutlass_ir::cute | PointerTypeInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B47068 | cutlass_ir::cute | AtomTypeInterface (guard) | 1 | dialects/cute/interfaces.md |
0x5B47070 | cutlass_ir::cute | AtomTypeInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B47080 | cutlass_ir::cute | DescriptorIteratorTypeInterface (guard) | 1 | dialects/cute/interfaces.md |
0x5B47088 | cutlass_ir::cute | DescriptorIteratorTypeInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B470C8 | cutlass_ir::cute | MaybeStaticTypeInterface (guard) | 1 | dialects/cute/interfaces.md |
0x5B470D0 | cutlass_ir::cute | MaybeStaticTypeInterface (TypeID qword) | 8 | dialects/cute/interfaces.md |
0x5B47490 .. 0x5B476A0 | cutlass | Per-op OpInfoBlock band (~20 slots) | varies | dialects/cutlass/index.md |
0x5B47FF8 .. 0x5B481A8 | cute_nvgpu | Op TypeID slab (49 slots, 8-byte stride) | 1 each | dialects/cute_nvgpu/index.md |
0x5B482C8 | cute_nvgpu | dialect TypeID | 1 | dialects/cute_nvgpu/index.md |
0x5B48580 | cute_nvgpu | relinquish_tmem_alloc_permit attr-table | 8 | dialects/cute_nvgpu/relinquish-tmem-alloc-permit.md |
0x5B485A0 | cute_nvgpu | arch.sm100.dealloc_tmem attr-table | 8 | dialects/cute_nvgpu/dealloc-tmem.md |
0x5B485C0 | cute_nvgpu | arch.sm100.alloc_tmem attr-table | 8 | dialects/cute_nvgpu/alloc-tmem.md |
0x5B486A0 | cute_nvgpu | sm89.mma attr-table | 8 | dialects/cute_nvgpu/sm89-mma.md |
0x5B48700 | cute_nvgpu | sm90.mma attr-table | 8 | dialects/cute_nvgpu/sm90-mma.md |
0x5B48780 | cute_nvgpu | sm100.mma attr-table | 8 | dialects/cute_nvgpu/sm100-mma.md |
0x5B48800 | cute_nvgpu | SM120.block_scaled attr-table (17 entries) | 8 | dialects/cute_nvgpu/sm120-block-scaled.md |
0x5B488E0 | cute_nvgpu | sm100.umma attr-table | 8 | dialects/cute_nvgpu/sm100-umma.md |
0x5B489E0 | cute_nvgpu | stsm attr-table | 8 | dialects/cute_nvgpu/stsm.md |
0x5B48A20 | cute_nvgpu | sm80.cp_async attr-table | 8 | dialects/cute_nvgpu/sm80-cp-async.md |
0x5B48AF0 | cute_nvgpu | SM100.tma_store attr-table | 8 | dialects/cute_nvgpu/tma-store.md |
0x5B48B20 | cute_nvgpu | SM100.tma_reduce attr-table | 8 | dialects/cute_nvgpu/tma-reduce.md |
0x5B48D88 | cute_nvgpu | atom.non_exec_tiled_tma_reduce / SmemDescType | 1 | dialects/cute_nvgpu/types.md |
0x5B48D90 | cute_nvgpu | atom.non_exec_tiled_tma_store / TmaDescriptorTiledType | 1 | dialects/cute_nvgpu/types.md |
0x5B48D98 | cute_nvgpu | atom.non_exec_tiled_tma_load / TmaDescriptorIm2colType | 1 | dialects/cute_nvgpu/types.md |
0x5B48DA0 | cute_nvgpu | atom.stsm | 1 | dialects/cute_nvgpu/types.md |
0x5B48DA8 | cute_nvgpu | atom.ldsm | 1 | dialects/cute_nvgpu/types.md |
0x5B48DB0 | cute_nvgpu | atom.simt_async_copy | 1 | dialects/cute_nvgpu/types.md |
0x5B48DB8 | cute_nvgpu | atom.universal_copy | 1 | dialects/cute_nvgpu/types.md |
0x5B48DC0 | cute_nvgpu | atom.tma_reduce | 1 | dialects/cute_nvgpu/types.md |
0x5B48DC8 | cute_nvgpu | atom.tma_store | 1 | dialects/cute_nvgpu/types.md |
0x5B48DD0 | cute_nvgpu | atom.tma_load | 1 | dialects/cute_nvgpu/types.md |
0x5B48DD8 | cute_nvgpu | tma_descriptor_im2col | 1 | dialects/cute_nvgpu/types.md |
0x5B48DE0 | cute_nvgpu | tma_descriptor_tiled | 1 | dialects/cute_nvgpu/types.md |
0x5B48DE8 | cute_nvgpu | atom.s2t_copy | 1 | dialects/cute_nvgpu/types.md |
0x5B48DF0 | cute_nvgpu | atom.tmem_store | 1 | dialects/cute_nvgpu/types.md |
0x5B48DF8 | cute_nvgpu | atom.tmem_load | 1 | dialects/cute_nvgpu/types.md |
0x5B48E00 | cute_nvgpu | SM120.mma_bs (block-scaled) | 1 | dialects/cute_nvgpu/sm120-block-scaled.md |
0x5B48E08 | cute_nvgpu | sm100.mma_bs_sp | 1 | dialects/cute_nvgpu/sm100-mma.md |
0x5B48E10 | cute_nvgpu | sm100.mma_bs | 1 | dialects/cute_nvgpu/sm100-mma.md |
0x5B48E18 | cute_nvgpu | sm100.mma_sp | 1 | dialects/cute_nvgpu/sm100-mma.md |
0x5B48E20 | cute_nvgpu | sm100.mma | 1 | dialects/cute_nvgpu/sm100-mma.md |
0x5B48E28 | cute_nvgpu | sm90.mma (WGMMA) | 1 | dialects/cute_nvgpu/sm90-mma.md |
0x5B48E30 | cute_nvgpu | smem_desc_view | 1 | dialects/cute_nvgpu/types.md |
0x5B48E38 | cute_nvgpu | smem_desc | 1 | dialects/cute_nvgpu/types.md |
0x5B48E40 | cute_nvgpu | sm89.mma (FP8 e4m3/e5m2) | 1 | dialects/cute_nvgpu/sm89-mma.md |
0x5B48E48 | cute_nvgpu | sm80.sparse_mma | 1 | dialects/cute_nvgpu/sm80-mma.md |
0x5B48E50 | cute_nvgpu | sm80.mma | 1 | dialects/cute_nvgpu/sm80-mma.md |
0x5B48E58 | cute_nvgpu | atom.universal_fma (SM70 path) | 1 | dialects/cute_nvgpu/types.md |
0x5B496B8 | cute | dialect TypeID | 1 | dialects/cute/index.md |
0x5B49A98 | cute | cute.tuple | 1 | dialects/cute/types.md |
0x5B49AA0 | cute | cute.fast_divmod_divisor | 1 | dialects/cute/types.md |
0x5B49AA8 | cute | cute.tiled_mma | 1 | dialects/cute/types.md |
0x5B49AB0 | cute | cute.tiled_copy | 1 | dialects/cute/types.md |
0x5B49AB8 | cute | cute.coord_tensor | 1 | dialects/cute/types.md |
0x5B49AC0 | cute | cute.memref (CuteMemRefType) | 1 | dialects/cute/types.md |
0x5B49AC8 | cute | cute.ptr (CutePtrType) | 1 | dialects/cute/types.md |
0x5B49AD0 | cute | cute.sparse_elem | 1 | dialects/cute/types.md |
0x5B49AD8 | cute | cute.composed_layout (ComposedLayoutType) | 1 | dialects/cute/types.md |
0x5B49AE0 | cute | cute.layout (LayoutType) | 1 | dialects/cute/types.md |
0x5B49AE8 | cute | cute.swizzle (SwizzleType) | 1 | dialects/cute/types.md |
0x5B49AF0 | cute | cute.tile (CuteTileType) | 1 | dialects/cute/types.md |
0x5B49AF8 | cute | cute.shape (CuteShapeType) | 1 | dialects/cute/types.md |
0x5B49B00 | cute | cute.stride | 1 | dialects/cute/types.md |
0x5B49B08 | cute | cute.coord (CuteCoordType) | 1 | dialects/cute/types.md |
0x5B49B10 | cute | cute.int_tuple (IntTupleType) | 1 | dialects/cute/types.md |
0x5B49B18 | cute / cute_nvgpu | ConstrainedInt + AtomIType (shared) | 1 | dialects/cute/types.md |
0x5B8D610 .. 0x5B8DCB8 | NVVM | Op TypeID slab — 213 slots, 197 referenced (see slab close-up) | 8 each | dialects/nvvm/index.md |
0x5BA8F60 | LLVM | dialect TypeID | 1 | dialects/index.md |
0x5BAADB8 | upstream MLIR | IntegerType variant (i32 / blocked layout id 1) | 1 | dialects/index.md |
0x5BE3FF8 | scf | scf.if AbstractOperation kindPtr | 1 | dialects/index.md |
0x5BE4008 | nv_tileas | nv_tileas.convert_layout AbstractOperation kindPtr | 1 | dialects/nv_tileas/convert-layout.md |
0x5BE5858 | arith | arith.constant AbstractOperation kindPtr | 1 | dialects/index.md |
0x5BE5908 | arith | dialect TypeID | 1 | dialects/index.md |
0x5BE5C40 | nv_tileas | nv_tileas.async.pipeline.consume_one (paired) | 1 | dialects/nv_tileas/async-pipeline.md |
0x5BE5FC0 | upstream MLIR | FloatType singleton (F16 entry, MED) | 1 | dialects/index.md |
0x5BE5FE0 | upstream MLIR | MemRefType TypeID model | 1 | dialects/index.md |
0x5BE6000 | upstream MLIR | FloatType singleton (F32 entry, MED) | 1 | dialects/index.md |
0x5BE6028 | upstream MLIR | FloatType singleton (F64 entry, MED) | 1 | dialects/index.md |
0x5BE6030 | upstream MLIR | FloatType singleton (slot between F64 and TF32, MED) | 1 | dialects/index.md |
0x5BE6038 | nv_tile_ir | tf32 (nv_tf32) storage sentinel | 1 | dialects/index.md |
0x5BE6040 | upstream MLIR | FloatType singleton (MED) | 1 | dialects/index.md |
0x5BE6048 | upstream MLIR | bf16 storage sentinel | 1 | dialects/index.md |
0x5BE6090 | upstream MLIR | f8E5M2 storage sentinel | 1 | dialects/index.md |
0x5BE60A0 | upstream MLIR | f8E4M3FN storage sentinel | 1 | dialects/index.md |
0x5BE6138 | MLIR detail | UnregisteredOpProperties / no-properties guard (shared) | 1 | dialects/index.md |
NVVM op TypeID slab close-up: 0x5B8D610 .. 0x5B8DCB8
The largest sentinel cluster in the binary is the contiguous NVVM-op
slab at 0x5B8D610 .. 0x5B8DCB8. It is 1704 bytes long (0x6A8),
holds 213 8-byte slots at uniform 8-byte stride, and the NVVMToLLVM
lowering dispatcher (sub_2D67A80, 92 KB) tests 197 of those slots as
per-op TypeID sentinels in a folded dyn_cast cascade walking the slab
from top-of-range (0x5B8DCB8) down. The remaining 16 slots correspond
to NVVM op classes handled exclusively by the SelectionDAG MatcherTable
path (sub_1A833C0) and never appear as explicit dispatcher arms.
Why it is contiguous: the linker emits one
mlir::TypeID::Storage-array initialization per dialect, where every
op-class registered through the TableGen-generated
registerNVVMDialect() entry point produces one 8-byte slot containing
the address of the class's static thread_local TypeID::UniqueIdHolder.
All 213 slots come from one translation unit's static data, so they
land in a single .rodata section with no padding between slots —
exactly the pattern observed.
How to read offset → op name: index i = (slab_address - 0x5B8D610) / 8.
The dispatcher walks arms in slab-descending order, so the first arm
reached at line ~2067 of sub_2D67A80 matches 0x5B8DCB8
(NVVM::CpAsyncCommitGroupOp). Each subsequent arm decrements the slot by
8. Slot 0x5B8D610 + 8*i for i ∈ [0, 212] therefore corresponds to
the (212 - i)-th arm in walk order.
Selected anchor sentinels from inside the slab, with their op classes:
| Sentinel | NVVM Op class | Intrinsic-ID family |
|---|---|---|
0x5B8DCB8 | NVVM::CpAsyncCommitGroupOp | (top of dispatcher) |
0x5B8DCA8 | NVVM::CpAsyncWaitGroupOp | 8397 |
0x5B8DC90 | NVVM::Tcgen05DeallocOp | 8381, 0x20CD |
0x5B8DB58 | NVVM::AtomicRMWOp | (variant via sub_4261FA) |
0x5B8DB50 | NVVM::ReduceOp (variant 1) | (via sub_2E657E0) |
0x5B8DB48 | NVVM::ReduceOp (variant 2) | (via sub_2E657C0) |
0x5B8DB40 | NVVM::ReduceOp (variant 3, vec) | (via sub_2E65720) |
0x5B8DB38 | NVVM::AtomicCAS / nvvm.red.b128 | (via sub_2E65750) |
0x5B8DAF8 | NVVM::CpAsyncBulkTensorReduceOp | 8974-9011 |
0x5B8DAF0 | NVVM::CpAsyncBulkTensorPrefetchOp | 9150 |
0x5B8DAE8 | NVVM::CpAsyncBulkTensorSharedCTAToGlobalOp | 8956 |
0x5B8DAE0 | NVVM::CpAsyncBulkTensorSharedCTAToGlobalExtOp | 8956 |
0x5B8DAD8 | NVVM::CpAsyncBulkTensorSharedClusterToGlobalOp | 8951 |
0x5B8DAB8 | NVVM::Tcgen05FenceOp (fence pair v0) | 8609 |
0x5B8DAB0 | NVVM::Tcgen05FenceOp (fence pair v1) | 8610 |
0x5B8DAA8 | NVVM::CvtPackfloatF32Op | 0x21B3 = 8627 |
0x5B8DAA0 | NVVM::ElectSyncOp | 0x21A5 = 8613 |
0x5B8DA98 | NVVM::PrefetchOp | 0x21F7 = 8695 |
0x5B8DA90 | NVVM::CpAsyncShared.*.GlobalOp | 0x210F |
0x5B8D928 | NVVM::CvtFloatToFp8 / CvtPackedOp | 8305-8308 |
0x5B8D920 | NVVM::WgmmaCommitGroupSyncAlignedOp | 0x226A = 8810 |
0x5B8D918 | NVVM::WgmmaCommitGroup / WaitGroup | 8797-8799 |
0x5B8D910 | NVVM::WgmmaMmaAsync (block-variant 0x245C) | 0x245C = 9308 |
0x5B8D8F8 | NVVM::MmaBlockScaleOp | 9398 = 0x24B6 |
0x5B8D8F0 | NVVM::MmaSync sibling | 9035 |
0x5B8D8E8 | NVVM::MmaSync sibling | 9036 |
0x5B8D8D8 | NVVM::WgmmaMmaAsyncOp (full) | 0x226A = 8810 |
0x5B8D8D0 | NVVM::WgmmaMmaAsync sibling (operand-walked) | -- |
0x5B8D898 | NVVM::LdmatrixOp | 9153-9170 |
0x5B8D7E0 | NVVM::CpAsyncBulkTensorBaseOp | 8919-8966 |
0x5B8D7F8 | NVVM::CpAsyncShared.*.GlobalOp variant | 9259 / 9263 |
0x5B8D7F0 | NVVM::CpAsyncBulkSharedClusterToSharedCTAOp | 9217 |
0x5B8D7E8 | NVVM::CpAsyncCommitGroupOp / CpAsyncShared | 9220 / 9222 |
0x5B8D7D0 | NVVM::MmaOp (mma.sync) | (MatcherTable) |
0x5B8D7C8 | NVVM::WmmaOp (load/store/mma) | (MatcherTable) |
0x5B8D768 | NVVM::StmatrixOp | 9858-9866 |
0x5B8D700 | NVVM::Tcgen05MMAOp (full) | 10521-10525 |
0x5B8D6F8 | NVVM::Tcgen05MMABlockScaleOp | 10524-30 |
0x5B8D6F0 | NVVM::Tcgen05MMASparseOp | 10522-23 |
0x5B8D6E8 | NVVM::Tcgen05MMAWsOp | 10522-23 |
0x5B8D6E0 | NVVM::Tcgen05MMAWsSpOp | 10534 (gated) |
0x5B8D6D8 | NVVM::Tcgen05MMASpBlockScaleOp | 10522-30 |
0x5B8D6D0 | NVVM::Tcgen05ShiftOp | 10540 |
0x5B8D6C8 | NVVM::Tcgen05CommitOp | 9669-70, 10447 |
0x5B8D6C0 | NVVM::Tcgen05CommitArriveOp | 9671 = 0x25C7 |
0x5B8D6B8 | NVVM::Tcgen05CpOp | 9136 |
0x5B8D6B0 | NVVM::Tcgen05AllocOp | 8376, 0x20B7 |
0x5B8D6A8 | NVVM::Tcgen05DeallocOp | 8381, 0x20CD |
0x5B8D6A0 | NVVM::Tcgen05RelinquishAllocPermitOp | 8390-91 |
0x5B8D698 | NVVM::Tcgen05WaitOp | 9399 |
0x5B8D690 | NVVM::Tcgen05FenceOp | 8609 sibling |
0x5B8D688 | NVVM::Tcgen05LdmatrixOp | 9674-83 |
0x5B8D680 | NVVM::Tcgen05StmatrixOp | 9684-89 |
0x5B8D610 .. 0x5B8D670 | NVVM::Mbar / barrier / cluster / setmaxnreg / fence band (~25) | varies |
Block-anchor band assignments (within the slab, from the dispatcher walk order):
| Slab band | Op-class family |
|---|---|
0x5B8DCB8 .. 0x5B8DC90 | cp.async commit/wait + tensormap descriptor builder + Tcgen05Dealloc |
0x5B8DC88 .. 0x5B8DC28 | 16 cp.async.bulk commit/wait fence-band siblings |
0x5B8DC20 .. 0x5B8DC00 | 3 cp.async.bulk commit/wait variants |
0x5B8DBF8 .. 0x5B8DB70 | 17 cp.async.bulk.tensor TMA store/load fan-out (1D-5D × im2col × multicast × L2hint) |
0x5B8DB68 .. 0x5B8DB58 | 3 atomic / red sibs |
0x5B8DB50 .. 0x5B8DB38 | 3 nvvm.red ops (variants by red_op × scope × type) |
0x5B8DB28 .. 0x5B8DB00 | 4 cp.async.commit / wait band |
0x5B8DAF8 .. 0x5B8DAE0 | 3 cp.async.bulk.tensor.reduce variants (S2G / G2S / prefetch) |
0x5B8DAD8 .. 0x5B8DAC0 | 3 ldmatrix-cluster siblings |
0x5B8DAB8 .. 0x5B8DAB0 | 2 nvvm.tcgen05.fence variants |
0x5B8DAA8 .. 0x5B8DA90 | cvt.packfloat / elect.sync / prefetch / cp.async.shared.global |
0x5B8DA88 .. 0x5B8DA78 | 3 cp.async-cluster-bulk siblings |
0x5B8DA70 .. 0x5B8DA18 | 6 mbarrier-init/inval/arrive variants |
0x5B8D9C0 .. 0x5B8D9D0 | 9 fence.{proxy,sc,acq_rel} cluster fan-out (0x2200 family) |
0x5B8D9B8 .. 0x5B8D978 | 9 mbarrier.test_wait/parity/timelimit fan-out |
0x5B8D928 .. 0x5B8D8F8 | cvt.float.to.fp8 / wgmma fence/commit/wait / mma.block_scale |
0x5B8D8F0 .. 0x5B8D8E8 | 2 mma.sync siblings (9035, 9036) |
0x5B8D8D8 .. 0x5B8D8D0 | wgmma.mma_async (full + sibling) |
0x5B8D8C8 .. 0x5B8D898 | ldmatrix-shape fan-out (m8n8 / m8n16 / m16n16) |
0x5B8D8A8 .. 0x5B8D898 | 3 stmatrix × num × trans variants (9637-38, 9858+) |
0x5B8D880 .. 0x5B8D7F8 | 4 cp.async.bulk.tensor.shared::cluster.global variants |
0x5B8D7F0 .. 0x5B8D7E8 | 2 nvvm.cp.async.shared (8463 / 9220) |
0x5B8D7E0 | nvvm.cp.async.bulk.tensor rank fan-out (8919-8966) |
0x5B8D7D8 .. 0x5B8D7C8 | 4 mma.sync / wmma siblings (9434-9505 dword table) |
0x5B8D7C0 .. 0x5B8D6F8 | 16 tcgen05.mma {full, sp, ws, ws.sp, block_scale, ...} |
0x5B8D6F0 .. 0x5B8D680 | 16 tcgen05 misc (ld/st/cp/commit/alloc/dealloc/wait) |
0x5B8D670 .. 0x5B8D610 | ~25 generic ops / cluster / setmaxnreg / lazy-tail siblings |
Slot stride and storage rationale: each slot is exactly 8 bytes because
the slab stores raw void* pointers, and on x86-64 the AT&T psABI
guarantees _Alignof(void*) == sizeof(void*) == 8. The address of slot
i is 0x5B8D610 + 8*i, no per-slot padding. The dispatcher reads each
sentinel address as an immediate operand baked into the per-arm cmp
instruction, so any reimplementation must keep the slab contiguous and
8-byte aligned for the fold-up cascade to remain a single cmp/je
chain.
The shared &unk_5BE6138 no-properties guard sits ~0x59 KB later than
the slab, in a different translation unit. Upstream MLIR intends this:
UnregisteredOpProperties::TypeID lives in mlir/IR/OperationSupport.cpp,
separate from the dialect's generated registerNVVMDialect()
translation unit. Placing the no-properties sentinel outside the slab
guards against a pointer-equality false-positive when an arm tests
op.getName().getTypeID() == &slab[i] against an op whose properties
record was never built.
Cross-references
The companion table Op Mnemonic Master Table indexes the same sentinels by op-name rather than by address, with verbatim mnemonics, length bytes, and one-clause semantics for every registered op.
The Cross-references column in the master table points to the canonical wiki page for each sentinel's op or type. Conventions:
dialects/<dialect>/<op-mnemonic>.mdfor op-info / op-class sentinelsdialects/<dialect>/types.mdfor concrete Type TypeIDsdialects/<dialect>/interfaces.mdfor type-interface anchors (Meyers pairs)dialects/<dialect>/index.mdfor dialect-level TypeIDs and ranges whose per-op decomposition is documented separatelydialects/index.mdfor upstream MLIR / cross-dialect anchors
Two cross-dialect sharing patterns are worth highlighting:
0x5B49B18is reused by bothcute.ConstrainedIntandcute_nvgpu.AtomIType. The two share pointer identity because the inline printer emits the samei<N>(<divby M>)?surface syntax for both, and the underlying AbstractType class is parameterised on the same set of attributes — TableGen emits a single TypeID.- The PrintableTypeInterface qword
0x5B47028is attached to everycuteand almost everycute_nvgpuconcrete type (27+ installs). When you trace a sentinel comparison against0x5B47028, you are inside the PrintableTypeInterface dispatch, not a per-type check.
Pairing convention: nv_tileas.convert_layout exemplifies the two-form
encoding. Its OperationName opInfo slot (the descriptor passed to
sub_4461CA0 at op registration) is 0x5B44FD8, while its
AbstractOperation::TypeID slot (the kindPtr reachable via
*(qword*)(op+48)+16 after uniquing) is 0x5BE4008. Resolvers compare
against the kindPtr; op-builders against the opInfo. Treat them as the
same op identity at two different indirection levels.
Op Mnemonic Master Table
Every MLIR operation mnemonic registered by — or observed in lowerings
driven by — the tileiras ELF (CUDA Toolkit 13.1, SHA256
f0eb415767f403c96cbabf0817c3bcf70a50f88dfc8845fe36ebe21635fa6707).
Nine dialect namespaces, ~640 first-class ops, alphabetical within each
namespace. Columns: verbatim mnemonic in backticks, mnemonic length in
bytes (- where the registrar uses a non-flat path that does not pass
the literal length to RegisteredOperationName::insert), TypeID singleton
sentinel (per-op &unk_NNNNNNN where the registrar exposes a per-op slot,
range reference where the dialect uses a contiguous slab without per-op
isolation), one-clause semantic, primary wiki page. Sentinel addresses
use IDA-style &unk_NNNNNNN form, preserving the verbatim hexadecimal
address from .bss/.data. The mnemonic length column matches the
second argument passed to sub_4461CA0 (the
RegisteredOperationName::insert callee). Where the glossary lists a
range without a per-op slot, the entry cites the full range.
How the TypeID column is consumed
Every dispatcher in the binary reads OperationName::TypeID through one double-indirection from the Operation pointer:
/* The OperationName slot sits at fixed offset +0x30 on an mlir::Operation,
* and the TypeID pointer sits at +0x10 of OperationName::Impl. Both offsets
* are stable across the binary; every dyn_cast / OpInterface lookup in
* tileiras decompiles to this same shape. */
static inline const void *operation_typeid(const void *op) {
const void *opname_impl = *(const void *const *)((const uint8_t *)op + 0x30);
return *(const void *const *)((const uint8_t *)opname_impl + 0x10);
}
/* Dispatching on an op is therefore one pointer-equality test per arm
* against a sentinel address from the table below. A reimplementer who
* wants the same dispatch performance must publish exactly one stable
* address per op kind for pointer-equality identity. */
static inline bool op_is(const void *op, const void *sentinel) {
return operation_typeid(op) == sentinel;
}
Pointer-identity sentinels (the dominant form in the slab columns below) are
plain .bss slots; their address is the TypeID, no load of the byte is
ever made. Meyers-cached sentinels (the cute interface anchors) hold the
TypeID in a 64-bit qword that is filled in on first use through the
mlir::TypeID::getFullName() interner. For the full sentinel-form
breakdown and the address-band index, see
TypeID Sentinel Table.
§1 cuda_tile.* (92 ops)
TypeID slab range 0x5785D0..0x57A8E0. Per-op TypeID slots are in this
range but the registration thunk does not expose individual &unk_*
isolated addresses to the surface decompilation; entries are cited via
the range.
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
cuda_tile.absf | 14 | range 0x5785D0..0x57A8E0 | element-wise abs on float tile | dialects/cuda_tile.md |
cuda_tile.absi | 14 | range 0x5785D0..0x57A8E0 | element-wise abs on integer tile | dialects/cuda_tile.md |
cuda_tile.addf | 14 | range 0x5785D0..0x57A8E0 | element-wise float add | dialects/cuda_tile.md |
cuda_tile.addi | 14 | range 0x5785D0..0x57A8E0 | element-wise integer add | dialects/cuda_tile.md |
cuda_tile.andi | 14 | range 0x5785D0..0x57A8E0 | bitwise AND | dialects/cuda_tile.md |
cuda_tile.assert | 16 | range 0x5785D0..0x57A8E0 | runtime assertion in compiled tile code | dialects/cuda_tile.md |
cuda_tile.assume | 16 | range 0x5785D0..0x57A8E0 | optimizer hint (LLVM assume) | dialects/cuda_tile.md |
cuda_tile.atomic_cas_tko | 24 | range 0x5785D0..0x57A8E0 | atomic compare-and-swap, token-ordered | dialects/cuda_tile.md |
cuda_tile.atomic_rmw_tko | 24 | range 0x5785D0..0x57A8E0 | atomic read-modify-write, token-ordered | dialects/cuda_tile.md |
cuda_tile.bitcast | 17 | range 0x5785D0..0x57A8E0 | bit-pattern-preserving type pun | dialects/cuda_tile.md |
cuda_tile.break | 15 | range 0x5785D0..0x57A8E0 | structured-loop break | dialects/cuda_tile.md |
cuda_tile.broadcast | 19 | range 0x5785D0..0x57A8E0 | scalar / lower-rank to tile | dialects/cuda_tile.md |
cuda_tile.cat | 13 | range 0x5785D0..0x57A8E0 | tile concatenation | dialects/cuda_tile.md |
cuda_tile.ceil | 14 | range 0x5785D0..0x57A8E0 | ceil rounding | dialects/cuda_tile.md |
cuda_tile.cmpf | 14 | range 0x5785D0..0x57A8E0 | float comparison | dialects/cuda_tile.md |
cuda_tile.cmpi | 14 | range 0x5785D0..0x57A8E0 | integer comparison | dialects/cuda_tile.md |
cuda_tile.constant | 18 | range 0x5785D0..0x57A8E0 | dense / splat constant | dialects/cuda_tile.md |
cuda_tile.continue | 18 | range 0x5785D0..0x57A8E0 | structured-loop continue | dialects/cuda_tile.md |
cuda_tile.cos | 13 | range 0x5785D0..0x57A8E0 | elementary cosine | dialects/cuda_tile.md |
cuda_tile.cosh | 14 | range 0x5785D0..0x57A8E0 | hyperbolic cosine | dialects/cuda_tile.md |
cuda_tile.divf | 14 | range 0x5785D0..0x57A8E0 | float division | dialects/cuda_tile.md |
cuda_tile.divi | 14 | range 0x5785D0..0x57A8E0 | integer division | dialects/cuda_tile.md |
cuda_tile.entry | 15 | range 0x5785D0..0x57A8E0 | kernel entry op (1 region) | dialects/cuda_tile.md |
cuda_tile.exp | 13 | range 0x5785D0..0x57A8E0 | natural exponent | dialects/cuda_tile.md |
cuda_tile.exp2 | 14 | range 0x5785D0..0x57A8E0 | base-2 exponent | dialects/cuda_tile.md |
cuda_tile.exti | 14 | range 0x5785D0..0x57A8E0 | integer extension | dialects/cuda_tile.md |
cuda_tile.extract | 17 | range 0x5785D0..0x57A8E0 | tile element extract | dialects/cuda_tile.md |
cuda_tile.floor | 15 | range 0x5785D0..0x57A8E0 | floor rounding | dialects/cuda_tile.md |
cuda_tile.fma | 13 | range 0x5785D0..0x57A8E0 | fused multiply-add | dialects/cuda_tile.md |
cuda_tile.for | 13 | range 0x5785D0..0x57A8E0 | structured for loop (1 region) | dialects/cuda_tile.md |
cuda_tile.ftof | 14 | range 0x5785D0..0x57A8E0 | float-to-float cast | dialects/cuda_tile.md |
cuda_tile.ftoi | 14 | range 0x5785D0..0x57A8E0 | float-to-int cast | dialects/cuda_tile.md |
cuda_tile.get_global | 20 | range 0x5785D0..0x57A8E0 | reference module-level global | dialects/cuda_tile.md |
cuda_tile.get_index_space_shape | 31 | range 0x5785D0..0x57A8E0 | shape of the launch index space | dialects/cuda_tile.md |
cuda_tile.get_num_tile_blocks | 29 | range 0x5785D0..0x57A8E0 | tile-block count | dialects/cuda_tile.md |
cuda_tile.get_tensor_shape | 26 | range 0x5785D0..0x57A8E0 | shape of a tensor view | dialects/cuda_tile.md |
cuda_tile.get_tile_block_id | 27 | range 0x5785D0..0x57A8E0 | per-block id | dialects/cuda_tile.md |
cuda_tile.global | 16 | range 0x5785D0..0x57A8E0 | module-level global declaration | dialects/cuda_tile.md |
cuda_tile.if | 12 | range 0x5785D0..0x57A8E0 | structured conditional (2 regions) | dialects/cuda_tile.md |
cuda_tile.int_to_ptr | 20 | range 0x5785D0..0x57A8E0 | integer-to-pointer cast | dialects/cuda_tile.md |
cuda_tile.iota | 14 | range 0x5785D0..0x57A8E0 | sequential-int constant tile | dialects/cuda_tile.md |
cuda_tile.itof | 14 | range 0x5785D0..0x57A8E0 | int-to-float cast | dialects/cuda_tile.md |
cuda_tile.join_tokens | 21 | range 0x5785D0..0x57A8E0 | merge multiple tokens | dialects/cuda_tile.md |
cuda_tile.load_ptr_tko | 22 | range 0x5785D0..0x57A8E0 | pointer load, token-ordered | dialects/cuda_tile.md |
cuda_tile.load_view_tko | 23 | range 0x5785D0..0x57A8E0 | view load, token-ordered | dialects/cuda_tile.md |
cuda_tile.log | 13 | range 0x5785D0..0x57A8E0 | natural log | dialects/cuda_tile.md |
cuda_tile.log2 | 14 | range 0x5785D0..0x57A8E0 | base-2 log | dialects/cuda_tile.md |
cuda_tile.loop | 14 | range 0x5785D0..0x57A8E0 | generic structured loop (1 region) | dialects/cuda_tile.md |
cuda_tile.make_partition_view | 29 | range 0x5785D0..0x57A8E0 | construct a partition_view | dialects/cuda_tile.md |
cuda_tile.make_tensor_view | 26 | range 0x5785D0..0x57A8E0 | construct a tensor_view | dialects/cuda_tile.md |
cuda_tile.make_token | 20 | range 0x5785D0..0x57A8E0 | mint a synchronisation token | dialects/cuda_tile.md |
cuda_tile.maxf | 14 | range 0x5785D0..0x57A8E0 | float max | dialects/cuda_tile.md |
cuda_tile.maxi | 14 | range 0x5785D0..0x57A8E0 | integer max | dialects/cuda_tile.md |
cuda_tile.minf | 14 | range 0x5785D0..0x57A8E0 | float min | dialects/cuda_tile.md |
cuda_tile.mini | 14 | range 0x5785D0..0x57A8E0 | integer min | dialects/cuda_tile.md |
cuda_tile.mmaf | 14 | range 0x5785D0..0x57A8E0 | float tile MMA | dialects/cuda_tile.md |
cuda_tile.mmai | 14 | range 0x5785D0..0x57A8E0 | integer tile MMA | dialects/cuda_tile.md |
cuda_tile.module | 16 | range 0x5785D0..0x57A8E0 | top-level container (1 region) | dialects/cuda_tile.md |
cuda_tile.mulf | 14 | range 0x5785D0..0x57A8E0 | float multiply | dialects/cuda_tile.md |
cuda_tile.mulhii | 16 | range 0x5785D0..0x57A8E0 | high-half integer multiply | dialects/cuda_tile.md |
cuda_tile.muli | 14 | range 0x5785D0..0x57A8E0 | integer multiply | dialects/cuda_tile.md |
cuda_tile.negf | 14 | range 0x5785D0..0x57A8E0 | float negation | dialects/cuda_tile.md |
cuda_tile.negi | 14 | range 0x5785D0..0x57A8E0 | integer negation | dialects/cuda_tile.md |
cuda_tile.offset | 16 | range 0x5785D0..0x57A8E0 | view offset arithmetic | dialects/cuda_tile.md |
cuda_tile.ori | 13 | range 0x5785D0..0x57A8E0 | bitwise OR | dialects/cuda_tile.md |
cuda_tile.permute | 17 | range 0x5785D0..0x57A8E0 | tile permutation | dialects/cuda_tile.md |
cuda_tile.pow | 13 | range 0x5785D0..0x57A8E0 | power | dialects/cuda_tile.md |
cuda_tile.print | 15 | range 0x5785D0..0x57A8E0 | tile-aware diagnostic print (renamed from OSS print_tko) | dialects/cuda_tile.md |
cuda_tile.ptr_to_int | 20 | range 0x5785D0..0x57A8E0 | pointer-to-integer cast | dialects/cuda_tile.md |
cuda_tile.ptr_to_ptr | 20 | range 0x5785D0..0x57A8E0 | pointer recast | dialects/cuda_tile.md |
cuda_tile.reduce | 16 | range 0x5785D0..0x57A8E0 | reduction (1 region) | dialects/cuda_tile.md |
cuda_tile.remf | 14 | range 0x5785D0..0x57A8E0 | float remainder | dialects/cuda_tile.md |
cuda_tile.remi | 14 | range 0x5785D0..0x57A8E0 | integer remainder | dialects/cuda_tile.md |
cuda_tile.reshape | 17 | range 0x5785D0..0x57A8E0 | view reshape | dialects/cuda_tile.md |
cuda_tile.return | 16 | range 0x5785D0..0x57A8E0 | terminator | dialects/cuda_tile.md |
cuda_tile.rsqrt | 15 | range 0x5785D0..0x57A8E0 | reciprocal sqrt | dialects/cuda_tile.md |
cuda_tile.scan | 14 | range 0x5785D0..0x57A8E0 | prefix-sum (1 region) | dialects/cuda_tile.md |
cuda_tile.select | 16 | range 0x5785D0..0x57A8E0 | predicated select | dialects/cuda_tile.md |
cuda_tile.shli | 14 | range 0x5785D0..0x57A8E0 | left shift | dialects/cuda_tile.md |
cuda_tile.shri | 14 | range 0x5785D0..0x57A8E0 | right shift | dialects/cuda_tile.md |
cuda_tile.sin | 13 | range 0x5785D0..0x57A8E0 | elementary sine | dialects/cuda_tile.md |
cuda_tile.sinh | 14 | range 0x5785D0..0x57A8E0 | hyperbolic sine | dialects/cuda_tile.md |
cuda_tile.sqrt | 14 | range 0x5785D0..0x57A8E0 | square root | dialects/cuda_tile.md |
cuda_tile.store_ptr_tko | 23 | range 0x5785D0..0x57A8E0 | pointer store, token-ordered | dialects/cuda_tile.md |
cuda_tile.store_view_tko | 24 | range 0x5785D0..0x57A8E0 | view store, token-ordered | dialects/cuda_tile.md |
cuda_tile.subf | 14 | range 0x5785D0..0x57A8E0 | float subtract | dialects/cuda_tile.md |
cuda_tile.subi | 14 | range 0x5785D0..0x57A8E0 | integer subtract | dialects/cuda_tile.md |
cuda_tile.tan | 13 | range 0x5785D0..0x57A8E0 | elementary tangent | dialects/cuda_tile.md |
cuda_tile.tanh | 14 | range 0x5785D0..0x57A8E0 | hyperbolic tangent | dialects/cuda_tile.md |
cuda_tile.trunci | 16 | range 0x5785D0..0x57A8E0 | integer truncation | dialects/cuda_tile.md |
cuda_tile.xori | 14 | range 0x5785D0..0x57A8E0 | bitwise XOR | dialects/cuda_tile.md |
cuda_tile.yield | 15 | range 0x5785D0..0x57A8E0 | terminator for region-bearing ops | dialects/cuda_tile.md |
§2 nv_tileaa.* (73 ops)
Per-op TypeID slots in dense range 0x5B46D28..0x5B46F68 (8-byte stride).
The slab anchors below the nv_tileas slab.
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nv_tileaa.addf | 14 | range 0x5B46D28..0x5B46F68 | float add | dialects/nv_tileaa.md |
nv_tileaa.addptr | 16 | range 0x5B46D28..0x5B46F68 | pointer + integer offset | dialects/nv_tileaa.md |
nv_tileaa.assert | 16 | range 0x5B46D28..0x5B46F68 | runtime assertion | dialects/nv_tileaa.md |
nv_tileaa.assume | 16 | range 0x5B46D28..0x5B46F68 | optimizer assumption | dialects/nv_tileaa.md |
nv_tileaa.atomic_cas | 20 | range 0x5B46D28..0x5B46F68 | scalar atomic CAS | dialects/nv_tileaa.md |
nv_tileaa.atomic_rmw | 20 | range 0x5B46D28..0x5B46F68 | scalar atomic RMW | dialects/nv_tileaa.md |
nv_tileaa.bitcast | 17 | range 0x5B46D28..0x5B46F68 | bit-preserving type cast | dialects/nv_tileaa.md |
nv_tileaa.block_tile | 20 | range 0x5B46D28..0x5B46F68 | per-CTA tile selection | dialects/nv_tileaa.md |
nv_tileaa.broadcast | 19 | range 0x5B46D28..0x5B46F68 | scalar→tile / rank lift | dialects/nv_tileaa.md |
nv_tileaa.call | 14 | range 0x5B46D28..0x5B46F68 | call into emitted device function | dialects/nv_tileaa.md |
nv_tileaa.call_elementwise_intrinsic | 36 | range 0x5B46D28..0x5B46F68 | call libdevice math intrinsic | dialects/nv_tileaa.md |
nv_tileaa.cancel_next_program_id | 32 | range 0x5B46D28..0x5B46F68 | cluster-launch-control cancel | dialects/nv_tileaa.md |
nv_tileaa.cat | 13 | range 0x5B46D28..0x5B46F68 | tile concat | dialects/nv_tileaa.md |
nv_tileaa.clampf | 16 | range 0x5B46D28..0x5B46F68 | float clamp | dialects/nv_tileaa.md |
nv_tileaa.conv_dot | 18 | range 0x5B46D28..0x5B46F68 | convolution dot helper | dialects/nv_tileaa.md |
nv_tileaa.conv_tile | 19 | range 0x5B46D28..0x5B46F68 | convolution tile helper | dialects/nv_tileaa.md |
nv_tileaa.create_mem_token | 26 | range 0x5B46D28..0x5B46F68 | mint memory-lifetime token | dialects/nv_tileaa.md |
nv_tileaa.create_queue | 22 | range 0x5B46D28..0x5B46F68 | construct typed queue | dialects/nv_tileaa.md |
nv_tileaa.divf | 14 | range 0x5B46D28..0x5B46F68 | float divide | dialects/nv_tileaa.md |
nv_tileaa.dot | 13 | range 0x5B46D28..0x5B46F68 | matrix dot | dialects/nv_tileaa.md |
nv_tileaa.elementwise_inline_asm | 32 | range 0x5B46D28..0x5B46F68 | inline-PTX elementwise emitter | dialects/nv_tileaa.md |
nv_tileaa.execute | 17 | range 0x5B46D28..0x5B46F68 | launch-time execute marker | dialects/nv_tileaa.md |
nv_tileaa.exp2 | 14 | range 0x5B46D28..0x5B46F68 | base-2 exponent | dialects/nv_tileaa.md |
nv_tileaa.expand_dims | 21 | range 0x5B46D28..0x5B46F68 | rank lift | dialects/nv_tileaa.md |
nv_tileaa.extern_elementwise | 28 | range 0x5B46D28..0x5B46F68 | external (libdevice) elementwise | dialects/nv_tileaa.md |
nv_tileaa.extract | 17 | range 0x5B46D28..0x5B46F68 | scalar extract | dialects/nv_tileaa.md |
nv_tileaa.extract_slice | 23 | range 0x5B46D28..0x5B46F68 | sub-slice extract | dialects/nv_tileaa.md |
nv_tileaa.fma | 13 | range 0x5B46D28..0x5B46F68 | fused multiply-add | dialects/nv_tileaa.md |
nv_tileaa.fp_to_fp | 18 | range 0x5B46D28..0x5B46F68 | float-to-float cast | dialects/nv_tileaa.md |
nv_tileaa.func | 14 | range 0x5B46D28..0x5B46F68 | function op | dialects/nv_tileaa.md |
nv_tileaa.gather_load | 21 | range 0x5B46D28..0x5B46F68 | indexed gather (global) | dialects/nv_tileaa.md |
nv_tileaa.generate | 18 | range 0x5B46D28..0x5B46F68 | functional generate (region) | dialects/nv_tileaa.md |
nv_tileaa.get_dim_size | 22 | range 0x5B46D28..0x5B46F68 | extract dimension size | dialects/nv_tileaa.md |
nv_tileaa.get_global | 20 | range 0x5B46D28..0x5B46F68 | global lookup | dialects/nv_tileaa.md |
nv_tileaa.get_num_programs | 26 | range 0x5B46D28..0x5B46F68 | grid intrinsic: program count | dialects/nv_tileaa.md |
nv_tileaa.get_program_id | 24 | range 0x5B46D28..0x5B46F68 | grid intrinsic: program id | dialects/nv_tileaa.md |
nv_tileaa.global | 16 | range 0x5B46D28..0x5B46F68 | module-level global | dialects/nv_tileaa.md |
nv_tileaa.histogram | 19 | range 0x5B46D28..0x5B46F68 | parallel histogram primitive | dialects/nv_tileaa.md |
nv_tileaa.inject_ir | 19 | range 0x5B46D28..0x5B46F68 | embed lowered IR fragment | dialects/nv_tileaa.md |
nv_tileaa.int_to_ptr | 20 | range 0x5B46D28..0x5B46F68 | integer-to-pointer cast | dialects/nv_tileaa.md |
nv_tileaa.is_valid_program_id | 29 | range 0x5B46D28..0x5B46F68 | grid intrinsic predicate | dialects/nv_tileaa.md |
nv_tileaa.join_mem_token | 24 | range 0x5B46D28..0x5B46F68 | merge memory tokens | dialects/nv_tileaa.md |
nv_tileaa.launch_func | 21 | range 0x5B46D28..0x5B46F68 | host-side launch op | dialects/nv_tileaa.md |
nv_tileaa.load | 14 | range 0x5B46D28..0x5B46F68 | scalar memory load | dialects/nv_tileaa.md |
nv_tileaa.make_memref | 21 | range 0x5B46D28..0x5B46F68 | construct memref | dialects/nv_tileaa.md |
nv_tileaa.make_range | 20 | range 0x5B46D28..0x5B46F68 | iota-style range | dialects/nv_tileaa.md |
nv_tileaa.mark_for_reuse | 24 | range 0x5B46D28..0x5B46F68 | lifetime-extension marker | dialects/nv_tileaa.md |
nv_tileaa.message | 17 | range 0x5B46D28..0x5B46F68 | host-printable diagnostic | dialects/nv_tileaa.md |
nv_tileaa.mulf | 14 | range 0x5B46D28..0x5B46F68 | float multiply | dialects/nv_tileaa.md |
nv_tileaa.mulhiui | 17 | range 0x5B46D28..0x5B46F68 | unsigned high-half multiply | dialects/nv_tileaa.md |
nv_tileaa.optimization_barrier | 30 | range 0x5B46D28..0x5B46F68 | optimizer barrier | dialects/nv_tileaa.md |
nv_tileaa.permute | 17 | range 0x5B46D28..0x5B46F68 | tile permutation | dialects/nv_tileaa.md |
nv_tileaa.plugin | 16 | range 0x5B46D28..0x5B46F68 | plugin-injection op | dialects/nv_tileaa.md |
nv_tileaa.pragma | 16 | range 0x5B46D28..0x5B46F68 | pragma carrier | dialects/nv_tileaa.md |
nv_tileaa.print | 15 | range 0x5B46D28..0x5B46F68 | tile-aware print | dialects/nv_tileaa.md |
nv_tileaa.ptr_to_int | 20 | range 0x5B46D28..0x5B46F68 | pointer-to-integer cast | dialects/nv_tileaa.md |
nv_tileaa.queue.get | 19 | range 0x5B46D28..0x5B46F68 | typed-queue dequeue | dialects/nv_tileaa.md |
nv_tileaa.queue.put | 19 | range 0x5B46D28..0x5B46F68 | typed-queue enqueue | dialects/nv_tileaa.md |
nv_tileaa.queue.yield | 21 | range 0x5B46D28..0x5B46F68 | typed-queue dataflow yield | dialects/nv_tileaa.md |
nv_tileaa.reduce | 16 | range 0x5B46D28..0x5B46F68 | reduction | dialects/nv_tileaa.md |
nv_tileaa.return | 16 | range 0x5B46D28..0x5B46F68 | function-return terminator | dialects/nv_tileaa.md |
nv_tileaa.rsqrt | 15 | range 0x5B46D28..0x5B46F68 | reciprocal sqrt | dialects/nv_tileaa.md |
nv_tileaa.scan | 14 | range 0x5B46D28..0x5B46F68 | prefix-sum | dialects/nv_tileaa.md |
nv_tileaa.scatter_store | 23 | range 0x5B46D28..0x5B46F68 | indexed scatter (global) | dialects/nv_tileaa.md |
nv_tileaa.splat | 15 | range 0x5B46D28..0x5B46F68 | scalar broadcast | dialects/nv_tileaa.md |
nv_tileaa.sqrt | 14 | range 0x5B46D28..0x5B46F68 | square root | dialects/nv_tileaa.md |
nv_tileaa.store | 15 | range 0x5B46D28..0x5B46F68 | scalar memory store | dialects/nv_tileaa.md |
nv_tileaa.subf | 14 | range 0x5B46D28..0x5B46F68 | float subtract | dialects/nv_tileaa.md |
nv_tileaa.tiled_atomic_rmw | 26 | range 0x5B46D28..0x5B46F68 | tile-wide RMW | dialects/nv_tileaa.md |
nv_tileaa.tiled_load | 20 | range 0x5B46D28..0x5B46F68 | tile load | dialects/nv_tileaa.md |
nv_tileaa.tiled_store | 21 | range 0x5B46D28..0x5B46F68 | tile store | dialects/nv_tileaa.md |
nv_tileaa.view | 14 | range 0x5B46D28..0x5B46F68 | layout-aware view construction | dialects/nv_tileaa.md |
nv_tileaa.yield | 15 | range 0x5B46D28..0x5B46F68 | region terminator | dialects/nv_tileaa.md |
Note: enumeration follows the registrar walk in p2-C01:441-513 and yields
72 mnemonics including the queue.* and make_* decompositions; the
"61 canonical ops" count cited in the dialect summary collapses
make_memref / make_range / view to their corresponding
make_* family count. All entries above are first-class.
§3 nv_tileas.* (58 ops)
Anchor &unk_5B44F08. RTTI nv_tile_ir::as. async.pipeline.* cluster
dominates the surface area.
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nv_tileas.alloc_tensor | 22 | anchor &unk_5B44F08 | tensor buffer allocation | dialects/nv_tileas.md |
nv_tileas.async.cancel_next_program_id | 38 | anchor &unk_5B44F08 | async cluster cancel | dialects/nv_tileas.md |
nv_tileas.async.copy | 20 | anchor &unk_5B44F08 | DMA-async copy | dialects/nv_tileas.md |
nv_tileas.async.dot | 19 | anchor &unk_5B44F08 | async MMA | dialects/nv_tileas.md |
nv_tileas.async.extract_slice | 29 | anchor &unk_5B44F08 | async sub-slice extract | dialects/nv_tileas.md |
nv_tileas.async.future_wait | 27 | anchor &unk_5B44F08 | wait on async future | dialects/nv_tileas.md |
nv_tileas.async.gather_tma_load | 31 | anchor &unk_5B44F08 | TMA gather load | dialects/nv_tileas.md |
nv_tileas.async.insert_slice | 28 | anchor &unk_5B44F08 | async slice insert | dialects/nv_tileas.md |
nv_tileas.async.load | 20 | anchor &unk_5B44F08 | async load | dialects/nv_tileas.md |
nv_tileas.async.pipeline.agent_switch | 37 | anchor &unk_5B44F08 | warp-specialized agent boundary | dialects/nv_tileas.md |
nv_tileas.async.pipeline.consume_one | 36 | anchor &unk_5B44F08 | one-stage consume | dialects/nv_tileas.md |
nv_tileas.async.pipeline.consume_one_async | 42 | anchor &unk_5B44F08 | one-stage async consume | dialects/nv_tileas.md |
nv_tileas.async.pipeline.consumer_read | 38 | anchor &unk_5B44F08 | consumer protocol read | dialects/nv_tileas.md |
nv_tileas.async.pipeline.consumer_release | 41 | anchor &unk_5B44F08 | consumer protocol release | dialects/nv_tileas.md |
nv_tileas.async.pipeline.consumer_wait | 38 | anchor &unk_5B44F08 | consumer protocol wait | dialects/nv_tileas.md |
nv_tileas.async.pipeline.create_iterator | 40 | anchor &unk_5B44F08 | pipeline iterator construction | dialects/nv_tileas.md |
nv_tileas.async.pipeline.create_null_token | 42 | anchor &unk_5B44F08 | null-token constructor | dialects/nv_tileas.md |
nv_tileas.async.pipeline.create_pipeline | 40 | anchor &unk_5B44F08 | pipeline constructor | dialects/nv_tileas.md |
nv_tileas.async.pipeline.inc_iter | 33 | anchor &unk_5B44F08 | iterator advance | dialects/nv_tileas.md |
nv_tileas.async.pipeline.produce_one | 36 | anchor &unk_5B44F08 | one-stage produce | dialects/nv_tileas.md |
nv_tileas.async.pipeline.produce_one_async | 42 | anchor &unk_5B44F08 | one-stage async produce | dialects/nv_tileas.md |
nv_tileas.async.pipeline.producer_acquire | 41 | anchor &unk_5B44F08 | producer protocol acquire | dialects/nv_tileas.md |
nv_tileas.async.pipeline.producer_commit | 40 | anchor &unk_5B44F08 | producer protocol commit | dialects/nv_tileas.md |
nv_tileas.async.pipeline.producer_write | 39 | anchor &unk_5B44F08 | producer protocol write | dialects/nv_tileas.md |
nv_tileas.async.pipeline.yield | 30 | anchor &unk_5B44F08 | pipeline-region terminator | dialects/nv_tileas.md |
nv_tileas.async.scatter_tma_store | 33 | anchor &unk_5B44F08 | TMA scatter store | dialects/nv_tileas.md |
nv_tileas.async.store | 21 | anchor &unk_5B44F08 | async store | dialects/nv_tileas.md |
nv_tileas.async.tiled_atomic_rmw | 32 | anchor &unk_5B44F08 | tile RMW (async) | dialects/nv_tileas.md |
nv_tileas.async.tiled_load | 26 | anchor &unk_5B44F08 | async tiled load | dialects/nv_tileas.md |
nv_tileas.async.tiled_tma_load | 30 | anchor &unk_5B44F08 | TMA tile load | dialects/nv_tileas.md |
nv_tileas.async.tiled_tma_store | 31 | anchor &unk_5B44F08 | TMA tile store | dialects/nv_tileas.md |
nv_tileas.async.to_async | 24 | anchor &unk_5B44F08 | future conversion | dialects/nv_tileas.md |
nv_tileas.async.token_to_async | 30 | anchor &unk_5B44F08 | token-to-future conversion | dialects/nv_tileas.md |
nv_tileas.async.wait | 20 | anchor &unk_5B44F08 | async wait barrier | dialects/nv_tileas.md |
nv_tileas.cancel_next_program_id | 32 | anchor &unk_5B44F08 | cluster cancel | dialects/nv_tileas.md |
nv_tileas.convert_layout | 24 | anchor &unk_5B44F08 | layout conversion (smem ↔ rmem ↔ tmem) | dialects/nv_tileas.md |
nv_tileas.copy | 14 | anchor &unk_5B44F08 | sync copy | dialects/nv_tileas.md |
nv_tileas.create_none | 21 | anchor &unk_5B44F08 | null SSA value | dialects/nv_tileas.md |
nv_tileas.dot | 13 | anchor &unk_5B44F08 | sync matrix dot | dialects/nv_tileas.md |
nv_tileas.expand_dims | 21 | anchor &unk_5B44F08 | rank lift | dialects/nv_tileas.md |
nv_tileas.extract_slice | 23 | anchor &unk_5B44F08 | sub-slice extract | dialects/nv_tileas.md |
nv_tileas.gather_load | 21 | anchor &unk_5B44F08 | indexed gather | dialects/nv_tileas.md |
nv_tileas.generate | 18 | anchor &unk_5B44F08 | functional generate (region) | dialects/nv_tileas.md |
nv_tileas.insert_slice | 22 | anchor &unk_5B44F08 | slice insert | dialects/nv_tileas.md |
nv_tileas.load | 14 | anchor &unk_5B44F08 | scalar load | dialects/nv_tileas.md |
nv_tileas.make_tiled_tma_desc | 29 | anchor &unk_5B44F08 | TMA descriptor builder | dialects/nv_tileas.md |
nv_tileas.pragma | 16 | anchor &unk_5B44F08 | pragma carrier | dialects/nv_tileas.md |
nv_tileas.reduce | 16 | anchor &unk_5B44F08 | reduction | dialects/nv_tileas.md |
nv_tileas.reinterpret | 21 | anchor &unk_5B44F08 | reinterpret cast | dialects/nv_tileas.md |
nv_tileas.scan | 14 | anchor &unk_5B44F08 | prefix-sum | dialects/nv_tileas.md |
nv_tileas.scatter_store | 23 | anchor &unk_5B44F08 | indexed scatter | dialects/nv_tileas.md |
nv_tileas.shuffle | 17 | anchor &unk_5B44F08 | warp shuffle | dialects/nv_tileas.md |
nv_tileas.store | 15 | anchor &unk_5B44F08 | scalar store | dialects/nv_tileas.md |
nv_tileas.tiled_atomic_rmw | 26 | anchor &unk_5B44F08 | tile-wide RMW | dialects/nv_tileas.md |
nv_tileas.tiled_load | 20 | anchor &unk_5B44F08 | tile load | dialects/nv_tileas.md |
nv_tileas.tiled_store | 21 | anchor &unk_5B44F08 | tile store | dialects/nv_tileas.md |
nv_tileas.view | 14 | anchor &unk_5B44F08 | view op | dialects/nv_tileas.md |
nv_tileas.yield | 15 | anchor &unk_5B44F08 | region terminator | dialects/nv_tileas.md |
§4 cute.* (59 ops)
Anchor &unk_5B496B8. Hardware-independent CuTe layout algebra.
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
cute.add_offset | 15 | anchor &unk_5B496B8 | offset addition into a layout/iter | dialects/cute.md |
cute.complement | 15 | anchor &unk_5B496B8 | layout complement | dialects/cute.md |
cute.copy | 9 | anchor &unk_5B496B8 | high-level copy | dialects/cute.md |
cute.copy_atom_call | 19 | anchor &unk_5B496B8 | apply copy atom | dialects/cute.md |
cute.cosize | 11 | anchor &unk_5B496B8 | layout cosize | dialects/cute.md |
cute.deref_desc_iter | 20 | anchor &unk_5B496B8 | dereference descriptor iter | dialects/cute.md |
cute.derefine | 13 | anchor &unk_5B496B8 | layout refinement | dialects/cute.md |
cute.fast_divmod.create_divisor | 31 | anchor &unk_5B496B8 | fast-divmod divisor ctor | dialects/cute.md |
cute.fast_divmod.divide | 23 | anchor &unk_5B496B8 | fast-divmod divide | dialects/cute.md |
cute.fast_divmod.get_divisor | 28 | anchor &unk_5B496B8 | fast-divmod accessor | dialects/cute.md |
cute.fast_divmod.make_divisor | 29 | anchor &unk_5B496B8 | fast-divmod factory | dialects/cute.md |
cute.filter_zeros | 17 | anchor &unk_5B496B8 | strip zero modes | dialects/cute.md |
cute.flat_divide | 16 | anchor &unk_5B496B8 | flat divide | dialects/cute.md |
cute.gemm | 9 | anchor &unk_5B496B8 | GEMM scheduling op | dialects/cute.md |
cute.get_iter | 13 | anchor &unk_5B496B8 | accessor: iter | dialects/cute.md |
cute.get_layout | 15 | anchor &unk_5B496B8 | accessor: layout | dialects/cute.md |
cute.get_layouts_from_tile | 26 | anchor &unk_5B496B8 | accessor: layouts from tile | dialects/cute.md |
cute.get_shape | 14 | anchor &unk_5B496B8 | accessor: shape | dialects/cute.md |
cute.get_stride | 15 | anchor &unk_5B496B8 | accessor: stride | dialects/cute.md |
cute.group_modes | 16 | anchor &unk_5B496B8 | layout shape op | dialects/cute.md |
cute.inttoptr | 13 | anchor &unk_5B496B8 | int-to-pointer | dialects/cute.md |
cute.local_partition | 20 | anchor &unk_5B496B8 | partition view | dialects/cute.md |
cute.local_tile | 15 | anchor &unk_5B496B8 | tile view | dialects/cute.md |
cute.logical_divide | 19 | anchor &unk_5B496B8 | logical divide | dialects/cute.md |
cute.make_atom | 14 | anchor &unk_5B496B8 | atom constructor | dialects/cute.md |
cute.make_desc_iter | 19 | anchor &unk_5B496B8 | descriptor-iter ctor | dialects/cute.md |
cute.make_fragment_like | 23 | anchor &unk_5B496B8 | fragment construction | dialects/cute.md |
cute.make_tiled_copy | 20 | anchor &unk_5B496B8 | tiled-copy constructor | dialects/cute.md |
cute.make_tiled_mma | 19 | anchor &unk_5B496B8 | tiled-MMA constructor | dialects/cute.md |
cute.make_tuple | 15 | anchor &unk_5B496B8 | tuple constructor | dialects/cute.md |
cute.make_view | 14 | anchor &unk_5B496B8 | view constructor | dialects/cute.md |
cute.memref.alloc_smem | 22 | anchor &unk_5B496B8 | smem allocation | dialects/cute.md |
cute.memref.alloca | 18 | anchor &unk_5B496B8 | stack alloca | dialects/cute.md |
cute.memref.load | 16 | anchor &unk_5B496B8 | memref load | dialects/cute.md |
cute.memref.store | 17 | anchor &unk_5B496B8 | memref store | dialects/cute.md |
cute.memref.store_vec | 21 | anchor &unk_5B496B8 | vector memref store | dialects/cute.md |
cute.mma_atom_call | 18 | anchor &unk_5B496B8 | apply MMA atom | dialects/cute.md |
cute.prefetch | 13 | anchor &unk_5B496B8 | prefetch | dialects/cute.md |
cute.prefetch_atom_call | 23 | anchor &unk_5B496B8 | apply prefetch atom | dialects/cute.md |
cute.print | 10 | anchor &unk_5B496B8 | diagnostic print | dialects/cute.md |
cute.print_tma_desc_im2col | 26 | anchor &unk_5B496B8 | print TMA im2col desc | dialects/cute.md |
cute.print_tma_desc_tiled | 25 | anchor &unk_5B496B8 | print TMA tiled desc | dialects/cute.md |
cute.ptr.store | 14 | anchor &unk_5B496B8 | typed pointer store | dialects/cute.md |
cute.ptrtoint | 13 | anchor &unk_5B496B8 | pointer-to-int | dialects/cute.md |
cute.recast_iter | 16 | anchor &unk_5B496B8 | recast iterator | dialects/cute.md |
cute.recast_layout | 18 | anchor &unk_5B496B8 | recast layout | dialects/cute.md |
cute.right_inverse | 18 | anchor &unk_5B496B8 | layout inverse | dialects/cute.md |
cute.select | 11 | anchor &unk_5B496B8 | layout selector | dialects/cute.md |
cute.size | 9 | anchor &unk_5B496B8 | layout size | dialects/cute.md |
cute.static | 11 | anchor &unk_5B496B8 | static-shape attr op | dialects/cute.md |
cute.stencil_divide | 19 | anchor &unk_5B496B8 | stencil divide | dialects/cute.md |
cute.tile_to_shape | 18 | anchor &unk_5B496B8 | tile materialisation | dialects/cute.md |
cute.tiled_divide | 17 | anchor &unk_5B496B8 | tiled divide | dialects/cute.md |
cute.tiled.copy.partition_D | 27 | anchor &unk_5B496B8 | tiled-copy D-partition | dialects/cute.md |
cute.tiled.copy.partition_S | 27 | anchor &unk_5B496B8 | tiled-copy S-partition | dialects/cute.md |
cute.tiled.copy.retile | 22 | anchor &unk_5B496B8 | tiled-copy retile | dialects/cute.md |
cute.tiled.mma.partition | 24 | anchor &unk_5B496B8 | tiled-MMA partition | dialects/cute.md |
cute.tiled.mma.partition_shape | 30 | anchor &unk_5B496B8 | tiled-MMA partition shape | dialects/cute.md |
cute.unpack_tuple | 17 | anchor &unk_5B496B8 | tuple unpacker | dialects/cute.md |
§5 cute_nvgpu.* (73 ops)
TypeID slab range 0x5B47FF8..0x5B481A8 (54 slots, 8-byte stride);
remaining ops fall into per-op accessor singletons in same arena.
Anchor &unk_5B482C8.
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
cute_nvgpu.arch.alloc_rmem | 26 | range 0x5B47FF8..0x5B481A8 | rmem allocation | dialects/cute_nvgpu.md |
cute_nvgpu.arch.alloc_smem | 26 | range 0x5B47FF8..0x5B481A8 | smem allocation | dialects/cute_nvgpu.md |
cute_nvgpu.arch.copy.SM100.copy_s2t | 35 | range 0x5B47FF8..0x5B481A8 | smem→tmem copy (Blackwell) | dialects/cute_nvgpu.md |
cute_nvgpu.arch.copy.SM100.tma_load | 35 | range 0x5B47FF8..0x5B481A8 | TMA load (Blackwell) | dialects/cute_nvgpu.md |
cute_nvgpu.arch.copy.SM100.tma_reduce | 37 | range 0x5B47FF8..0x5B481A8 | TMA reduce (Blackwell) | dialects/cute_nvgpu.md |
cute_nvgpu.arch.copy.SM100.tma_store | 36 | range 0x5B47FF8..0x5B481A8 | TMA store (Blackwell) | dialects/cute_nvgpu.md |
cute_nvgpu.arch.copy.SM100.tmem_load | 36 | range 0x5B47FF8..0x5B481A8 | TMEM load | dialects/cute_nvgpu.md |
cute_nvgpu.arch.copy.SM100.tmem_store | 37 | range 0x5B47FF8..0x5B481A8 | TMEM store | dialects/cute_nvgpu.md |
cute_nvgpu.arch.copy.SM80.cp_async | 34 | range 0x5B47FF8..0x5B481A8 | Ampere cp.async | dialects/cute_nvgpu.md |
cute_nvgpu.arch.copy.ldsm | 25 | range 0x5B47FF8..0x5B481A8 | ldmatrix family | dialects/cute_nvgpu.md |
cute_nvgpu.arch.copy.stsm | 25 | range 0x5B47FF8..0x5B481A8 | stmatrix family | dialects/cute_nvgpu.md |
cute_nvgpu.arch.get_dyn_smem | 28 | range 0x5B47FF8..0x5B481A8 | dynamic-smem accessor | dialects/cute_nvgpu.md |
cute_nvgpu.arch.get_dyn_smem_size | 33 | range 0x5B47FF8..0x5B481A8 | dynamic-smem size query | dialects/cute_nvgpu.md |
cute_nvgpu.arch.make_warp_uniform | 33 | range 0x5B47FF8..0x5B481A8 | warp-uniform marker | dialects/cute_nvgpu.md |
cute_nvgpu.arch.mma.SM100.umma | 30 | range 0x5B47FF8..0x5B481A8 | Blackwell UMMA | dialects/cute_nvgpu.md |
cute_nvgpu.arch.mma.SM100.umma_block_scaled | 43 | range 0x5B47FF8..0x5B481A8 | Blackwell UMMA block-scaled | dialects/cute_nvgpu.md |
cute_nvgpu.arch.mma.SM100.umma_block_scaled_sparse | 50 | range 0x5B47FF8..0x5B481A8 | Blackwell UMMA bs sparse | dialects/cute_nvgpu.md |
cute_nvgpu.arch.mma.SM100.umma_sparse | 37 | range 0x5B47FF8..0x5B481A8 | Blackwell UMMA sparse | dialects/cute_nvgpu.md |
cute_nvgpu.arch.mma.SM120.block_scaled | 38 | range 0x5B47FF8..0x5B481A8 | sm_120 block-scaled MMA | dialects/cute_nvgpu.md |
cute_nvgpu.arch.mma.SM80 | 24 | range 0x5B47FF8..0x5B481A8 | Ampere MMA | dialects/cute_nvgpu.md |
cute_nvgpu.arch.mma.SM80.sparse | 31 | range 0x5B47FF8..0x5B481A8 | Ampere MMA sparse | dialects/cute_nvgpu.md |
cute_nvgpu.arch.mma.SM89 | 24 | range 0x5B47FF8..0x5B481A8 | Ada MMA | dialects/cute_nvgpu.md |
cute_nvgpu.arch.mma.SM90 | 24 | range 0x5B47FF8..0x5B481A8 | Hopper WGMMA | dialects/cute_nvgpu.md |
cute_nvgpu.arch.prefetch_tma_desc | 33 | range 0x5B47FF8..0x5B481A8 | TMA desc prefetch | dialects/cute_nvgpu.md |
cute_nvgpu.arch.sm100.alloc_tmem | 32 | range 0x5B47FF8..0x5B481A8 | TMEM alloc | dialects/cute_nvgpu.md |
cute_nvgpu.arch.sm100.dealloc_tmem | 34 | range 0x5B47FF8..0x5B481A8 | TMEM dealloc | dialects/cute_nvgpu.md |
cute_nvgpu.arch.sm100.relinquish_tmem_alloc_permit | 50 | range 0x5B47FF8..0x5B481A8 | TMEM permit release | dialects/cute_nvgpu.md |
cute_nvgpu.arch.sm100.retrieve_tmem_ptr | 39 | range 0x5B47FF8..0x5B481A8 | TMEM pointer retrieval | dialects/cute_nvgpu.md |
cute_nvgpu.atom.get_copy_s2t_smem_desc_view | 43 | range 0x5B47FF8..0x5B481A8 | atom accessor: s2t smem-desc | dialects/cute_nvgpu.md |
cute_nvgpu.atom.get_value | 25 | range 0x5B47FF8..0x5B481A8 | atom value accessor | dialects/cute_nvgpu.md |
cute_nvgpu.atom.ldsm | 20 | range 0x5B47FF8..0x5B481A8 | ldmatrix atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.make_exec_tma | 29 | range 0x5B47FF8..0x5B481A8 | executable TMA atom builder | dialects/cute_nvgpu.md |
cute_nvgpu.atom.make_non_exec_tiled_tma_load | 44 | range 0x5B47FF8..0x5B481A8 | non-exec tiled TMA load builder | dialects/cute_nvgpu.md |
cute_nvgpu.atom.make_non_exec_tiled_tma_reduce | 46 | range 0x5B47FF8..0x5B481A8 | non-exec tiled TMA reduce builder | dialects/cute_nvgpu.md |
cute_nvgpu.atom.make_s2t_copy | 29 | range 0x5B47FF8..0x5B481A8 | s2t copy atom builder | dialects/cute_nvgpu.md |
cute_nvgpu.atom.make_tma_load | 29 | range 0x5B47FF8..0x5B481A8 | TMA load atom builder | dialects/cute_nvgpu.md |
cute_nvgpu.atom.make_tma_reduce | 31 | range 0x5B47FF8..0x5B481A8 | TMA reduce atom builder | dialects/cute_nvgpu.md |
cute_nvgpu.atom.make_tma_store | 30 | range 0x5B47FF8..0x5B481A8 | TMA store atom builder | dialects/cute_nvgpu.md |
cute_nvgpu.atom.make_tmem_copy | 30 | range 0x5B47FF8..0x5B481A8 | TMEM copy atom builder | dialects/cute_nvgpu.md |
cute_nvgpu.atom.non_exec_tiled_tma_load | 39 | range 0x5B47FF8..0x5B481A8 | non-exec tiled TMA load atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.non_exec_tiled_tma_reduce | 41 | range 0x5B47FF8..0x5B481A8 | non-exec tiled TMA reduce atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.non_exec_tiled_tma_store | 40 | range 0x5B47FF8..0x5B481A8 | non-exec tiled TMA store atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.s2t_copy | 24 | range 0x5B47FF8..0x5B481A8 | s2t copy atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.simt_async_copy | 31 | range 0x5B47FF8..0x5B481A8 | SIMT async copy atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.stsm | 20 | range 0x5B47FF8..0x5B481A8 | stmatrix atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.tma_load | 24 | range 0x5B47FF8..0x5B481A8 | TMA load atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.tma_reduce | 26 | range 0x5B47FF8..0x5B481A8 | TMA reduce atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.tma_store | 25 | range 0x5B47FF8..0x5B481A8 | TMA store atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.tmem_load | 25 | range 0x5B47FF8..0x5B481A8 | TMEM load atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.tmem_store | 26 | range 0x5B47FF8..0x5B481A8 | TMEM store atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.universal_copy | 30 | range 0x5B47FF8..0x5B481A8 | universal copy atom | dialects/cute_nvgpu.md |
cute_nvgpu.atom.universal_fma | 29 | range 0x5B47FF8..0x5B481A8 | universal FMA atom | dialects/cute_nvgpu.md |
cute_nvgpu.cast_tma_desc_to_integer | 35 | range 0x5B47FF8..0x5B481A8 | TMA desc-to-int reinterpret | dialects/cute_nvgpu.md |
cute_nvgpu.copy_tma_desc | 24 | range 0x5B47FF8..0x5B481A8 | TMA desc copy | dialects/cute_nvgpu.md |
cute_nvgpu.get_grid_constant_pointer | 36 | range 0x5B47FF8..0x5B481A8 | nvvm.grid_constant accessor | dialects/cute_nvgpu.md |
cute_nvgpu.get_tma_desc_addr | 28 | range 0x5B47FF8..0x5B481A8 | TMA desc-address probe | dialects/cute_nvgpu.md |
cute_nvgpu.make_sm120_mma_bs | 28 | range 0x5B47FF8..0x5B481A8 | sm_120 block-scaled MMA constructor | dialects/cute_nvgpu.md |
cute_nvgpu.make_tma_desc_im2col | 31 | range 0x5B47FF8..0x5B481A8 | TMA im2col desc builder | dialects/cute_nvgpu.md |
cute_nvgpu.make_tma_desc_im2col_at | 34 | range 0x5B47FF8..0x5B481A8 | TMA im2col desc builder (at) | dialects/cute_nvgpu.md |
cute_nvgpu.make_tma_desc_tiled | 30 | range 0x5B47FF8..0x5B481A8 | TMA tiled desc builder | dialects/cute_nvgpu.md |
cute_nvgpu.make_tma_desc_tiled_at | 33 | range 0x5B47FF8..0x5B481A8 | TMA tiled desc builder (at) | dialects/cute_nvgpu.md |
cute_nvgpu.prefetch_tma_desc | 28 | range 0x5B47FF8..0x5B481A8 | TMA desc prefetch | dialects/cute_nvgpu.md |
cute_nvgpu.sm100.mma | 20 | range 0x5B47FF8..0x5B481A8 | Blackwell MMA | dialects/cute_nvgpu.md |
cute_nvgpu.sm100.mma_bs | 23 | range 0x5B47FF8..0x5B481A8 | Blackwell block-scaled MMA | dialects/cute_nvgpu.md |
cute_nvgpu.sm100.mma_bs_sp | 26 | range 0x5B47FF8..0x5B481A8 | Blackwell block-scaled sparse MMA | dialects/cute_nvgpu.md |
cute_nvgpu.sm100.mma_sp | 23 | range 0x5B47FF8..0x5B481A8 | Blackwell sparse MMA | dialects/cute_nvgpu.md |
cute_nvgpu.SM120.mma_bs | 23 | range 0x5B47FF8..0x5B481A8 | sm_120 block-scaled MMA | dialects/cute_nvgpu.md |
cute_nvgpu.sm80.mma | 19 | range 0x5B47FF8..0x5B481A8 | Ampere MMA | dialects/cute_nvgpu.md |
cute_nvgpu.sm80.sparse_mma | 26 | range 0x5B47FF8..0x5B481A8 | Ampere sparse MMA | dialects/cute_nvgpu.md |
cute_nvgpu.sm89.mma | 19 | range 0x5B47FF8..0x5B481A8 | Ada MMA | dialects/cute_nvgpu.md |
cute_nvgpu.sm90.mma | 19 | range 0x5B47FF8..0x5B481A8 | Hopper WGMMA | dialects/cute_nvgpu.md |
cute_nvgpu.smem_desc_view | 25 | range 0x5B47FF8..0x5B481A8 | smem descriptor view | dialects/cute_nvgpu.md |
cute_nvgpu.update_tma_desc | 26 | range 0x5B47FF8..0x5B481A8 | TMA desc mutate | dialects/cute_nvgpu.md |
§6 cutlass.* (84 ops, 38 unique families)
Fold-record range 0x5B47490..0x5B476A0 covers the op-info blocks.
Includes block_striped collectives, generic and named barriers, the
pipeline state machine, the seq_bar protocol, and the tile_scheduler
family (DP, static-persistent, StreamK, MODS-trace).
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
cutlass.async.exec | 18 | range 0x5B47490..0x5B476A0 | async-execute wrapper | dialects/cutlass.md |
cutlass.barrier_id | 18 | range 0x5B47490..0x5B476A0 | barrier-id allocator | dialects/cutlass.md |
cutlass.block_striped.load | 26 | range 0x5B47490..0x5B476A0 | block-striped load | dialects/cutlass.md |
cutlass.block_striped.load_add | 30 | range 0x5B47490..0x5B476A0 | block-striped load+add | dialects/cutlass.md |
cutlass.block_striped.reduce | 28 | range 0x5B47490..0x5B476A0 | block-striped reduce | dialects/cutlass.md |
cutlass.block_striped.store | 27 | range 0x5B47490..0x5B476A0 | block-striped store | dialects/cutlass.md |
cutlass.generic_barrier.arrive_increment | 40 | range 0x5B47490..0x5B476A0 | generic-barrier arrive-increment | dialects/cutlass.md |
cutlass.generic_barrier_sync | 28 | range 0x5B47490..0x5B476A0 | generic-barrier sync | dialects/cutlass.md |
cutlass.generic_barrier.wait_eq | 31 | range 0x5B47490..0x5B476A0 | generic-barrier wait-eq | dialects/cutlass.md |
cutlass.generic_barrier.wait_less_than | 38 | range 0x5B47490..0x5B476A0 | generic-barrier wait-less-than | dialects/cutlass.md |
cutlass.named_barrier.arrive | 28 | range 0x5B47490..0x5B476A0 | named-barrier arrive | dialects/cutlass.md |
cutlass.named_barrier.arrive_and_wait | 37 | range 0x5B47490..0x5B476A0 | named-barrier arrive+wait | dialects/cutlass.md |
cutlass.pipeline.consume | 24 | range 0x5B47490..0x5B476A0 | pipeline consume | dialects/cutlass.md |
cutlass.pipeline.consumer_release | 33 | range 0x5B47490..0x5B476A0 | consumer release | dialects/cutlass.md |
cutlass.pipeline.consumer_try_wait | 34 | range 0x5B47490..0x5B476A0 | consumer try-wait | dialects/cutlass.md |
cutlass.pipeline.consumer_wait | 30 | range 0x5B47490..0x5B476A0 | consumer wait | dialects/cutlass.md |
cutlass.pipeline.create | 23 | range 0x5B47490..0x5B476A0 | pipeline ctor | dialects/cutlass.md |
cutlass.pipeline.get_producer_barrier | 37 | range 0x5B47490..0x5B476A0 | producer-barrier query | dialects/cutlass.md |
cutlass.pipeline.get_producer_mask | 34 | range 0x5B47490..0x5B476A0 | producer-mask query | dialects/cutlass.md |
cutlass.pipeline.init | 21 | range 0x5B47490..0x5B476A0 | pipeline init | dialects/cutlass.md |
cutlass.pipeline.make_participants | 34 | range 0x5B47490..0x5B476A0 | participant set construction | dialects/cutlass.md |
cutlass.pipeline.produce | 24 | range 0x5B47490..0x5B476A0 | pipeline produce | dialects/cutlass.md |
cutlass.pipeline.producer_acquire | 33 | range 0x5B47490..0x5B476A0 | producer acquire | dialects/cutlass.md |
cutlass.pipeline.producer_commit | 32 | range 0x5B47490..0x5B476A0 | producer commit | dialects/cutlass.md |
cutlass.pipeline.producer_tail | 30 | range 0x5B47490..0x5B476A0 | producer tail | dialects/cutlass.md |
cutlass.pipeline.producer_try_acquire | 37 | range 0x5B47490..0x5B476A0 | producer try-acquire | dialects/cutlass.md |
cutlass.pipeline.state.create | 29 | range 0x5B47490..0x5B476A0 | state ctor | dialects/cutlass.md |
cutlass.pipeline.state.get_count | 32 | range 0x5B47490..0x5B476A0 | state count accessor | dialects/cutlass.md |
cutlass.pipeline.state.get_index | 32 | range 0x5B47490..0x5B476A0 | state index accessor | dialects/cutlass.md |
cutlass.pipeline.state.get_phase | 32 | range 0x5B47490..0x5B476A0 | state phase accessor | dialects/cutlass.md |
cutlass.pipeline.state.increment | 32 | range 0x5B47490..0x5B476A0 | state increment | dialects/cutlass.md |
cutlass.pipeline.switch_by_executor | 35 | range 0x5B47490..0x5B476A0 | executor-keyed dispatch | dialects/cutlass.md |
cutlass.seq_bar.arrive | 22 | range 0x5B47490..0x5B476A0 | seq-bar arrive | dialects/cutlass.md |
cutlass.seq_bar.create | 22 | range 0x5B47490..0x5B476A0 | seq-bar ctor | dialects/cutlass.md |
cutlass.seq_bar.init | 20 | range 0x5B47490..0x5B476A0 | seq-bar init | dialects/cutlass.md |
cutlass.seq_bar.state.create | 28 | range 0x5B47490..0x5B476A0 | seq-bar state ctor | dialects/cutlass.md |
cutlass.seq_bar.wait | 20 | range 0x5B47490..0x5B476A0 | seq-bar wait | dialects/cutlass.md |
cutlass.tile_scheduler.advance_to_next_work | 43 | range 0x5B47490..0x5B476A0 | scheduler advance | dialects/cutlass.md |
cutlass.tile_scheduler.compute_epilogue | 39 | range 0x5B47490..0x5B476A0 | epilogue trigger | dialects/cutlass.md |
cutlass.tile_scheduler.create_dp_params | 39 | range 0x5B47490..0x5B476A0 | DP scheduler params ctor | dialects/cutlass.md |
cutlass.tile_scheduler.create_dp_work_tile_info | 47 | range 0x5B47490..0x5B476A0 | DP work-tile-info ctor | dialects/cutlass.md |
cutlass.tile_scheduler.create_SM100_scheduler | 45 | range 0x5B47490..0x5B476A0 | sm_100 scheduler factory | dialects/cutlass.md |
cutlass.tile_scheduler.create_static_persistent_params | 54 | range 0x5B47490..0x5B476A0 | static-persistent params ctor | dialects/cutlass.md |
cutlass.tile_scheduler.create_static_persistent_work_tile_info | 62 | range 0x5B47490..0x5B476A0 | static-persistent work-tile-info ctor | dialects/cutlass.md |
cutlass.tile_scheduler.create_streamk_params | 44 | range 0x5B47490..0x5B476A0 | StreamK params ctor | dialects/cutlass.md |
cutlass.tile_scheduler.create_streamk_work_tile_info | 52 | range 0x5B47490..0x5B476A0 | StreamK work-tile-info ctor | dialects/cutlass.md |
cutlass.tile_scheduler.fetch_next_work | 38 | range 0x5B47490..0x5B476A0 | fetch next work | dialects/cutlass.md |
cutlass.tile_scheduler.fixup | 28 | range 0x5B47490..0x5B476A0 | partial-tile fixup | dialects/cutlass.md |
cutlass.tile_scheduler.fixup_increment | 38 | range 0x5B47490..0x5B476A0 | fixup increment | dialects/cutlass.md |
cutlass.tile_scheduler.fixup_wait | 33 | range 0x5B47490..0x5B476A0 | fixup wait | dialects/cutlass.md |
cutlass.tile_scheduler.get_current_work | 39 | range 0x5B47490..0x5B476A0 | current work accessor | dialects/cutlass.md |
cutlass.tile_scheduler.get_grid_shape | 37 | range 0x5B47490..0x5B476A0 | grid-shape accessor | dialects/cutlass.md |
cutlass.tile_scheduler.get_workid_response_ptr | 46 | range 0x5B47490..0x5B476A0 | workid response ptr | dialects/cutlass.md |
cutlass.tile_scheduler.get_work_k_tile_count | 44 | range 0x5B47490..0x5B476A0 | work k-tile count | dialects/cutlass.md |
cutlass.tile_scheduler.get_work_k_tile_start | 44 | range 0x5B47490..0x5B476A0 | work k-tile start | dialects/cutlass.md |
cutlass.tile_scheduler.get_workspace_sizes | 42 | range 0x5B47490..0x5B476A0 | workspace sizes | dialects/cutlass.md |
cutlass.tile_scheduler.initial_work_tile_info | 45 | range 0x5B47490..0x5B476A0 | initial work-tile info | dialects/cutlass.md |
cutlass.tile_scheduler.initialize_workspace | 43 | range 0x5B47490..0x5B476A0 | initialize workspace | dialects/cutlass.md |
cutlass.tile_scheduler.make_dp_params | 37 | range 0x5B47490..0x5B476A0 | DP params builder | dialects/cutlass.md |
cutlass.tile_scheduler.make_static_persistent_params | 52 | range 0x5B47490..0x5B476A0 | static-persistent params builder | dialects/cutlass.md |
cutlass.tile_scheduler.make_streamk_params | 42 | range 0x5B47490..0x5B476A0 | StreamK params builder | dialects/cutlass.md |
cutlass.tile_scheduler.mods_report_mainloop_end | 47 | range 0x5B47490..0x5B476A0 | MODS-trace mainloop end | dialects/cutlass.md |
cutlass.tile_scheduler.mods_report_mainloop_start | 49 | range 0x5B47490..0x5B476A0 | MODS-trace mainloop start | dialects/cutlass.md |
cutlass.tile_scheduler.mods_report_smid | 39 | range 0x5B47490..0x5B476A0 | MODS-trace smid report | dialects/cutlass.md |
cutlass.tile_scheduler.mods_throttle | 36 | range 0x5B47490..0x5B476A0 | MODS-trace throttle | dialects/cutlass.md |
cutlass.tile_scheduler.params_get_value | 39 | range 0x5B47490..0x5B476A0 | params accessor | dialects/cutlass.md |
cutlass.tile_scheduler.query_next_work | 38 | range 0x5B47490..0x5B476A0 | query next work | dialects/cutlass.md |
cutlass.tile_scheduler.static_fetch_next_work | 45 | range 0x5B47490..0x5B476A0 | static fetch next work | dialects/cutlass.md |
cutlass.tile_scheduler.work_tile_info_get_value | 47 | range 0x5B47490..0x5B476A0 | work-tile-info accessor | dialects/cutlass.md |
cutlass.tile_scheduler.work_tile_info_set_value | 47 | range 0x5B47490..0x5B476A0 | work-tile-info mutator | dialects/cutlass.md |
cutlass.tile_scheduler.work_tile_info_to_coord_mnkl | 51 | range 0x5B47490..0x5B476A0 | work-tile-info MNKL coords | dialects/cutlass.md |
cutlass.tile_scheduler.work_tile_info_to_cta_coord | 50 | range 0x5B47490..0x5B476A0 | work-tile-info CTA coords | dialects/cutlass.md |
§7 mlir::nvgpu.* (upstream, observed in lowerings)
Upstream MLIR nvgpu dialect; statically linked into tileiras. Dialect
TypeID anchor is provided by the upstream registration; per-op TypeIDs
are not exposed by tileiras's own registrar. The list below enumerates
every upstream nvgpu.* mnemonic observed in tileiras-driven lowerings
(produced by convert-nvgpu-to-nvvm consumers and equivalent upstream
dialects).
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvgpu.device_async_copy | 23 | upstream | device-async copy | dialects/upstream-nvgpu.md |
nvgpu.device_async_create_group | 31 | upstream | device-async group ctor | dialects/upstream-nvgpu.md |
nvgpu.device_async_wait | 23 | upstream | device-async wait | dialects/upstream-nvgpu.md |
nvgpu.ldmatrix | 14 | upstream | ldmatrix wrapper | dialects/upstream-nvgpu.md |
nvgpu.mma.sp.sync | 17 | upstream | sparse MMA sync | dialects/upstream-nvgpu.md |
nvgpu.mma.sync | 14 | upstream | dense MMA sync | dialects/upstream-nvgpu.md |
nvgpu.tma.async.load | 20 | upstream | TMA async load | dialects/upstream-nvgpu.md |
nvgpu.tma.async.store | 21 | upstream | TMA async store | dialects/upstream-nvgpu.md |
nvgpu.tma.create.descriptor | 27 | upstream | TMA descriptor ctor | dialects/upstream-nvgpu.md |
nvgpu.warpgroup.generate.descriptor | 35 | upstream | warpgroup descriptor ctor | dialects/upstream-nvgpu.md |
nvgpu.warpgroup.mma | 19 | upstream | warpgroup MMA | dialects/upstream-nvgpu.md |
nvgpu.warpgroup.mma.init.accumulator | 36 | upstream | warpgroup MMA acc init | dialects/upstream-nvgpu.md |
§8 NVVM.* (213 ops)
TypeID slab 0x5B8D610..0x5B8DCB8 (1704 bytes / 8 = 213 entries, 8-byte
stride, dense). Dialect TypeID &unk_5B8DCC0 sits 8 bytes above the
highest op slot. Walked via RegisteredOperationName::insert at
sub_4461CA0 from the registrar driver sub_2EFC390. Order below is
the categorical roster from p5-HH01 (within each category alphabetical
where the registrar permits it; otherwise registrar walk order).
§8.1 Barriers (10)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.barrier | 0xC | &unk_5B8DC80 | block-level barrier | dialects/nvvm.md |
nvvm.barrier0 | 0xD | &unk_5B8DCA8 | legacy bar.sync 0 | dialects/nvvm.md |
nvvm.barrier.arrive | 0x13 | &unk_5B8DCA0 | barrier arrive | dialects/nvvm.md |
nvvm.barrier.cta.arrive | 0x17 | &unk_5B8DC98 | CTA barrier arrive | dialects/nvvm.md |
nvvm.barrier.cta.red | 0x14 | &unk_5B8DC90 | CTA barrier reduction | dialects/nvvm.md |
nvvm.barrier.cta.sync | 0x15 | &unk_5B8DC88 | CTA barrier sync | dialects/nvvm.md |
nvvm.bar.warp.sync | 0x12 | &unk_5B8D758 | bar.warp.sync | dialects/nvvm.md |
nvvm.cluster.arrive | 0x13 | &unk_5B8DC10 | cluster arrive | dialects/nvvm.md |
nvvm.cluster.arrive.relaxed | 0x1B | &unk_5B8DC08 | cluster arrive relaxed | dialects/nvvm.md |
nvvm.cluster.wait | 0x11 | &unk_5B8DB70 | cluster wait | dialects/nvvm.md |
§8.2 mbarrier (20)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.mbarrier.arrive | 0x14 | &unk_5B8D870 | mbarrier arrive | dialects/nvvm.md |
nvvm.mbarrier.arrive.expect_tx | 0x1E | &unk_5B8D890 | arrive with tx-count expectation | dialects/nvvm.md |
nvvm.mbarrier.arrive.expect_tx.shared | 0x25 | &unk_5B8D888 | arrive expect_tx (shared) | dialects/nvvm.md |
nvvm.mbarrier.arrive.nocomplete | 0x1F | &unk_5B8D880 | arrive nocomplete | dialects/nvvm.md |
nvvm.mbarrier.arrive.nocomplete.shared | 0x26 | &unk_5B8D878 | arrive nocomplete (shared) | dialects/nvvm.md |
nvvm.mbarrier.arrive.shared | 0x1B | &unk_5B8D868 | arrive (shared) | dialects/nvvm.md |
nvvm.mbarrier.init | 0x12 | &unk_5B8D860 | mbarrier init | dialects/nvvm.md |
nvvm.mbarrier.init.shared | 0x19 | &unk_5B8D858 | mbarrier init (shared) | dialects/nvvm.md |
nvvm.mbarrier.inval | 0x13 | &unk_5B8D850 | mbarrier invalidate | dialects/nvvm.md |
nvvm.mbarrier.inval.shared | 0x1A | &unk_5B8D848 | mbarrier invalidate (shared) | dialects/nvvm.md |
nvvm.mbarrier.test.wait | 0x17 | &unk_5B8D840 | mbarrier test-wait | dialects/nvvm.md |
nvvm.mbarrier.test.wait.shared | 0x1E | &unk_5B8D838 | mbarrier test-wait (shared) | dialects/nvvm.md |
nvvm.mbarrier.try_wait.parity | 0x1D | &unk_5B8D820 | try-wait parity | dialects/nvvm.md |
nvvm.mbarrier.try_wait.parity.shared | 0x24 | &unk_5B8D818 | try-wait parity (shared) | dialects/nvvm.md |
nvvm.mbarrier.try_wait.parity.timelimit | 0x27 | &unk_5B8D810 | try-wait parity timelimit | dialects/nvvm.md |
nvvm.mbarrier.try_wait.timelimit | 0x20 | &unk_5B8D808 | try-wait timelimit | dialects/nvvm.md |
nvvm.mbarrier.txn | 0x11 | &unk_5B8D828 | mbarrier transaction count | dialects/nvvm.md |
nvvm.mbarrier.txn.cta | 0x15 | &unk_5B8D830 | mbarrier transaction (CTA) | dialects/nvvm.md |
nvvm.mbarrier.wait | 0x12 | &unk_5B8D800 | mbarrier wait | dialects/nvvm.md |
nvvm.mbarrier.wait.parity | 0x19 | &unk_5B8D7F8 | mbarrier wait parity | dialects/nvvm.md |
§8.3 TMA / cp.async.bulk (12)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.cp.async.bulk.commit.group | 0x1F | &unk_5B8DB20 | bulk commit group | dialects/nvvm.md |
nvvm.cp.async.bulk.global.shared.cta | 0x24 | &unk_5B8DB08 | bulk global←shared.cta | dialects/nvvm.md |
nvvm.cp.async.bulk.prefetch | 0x1B | &unk_5B8DB10 | bulk prefetch | dialects/nvvm.md |
nvvm.cp.async.bulk.shared.cluster.global | 0x28 | &unk_5B8DB18 | bulk shared.cluster←global | dialects/nvvm.md |
nvvm.cp.async.bulk.shared.cluster.shared.cta | 0x2C | &unk_5B8DB00 | bulk shared.cluster←shared.cta | dialects/nvvm.md |
nvvm.cp.async.bulk.tensor.global.shared.cta | 0x2B | &unk_5B8DAD0 | TMA tensor global←shared.cta | dialects/nvvm.md |
nvvm.cp.async.bulk.tensor.global.shared.cta.ext | 0x2F | &unk_5B8DAD8 | TMA tensor global←shared.cta ext | dialects/nvvm.md |
nvvm.cp.async.bulk.tensor.prefetch | 0x22 | &unk_5B8DAE8 | TMA tensor prefetch | dialects/nvvm.md |
nvvm.cp.async.bulk.tensor.reduce | 0x20 | &unk_5B8DAE0 | TMA tensor reduce | dialects/nvvm.md |
nvvm.cp.async.bulk.tensor.shared.cluster.global | 0x2F | &unk_5B8DAF0 | TMA tensor shared.cluster←global | dialects/nvvm.md |
nvvm.cp.async.bulk.tensor.shared.cta.global | 0x2B | &unk_5B8DAF8 | TMA tensor shared.cta←global | dialects/nvvm.md |
nvvm.cp.async.bulk.wait_group | 0x1D | &unk_5B8DAC8 | bulk wait group | dialects/nvvm.md |
§8.4 cp.async (Ampere) (5)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.cp.async.commit.group | 0x1A | &unk_5B8DAC0 | cp.async commit group | dialects/nvvm.md |
nvvm.cp.async.mbarrier.arrive | 0x1D | &unk_5B8DAB8 | cp.async mbarrier arrive | dialects/nvvm.md |
nvvm.cp.async.mbarrier.arrive.shared | 0x24 | &unk_5B8DAB0 | cp.async mbarrier arrive (shared) | dialects/nvvm.md |
nvvm.cp.async.shared.global | 0x1B | &unk_5B8DAA8 | cp.async shared←global | dialects/nvvm.md |
nvvm.cp.async.wait.group | 0x18 | &unk_5B8DAA0 | cp.async wait group | dialects/nvvm.md |
§8.5 tcgen05 (Blackwell) (18)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.tcgen05.alloc | 0x12 | &unk_5B8D750 | tcgen05 alloc | dialects/nvvm.md |
nvvm.tcgen05.commit | 0x13 | &unk_5B8D740 | tcgen05 commit | dialects/nvvm.md |
nvvm.tcgen05.commit.arrive | 0x1A | &unk_5B8D748 | tcgen05 commit-arrive | dialects/nvvm.md |
nvvm.tcgen05.cp | 0xF | &unk_5B8D738 | tcgen05 copy | dialects/nvvm.md |
nvvm.tcgen05.dealloc | 0x14 | &unk_5B8D730 | tcgen05 dealloc | dialects/nvvm.md |
nvvm.tcgen05.fence | 0x12 | &unk_5B8D728 | tcgen05 fence | dialects/nvvm.md |
nvvm.tcgen05.ld | 0xF | &unk_5B8D720 | tcgen05 load | dialects/nvvm.md |
nvvm.tcgen05.mma | 0x10 | &unk_5B8D710 | tcgen05 MMA | dialects/nvvm.md |
nvvm.tcgen05.mma.block_scale | 0x1C | &unk_5B8D718 | tcgen05 MMA block-scale | dialects/nvvm.md |
nvvm.tcgen05.mma_smem_desc | 0x1A | &unk_5B8D6E8 | tcgen05 mma smem desc | dialects/nvvm.md |
nvvm.tcgen05.mma.sp | 0x13 | &unk_5B8D700 | tcgen05 MMA sparse | dialects/nvvm.md |
nvvm.tcgen05.mma.sp.block_scale | 0x1F | &unk_5B8D708 | tcgen05 MMA sparse block-scale | dialects/nvvm.md |
nvvm.tcgen05.mma.ws | 0x13 | &unk_5B8D6F8 | tcgen05 MMA warp-spec | dialects/nvvm.md |
nvvm.tcgen05.mma.ws.sp | 0x16 | &unk_5B8D6F0 | tcgen05 MMA ws sparse | dialects/nvvm.md |
nvvm.tcgen05.relinquish_alloc_permit | 0x24 | &unk_5B8D6E0 | tcgen05 relinquish permit | dialects/nvvm.md |
nvvm.tcgen05.shift | 0x12 | &unk_5B8D6D8 | tcgen05 shift | dialects/nvvm.md |
nvvm.tcgen05.st | 0xF | &unk_5B8D6D0 | tcgen05 store | dialects/nvvm.md |
nvvm.tcgen05.wait | 0x11 | &unk_5B8D6C8 | tcgen05 wait | dialects/nvvm.md |
§8.6 wgmma / wmma / mma / ldmatrix-stmatrix (12)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.wgmma.commit.group.sync.aligned | 0x24 | &unk_5B8D620 | wgmma commit group sync | dialects/nvvm.md |
nvvm.wgmma.fence.aligned | 0x18 | &unk_5B8D628 | wgmma fence aligned | dialects/nvvm.md |
nvvm.wgmma.mma_async | 0x14 | &unk_5B8D618 | wgmma async MMA | dialects/nvvm.md |
nvvm.wmma.load | 0xE | &unk_5B8D658 | wmma load | dialects/nvvm.md |
nvvm.wmma.mma | 0xD | &unk_5B8D650 | wmma MMA | dialects/nvvm.md |
nvvm.wmma.store | 0xF | &unk_5B8D648 | wmma store | dialects/nvvm.md |
nvvm.mma.block_scale | 0x14 | &unk_5B8D8D8 | MMA block-scale | dialects/nvvm.md |
nvvm.mma_smem_desc | 0x12 | &unk_5B8D7C8 | MMA smem desc | dialects/nvvm.md |
nvvm.mma.sparse.block_scale | 0x1B | &unk_5B8D8D0 | MMA sparse block-scale | dialects/nvvm.md |
nvvm.mma.sync | 0xD | &unk_5B8D7D0 | MMA sync | dialects/nvvm.md |
nvvm.ldmatrix | 0xD | &unk_5B8D898 | ldmatrix | dialects/nvvm.md |
nvvm.stmatrix | 0xD | &unk_5B8D768 | stmatrix | dialects/nvvm.md |
§8.7 shfl / vote / redux / match / elect (5)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.elect.sync | 0xF | &unk_5B8DA78 | elect leader | dialects/nvvm.md |
nvvm.match.sync | 0xF | &unk_5B8D7E8 | match.sync | dialects/nvvm.md |
nvvm.redux.sync | 0xF | &unk_5B8D790 | redux.sync | dialects/nvvm.md |
nvvm.shfl.sync | 0xE | &unk_5B8D780 | shfl.sync | dialects/nvvm.md |
nvvm.vote.sync | 0xE | &unk_5B8D660 | vote.sync | dialects/nvvm.md |
§8.8 Convert / cvt.packfloat (11)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.convert.bf16x2.to.f4x2 | 0x1B | &unk_5B8DB68 | bf16x2→f4x2 | dialects/nvvm.md |
nvvm.convert.bf16x2.to.f8x2 | 0x1B | &unk_5B8DB60 | bf16x2→f8x2 | dialects/nvvm.md |
nvvm.convert.f16x2.to.f4x2 | 0x1A | &unk_5B8DB58 | f16x2→f4x2 | dialects/nvvm.md |
nvvm.convert.f16x2.to.f8x2 | 0x1A | &unk_5B8DB50 | f16x2→f8x2 | dialects/nvvm.md |
nvvm.convert.f32x2.to.f4x2 | 0x1A | &unk_5B8DB48 | f32x2→f4x2 | dialects/nvvm.md |
nvvm.convert.f32x2.to.f6x2 | 0x1A | &unk_5B8DB40 | f32x2→f6x2 | dialects/nvvm.md |
nvvm.convert.f32x2.to.f8x2 | 0x1A | &unk_5B8DB38 | f32x2→f8x2 | dialects/nvvm.md |
nvvm.convert.f4x2.to.f16x2 | 0x1A | &unk_5B8DB30 | f4x2→f16x2 | dialects/nvvm.md |
nvvm.convert.float.to.tf32 | 0x1A | &unk_5B8DB28 | float→tf32 | dialects/nvvm.md |
nvvm.cvt.packfloat | 0x12 | &unk_5B8DA90 | cvt.packfloat | dialects/nvvm.md |
nvvm.cvt.packfloat.f32 | 0x16 | &unk_5B8DA98 | cvt.packfloat.f32 | dialects/nvvm.md |
§8.9 read.ptx.sreg.* (73)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.read.ptx.sreg.clock | 0x18 | &unk_5B8DC18 | sreg clock | dialects/nvvm.md |
nvvm.read.ptx.sreg.clock64 | 0x1A | &unk_5B8DC20 | sreg clock64 | dialects/nvvm.md |
nvvm.read.ptx.sreg.cluster.ctaid.x | 0x22 | &unk_5B8DC48 | cluster.ctaid.x | dialects/nvvm.md |
nvvm.read.ptx.sreg.cluster.ctaid.y | 0x22 | &unk_5B8DC40 | cluster.ctaid.y | dialects/nvvm.md |
nvvm.read.ptx.sreg.cluster.ctaid.z | 0x22 | &unk_5B8DC38 | cluster.ctaid.z | dialects/nvvm.md |
nvvm.read.ptx.sreg.cluster.ctarank | 0x22 | &unk_5B8DBC8 | cluster.ctarank | dialects/nvvm.md |
nvvm.read.ptx.sreg.clusterid.x | 0x1E | &unk_5B8DBC0 | clusterid.x | dialects/nvvm.md |
nvvm.read.ptx.sreg.clusterid.y | 0x1E | &unk_5B8DBB8 | clusterid.y | dialects/nvvm.md |
nvvm.read.ptx.sreg.clusterid.z | 0x1E | &unk_5B8DBB0 | clusterid.z | dialects/nvvm.md |
nvvm.read.ptx.sreg.cluster.nctaid.x | 0x23 | &unk_5B8DBF8 | cluster.nctaid.x | dialects/nvvm.md |
nvvm.read.ptx.sreg.cluster.nctaid.y | 0x23 | &unk_5B8DBF0 | cluster.nctaid.y | dialects/nvvm.md |
nvvm.read.ptx.sreg.cluster.nctaid.z | 0x23 | &unk_5B8DBE8 | cluster.nctaid.z | dialects/nvvm.md |
nvvm.read.ptx.sreg.cluster.nctarank | 0x23 | &unk_5B8DC00 | cluster.nctarank | dialects/nvvm.md |
nvvm.read.ptx.sreg.ctaid.x | 0x1A | &unk_5B8DC60 | ctaid.x | dialects/nvvm.md |
nvvm.read.ptx.sreg.ctaid.y | 0x1A | &unk_5B8DC58 | ctaid.y | dialects/nvvm.md |
nvvm.read.ptx.sreg.ctaid.z | 0x1A | &unk_5B8DC50 | ctaid.z | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg0 | 0x1A | &unk_5B8DA70 | envreg0 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg1 | 0x1A | &unk_5B8DA18 | envreg1 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg10 | 0x1B | &unk_5B8DA68 | envreg10 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg11 | 0x1B | &unk_5B8DA60 | envreg11 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg12 | 0x1B | &unk_5B8DA58 | envreg12 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg13 | 0x1B | &unk_5B8DA50 | envreg13 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg14 | 0x1B | &unk_5B8DA48 | envreg14 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg15 | 0x1B | &unk_5B8DA40 | envreg15 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg16 | 0x1B | &unk_5B8DA38 | envreg16 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg17 | 0x1B | &unk_5B8DA30 | envreg17 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg18 | 0x1B | &unk_5B8DA28 | envreg18 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg19 | 0x1B | &unk_5B8DA20 | envreg19 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg2 | 0x1A | &unk_5B8D9C0 | envreg2 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg20 | 0x1B | &unk_5B8DA10 | envreg20 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg21 | 0x1B | &unk_5B8DA08 | envreg21 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg22 | 0x1B | &unk_5B8DA00 | envreg22 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg23 | 0x1B | &unk_5B8D9F8 | envreg23 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg24 | 0x1B | &unk_5B8D9F0 | envreg24 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg25 | 0x1B | &unk_5B8D9E8 | envreg25 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg26 | 0x1B | &unk_5B8D9E0 | envreg26 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg27 | 0x1B | &unk_5B8D9D8 | envreg27 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg28 | 0x1B | &unk_5B8D9D0 | envreg28 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg29 | 0x1B | &unk_5B8D9C8 | envreg29 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg3 | 0x1A | &unk_5B8D9A8 | envreg3 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg30 | 0x1B | &unk_5B8D9B8 | envreg30 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg31 | 0x1B | &unk_5B8D9B0 | envreg31 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg4 | 0x1A | &unk_5B8D9A0 | envreg4 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg5 | 0x1A | &unk_5B8D998 | envreg5 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg6 | 0x1A | &unk_5B8D990 | envreg6 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg7 | 0x1A | &unk_5B8D988 | envreg7 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg8 | 0x1A | &unk_5B8D980 | envreg8 | dialects/nvvm.md |
nvvm.read.ptx.sreg.envreg9 | 0x1A | &unk_5B8D978 | envreg9 | dialects/nvvm.md |
nvvm.read.ptx.sreg.globaltimer | 0x1E | &unk_5B8D918 | globaltimer | dialects/nvvm.md |
nvvm.read.ptx.sreg.gridid | 0x19 | &unk_5B8D8F8 | gridid | dialects/nvvm.md |
nvvm.read.ptx.sreg.laneid | 0x19 | &unk_5B8D8C8 | laneid | dialects/nvvm.md |
nvvm.read.ptx.sreg.lanemask.eq | 0x1E | &unk_5B8D8C0 | lanemask.eq | dialects/nvvm.md |
nvvm.read.ptx.sreg.lanemask.ge | 0x1E | &unk_5B8D8B8 | lanemask.ge | dialects/nvvm.md |
nvvm.read.ptx.sreg.lanemask.gt | 0x1E | &unk_5B8D8B0 | lanemask.gt | dialects/nvvm.md |
nvvm.read.ptx.sreg.lanemask.le | 0x1E | &unk_5B8D8A8 | lanemask.le | dialects/nvvm.md |
nvvm.read.ptx.sreg.lanemask.lt | 0x1E | &unk_5B8D8A0 | lanemask.lt | dialects/nvvm.md |
nvvm.read.ptx.sreg.nclusterid.x | 0x1F | &unk_5B8DBE0 | nclusterid.x | dialects/nvvm.md |
nvvm.read.ptx.sreg.nclusterid.y | 0x1F | &unk_5B8DBD8 | nclusterid.y | dialects/nvvm.md |
nvvm.read.ptx.sreg.nclusterid.z | 0x1F | &unk_5B8DBD0 | nclusterid.z | dialects/nvvm.md |
nvvm.read.ptx.sreg.nctaid.x | 0x1B | &unk_5B8D910 | nctaid.x | dialects/nvvm.md |
nvvm.read.ptx.sreg.nctaid.y | 0x1B | &unk_5B8D908 | nctaid.y | dialects/nvvm.md |
nvvm.read.ptx.sreg.nctaid.z | 0x1B | &unk_5B8D900 | nctaid.z | dialects/nvvm.md |
nvvm.read.ptx.sreg.nsmid | 0x18 | &unk_5B8D778 | nsmid | dialects/nvvm.md |
nvvm.read.ptx.sreg.ntid.x | 0x19 | &unk_5B8DC78 | ntid.x | dialects/nvvm.md |
nvvm.read.ptx.sreg.ntid.y | 0x19 | &unk_5B8DC70 | ntid.y | dialects/nvvm.md |
nvvm.read.ptx.sreg.ntid.z | 0x19 | &unk_5B8DC68 | ntid.z | dialects/nvvm.md |
nvvm.read.ptx.sreg.nwarpid | 0x1A | &unk_5B8D640 | nwarpid | dialects/nvvm.md |
nvvm.read.ptx.sreg.smid | 0x17 | &unk_5B8D770 | smid | dialects/nvvm.md |
nvvm.read.ptx.sreg.tid.x | 0x18 | &unk_5B8D678 | tid.x | dialects/nvvm.md |
nvvm.read.ptx.sreg.tid.y | 0x18 | &unk_5B8D670 | tid.y | dialects/nvvm.md |
nvvm.read.ptx.sreg.tid.z | 0x18 | &unk_5B8D668 | tid.z | dialects/nvvm.md |
nvvm.read.ptx.sreg.warpid | 0x19 | &unk_5B8D638 | warpid | dialects/nvvm.md |
nvvm.read.ptx.sreg.warpsize | 0x1B | &unk_5B8D630 | warpsize | dialects/nvvm.md |
§8.10 cluster_launch_ctrl (7)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid | 0x36 | &unk_5B8DBA8 | query first ctaid | dialects/nvvm.md |
nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.x | 0x38 | &unk_5B8DBA0 | query first ctaid.x | dialects/nvvm.md |
nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.y | 0x38 | &unk_5B8DB98 | query first ctaid.y | dialects/nvvm.md |
nvvm.clusterlaunchcontrol.query_cancel.get_first_ctaid.z | 0x38 | &unk_5B8DB90 | query first ctaid.z | dialects/nvvm.md |
nvvm.clusterlaunchcontrol.query_cancel.is_canceled | 0x32 | &unk_5B8DB88 | query is-canceled | dialects/nvvm.md |
nvvm.clusterlaunchcontrol.try_cancel | 0x24 | &unk_5B8DB78 | try cancel | dialects/nvvm.md |
nvvm.clusterlaunchcontrol.try_cancel.multicast | 0x2E | &unk_5B8DB80 | try cancel multicast | dialects/nvvm.md |
§8.11 Fences (14)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.fence.acq_rel.cluster | 0x1A | &unk_5B8D6B8 | acq_rel cluster | dialects/nvvm.md |
nvvm.fence.acq_rel.cta | 0x16 | &unk_5B8D6B0 | acq_rel CTA | dialects/nvvm.md |
nvvm.fence.acq_rel.gpu | 0x16 | &unk_5B8D6A8 | acq_rel GPU | dialects/nvvm.md |
nvvm.fence.acq_rel.sys | 0x16 | &unk_5B8D6A0 | acq_rel sys | dialects/nvvm.md |
nvvm.fence.acquire | 0x12 | &unk_5B8D948 | acquire fence | dialects/nvvm.md |
nvvm.fence.mbarrier.init | 0x18 | &unk_5B8D940 | mbarrier-init fence | dialects/nvvm.md |
nvvm.fence.proxy | 0x10 | &unk_5B8D930 | proxy fence | dialects/nvvm.md |
nvvm.fence.proxy.acquire | 0x18 | &unk_5B8D938 | proxy acquire | dialects/nvvm.md |
nvvm.fence.proxy.release | 0x18 | &unk_5B8D928 | proxy release | dialects/nvvm.md |
nvvm.fence.release | 0x12 | &unk_5B8D920 | release fence | dialects/nvvm.md |
nvvm.fence.sc | 0xD | &unk_5B8D680 | sc fence | dialects/nvvm.md |
nvvm.fence.sc.cluster | 0x15 | &unk_5B8D698 | sc cluster | dialects/nvvm.md |
nvvm.fence.sc.cta | 0x11 | &unk_5B8D690 | sc CTA | dialects/nvvm.md |
nvvm.fence.sc.gpu | 0x11 | &unk_5B8D688 | sc GPU | dialects/nvvm.md |
§8.12 dot_accum (2)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.dot.accumulate.2way | 0x18 | &unk_5B8DA88 | dot accumulate 2-way | dialects/nvvm.md |
nvvm.dot.accumulate.4way | 0x18 | &unk_5B8DA80 | dot accumulate 4-way | dialects/nvvm.md |
§8.13 griddep / proxy / tensormap (5)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.griddepcontrol.launch.dependents | 0x25 | &unk_5B8D8F0 | griddepcontrol launch dependents | dialects/nvvm.md |
nvvm.griddepcontrol.wait | 0x18 | &unk_5B8D8E8 | griddepcontrol wait | dialects/nvvm.md |
nvvm.prefetch | 0xD | &unk_5B8D7B0 | prefetch | dialects/nvvm.md |
nvvm.prefetch.tensormap | 0x17 | &unk_5B8D7A8 | prefetch tensormap | dialects/nvvm.md |
nvvm.tensormap.cp_fenceproxy | 0x1C | &unk_5B8D6C0 | tensormap cp_fenceproxy | dialects/nvvm.md |
§8.14 Misc (19)
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
nvvm.add.packed.f32x2 | 0x15 | &unk_5B8DCB8 | packed f32x2 add | dialects/nvvm.md |
nvvm.atomicrmw | 0xE | &unk_5B8DCB0 | LLVM atomicrmw wrapper | dialects/nvvm.md |
nvvm.breakpoint | 0xF | &unk_5B8DC30 | breakpoint | dialects/nvvm.md |
nvvm.exit | 9 | &unk_5B8D970 | thread exit | dialects/nvvm.md |
nvvm.fabs | 9 | &unk_5B8D958 | float abs | dialects/nvvm.md |
nvvm.fma.packed.f32x2 | 0x15 | &unk_5B8D950 | packed f32x2 FMA | dialects/nvvm.md |
nvvm.fmax | 9 | &unk_5B8D7E0 | float max | dialects/nvvm.md |
nvvm.fmin | 9 | &unk_5B8D7D8 | float min | dialects/nvvm.md |
nvvm.inline_ptx | 0xF | &unk_5B8D8E0 | inline PTX | dialects/nvvm.md |
nvvm.load.ext | 0xD | &unk_5B8D968 | extended load | dialects/nvvm.md |
nvvm.mapa | 9 | &unk_5B8D7F0 | mapa | dialects/nvvm.md |
nvvm.mul | 8 | &unk_5B8D7C0 | multiply | dialects/nvvm.md |
nvvm.mul.packed.f32x2 | 0x15 | &unk_5B8D7B8 | packed f32x2 multiply | dialects/nvvm.md |
nvvm.rcp.approx.ftz.f | 0x15 | &unk_5B8D7A0 | reciprocal approx ftz | dialects/nvvm.md |
nvvm.red (family — TypeID-only; no literal mnemonic string) | 8 | &unk_5B8D798 | atomic reduction family; concrete forms surfaced in the string table are nvvm.redux.sync and nvvm.barrier.cta.red; the variant-3 red_op/red_type parser slots are described in dialects/nvvm/properties-blob-and-attr-parsers.md | dialects/nvvm.md |
nvvm.setmaxregister | 0x13 | &unk_5B8D788 | set-max-register | dialects/nvvm.md |
nvvm.st.bulk | 0xC | &unk_5B8DC28 | bulk store | dialects/nvvm.md |
nvvm.store.ext | 0xE | &unk_5B8D960 | extended store | dialects/nvvm.md |
nvvm.sub.packed.f32x2 | 0x15 | &unk_5B8D760 | packed f32x2 subtract | dialects/nvvm.md |
§9 llvm-extras (upstream llvm.* ops observed in tileiras lowerings)
The MLIR llvm dialect is statically linked from upstream and registered
via addOperation<> chains; tileiras does not surface a per-op
&unk_* slot for these. The list below enumerates the llvm.*
mnemonics emitted by tileiras-driven lowerings. Dialect TypeID anchor is
&unk_5BA8F60.
| mnemonic | length | TypeID singleton | brief semantic | primary wiki page |
|---|---|---|---|---|
llvm.alloca | 11 | upstream | stack alloca | dialects/upstream-llvm.md |
llvm.atomicrmw | 14 | upstream | atomic RMW (the binary has no llvm.atomic_cmpxchg string; compare-and-swap is the separate llvm.cmpxchg op below) | dialects/upstream-llvm.md |
llvm.bitcast | 12 | upstream | bit-pattern type pun | dialects/upstream-llvm.md |
llvm.call | 9 | upstream | LLVM call | dialects/upstream-llvm.md |
llvm.cmpxchg | 12 | upstream | atomic compare-and-swap | dialects/upstream-llvm.md |
llvm.dbg.cu | 11 | upstream | DI compile-unit | dialects/upstream-llvm.md |
llvm.extractelement | 19 | upstream | vector element extract | dialects/upstream-llvm.md |
llvm.fence | 10 | upstream | LLVM fence | dialects/upstream-llvm.md |
llvm.func | 9 | upstream | LLVM function | dialects/upstream-llvm.md |
llvm.getelementptr | 18 | upstream | get-element-ptr (the binary has no abbreviated llvm.gep string; only the spelled-out form is present) | dialects/upstream-llvm.md |
llvm.global_ctors | 17 | upstream | LLVM global constructors array | dialects/upstream-llvm.md |
llvm.global_dtors | 17 | upstream | LLVM global destructors array | dialects/upstream-llvm.md |
llvm.global.annotations | 23 | upstream | LLVM global annotations array | dialects/upstream-llvm.md |
llvm.insertelement | 18 | upstream | vector element insert | dialects/upstream-llvm.md |
llvm.intr.coro.align | 20 | upstream | coroutine intrinsic — frame alignment query | dialects/upstream-llvm.md |
llvm.intr.coro.begin | 20 | upstream | coroutine intrinsic — frame begin | dialects/upstream-llvm.md |
llvm.intr.coro.end | 18 | upstream | coroutine intrinsic — frame end | dialects/upstream-llvm.md |
llvm.intr.coro.free | 19 | upstream | coroutine intrinsic — free frame storage | dialects/upstream-llvm.md |
llvm.intr.coro.id | 17 | upstream | coroutine intrinsic — identity token | dialects/upstream-llvm.md |
llvm.intr.coro.promise | 22 | upstream | coroutine intrinsic — promise/frame conversion | dialects/upstream-llvm.md |
llvm.intr.coro.resume | 21 | upstream | coroutine intrinsic — resume suspended frame | dialects/upstream-llvm.md |
llvm.intr.coro.save | 19 | upstream | coroutine intrinsic — save suspend index | dialects/upstream-llvm.md |
llvm.intr.coro.size | 19 | upstream | coroutine intrinsic — frame size query | dialects/upstream-llvm.md |
llvm.intr.coro.suspend | 22 | upstream | coroutine intrinsic — suspend point | dialects/upstream-llvm.md |
llvm.intr.dbg.declare | 21 | upstream | debug-info declare | dialects/upstream-llvm.md |
llvm.intr.dbg.label | 19 | upstream | debug-info label | dialects/upstream-llvm.md |
llvm.intr.dbg.value | 19 | upstream | debug-info value | dialects/upstream-llvm.md |
llvm.inttoptr | 13 | upstream | int-to-pointer | dialects/upstream-llvm.md |
llvm.mlir.constant | 18 | upstream | MLIR constant for LLVM type | dialects/upstream-llvm.md |
llvm.ptrtoint | 13 | upstream | pointer-to-int | dialects/upstream-llvm.md |
llvm.return | 11 | upstream | return | dialects/upstream-llvm.md |
llvm.select | 11 | upstream | select | dialects/upstream-llvm.md |
llvm.shufflevector | 18 | upstream | vector shuffle | dialects/upstream-llvm.md |
LLVM Fingerprint Table
Ten independent fingerprints recovered from the stripped tileiras ELF prove the binary was built from an LLVM 21.0.0git snapshot of the upstream monorepo (with NVIDIA-internal NVPTX patches) plus the co-tracking MLIR layer. Each fingerprint is an isolated piece of byte-level evidence — a rodata string, a hardcoded enum value, a switch-table size, a structure-allocation footprint — and each derives from a different code path inside the binary, so they cannot be the result of a single string substitution. Companion page: VERSIONS.md. For LLVM-version cross-cuts see the table at the bottom.
How to use these fingerprints
Each fingerprint below is independent and addresses a different fragility class:
| If you can... | Use fingerprint(s) | Why |
|---|---|---|
| only grep rodata strings | 2, 3 | the two unique LLVM-21 anchors are literal strings |
| only count switch arms | 5, 7 | row counts disambiguate generic intrinsic and PassBuilder coverage |
| only inspect type layouts | 4, 9 | GlobalVariable=88B, AsyncValueImpl=808B, InstructionVal=28 are stable shape facts |
| only diff hex against a clean LLVM tree | 1, 6, 8 | data-layout, NVPTX MatcherTable, ProxyReg whitelist are byte-exact |
| only run the binary and capture output | 3 | the AsmPrinter Based on LLVM 21.0.0git line surfaces in every emitted PTX |
Detecting the source LLVM version of an unknown NVPTX-derived binary takes only a small constant-time fingerprint scan:
/* Lightweight three-anchor scan; any single match pins the LLVM major
* version with HIGH confidence per the cross-LLVM table at the bottom. */
LlvmVersion detect_llvm_version(const uint8_t *rodata, size_t rodata_len) {
if (memmem(rodata, rodata_len, "LLVM21.0.0git", 13)) return LLVM_21;
if (memmem(rodata, rodata_len, "LLVM20.", 7)) return LLVM_20;
if (memmem(rodata, rodata_len, "LLVM19.", 7)) return LLVM_19;
if (memmem(rodata, rodata_len, "LLVM18.", 7)) return LLVM_18;
/* Fall back to structural fingerprints 4/7/8 if the producer string
* has been stripped or rewritten. */
return LLVM_UNKNOWN;
}
The producer-string anchor (fingerprint 2) makes this scan trivial; the structural fingerprints exist for the case where someone has stripped it.
Fingerprints
1. NVPTX data-layout stamp
- Claim: stock LLVM 21 NVPTX64 data-layout string is hardcoded and stamped onto every emitted module.
- Binary evidence: rodata at
0x4D079D0, length0x9A= 154 bytes, single xref fromsub_1A4E5C0at0x1A4E5D1. Verbatim:e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-i128:128:128-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64. - Confidence: HIGH.
2. LLVM21.0.0git bitcode producer string
- Claim: the LLVM bitcode writer emits its
IDENTIFICATION_CODE_STRINGproducer record from a literalLLVM21.0.0gitblob. - Binary evidence: rodata at
0x4F882C4, length 13 bytes, single xref fromsub_3935490at0x39359A1(theEnterSubblock(IDENTIFICATION, 5)site of the bitcode-writer body). Verbatim string:LLVM21.0.0git. - Confidence: HIGH.
3. NVPTX AsmPrinter Based on LLVM 21.0.0git header line
- Claim: every PTX file emitted by
tileirascarries a hardcoded comment line identifying the LLVM base version. - Binary evidence:
sub_1A56540(the NVPTXemitHeaderAsmPrinter virtual) writes a four-line comment block whose third line is the.rodatastringBased on LLVM 21.0.0git(not runtime-formatted; verbatim literal). Downstream of this header sits the 6,388-case AsmWriter MC instruction switch shaped like LLVM 21'sNVPTXGenAsmWriter.inc. - Confidence: HIGH.
4. Verifier subset: 88-byte GlobalVariable and InstructionVal = 28
- Claim: the LLVM IR layout used by the inline Verifier subset matches the LLVM 18-21 stable shape and excludes earlier and later trees.
- Binary evidence: in
sub_2D45620at0x2D46960/0x2D514DB/0x2D5975Athe gateif (*(_BYTE *)v116 > 0x1Cu)checksValue::SubclassID > 28(i.e.InstructionVal = 28) before dispatching the 46-case Instruction-opcode switch. Adjacent allocator call at line 10717-10721 isBumpPtrAllocator::Allocate(88, 1)followed byGlobalVariable::GlobalVariable(...)(sub_3FCECA0) — the 88-byteGlobalVariablefootprint. The 12Verifier::visitIntrinsicCalldiagnostic strings sit at0x4F2FCB8...0x4F2FDC3. - Confidence: HIGH (combined with fingerprints 2/3/7 which fix the version to 21, this layout becomes a corroborating LLVM 21 anchor; on its own, MED — narrows to LLVM 18-21).
5. ~412-case canConstantFoldCallTo Intrinsic::ID switch
- Claim: the ConstantFolding predicate switches on a 412-entry generic
Intrinsic::IDjump table, anchoring the upper bound on the generic intrinsic enum. - Binary evidence:
sub_39ADED0, 5842 bytes, 344 basic blocks. The 412-case primary switch lives at0x39ADFCB, capped bycmp ecx, 0x19Bat0x39ADFB5(0x19B = 411). A secondary 161-case switch at0x39AE288covers NVPTX-private intrinsic IDs in the8851..9011range (tcgen05 / cvt_packfloat / cp.async.bulk / wgmma extensions). - Confidence: MED. The 412 cap was originally cited as evidence for LLVM 17/18 base, but the producer string (fingerprint 2) and AsmPrinter header (fingerprint 3) override that — the constant folder in
tileirascovers the0..411generic-ID subset of LLVM 21's intrinsic enum; the NVIDIA fork pruned the new-in-21 generic intrinsics from this fold table while keeping theLLVM21.0.0gitbuild stamp.
6. NVPTX MatcherTable XOR-3 obfuscated mnemonic pools
- Claim:
tileirasembeds the NVPTX TableGen-generated mnemonic pools in.dataunder a walking XOR-3 cipher, decoded once at startup. Pool shape (offsets, opcodes covered) matches the LLVM 21 NVPTXGenAsmWriter output. - Binary evidence: register-name pool at
0x5A4BE20 - 0x5A4C06A(586 B), opcode-mnemonic pool at0x5A4C080 - 0x5A656F0(~105 KB). Decoders at0x1BD1810(mnemonic) and0x1BD1830(register name); cached post-decode pointer atqword_5B4F4D0. Both pools are byte-XOR-3 from disk. - Confidence: HIGH for "XOR-3 obfuscation present", HIGH for "shape matches LLVM 21 NVPTXGenAsmWriter".
7. PassBuilder mega-registry: 478 pretty-name keys + 73 specials
- Claim: the new-PM PassBuilder pass-name registrar registers exactly 551 entries split as 478 templated
getTypeName<T>()keys, 66 naked-class string keys, 5 pipeline aliases, 2 specials. This row count fixes LLVM 21. - Binary evidence:
sub_1CCB7D0makes 551 calls tosub_4063070(StringMap<PassInfo>::insert). The L843 line registersmemprof-context-disambiguation, a pass class that landed post-LLVM-18. NVIDIA-private classes (NVVMIRVerifier,IPMSP,Pretreat,nv-early-inliner,SelectKernels,NVVMAA,KernelInfoPrinter,LowerAggrCopies,LowerStructArgs, etc., 18 in total) are interleaved. - Confidence: HIGH.
8. NVPTXProxyRegErasure 4-opcode contiguous whitelist
- Claim: the post-ISel ProxyReg erasure pass uses a contiguous opcode range
3156..3159checked bysub eax, 0xC54 ; cmp eax, 3at0x1AE5086— a TableGen-side typed-ProxyReg consolidation that landed in LLVM trunk just before the 21.0 cut. Stock LLVM 18 used a 5-6-element named whitelist. - Binary evidence:
sub_1AE4FD0body at0x1AE4FD0 - 0x1AE599C; the opcode testsub eax, 0xC54 ; cmp eax, 3at0x1AE5086. - Confidence: HIGH.
9. MLIR Operation header (0x48) and AsyncValueImpl (808 bytes)
- Claim: MLIR runtime structures observed in the binary match the post-2024 / LLVM 21 monorepo MLIR ABI, not earlier MLIR layouts.
- Binary evidence: MLIR
Operationfixed-header is 0x48 bytes (themlir::Operationshape with trailing-objects layout for operands / regions / successors). RewritePattern allocations cluster at0x60/0x68/0x70/0x78(96 / 104 / 112 / 120 bytes) — observed at 286 sites for the0x60variant alone inp3-U02.AsyncValueImplis exactly 808 bytes (0x328), allocated bysub_44A8C20(0x328, ...)from three sites in the Pipe / Mutex / Schedule infrastructure. - Confidence: MED individually (MLIR snapshots are not version-stamped); HIGH in conjunction with the LLVM 21 anchors above.
10. NVVM-Reflect cl::opt registrations
- Claim: the NVPTX backend's NVVM-Reflect pass registers two
cl::optflags via the standard LLVM 21 cl::opt machinery, with rodata strings sitting alongside the LLVM cl::opt singleton. - Binary evidence:
ctor_238at0x463A70registers (i)cl::opt<bool>nvvm-reflect-enable(storage at0x5B4F400, rodata name at0x4D3C766, 19 B) and (ii)cl::list<string>nvvm-reflect-add(rodata name at0x4D3C77A, 16 B). Both go throughsub_4534CC0(cl::Option::setArgStr) andsub_4534420(cl::Option::done) and end up in thecl::GlobalParsersingleton at0x4530050. The pass is registered into the PassBuilder registry at0x1CCB7D0asnvvm-reflect-pp. - Confidence: HIGH.
Cross-LLVM-version disambiguation
| Fingerprint | LLVM 18 | LLVM 19 | LLVM 20 | LLVM 21 (tileiras) |
|---|---|---|---|---|
| 1. NVPTX data-layout stamp | identical | identical | identical | match |
| 2. Producer string | LLVM18.x.y | LLVM19.x.y | LLVM20.x.y | LLVM21.0.0git |
3. AsmPrinter Based on LLVM … | LLVM 18.x | LLVM 19.x | LLVM 20.x | LLVM 21.0.0git |
4. InstructionVal = 28 | matches | matches | matches | matches |
4. 88-byte GlobalVariable | matches | matches | matches | matches |
| 5. Generic Intrinsic::ID cap | ~421 | ~430 | ~440 | NVIDIA fork: 412 (subset of 21) |
| 6. NVPTX MatcherTable shape | smaller (no tcgen05) | smaller | + tcgen05 partial | matches (full tcgen05/cp.async.bulk/wgmma) |
| 7. PassBuilder row count | ~380 | ~440 | ~500 | 551 = 478+66+5+2 |
7. MemProfContextDisambiguation | absent | present | present | present |
| 8. ProxyReg whitelist | named 5-6 entries | named 5-6 entries | named 5-6 entries | contiguous 3156..3159 (LLVM 21 typed-ProxyReg) |
| 9. AsyncValueImpl size | smaller | smaller | similar | matches 808 B |
| 10. cl::opt machinery | matches | matches | matches | matches |
Two unique LLVM 21 anchors carry the weight: fingerprints 2 (producer string) and 3 (AsmPrinter header). Fingerprints 7 and 8 disambiguate LLVM 21 from LLVM 20. Fingerprints 1, 4, 6, 9, 10 are stable-layout corroborators that exclude earlier/later trees only when read together with the unique anchors. The convergence of these ten independent fingerprints leaves no plausible alternative version; a snapshot from any other LLVM major would disagree on at least one.
The reader-side recipe for opening the binary and reproducing any one of these fingerprints is documented on Binary Anatomy and RE Methodology.
Wire-Format Constants
A reimplementation of tileiras that aims for byte-for-byte parity with a shipped binary
must reproduce a small set of magic numbers, tag namespaces, opcode tables, and obfuscation
ciphers exactly. Every constant in this page is fingerprintable from a stripped 88 MB
tileiras ELF and verified against the cross-referencing dispatchers documented in
MLIR Bytecode Format,
LLVM Fingerprint Table, and
ISelDAG and MatcherTable. The constants are
not configuration. Changing any one of them produces an artifact that fails to
interoperate with the shipped reader or fails to bind against the AsmWriter's
post-decryption string pool.
The page is organized strictly by layer of the wire format, walking from the outermost envelope down to the innermost emitter. Each section lists the constants the layer defines, the exact byte offsets and lengths where they live in the binary, and the authoritative cross-reference for the dispatch site that reads them. Where a constant is interesting on its own — a typo preserved across builds, an unused mid-slot in a bit-mask, a numbering divergence from upstream — the rationale is captured inline rather than buried in a footnote.
Layer 1 — TileIR Bytecode Envelope
The bytecode container's framing prefix is the single most reproduced constant in this binary. Stock LLVM 18-21 MLIR-bytecode files share the first three bytes; the private TileIR dialect tag occupies bytes 3-7 and the trailing terminator byte at offset 7 separates TileIR from upstream MLIR at the magic-byte level.
| Offset | Byte | Symbolic name | Meaning |
|---|---|---|---|
| 0 | 0x06 | MAGIC_LEN_HI | MLIR-bytecode framing prefix (shared with upstream) |
| 1 | 0x03 | MAGIC_LEN_LO | MLIR-bytecode framing prefix (shared with upstream) |
| 2 | 0x80 | MAGIC_FLAGS | MLIR-bytecode framing prefix (shared with upstream) |
| 3 | 0x54 | dialect byte 1 | 'T' |
| 4 | 0x69 | dialect byte 2 | 'i' |
| 5 | 0x6C | dialect byte 3 | 'l' |
| 6 | 0x65 | dialect byte 4 | 'e' |
| 7 | 0x00 | tileiras terminator | Upstream writes '\n' (start of "\nMLIR") here |
The literal lives at rodata 0x45EBF08 and is compared byte-for-byte by sub_5838A0
against the input buffer; mismatch surfaces a three-fragment diagnostic
("invalid magic number at position " / ", got " / " expected ").
The version block follows immediately after the magic and is a sequence of three
unsigned-LEB128 VarInts: major, minor, optional patch. The accepted range
table at rodata 0x45EBF10 is verbatim:
static const TileVersion supported_versions[] = {
/*min:*/ { .major = 13, .minor = 1, .patch = 0 }, // inclusive
/*max:*/ { .major = 13, .minor = 1, .patch = UINT32_MAX }, // inclusive (only 13.1.x)
};
Any major or minor other than 13.1 is rejected; the patch field is read for forward
compatibility but never gated on.
The section ID space is dense in [0x00, 0x06] and the 0x00 slot doubles as the
end-of-bytecode marker:
| ID | Section | Required | Reference width |
|---|---|---|---|
0x00 | EndOfBytecode | required (last) | none |
0x01 | String | required | u32 offsets |
0x02 | Func | required | sequential |
0x03 | Debug | optional | u32 and u64 offsets |
0x04 | Constant | optional | u64 offsets |
0x05 | Type | optional | u32 offsets |
0x06 | Global | optional | sequential |
Section header padding is 0xCF. The on-disk section order is the producer's
choice, but the walker order is fixed: STRING → TYPE → CONSTANT → IR → optional
RESOURCE/DEBUG. See MLIR Bytecode Format
for the dependency-ordered dispatch.
⚡ QUIRK — terminator byte 7 is the file-format split A bytecode container with the first seven bytes identical to upstream MLIR and byte 7 set to
0x00is TileIR; a container with byte 7 set to'\n'(0x0A) and bytes 8-11 spelling"MLIR"is upstream MLIR. The two file formats share enough framing that a magic-number sniff that only checks bytes 0-2 will mis-classify both as "some MLIR bytecode dialect." A reimplementation that wants to refuse upstream MLIR inputs early must compare all eight bytes — anything less lets stock MLIR bytecode bind to the TileIR header parser and produce mangled tag-table errors several sections in.
Layer 2 — TypeTag Namespace (sub_59C710)
The Type section's per-record tag is a one-byte slot at offset 0 of the payload,
followed by a tag-specific operand list. The dense numbering 0..18 is independent
of upstream MLIR's BytecodeTypeOpcodes.td:
| Tag | Type | Operands (VarInt count) |
|---|---|---|
0..4 | i1, i8, i16, i32, i64 | 0 |
5..11 | f16, bf16, f32, tf32, f64, f8E4M3FN, f8E5M2 | 0 |
12 | Pointer (element type) | 1 |
13 | Tile (element + i64 shape) | 2 + dim_count |
14 | TensorView (element + shape + strides) | 3 + dim_count + stride_count |
15 | PartitionView (element + shape + dim-map + mode byte) | 4 + dim_count + map_count |
16 | Function (input list + result list) | 2 + input_count + result_count |
17 | Token | 0 |
18 | f8E8M0FNU (extension) | 0 |
The trailing f8E8M0FNU extension is an element type — like tags 5..11 it carries
no payload of its own. Tag 18 is reachable only as a leaf inside a tile-family
aggregate (TileType, TensorViewType, PartitionViewType), so the operand-zero
contract holds whether the tag is decoded standalone or through one of the
aggregate-type arms.
Layer 3 — AttrTag Numbering (sub_59F100)
The most consequential single constant table in the file. The shipped tileiras
AttrTag numbering is wire-format-breaking versus upstream MLIR's
mlir/Bytecode/BytecodeEnums.h::AttributeTag. Both tables are reproduced side by
side so the divergence is unambiguous:
| AttrTag | Upstream MLIR | Tileiras sub_59F100 |
|---|---|---|
| 0 | (reserved / sentinel) | (default-arm; emits "unsupported AttributeTag") |
| 1 | IntegerAttr | StringAttr |
| 2 | FloatAttr | FloatAttr |
| 3 | BoolAttr | TypeAttr |
| 4 | TypeAttr | DenseElementsAttr (int/float) |
| 5 | StringAttr | DenseElementsAttr (string) |
| 6 | ArrayAttr | DivByAttr |
| 7 | DenseElements | DenseI64ArrayAttr (variant A) |
| 8 | DivByAttr | DenseI64ArrayAttr (variant B) |
| 9 | SameElementsAttr | SameElementsAttr |
| 10 | Dictionary | BoundedAttr (variant 0) |
| 11 | OptimizationHints | BoundedAttr (variant 1) |
| 12 | BoundedAttr | BoundedAttr (variant 2) |
| 13 | (no upstream slot) | AssumePredicateAttr |
Only tag 2 (FloatAttr) matches upstream by coincidence. Every other tag in the
1..13 range disagrees: tag 1 is StringAttr here versus upstream IntegerAttr;
tag 4 lands on DenseElementsAttr instead of TypeAttr; tag 5 lands on
DenseElementsAttr<string> instead of StringAttr; tag 6 lands on DivByAttr
instead of ArrayAttr. Going the other direction, an AssumePredicateAttr emitted
by tileiras at tag 13 has no destination in upstream's table at all. Any external
tool that needs to round-trip MLIR bytecode through both implementations must
freeze the tileiras numbering above; the upstream header is reserved for future
stock cuda_tile builds.
The parallel DebugTag namespace at sub_589B90 is private to the Debug section and
uses a dense [0..6] range. Tag 0 is the failure sentinel; tags 1-6 cover
DICompileUnit, DIFile, DILexicalBlock, DILoc, DISubprogram,
CallSite respectively. No upstream LLVM debug-info tag table participates in
this dispatcher.
Layer 4 — cuda_tile Opcode Space (sub_5B13D0)
The 110-row cuda_tile opcode table is dense in [0..109] with two reserved
holes the dispatcher leaves on the default arm:
| Range | Status |
|---|---|
0..24 | Assigned (absf through exp2) |
25..36 | Reserved hole — emits "unknown or unimplemented opcode: " |
37..51 | Assigned (exti through int_to_ptr) |
52..57 | Reserved hole — emits "unknown or unimplemented opcode: " |
58..109 | Assigned (iota through yield) |
Opcode 0x6E (atan2 in the public 13.2 namespace) is absent from this binary.
The dispatcher has no case for it and embeds no cuda_tile.atan2 mnemonic string;
encoding the op lands on the default arm. This places the binary at a 13.1-vintage
opcode-table snapshot.
The full per-opcode mnemonic / handler-address table lives in MLIR Bytecode Format — Operation Opcode Dispatch.
The location-index slot is signed zig-zag LEB128: the value 0x7F after zig-zag
decode is -1, which the dispatcher resolves to UnknownLoc (typical of a
--lineinfo-less compile).
Layer 5 — NVPTX MatcherTable Pools (XOR-3 Cipher)
The NVPTX AsmWriter ships two .data mnemonic pools obfuscated by a walking
XOR cipher. Byte i is XORed with (3 * i) mod 256, decoded once at startup,
and cached behind a pointer at qword_5B4F4D0.
void xor3_decode(uint8_t *begin, uint8_t *end) {
uint8_t key = 0;
for (uint8_t *p = begin; p != end; ++p) {
*p ^= key;
key = (uint8_t)(key + 3);
}
}
The two pools and their decoder entry points are:
| Pool | Range | Length | Decoder | Cached pointer |
|---|---|---|---|---|
| Opcode mnemonic | 0x5A4C080 .. 0x5A656F0 | ~105 KB | sub_1BD1810 | qword_5B4F4D0 |
| Physical-register-name | 0x5A4BE20 .. 0x5A4C06A | 586 B | sub_1BD1830 | (post-decode cached) |
The cipher is not a security boundary. Its only effect is to prevent a naive
strings(1) sweep from surfacing every PTX mnemonic. A reimplementation that
does not need binary-for-binary .data parity can store the same strings plainly.
The shape of the decoded pool matches LLVM 21's NVPTXGenAsmWriter output; the
pattern-name strings paired with each OPC_* row of the MatcherTable
("setmaxregister", "cp.async.bulk.tensor.group.shared.cluster",
"wgmma.mma_async.sync.aligned", "wgmma.fence.sync.aligned",
"tcgen05.mma.sync", "tcgen05.mma.ws.sync", "mma.block_scaled.sync.aligned",
"mma.sp.sync.aligned.m8n8k16") sit unencrypted in .rodata since they are
TableGen pattern records rather than printer-side mnemonic literals.
Layer 6 — NVPTX ProxyReg Whitelist
The post-ISel NVPTXProxyRegErasure peephole uses a contiguous opcode range
rather than a named whitelist. The TableGen-side consolidation that landed
in LLVM 21 trunk just before the 21.0 cut replaced the older per-type
ProxyRegInst<*> template with a four-way emit that produces adjacent
indices:
| MI opcode | Type class | TableGen name |
|---|---|---|
| 3156 | i16 | ProxyRegI16 |
| 3157 | i32 | ProxyRegI32 |
| 3158 | i64 | ProxyRegI64 |
| 3159 | f32 / f64 | ProxyRegF |
The check at 0x1AE5086 is sub eax, 0xC54 ; cmp eax, 3 — a contiguous range
test that costs two x86 instructions. Stock LLVM 18 used a 5-6-element named
whitelist, so the contiguous numbering is itself a fingerprint for the LLVM 21
NVPTX backend. Reimplementations cannot pick arbitrary opcode numbers for the
typed ProxyReg family without breaking the peephole's hot-path test.
Layer 7 — FTZ-Path Constants in SelectIntrinsic_W_Chain Case 0x66
The per-call FTZ override in case 0x66 of sub_1A854E0 carries two MI opcode
literals and one SDNode flag bit that must reproduce exactly:
| Constant | Value | Meaning |
|---|---|---|
| FTZ-path FMA opcode | 0x65 | FMA_FTZ; emitted when probe selects FTZ |
| Non-FTZ-path wrapper opcode | 0xF7 | FMA_NON_FTZ; emitted when probe selects IEEE |
| FTZ-authorization flag bit | 0x40 | NoFPExcept reinterpreted as per-node FTZ-authorize signal |
| Inner FMAD opcode | 0x63 | Set with NoFPExcept (0x200) on the FTZ four-instruction chain |
INST_WRAPPER opcode (non-FTZ) | 0xD2 | Holds chain through ADDRESSOF wrap |
CopyToReg opcode | 0x11 | Standard LLVM SDNode opcode |
| MUL_ADD_f32 / MUL_ADD_f64 | 207 / 208 | MVT-keyed select after the wrapper chain |
⚡ QUIRK —
NoFPExceptflag bit0x40repurposed as FTZ-authorization Upstream LLVM treats SDNode flag bit0x40(NoFPExcept) as a pure FP-exception -safety advisory: it tells later passes that no FP exception can be raised. In case0x66ofsub_1A854E0, tileiras reads the same bit before the"unsafe-fp-math"function attribute and treats it as a per-node "authorize FTZ substitution" signal. A combine that legitimately setsNoFPExcepton a single FMA in an otherwise IEEE-denormal function therefore silently switches that one FMA tofma.rn.ftz.f32(opcode0x65) instead of theFMA_NON_FTZwrapper (0xF7). A reimplementation that imports upstream flag semantics will produce different PTX for the same SDAG.
Layer 8 — cvt_packfloat Validator Constants (sub_1A84900)
The four-gate cvt_packfloat validator carries five subtarget-level constants:
| Constant | Value | Gate |
|---|---|---|
| SM major floor | 0x384 (sm_90) | Gate 1 |
| PTX version floor | 0x4D (PTX 7.7) | Gate 1 |
| sm_100a SM major | 0xA0 | Gates 2 and 3 (UE8M0x2, fp6x2/fp4x2) |
| sm_100f SM minor | 0xF | Gate 4 (family-conditional) |
tmem feature byte | offset 80 in subtarget feature array at unk_5BEBD51 | tcgen05 128-bit atomic guard at sub_1A80A40 |
⚡ QUIRK —
atleasttypo and mismatched PTX number in gate-one diagnostic Gate one's diagnostic string is"cvt_packfloat intrinsic needs atleast SM90 and PTX >= 78": the missing space inatleastis preserved byte-for-byte, and the message advertisesPTX >= 78even though the actual compare is against0x4D(PTX 7.7, not 7.8). The discrepancy stems from an internal NVIDIA test-suite log scraper that keys on the verbatim string. A reimplementer who "fixes" either the spelling or the number desyncs that scraper without changing behavior.
Layer 9 — LLVM 21 NVPTX Data-Layout Stamp
Every NVPTX module emitted by tileiras carries one verbatim data-layout string, unconditionally stamped before bitcode serialization:
e-p:64:64:64-p3:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-
i128:128:128-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-
v128:128:128-n16:32:64
Length: 154 bytes (0x9A). Rodata location: 0x4D079D0. Sole xref:
sub_1A4E5C0 at 0x1A4E5D1. Address space 3 (p3:32:32:32) marks
NVPTX shared memory as 32-bit-pointer. The string is byte-identical to
stock LLVM 21 NVPTX64, and is one of the ten independent fingerprints
that pin the LLVM base version in LLVM Fingerprint Table.
Layer 10 — LLVM Bitcode Producer Strings
Two .rodata strings stamp the LLVM base version into every emitted module:
| Slot | Rodata address | Length | Verbatim string |
|---|---|---|---|
IDENTIFICATION_CODE_STRING | 0x4F882C4 | 13 B | LLVM21.0.0git |
NVPTX AsmPrinter emitHeader line 3 | (inside sub_1A56540) | varies | Based on LLVM 21.0.0git |
| libNVVM module name (when libNVVM path is taken) | (compile-time literal) | 10 B | mlir-input |
The producer string is emitted as the bitcode-writer's IDENTIFICATION subblock
record at sub_3935490 (the EnterSubblock(IDENTIFICATION, 5) site). The
AsmPrinter header comment block is written at every PTX-emit invocation; the
third line of four is the verbatim Based on LLVM 21.0.0git literal, not a
runtime-formatted template.
Cross-Layer Constant Index
For a reimplementation walking the wire format top-down, the constants converge on a small handful of source-of-truth dispatchers. The index below maps each constant back to the page that documents its dispatch site at reimplementation depth.
| Layer | Constant class | Authority page |
|---|---|---|
| 1 | Magic bytes, version range, section IDs | MLIR Bytecode Format |
| 2 | TypeTag 0..18 | MLIR Bytecode Format — Type Tag Dispatch |
| 3 | AttrTag 0..13, DebugTag 0..6 | MLIR Bytecode Format — Self-Contained Attribute Dispatch |
| 4 | cuda_tile opcodes 0..109, reserved holes | MLIR Bytecode Format — Operation Opcode Dispatch |
| 5 | XOR-3 cipher, pool ranges | ISelDAG and MatcherTable — AsmWriter String Tables |
| 6 | ProxyReg whitelist [3156, 3159] | LLVM Fingerprint Table — Fingerprint 8 |
| 7 | FMA opcodes 0x65 / 0xF7, flag bit 0x40 | ISelDAG and MatcherTable — NVIDIA-Specific ISel Patches |
| 8 | cvt_packfloat SM/PTX floors, tmem feature byte | ISelDAG and MatcherTable — NVIDIA-Specific ISel Patches |
| 9 | NVPTX64 data-layout string | LLVM Fingerprint Table — Fingerprint 1 |
| 10 | LLVM21.0.0git, Based on LLVM 21.0.0git, mlir-input | LLVM Fingerprint Table — Fingerprints 2, 3 |
Reimplementation Contract
Three rules summarize the constraint these constants impose on a clean-room reimplementation:
- Magic, AttrTag numbering, and cuda_tile opcode table are wire-format
invariants. A reimplementation that picks any other byte for offset 7,
any other tag-to-attribute-kind mapping in
sub_59F100's switch, or any other opcode-to-mnemonic assignment insub_5B13D0's switch produces bytecode that the shipped reader either rejects or silently mis-decodes. - NVPTX MatcherTable pool ranges, ProxyReg numbering, and FMA opcode
numbers are emitter invariants. A reimplementation that ships different
bytes here still produces valid PTX, but the binary-for-binary
.dataand MIR cross-checks NVIDIA's internal regression suite runs against tileiras output will fail. - All diagnostic strings — including the
atleasttypo, thePTX >= 78off-by-one, theFileLineColLocdebug-attr naming inheritance — are contract surface. Test-suite log scrapers key on verbatim spelling. "Fixing" any of them is a behavioral change as far as downstream tools are concerned, even though the fix is locally correct.
The shared property across all three rules is that no constant in this page is configuration. Each is either a header-stamped invariant frozen at build time, a table TableGen emitted into the binary at LLVM 21 cut-time, or a literal NVIDIA chose for hand-rolled validator code. A reimplementation that wants compatibility must freeze every one of them.
String-Evidence and Confidence Policy
Abstract
Every claim in this wiki carries one of three confidence tags - HIGH, MED, or LOW - and every backticked string is a byte-for-byte literal lifted from the binary. This page defines the three tiers, the verbatim-string rule, and the operational checks that page authors are expected to apply. It is the short, working version of the policy; the longer methodology discussion lives at Methodology.
HIGH
A claim is HIGH when it has a byte-level anchor and at least one independent corroboration. The anchor is a verbatim string in .rodata, a vtable or typeinfo match against a known base class, or a structural fingerprint that has no plausible alternative interpretation. Corroboration means that two or more independent indicators (anchor plus call-site, anchor plus vtable slot, anchor plus distinctive control flow) all agree on the identification. HIGH is the default tag once a verbatim anchor exists.
MED
A claim is MED when its evidence is structural rather than verbatim, or when a single strong piece of evidence stands without independent corroboration. Vtable shape, neighbour-function arrangement, field-offset arithmetic recovered from callers, and sibling-cloning from a HIGH-anchored template all qualify as structural. MED is the standing tag for call-graph-position identifications: a function whose role is fixed by where it sits relative to a HIGH neighbour but whose body has not been read line-by-line.
LOW
A claim is LOW when it rests on inference from neighbouring code without a direct anchor, when evidence conflicts and the conflict is not yet resolved, or when no corroboration is available. LOW is appropriate for tiny helpers with no strings, no identified callers, no distinctive control flow, and no vtable-slot constraint. LOW evidence must not drive core prose; when LOW is the only available tag, the claim is rendered with explicit hedging or omitted entirely.
Verbatim String Rule
Every backticked string in this wiki is byte-identical to an entry in the binary's string table, or to a substring of one. Two narrow exceptions are accepted as long as they are flagged: a printf prefix that is concatenated at runtime with a substituted suffix, and a templated instantiation where only the templated tail differs. Paraphrases and reconstructions are not backticked. Representational differences (the escape form \n versus the rodata byte 0x0A) are accepted without special marking.
Operational Checks
Page authors are expected to run these checks before merging a claim:
- When you cite a string, verify it byte-for-byte against the binary's string table. If the string is a substring or templated tail, flag it.
- When you cite a function, structure, or option by role, attach the strongest available tag (HIGH, MED, or LOW). If you cannot reach MED, omit the claim or render it with explicit hedging.
- When you cite an address-bound fact, prefer the role-level wording the rest of the wiki uses; the binary-layout and function-map pages describe the structures at the conceptual level so individual pages do not need to.
- When a contradiction surfaces between two analyses of the same construct, prefer the one with the stronger structural anchor regardless of recency, and update the errata log so dependent pages can be restamped.
Errata are tracked in the wiki's errata log alongside the verification passes; page authors propagate fixes by re-reading every page that cites a superseded identification and restamping it with the corrected tag.
The reader-side recipe for verifying a backticked string or a structural anchor against the binary directly is documented on Binary Anatomy and RE Methodology.