Frequently Asked Questions
Abstract
This page collects the questions a reader most often arrives with and points each one at the page that answers it in full. The entries are short by design: enough context to confirm that the question matches the situation, plus a link into the subsystem documentation. Detailed contracts, pseudocode, and confidence claims live on the linked pages.
The page is organized into five clusters: what tileiras is, how to use it, how to read this wiki, suggested reading paths for common goals, and meta questions about the project itself.
About tileiras
What is tileiras?
Tileiras is NVIDIA's CUDA TileIR optimizing assembler, shipped in CUDA 13.1
as a separate compiler binary that sits alongside cicc, ptxas, and
cudafe++. It consumes MLIR bytecode describing a tile-level GPU program,
runs that program through a cascade of nine dialect layers, and emits a host
ELF relocatable that carries compiled SASS for one Blackwell-family target.
The narrow framing is the useful one. Tileiras does not parse CUDA C++, does
not handle host-side template expansion, and does not generate launch stubs.
Those responsibilities belong elsewhere in the toolchain. Tileiras begins
after a frontend has already produced Tile IR and ends after ptxas has
finished. The high-level shape is documented in
Tileiras Internals and in
Position in nvcc 13.1.
How is tileiras different from cicc?
The two binaries are sibling device compilers that target different input
languages. cicc is the legacy LLVM-based path: CUDA C++ enters through
cudafe++, lowers through a device LLVM IR cascade, and emits PTX.
Tileiras is the MLIR-based path: a tile frontend emits MLIR bytecode,
tileiras runs the dialect cascade, and emits PTX. They share ptxas as a
downstream and they share several IR concepts in the lower layers, but they
do not share a frontend, a pass pipeline, or a scheduler.
The contrast is unpacked in cicc Comparison.
Is tileiras open source?
Mostly no. A small portion of the cuda_tile dialect appears in NVIDIA's
public CUTLASS repository and that portion is what the
OSS Comparison Overview maps. The rest of the program,
including every other dialect, the scheduler, the NVPTX customisations, and
the driver, is closed and ships only as a binary in the CUDA toolkit. The
.td Files Delta page enumerates which TableGen
files have public counterparts.
What architectures does tileiras support?
The driver accepts the SM levels listed in the
CLI Options page as valid --gpu-name values.
The supported set includes sm_100, sm_103, sm_110, sm_120, sm_121,
their a (architecture-specific) variants, and a backward-compatibility
range that covers earlier Hopper and Ampere targets used by older CUTLASS
atoms. The exact mapping from SM level to feature set is in
PTX Version and Target Selection.
Using tileiras
How do I invoke tileiras?
Normally, you do not invoke it directly. nvcc invokes tileiras when the
input is Tile IR bytecode produced by a tile frontend. Direct invocation is
also supported and matches the form
tileiras --gpu-name=<target> --opt-level=<n> -o <output> <input>. The
full option matrix is in CLI Options and the
runtime gates that change behavior without appearing on the command line are
in Env Vars and Runtime Gates.
How do I produce Tile IR bytecode?
A tile frontend produces it. NVIDIA's Triton-style frontend and the CUTLASS
DSL frontend are the two known producers. Either one emits MLIR bytecode
that uses the dialect tags tileiras expects. Hand-writing Tile IR is
possible through mlir-translate with a tileiras-aware bytecode writer,
but it is not the supported path. The frontend contract, including the
dialect schema that a producer must obey, is in
Frontend Contract and Tile IR Emission.
Why doesn't mlir-translate --serialize-bytecode produce tileiras-readable files?
Tileiras's bytecode uses a wire format that diverges from upstream MLIR
bytecode in the attribute and type tag tables. Upstream mlir-translate
writes upstream tags; the tileiras reader expects its own tag space and
rejects files that probe as upstream bytecode. The reader probes for the
upstream magic explicitly so that this case produces a specific diagnostic,
failed to parse IR bytecode (it looks like MLIR bytecode instead). The
wire-format contract is in
MLIR Bytecode Format and the dialect-by-dialect
status is in Dialect Reader/Writer Status.
My compile fails with --device-debug --opt-level=3. Why?
The combination is rejected at driver level with optimized debugging is not supported. --device-debug implies --opt-level=0; raising the
optimization level past that is an error rather than a warning. Use
--lineinfo for source mapping at higher optimization levels. The full
debugging story is in
Debugging and Introspection.
What does --gpu-name=sm_90a mean versus --gpu-name=sm_90?
The a suffix marks an architecture-specific target. sm_90a unlocks
Hopper-only instructions, most importantly WGMMA, that are not part of the
forward-compatible sm_90 baseline. Code compiled for sm_90a does not
run on later architectures without recompilation; code compiled for sm_90
does. The same pattern repeats for sm_100a, sm_103a, sm_120a, and
sm_121a. The selection logic is in
PTX Version and Target Selection.
Understanding the wiki
How accurate is this wiki?
For verbatim artifacts, very accurate. Diagnostic strings, opcode mnemonics,
attribute schemas, and bit-field layouts are extracted from the binary
byte-by-byte and carry HIGH confidence. For named functions like
sub_ABCDEF, the addresses are exact but the names are auto-generated by
IDA Pro because the binary is stripped; the algorithm descriptions on those
pages are derived from disassembly rather than from a source-level symbol.
The confidence taxonomy that every page uses is in
String Evidence and Confidence Policy.
Why are there sub_XXX references throughout the wiki?
Tileiras ships as a stripped binary, so the original function names are not
recoverable. IDA Pro names unknown functions sub_<hex_address> and the
wiki keeps that convention as the canonical reference for a function whose
real name is unknown. The address is stable across analyses and useful for
cross-referencing the binary; the prose around it describes what the
function does. The reverse-engineering methodology is in
Binary Anatomy and RE Methodology.
What is the difference between cute, cute_nvgpu, and cutlass?
Three layers of the CUTLASS programming model, each a separate dialect.
cute is the layout algebra and tile-decomposition primitive set. The
contract is in cute Overview. cute_nvgpu
is the NVIDIA-specific atom layer that binds layouts to actual hardware
copy and MMA instructions; its roster is in
cute_nvgpu Overview. cutlass is the
high-level pipeline and tile-scheduler dialect that orchestrates kernels
built from the lower two; its overview is in
cutlass Overview.
What does "wave-specialized" mean?
A scheduling pattern, also called producer-consumer specialization, where
one warp-group performs asynchronous loads and another warp-group performs
the matrix-multiply. The division is explicit in the IR: a producer
warp-group issues TMA copies and signals an mbarrier, the consumer
warp-group waits on that mbarrier and consumes the data. The op roster is
in nv_tileas Op Roster and Builders
and the synchronization protocol is in
mbarrier State Machine.
What is mbarrier?
A transactional barrier living in shared memory, introduced on Hopper as the synchronisation primitive for asynchronous copies. An mbarrier carries an arrival count and a transaction-byte count; producers update the transaction count when their copy commits, consumers wait until both counters reach a threshold. The state machine is documented in mbarrier State Machine.
What is TMA?
The Tensor Memory Accelerator, a Hopper-and-later hardware engine for
asynchronous bulk tensor copies between global and shared memory. The TMA
descriptor (CUtensorMap) encodes the multi-dimensional shape and
swizzling; the copy itself is initiated by cp.async.bulk.tensor and its
completion is ordered through an mbarrier. The codegen contract is in
TMA, Tensormap and cp.async.bulk.
What is WGMMA?
Warp-Group Matrix Multiply-Accumulate, the Hopper sm_90a instruction in
which four warps cooperate to issue one matrix-multiply against shared-memory
operands described by a 64-bit descriptor. The descriptor layout, the
synchronisation fence sequence, and the way the scheduler treats WGMMA
issue groups are in WGMMA Emission Protocol.
What is tcgen05?
The Blackwell sm_100a matrix-multiply family that replaces WGMMA. Unlike
WGMMA, tcgen05 keeps operands and accumulators in a dedicated tensor memory
(TMEM) bank rather than in registers, and supports 2-CTA and 4-CTA modes
where the multiply spans multiple thread-blocks within a cluster. The
tensor-memory programming model is in
tcgen05 Tensor Memory Model and the
multi-CTA variants are in
Blackwell 2-CTA/4-CTA MMA.
Reading paths
I want to reimplement tileiras
Read Position in nvcc 13.1 first to fix the binary's role, then Program Layout for the executable shape, then Pipeline Overview for the top-to-bottom cascade. Drill into whichever subsystem you are implementing next. Verify any single claim against the binary using the recipes in Binary Anatomy and RE Methodology.
I want to write a Tile IR frontend
Read Frontend Contract and Tile IR Emission for the dialect schema your emitter must satisfy, then cuda_tile Overview for the public-input dialect, then DSL to PTX End-to-End to follow a worked example from frontend output to PTX.
I want to understand WGMMA emission
Read WGMMA Emission Protocol for the issue contract, then cute_nvgpu MMA Atoms SM70-120 for the per-SM atom registry, then Per-SM Emission Templates for the PTX templates that the backend prints.
I want to debug a slow kernel
Read Performance and Cost Model for the scheduling cost function, then Debugging and Introspection for the diagnostic surfaces tileiras exposes, then Modulo Scheduler and Rau when the bottleneck reaches the scheduler itself.
I want to verify a claim made on this wiki
Read Binary Anatomy and RE Methodology for the verification recipes, then the String Evidence and Confidence Policy for how each page tags its claims.
Meta
Who wrote this wiki?
Reverse engineering and writing by Grigory Evko. The project is not endorsed by NVIDIA. Every claim derives from static analysis of the publicly-distributed CUDA 13.1 tileiras binary.
How can I contribute?
The wiki source lives at github.com/GrigoryEvko/nvopen-tools under
tileiras/wiki/. Issues and pull requests are welcome. Corrections that
challenge a specific claim are most useful when they cite either a
reproducible behavior of the binary or a binary offset.
Can I trust this wiki for production decisions?
For documentation and reimplementation reference, yes, within the confidence labels each page declares. For correctness in production, treat the wiki as a derived description and confirm any safety-critical behavior against the actual binary. The wiki is a reverse-engineered model; authoritative behavior lives only in the tileiras binary itself.
Cross-references
The questions above point into the rest of the wiki; the converse direction
is Reading Map, which organizes reading sequences by
subsystem rather than by question. The Glossary defines
the terms used here without unpacking them. The
Subsystem Map is the cross-reference for any
sub_<hex> name encountered while answering a follow-up question.