cute_nvgpu Dialect Overview
Provenance vs Upstream MLIR
cute_nvgpu is NVIDIA-introduced and has no upstream MLIR counterpart. Upstream MLIR exposes NVIDIA hardware operations only through nvgpu (a thin bridge dialect) and nvvm (typed intrinsics). Neither models the SM-tier-qualified atom catalogue — WGMMA, UMMA, TMA, TMEM lifecycle, ldmatrix/stmatrix, block-scaled MMA forms — that tileiras needs to keep around between cute layout algebra and nvgpu lowering. Without this dialect the layout-to-intrinsic step would have to collapse atom selection, SM-tier verification, and intrinsic emission into one rewrite; the dialect splits those concerns so the SM gate can run before NVVM conversion. See nvgpu for the upstream-linked bridge below this layer.
Abstract
cute_nvgpu is the NVIDIA architectural atom dialect sitting on top of cute. It hosts MMA atoms from SM70 through SM120 (WGMMA and UMMA included), TMA descriptor and transfer atoms, TMEM lifecycle operations, LDSM/STSM matrix-load atoms, and SMEM descriptor views. Every operation passes through an explicit SM-tier verifier, so an invalid (shape, element type, target) triple is rejected before NVVM emission. This page is the dialect-level map; per-family detail lives in the linked sub-pages.
Where cute describes target-neutral layout algebra, cute_nvgpu binds that algebra to real NVIDIA operations — MMA, WGMMA, TMA, TMEM allocation, ldmatrix, stmatrix, async bulk copies, SM-specific copy atoms. It is the seam where a layout stops being merely algebraic and starts requesting a specific GPU instruction family.
The dialect is organised by architecture tier. Older tiers describe classic tensor-core MMA and copy atoms. Hopper-era tiers add WGMMA and TMA descriptor movement. Blackwell-era tiers add tensor-memory lifecycle and block-scaled MMA forms. Tier names live in the operation spellings so verifiers and lowerings can reject invalid shape, element-type, or target combinations before NVVM conversion.
Position in the Cascade
cute
|
| select target-specific copy, MMA, TMA, and tensor-memory atoms
v
cute_nvgpu
|
| normalize architecture atoms
v
nvgpu
|
| emit NVVM intrinsics
v
PTX
cute_nvgpu preserves the high-level atom boundary. It sits below pure layout algebra and above raw NVVM intrinsics — the natural place to enforce SM-tier constraints and descriptor compatibility while keeping the final intrinsic selection simple.
Architecture Tiers
| Tier | Main operations | Meaning |
|---|---|---|
| SM70 | universal FMA and copy fallbacks | Baseline tensor-core-era atom vocabulary. |
| SM80 | sm80.mma, sparse MMA | Ampere MMA and structured sparsity forms. |
| SM89 | FP8-oriented MMA variants | Ada-generation element-type extensions. |
| SM90 | WGMMA, TMA descriptor views | Hopper warpgroup MMA and tensor-memory async movement. |
| SM100 | UMMA, TMEM lifecycle, block-scaled MMA | Blackwell datacenter tensor-memory and tcgen-style operations. |
| SM120 | consumer Blackwell block-scaled forms | Consumer Blackwell microscaling and per-lane scale metadata. |
Tier spelling is part of the IR contract. Lowering must not silently reinterpret an operation as a different tier just because another instruction shape looks similar. If a target does not support the tier named by the operation, verification fails before NVVM lowering.
Atom Families
The major atom families are:
- MMA atoms, including dense, sparse, FP8, WGMMA, UMMA, and block-scaled forms.
- TMA atoms for tensor-memory load, store, gather, scatter, prefetch, and descriptor use.
- Copy atoms for register, shared-memory, global-memory, and tiled partition movement.
- TMEM lifecycle operations for allocation, deallocation, permit transfer, and pointer retrieval.
- Descriptor view operations that connect
cutelayouts to hardware descriptor operands. - Kernel-marker lowering that turns a
cutekernel marker into the entry-point marker expected by NVVM.
Each family consumes cute layout values and emits lower-level operations whose shapes, element types, and memory spaces are visible to the target.
Kernel Lowering
The kernel boundary stays deliberately simple. A function marked as a cute kernel becomes an NVVM kernel entry, and every architecture atom in the body lowers or normalises toward nvgpu and nvvm.
void lower_cute_kernel_to_nvvm(Function func, Target target) {
if (has_attr(func, "cute.kernel")) {
remove_attr(func, "cute.kernel");
set_attr(func, "nvvm.kernel");
}
for (Operation *op : func.walk()) {
if (is_cute_nvgpu_mma(op)) {
require(target_supports_mma_tier(target, op));
lower_mma_atom(op, target);
} else if (is_cute_nvgpu_tma(op)) {
require(target_supports_tma(target, op));
lower_tma_atom(op, target);
} else if (is_cute_nvgpu_tmem(op)) {
require(target_supports_tmem(target, op));
lower_tmem_lifecycle_op(op, target);
} else if (is_cute_layout_carrier(op)) {
rewrite_descriptor_or_view(op, target);
}
}
}
The rewrite preserves the semantic shape of the atom. A WGMMA atom lowers through a warpgroup MMA op, not a scalarized loop that happens to compute the same value. A TMA atom lowers through descriptor construction and async tensor-memory ops, not through ordinary elementwise loads — unless an explicit fallback path exists.
Verifier Invariants
A correct verifier should reject invalid target combinations early:
- the selected target supports the SM tier named by the operation,
- MMA tile shapes are supported by that tier,
- operand element types match the tier and the chosen MMA mode,
- sparse MMA forms include valid metadata and selector attributes,
- block-scaled MMA forms include valid scale-vector layout and per-lane scale ids,
- TMA descriptor operands agree with the source or destination layout,
tcgen05.mmakind words clear the 13-rule mutual-exclusion ladder before opcode selection,- TMA partition ops clear the 11-step ladder (type, layout-kind, integer-stride, swizzle, static, shape-equiv, G-basis, layout, tensor-type, multicast),
- tensor-memory operations respect allocation, deallocation, and permit-transfer order,
- descriptor views preserve address space, element type, shape, and swizzle requirements,
- kernel entry markers are rewritten before NVVM emission.
These invariants are easiest to enforce while the atom name is still present. Once the op has become an NVVM intrinsic the diagnostic context shrinks, and the original layout intent may already be gone.
If You Know CUTLASS (open source) — cross-walk
For readers fluent in cutlass/arch/*.hpp and the per-SM atom traits in open-source CUTLASS:
| CUTLASS C++ | tileiras IR (cute_nvgpu) |
|---|---|
cutlass::arch::Mma<...> SM70/SM80/SM89 specialisations | atom.universal_fma, sm80.mma, sm89.mma (plus sm80.sparse_mma) |
cutlass::arch::Wmma<...> traits | accessed through atom.universal_fma and tier-generic paths |
Hopper GMMA::ss/rs/sr descriptor builders | cute_nvgpu.smem_desc_view + the descriptor packer at sub_17DD6A0 |
Hopper WGMMA atom + make_smem_desc | cute_nvgpu.sm90.mma op consuming a !smem_desc_view typed operand |
Hopper TMA cp.async.bulk.tensor family | atom.tma_load, atom.tma_store, atom.tma_reduce plus the non-exec variants |
Hopper cuTensorMapEncodeTiled | tma_descriptor_tiled type + the TMA descriptor builder |
| Blackwell TCGEN / UMMA atoms | sm100.mma, sm100.mma_sp, sm100.mma_bs, sm100.mma_bs_sp |
| Blackwell TMEM allocation / lifecycle | atom.tmem_load, atom.tmem_store, atom.s2t_copy, the TMEM lifecycle ops |
cutlass::arch::Sm120BlockScaledMma<...> | SM120.mma_bs (uppercase SM is required) |
Shared-memory matrix loads (ldmatrix) | atom.ldsm, atom.stsm with the mode/size pattern matrix in Mode Pattern Verifiers — LDSM and STSM Matrix |
Two departures from the open-source surface matter. First, SM120.mma_bs is the only SM120 entry — no SM120.mma, no sparse variant — matching the consumer-Blackwell FP4 surface where sparse MMA is not exposed. Second, the SMEM descriptor is a first-class IR type (!smem_desc_view) rather than an i64 immediate, so the verifier can re-check the descriptor's swizzle and tile-stride encoding against the same layout that produced it.
Cross-links
- SM Tier Roster and Copy Atom Registry — Atom Surface by Tier lists the atom families by target tier.
- MMA Atoms SM70-120 — Per-Arch MMA Shape Lattice covers matrix-multiply atom semantics.
- TMA Atoms — Atom Family covers tensor-memory descriptor and transfer atoms.
- Mode Pattern Verifiers — UMMA Canonical Layout Verifier covers shape, element-type, and mode checks.
- Asm Printer and Mnemonic Hash — Mnemonic Perfect-Hash Dispatch covers textual spelling and parser dispatch.