NVVM Dialect Overview

Provenance vs Upstream MLIR

nvvm is the upstream MLIR dialect, linked unchanged from the LLVM/MLIR snapshot tileiras tracks (the same dialect described in mlir/Dialect/LLVMIR/NVVMOps.td). Tileiras adds no nvvm.* op of its own — every op listed below comes from upstream. What the binary does override is usage: the inline-PTX templates, sparse-MMA path, and a few tcgen05 lowerings emit forms not yet exposed as upstream NVVM intrinsics in this snapshot. Those gaps are called out per family below.

Abstract

Every nvvm.* op exists to print one PTX instruction (or one inline-asm template). nvvm is the bottom MLIR dialect in TileIR's lowering stack — a typed intrinsic layer, not a programming model. Earlier dialects decide tiling, scheduling, pipeline stages, layouts, and target atoms; nvvm preserves those decisions in a form LLVM and the NVPTX backend understand.

Three lowering paths cover the whole dialect. Most ops become a call @llvm.nvvm.X intrinsic that the NVPTX backend prints as the matching PTX instruction. A smaller set lowers to llvm.inline_asm with a fixed PTX template — sparse MMA, a handful of TMA replace variants, a few cluster ops. The third path expands into ordinary llvm dialect ops (alloca, GEP, store, call). No nvvm.* op survives NVVM-to-LLVM conversion.

Position in the Cascade

nvgpu
    |
    | convert GPU operations to NVVM operations and LLVM helper IR
    v
nvvm
    |
    | convert NVVM operations to LLVM intrinsics or inline assembly
    v
llvm
    |
    | optimize, verify, select instructions, print PTX
    v
PTX

nvgpu is the last MLIR layer that still looks like a GPU dialect. nvvm looks more like LLVM IR: pointer types, vector types, memory-order attributes, target attributes, and intrinsic operand shapes have to be explicit by the time IR arrives. Most verifier failures here are best read as "the previous lowering didn't finish specifying the target operation." See Lowering: nvgpu / gpu to NVVM for the per-op rewrite contract.

Per-Family Pages

Rather than a single op count, the dialect's footprint is best stated as a per-category breakdown:

128 ops (one per mlir::NVVM::*Op class, as measured by distinct mlir::NVVM::detail::*OpGenericAdaptorBase template instantiations in the string pool — each TableGen-generated Op class instantiates exactly one OpGenericAdaptorBase to project its operand and attribute layout, so this count is the authoritative Op-class count);
64 attrs (one per mlir::NVVM::*Attr class — AtomicOpKindAttr, CacheEvictionPriorityAttr, ConvertFP{4,6,8}TypeAttr, CpAsyncBulkTensorLoadModeAttr, FPRoundingModeAttr, LoadCacheModifierKindAttr, MBarrierScopeKindAttr, MMAB1OpAttr, MMALayoutAttr, MMAShapeAttr, MMATypesAttr, MemScopeKindAttr, Tcgen05*Attr × 9, WGMMA*Attr × 3, and the rest — these supply the enum-attribute mnemonics most ops carry);
0 user-visible Type classes (the seven mlir::NVVM::*Type strings in the binary — ConvertFP{4,6,8}Type, DotAccumulateType, MMAType, ReductionType, WGMMAType — name enum-attribute kinds, not MLIR Type classes; nvvm does not introduce new MLIR Type subclasses);
~70 enum-mnemonic strings (the enum-attribute kind names the parser accepts and the printer emits, e.g. f16, bf16, e4m3, e5m2, row, col, cta, cluster, acquire, release — these are the values inside the Attr classes above, not separate ops);
19 module- and function-metadata keys (the single-segment "nvvm.X" strings the NVPTX backend reads as LLVM module/function metadata after MLIR-to-LLVM translation: nvvm.kernel, nvvm.target, nvvm.annotations, nvvm.annotations_transplanted, nvvm.reqntid, nvvm.maxntid, nvvm.minctasm, nvvm.maxnreg, nvvm.maxclusterrank, nvvm.cluster_dim, nvvm.cluster_max_blocks, nvvm.blocksareclusters, nvvm.grid_constant, nvvm.hidden, nvvm.reflection, nvvm.restrict_keyword, nvvm.restrict_processed, nvvm.restrict_scope, nvvm.exit).

The PTX form count is several times larger than the Op-class count because most ops carry attribute-driven cross-products: shape × layout × element type for WMMA, rank × mode for TMA, kind × cta_group × collector for tcgen05, generic vs .shared splits for mbarrier, and so on. They split cleanly into eight large families plus a long tail of small ones. The bulk of each family is documented on its own page; this overview lists the families, their roster sizes, the SM floor, and one example op so the cross-link table doubles as an index. (WMMA is three ops in MLIR; the PTX shape × layout × element-type cross-product is reached through attributes on nvvm.wmma.{load,store,mma} rather than per-combination ops.)

Diagnosing the prior oscillation. Earlier waves had quoted three different NVVM op counts: 213/218, 124, and 86. None were wrong about what they measured; each measured a different thing. The 213-vs-218 pair counts every op-name-shaped string in the binary, including the LLVM intrinsic names (llvm.nvvm.barrier0, llvm.nvvm.cp.async.bulk.tensor.*, the full llvm.nvvm.tcgen05.* family) that NVVM-to-LLVM lowering emits — these aren't nvvm.* ops, they're the lowered IR that follows. The 1704-byte / 8-stride TypeID slab 0x5B8D610..0x5B8DCB8 documented in Op Mnemonic Master Table — §8 NVVM.* reaches 213 because the slab includes one entry per RegisteredOperationName::insert call, and a handful of mnemonics (the nvvm.read.ptx.sreg.envregN series for N=0..31, the per-axis splits of cluster registers) register separate slots even though they share an Op class. The 124 count was Op-classes-minus-an-internal-subset; the 86 count was a broken single-xref heuristic that walked one cross-reference table and missed the families reached through TypeID dispatch. The 128 above is the only number that survives all three cross-checks: mlir::NVVM::*OpGenericAdaptorBase strings (template-side), distinct mlir::NVVM::*Op class names (RTTI-side), and unique Op classes registered through sub_4461CA0 from the NVVM dialect constructor.

Family	Count	SM floor	Example op	Page
WMMA — warp-synchronous register MMA	3	sm_70	`nvvm.wmma.mma`	WMMA Ops
WGMMA — warp-group async MMA (Hopper)	4	sm_90a	`nvvm.wgmma.mma_async`	WGMMA Ops
TMA — bulk tensor copy, prefetch, reduce	9 dialect ops (rank 1..5 and mode in attributes)	sm_90	`nvvm.cp.async.bulk.tensor.shared.cluster.global`	TMA Ops
tcgen05 — Blackwell tensor memory + MMA	15 dialect ops (kind / cta_group / collector / layout / sparsity / block-scale in attributes; the cross-product reaches several thousand PTX forms)	sm_100a	`nvvm.tcgen05.mma.block_scale`	tcgen05 Ops
mbarrier — shared-memory barrier state machine	12 dialect ops (generic vs `.shared` address-space split adds the second variant on most ops)	sm_80	`nvvm.mbarrier.arrive.expect_tx.shared`	mbarrier Ops
Cluster — thread-block cluster sync	9	sm_90	`nvvm.cluster.wait`, `nvvm.mapa`	Cluster Ops
Synchronisation — `barrier0`, `barrier.cta.sync`, `bar.warp.sync`, `barrier.{arrive,sync}` helpers	8	sm_70	`nvvm.barrier.cta.sync`	(this page)
`cp.async` (Ampere SM80 async-copy queue)	6 dialect ops (vector width `{4,8,16}` and `.ca`/`.cg` cache modifier are attributes on `nvvm.cp.async.shared.global`)	sm_80	`nvvm.cp.async.shared.global`	(this page)
Special registers — `tid`, `ctaid`, `ntid`, etc.	7	sm_70	`nvvm.read.ptx.sreg.tid.x`	(this page)
shfl / vote / elect.sync	5	sm_70	`nvvm.shfl.sync`	(this page)
Other (`mapa`, fences, ldmatrix/stmatrix, redux, prefetch)	8	varies	`nvvm.ldmatrix`	(this page)

The family page is the normative spec: it pins each op to its operand list, LLVM intrinsic, PTX template, constraint string for inline-asm variants, and SM floor. The roster table below covers the smaller families that don't justify their own page.

Roster — Small Families

Synchronisation

Op	LLVM intrinsic	PTX printed
`nvvm.barrier0`	`llvm.nvvm.barrier0`	`bar.sync 0;`
`nvvm.bar.warp.sync`	`llvm.nvvm.bar.warp.sync`	`bar.warp.sync %m;`
`nvvm.barrier`	`llvm.nvvm.barrier`	`barrier.cta.sync.aligned %b, %n;`
`nvvm.barrier.cta.sync`	`llvm.nvvm.barrier.cta.sync`	`barrier.cta.sync %b, %n;`
`nvvm.barrier.cta.arrive`	`llvm.nvvm.barrier.cta.arrive`	`barrier.cta.arrive %b, %n;`
`nvvm.barrier.cta.red`	`llvm.nvvm.barrier.cta.red`	`barrier.cta.red.{op} %p, %b, %n, %src;`
`nvvm.barrier.arrive`	`llvm.nvvm.barrier.arrive`	`bar.arrive %b, %n;`
`nvvm.elect.sync`	`llvm.nvvm.elect.sync`	`elect.sync %p

Special-register reads

Op	LLVM intrinsic	PTX printed
`nvvm.read.ptx.sreg.tid.x` (`.y`, `.z`)	`llvm.nvvm.read.ptx.sreg.tid.{x,y,z}`	`mov.u32 %r, %tid.{x,y,z};`
`nvvm.read.ptx.sreg.ntid.x` (`.y`, `.z`)	`llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}`	`mov.u32 %r, %ntid.{x,y,z};`
`nvvm.read.ptx.sreg.ctaid.x` (`.y`, `.z`)	`llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}`	`mov.u32 %r, %ctaid.{x,y,z};`
`nvvm.read.ptx.sreg.nctaid.x` (`.y`, `.z`)	`llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}`	`mov.u32 %r, %nctaid.{x,y,z};`
`nvvm.read.ptx.sreg.warpid`	`llvm.nvvm.read.ptx.sreg.warpid`	`mov.u32 %r, %warpid;`
`nvvm.read.ptx.sreg.laneid`	`llvm.nvvm.read.ptx.sreg.laneid`	`mov.u32 %r, %laneid;`
`nvvm.read.ptx.sreg.smid`	`llvm.nvvm.read.ptx.sreg.smid`	`mov.u32 %r, %smid;`

`cp.async` (Ampere)

Op	LLVM intrinsic	PTX printed
`nvvm.cp.async.shared.global`	`llvm.nvvm.cp.async.{ca,cg}.shared.global.{4,8,16}`	`cp.async.{ca,cg}.shared.global [%dst], [%src], N;`
`nvvm.cp.async.commit.group`	`llvm.nvvm.cp.async.commit.group`	`cp.async.commit_group;`
`nvvm.cp.async.wait.group`	`llvm.nvvm.cp.async.wait.group`	`cp.async.wait_group N;`
`nvvm.cp.async.bulk.wait_group`	`llvm.nvvm.cp.async.bulk.wait_group`	`cp.async.bulk.wait_group N;`
`nvvm.cp.async.mbarrier.arrive[.shared]`	`llvm.nvvm.cp.async.mbarrier.arrive[.shared]`	`cp.async.mbarrier.arrive[.shared].b64 [%mbar];`
`nvvm.cp.async.mbarrier.arrive.noinc[.shared]`	`llvm.nvvm.cp.async.mbarrier.arrive.noinc[.shared]`	`cp.async.mbarrier.arrive.noinc[.shared].b64 [%mbar];`

shfl / vote

Op	LLVM intrinsic	PTX printed
`nvvm.shfl.sync`	`llvm.nvvm.shfl.sync.{idx,up,down,bfly}.{i32,f32}`	`shfl.sync.{idx,up,down,bfly}.b32 %r, %v, %lane, %m, %mask;`
`nvvm.vote.sync` (kind `ballot`)	`llvm.nvvm.vote.ballot.sync`	`vote.sync.ballot.b32 %r, %p, %mask;`
`nvvm.vote.sync` (kinds `all`/`any`/`uni`, selected by `nvvm.vote_sync_kind`)	`llvm.nvvm.vote.{all,any,uni}.sync`	`vote.sync.{all,any,uni}.pred %p, %src, %mask;`
`nvvm.match.sync`	`llvm.nvvm.match.{any,all}.sync.{i32,i64}`	`match.{any,all}.sync.b{32,64} %r, %v, %mask;`
`nvvm.redux.sync`	`llvm.nvvm.redux.sync.{op}.{type}`	`redux.sync.{op}.{type} %r, %v, %mask;`

ldmatrix / stmatrix and miscellaneous

Op	LLVM intrinsic	PTX printed
`nvvm.ldmatrix`	`llvm.nvvm.ldmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.{b16,b8x16,...}`	`ldmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.shared::cta.{b16,b8x16,...} {...}, [%addr];`
`nvvm.stmatrix`	`llvm.nvvm.stmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.{b16,b8x16}`	`stmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.shared::cta.{b16,b8x16} [%addr], {...};`
`nvvm.prefetch.tensormap`	`llvm.nvvm.prefetch.tensormap`	`prefetch.tensormap [%tmap];`
`nvvm.fence.proxy.acquire`	`llvm.nvvm.fence.proxy.acquire`	`fence.proxy.async.shared::cluster;`
`nvvm.fence.mbarrier.init`	`llvm.nvvm.fence.mbarrier.init`	`fence.mbarrier_init.release.cluster;`
`nvvm.cvt.packfloat.f32`	`llvm.nvvm.cvt.{rn,rz,rm,rp}.{f16x2,bf16x2,e4m3x2,e5m2x2}.f32`	`cvt.{rnd}.{f16,bf16,e4m3,e5m2}x2.f32 %r, %fhi, %flo;`
`nvvm.mma.sync` (Ampere/Ada dense)	`llvm.nvvm.mma.m{8,16}n{8,16}k{...}.row.col.{...}`	`mma.sync.aligned.m16n8kK.{row,col}.{row,col}.{...} {...}, %a, %b, %c;`
`nvvm.mma.block_scale`	`llvm.nvvm.mma.block_scale.m16n8k.{kind}`	`mma.sync.aligned.m16n8k.{kind}.scale::vec::{16,32} {...}, %a, %b, %c, %sa, %sb;`

Inline-PTX Templates and Constraint Strings

A handful of ops bypass call @llvm.nvvm.X and lower to llvm.inline_asm with a fixed PTX template plus a verbatim constraint string. The backend rejects the asm node unless template and constraint match the operand list exactly; reimplementers must reproduce both byte-for-byte.

The constraint codes used in this dialect:

Code	Meaning
`r`	32-bit integer register (`i32` / `f32` / `i16` / `i8`)
`l`	64-bit integer register (`i64`, including pointer-typed operands)
`f`	32-bit floating-point register (`f32`)
`h`	16-bit integer register (`i16` / `f16` / `bf16`)
`n`	compile-time integer immediate
`=r` / `=l` / `=f` / `=h`	output-only register of the matching width

Sparse MMA

template:    "mma.sp.sync.aligned.m{M}n{N}k{K}.row.col.{aType}.{bType}.{cType}.{dType}
                 { %0, %1, %2, %3 },          // D (output)
                 { %4, %5, %6, %7 },          // A (sparse halved)
                 { %8, %9, %10, %11, %12, %13, %14, %15 },  // B
                 { %16, %17, %18, %19 },      // C
                 %20, 0x{selector};"          // sparse metadata, selector immediate
constraint:  "=r,=r,=r,=r,r,r,r,r,r,r,r,r,r,r,r,r,r,r,r,r,r"

The first four =r slots are the output D fragment; the trailing r slots are the input fragments and the metadata word. The selector immediate is baked into the template literal at lowering time rather than passed as an operand; the same op emits 0x0 or 0x1 depending on the sparsitySelector attribute.

For shape m16n8k16.row.col.f16.f16.f16.f16 the constraint string above expands to four =r outputs (D fragment) and seventeen r inputs (A=4, B=8, C=4, plus the sparse-metadata word), matching the template's %0..%20 slot range. For m16n8k32.row.col.s32.s8.s8.s32 the printed PTX is {$0..$3}, {$4,$5}, {$6,$7}, {$8..$11}, $12 — four s32 outputs and nine inputs (A=2, B=2, C=4, metadata=1). The verifier rejects any combination not listed in the PTX ISA.

im2col TMA store with L2 cache hint

template:    "cp.async.bulk.tensor.{N}d.global.shared::cta.im2col.bulk_group.L2::cache_hint
                 [%0, { %1, %2, ..., %{N} }],
                 [%{N+1}],
                 %{N+2};"
constraint:  "l,r,r,r,r,r,l,l"      // N=5 example

Operand 0 is the i64 descriptor pointer; the next N operands (one per rank) are 32-bit coordinates; the SMEM source pointer is l; the cache hint is l. Rank-3 and rank-4 forms drop coordinate operands and shrink the constraint string accordingly.

tcgen05.cp

template:    "tcgen05.cp.{shape}.{multicast}.{src_fmt} [%0], [%1];"
constraint:  "r,r"

The two r operands are the destination and source TMEM column indices. The shape, multicast, and src_fmt tokens are baked into the template literal at pattern-build time.

stmatrix fallback (pre-sm_90)

When nvvm.stmatrix.sync.aligned is targeted at a pre-sm_90 SM that exposes ldmatrix but not stmatrix directly, the op lowers through llvm.inline_asm:

template:    "stmatrix.sync.aligned.m8n8.x{num}{.trans,}.shared::cta.b16
                 [%0], { %1, %2, ..., %{num} };"
constraint:  "l,r,r,...,r"          // one l for addr, num× r for fragment regs

l is the ptr addrspace(3) destination; the trailing r slots are the fragment registers.

WGMMA scale-D selector (when the immediate form is rejected)

Most wgmma.mma_async.sync.aligned variants reach PTX through the LLVM intrinsic, which carries scale_d as a compile-time argument. The few ops that drop to inline-asm use:

template:    "wgmma.mma_async.sync.aligned.m64n{N}k{K}.{accT}.{aT}.{bT}
                 { %0, %1, ..., %{accW-1} },
                 %da, %db, %p,
                 1, 1, %la, %lb;"
constraint:  "=f,=f,...,=f,l,l,n,n,n"

Each output accumulator register is =f (for f32 accumulator types) or =h (f16). The two descriptor inputs are l. The %p predicate and the two trailing n slots are compile-time immediates. The =r slot used in some upstream snapshots for the scale-D return value does not appear on this constraint string because the immediate form is the only one tileiras emits.

Per-Arch Availability

Registration is uniform across targets; the gate lives in the verifier and the backend. The table is the practical "what runs where" view. ptx_min is the lowest PTX ISA version the final printed instruction requires.

Family	SM floor	SM ceiling (observed)	`ptx_min`	Notes
Synchronisation	sm_70	unbounded	6.0 / 7.0	`aligned` forms require 7.0
Special registers	sm_70	unbounded	6.0	always legal
shfl / vote	sm_70	unbounded	6.0	only the `.sync` forms are emitted
cp.async (Ampere)	sm_80	unbounded	7.0	Ampere async-copy queue
mbarrier	sm_80 (base), sm_90 (`.expect_tx`)	unbounded	7.0 / 7.8	shared-memory variant on Ampere; cluster-aware extensions on Hopper
WMMA	sm_70	sm_89 (Hopper redirects through WGMMA)	6.0	the only MMA path on Turing/Ampere
WGMMA	sm_90a	sm_90a (no Blackwell WGMMA)	8.0	architecture-qualified; plain `sm_90` is rejected
TMA	sm_90	unbounded	8.0 / 8.3	descriptor lives in global memory
Cluster	sm_90	unbounded	8.0	requires `barrier.cluster.*` PTX
ldmatrix / stmatrix	sm_75 (`ldmatrix`), sm_90 (`stmatrix`)	unbounded	6.5 / 8.0	width-4 `.trans` form requires 7.8
tcgen05	sm_100a	sm_100a (+ sm_100f for `f`-suffixed copy variants)	8.6	Blackwell tensor-memory family
Block-scaled MMA (`mma.block_scale`)	sm_100a	sm_100a	8.6	the only sm_100 form in the legacy `nvvm.mma` namespace
redux / barrier-id helpers	sm_80 (`redux.sync`) / sm_70 (`bar.{arrive,sync}`)	unbounded	7.0 / 6.0	`redux.sync` requires Ampere

Lowering Contract

NVVM-to-LLVM conversion is deliberately mechanical. Each nvvm.X op has a single registered lowering: a direct call @llvm.nvvm.X intrinsic when LLVM exposes a matching intrinsic, or an llvm.inline_asm with a hard-coded PTX template otherwise. A third path expands into ordinary llvm dialect ops for the few cases that aren't a single instruction (e.g. nvvm.shfl.sync synthesised broadcast loops).

The choice is fixed per op at registration time. The conversion driver walks each nvvm.* op, looks up the op's OperationName in the dispatch map, and invokes the matching rewrite:

LogicalResult lower_nvvm_op(Operation *op) {
    const NvvmLowering *entry = lookup_by_operation_name(op->getName());
    require(entry != NULL);

    switch (entry->kind) {
        case NVVM_DIRECT_INTRINSIC:
            return replace_with_llvm_intrinsic_call(op, entry->intrinsic_id);
        case NVVM_INLINE_ASM:
            return replace_with_inline_asm(op, entry->ptx_template, entry->constraints);
        case NVVM_LLVM_EXPANSION:
            return entry->custom_expand(op);
    }
}

The dispatch map is built once at dialect-load time from the TableGen records: each record declares dialect="nvvm", an op mnemonic, an LLVM intrinsic ID (or an inline-asm template + constraint string), and the kind. Lowering reads each field straight out of the entry. No layout, scheduling, or pipeline policy is reinferred here — earlier dialects must already have committed to the target operation.

After the sweep, no nvvm.* op survives. The verifier check that follows the sweep treats any remaining nvvm.* op as a missing pattern, not as a default-illegal op.

Verifier Invariants

The verifier rejects anything that cannot be legally translated to the selected target:

intrinsic operand counts and result counts match the selected intrinsic;
pointer address spaces are explicit and legal for the operation;
memory scopes and memory-ordering attributes are compatible;
MMA and WGMMA shapes are supported by the target;
sparse and block-scaled MMA forms carry the required metadata operands;
TMA and async-copy operands have valid descriptor, barrier, and memory-space types;
special-register reads are valid for the target and execution model;
inline-PTX operations have complete constraint strings and result types;
operations requiring a newer SM generation are not emitted for an older target.

This is the last MLIR-level diagnostic point before LLVM IR and the machine backend. A good error here names the semantic mismatch, not just the intrinsic.

Target Attributes

nvvm carries the target attributes that make the LLVM handoff meaningful: architecture (nvvm.target = "sm_90a"), PTX version, feature flags (+ptx80, +tmem, ...), kernel markers, launch bounds, cluster dimensions, and assorted function- and module-level properties. Earlier passes set these through gpu.module and conversion interfaces; by the time the LLVM module materialises, the NVPTX backend has to recover a concrete subtarget from them.

The attributes are plain string / integer / array attributes attached to the gpu.module or func.func parents — the NVPTX backend reads them from the LLVM module's metadata after MLIR-to-LLVM translation. Missing or contradictory attributes here are silent disasters: the backend still receives syntactically valid LLVM IR, but generates code for the wrong target contract. The verifier rejects the obvious cases (no sm, no ptx), and the NVPTX subtarget feature matrix lists which features each SM accepts.

Keyboard shortcuts

Tileiras Internals