Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVM Dialect Overview

Provenance vs Upstream MLIR

nvvm is the upstream MLIR dialect, linked unchanged from the LLVM/MLIR snapshot tileiras tracks (the same dialect described in mlir/Dialect/LLVMIR/NVVMOps.td). Tileiras adds no nvvm.* op of its own — every op listed below comes from upstream. What the binary does override is usage: the inline-PTX templates, sparse-MMA path, and a few tcgen05 lowerings emit forms not yet exposed as upstream NVVM intrinsics in this snapshot. Those gaps are called out per family below.

Abstract

Every nvvm.* op exists to print one PTX instruction (or one inline-asm template). nvvm is the bottom MLIR dialect in TileIR's lowering stack — a typed intrinsic layer, not a programming model. Earlier dialects decide tiling, scheduling, pipeline stages, layouts, and target atoms; nvvm preserves those decisions in a form LLVM and the NVPTX backend understand.

Three lowering paths cover the whole dialect. Most ops become a call @llvm.nvvm.X intrinsic that the NVPTX backend prints as the matching PTX instruction. A smaller set lowers to llvm.inline_asm with a fixed PTX template — sparse MMA, a handful of TMA replace variants, a few cluster ops. The third path expands into ordinary llvm dialect ops (alloca, GEP, store, call). No nvvm.* op survives NVVM-to-LLVM conversion.

Position in the Cascade

nvgpu
    |
    | convert GPU operations to NVVM operations and LLVM helper IR
    v
nvvm
    |
    | convert NVVM operations to LLVM intrinsics or inline assembly
    v
llvm
    |
    | optimize, verify, select instructions, print PTX
    v
PTX

nvgpu is the last MLIR layer that still looks like a GPU dialect. nvvm looks more like LLVM IR: pointer types, vector types, memory-order attributes, target attributes, and intrinsic operand shapes have to be explicit by the time IR arrives. Most verifier failures here are best read as "the previous lowering didn't finish specifying the target operation." See Lowering: nvgpu / gpu to NVVM for the per-op rewrite contract.

Per-Family Pages

Rather than a single op count, the dialect's footprint is best stated as a per-category breakdown:

  • 128 ops (one per mlir::NVVM::*Op class, as measured by distinct mlir::NVVM::detail::*OpGenericAdaptorBase template instantiations in the string pool — each TableGen-generated Op class instantiates exactly one OpGenericAdaptorBase to project its operand and attribute layout, so this count is the authoritative Op-class count);
  • 64 attrs (one per mlir::NVVM::*Attr class — AtomicOpKindAttr, CacheEvictionPriorityAttr, ConvertFP{4,6,8}TypeAttr, CpAsyncBulkTensorLoadModeAttr, FPRoundingModeAttr, LoadCacheModifierKindAttr, MBarrierScopeKindAttr, MMAB1OpAttr, MMALayoutAttr, MMAShapeAttr, MMATypesAttr, MemScopeKindAttr, Tcgen05*Attr × 9, WGMMA*Attr × 3, and the rest — these supply the enum-attribute mnemonics most ops carry);
  • 0 user-visible Type classes (the seven mlir::NVVM::*Type strings in the binary — ConvertFP{4,6,8}Type, DotAccumulateType, MMAType, ReductionType, WGMMAType — name enum-attribute kinds, not MLIR Type classes; nvvm does not introduce new MLIR Type subclasses);
  • ~70 enum-mnemonic strings (the enum-attribute kind names the parser accepts and the printer emits, e.g. f16, bf16, e4m3, e5m2, row, col, cta, cluster, acquire, release — these are the values inside the Attr classes above, not separate ops);
  • 19 module- and function-metadata keys (the single-segment "nvvm.X" strings the NVPTX backend reads as LLVM module/function metadata after MLIR-to-LLVM translation: nvvm.kernel, nvvm.target, nvvm.annotations, nvvm.annotations_transplanted, nvvm.reqntid, nvvm.maxntid, nvvm.minctasm, nvvm.maxnreg, nvvm.maxclusterrank, nvvm.cluster_dim, nvvm.cluster_max_blocks, nvvm.blocksareclusters, nvvm.grid_constant, nvvm.hidden, nvvm.reflection, nvvm.restrict_keyword, nvvm.restrict_processed, nvvm.restrict_scope, nvvm.exit).

The PTX form count is several times larger than the Op-class count because most ops carry attribute-driven cross-products: shape × layout × element type for WMMA, rank × mode for TMA, kind × cta_group × collector for tcgen05, generic vs .shared splits for mbarrier, and so on. They split cleanly into eight large families plus a long tail of small ones. The bulk of each family is documented on its own page; this overview lists the families, their roster sizes, the SM floor, and one example op so the cross-link table doubles as an index. (WMMA is three ops in MLIR; the PTX shape × layout × element-type cross-product is reached through attributes on nvvm.wmma.{load,store,mma} rather than per-combination ops.)

Diagnosing the prior oscillation. Earlier waves had quoted three different NVVM op counts: 213/218, 124, and 86. None were wrong about what they measured; each measured a different thing. The 213-vs-218 pair counts every op-name-shaped string in the binary, including the LLVM intrinsic names (llvm.nvvm.barrier0, llvm.nvvm.cp.async.bulk.tensor.*, the full llvm.nvvm.tcgen05.* family) that NVVM-to-LLVM lowering emits — these aren't nvvm.* ops, they're the lowered IR that follows. The 1704-byte / 8-stride TypeID slab 0x5B8D610..0x5B8DCB8 documented in Op Mnemonic Master Table — §8 NVVM.* reaches 213 because the slab includes one entry per RegisteredOperationName::insert call, and a handful of mnemonics (the nvvm.read.ptx.sreg.envregN series for N=0..31, the per-axis splits of cluster registers) register separate slots even though they share an Op class. The 124 count was Op-classes-minus-an-internal-subset; the 86 count was a broken single-xref heuristic that walked one cross-reference table and missed the families reached through TypeID dispatch. The 128 above is the only number that survives all three cross-checks: mlir::NVVM::*OpGenericAdaptorBase strings (template-side), distinct mlir::NVVM::*Op class names (RTTI-side), and unique Op classes registered through sub_4461CA0 from the NVVM dialect constructor.

FamilyCountSM floorExample opPage
WMMA — warp-synchronous register MMA3sm_70nvvm.wmma.mmaWMMA Ops
WGMMA — warp-group async MMA (Hopper)4sm_90anvvm.wgmma.mma_asyncWGMMA Ops
TMA — bulk tensor copy, prefetch, reduce9 dialect ops (rank 1..5 and mode in attributes)sm_90nvvm.cp.async.bulk.tensor.shared.cluster.globalTMA Ops
tcgen05 — Blackwell tensor memory + MMA15 dialect ops (kind / cta_group / collector / layout / sparsity / block-scale in attributes; the cross-product reaches several thousand PTX forms)sm_100anvvm.tcgen05.mma.block_scaletcgen05 Ops
mbarrier — shared-memory barrier state machine12 dialect ops (generic vs .shared address-space split adds the second variant on most ops)sm_80nvvm.mbarrier.arrive.expect_tx.sharedmbarrier Ops
Cluster — thread-block cluster sync9sm_90nvvm.cluster.wait, nvvm.mapaCluster Ops
Synchronisation — barrier0, barrier.cta.sync, bar.warp.sync, barrier.{arrive,sync} helpers8sm_70nvvm.barrier.cta.sync(this page)
cp.async (Ampere SM80 async-copy queue)6 dialect ops (vector width {4,8,16} and .ca/.cg cache modifier are attributes on nvvm.cp.async.shared.global)sm_80nvvm.cp.async.shared.global(this page)
Special registers — tid, ctaid, ntid, etc.7sm_70nvvm.read.ptx.sreg.tid.x(this page)
shfl / vote / elect.sync5sm_70nvvm.shfl.sync(this page)
Other (mapa, fences, ldmatrix/stmatrix, redux, prefetch)8variesnvvm.ldmatrix(this page)

The family page is the normative spec: it pins each op to its operand list, LLVM intrinsic, PTX template, constraint string for inline-asm variants, and SM floor. The roster table below covers the smaller families that don't justify their own page.

Roster — Small Families

Synchronisation

OpLLVM intrinsicPTX printed
nvvm.barrier0llvm.nvvm.barrier0bar.sync 0;
nvvm.bar.warp.syncllvm.nvvm.bar.warp.syncbar.warp.sync %m;
nvvm.barrierllvm.nvvm.barrierbarrier.cta.sync.aligned %b, %n;
nvvm.barrier.cta.syncllvm.nvvm.barrier.cta.syncbarrier.cta.sync %b, %n;
nvvm.barrier.cta.arrivellvm.nvvm.barrier.cta.arrivebarrier.cta.arrive %b, %n;
nvvm.barrier.cta.redllvm.nvvm.barrier.cta.redbarrier.cta.red.{op} %p, %b, %n, %src;
nvvm.barrier.arrivellvm.nvvm.barrier.arrivebar.arrive %b, %n;
nvvm.elect.syncllvm.nvvm.elect.sync`elect.sync %p

Special-register reads

OpLLVM intrinsicPTX printed
nvvm.read.ptx.sreg.tid.x (.y, .z)llvm.nvvm.read.ptx.sreg.tid.{x,y,z}mov.u32 %r, %tid.{x,y,z};
nvvm.read.ptx.sreg.ntid.x (.y, .z)llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}mov.u32 %r, %ntid.{x,y,z};
nvvm.read.ptx.sreg.ctaid.x (.y, .z)llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}mov.u32 %r, %ctaid.{x,y,z};
nvvm.read.ptx.sreg.nctaid.x (.y, .z)llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}mov.u32 %r, %nctaid.{x,y,z};
nvvm.read.ptx.sreg.warpidllvm.nvvm.read.ptx.sreg.warpidmov.u32 %r, %warpid;
nvvm.read.ptx.sreg.laneidllvm.nvvm.read.ptx.sreg.laneidmov.u32 %r, %laneid;
nvvm.read.ptx.sreg.smidllvm.nvvm.read.ptx.sreg.smidmov.u32 %r, %smid;

cp.async (Ampere)

OpLLVM intrinsicPTX printed
nvvm.cp.async.shared.globalllvm.nvvm.cp.async.{ca,cg}.shared.global.{4,8,16}cp.async.{ca,cg}.shared.global [%dst], [%src], N;
nvvm.cp.async.commit.groupllvm.nvvm.cp.async.commit.groupcp.async.commit_group;
nvvm.cp.async.wait.groupllvm.nvvm.cp.async.wait.groupcp.async.wait_group N;
nvvm.cp.async.bulk.wait_groupllvm.nvvm.cp.async.bulk.wait_groupcp.async.bulk.wait_group N;
nvvm.cp.async.mbarrier.arrive[.shared]llvm.nvvm.cp.async.mbarrier.arrive[.shared]cp.async.mbarrier.arrive[.shared].b64 [%mbar];
nvvm.cp.async.mbarrier.arrive.noinc[.shared]llvm.nvvm.cp.async.mbarrier.arrive.noinc[.shared]cp.async.mbarrier.arrive.noinc[.shared].b64 [%mbar];

shfl / vote

OpLLVM intrinsicPTX printed
nvvm.shfl.syncllvm.nvvm.shfl.sync.{idx,up,down,bfly}.{i32,f32}shfl.sync.{idx,up,down,bfly}.b32 %r, %v, %lane, %m, %mask;
nvvm.vote.sync (kind ballot)llvm.nvvm.vote.ballot.syncvote.sync.ballot.b32 %r, %p, %mask;
nvvm.vote.sync (kinds all/any/uni, selected by nvvm.vote_sync_kind)llvm.nvvm.vote.{all,any,uni}.syncvote.sync.{all,any,uni}.pred %p, %src, %mask;
nvvm.match.syncllvm.nvvm.match.{any,all}.sync.{i32,i64}match.{any,all}.sync.b{32,64} %r, %v, %mask;
nvvm.redux.syncllvm.nvvm.redux.sync.{op}.{type}redux.sync.{op}.{type} %r, %v, %mask;

ldmatrix / stmatrix and miscellaneous

OpLLVM intrinsicPTX printed
nvvm.ldmatrixllvm.nvvm.ldmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.{b16,b8x16,...}ldmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.shared::cta.{b16,b8x16,...} {...}, [%addr];
nvvm.stmatrixllvm.nvvm.stmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.{b16,b8x16}stmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.shared::cta.{b16,b8x16} [%addr], {...};
nvvm.prefetch.tensormapllvm.nvvm.prefetch.tensormapprefetch.tensormap [%tmap];
nvvm.fence.proxy.acquirellvm.nvvm.fence.proxy.acquirefence.proxy.async.shared::cluster;
nvvm.fence.mbarrier.initllvm.nvvm.fence.mbarrier.initfence.mbarrier_init.release.cluster;
nvvm.cvt.packfloat.f32llvm.nvvm.cvt.{rn,rz,rm,rp}.{f16x2,bf16x2,e4m3x2,e5m2x2}.f32cvt.{rnd}.{f16,bf16,e4m3,e5m2}x2.f32 %r, %fhi, %flo;
nvvm.mma.sync (Ampere/Ada dense)llvm.nvvm.mma.m{8,16}n{8,16}k{...}.row.col.{...}mma.sync.aligned.m16n8kK.{row,col}.{row,col}.{...} {...}, %a, %b, %c;
nvvm.mma.block_scalellvm.nvvm.mma.block_scale.m16n8k.{kind}mma.sync.aligned.m16n8k.{kind}.scale::vec::{16,32} {...}, %a, %b, %c, %sa, %sb;

Inline-PTX Templates and Constraint Strings

A handful of ops bypass call @llvm.nvvm.X and lower to llvm.inline_asm with a fixed PTX template plus a verbatim constraint string. The backend rejects the asm node unless template and constraint match the operand list exactly; reimplementers must reproduce both byte-for-byte.

The constraint codes used in this dialect:

CodeMeaning
r32-bit integer register (i32 / f32 / i16 / i8)
l64-bit integer register (i64, including pointer-typed operands)
f32-bit floating-point register (f32)
h16-bit integer register (i16 / f16 / bf16)
ncompile-time integer immediate
=r / =l / =f / =houtput-only register of the matching width

Sparse MMA

template:    "mma.sp.sync.aligned.m{M}n{N}k{K}.row.col.{aType}.{bType}.{cType}.{dType}
                 { %0, %1, %2, %3 },          // D (output)
                 { %4, %5, %6, %7 },          // A (sparse halved)
                 { %8, %9, %10, %11, %12, %13, %14, %15 },  // B
                 { %16, %17, %18, %19 },      // C
                 %20, 0x{selector};"          // sparse metadata, selector immediate
constraint:  "=r,=r,=r,=r,r,r,r,r,r,r,r,r,r,r,r,r,r,r,r,r,r"

The first four =r slots are the output D fragment; the trailing r slots are the input fragments and the metadata word. The selector immediate is baked into the template literal at lowering time rather than passed as an operand; the same op emits 0x0 or 0x1 depending on the sparsitySelector attribute.

For shape m16n8k16.row.col.f16.f16.f16.f16 the constraint string above expands to four =r outputs (D fragment) and seventeen r inputs (A=4, B=8, C=4, plus the sparse-metadata word), matching the template's %0..%20 slot range. For m16n8k32.row.col.s32.s8.s8.s32 the printed PTX is {$0..$3}, {$4,$5}, {$6,$7}, {$8..$11}, $12 — four s32 outputs and nine inputs (A=2, B=2, C=4, metadata=1). The verifier rejects any combination not listed in the PTX ISA.

im2col TMA store with L2 cache hint

template:    "cp.async.bulk.tensor.{N}d.global.shared::cta.im2col.bulk_group.L2::cache_hint
                 [%0, { %1, %2, ..., %{N} }],
                 [%{N+1}],
                 %{N+2};"
constraint:  "l,r,r,r,r,r,l,l"      // N=5 example

Operand 0 is the i64 descriptor pointer; the next N operands (one per rank) are 32-bit coordinates; the SMEM source pointer is l; the cache hint is l. Rank-3 and rank-4 forms drop coordinate operands and shrink the constraint string accordingly.

tcgen05.cp

template:    "tcgen05.cp.{shape}.{multicast}.{src_fmt} [%0], [%1];"
constraint:  "r,r"

The two r operands are the destination and source TMEM column indices. The shape, multicast, and src_fmt tokens are baked into the template literal at pattern-build time.

stmatrix fallback (pre-sm_90)

When nvvm.stmatrix.sync.aligned is targeted at a pre-sm_90 SM that exposes ldmatrix but not stmatrix directly, the op lowers through llvm.inline_asm:

template:    "stmatrix.sync.aligned.m8n8.x{num}{.trans,}.shared::cta.b16
                 [%0], { %1, %2, ..., %{num} };"
constraint:  "l,r,r,...,r"          // one l for addr, num× r for fragment regs

l is the ptr addrspace(3) destination; the trailing r slots are the fragment registers.

WGMMA scale-D selector (when the immediate form is rejected)

Most wgmma.mma_async.sync.aligned variants reach PTX through the LLVM intrinsic, which carries scale_d as a compile-time argument. The few ops that drop to inline-asm use:

template:    "wgmma.mma_async.sync.aligned.m64n{N}k{K}.{accT}.{aT}.{bT}
                 { %0, %1, ..., %{accW-1} },
                 %da, %db, %p,
                 1, 1, %la, %lb;"
constraint:  "=f,=f,...,=f,l,l,n,n,n"

Each output accumulator register is =f (for f32 accumulator types) or =h (f16). The two descriptor inputs are l. The %p predicate and the two trailing n slots are compile-time immediates. The =r slot used in some upstream snapshots for the scale-D return value does not appear on this constraint string because the immediate form is the only one tileiras emits.

Per-Arch Availability

Registration is uniform across targets; the gate lives in the verifier and the backend. The table is the practical "what runs where" view. ptx_min is the lowest PTX ISA version the final printed instruction requires.

FamilySM floorSM ceiling (observed)ptx_minNotes
Synchronisationsm_70unbounded6.0 / 7.0aligned forms require 7.0
Special registerssm_70unbounded6.0always legal
shfl / votesm_70unbounded6.0only the .sync forms are emitted
cp.async (Ampere)sm_80unbounded7.0Ampere async-copy queue
mbarriersm_80 (base), sm_90 (.expect_tx)unbounded7.0 / 7.8shared-memory variant on Ampere; cluster-aware extensions on Hopper
WMMAsm_70sm_89 (Hopper redirects through WGMMA)6.0the only MMA path on Turing/Ampere
WGMMAsm_90asm_90a (no Blackwell WGMMA)8.0architecture-qualified; plain sm_90 is rejected
TMAsm_90unbounded8.0 / 8.3descriptor lives in global memory
Clustersm_90unbounded8.0requires barrier.cluster.* PTX
ldmatrix / stmatrixsm_75 (ldmatrix), sm_90 (stmatrix)unbounded6.5 / 8.0width-4 .trans form requires 7.8
tcgen05sm_100asm_100a (+ sm_100f for f-suffixed copy variants)8.6Blackwell tensor-memory family
Block-scaled MMA (mma.block_scale)sm_100asm_100a8.6the only sm_100 form in the legacy nvvm.mma namespace
redux / barrier-id helperssm_80 (redux.sync) / sm_70 (bar.{arrive,sync})unbounded7.0 / 6.0redux.sync requires Ampere

Lowering Contract

NVVM-to-LLVM conversion is deliberately mechanical. Each nvvm.X op has a single registered lowering: a direct call @llvm.nvvm.X intrinsic when LLVM exposes a matching intrinsic, or an llvm.inline_asm with a hard-coded PTX template otherwise. A third path expands into ordinary llvm dialect ops for the few cases that aren't a single instruction (e.g. nvvm.shfl.sync synthesised broadcast loops).

The choice is fixed per op at registration time. The conversion driver walks each nvvm.* op, looks up the op's OperationName in the dispatch map, and invokes the matching rewrite:

LogicalResult lower_nvvm_op(Operation *op) {
    const NvvmLowering *entry = lookup_by_operation_name(op->getName());
    require(entry != NULL);

    switch (entry->kind) {
        case NVVM_DIRECT_INTRINSIC:
            return replace_with_llvm_intrinsic_call(op, entry->intrinsic_id);
        case NVVM_INLINE_ASM:
            return replace_with_inline_asm(op, entry->ptx_template, entry->constraints);
        case NVVM_LLVM_EXPANSION:
            return entry->custom_expand(op);
    }
}

The dispatch map is built once at dialect-load time from the TableGen records: each record declares dialect="nvvm", an op mnemonic, an LLVM intrinsic ID (or an inline-asm template + constraint string), and the kind. Lowering reads each field straight out of the entry. No layout, scheduling, or pipeline policy is reinferred here — earlier dialects must already have committed to the target operation.

After the sweep, no nvvm.* op survives. The verifier check that follows the sweep treats any remaining nvvm.* op as a missing pattern, not as a default-illegal op.

Verifier Invariants

The verifier rejects anything that cannot be legally translated to the selected target:

  • intrinsic operand counts and result counts match the selected intrinsic;
  • pointer address spaces are explicit and legal for the operation;
  • memory scopes and memory-ordering attributes are compatible;
  • MMA and WGMMA shapes are supported by the target;
  • sparse and block-scaled MMA forms carry the required metadata operands;
  • TMA and async-copy operands have valid descriptor, barrier, and memory-space types;
  • special-register reads are valid for the target and execution model;
  • inline-PTX operations have complete constraint strings and result types;
  • operations requiring a newer SM generation are not emitted for an older target.

This is the last MLIR-level diagnostic point before LLVM IR and the machine backend. A good error here names the semantic mismatch, not just the intrinsic.

Target Attributes

nvvm carries the target attributes that make the LLVM handoff meaningful: architecture (nvvm.target = "sm_90a"), PTX version, feature flags (+ptx80, +tmem, ...), kernel markers, launch bounds, cluster dimensions, and assorted function- and module-level properties. Earlier passes set these through gpu.module and conversion interfaces; by the time the LLVM module materialises, the NVPTX backend has to recover a concrete subtarget from them.

The attributes are plain string / integer / array attributes attached to the gpu.module or func.func parents — the NVPTX backend reads them from the LLVM module's metadata after MLIR-to-LLVM translation. Missing or contradictory attributes here are silent disasters: the backend still receives syntactically valid LLVM IR, but generates code for the wrong target contract. The verifier rejects the obvious cases (no sm, no ptx), and the NVPTX subtarget feature matrix lists which features each SM accepts.