Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVM WMMA Ops

Abstract

The nvvm.wmma.* family is the warp-synchronous matrix-multiply-accumulate path used on every NVIDIA target from sm_70 through sm_89. The dialect carries three MLIR ops — nvvm.wmma.load, nvvm.wmma.store, and nvvm.wmma.mma — each parameterised by attributes (shape, fragment role, layout, element types). The full PTX shape × layout × element-type cross-product is reached by attribute combinations on these three ops, not by enumerating dozens of ops.

Hopper (sm_90+) does not extend this family. Warp-group MMA on Hopper lives in nvvm.wgmma.*; Blackwell MMA lives in nvvm.tcgen05.*.

Op Layout

The dialect registers exactly three op classes — nvvm.wmma.load, nvvm.wmma.store, nvvm.wmma.mma — and that is the count visible in the binary's interned mnemonic strings. The attribute cross-product on those three ops expands to roughly 64 distinct LLVM-intrinsic / PTX-instruction targets at lowering time; the right-hand column counts intrinsic targets reachable through the op, not separate dialect ops.

Op (dialect op)RoleAttribute axesReachable LLVM intrinsics
nvvm.wmma.loadA / B / C fragment loadfragment ∈ {a,b,c} × shape × layout ∈ {row,col} × element type~36
nvvm.wmma.storeD fragment storeshape × layout × element type ∈ {f16,f32,s32}~12
nvvm.wmma.mmatile MMAshape × A-layout × B-layout × (aT,bT,cT,dT) × .popc/.and.popc for b1~16

Tile shapes legal in PTX: m16n16k16, m8n32k16, m32n8k16 for f16/bf16; m16n16k8, m8n32k8, m32n8k8 for tf32; m16n16k16, m8n32k16, m32n8k16 for s8/u8; m8n8k128 for b1; m8n8k32 for s4/u4. The verifier rejects any attribute tuple not in this table.

Operand Tables

nvvm.wmma.load.{a,b,c}.sync.aligned.mXnYkZ.{row,col}.{T}

PositionNameTypeNotes
operand 0addrptr addrspace(3)shared-memory tile origin
operand 1stridei32row or column stride in elements
result 0frag!llvm.struct<(T, T, ..., T)>per-thread register fragment; cardinality fixed by shape and element type

Each shape pins the fragment length: an m16n16k16.f16 A fragment is struct<(vector<2xf16>, vector<2xf16>, ..., vector<2xf16>)> of length 8; an m16n16k16.tf32 A fragment is struct<(i32, i32, i32, i32)>; an m16n16k16.s8 A fragment is struct<(i32, i32)>. The verifier rejects any other arity for the chosen shape/type pair.

nvvm.wmma.store.d.sync.aligned.mXnYkZ.{row,col}.{T}

PositionNameTypeNotes
operand 0addrptr addrspace(3)shared-memory destination
operand 1..Nfragper shape/typeD fragment elements, expanded into one operand per register
operand N+1stridei32row or column stride in elements

store.d flattens the fragment into separate operands rather than re-packing into a struct, which mirrors LLVM's intrinsic signature.

nvvm.wmma.mma.sync.aligned.mXnYkZ.{layoutA}.{layoutB}.{aT}.{bT}.{cT}.{dT}

PositionNameTypeNotes
operand 0..pA fragmentper shape and aTflattened
operand p+1..qB fragmentper shape and bTflattened
operand q+1..rC fragmentper shape and cTaccumulator input
result 0D fragment!llvm.struct<(dT, ..., dT)>accumulator output

For m16n16k16.row.col.f16.f16.f16.f16 the operand bag is A=8 f16x2, B=8 f16x2, C=4 f16x2, and the result is struct<(vector<2xf16>) x 4>. The verifier cross-checks the operand counts against the shape and types and rejects any mismatch.

LLVM Intrinsic Mapping

Every nvvm.wmma.* op lowers to one call @llvm.nvvm.wmma.mXnYkZ.{op}.{layout}.{...} intrinsic. The intrinsic name is constructed at TableGen registration time by concatenating the shape, op, layout, and type tokens. The lowering pattern reads the op's attributes only to verify; it does not pick the intrinsic at run time — the per-op vtable hardwires the intrinsic ID.

OpLLVM intrinsic
nvvm.wmma.load (frag = A, m16n16k16, row, f16)llvm.nvvm.wmma.m16n16k16.load.a.row.stride.f16
nvvm.wmma.load (frag = B, m16n16k16, col, f16)llvm.nvvm.wmma.m16n16k16.load.b.col.stride.f16
nvvm.wmma.load (frag = C, m16n16k16, row, f32)llvm.nvvm.wmma.m16n16k16.load.c.row.stride.f32
nvvm.wmma.store (frag = D, m16n16k16, row, f32)llvm.nvvm.wmma.m16n16k16.store.d.row.stride.f32
nvvm.wmma.mma (m16n16k16, row, col, f16→f16)llvm.nvvm.wmma.m16n16k16.mma.row.col.f16.f16
nvvm.wmma.mma (m16n16k16, row, col, f16→f32)llvm.nvvm.wmma.m16n16k16.mma.row.col.f32.f32
nvvm.wmma.mma (m8n8k128, row, col, b1→s32)llvm.nvvm.wmma.m8n8k128.mma.row.col.b1
nvvm.wmma.mma (m8n8k32, row, col, s4→s32)llvm.nvvm.wmma.m8n8k32.mma.row.col.s4

Shape, fragment, layout pair, and element types all live as attributes on the three canonical dialect ops (nvvm.wmma.load, nvvm.wmma.store, nvvm.wmma.mma); the matching intrinsic name is reconstructed at NVVM-to-LLVM time by concatenating the attribute tokens.

PTX Templates

Once the LLVM intrinsic is selected, the NVPTX backend emits one PTX instruction. The templates below cover the canonical shape/type combinations; other combinations substitute the shape and type tokens without changing the skeleton.

OpPTX printed
wmma.load.a.sync (f16, row)wmma.load.a.sync.aligned.m16n16k16.row.shared::cta.f16 {%r0, %r1, %r2, %r3, %r4, %r5, %r6, %r7}, [%addr], %stride;
wmma.load.b.sync (f16, col)wmma.load.b.sync.aligned.m16n16k16.col.shared::cta.f16 {%r0, %r1, %r2, %r3, %r4, %r5, %r6, %r7}, [%addr], %stride;
wmma.load.c.sync (f32, row)wmma.load.c.sync.aligned.m16n16k16.row.shared::cta.f32 {%r0, %r1, %r2, %r3, %r4, %r5, %r6, %r7}, [%addr], %stride;
wmma.store.d.sync (f32, row)wmma.store.d.sync.aligned.m16n16k16.row.shared::cta.f32 [%addr], {%r0, %r1, %r2, %r3, %r4, %r5, %r6, %r7}, %stride;
wmma.mma.sync (row.col.f16.f16.f16.f16)wmma.mma.sync.aligned.m16n16k16.row.col.f16.f16 {%d0..%d3}, {%a0..%a7}, {%b0..%b7}, {%c0..%c3};
wmma.mma.sync (row.col.f16.f16.f32.f32)wmma.mma.sync.aligned.m16n16k16.row.col.f32.f32 {%d0..%d7}, {%a0..%a7}, {%b0..%b7}, {%c0..%c7};
wmma.mma.sync (row.col.s8.s8.s32.s32)wmma.mma.sync.aligned.m16n16k16.row.col.s8 {%d0..%d7}, {%a0..%a1}, {%b0..%b1}, {%c0..%c7};
wmma.mma.sync (row.col.b1.b1.s32.s32, popc)wmma.mma.sync.aligned.m8n8k128.row.col.popc.b1 {%d0..%d1}, {%a0}, {%b0}, {%c0..%c1};

The .popc and .and.popc modifiers on the b1 form are encoded as a boolean attribute on nvvm.wmma.mma (and selected through the op's element-type discriminator). The verifier rejects any combination not listed in the PTX ISA.

Per-Arch Availability

Sub-familySM floorptx_minNotes
f16/f32 accumulatorssm_706.0universal across Volta and later
bf16sm_807.0Ampere extension
tf32sm_807.0only with m16n16k8 / m8n32k8 / m32n8k8 shapes
s8 / u8sm_726.3mobile + datacenter Turing onwards
s4 / u4sm_756.3m8n8k32 shape only
b1 (popc / and.popc)sm_756.3m8n8k128 shape only

Hopper (sm_90+) backends accept nvvm.wmma.* for backward compatibility but Tileiras prefers nvvm.wgmma.mma_async once the target hits sm_90a. Blackwell (sm_100+) keeps WMMA legal for short-K tiles only — long-K paths go through nvvm.tcgen05.mma. See Per-SM Emission Templates — SM70 / SM75 for the Volta/Turing PTX templates.

Verification Invariants

  • Tile shape and element-type tuple must match a row of the PTX ISA's WMMA shape table.
  • A and B fragment cardinalities are derived from the shape; the verifier rejects mismatched operand counts.
  • C and D layouts (row / col) must agree.
  • .popc / .and.popc are legal only on the b1 form.
  • f64 WMMA does not exist in this dialect; the FP64 MMA path uses nvvm.mma.sync with the m8n8k4.f64 shape/type attribute combination.