NVVM WGMMA Ops
Abstract
nvvm.wgmma.* is the warp-group asynchronous MMA family used on Hopper (sm_90a). A warp group is four contiguous warps cooperating on one m64nNkK accumulator tile, with B always resident in shared memory through a 64-bit SMEM descriptor and A either in registers or in SMEM through a second descriptor. The four ops in this family pair into a four-stage pipeline: fence, mma_async, commit, wait. See WGMMA Emission Protocol — The Four-Op Sequence for the pipeline timing and WGMMA Emission for the codegen side.
Blackwell (sm_100+) does not extend this family. The Hopper WGMMA path is the only wgmma.* PTX surface; Blackwell MMA lives in nvvm.tcgen05.*.
Op Roster
The "Properties slots used" column tracks where each op stores its attribute payload in the inline Properties record; see Properties Blob — Per-op-family slot maps for the exact byte offsets.
| Op | Role | Properties slots used |
|---|---|---|
nvvm.wgmma.fence.aligned | producer-side fence before mma_async | none |
nvvm.wgmma.mma_async | the MMA itself | typeA, b1Op, typeB, shape, typeC, scaleIn, scaleOut, layoutA, layoutB |
nvvm.wgmma.commit.group.sync.aligned | close the current MMA group | wgmma_type, wgmma_layout |
nvvm.wgmma.wait.group.sync.aligned | wait for the group with depth N | wgmma_type, wgmma_layout, shape-N |
commit.group and wait.group carry type+layout attributes even though no register operand survives — the suffix selects which earlier mma_async group the wait drains.
Operand Tables
nvvm.wgmma.fence.aligned
No operands and no result. Lowers to a single PTX wgmma.fence.sync.aligned; instruction.
nvvm.wgmma.mma_async
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | descA | i64 | WGMMA SMEM descriptor for A — or, for the A-in-registers form, an !llvm.struct register fragment |
| operand 1 | descB | i64 | WGMMA SMEM descriptor for B (always SMEM-resident) |
| operand 2 | accumIn | !llvm.struct<(T, ..., T)> of accumulator regs | accumulator input tile |
| attribute | typeA | enum wgmma_type | f16 / bf16 / tf32 / e4m3 / e5m2 / s8 / u8 / s4 / u4 / b1 |
| attribute | typeB | enum wgmma_type | mirror of typeA |
| attribute | typeC | enum wgmma_type | usually f32; f16 allowed for f16xf16 |
| attribute | shape | enum shape | m64nNkK selector — N ∈ {8, 16, 24, ..., 256} step 8 |
| attribute | scaleIn | enum wgmma_scale_in | +1 / -1 for A and B |
| attribute | scaleOut | enum wgmma_scale_out | 0 (init) or 1 (accumulate) |
| attribute | layoutA | enum mma_layout | row / col |
| attribute | layoutB | enum mma_layout | row / col |
| attribute | b1Op | enum mma_b1op | xor_popc / and_popc / none |
| result 0 | accumOut | same struct type as accumIn | accumulator after the MMA |
The accumulator struct width depends on N and typeC. For m64n128k16.f32.f16.f16 the accumulator is 64 f32 registers laid out as struct<(f32) x 64>; for m64n64k16.f32.f16.f16 it is 32 f32. The verifier rejects any struct width that does not match N * typeC_bits / 32.
nvvm.wgmma.commit.group.sync.aligned
| Position | Name | Type | Notes |
|---|---|---|---|
| attribute | wgmma_type | enum | echoes the mma_async typeA/typeB selector |
| attribute | wgmma_layout | enum | echoes the layout pair |
No operands; closes the current outstanding-MMA group.
nvvm.wgmma.wait.group.sync.aligned
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | groupDepth | i32 | number of older groups the wait keeps alive |
| attribute | wgmma_type / wgmma_layout / shape-N | enums | propagated through to the PTX suffix |
A depth-zero wait drains every outstanding group; non-zero values keep older groups in flight while ensuring the current one is complete.
WGMMA SMEM Descriptor
The 64-bit value passed as descA (when A is SMEM-resident) and descB packs the SMEM tile origin and stride into a single word. The bit layout is shared with cute_nvgpu's WGMMA descriptor construction:
typedef union WgmmaDescriptor {
uint64_t raw;
struct {
uint64_t start_addr : 14; /* low 14 bits of (smem_byte_offset >> 4) */
uint64_t lbo : 16; /* leading byte offset (per-warp tile) */
uint64_t sbo : 16; /* stride byte offset (between warp tiles)*/
uint64_t base_offset : 3; /* per-CTA SMEM base offset (>>3) */
uint64_t reserved : 3; /* always zero */
uint64_t swizzle_mode : 2; /* 0=none, 1=128B, 2=64B, 3=32B */
uint64_t pad : 10;
};
} WgmmaDescriptor;
start_addr requires 16-byte SMEM alignment because the field stores the offset shifted right by 4. lbo and sbo together encode the two-dimensional warp-tile stride layout. The swizzle field selects the canonical Hopper 128-byte mode, with 64-byte and 32-byte modes available for sub-tile widths.
The descriptor reaches nvvm.wgmma.mma_async as a plain i64 operand. The pattern that builds it sits in nvgpu.warpgroup.descriptor (see the nvgpu overview).
LLVM Intrinsic Mapping
| Op | LLVM intrinsic |
|---|---|
nvvm.wgmma.fence.aligned | llvm.nvvm.wgmma.fence.sync.aligned |
nvvm.wgmma.mma_async (m64n128k16, f32.f16.f16) | llvm.nvvm.wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 |
nvvm.wgmma.mma_async (m64n256k32, f32.e4m3.e4m3) | llvm.nvvm.wgmma.mma_async.sync.aligned.m64n256k32.f32.e4m3.e4m3 |
nvvm.wgmma.commit.group.sync.aligned | llvm.nvvm.wgmma.commit.group.sync.aligned |
nvvm.wgmma.wait.group.sync.aligned | llvm.nvvm.wgmma.wait.group.sync.aligned |
The intrinsic name is built by concatenating the shape, accumulator type, A type, and B type tokens. Tile counts (m64nNkK) are enumerated: every N ∈ {8, 16, 24, ..., 256} exposes a separate intrinsic. The verifier rejects any N outside that lattice.
PTX Templates
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64nNkK.{accT}.{aT}.{bT}
{ %d0, %d1, ..., %d{accW-1} },
%da, %db, %p,
%scale_a, %scale_b,
%trans_a, %trans_b;
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned N;
%da and %db are the 64-bit SMEM descriptors. %p is the immediate scale-D predicate (compile-time 0 or 1) that selects between init (overwrite accumulator) and accumulate. %scale_a and %scale_b are the immediate +1/-1 selectors that bind to scaleIn. %trans_a and %trans_b are the immediate transpose flags bound to layoutA / layoutB. The accumulator register list %d0..%d{accW-1} expands per N and accumulator type.
For the canonical m64n128k16.f32.f16.f16 shape:
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16
{ %f0, %f1, ..., %f63 },
%da, %db, %p, 1, 1, %la, %lb;
Per-Arch Availability
| Op | SM floor | ptx_min |
|---|---|---|
wgmma.fence.aligned | sm_90a | 8.0 |
wgmma.mma_async.sync.aligned | sm_90a | 8.0 |
wgmma.commit.group.sync.aligned | sm_90a | 8.0 |
wgmma.wait.group.sync.aligned | sm_90a | 8.0 |
Plain sm_90 is rejected; the WGMMA family requires the architecture-qualified sm_90a variant. Blackwell (sm_100+) does not extend WGMMA — the Blackwell tensor-memory MMA path is nvvm.tcgen05.mma.sync. See Per-SM Emission Templates — SM90 for the Hopper PTX templates and WGMMA Descriptor Round-Trip for the descriptor hex walk-through.
Verifier Invariants
shapeism64nNkKwithN ∈ {8, 16, ..., 256}andK = 256 / typeA_bits(or 16 fortf32).descAisi64only whenlayoutAmatches an SMEM tile; an A-in-registers fragment must be a typed struct.descBis alwaysi64.- Accumulator struct width equals
N * sizeof(typeC) / 432-bit registers. scaleOutis a compile-timei1; runtime values are rejected.commit.groupandwait.groupcarry the samewgmma_typeand layout as the in-flightmma_async.- Wait depth is non-negative and fits in 6 bits.