Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVM WGMMA Ops

Abstract

nvvm.wgmma.* is the warp-group asynchronous MMA family used on Hopper (sm_90a). A warp group is four contiguous warps cooperating on one m64nNkK accumulator tile, with B always resident in shared memory through a 64-bit SMEM descriptor and A either in registers or in SMEM through a second descriptor. The four ops in this family pair into a four-stage pipeline: fence, mma_async, commit, wait. See WGMMA Emission Protocol — The Four-Op Sequence for the pipeline timing and WGMMA Emission for the codegen side.

Blackwell (sm_100+) does not extend this family. The Hopper WGMMA path is the only wgmma.* PTX surface; Blackwell MMA lives in nvvm.tcgen05.*.

Op Roster

The "Properties slots used" column tracks where each op stores its attribute payload in the inline Properties record; see Properties Blob — Per-op-family slot maps for the exact byte offsets.

OpRoleProperties slots used
nvvm.wgmma.fence.alignedproducer-side fence before mma_asyncnone
nvvm.wgmma.mma_asyncthe MMA itselftypeA, b1Op, typeB, shape, typeC, scaleIn, scaleOut, layoutA, layoutB
nvvm.wgmma.commit.group.sync.alignedclose the current MMA groupwgmma_type, wgmma_layout
nvvm.wgmma.wait.group.sync.alignedwait for the group with depth Nwgmma_type, wgmma_layout, shape-N

commit.group and wait.group carry type+layout attributes even though no register operand survives — the suffix selects which earlier mma_async group the wait drains.

Operand Tables

nvvm.wgmma.fence.aligned

No operands and no result. Lowers to a single PTX wgmma.fence.sync.aligned; instruction.

nvvm.wgmma.mma_async

PositionNameTypeNotes
operand 0descAi64WGMMA SMEM descriptor for A — or, for the A-in-registers form, an !llvm.struct register fragment
operand 1descBi64WGMMA SMEM descriptor for B (always SMEM-resident)
operand 2accumIn!llvm.struct<(T, ..., T)> of accumulator regsaccumulator input tile
attributetypeAenum wgmma_typef16 / bf16 / tf32 / e4m3 / e5m2 / s8 / u8 / s4 / u4 / b1
attributetypeBenum wgmma_typemirror of typeA
attributetypeCenum wgmma_typeusually f32; f16 allowed for f16xf16
attributeshapeenum shapem64nNkK selector — N ∈ {8, 16, 24, ..., 256} step 8
attributescaleInenum wgmma_scale_in+1 / -1 for A and B
attributescaleOutenum wgmma_scale_out0 (init) or 1 (accumulate)
attributelayoutAenum mma_layoutrow / col
attributelayoutBenum mma_layoutrow / col
attributeb1Openum mma_b1opxor_popc / and_popc / none
result 0accumOutsame struct type as accumInaccumulator after the MMA

The accumulator struct width depends on N and typeC. For m64n128k16.f32.f16.f16 the accumulator is 64 f32 registers laid out as struct<(f32) x 64>; for m64n64k16.f32.f16.f16 it is 32 f32. The verifier rejects any struct width that does not match N * typeC_bits / 32.

nvvm.wgmma.commit.group.sync.aligned

PositionNameTypeNotes
attributewgmma_typeenumechoes the mma_async typeA/typeB selector
attributewgmma_layoutenumechoes the layout pair

No operands; closes the current outstanding-MMA group.

nvvm.wgmma.wait.group.sync.aligned

PositionNameTypeNotes
operand 0groupDepthi32number of older groups the wait keeps alive
attributewgmma_type / wgmma_layout / shape-Nenumspropagated through to the PTX suffix

A depth-zero wait drains every outstanding group; non-zero values keep older groups in flight while ensuring the current one is complete.

WGMMA SMEM Descriptor

The 64-bit value passed as descA (when A is SMEM-resident) and descB packs the SMEM tile origin and stride into a single word. The bit layout is shared with cute_nvgpu's WGMMA descriptor construction:

typedef union WgmmaDescriptor {
    uint64_t raw;
    struct {
        uint64_t start_addr   : 14;   /* low 14 bits of (smem_byte_offset >> 4) */
        uint64_t lbo          : 16;   /* leading byte offset (per-warp tile)    */
        uint64_t sbo          : 16;   /* stride byte offset (between warp tiles)*/
        uint64_t base_offset  : 3;    /* per-CTA SMEM base offset (>>3)         */
        uint64_t reserved     : 3;    /* always zero                            */
        uint64_t swizzle_mode : 2;    /* 0=none, 1=128B, 2=64B, 3=32B           */
        uint64_t pad          : 10;
    };
} WgmmaDescriptor;

start_addr requires 16-byte SMEM alignment because the field stores the offset shifted right by 4. lbo and sbo together encode the two-dimensional warp-tile stride layout. The swizzle field selects the canonical Hopper 128-byte mode, with 64-byte and 32-byte modes available for sub-tile widths.

The descriptor reaches nvvm.wgmma.mma_async as a plain i64 operand. The pattern that builds it sits in nvgpu.warpgroup.descriptor (see the nvgpu overview).

LLVM Intrinsic Mapping

OpLLVM intrinsic
nvvm.wgmma.fence.alignedllvm.nvvm.wgmma.fence.sync.aligned
nvvm.wgmma.mma_async (m64n128k16, f32.f16.f16)llvm.nvvm.wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16
nvvm.wgmma.mma_async (m64n256k32, f32.e4m3.e4m3)llvm.nvvm.wgmma.mma_async.sync.aligned.m64n256k32.f32.e4m3.e4m3
nvvm.wgmma.commit.group.sync.alignedllvm.nvvm.wgmma.commit.group.sync.aligned
nvvm.wgmma.wait.group.sync.alignedllvm.nvvm.wgmma.wait.group.sync.aligned

The intrinsic name is built by concatenating the shape, accumulator type, A type, and B type tokens. Tile counts (m64nNkK) are enumerated: every N ∈ {8, 16, 24, ..., 256} exposes a separate intrinsic. The verifier rejects any N outside that lattice.

PTX Templates

wgmma.fence.sync.aligned;

wgmma.mma_async.sync.aligned.m64nNkK.{accT}.{aT}.{bT}
    { %d0, %d1, ..., %d{accW-1} },
    %da, %db, %p,
    %scale_a, %scale_b,
    %trans_a, %trans_b;

wgmma.commit_group.sync.aligned;

wgmma.wait_group.sync.aligned N;

%da and %db are the 64-bit SMEM descriptors. %p is the immediate scale-D predicate (compile-time 0 or 1) that selects between init (overwrite accumulator) and accumulate. %scale_a and %scale_b are the immediate +1/-1 selectors that bind to scaleIn. %trans_a and %trans_b are the immediate transpose flags bound to layoutA / layoutB. The accumulator register list %d0..%d{accW-1} expands per N and accumulator type.

For the canonical m64n128k16.f32.f16.f16 shape:

wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16
    { %f0, %f1, ..., %f63 },
    %da, %db, %p, 1, 1, %la, %lb;

Per-Arch Availability

OpSM floorptx_min
wgmma.fence.alignedsm_90a8.0
wgmma.mma_async.sync.alignedsm_90a8.0
wgmma.commit.group.sync.alignedsm_90a8.0
wgmma.wait.group.sync.alignedsm_90a8.0

Plain sm_90 is rejected; the WGMMA family requires the architecture-qualified sm_90a variant. Blackwell (sm_100+) does not extend WGMMA — the Blackwell tensor-memory MMA path is nvvm.tcgen05.mma.sync. See Per-SM Emission Templates — SM90 for the Hopper PTX templates and WGMMA Descriptor Round-Trip for the descriptor hex walk-through.

Verifier Invariants

  • shape is m64nNkK with N ∈ {8, 16, ..., 256} and K = 256 / typeA_bits (or 16 for tf32).
  • descA is i64 only when layoutA matches an SMEM tile; an A-in-registers fragment must be a typed struct.
  • descB is always i64.
  • Accumulator struct width equals N * sizeof(typeC) / 4 32-bit registers.
  • scaleOut is a compile-time i1; runtime values are rejected.
  • commit.group and wait.group carry the same wgmma_type and layout as the in-flight mma_async.
  • Wait depth is non-negative and fits in 6 bits.