Matmul Progression by SM

Abstract

NVIDIA's matrix-multiply abstraction has evolved across seven SM generations. Each generation adds capacity along one of three axes — concurrency model (warp-cooperative → warp-group → cluster-cooperative), operand storage class (register fragments → SMEM descriptors → tensor memory), or numerical range (FP16 → FP8 → MXFP4 with block scales). Some generations also remove resource classes that earlier ones introduced: Blackwell datacenter parts drop the register-resident accumulator that WGMMA used, and Blackwell consumer parts drop tensor memory entirely while keeping the block-scale operand encoding.

This page is the canonical cross-architecture overview. It supersedes the scattered per-tier discussions in MMA Atoms SM70-SM120 (the per-arch shape lattice), the WGMMA and tcgen05 topic pages (which focus on one generation each), and tcgen05 / WGMMA / mbarrier / Cluster Emission. Those pages keep their per-tier content; this page covers the cross-architecture story.

SM70 / SM75: Warp-Cooperative `mma.sync`

SM70 (Volta) and SM75 (Turing) introduced the first generation of tensor cores. The MMA instruction is mma.sync: warp-cooperative (32 threads cooperate on one tile), synchronous (the result is visible to the warp immediately after the instruction returns), and entirely register-resident (both operands and the accumulator live in the warp's register file).

The tile shapes are fixed and small. SM70 supports 8 x 8 x 4 with FP16 inputs and FP16 or FP32 accumulators. SM75 adds 16 x 8 x 8 with FP16, BF16, and the integer low-bit forms. The operand layouts are pinned by the architecture: each lane carries a specific subset of the matrix tile, and the layout grammar in cute_nvgpu exists in large part to record these per-lane subsets without losing them across pipeline transformations.

emit:  mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 { %d0, %d1 }, { %a0, %a1 }, { %b0 }, { %c0, %c1 };
       (warp-cooperative, synchronous, all operands and accumulator in registers)

SM80 / SM86 / SM87 / SM89: Dense and Sparse `mma.sync`

SM80 (Ampere A100) keeps the same warp-cooperative synchronous model but expands the shape lattice substantially: 16 x 8 x 16 with FP16 / BF16 / TF32 and a sparse mma.sp.sync variant that halves the structurally-sparse operand and adds a metadata operand. The lower SM80 derivatives (SM86, SM87) keep the same operations with smaller tensor-core arrays.

SM89 (Ada L40) adds FP8 E4M3 and E5M2 inputs to the same warp-cooperative synchronous register-MMA model. FP8 inputs always accumulate into FP32; the FP8 shape is 16 x 8 x 32 and the K extent doubles compared to FP16 because each element takes half the bits.

emit:  mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 ...   (SM80)
       mma.sp.sync.aligned.m16n8k32.row.col.s8.s8.s32 ...     (SM80 sparse)
       mma.sync.aligned.m16n8k32.row.col.e4m3.e4m3.f32 ...    (SM89 FP8)

None of the SM80-tier MMAs touch shared memory directly — they read operands from registers. The kernel is responsible for staging tiles into registers, typically via ldmatrix from shared memory and cp.async into shared memory upstream.

SM90 / SM90a: Warp-Group Async WGMMA

SM90 (Hopper H100) introduces the first asynchronous MMA: wgmma.mma_async. Four warps now cooperate on one accumulator tile (warp-group cooperative, hence WGMMA). The instruction is asynchronous against the issuing warps — it returns immediately, and the accumulator is not visible until a wait-group instruction drains the in-flight cohort.

The operand storage class changes too. Operand B is always an SMEM descriptor — a packed 64-bit word encoding base address, leading byte offset, stride byte offset, base offset, and swizzle mode. Operand A may be a register fragment or an SMEM descriptor depending on the atom variant. The accumulator stays in the warp group's register file, but is invisible until drained.

The four-op emission protocol — fence → tile loop of mma_async → commit → wait — is the contract a correct lowering must preserve. See wgmma-emission-protocol for details.

Shapes range over 64 x N x K where M is fixed at 64 per instruction, N steps in multiples of 8 up to 256, and K is the canonical 256 / elem_bits per element type. The architecture-qualified sm_90a variant is mandatory — plain sm_90 rejects WGMMA at NVVM verification.

emit:  wgmma.fence.sync.aligned;
       wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {...}, %a, %b_desc, %scale, ...;
       wgmma.commit_group.sync.aligned;
       wgmma.wait_group.sync.aligned 0;
       (warp-group cooperative, asynchronous, B in SMEM descriptor, accumulator in RF)

SM100 / SM103: Tensor Memory and `tcgen05.mma`

SM100 (Blackwell B200) and SM103 (Blackwell Ultra GB300) remove WGMMA and replace it with tcgen05.mma. The concurrency model stays warp-group cooperative; the accumulator moves out of the register file and into tensor memory (TMEM), a new on-chip memory class. Operand A becomes either an SMEM descriptor or a TMEM pointer; operand B stays as an SMEM descriptor.

TMEM is per-SM, dense (128 rows per region), and reachable only from the tcgen05 instruction family. The accumulator residency change is the single biggest architectural shift between WGMMA and tcgen05: a kernel that reads the accumulator must use tcgen05.ld to copy TMEM back into registers, not just observe the SSA value as on Hopper.

SM100 also adds two new variant axes:

Block-scaled MMA for microscale formats (FP4, FP6, FP8) with per-block E8M0 or E4M3FN scale factors stored in dedicated TMEM regions.
Weight-stationary mode that pins operand A to its TMEM region across the K loop, amortising A-side bandwidth.

The cluster-cooperative variant cta_group::2 lets two CTAs in a cluster share an MMA tile; CTA 0 holds half of TMEM rows, CTA 1 holds the other half. A 4-CTA copy variant exists on the staging-copy side but not on the MMA side — Blackwell's 4-CTA semantics is a copy-time fan-out, and the MMA that follows is a plain single-CTA instruction over its slice. See tcgen05-tensor-memory-model.

emit:  tcgen05.alloc.shared %h, 256;           // allocate TMEM region
       tcgen05.cp.smem.tmem ...;               // stage operand into TMEM
       tcgen05.mma.cta_group::1 %h_d, %a_desc, %b_desc, %h_scale, 1;
       (warp-group cooperative, asynchronous, A in SMEM/TMEM, B in SMEM, D in TMEM)

SM110: Jetson Thor — No Dedicated MMA Surface

SM110 (Jetson Thor) sits between datacenter Blackwell (SM100/SM103) and consumer Blackwell (SM120/SM121) in the architecture roster, and the compiler enumerates sm_110, sm_110a, and sm_110f as legal target strings. The cute_nvgpu dialect does not register any sm110.* MMA atom mnemonic — no WGMMA-style warp-group MMA, no tcgen05.mma over tensor memory, and no consumer-style block-scaled register MMA is dialect-side dispatched for SM110. Kernels compiled against sm_110 use the universal-FMA fallback or an earlier-tier MMA atom that the architecture-conditional gate accepts. See SM Tier Roster and Copy Atom Registry — SM110 (Jetson Thor) for the dialect-side evidence. Confidence: HIGH.

SM120 / SM121: Consumer Blackwell Block-Scaled MMA

SM120 (consumer RTX 50-series and enterprise Pro) and SM121 (DGX Spark) are a different lineage from datacenter Blackwell. They keep the block-scaled operand encoding but remove tensor memory. The MMA is once again warp-cooperative (32 threads, like SM70-SM89), synchronous (no wait-group), and entirely register-resident.

The instruction is a synchronous mma.sync.aligned with two new per-operand operands: scale_a and scale_b, both E8M0 register fragments. Each operand carries one scale factor per vecSize elements along K; the legal (K, vecSize) combinations are (32, 32) for the FP4/FP6/FP8 family and (64, 16) or (64, 32) for FP4-only inputs.

The accumulator stays in registers. The MMA is synchronous, so there is no wait-group barrier. The operand-encoding is closer to SM89 than to SM100 — block-scale is a numerical-range expansion of the register-MMA model, not a concurrency-model change.

emit:  mma.sync.aligned.m16n8k32.row.col.f4.f4.f32.block_scale
           { %d0, %d1, %d2, %d3 },
           { %a0, %a1 },           // FP4 operand A
           { %b0 },                // FP4 operand B
           { %c0, %c1, %c2, %c3 },
           { %sa },                // E8M0 scale factor for A
           { %sb };                // E8M0 scale factor for B
       (warp-cooperative, synchronous, all operands and accumulator in registers,
        block-scale operands in dedicated register fragments)

Worked Example: m64n128k16 bf16 × bf16 → f32

The clearest way to see the per-generation lowering differences is to pick a single logical matmul shape and trace what each tier emits. The shape below is large enough to require warp-cooperation on every tier but small enough to fit in one warp-group instruction on SM90 and SM100:

Computation: D = A × B + C
A: 64 × 16 tile, bf16
B: 16 × 128 tile, bf16
C, D: 64 × 128 tile, f32

SM70 / SM75: warp-cooperative `mma.sync`, register-resident

Volta and Turing have no instruction that produces a 64 × 128 tile in one issue. The compiler tiles the 64 × 128 output into a 4 × 16 grid of m16n8k8 sub-tiles and dispatches them across four warps (one per M = 16 sub-tile-row) with each warp running 16 K-sub-tiles inside. Operand fragments load from SMEM via ldmatrix into the warp's register file before each mma.sync:

for warp_m in 0..4:           # 64 / 16 = 4 warps cover the M extent
  for n_tile in 0..16:        # 128 / 8 = 16 N sub-tiles per warp
    for k_tile in 0..2:       # 16 / 8 = 2 K sub-tiles per N sub-tile
      ldmatrix A[warp_m, k_tile]                     # 4 i32 registers
      ldmatrix B[k_tile, n_tile]                     # 2 i32 registers
      mma.sync.aligned.m16n8k8.row.col.f32.bf16.bf16.f32
          { D[warp_m, n_tile] regs },                # 4 f32 registers
          { A regs }, { B regs },
          { C[warp_m, n_tile] regs }                 # 4 f32 registers

Total: 4 warps × 16 N-tiles × 2 K-tiles = 128 individual MMAs. Every operand and accumulator lives in the register file; SMEM is staging only. The MMAs are synchronous — the result is in registers when the instruction returns.

SM80 / SM86 / SM89: warp-cooperative `mma.sync` with wider K

Ampere expands the legal shapes to m16n8k16 for bf16, doubling the K extent per instruction. The same 64 × 128 output now tiles into a 4 × 16 grid of m16n8k16 sub-tiles — one K sub-tile per N sub-tile per warp:

for warp_m in 0..4:           # 64 / 16 = 4 warps
  for n_tile in 0..16:        # 128 / 8 = 16 N sub-tiles per warp
    ldmatrix A[warp_m, 0]                            # K = 16 in one load
    ldmatrix B[0, n_tile]                            # K = 16 in one load
    mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32
        { D regs }, { A regs }, { B regs }, { C regs };

Total: 4 warps × 16 N-tiles × 1 K-tile = 64 MMAs — half the SM70/SM75 count. Operand and accumulator residency is identical to Volta; the change is the K extent per instruction. SM80 also gains the mma.sp.sync.aligned sparse variant for 2:4-structured operands; SM89 adds FP8 inputs (mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32) with the K extent doubled again to 32.

SM90a: warp-group async WGMMA, B in SMEM descriptor

Hopper collapses the entire 64 × 128 output into a single warp-group instruction. Four warps cooperate on the same accumulator tile (M = 64 is the warp-group dimension); operand B rides an SMEM descriptor; operand A may be a register fragment or an SMEM descriptor. The accumulator stays in the warp group's register file but is invisible until the wait drains the group:

# Build the SMEM descriptor for B once before the loop
%b_desc = make_smem_desc(smem_off=&B, lbo=16*2, sbo=0, base_offset=0, swizzle=128B)

wgmma.fence.sync.aligned;

wgmma.mma_async.sync.aligned.m64n128k16.f32.bf16.bf16
    { %fd0, %fd1, ..., %fd31 },   # 32 f32 accumulator registers per thread
    { %ra0, %ra1, ..., %ra3 },    # 4 bf16 A-fragment registers (or %b_desc_a if SMEM-resident)
    %b_desc,                      # 64-bit SMEM descriptor for B
    %scale_d,                     # 1 if accumulating, 0 if zeroing
    1, 1,                         # scale-A, scale-B (FP families)
    0, 0;                         # transpose-A, transpose-B

wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 0;

One MMA replaces 64 from SM80. The four-op protocol — fence, async MMA, commit, wait — is mandatory (see WGMMA Emission Protocol). The accumulator is async-visible only: reads of %fd* before wait_group are silent UB.

SM100 / SM103: warp-group `tcgen05.mma`, accumulator in TMEM

Blackwell moves the accumulator out of the register file entirely. The 64 × 128 f32 output now lives in TMEM, occupying 16 columns of the SM's 128-row × 256-column TMEM grid. Operand A lands in TMEM (staged from SMEM via tcgen05.cp); operand B stays as an SMEM descriptor with the same 64-bit format as Hopper. Single-CTA variant:

# Allocate TMEM for the accumulator (16 columns × 128 rows × 16 B = 32 KiB)
%d_tmem = tcgen05.alloc.shared 16

# Stage A from SMEM into TMEM (one column × 128 rows for bf16 A tile)
tcgen05.cp.smem.tmem %a_tmem, smem_off=&A, layout=...

# Build the SMEM descriptor for B
%b_desc = make_smem_desc(smem_off=&B, lbo=..., sbo=..., swizzle=128B)

# Pack the control word: kind::f16 (covers bf16), cta_group::1, no block-scale
%ctrl = ((MMA_KIND_F16 << 6) | (CTA_GROUP_1 << 0))

tcgen05.mma.cta_group::1.kind::f16.f32.bf16.bf16
    [%d_tmem],                  # TMEM destination (C and D in-place)
    [%a_tmem],                  # TMEM source for A
    %b_desc,                    # SMEM descriptor for B
    %ctrl;                      # packed control word

# Drain via mbarrier or tcgen05.commit + tcgen05.wait
mbarrier.arrive.expect_tx [%mbar], 1
mbarrier.wait %mbar

# Copy D out of TMEM back to registers if a consumer needs it
tcgen05.ld.shared %dst_regs, [%d_tmem]

The MMA is async like WGMMA, but the completion signal is an mbarrier transaction rather than a wait-group counter. Reading the accumulator requires an explicit tcgen05.ld to copy TMEM into registers — there is no SSA visibility shortcut like Hopper's. See tcgen05 Tensor Memory Model.

The 2-CTA cooperative variant halves the M extent per CTA: CTA 0 owns the top 32 rows of D (M = 0..32), CTA 1 owns the bottom 32 rows (M = 32..64). The MMA opcode becomes tcgen05.mma.cta_group::2.kind::f16.f32.bf16.bf16 and pairs the two CTAs at execute time.

SM120 / SM121: warp-cooperative block-scale `mma.sync`, register-resident

Consumer Blackwell drops TMEM but keeps the block-scale operand encoding. Without TMEM, the warp-group cooperation model collapses back to per-warp synchronous MMA, so the 64 × 128 output once again tiles across four warps as on Ampere — but with a per-operand scale factor:

for warp_m in 0..4:           # 64 / 16 = 4 warps
  for n_tile in 0..16:        # 128 / 8 = 16 N-tiles per warp
    for k_tile in 0..(K/32):  # K = 32 per block-scale instruction
      ldmatrix A[warp_m, k_tile]
      ldmatrix B[k_tile, n_tile]
      mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.scale_vec::1X.block_scale.f32.e4m3.e4m3.f32
          { D regs },
          { A regs },             # FP8 operand A (e4m3 here)
          { B regs },             # FP8 operand B
          { C regs },
          %sfa,                   # E8M0 scale factor for A (register)
          %sfb;                   # E8M0 scale factor for B (register)

No async, no TMEM, no warp-group cooperation. The scale factors %sfa and %sfb are per-warp register fragments — one E8M0 byte per vecSize = 32 elements along K. Compared to SM100, the block-scale operand encoding is identical (same E8M0 / E4M3FN formats, same (K, vecSize) triples) but the residency is registers, not TMEM.

Side-By-Side Summary

Tier	Instructions per 64×128×16	Operand A	Operand B	Accumulator	Sync model	Operand-A bandwidth
SM70/75	128 (`m16n8k8`)	RF	RF	RF	sync	re-loaded per inner tile
SM80/89	64 (`m16n8k16`)	RF	RF	RF	sync	re-loaded per N-tile
SM90a	1 (`m64n128k16`)	RF or SMEM desc	SMEM desc	RF (async)	async (4-op)	one load per instruction
SM100/103	1 (`m64n128k16`)	SMEM desc or TMEM	SMEM desc	TMEM	async (mbarrier)	amortised by collector
SM110 (Jetson Thor)	falls through to universal-FMA / earlier-tier atoms	—	—	—	—	no SM110-specific MMA dispatch
SM120/121	64 (`m16n8k32` block-scale)	RF	RF	RF	sync	re-loaded per N-tile

Reading the table: the instruction-count progression collapses the per-warp tile loop into the hardware between SM89 and SM90, then keeps it collapsed through SM100. SM120 reverts to per-warp tiling because consumer Blackwell removes the warp-group cooperation model, but the block-scale operand encoding stays — so SM120 is "SM89-shaped MMA with SM100's numerical range". The accumulator-residency progression is the most consequential: it moves out of the register file at SM90 (still in RF but async-visible only), out the rest of the way at SM100 (into TMEM), and back into RF at SM120. A kernel author who reuses an SM100 codepath on SM120 has to re-introduce explicit ldmatrix staging because TMEM is no longer there.

What Each Generation Adds and Removes

Tier	Concurrency	Operand A	Operand B	Accumulator	Sync	New
SM70/75	warp (32 lanes)	RF	RF	RF	sync	dense `mma.sync`, FP16
SM80	warp (32 lanes)	RF	RF	RF	sync	sparse `mma.sp.sync`, BF16, TF32
SM89	warp (32 lanes)	RF	RF	RF	sync	FP8 E4M3 / E5M2 inputs
SM90a	warp-group (4 warps)	RF or SMEM desc	SMEM desc	RF (async-visible)	async	warp-group MMA, SMEM operand descriptors
SM100/103	warp-group, optional 2-CTA cluster	SMEM desc or TMEM	SMEM desc	TMEM	async	tensor memory, block-scale, weight-stationary, sparse block-scale
SM110 (Jetson Thor)	—	—	—	—	—	target tier registered, no dedicated MMA atom; lowering falls through to universal-FMA
SM120/121	warp (32 lanes)	RF	RF	RF	sync	block-scale on consumer parts, no TMEM, no async

The progression is not monotonic. SM90a moves the accumulator out of registers (sort of: still in the RF, but async-visible only). SM100 moves it the rest of the way out, into TMEM. SM120 moves it back into registers, but keeps the block-scale operand encoding that SM100 added. The right way to read the table is one column at a time: concurrency grows up to SM100 and then resets for consumer Blackwell; operand storage class climbs steadily through SM100 and then resets; numerical range grows monotonically.

Cross-References

MMA Atoms SM70-SM120 carries the per-arch shape lattice and the dialect-side atom contracts. WGMMA Emission Protocol covers the SM90a four-op protocol. tcgen05 Tensor Memory Model covers the SM100/103 model and the 10-variant taxonomy. Mode Pattern Verifiers carries the kind-word verifier ladder that gates SM100 and SM120 block-scaled variants. Blackwell 2-CTA and 4-CTA MMA documents the cluster-cooperative copy patterns that stage TMEM operands for SM100. mbarrier State Machine is the synchronisation primitive every async generation builds its producer/consumer protocol on top of.

Keyboard shortcuts

Tileiras Internals