Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Per-SM Emission Templates

Abstract

Tileiras emits tensor-core matrix instructions through a different path on each SM generation. The useful public model is not "which helper printed a string" but "which instruction surface is available, which operands it expects, and whether emission goes through inline assembly or an NVPTX machine instruction."

Older Volta and Turing MMA operations take the NVVM intrinsic path. Ampere and Ada take llvm.inline_asm templates for dense and sparse mma.sync. Hopper adds WGMMA templates. Datacenter Blackwell moves tensor-core matmul into tensor-memory tcgen05 machine instructions. Consumer Blackwell has no tensor memory and falls back to warp-level block-scaled mma.sync machine instructions.

Capability Matrix

SM tierPublic surfaceEmission pathMain instruction family
SM70 / SM75nv_tileas.mma.sm70, nv_tileas.mma.sm75NVVM intrinsicmma.sync.m8n8k*
SM80dense/sparse MMA atomsinline asmmma.sync.aligned, mma.sp.sync.aligned
SM89FP8 MMA atomsinline asmmma.sync.aligned.m16n8k32
SM90warp-group MMAinline asm / NVVM opswgmma.mma_async.sync.aligned
SM100 / SM103tensor-memory MMAMachineInstrtcgen05.mma
SM120 / SM121block-scaled warp MMAMachineInstrmma.sync.aligned.*.block_scale

The selection rule is a tier-keyed lookup: the SM major version names a single emission path. SM70 and SM75 emit the NVVM intrinsic and let the NVPTX backend pick the final PTX spelling. SM80 and SM89 build a mma.sync.aligned inline-asm template at IR time. SM90 builds a four-part WGMMA inline-asm protocol. SM100 and SM103 emit tcgen05.mma as a MachineInstr directly. SM120 and SM121 emit warp-synchronous mma.sync.aligned.*.block_scale as a MachineInstr.

SM70 / SM75

Volta and Turing need no Tileiras-owned inline-assembly templates for their baseline MMA surface. The dialect registers the SM70 and SM75 atoms, then lowers them to the corresponding llvm.nvvm.mma.* intrinsics. The downstream NVPTX backend owns final PTX spelling.

TierShape familiesLowering rule
SM70m8n8k4Use NVVM MMA intrinsic.
SM75m8n8k16, m8n8k32, m8n8k128, BF16 additionsUse NVVM MMA intrinsic.

The PTX spelling produced by the NVPTX backend matches the SM tier:

mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32
    {%fd0, %fd1, %fd2, %fd3},
    {%r0, %r1, %r2, %r3},
    {%r4, %r5},
    {%fd4, %fd5, %fd6, %fd7};

The exact register count per operand fragment depends on the shape and element type; the table-driven NVPTX printer reads it from the per-opcode operand-class enumeration.

SM80

Ampere is the first tier where Tileiras builds the PTX template directly inside llvm.inline_asm. Dense MMA emits mma.sync.aligned. Sparse MMA emits mma.sp.sync.aligned with a metadata register and a sparsity-selector immediate. The INT8 m16n8k32 sparse form has a .sp::ordered_metadata fast path that pins the selector to zero.

FamilyShape examplesAccumulatorExtra operands
Dense f16/bf16/tf32m16n8k8, m16n8k16f16 or f32none
Dense integerm16n8k32, m16n8k64s32optional .satfinite
Sparse f16/bf16/tf32m16n8k8, m16n8k16f16 or f32metadata + selector
Sparse integerm16n8k32, m16n8k64s32metadata + selector
Ordered metadatam16n8k32 INT8 sparses32metadata, selector fixed to zero

Dense m16n8k16.f32.f16.f16.f32 emits:

mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
    {%fd0, %fd1, %fd2, %fd3},
    {%r0, %r1, %r2, %r3},
    {%r4, %r5},
    {%fd4, %fd5, %fd6, %fd7};

The dense INT8 m16n8k32.s32.s8.s8.s32 form emits the same shape with s32/s8 type suffixes and an optional .satfinite modifier on the destination side.

Sparse m16n8k16.f32.f16.f16.f32 emits the same operand list plus a metadata register and a selector immediate:

mma.sp.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
    {%fd0, %fd1, %fd2, %fd3},
    {%r0, %r1},
    {%r2, %r3},
    {%fd4, %fd5, %fd6, %fd7},
    %r4,
    0x0;

The metadata operand is logically two i16 values packed into one i32 register. The selector is a one-bit immediate.

The INT8 ordered-metadata fast path swaps .sp for .sp::ordered_metadata and elides the explicit selector:

mma.sp::ordered_metadata.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32
    {%r0, %r1, %r2, %r3},
    {%r4, %r5, %r6, %r7},
    {%r8, %r9},
    {%r10, %r11, %r12, %r13},
    %r14;

Dense integer forms can request .satfinite; floating forms have no such modifier at the MMA level.

SM89

Ada extends the SM80 dynamic builders with FP8 types. The shape is m16n8k32, the accumulator is f32, and the input type product is one of e4m3 x e4m3, e4m3 x e5m2, e5m2 x e4m3, e5m2 x e5m2. The emitted PTX form is:

mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32
    {%fd0, %fd1, %fd2, %fd3},
    {%r0, %r1, %r2, %r3},
    {%r4, %r5},
    {%fd4, %fd5, %fd6, %fd7};

Register arity follows the SM80 INT8 k32 layout: four D registers, four A registers, two B registers, four C registers. Sparse FP8 adds one metadata register and reuses the .sp modifier shape from SM80. No FP16 accumulator path exists for this tier's FP8 mma.sync — that belongs to the later WGMMA surface.

SM90

Hopper introduces WGMMA. Tileiras emits wgmma.mma_async.sync.aligned inside a four-part inline-assembly protocol: fence, one or more async MMA instructions, commit group, wait group. The accumulator-update bit is carried by a predicate register (%p) computed from the scale_d operand. Shared-memory operands ride as 64-bit descriptors built by the per-atom descriptor constructor.

Input familyD typeKNotes
f16 x f16f16 or f3216Optional scale and transpose operands.
bf16 x bf16f3216Same operand structure as f16/f32.
tf32 x tf32f328TF32-specific K width.
e4m3/e5m2 FP8 pairsf3232Four FP8 type combinations.
s8/u8 integer pairss3232Forced .satfinite, no scale-a/b.
b1 x b1s32256Uses .xor.popc or .and.popc.

The four-part protocol for one tile of m64n128k16.f32.f16.f16 is:

wgmma.fence.sync.aligned;

wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16
    {%fd0, %fd1, %fd2, %fd3, %fd4, %fd5, %fd6, %fd7,
     %fd8, %fd9, %fd10, %fd11, %fd12, %fd13, %fd14, %fd15,
     %fd16, %fd17, %fd18, %fd19, %fd20, %fd21, %fd22, %fd23,
     %fd24, %fd25, %fd26, %fd27, %fd28, %fd29, %fd30, %fd31},
    %rd0,                       // descriptor A
    %rd1,                       // descriptor B
    %p0,                        // scale-D (accumulator-update predicate)
    1, 1,                       // scale-A, scale-B (FP families only)
    0, 0;                       // transpose-A, transpose-B (FP families only)

wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 0;

The float families append scale-A, scale-B, transpose-A, transpose-B immediates after the scale-D predicate; the integer families omit those and force .satfinite. The b1 family substitutes .xor.popc or .and.popc for the type suffix.

The A operand can be a register fragment instead of an SMEM descriptor — in that case it appears as a register-tuple { %r0, %r1, ... } and the constraint list switches l to r. The B operand is always an SMEM descriptor. Descriptor offsets are expressed in 16-byte units, so the constructor shifts byte offsets right by four before packing. See SM70-120 MMA Atoms — SMEM-Descriptor Construction for the 64-bit descriptor bit layout, and the WGMMA Descriptor Round-Trip section below for a worked hex example.

SM100 / SM103

Datacenter Blackwell uses tensor memory and emits tcgen05.mma through the MachineInstr layer rather than llvm.inline_asm. The instruction is warp-group-uniform and operates on TMEM operands. The packed control word carries instruction family, CTA group, sparsity, block scale, scale-vector size, input family, collector mode, and optional scale-input-accumulator state. See tcgen05 Control-Word Bit Layout for the bit-layout of the control word and Verifier Rules for the verifier rules.

A dense tcgen05.mma for one tile emits:

tcgen05.mma.cta_group::1.kind::f16.f32.f16.f16
    [%r0],                      // TMEM destination (D)
    [%r1],                      // TMEM source (A)
    %rd2,                       // SMEM descriptor (B)
    %r3;                        // packed control word

The control-word operand encodes scale-vector size, MMA kind, scale-input-accumulator, and block-scale bits. A sparse variant adds a metadata operand:

tcgen05.mma.sp.cta_group::1.kind::f16.f32.f16.f16
    [%r0], [%r1], %rd2, [%r3], %r4;

A block-scaled variant adds two TMEM scale operands and a scale-vec modifier:

tcgen05.mma.cta_group::1.kind::mxf8f6f4.scale_vec::1X.f32.e4m3.e4m3
    [%r0], [%r1], %rd2,
    [%r3],                      // SFA scale (TMEM)
    [%r4],                      // SFB scale (TMEM)
    %r5;

The weight-stationary variant prefixes the mnemonic with .ws and rejects two-CTA grouping. The two-CTA variant is encoded as cta_group::2 in the modifier. The arch-conditional variants (sm_100a, sm_100f) accept different subsets of the kind tag and scale-vec width than the base variant.

SM103 follows the same structural path with a different accepted target tuple. Drive the algorithm with subtarget feature predicates, not a separate forked emitter.

SM120 / SM121

Consumer Blackwell removes tensor memory and therefore drops tcgen05.mma entirely. Its block-scaled matmul surface is warp-synchronous mma.sync.aligned.*.block_scale. The public operation has nine attributes: a_type, b_type, byte_id_a, byte_id_b, sf_type, shape_MNK, thread_id_a, thread_id_b, vec_size.

The verifier accepts exactly three shape/vector families:

Kvec_sizeKindA/B typesScale-factor type
3232MXFP8e4m3, e5m2, e3m2, e2m3, e2m1E8M0
6416MXFP4e2m1E8M0 or E4M3
6432NVFP4e2m1E8M0

Dense and sparse forms share one set of operand families: A fragment, B fragment, C accumulator, D output, SFA scale fragment, SFB scale fragment. Sparse forms add ordered metadata. SFA and SFB are warp-register fragments, unlike SM100 where the scale operands live in tensor memory.

A dense m16n8k32 MXFP8 block-scale tile emits:

mma.sync.aligned.m16n8k32.row.col.kind::mxf8f6f4.scale_vec::1X.block_scale.f32.e4m3.e4m3.f32
    {%fd0, %fd1, %fd2, %fd3},
    {%r0, %r1, %r2, %r3},
    {%r4, %r5},
    {%fd4, %fd5, %fd6, %fd7},
    %r6,                        // SFA scale fragment (register)
    %r7;                        // SFB scale fragment (register)

The NVFP4 m16n8k64.scale_vec::4X form emits the same operand layout with e2m1 type suffixes and a kind::mxf4nvf4 tag. The MXFP4 m16n8k64.scale_vec::2X form pairs kind::mxf4 with E8M0 or E4M3 scale-factor type.

Sparse variants prepend .sp::ordered_metadata and add a metadata register slot:

mma.sp::ordered_metadata.sync.aligned.m16n8k64.row.col.kind::mxf4nvf4.scale_vec::4X.block_scale.f32.e2m1.e2m1.f32
    {%fd0, ..., %fd3},
    {%r0, %r1},
    {%r2},
    {%fd4, ..., %fd7},
    %r3,                        // metadata
    %r4, %r5;                   // SFA, SFB

The MMA verifier rejects shape/vector/type combinations outside the three accepted families. Compression from the SM100 tcgen05 lattice to the SM120 surface is intentional: no CTA group, no collector mode, no A-shift, no weight-stationary mode, no scale-input accumulator, no tensor-memory destination, no write-disable modifier. Only shape, element family, scale-factor family, scale-vector width, and the sparse/dense choice remain.

WGMMA Descriptor Round-Trip

The SM90 inline-asm template above threads an SMEM descriptor through the l constraint slot for operand B (and for operand A when the atom is fully SMEM-resident). The descriptor is a 64-bit packed word built by the per-atom constructor from the abstract tile shape and swizzle mode. A worked example shows the round-trip from logical fields to the hex value that lands in the inline-asm input.

Consider a representative atom: m64n128k16.f32.f16.f16 with swizzle=128B, lbo=2048, sbo=0, and a starting SMEM byte offset chosen so (smem_off >> 4) = 0x1000. The constructor packs the fields according to the WGMMA descriptor layout:

FieldBitsWidthLogical valueEncoded
start_addr0-1314smem_off >> 4 = 0x1000 (low 14 bits)0x1000
lbo14-29162048 (0x800)0x800
sbo30-451600x0
base_offset46-48300
reserved49-5130 (mandatory)0
swizzle_mode52-532128B1
pad54-6310unused0

The composition is straightforward:

uint64_t raw = 0;
raw |= ((uint64_t)0x1000) <<  0;   // start_addr at bits  0-13
raw |= ((uint64_t)0x0800) << 14;   // lbo         at bits 14-29
raw |= ((uint64_t)0x0000) << 30;   // sbo         at bits 30-45
raw |= ((uint64_t)0x0000) << 46;   // base_offset at bits 46-48
raw |= ((uint64_t)0x0001) << 52;   // swizzle 128B at bits 52-53

The resulting WgmmaDescriptor.raw value is 0x0010_0000_0200_1000. Decomposed back: bits 0-13 hold 0x1000, bits 14-29 hold 0x800 (the four bytes 0x02000 overlap into the lbo window because the field starts at bit 14), bits 52-53 hold the 128B swizzle code, and every reserved bit is clear. A round-trip through decode_descriptor(0x00100000_02001000) produces the exact original logical-field set.

The constructor passes this hex value into the inline-asm fragment as an i64 input bound to the l constraint slot. The PTX template that consumes it is the WGMMA form documented in SM90; the runtime register-allocator sees the constant as a 64-bit GPR (%rd1 in the example) and the WGMMA hardware decodes it back into the canonical Hopper SMEM descriptor on each wgmma.mma_async.sync.aligned issue. A mismatch between the constructor's swizzle mode and the verifier's swizzle mode produces silently wrong results, which is why the constructor and verifier must read the same swizzle table — see the cross-reference paragraph at the end of SMEM-Descriptor Construction.

Cross-References

PTX Version and Target Selection — Architecture-Conditional Instructions documents the upstream subtarget gating that decides which of these templates is reachable for a given .target line. Plain sm_NN rows admit only the SM70-SM80-SM89 surfaces; sm_NNa and sm_NNf rows unlock WGMMA, tcgen05, and block_scale MMA per the suffix grid in that page. AsmPrinter — MC Switch Shape Population Table documents the dispatcher and AsmWriter String Pools and the XOR-3 Walking Cipher covers the mnemonic pool that finally prints the template strings shown above. tcgen05 Control-Word Bit Layout covers the SM100/SM103 control word and tcgen05 mbarrier Emission plus Cluster Sync Emission cover the mbarrier/cluster wiring around tcgen05.mma. ISelDAG and MatcherTable — Selector Layers shows where the selector chooses between inline-asm and MachineInstr paths for each SM tier. The MMA atom registry in SM70-120 MMA Atoms is the dialect-level entry point that feeds shape and operand types into these templates.