Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVM tcgen05 Ops

Abstract

nvvm.tcgen05.* covers the Blackwell (sm_100+) tensor-memory family. Tensor memory (TMEM) is a per-SM scratchpad allocated and freed through the dialect's alloc / dealloc ops, accessed through ld / st and the long-K MMA path, and torn down before the kernel exits. The roster below is the only path to TMEM from MLIR; Hopper's WGMMA family (nvvm.wgmma.*) does not reach Blackwell tensor cores. See tcgen05 Tensor Memory Model for the TMEM allocation discipline and the variant taxonomy, and tcgen05 Machine Validation for the codegen-side verifier rules.

tcgen05.mma carries a control-word modifier table that selects element-type interpretation, sparsity, block-scaling, and collector behaviour. Block-scaled UMMA exposes scale-vector size and scale-format enums; the cross-product produces several thousand legal PTX forms from a single dialect op.

Op Roster

The "Properties slots used" column tracks where each op stores its attribute payload in the inline Properties record; see Properties Blob — Per-op-family slot maps for the exact byte offsets.

OpRoleProperties slots used
nvvm.tcgen05.alloc / .sharedrequest a TMEM rangecta_group
nvvm.tcgen05.deallocrelease a TMEM rangecta_group
nvvm.tcgen05.relinquish_alloc_permitdrop the alloc-permit tokencta_group
nvvm.tcgen05.ldload from TMEM to registers(none — operand-typed)
nvvm.tcgen05.ststore from registers to TMEM(none — operand-typed)
nvvm.tcgen05.cpcopy TMEM tile across CTAsmulticast, shape, src_fmt
nvvm.tcgen05.mmaMMA into TMEM accumulatortypeA/cType, collectorA, scale_d, layout-bits
nvvm.tcgen05.mma.spsparse-input variant of abovesame + sparse metadata operand
nvvm.tcgen05.mma.block_scaleblock-scaled variantcType, collectorA, scale_d, layout, kindA, kindB
nvvm.tcgen05.mma.sp.block_scalesparse + block-scaledmerge of sparse and block-scaled fields
nvvm.tcgen05.mma.wsweight-stationary variantoperand-only
nvvm.tcgen05.commit / .commit.arriveclose a group; optionally signal a barriercta_group
nvvm.tcgen05.waitwait on load or store groupwait_kind
nvvm.tcgen05.shiftshift register fragment across TMEMoperand-only
nvvm.tcgen05.fenceproducer / consumer fencetcgen05_fence (before / after)

Operand Tables

nvvm.tcgen05.alloc[.shared]

PositionNameTypeNotes
operand 0dstptr addrspace(3) (or generic)output slot for the allocated TMEM base
operand 1ni32column count to allocate (must be a multiple of 32)
attributecta_groupenum tcgen05_groupcta_1 or cta_2 for 1-CTA or 2-CTA cooperative allocation

nvvm.tcgen05.dealloc / .relinquish_alloc_permit

PositionNameTypeNotes
operand 0tmem_basei32TMEM column index returned by alloc
operand 1 (dealloc)ni32column count being released
attributecta_groupenum tcgen05_groupmatches the alloc's cta_group

nvvm.tcgen05.ld / nvvm.tcgen05.st

PositionNameTypeNotes
operand 0tmem_addri32TMEM column address
operand 1 (st)frag!llvm.struct<(i32, ...)>register fragment to store
result 0 (ld)frag!llvm.struct<(i32, ...)>register fragment loaded
attribute (encoded into mnemonic)shapem32n8 / m32n16 / m32n32 / ...tile shape that fixes the fragment width
attribute (encoded into mnemonic)numx1 / x2 / x4 / ...replication factor
attribute (encoded into mnemonic)packpack / unpackper-thread packing mode

nvvm.tcgen05.cp

PositionNameTypeNotes
operand 0tmem_dsti32destination TMEM column
operand 1tmem_srci32source TMEM column
attributeshapeenum tcgen05_cp_shapetile shape selector
attributemulticastenum tcgen05_cp_multicastnone / warp_x2 / warp_x4
attributesrc_fmtenum tcgen05_cp_src_fmtsource element format

nvvm.tcgen05.mma (dense)

PositionNameTypeNotes
operand 0tmem_di32TMEM accumulator column
operand 1desc_ai64SMEM descriptor for A, or TMEM column for a_in_tmem form
operand 2desc_bi64SMEM descriptor for B
operand 3scalei32accumulator-update scale (compile-time 0 or 1)
attributekindenum tcgen05_mma_kindf8f6f4 / mxf4 / mxf4nvf4 / f16 / tf32 / i8
attributecta_groupenumcta_1 / cta_2
attributecollectorAenum tcgen05_mma_collectoropdiscard / fill / use / last_use
attributescale_denumcontrols how scale selects between init and accumulate
attributelayoutenum TmemLayoutTMEM tile layout

nvvm.tcgen05.mma.sp (sparse)

Adds one operand:

PositionNameTypeNotes
operand 4sparse_metadatai32TMEM column holding the sparse selectors

nvvm.tcgen05.mma.block_scale

PositionNameTypeNotes
operand 0..3same as mmasame accumulator + descriptors + scale
operand 4scale_a_veci32TMEM column for A scale vector
operand 5scale_b_veci32TMEM column for B scale vector
attributekindA / kindBenum block_scale_formatE8M0 / E4M3FN
attributescale_vec_sizeenum scale_vec_size16 or 32

The (atom_K, vecSize) triples accepted by the verifier are documented on the cute_nvgpu MMA atoms page (SM100 UMMA block-scaled).

nvvm.tcgen05.commit / .commit.arrive

PositionNameTypeNotes
operand 0 (commit.arrive)barrierptr addrspace(3)mbarrier slot to signal
attributecta_groupenummatches the in-flight MMA's cta_group

nvvm.tcgen05.wait

PositionNameTypeNotes
attributewait_kindenum tcgen05_waitload (drain TMEM loads) or store (drain TMEM stores)

nvvm.tcgen05.fence

PositionNameTypeNotes
attributetcgen05_fenceenumbefore (producer) or after (consumer)

Control-Word Modifier Table

The PTX form tcgen05.mma.sync.aligned.{kind}.cta_group::{1,2}.{layout}.{collector} packs several modifiers into the mnemonic. See tcgen05 Tensor Memory Model — Control Word Layout for the bit-level encoding and tcgen05 Machine Validation — Control-Word Bit Layout for the codegen-side checks. The table below pairs each modifier with its NVVM attribute and the legal value range.

PTX modifierNVVM attributeValues
{kind}kindf8f6f4 / mxf4 / mxf4nvf4 / f16 / tf32 / i8
cta_group::{1,2}cta_groupcta_1 (single-CTA) / cta_2 (cluster-coop 2-CTA)
{layout}layoutmn (row-major) / kn (canonical K-major)
{collector}collectorAdiscard / fill / use / last_use
.sp(op mnemonic carries .sp)sparse (.sp) vs dense
.block_scale(op mnemonic carries .block_scale)block-scaled vs unscaled
.scale::vec::{16,32}scale_vec_size16 / 32
.{sfA}.{sfB}kindA / kindBscale-factor element format

The collector modifier controls how the MMA pipeline reuses register-file data across iterations: discard evicts on commit, fill accumulates without evicting, use consumes a previously-filled buffer, last_use consumes and then evicts.

LLVM Intrinsic Mapping

OpLLVM intrinsic
nvvm.tcgen05.alloc (addrspace=3, shared SMEM dest)llvm.nvvm.tcgen05.alloc.cta_group.{1,2}.shared
nvvm.tcgen05.alloc (addrspace=0/1, generic/global dest)llvm.nvvm.tcgen05.alloc.cta_group.{1,2}
nvvm.tcgen05.deallocllvm.nvvm.tcgen05.dealloc.cta_group.{1,2}
nvvm.tcgen05.ldllvm.nvvm.tcgen05.ld.{shape}.{num}
nvvm.tcgen05.stllvm.nvvm.tcgen05.st.{shape}.{num}
nvvm.tcgen05.mmallvm.nvvm.tcgen05.mma.{kind}.cta_group.{1,2}.{collector}
nvvm.tcgen05.mma.spllvm.nvvm.tcgen05.mma.sp.{kind}.cta_group.{1,2}.{collector}
nvvm.tcgen05.mma.block_scalellvm.nvvm.tcgen05.mma.block_scale.{kind}.{scale_vec}.cta_group.{1,2}.{collector}
nvvm.tcgen05.cpllvm.nvvm.tcgen05.cp.{shape}.{multicast}.{src_fmt}
nvvm.tcgen05.commitllvm.nvvm.tcgen05.commit.cta_group.{1,2}
nvvm.tcgen05.commit.arrivellvm.nvvm.tcgen05.commit.arrive.cta_group.{1,2}
nvvm.tcgen05.waitllvm.nvvm.tcgen05.wait.{load,store}
nvvm.tcgen05.fencellvm.nvvm.tcgen05.fence.{before,after}.thread

PTX Templates

tcgen05.alloc.cta_group::{1,2}.shared::cta.b32 [%tmem], %n;
tcgen05.dealloc.cta_group::{1,2}.b32 [%tmem], %n;
tcgen05.relinquish_alloc_permit.cta_group::{1,2};

tcgen05.ld.sync.aligned.{shape}.{num}.b32 {%r0, %r1, ...}, [%tmem];
tcgen05.st.sync.aligned.{shape}.{num}.b32 [%tmem], {%r0, %r1, ...};

tcgen05.mma.sync.aligned.{kind}.cta_group::{1,2}.{layout}.{collector}
    [%tmem_d], %desc_a, %desc_b, %scale;

tcgen05.mma.sp.sync.aligned.{kind}.cta_group::{1,2}.{layout}.{collector}
    [%tmem_d], %desc_a, %desc_b, [%sparse_meta], %scale;

tcgen05.mma.block_scale.sync.aligned.{kind}.scale::vec::{16,32}.cta_group::{1,2}.{layout}.{collector}.{sfA}.{sfB}
    [%tmem_d], %desc_a, %desc_b, [%sf_a], [%sf_b], %scale;

tcgen05.cp.{shape}.{multicast}.{src_fmt} [%tmem_dst], [%tmem_src];
tcgen05.commit.cta_group::{1,2};
tcgen05.commit.arrive.cta_group::{1,2}.b64 [%mbar];
tcgen05.wait::{load,store}.sync.aligned;
tcgen05.fence::{before,after}.thread;

The descriptor operands %desc_a and %desc_b are 64-bit SMEM descriptors when the operand is SMEM-resident, or TMEM column indices when the operand is TMEM-resident.

Inline-PTX Variants

nvvm.tcgen05.cp reaches PTX through llvm.inline_asm when the multicast / src_fmt combination has no matching LLVM intrinsic at the snapshot revision Tileiras tracks:

asm template: "tcgen05.cp.{shape}.{multicast}.{src_fmt} [%dst], [%src];"
constraints : "r,r"

The two r slots are the destination and source TMEM column indices. The shape, multicast, and src_fmt tokens are baked into the template literal at lowering time; the constraint string never changes.

Per-Arch Availability

Op familySM floorptx_min
alloc / dealloc / relinquish_alloc_permitsm_100a8.6
ld / stsm_100a8.6
cpsm_100a (+ sm_100f for the f-suffixed variants)8.6
mma / mma.spsm_100a8.6
mma.block_scale / mma.sp.block_scalesm_100a8.6
commit / commit.arrive / wait / fencesm_100a8.6

sm_100a is the architecture-qualified Blackwell target; the family is also legal on sm_100f for the few f-suffixed copy variants. Datacenter Blackwell (sm_100) is the only sub-arch the dialect exposes; Blackwell Ultra (sm_103) and Jetson Thor (sm_110) reuse the same op surface. See Per-SM Emission Templates — SM100 / SM103 for the codegen-side templates and NVPTX Subtarget Feature Matrix for the feature gating.

Verifier Invariants

  • TMEM column counts are multiples of 32.
  • cta_group agrees between matched alloc / dealloc and between the in-flight MMA and its commit / wait.
  • scale is a compile-time immediate.
  • Block-scaled (atom_K, vecSize) matches one of (32, 32), (64, 16), (64, 32); other combinations are rejected by the per-combo expectation diagnostics listed under nv_tileas Verifiers — Block-Scaled MMA Verification (e.g. "expects A/B element types to be Float4E2M1FNType and sfa/sfb element types to be Float8E8M0FNUType when (atom_K=64 && vecSize=32)").
  • Sparse metadata column must be valid TMEM and non-zero stride.
  • Accumulator element type is f32 for every block-scaled variant.
  • kindA and kindB agree (no mixed scale-factor formats).