Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SM Tier Roster and Copy Atom Registry

Abstract

cute_nvgpu registers MMA, copy, prefetch, TMA, tensor-memory, and descriptor atom types per SM tier, then exposes them through common atom interfaces. The rest of the compiler asks uniform questions through those interfaces: what shape does this atom operate on, what element types are legal, where do operands live, what resources does the target need? The registry below is the product contract — every atom mnemonic, the interfaces it implements, the SM tier that registers it, and the residencies its operands accept.

Registry Model

The dialect uses interface-driven atom records:

InterfaceImplemented byPurpose
MMA atomUniversal FMA, SM80, SM89, SM90, SM100/SM103, SM120/SM121 MMA familiesReports MMA shape, operand element types, accumulator type, atom class, and verifier hooks.
Copy atomTMEM load/store, S2T copy, universal copy, async copy, LDSM/STSM, TMA atomsReports copy shape, value type, memory spaces, vector width, and legality.
Prefetch atomTMA load, store, reduce, and non-executing tiled TMA atomsReports descriptor, prefetch tile, and cache-hint behavior.
Descriptor typeSMEM descriptor views and TMA descriptorsCarries hardware descriptor state as typed IR.

The design point that matters: generic cute code dispatches through interfaces, not through string comparisons on target names.

LogicalResult verify_atom_instance(Atom atom, Target target, Shape use_shape) {
    if (MmaAtomInterface mma = dyn_cast_mma_atom(atom.type)) {
        return mma.verify_instance(atom, target, use_shape);
    }

    if (CopyAtomInterface copy = dyn_cast_copy_atom(atom.type)) {
        return copy.verify_instance(atom, target, use_shape);
    }

    if (PrefetchAtomInterface prefetch = dyn_cast_prefetch_atom(atom.type)) {
        return prefetch.verify_prefetch(atom, target, use_shape);
    }

    return failure("atom type does not implement a known cute_nvgpu interface");
}

Atom Surface by Tier

TierMMA atomsCopy and descriptor atomsNotes
All tiersatom.universal_fmaatom.universal_copyGeneric fallback atom vocabulary.
SM75+No dedicated MMA mnemonicatom.ldsmTuring introduces ldmatrix-style shared-memory matrix loads.
SM80sm80.mma, sm80.sparse_mmaatom.simt_async_copy, atom.ldsmAmpere dense and sparse mma.sync, plus cp.async-style copy atoms.
SM89sm89.mmaSM80 copy atomsAda extends the dense register-MMA surface with FP8 inputs.
SM90sm90.mma, smem_desc_viewatom.tma_load, atom.tma_store, atom.tma_reduce, atom.stsm, non-exec TMA atomsHopper WGMMA, SMEM descriptors, and TMA descriptor traffic.
SM100/SM103sm100.mma, sm100.mma_sp, sm100.mma_bs, sm100.mma_bs_spatom.tmem_load, atom.tmem_store, atom.s2t_copy, TMA atomsDatacenter Blackwell UMMA, sparse UMMA, block-scaled MMA, sparse block-scaled MMA, and tensor memory.
SM110 (Jetson Thor)inherited universal atomsSM110 is registered as a target tier (sm_110 / sm_110a / sm_110f) but has no dedicated MMA mnemonic; see note below.
SM120/SM121SM120.mma_bsRegister-based copy and scale-factor pathsConsumer Blackwell block-scaled MMA with uppercase SM120 spelling.

The uppercase spelling in SM120.mma_bs is part of the textual contract. A parser that lowercases it cannot round-trip IR for this dialect.

QUIRK — SM110 (Jetson Thor) registers a target tier but exposes no dedicated MMA surface (HIGH) The compiler's SM-target roster enumerates sm_110, sm_110a, and sm_110f alongside the other Blackwell tiers, and lowering will accept those targets as legal architecture flags. The dialect does NOT register a sm110.mma atom mnemonic — every MMA mnemonic in the registry is sm80.mma, sm89.mma, sm90.mma, sm100.mma{,_sp,_bs,_bs_sp}, or SM120.mma_bs. Kernels targeting SM110 fall through to the universal-FMA atom or to whichever earlier-tier MMA atom the architecture-conditional gate accepts. No WGMMA, no tcgen05.mma, no block-scaled register MMA is dialect-side dispatched for SM110 in this compiler. A reimplementation that expects SM110 to carry its own MMA / TMEM / WGMMA atom family will find none here, and a kernel that wants tensor-core throughput on Thor must either route through a non-MMA path or accept that the dispatcher will not synthesise a Thor-specific machine form.

Atom TypeID Registry

The dialect registers one MLIR TypeID per atom kind. Generic cute code never sees the per-atom C++ class; it sees a typed value whose TypeID resolves to an interface vtable, and that vtable carries the verifier, the asm printer, the bytecode round-trip, and the per-atom legality predicates. The registry below lists the contract per atom: which interfaces it implements, which residencies its operands accept, and which SM tier first registers it.

Atom mnemonicMin tierInterfaces implementedResidency contract
atom.universal_fmaallMmaAtomA, B, D in rmem
atom.universal_copyallCopyAtomany source-destination pair the target supports
atom.ldsmSM75CopyAtomsrc=smem, dst=rmem, shape ∈ LDSM matrix
atom.stsmSM90CopyAtomsrc=rmem, dst=smem, shape ∈ STSM matrix
atom.simt_async_copySM80CopyAtom, AsyncCopyAtomcp.async gmem → smem
sm80.mmaSM80MmaAtomA, B, D in rmem; one element type per operand
sm80.sparse_mmaSM80MmaAtom, SparseMetadataAtomA, B, D in rmem; metadata operand alongside A
sm89.mmaSM89MmaAtomA, B, D in rmem; FP8 element types added
sm90.mma (WGMMA)SM90MmaAtom, SmemDescriptorAtomA in rmem or smem_desc_view; B in smem_desc_view; D in rmem
smem_desc_viewSM90DescriptorTypetyped view over an SMEM descriptor
atom.tma_loadSM90CopyAtom, TmaAtom, PrefetchAtom, AsyncCopyAtomdescriptor-driven gmem → smem
atom.tma_storeSM90CopyAtom, TmaAtom, AsyncCopyAtomdescriptor-driven smem → gmem
atom.tma_reduceSM90CopyAtom, TmaAtom, AsyncCopyAtomdescriptor-driven reduce into gmem
atom.non_exec_tiled_tma_*SM90TmaAtom (non-exec)partition-verified TMA atom waiting on mbarrier and cache binding
sm100.mma (UMMA)SM100MmaAtom, TmemAtomA in memref/smem_desc_view; B in smem_desc_view; D in memref (tmem)
sm100.mma_spSM100MmaAtom, TmemAtom, SparseMetadataAtomUMMA contract + structurally-sparse A and metadata operand
sm100.mma_bsSM100MmaAtom, TmemAtom, BlockScaleAtomUMMA contract + scale-factor operand
sm100.mma_bs_spSM100MmaAtom, TmemAtom, BlockScaleAtom, SparseMetadataAtomUMMA block-scale + sparsity
atom.tmem_loadSM100CopyAtomsrc=tmem, dst=rmem
atom.tmem_storeSM100CopyAtomsrc=rmem, dst=tmem
atom.s2t_copySM100CopyAtom, AsyncCopyAtomsrc=smem, dst=tmem
SM120.mma_bsSM120MmaAtom, BlockScaleAtomA, B, D in rmem; two scale-factor operands (one per A, B)

The Interfaces implemented column is the dispatch contract. A pass that walks every atom and asks "do you support prefetch?" calls dyn_cast<PrefetchAtomInterface> on each atom value; the SM90+ TMA load atom is the only positive hit, and the call collapses to a TypeID compare. The Residency contract column lists the legality bounds the per-atom verifier enforces; it is the same checklist a CUTLASS C++ user reads from Copy_Traits<> and MMA_Traits<> headers.

Copy Atom Operand-Layout Contracts

Every copy atom carries an operand-layout contract that the verifier checks before lowering. The contract pins source and destination residency, the per-thread fragment shape, the natural shape one atom invocation transfers, and the PTX (or NVVM intrinsic) instruction the lowering emits. The table below is the per-tier catalog; each row is one atom mnemonic.

SM70 and SM75 register copy atoms

AtomSourceDestinationNatural shapeElement widthPer-thread fragmentLowering target
atom.universal_copyanyany (target-supported)one elementanyone valuescalar ld/st of matching width
atom.ldsm<m8n8> (SM75)smemrmem8 x 8 matrix tile16 bits2 elements per laneldmatrix.sync.aligned.m8n8.x1.shared.b16
atom.ldsm<m8n8.x2> (SM75)smemrmem8 x 16 matrix tile16 bits4 elements per laneldmatrix.sync.aligned.m8n8.x2.shared.b16
atom.ldsm<m8n8.x4> (SM75)smemrmem8 x 32 matrix tile16 bits8 elements per laneldmatrix.sync.aligned.m8n8.x4.shared.b16
atom.ldsm<m8n8.x4.trans> (SM75)smemrmem8 x 32 transposed tile16 bits8 elements per laneldmatrix.sync.aligned.m8n8.x4.trans.shared.b16

The x1/x2/x4 suffix is the number of 8x8 sub-tiles the atom fetches in one instruction. The transposed variants swap the per-lane fragment layout so that two register-resident MMA operands meet at the same memory cell after the matrix multiply; the verifier checks the transpose flag against the consuming MMA atom's expected operand layout.

SM80 and SM86 async-copy and matrix-load atoms

AtomSourceDestinationNatural shapeElement widthPer-thread fragmentLowering target
atom.simt_async_copy<4>gmemsmem4-byte element32 bits1 i32 per lanecp.async.ca.shared.global (4 bytes)
atom.simt_async_copy<8>gmemsmem8-byte element64 bits1 i64 per lanecp.async.ca.shared.global (8 bytes)
atom.simt_async_copy<16>gmemsmem16-byte element128 bits1 i128-equivalent per lanecp.async.cg.shared.global (16 bytes; bypass L1)
atom.ldsm<m8n8.*>smemrmeminherited from SM7516 bitsinheritedldmatrix.sync.aligned.*

The 4/8/16 vector widths are the only legal cp.async granularities; any other width is rejected before lowering — the binary stores no dedicated diagnostic for the width-out-of-range case, so the failure surfaces through the standard '{0}' cannot vectorize copy to {1} elements (static strides must be 1) / '{0}' cannot vectorize copy to {1} elements (static strides must match) template that the copy-vectorization helper emits for any vectorisation failure. The 16-byte variant uses the cg cache-policy (bypass L1) because the L1 cache cannot satisfy a 128-bit single-instruction store; the 4- and 8-byte variants use ca (cache-all). Lowering chooses the cache policy from the atom's width alone — there is no per-op cache hint.

SM90 TMA atom family

TMA atoms are descriptor-driven; the per-lane fragment layout is implicit in the descriptor word rather than in the atom's MLIR operand types.

AtomSourceDestinationDescriptor kindNatural shapeLowering target
atom.tma_loadgmem (descriptor)smemTMA tile descriptorrank-1..rank-5 boxcp.async.bulk.tensor.NDIM.shared::cluster.global
atom.tma_load_multicastgmem (descriptor)smem (multi-CTA)TMA tile descriptor + CTA maskrank-1..rank-5 boxcp.async.bulk.tensor.NDIM.shared::cluster.global.multicast::cluster
atom.tma_load_im2colgmem (descriptor)smemTMA im2col descriptorrank-3..rank-5 spatialcp.async.bulk.tensor.NDIM.shared::cluster.global.im2col
atom.tma_storesmemgmem (descriptor)TMA tile descriptorrank-1..rank-5 boxcp.async.bulk.tensor.NDIM.global.shared
atom.tma_reduce<op>smemgmem (descriptor)TMA tile + reduce kindrank-1..rank-5 boxcp.reduce.async.bulk.tensor.NDIM.global.shared.OP
atom.stsm<m8n8.*>rmemsmemnone (register copy)8 x 8 matrix tile per sub-tilestmatrix.sync.aligned.m8n8.x[1,2,4].shared.b16

The TMA atoms accept rank-1 through rank-5 boxes; the descriptor word encodes the per-dimension extents, strides, and box edges (see the dedicated TMA atom page). The multicast variant adds a 16-bit CTA mask that names which CTAs in the cluster receive the loaded data, enabling one-to-many fanout from a single GMEM read. The im2col variant rewrites the descriptor's box coordinates through a convolution-style spatial reshape so a single load presents the data already in NCHW-to-window form for convolution kernels.

atom.stsm mirrors atom.ldsm from SM75 in reverse — rmem -> smem rather than smem -> rmem — and shares the same sub-tile multiplicity convention.

SM100 and SM103 TMEM copy atoms

Datacenter Blackwell adds tensor memory as a fourth memory class alongside register, shared, and global. The copy atom family covers every legal direction between TMEM and the other three classes.

AtomSourceDestinationNatural shapeElement widthLowering target
atom.tmem_loadtmemrmemone TMEM column tile per atom32/16/8 bitstcgen05.ld.sync.aligned.shape.b32
atom.tmem_storermemtmemone TMEM column tile per atom32/16/8 bitstcgen05.st.sync.aligned.shape.b32
atom.s2t_copysmemtmemTMA-box-shaped SMEM slice8/16/32/64 bitstcgen05.cp.shared::cta.async
atom.tmem_to_smem_copytmemsmemTMEM column tile32/16/8 bitstcgen05.cp.async.shared::cta (reverse direction)
atom.tcgen05.cptmemtmem (cross-CTA)column tile inside one cluster32 bitstcgen05.cp.async (cluster-scope)

TMEM is column-organised: an atom transfers one or more TMEM columns at a time. The verifier checks that the operand layout addresses TMEM columns in a contiguous range matching the natural shape, and that the column count matches the destination tile. SM100 splits the tcgen05 family into tcgen05.ld / tcgen05.st (register-mediated, synchronous-looking) and tcgen05.cp (cluster-scope async); the atom mnemonics make that distinction explicit.

A TMEM-resident MMA accumulator does not move out of TMEM until a atom.tmem_load retires its column range into registers. Lowering must keep that retire op alive across any consumer that reads the accumulator from registers — eliding it produces undefined values.

MMA Records

MMA records carry:

  • architecture tier;
  • operand A, B, and accumulator element types;
  • tile shape, usually expressed as (M, N, K);
  • operand residency, such as register memory, shared-memory descriptor, or tensor memory;
  • sparse or block-scaled metadata, when present;
  • verifier and lowering hooks.
typedef struct {
    SmTier min_tier;
    Shape mnk;
    ElementType a_type;
    ElementType b_type;
    ElementType c_type;
    Residency a_residency;
    Residency b_residency;
    Residency d_residency;
    bool supports_sparse;
    bool supports_block_scale;
} MmaAtomContract;

Copy Records

Copy atoms carry copy width, source and destination residency, optional async behaviour, and any descriptor or prefetch behaviour. TMA atoms add a descriptor flavour and a prefetch interface on top.

typedef struct {
    SmTier min_tier;
    Residency source;
    Residency destination;
    int value_bits;
    bool is_async;
    bool uses_tma_descriptor;
    bool supports_prefetch;
} CopyAtomContract;

Per-Tier Semantics

SM70 and SM75

Volta and Turing mostly use generic atoms. SM75 introduces the shared-memory matrix-load family, where ldsm becomes tier-gated. Older MMA forms route through universal or backend intrinsic paths — there is no dedicated cute_nvgpu.sm70.mma spelling.

SM80

Ampere is the first full register-MMA tier. sm80.mma covers dense mma.sync forms; sm80.sparse_mma covers the structured-sparse forms with metadata operands. simt_async_copy models Ampere asynchronous copies. The verifier's anchors here are register-resident MMA operands, supported integer and floating input types, valid sparse metadata, and legal copy vector widths.

SM89

Ada keeps the SM80 register-resident model but adds FP8 input combinations. Sparse FP8 is not part of this tier's atom surface.

SM90

Hopper introduces WGMMA and TMA. sm90.mma accepts shared-memory descriptor operands; B is always descriptor-backed, A is either register- or descriptor-backed depending on mode. TMA load/store/reduce atoms are descriptor-driven and often start as non-executing tiled atoms, then bind to mbarrier and cache state to form executable atoms.

SM100 and SM103

Datacenter Blackwell introduces UMMA and tensor memory. sm100.mma is the plain tensor-memory MMA family; sm100.mma_bs and sm100.mma_bs_sp carry block-scale and sparse block-scale metadata. TMEM load/store and shared-to-tmem copy atoms move values between register, shared, global, and tensor-memory domains. SM103 reuses the same dialect surface — the distinction is a target flag, not a new atom family.

SM120 and SM121

Consumer Blackwell block-scaled MMA has no TMEM dependency. It carries two scale-factor operands — one for A, one for B — and keeps the accumulator in register memory. SM121 shares the same surface.

MMA Atom Verifier Diagnostics

Every MMA atom registers one verifier through the dialect. The verifier emits verbatim diagnostics so test suites can match by string. The strings below are the user-visible contract.

Layout-shape verifier (all UMMA / SM90 / SM100 variants)

  • "expects Mma atom layout of " (binary string, with the canonical reference layout streamed in after the trailing space) — strict equality between the op's declared per-operand layout and the canonical layout the atom's traits table reconstructs.
  • "expects static and no scaled basis layout for" (printed layout follows) — the stride basis must be static integer; scaled-basis layouts are rejected because the descriptor packer cannot encode them.

Element-type ladder (UMMA family)

  • "expects operand a with element type {0}, but get {1}."
  • "expects operand b with element type {0}, but get {1}."
  • "expects operand c with element type {0}, but get {1}." (the verifier uses c for both C and D operand slots)

The element-type check happens before the residency check. The {0} slot prints the expected element type from the atom's traits; {1} prints the actual operand element type.

Residency ladder (UMMA family)

  • "expects memref/smem_desc_view for operand A, but gets A:{0}."
  • "expects smem_desc_view for operand B, but gets B: {0}."
  • "expects memref for operand D, but gets D: {0}."

UMMA B is always SMEM-descriptor; A is either an RMEM memref or an SMEM-descriptor (note: per verify_sm100_umma an RMEM A is rejected — A must be either an SMEM descriptor or a TMEM memref — see MMA Atoms SM70-120 — SM100 and SM103 UMMA); D is a memref in the tensor-memory address space (result.residency == TENSOR_MEMORY) — the verifier diagnostic spells the SSA type "memref", but the residency contract pins the accumulator to TMEM, not register memory. The verifier emits the first mismatched operand and stops.

Layout composability (UMMA family)

  • "invalid layout of A/B/D. A: {0}, B: {1}, D:{2}" — emitted when one of the three per-operand canonical-layout checks fails before the composability step runs.
  • "layoutA, layoutB and layoutD fail to form a gemm. A: {0}, B: {1}, D: {2}" — emitted when the three layouts pass individually but their composition does not encode a valid (M, N, K) triple.

Non-UMMA shared verifier (SM70 / 75 / 80 / 89)

  • "expects all mma operands to have element type"
  • "expects rmem for input operands, but got A: " (binary string; printed A/B/D operand types follow)
  • "expects operand a with element type " (followed by expected and actual types)

The non-UMMA path enforces the simpler rule that A, B, and D all share one element type and all live in register memory. This is the only path SM70-89 use.

Composed-layout rejection (tiled-copy / tiled-mma builders)

  • "doesn't support composed layout for "
  • "A/B/D, but got: A: {0}, B: {1}, D:{2}"
  • "expects A, B to have the same rank, but got A: "
  • "expects C, D to have the same rank, but got C: "
  • "expects C to have rank 1 or 2, but got C: "
  • "expects C to have rank 2 or 3, but got C: "
  • "expects C to have rank 3, but got C: "

Registry Invariants

  • Atom names encode the minimum architecture tier or intentionally remain tier-generic.
  • Generic tiling code dispatches through interfaces, not mnemonic switches.
  • Sparse and block-scaled atoms expose their metadata through typed operands or attributes.
  • TMA atoms that prefetch descriptors implement the prefetch interface.
  • Descriptor view types remain explicit until the backend has emitted the corresponding WGMMA, TMA, or TCGEN instruction sequence.
  • Target verification rejects atoms whose tier exceeds the selected target.

Cross-References

Mode Pattern Verifiers — LDSM and STSM Matrix documents the LDSM/STSM, UMMA Canonical Layout Verifier, tcgen05.mma Kind-Word Verifier, and SM120 Block-Scaled Lattice verifiers each atom registers. TMA Atoms — Atom Family covers the descriptor-driven TMA family in depth. MMA Atoms SM70-120 — Per-Arch MMA Shape Lattice covers the per-tier MMA shape lattice. MMA Atoms SM70-SM120 — Operand Contract by Tier cross-references the consumer side of every copy atom in the table above. Layout Algebra and Descriptor Grammar — Swizzle Operator covers the bit-manipulation formula the SMEM-resident atoms (atom.ldsm, atom.stsm, TMA descriptors, atom.s2t_copy) rely on for bank-conflict-free placement.