Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

cute_nvgpu Dialect Overview

Provenance vs Upstream MLIR

cute_nvgpu is NVIDIA-introduced and has no upstream MLIR counterpart. Upstream MLIR exposes NVIDIA hardware operations only through nvgpu (a thin bridge dialect) and nvvm (typed intrinsics). Neither models the SM-tier-qualified atom catalogue — WGMMA, UMMA, TMA, TMEM lifecycle, ldmatrix/stmatrix, block-scaled MMA forms — that tileiras needs to keep around between cute layout algebra and nvgpu lowering. Without this dialect the layout-to-intrinsic step would have to collapse atom selection, SM-tier verification, and intrinsic emission into one rewrite; the dialect splits those concerns so the SM gate can run before NVVM conversion. See nvgpu for the upstream-linked bridge below this layer.

Abstract

cute_nvgpu is the NVIDIA architectural atom dialect sitting on top of cute. It hosts MMA atoms from SM70 through SM120 (WGMMA and UMMA included), TMA descriptor and transfer atoms, TMEM lifecycle operations, LDSM/STSM matrix-load atoms, and SMEM descriptor views. Every operation passes through an explicit SM-tier verifier, so an invalid (shape, element type, target) triple is rejected before NVVM emission. This page is the dialect-level map; per-family detail lives in the linked sub-pages.

Where cute describes target-neutral layout algebra, cute_nvgpu binds that algebra to real NVIDIA operations — MMA, WGMMA, TMA, TMEM allocation, ldmatrix, stmatrix, async bulk copies, SM-specific copy atoms. It is the seam where a layout stops being merely algebraic and starts requesting a specific GPU instruction family.

The dialect is organised by architecture tier. Older tiers describe classic tensor-core MMA and copy atoms. Hopper-era tiers add WGMMA and TMA descriptor movement. Blackwell-era tiers add tensor-memory lifecycle and block-scaled MMA forms. Tier names live in the operation spellings so verifiers and lowerings can reject invalid shape, element-type, or target combinations before NVVM conversion.

Position in the Cascade

cute
    |
    | select target-specific copy, MMA, TMA, and tensor-memory atoms
    v
cute_nvgpu
    |
    | normalize architecture atoms
    v
nvgpu
    |
    | emit NVVM intrinsics
    v
PTX

cute_nvgpu preserves the high-level atom boundary. It sits below pure layout algebra and above raw NVVM intrinsics — the natural place to enforce SM-tier constraints and descriptor compatibility while keeping the final intrinsic selection simple.

Architecture Tiers

TierMain operationsMeaning
SM70universal FMA and copy fallbacksBaseline tensor-core-era atom vocabulary.
SM80sm80.mma, sparse MMAAmpere MMA and structured sparsity forms.
SM89FP8-oriented MMA variantsAda-generation element-type extensions.
SM90WGMMA, TMA descriptor viewsHopper warpgroup MMA and tensor-memory async movement.
SM100UMMA, TMEM lifecycle, block-scaled MMABlackwell datacenter tensor-memory and tcgen-style operations.
SM120consumer Blackwell block-scaled formsConsumer Blackwell microscaling and per-lane scale metadata.

Tier spelling is part of the IR contract. Lowering must not silently reinterpret an operation as a different tier just because another instruction shape looks similar. If a target does not support the tier named by the operation, verification fails before NVVM lowering.

Atom Families

The major atom families are:

  • MMA atoms, including dense, sparse, FP8, WGMMA, UMMA, and block-scaled forms.
  • TMA atoms for tensor-memory load, store, gather, scatter, prefetch, and descriptor use.
  • Copy atoms for register, shared-memory, global-memory, and tiled partition movement.
  • TMEM lifecycle operations for allocation, deallocation, permit transfer, and pointer retrieval.
  • Descriptor view operations that connect cute layouts to hardware descriptor operands.
  • Kernel-marker lowering that turns a cute kernel marker into the entry-point marker expected by NVVM.

Each family consumes cute layout values and emits lower-level operations whose shapes, element types, and memory spaces are visible to the target.

Kernel Lowering

The kernel boundary stays deliberately simple. A function marked as a cute kernel becomes an NVVM kernel entry, and every architecture atom in the body lowers or normalises toward nvgpu and nvvm.

void lower_cute_kernel_to_nvvm(Function func, Target target) {
    if (has_attr(func, "cute.kernel")) {
        remove_attr(func, "cute.kernel");
        set_attr(func, "nvvm.kernel");
    }

    for (Operation *op : func.walk()) {
        if (is_cute_nvgpu_mma(op)) {
            require(target_supports_mma_tier(target, op));
            lower_mma_atom(op, target);
        } else if (is_cute_nvgpu_tma(op)) {
            require(target_supports_tma(target, op));
            lower_tma_atom(op, target);
        } else if (is_cute_nvgpu_tmem(op)) {
            require(target_supports_tmem(target, op));
            lower_tmem_lifecycle_op(op, target);
        } else if (is_cute_layout_carrier(op)) {
            rewrite_descriptor_or_view(op, target);
        }
    }
}

The rewrite preserves the semantic shape of the atom. A WGMMA atom lowers through a warpgroup MMA op, not a scalarized loop that happens to compute the same value. A TMA atom lowers through descriptor construction and async tensor-memory ops, not through ordinary elementwise loads — unless an explicit fallback path exists.

Verifier Invariants

A correct verifier should reject invalid target combinations early:

  • the selected target supports the SM tier named by the operation,
  • MMA tile shapes are supported by that tier,
  • operand element types match the tier and the chosen MMA mode,
  • sparse MMA forms include valid metadata and selector attributes,
  • block-scaled MMA forms include valid scale-vector layout and per-lane scale ids,
  • TMA descriptor operands agree with the source or destination layout,
  • tcgen05.mma kind words clear the 13-rule mutual-exclusion ladder before opcode selection,
  • TMA partition ops clear the 11-step ladder (type, layout-kind, integer-stride, swizzle, static, shape-equiv, G-basis, layout, tensor-type, multicast),
  • tensor-memory operations respect allocation, deallocation, and permit-transfer order,
  • descriptor views preserve address space, element type, shape, and swizzle requirements,
  • kernel entry markers are rewritten before NVVM emission.

These invariants are easiest to enforce while the atom name is still present. Once the op has become an NVVM intrinsic the diagnostic context shrinks, and the original layout intent may already be gone.

If You Know CUTLASS (open source) — cross-walk

For readers fluent in cutlass/arch/*.hpp and the per-SM atom traits in open-source CUTLASS:

CUTLASS C++tileiras IR (cute_nvgpu)
cutlass::arch::Mma<...> SM70/SM80/SM89 specialisationsatom.universal_fma, sm80.mma, sm89.mma (plus sm80.sparse_mma)
cutlass::arch::Wmma<...> traitsaccessed through atom.universal_fma and tier-generic paths
Hopper GMMA::ss/rs/sr descriptor builderscute_nvgpu.smem_desc_view + the descriptor packer at sub_17DD6A0
Hopper WGMMA atom + make_smem_desccute_nvgpu.sm90.mma op consuming a !smem_desc_view typed operand
Hopper TMA cp.async.bulk.tensor familyatom.tma_load, atom.tma_store, atom.tma_reduce plus the non-exec variants
Hopper cuTensorMapEncodeTiledtma_descriptor_tiled type + the TMA descriptor builder
Blackwell TCGEN / UMMA atomssm100.mma, sm100.mma_sp, sm100.mma_bs, sm100.mma_bs_sp
Blackwell TMEM allocation / lifecycleatom.tmem_load, atom.tmem_store, atom.s2t_copy, the TMEM lifecycle ops
cutlass::arch::Sm120BlockScaledMma<...>SM120.mma_bs (uppercase SM is required)
Shared-memory matrix loads (ldmatrix)atom.ldsm, atom.stsm with the mode/size pattern matrix in Mode Pattern Verifiers — LDSM and STSM Matrix

Two departures from the open-source surface matter. First, SM120.mma_bs is the only SM120 entry — no SM120.mma, no sparse variant — matching the consumer-Blackwell FP4 surface where sparse MMA is not exposed. Second, the SMEM descriptor is a first-class IR type (!smem_desc_view) rather than an i64 immediate, so the verifier can re-check the descriptor's swizzle and tile-stride encoding against the same layout that produced it.