Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

cute_nvgpu Assembly Printer and Type Mnemonics

Abstract

cute_nvgpu textual assembly is primarily a type surface. Atom and descriptor types print as compact mnemonics — sm90.mma, atom.tma_load, SM120.mma_bs. Parameterized types add angle-bracket payloads for descriptor views, universal copy atoms, or sub-byte integer fragments. The dialect provides no dialect-scoped attributes in text; operation attributes use ordinary builtin attribute syntax. This page covers the printer, the parser-facing mnemonics, their parameters, enum spellings, alias hints, and the 27-entry length-keyed packed-XOR perfect-hash dispatcher (sub_1826CF0, 6164 B) that resolves them all.

Type Mnemonics

FamilyMnemonics
MMA atomsatom.universal_fma, sm80.mma, sm80.sparse_mma, sm89.mma, sm90.mma, sm100.mma, sm100.mma_sp, sm100.mma_bs, sm100.mma_bs_sp, SM120.mma_bs
SMEM descriptorssmem_desc, smem_desc_view
TMEM and copy atomsatom.tmem_load, atom.tmem_store, atom.s2t_copy, atom.universal_copy, atom.simt_async_copy, atom.ldsm, atom.stsm
TMA descriptors and atomstma_descriptor_tiled, tma_descriptor_im2col, atom.tma_load, atom.tma_store, atom.tma_reduce, atom.non_exec_tiled_tma_load, atom.non_exec_tiled_tma_store, atom.non_exec_tiled_tma_reduce

SM120.mma_bs is case-sensitive. It prints and parses with uppercase SM.

Parameterized Types

smem_desc_view

smem_desc_view wraps a source type and a layout attribute:

!cute_nvgpu.smem_desc_view<memref<128xf16, 3>, #cute.layout<(4@0, 32@1)>>

The source type describes the shared-memory object. The layout attribute tells WGMMA or UMMA how to interpret that shared-memory tile.

atom.universal_copy

Universal copy atoms carry value type, optional bit width, optional distributed shared-memory allowance, and optional PTX-like memory order and scope:

!cute_nvgpu.atom.universal_copy<f16>
!cute_nvgpu.atom.universal_copy<f16, 128 b>
!cute_nvgpu.atom.universal_copy<f16, 128 b, allow_dsmem>
!cute_nvgpu.atom.universal_copy<f16, mem_order=acquire, mem_scope=cluster>

The b suffix means bits. Keep the space before b to match this dialect's printer exactly.

Atom Integer Type

Sub-byte and microscaling fragments print as integer widths with an optional division factor:

!cute_nvgpu.i4
!cute_nvgpu.i6
!cute_nvgpu.i8
!cute_nvgpu.i4<divby 2>
!cute_nvgpu.i2<divby 4>

The division factor controls packed-lane interpretation, not integer arithmetic.

Enum Spellings

Universal copy atoms use PTX-like memory order and scope spellings.

EnumSpellings
Memory orderrelaxed, acquire, release, acq_rel, sc, mmio, constant, volatile
Memory scopecluster, gpu, sys

An omitted order or scope is not the same as printing a default value. Printers elide absent fields rather than emit sentinel keywords.

Alias Hints

The dialect may provide human-readable SSA aliases for common atom families. Aliases are non-semantic and may be overridden by operation-level result naming.

StringRef alias_for_type(Type type) {
    if (is_memref_family(type)) {
        return format("memref_%s_%d", element_name(type), rank(type));
    }

    if (is_copy_atom(type)) {
        return format("copy_%s", copy_atom_suffix(type));
    }

    if (is_mma_atom(type)) {
        return format("mma_%s_%s_%s_%s",
                      a_element_name(type),
                      b_element_name(type),
                      c_element_name(type),
                      shape_name(type));
    }

    return "";
}

Examples:

%copy_sm90_tma_load = ...
%mma_f16_f16_f32_16x8x16 = ...
%memref_f16_3 = ...

Attribute Text

The dialect does not parse #cute_nvgpu.* attributes. Operation attributes such as a_type, b_type, shape_MNK, cache modes, scale-vector size, and thread or byte identifiers should be represented with ordinary builtin or operation-specific attribute syntax.

// Good: op-owned attributes.
%atom = cute_nvgpu.make_sm120_mma_bs
    {shape_MNK = [16, 8, 32], vec_size = 32}

// Not a supported dialect attribute surface.
// #cute_nvgpu.some_attribute<...>

Parser Strategy

Any efficient mnemonic dispatch will do, as long as it behaves as if it recognises the exact case-sensitive set above. Unknown type mnemonics produce a clear diagnostic naming both the bad token and the cute_nvgpu dialect.

Type parse_cute_nvgpu_type(Parser *parser) {
    StringRef mnemonic = parser->parse_keyword();

    if (mnemonic == "smem_desc_view") {
        return parse_smem_desc_view(parser);
    }

    if (mnemonic == "atom.universal_copy") {
        return parse_universal_copy_atom(parser);
    }

    if (starts_with_atom_integer_prefix(mnemonic)) {
        return parse_atom_integer_type(parser, mnemonic);
    }

    Type type = lookup_zero_parameter_type(mnemonic);
    require(type != NULL);
    return type;
}

Mnemonic Perfect-Hash Dispatch

The compiled mnemonic dispatcher (sub_1826CF0, 6164 bytes) is a hand-written length-keyed perfect-hash walk. It fetches a single token via parseOptionalKeyword through the AsmParser vtable, then compares the token against 27 precomputed entries in a fixed order. Each entry is a tuple (length, first_qword, second_qword, tail_bytes); the comparison uses 8-byte unaligned loads XORed against the stored qword literal — the classic packed-XOR memcmp collapse. The walk order is the order the entries appear in the table below, and preserving it matters: the printer's slot index must match the parser's branch order so type-name resolution round-trips through identical decision-chain offsets.

The hash is perfect because every distinct mnemonic in the set has either a unique length or a unique first qword keyed on length. The compiler emits a chained linear walk rather than a switch table: each arm gates on len == LEN first, then on a fused XOR of one or two qwords, then on any remaining dword, word, and byte tail. A miss falls through to the next arm; a hit calls the per-mnemonic builder and sets HIBYTE(v44) = 1 as the success sticky bit.

#LengthFirst qword (LE)q0 literalSecond qword / tailMnemonic
0180x696E752E6D6F7461"atom.uni"q1 0x665F6C6173726576 "versal_f", w@+16 "ma"atom.universal_fma
180x616D6D2E30386D73"sm80.mma"sm80.mma
2150x6170732E30386D73"sm80.spa"d@+8 "rse_", w@+12 "mm", b@+14 'a'sm80.sparse_mma
380x616D6D2E39386D73"sm89.mma"sm89.mma
490x7365645F6D656D73"smem_des"b@+8 'c'smem_desc
5140x7365645F6D656D73"smem_des"d@+8 "c_vi", w@+12 "ew"smem_desc_view
680x616D6D2E30396D73"sm90.mma"sm90.mma
790x6D6D2E3030316D73"sm100.mm"b@+8 'a'sm100.mma
8120x6D6D2E3030316D73"sm100.mm"d@+8 "a_sp"sm100.mma_sp
9120x6D6D2E3030316D73"sm100.mm"d@+8 "a_bs"sm100.mma_bs
10150x6D6D2E3030316D73"sm100.mm"d@+8 "a_bs", w@+12 "_s", b@+14 'p'sm100.mma_bs_sp
11120x6D6D2E3032314D53"SM120.mm"d@+8 "a_bs"SM120.mma_bs
12140x656D742E6D6F7461"atom.tme"d@+8 "m_lo", w@+12 "ad"atom.tmem_load
13150x656D742E6D6F7461"atom.tme"d@+8 "m_st", w@+12 "or", b@+14 'e'atom.tmem_store
14130x7432732E6D6F7461"atom.s2t"d@+8 "_cop", b@+12 'y'atom.s2t_copy
15200x637365645F616D74"tma_desc"q1 0x745F726F74706972 "riptor_t", d@+16 "iled"tma_descriptor_tiled
16210x637365645F616D74"tma_desc"q1 0x695F726F74706972 "riptor_i", d@+16 "m2co", b@+20 'l'tma_descriptor_im2col
17130x616D742E6D6F7461"atom.tma"d@+8 "_loa", b@+12 'd'atom.tma_load
18140x616D742E6D6F7461"atom.tma"d@+8 "_sto", w@+12 "re"atom.tma_store
19150x616D742E6D6F7461"atom.tma"d@+8 "_red", w@+12 "uc", b@+14 'e'atom.tma_reduce
20190x696E752E6D6F7461"atom.uni"q1 0x635F6C6173726576 "versal_c", w@+16 "op", b@+18 'y'atom.universal_copy
21200x6D69732E6D6F7461"atom.sim"q1 0x5F636E7973615F74 "t_async_", d@+16 "copy"atom.simt_async_copy
2290x73646C2E6D6F7461"atom.lds"b@+8 'm'atom.ldsm
2390x7374732E6D6F7461"atom.sts"b@+8 'm'atom.stsm
24280x6E6F6E2E6D6F7461"atom.non"q1 0x69745F636578655F "_exec_ti", q@+16 "led_tma_", d@+24 "load"atom.non_exec_tiled_tma_load
25290x6E6F6E2E6D6F7461"atom.non"q1 0x69745F636578655F "_exec_ti", q@+16 "led_tma_", d@+24 "stor", b@+28 'e'atom.non_exec_tiled_tma_store
26300x6E6F6E2E6D6F7461"atom.non"q1 0x69745F636578655F "_exec_ti", q@+16 "led_tma_", d@+24 "redu", w@+28 "ce"atom.non_exec_tiled_tma_reduce

Entries 4, 15, and 16 (smem_desc, tma_descriptor_tiled, tma_descriptor_im2col) are zero-parameter sugared types that build straight through the MLIR type uniquer with a fixed TypeID global. Entries 5 and 20 (smem_desc_view, atom.universal_copy) carry inline sub-parsers for layout, scope, and order. Every other entry routes to a per-mnemonic builder trampoline whose address acts as a typed nonce in the dialect's registration table — the trampoline itself is a 3-byte xor eax, eax; ret, and the linker pins the function pointer as the opcode tag.

atom.i<N>_divby_<M> Prefix Branch

Once all 27 literal arms miss, the dispatcher checks whether the token starts with 'i' (0x69). If it does, the walk switches into a four-step sub-walk that reads the bare i<decimal> form first, then optionally consumes the divby keyword followed by a second decimal. This is the dialect's generic packed sub-byte fragment encoding — i4, i6, i8 for plain widths; i4_divby_2, i2_divby_4 for NVFP4-style microscaling fragments where the divisor controls packed-lane interpretation rather than arithmetic.

The walk is:

  1. Read N as a decimal suffix after the leading 'i'. The digit run is delimited by std::find_if_not(tail, end, isdigit) and converted with StringRef::getAsInteger(10, &value).
  2. Check that the remainder of the token is exhausted and that N fits in uint32_t.
  3. Probe parseOptionalKeyword("divby", 5, &cursor) through the parser vtable. If the keyword is absent, the result is a plain i<N> atom integer type.
  4. Read M as a second decimal and attach both integers to the OperationState via the OperationState::addAttribute("divby", i<M>) helper.

The bare divby keyword path lives entirely inside step 3 — not a separate dispatcher arm, but a sub-keyword that only ever follows a successful i<N> consumption. Reimplementers must place this prefix branch after every literal arm: anything starting with 'i' followed by an all-digit tail lands here regardless of whether the digits form a semantically valid bit-width.

SM120 Uppercase Quirk

Every SM70/80/90/100 arm uses a lowercase prefix (sm80., sm89., sm90., sm100.). The SM120 arm uppercases it. The packed first-qword literal for entry 11 is 0x6D6D2E3032314D53, decoding little-endian as "SM120.mm" — the bytes 4D 53 ('M', 'S') sit at offsets 0 and 1 of the qword, encoding the uppercase SM head. This is a genuine binary quirk preserved through the CUDA 13.1 release and confirmed against the verbatim qword constant in the decompilation. Any reimplementation must key the SM120 arm case-sensitively — never case-fold the prefix.

The table has exactly one SM120 variant: SM120.mma_bs. No SM120.mma, no _sp, no _bs_sp — the SM120 code path is gated to block-scaled MMA only, matching the consumer-Blackwell FP4 surface where sparse MMA is not exposed.

Diagnostic

When all 27 arms and the 'i'-prefix branch miss, the dispatcher emits:

unknown  type `<mnem>` in dialect `<dialect>`

The literal carries two spaces between unknown and type — verbatim in the binary at .rodata offset 0x04CF76F7. A compatible reimplementation must preserve it. Bad token and dialect name are both wrapped in backticks; the diagnostic is stitched from three separate .rodata fragments concatenated through the InFlightDiagnostic::operator<<(StringRef) chain.

Perfect-Hash Compare Pseudocode

typedef struct HashEntry {
    size_t      length;
    uint64_t    q0;
    uint64_t    q1;
    const char *tail;
    void      (*build)(ParseContext *);
} HashEntry;

int parseCuteNvgpuMnemonic(StringRef m, ParseResult *out) {
    static const HashEntry table[27] = {
        {18, 0x696E752E6D6F7461ULL, 0x665F6C6173726576ULL, "ma",     build_atom_universal_fma},
        { 8, 0x616D6D2E30386D73ULL, 0ULL,                  "",       build_sm80_mma},
        {15, 0x6170732E30386D73ULL, 0ULL,                  "rse_mma",build_sm80_sparse_mma},
        { 8, 0x616D6D2E39386D73ULL, 0ULL,                  "",       build_sm89_mma},
        /* ...22 more rows in walk order... */
        {12, 0x6D6D2E3032314D53ULL, 0ULL,                  "a_bs",   build_sm120_mma_bs},
        /* ...remaining rows... */
        {30, 0x6E6F6E2E6D6F7461ULL, 0x69745F636578655F ULL,
                                                          "led_tma_reduce",
                                                                    build_atom_non_exec_tiled_tma_reduce},
    };

    for (size_t i = 0; i < 27; ++i) {
        const HashEntry *e = &table[i];
        if (m.size != e->length)                                                   continue;
        uint64_t q0 = unaligned_load_u64(m.data + 0);
        if ((q0 ^ e->q0) != 0)                                                     continue;
        if (e->length > 8) {
            uint64_t q1 = unaligned_load_u64(m.data + 8);
            if ((q1 ^ e->q1) != 0)                                                 continue;
        }
        size_t tail_len = e->length > 16 ? e->length - 16 : 0;
        if (tail_len && memcmp(m.data + 16, e->tail, tail_len) != 0)               continue;
        e->build(out);
        return 1;
    }

    if (m.size > 0 && m.data[0] == 'i' && all_digits(m.data + 1, m.size - 1)) {
        return parse_atom_integer_with_optional_divby(m, out);
    }

    emit_error("unknown  type `%.*s` in dialect `cute_nvgpu`",
               (int)m.size, m.data);
    return 0;
}

The length gate collapses the 27-way decision into a constant-time per-bucket lookup. The compiler chains the arms as if/else if rather than building a switch table because the length-OR-first-qword key is collision-free across the entire mnemonic set, and the walk order must stay stable to preserve the printer-to-parser slot correspondence the dialect's round-trip self-test depends on.

Invariants

  • Type mnemonics are case-sensitive.
  • SM120.mma_bs prints with uppercase SM; the packed qword literal 0x6D6D2E3032314D53 enforces this case-sensitively.
  • The 27-entry perfect-hash walk order is preserved across builds so that printer slot indices match the parser's branch order.
  • The 'i'-prefix branch runs only after every literal arm has missed.
  • Parameterized type printers and parsers are symmetric.
  • Operation attributes do not require a dialect-scoped attribute parser.
  • Alias hints are deterministic but never semantic.
  • The unknown-type diagnostic uses two literal spaces between unknown and type and wraps both the bad token and the dialect name in backticks.