cute_nvgpu Assembly Printer and Type Mnemonics
Abstract
cute_nvgpu textual assembly is primarily a type surface. Atom and descriptor types print as compact mnemonics — sm90.mma, atom.tma_load, SM120.mma_bs. Parameterized types add angle-bracket payloads for descriptor views, universal copy atoms, or sub-byte integer fragments. The dialect provides no dialect-scoped attributes in text; operation attributes use ordinary builtin attribute syntax. This page covers the printer, the parser-facing mnemonics, their parameters, enum spellings, alias hints, and the 27-entry length-keyed packed-XOR perfect-hash dispatcher (sub_1826CF0, 6164 B) that resolves them all.
Type Mnemonics
| Family | Mnemonics |
|---|---|
| MMA atoms | atom.universal_fma, sm80.mma, sm80.sparse_mma, sm89.mma, sm90.mma, sm100.mma, sm100.mma_sp, sm100.mma_bs, sm100.mma_bs_sp, SM120.mma_bs |
| SMEM descriptors | smem_desc, smem_desc_view |
| TMEM and copy atoms | atom.tmem_load, atom.tmem_store, atom.s2t_copy, atom.universal_copy, atom.simt_async_copy, atom.ldsm, atom.stsm |
| TMA descriptors and atoms | tma_descriptor_tiled, tma_descriptor_im2col, atom.tma_load, atom.tma_store, atom.tma_reduce, atom.non_exec_tiled_tma_load, atom.non_exec_tiled_tma_store, atom.non_exec_tiled_tma_reduce |
SM120.mma_bs is case-sensitive. It prints and parses with uppercase SM.
Parameterized Types
smem_desc_view
smem_desc_view wraps a source type and a layout attribute:
!cute_nvgpu.smem_desc_view<memref<128xf16, 3>, #cute.layout<(4@0, 32@1)>>
The source type describes the shared-memory object. The layout attribute tells WGMMA or UMMA how to interpret that shared-memory tile.
atom.universal_copy
Universal copy atoms carry value type, optional bit width, optional distributed shared-memory allowance, and optional PTX-like memory order and scope:
!cute_nvgpu.atom.universal_copy<f16>
!cute_nvgpu.atom.universal_copy<f16, 128 b>
!cute_nvgpu.atom.universal_copy<f16, 128 b, allow_dsmem>
!cute_nvgpu.atom.universal_copy<f16, mem_order=acquire, mem_scope=cluster>
The b suffix means bits. Keep the space before b to match this dialect's printer exactly.
Atom Integer Type
Sub-byte and microscaling fragments print as integer widths with an optional division factor:
!cute_nvgpu.i4
!cute_nvgpu.i6
!cute_nvgpu.i8
!cute_nvgpu.i4<divby 2>
!cute_nvgpu.i2<divby 4>
The division factor controls packed-lane interpretation, not integer arithmetic.
Enum Spellings
Universal copy atoms use PTX-like memory order and scope spellings.
| Enum | Spellings |
|---|---|
| Memory order | relaxed, acquire, release, acq_rel, sc, mmio, constant, volatile |
| Memory scope | cluster, gpu, sys |
An omitted order or scope is not the same as printing a default value. Printers elide absent fields rather than emit sentinel keywords.
Alias Hints
The dialect may provide human-readable SSA aliases for common atom families. Aliases are non-semantic and may be overridden by operation-level result naming.
StringRef alias_for_type(Type type) {
if (is_memref_family(type)) {
return format("memref_%s_%d", element_name(type), rank(type));
}
if (is_copy_atom(type)) {
return format("copy_%s", copy_atom_suffix(type));
}
if (is_mma_atom(type)) {
return format("mma_%s_%s_%s_%s",
a_element_name(type),
b_element_name(type),
c_element_name(type),
shape_name(type));
}
return "";
}
Examples:
%copy_sm90_tma_load = ...
%mma_f16_f16_f32_16x8x16 = ...
%memref_f16_3 = ...
Attribute Text
The dialect does not parse #cute_nvgpu.* attributes. Operation attributes
such as a_type, b_type, shape_MNK, cache modes, scale-vector size, and
thread or byte identifiers should be represented with ordinary builtin or
operation-specific attribute syntax.
// Good: op-owned attributes.
%atom = cute_nvgpu.make_sm120_mma_bs
{shape_MNK = [16, 8, 32], vec_size = 32}
// Not a supported dialect attribute surface.
// #cute_nvgpu.some_attribute<...>
Parser Strategy
Any efficient mnemonic dispatch will do, as long as it behaves as if it recognises the exact case-sensitive set above. Unknown type mnemonics produce a clear diagnostic naming both the bad token and the cute_nvgpu dialect.
Type parse_cute_nvgpu_type(Parser *parser) {
StringRef mnemonic = parser->parse_keyword();
if (mnemonic == "smem_desc_view") {
return parse_smem_desc_view(parser);
}
if (mnemonic == "atom.universal_copy") {
return parse_universal_copy_atom(parser);
}
if (starts_with_atom_integer_prefix(mnemonic)) {
return parse_atom_integer_type(parser, mnemonic);
}
Type type = lookup_zero_parameter_type(mnemonic);
require(type != NULL);
return type;
}
Mnemonic Perfect-Hash Dispatch
The compiled mnemonic dispatcher (sub_1826CF0, 6164 bytes) is a hand-written length-keyed perfect-hash walk. It fetches a single token via parseOptionalKeyword through the AsmParser vtable, then compares the token against 27 precomputed entries in a fixed order. Each entry is a tuple (length, first_qword, second_qword, tail_bytes); the comparison uses 8-byte unaligned loads XORed against the stored qword literal — the classic packed-XOR memcmp collapse. The walk order is the order the entries appear in the table below, and preserving it matters: the printer's slot index must match the parser's branch order so type-name resolution round-trips through identical decision-chain offsets.
The hash is perfect because every distinct mnemonic in the set has either a unique length or a unique first qword keyed on length. The compiler emits a chained linear walk rather than a switch table: each arm gates on len == LEN first, then on a fused XOR of one or two qwords, then on any remaining dword, word, and byte tail. A miss falls through to the next arm; a hit calls the per-mnemonic builder and sets HIBYTE(v44) = 1 as the success sticky bit.
| # | Length | First qword (LE) | q0 literal | Second qword / tail | Mnemonic |
|---|---|---|---|---|---|
| 0 | 18 | 0x696E752E6D6F7461 | "atom.uni" | q1 0x665F6C6173726576 "versal_f", w@+16 "ma" | atom.universal_fma |
| 1 | 8 | 0x616D6D2E30386D73 | "sm80.mma" | — | sm80.mma |
| 2 | 15 | 0x6170732E30386D73 | "sm80.spa" | d@+8 "rse_", w@+12 "mm", b@+14 'a' | sm80.sparse_mma |
| 3 | 8 | 0x616D6D2E39386D73 | "sm89.mma" | — | sm89.mma |
| 4 | 9 | 0x7365645F6D656D73 | "smem_des" | b@+8 'c' | smem_desc |
| 5 | 14 | 0x7365645F6D656D73 | "smem_des" | d@+8 "c_vi", w@+12 "ew" | smem_desc_view |
| 6 | 8 | 0x616D6D2E30396D73 | "sm90.mma" | — | sm90.mma |
| 7 | 9 | 0x6D6D2E3030316D73 | "sm100.mm" | b@+8 'a' | sm100.mma |
| 8 | 12 | 0x6D6D2E3030316D73 | "sm100.mm" | d@+8 "a_sp" | sm100.mma_sp |
| 9 | 12 | 0x6D6D2E3030316D73 | "sm100.mm" | d@+8 "a_bs" | sm100.mma_bs |
| 10 | 15 | 0x6D6D2E3030316D73 | "sm100.mm" | d@+8 "a_bs", w@+12 "_s", b@+14 'p' | sm100.mma_bs_sp |
| 11 | 12 | 0x6D6D2E3032314D53 | "SM120.mm" | d@+8 "a_bs" | SM120.mma_bs |
| 12 | 14 | 0x656D742E6D6F7461 | "atom.tme" | d@+8 "m_lo", w@+12 "ad" | atom.tmem_load |
| 13 | 15 | 0x656D742E6D6F7461 | "atom.tme" | d@+8 "m_st", w@+12 "or", b@+14 'e' | atom.tmem_store |
| 14 | 13 | 0x7432732E6D6F7461 | "atom.s2t" | d@+8 "_cop", b@+12 'y' | atom.s2t_copy |
| 15 | 20 | 0x637365645F616D74 | "tma_desc" | q1 0x745F726F74706972 "riptor_t", d@+16 "iled" | tma_descriptor_tiled |
| 16 | 21 | 0x637365645F616D74 | "tma_desc" | q1 0x695F726F74706972 "riptor_i", d@+16 "m2co", b@+20 'l' | tma_descriptor_im2col |
| 17 | 13 | 0x616D742E6D6F7461 | "atom.tma" | d@+8 "_loa", b@+12 'd' | atom.tma_load |
| 18 | 14 | 0x616D742E6D6F7461 | "atom.tma" | d@+8 "_sto", w@+12 "re" | atom.tma_store |
| 19 | 15 | 0x616D742E6D6F7461 | "atom.tma" | d@+8 "_red", w@+12 "uc", b@+14 'e' | atom.tma_reduce |
| 20 | 19 | 0x696E752E6D6F7461 | "atom.uni" | q1 0x635F6C6173726576 "versal_c", w@+16 "op", b@+18 'y' | atom.universal_copy |
| 21 | 20 | 0x6D69732E6D6F7461 | "atom.sim" | q1 0x5F636E7973615F74 "t_async_", d@+16 "copy" | atom.simt_async_copy |
| 22 | 9 | 0x73646C2E6D6F7461 | "atom.lds" | b@+8 'm' | atom.ldsm |
| 23 | 9 | 0x7374732E6D6F7461 | "atom.sts" | b@+8 'm' | atom.stsm |
| 24 | 28 | 0x6E6F6E2E6D6F7461 | "atom.non" | q1 0x69745F636578655F "_exec_ti", q@+16 "led_tma_", d@+24 "load" | atom.non_exec_tiled_tma_load |
| 25 | 29 | 0x6E6F6E2E6D6F7461 | "atom.non" | q1 0x69745F636578655F "_exec_ti", q@+16 "led_tma_", d@+24 "stor", b@+28 'e' | atom.non_exec_tiled_tma_store |
| 26 | 30 | 0x6E6F6E2E6D6F7461 | "atom.non" | q1 0x69745F636578655F "_exec_ti", q@+16 "led_tma_", d@+24 "redu", w@+28 "ce" | atom.non_exec_tiled_tma_reduce |
Entries 4, 15, and 16 (smem_desc, tma_descriptor_tiled, tma_descriptor_im2col) are zero-parameter sugared types that build straight through the MLIR type uniquer with a fixed TypeID global. Entries 5 and 20 (smem_desc_view, atom.universal_copy) carry inline sub-parsers for layout, scope, and order. Every other entry routes to a per-mnemonic builder trampoline whose address acts as a typed nonce in the dialect's registration table — the trampoline itself is a 3-byte xor eax, eax; ret, and the linker pins the function pointer as the opcode tag.
atom.i<N>_divby_<M> Prefix Branch
Once all 27 literal arms miss, the dispatcher checks whether the token starts with 'i' (0x69). If it does, the walk switches into a four-step sub-walk that reads the bare i<decimal> form first, then optionally consumes the divby keyword followed by a second decimal. This is the dialect's generic packed sub-byte fragment encoding — i4, i6, i8 for plain widths; i4_divby_2, i2_divby_4 for NVFP4-style microscaling fragments where the divisor controls packed-lane interpretation rather than arithmetic.
The walk is:
- Read
Nas a decimal suffix after the leading'i'. The digit run is delimited bystd::find_if_not(tail, end, isdigit)and converted withStringRef::getAsInteger(10, &value). - Check that the remainder of the token is exhausted and that
Nfits inuint32_t. - Probe
parseOptionalKeyword("divby", 5, &cursor)through the parser vtable. If the keyword is absent, the result is a plaini<N>atom integer type. - Read
Mas a second decimal and attach both integers to theOperationStatevia theOperationState::addAttribute("divby", i<M>)helper.
The bare divby keyword path lives entirely inside step 3 — not a separate dispatcher arm, but a sub-keyword that only ever follows a successful i<N> consumption. Reimplementers must place this prefix branch after every literal arm: anything starting with 'i' followed by an all-digit tail lands here regardless of whether the digits form a semantically valid bit-width.
SM120 Uppercase Quirk
Every SM70/80/90/100 arm uses a lowercase prefix (sm80., sm89., sm90., sm100.). The SM120 arm uppercases it. The packed first-qword literal for entry 11 is 0x6D6D2E3032314D53, decoding little-endian as "SM120.mm" — the bytes 4D 53 ('M', 'S') sit at offsets 0 and 1 of the qword, encoding the uppercase SM head. This is a genuine binary quirk preserved through the CUDA 13.1 release and confirmed against the verbatim qword constant in the decompilation. Any reimplementation must key the SM120 arm case-sensitively — never case-fold the prefix.
The table has exactly one SM120 variant: SM120.mma_bs. No SM120.mma, no _sp, no _bs_sp — the SM120 code path is gated to block-scaled MMA only, matching the consumer-Blackwell FP4 surface where sparse MMA is not exposed.
Diagnostic
When all 27 arms and the 'i'-prefix branch miss, the dispatcher emits:
unknown type `<mnem>` in dialect `<dialect>`
The literal carries two spaces between unknown and type — verbatim in the binary at .rodata offset 0x04CF76F7. A compatible reimplementation must preserve it. Bad token and dialect name are both wrapped in backticks; the diagnostic is stitched from three separate .rodata fragments concatenated through the InFlightDiagnostic::operator<<(StringRef) chain.
Perfect-Hash Compare Pseudocode
typedef struct HashEntry {
size_t length;
uint64_t q0;
uint64_t q1;
const char *tail;
void (*build)(ParseContext *);
} HashEntry;
int parseCuteNvgpuMnemonic(StringRef m, ParseResult *out) {
static const HashEntry table[27] = {
{18, 0x696E752E6D6F7461ULL, 0x665F6C6173726576ULL, "ma", build_atom_universal_fma},
{ 8, 0x616D6D2E30386D73ULL, 0ULL, "", build_sm80_mma},
{15, 0x6170732E30386D73ULL, 0ULL, "rse_mma",build_sm80_sparse_mma},
{ 8, 0x616D6D2E39386D73ULL, 0ULL, "", build_sm89_mma},
/* ...22 more rows in walk order... */
{12, 0x6D6D2E3032314D53ULL, 0ULL, "a_bs", build_sm120_mma_bs},
/* ...remaining rows... */
{30, 0x6E6F6E2E6D6F7461ULL, 0x69745F636578655F ULL,
"led_tma_reduce",
build_atom_non_exec_tiled_tma_reduce},
};
for (size_t i = 0; i < 27; ++i) {
const HashEntry *e = &table[i];
if (m.size != e->length) continue;
uint64_t q0 = unaligned_load_u64(m.data + 0);
if ((q0 ^ e->q0) != 0) continue;
if (e->length > 8) {
uint64_t q1 = unaligned_load_u64(m.data + 8);
if ((q1 ^ e->q1) != 0) continue;
}
size_t tail_len = e->length > 16 ? e->length - 16 : 0;
if (tail_len && memcmp(m.data + 16, e->tail, tail_len) != 0) continue;
e->build(out);
return 1;
}
if (m.size > 0 && m.data[0] == 'i' && all_digits(m.data + 1, m.size - 1)) {
return parse_atom_integer_with_optional_divby(m, out);
}
emit_error("unknown type `%.*s` in dialect `cute_nvgpu`",
(int)m.size, m.data);
return 0;
}
The length gate collapses the 27-way decision into a constant-time per-bucket lookup. The compiler chains the arms as if/else if rather than building a switch table because the length-OR-first-qword key is collision-free across the entire mnemonic set, and the walk order must stay stable to preserve the printer-to-parser slot correspondence the dialect's round-trip self-test depends on.
Invariants
- Type mnemonics are case-sensitive.
SM120.mma_bsprints with uppercaseSM; the packed qword literal0x6D6D2E3032314D53enforces this case-sensitively.- The 27-entry perfect-hash walk order is preserved across builds so that printer slot indices match the parser's branch order.
- The
'i'-prefix branch runs only after every literal arm has missed. - Parameterized type printers and parsers are symmetric.
- Operation attributes do not require a dialect-scoped attribute parser.
- Alias hints are deterministic but never semantic.
- The unknown-type diagnostic uses two literal spaces between
unknownandtypeand wraps both the bad token and the dialect name in backticks.