Tensor Core / MMA Builtins
Tensor core builtins implement the Warp Matrix Multiply-Accumulate (WMMA) and Warp Group MMA (WGMMA) interfaces, spanning IDs 678–770 across four SM generations. Each generation added new data types and matrix shapes, resulting in 91 registered builtins that cover half-precision, integer, binary, double-precision, TF32, BF16, and FP8 matrix operations. SM 100 (Blackwell) adds a fifth generation — tcgen05 — documented in Tensor / MMA Codegen.
Key Facts
| Property | Value |
|---|---|
| Builtin IDs | 678–770 (93 entries) |
| WGMMA handler (IDs 753–768) | ~800 lines in sub_12B3FD0 / sub_955A70 |
| LLVM intrinsic range (WGMMA) | 5304–5447 (144-entry 5-D grid) plus 10654–10779 (N-dimension table) |
| NVVM lowering | sub_955A70 (24 KB native), sub_12B3FD0 (22 KB native) |
| Backend emission | sub_21E74C0 (PTX builder), sub_36E9630 (tcgen05 ISD selection) |
| SM gates | SM 70+ HMMA, SM 72+ IMMA, SM 75+ BMMA, SM 80+ DMMA/TF32/BF16, SM 90+ WGMMA |
WMMA Architecture Evolution
| SM Generation | Feature | ID Range | Count |
|---|---|---|---|
| SM 70 (Volta) | HMMA: FP16 tensor core | 678–707 | 30 |
| SM 75 (Turing) | IMMA: INT8/INT4, BMMA: binary | 708–745 | 38 |
| SM 80 (Ampere) | DMMA: FP64, TF32, BF16 | 746–764 | 19 |
| SM 90 (Hopper) | WGMMA: warp-group MMA, FP8 | 765–768 | 4 |
| SM 100 (Blackwell) | tcgen05: MX formats, block-scale, sparsity | (intrinsic path) | — |
HMMA — Half-Precision (IDs 678–707, SM 70+)
The original tensor core builtins provide 16-bit floating-point matrix multiply for three tile shapes. Each shape has 10 operations: load A, load B, load C (f16 and f32 accumulators), store C (f16 and f32), and four MMA variants for input/output precision combinations.
| ID Range | Shape | Builtin Prefix |
|---|---|---|
| 678–687 | 16x16x16 | __hmma_m16n16k16_* |
| 688–697 | 32x8x16 | __hmma_m32n8k16_* |
| 698–707 | 8x32x16 | __hmma_m8n32k16_* |
Per-shape operations (10 each):
| Suffix | Operation | Description |
|---|---|---|
ld_a | Load A fragment | Load matrix A tile from memory |
ld_b | Load B fragment | Load matrix B tile from memory |
ld_c_f16 | Load C (f16) | Load accumulator as half-precision |
ld_c_f32 | Load C (f32) | Load accumulator as single-precision |
st_c_f16 | Store C (f16) | Store result as half-precision |
st_c_f32 | Store C (f32) | Store result as single-precision |
mma_f16f16 | MMA f16->f16 | FP16 input, FP16 accumulator |
mma_f32f16 | MMA f16->f32 | FP16 input, FP32 accumulator |
mma_f16f32 | MMA f32->f16 | FP32 accumulator, FP16 output |
mma_f32f32 | MMA f32->f32 | FP32 input and accumulator |
IMMA — Integer MMA (IDs 708–739, SM 75+)
Integer tensor core operations for INT8 and INT4 data types.
INT8 (IDs 708–731)
Three shapes (16x16x16, 32x8x16, 8x32x16), each with 8 operations:
| Suffix | Description |
|---|---|
ld_a_s8 / ld_a_u8 | Load A fragment (signed/unsigned INT8) |
ld_b_s8 / ld_b_u8 | Load B fragment (signed/unsigned INT8) |
ld_c | Load accumulator (INT32) |
st_c_i32 | Store result (INT32) |
mma_s8 / mma_u8 | INT8 MMA (signed/unsigned) |
INT4 (IDs 732–739)
Single shape (8x8x32) with the same operation set but _s4 / _u4 type suffixes.
BMMA — Binary MMA (IDs 740–745, SM 75+)
Binary (1-bit) matrix multiply with XOR-POPC and AND-POPC accumulation modes. Single shape: 8x8x128.
| ID | Builtin | Description |
|---|---|---|
| 740 | __bmma_m8n8k128_ld_a_b1 | Load A fragment (binary) |
| 741 | __bmma_m8n8k128_ld_b_b1 | Load B fragment (binary) |
| 742 | __bmma_m8n8k128_ld_c | Load accumulator |
| 743 | __bmma_m8n8k128_st_c_i32 | Store result |
| 744 | __bmma_m8n8k128_mma_xor_popc_b1 | Binary MMA (XOR + popcount) |
| 745 | __bmma_m8n8k128_mma_and_popc_b1 | Binary MMA (AND + popcount) |
Extended Tensor Core (IDs 746–764, SM 80+)
SM 80 (Ampere) added double-precision, TF32, and BF16 tensor operations.
DMMA — Double Precision (IDs 746, 751–754)
| ID | Builtin | Description |
|---|---|---|
| 746 | __dmma_m8n8k4_mma_f64 | FP64 MMA |
| 751 | __dmma_m8n8k4_st_c_f64 | Store FP64 result |
| 752–754 | __dmma_m8n8k4_{ld_a,ld_b,ld_c} | Load fragments |
TF32 (IDs 747, 755–757)
| ID | Builtin | Description |
|---|---|---|
| 747 | __mma_tf32_m16n16k8_mma_f32 | TF32 MMA producing FP32 |
| 755–757 | __mma_tf32_m16n16k8_{ld_a,ld_b,ld_c} | Load fragments |
BF16 (IDs 748–750, 758–764)
| ID | Builtin | Description |
|---|---|---|
| 748 | __mma_bf16_m16n16k16_mma_f32 | BF16 16x16x16 MMA |
| 749 | __mma_bf16_m32n8k16_mma_f32 | BF16 32x8x16 MMA |
| 750 | __mma_bf16_m8n32k16_mma_f32 | BF16 8x32x16 MMA |
| 758–764 | __mma_bf16_m*_{ld_a,ld_b} | Load fragments for each shape |
WMMA Lowering Details
Three-Table Lookup
WMMA builtins use a three-table structure for mapping builtin IDs to LLVM intrinsic IDs:
| Table | Address (NVVM) | ID Range | Description |
|---|---|---|---|
dword_3F14840 | Entries 0–29 | 678–707 | HMMA (first-generation, FP16) |
dword_3F147E0 | Entries 0–23 | 708–731 | IMMA (INT8) |
dword_3F147A0 | Entries 0–12 | 732–744 | BMMA (binary) / INT4 |
The EDG-side parallel tables live at dword_42810C0 (678–709), dword_4281060 (708–731), dword_4281020 (732–744), addressed from sub_12AC1A0.
Fragment Size Determination
The number of register-level fragments varies by operation and data type:
| Condition | Fragment Count | Example |
|---|---|---|
| First-gen WMMA, BF16, store | 4 | BF16 store_c |
| First-gen WMMA, default | 8 | FP16 mma |
| IMMA, intrinsic 8914/8280 | 2 | INT8 ld_a compact |
| BMMA | 2 | Binary operations |
| IMMA intrinsic 0x22BB/0x22BC/0x22C5/0x22C6 | 4 | INT4 load A/B |
| IMMA intrinsic 0x22BD/0x22BE/0x22C3/0x22C4/0x22CB–0x22CE | 1 | Sub-byte single-element |
| IMMA intrinsic 0x22B7/0x22BF/0x22C7 | 8 | INT8 full-width |
MMA Codegen Flow
The MMA handler (sub_94E0D0 / sub_12AC5F0) processes 5 input operands:
- dest_ptr — Pointer to output fragment storage
- A_fragment — Matrix A input (loaded
v100times) - B_fragment — Matrix B input (loaded
v95times) - C_fragment — Accumulator input (loaded
v101times) - rowcol — Layout operand (validated 0–3 for MMA)
An optional satf flag (saturation, validated 0–1) is consumed for most intrinsics except ID 8279.
The handler emits the MMA call via sub_921880 and scatters results back to the destination fragment through v103 iterations of element-wise stores.
Fragment iteration counts per family (NVVM path, sub_94E0D0):
| Family | v95 (load B) | v100 (load A) | v101 (load C) | v103 (store D) |
|---|---|---|---|---|
| BMMA (b1) | 1 | 1 | 2 | 2 |
| IMMA (0x22C0-0x22C1) | 1 | 4 | 8 | 8 |
| IMMA (0x22B8-0x22B9 = 8888-8889) | 2 | 2 | 8 | 8 |
| IMMA (0x22C8-0x22C9 = 8904-8905) | 4 | 1 | 8 | 8 |
| HMMA (default, first-gen) | 8 | 8 | varies | varies (4 or 8) |
The output fragment count is determined by bit-test: (0x300C003 >> (intrinsic_id + 127)) & 1 selects 4 vs 8 fragments.
Architecture Gating — Exact Thresholds
The architecture version is stored at *(target_info + 252) as a DWORD.
| Function | Gate Expression | Minimum SM | Notes |
|---|---|---|---|
sub_21DFBF0 hmmastc | v8 > 0x45 | SM 70 | FP16 store |
sub_21E0360 hmmaldab | v8 > 0x45 | SM 70 | FP16 load A/B |
sub_21E0870 hmmamma | v8 > 0x45 | SM 70 | FP16 MMA |
sub_21E1280 immaldab | v8 > 0x47 | SM 72 | INT load; v8==72 && variant>1 rejected |
sub_21E1D20 immamma | v8 > 0x47 | SM 72 | INT MMA; variant>1 && v8==72 rejected |
sub_21E2280 bmmamma | v8 > 0x48 | SM 73/75 | Binary MMA |
sub_36E9630 tcgen05 | arch >= 0x3E8 | SM 100 | Blackwell only |
SM 72 (Xavier) has a unique partial IMMA implementation: only variant 0/1 shapes are supported, with explicit gating that blocks higher variants. This matches hardware reality where Xavier had limited INT8 tensor cores.
WGMMA — Warp Group MMA (SM 90+ Hopper)
WGMMA operates on an entire warp group (4 warps, 128 threads) rather than a single warp. The system is split across four builtin IDs, 20 auxiliary IDs for fence/store/load operations, and two massive handler blocks totaling ~800 lines of lowering logic.
Builtin Registration
Four builtins are registered in sub_90AEE0 (NVVM) and sub_126A910 (EDG):
| ID | Builtin | Data Type | Lowering Case |
|---|---|---|---|
| 765 (0x2FD) | __wgmma_mma_async_f16 | FP16 | Full operand set (6 chained: A, B, C, scale, negate, sparsity) |
| 766 (0x2FE) | __wgmma_mma_async_bf16 | BF16 | 2-operand (no scale/negate) |
| 767 (0x2FF) | __wgmma_mma_async_tf32 | TF32 | Reduced operand set |
| 768 (0x300) | __wgmma_mma_async_f8 | FP8 (SM 90a+) | Minimal (2 scale operands only) |
WGMMA ID Space Overview
The full WGMMA ID range spans 745–770, subdivided into four functional groups:
| ID Range | Function | Handler |
|---|---|---|
| 745–750 (0x2E9–0x2EE) | Fence / commit / wait | sub_12B1C20 / sub_953BA0 |
| 751–752 (0x2EF–0x2F0) | Store | sub_12B27B0 / sub_954350 |
| 753–764 (0x2F1–0x2FC) | MMA async load (12 variants) | inline / sub_9547E0 |
| 765–768 (0x2FD–0x300) | MMA async compute (4 type builtins) | inline ~800 lines / sub_12B2E10 |
| 769–770 (0x301–0x302) | Warp-group barrier | inline IR via sub_127FC40 |
WGMMA Fence / Commit / Wait (IDs 745–750)
sub_953BA0 (NVVM) / sub_12B1C20 (EDG) builds a red-black tree on first call with 7 entries keyed by builtin ID. Each entry packs:
struct wgmma_fence_entry {
uint32_t id; // builtin ID (745–751)
uint32_t trans_a; // transpose A flag
uint32_t shape; // shape code (0 or 1)
uint32_t trans_b; // transpose B flag
uint32_t a_nregs; // register count for A fragment
uint32_t b_nregs; // register count for B fragment
uint32_t padding; // unused alignment
llvm_type *a_type; // LLVM type for A (i64, i32, i16x2, i32x4)
llvm_type *b_type; // LLVM type for B
llvm_type *c_type; // LLVM type for C (i32x2, i32x8)
};
Decoded entries from local variables v47–v106:
| ID | trans_a | shape | trans_b | a_nregs | b_nregs | A type | B type | C type |
|---|---|---|---|---|---|---|---|---|
| 745 | 0 | 1 | 5 | 1 | 1 | i64 | i64 | — |
| 746 | 1 | 0 | 1 | 9 | 9 | i32 | i32 | i32x2 |
| 747 | 0 | 0 | 25 | 8 | 8 | i16x2 | i16x2 | — |
| 748 | 0 | 0 | 23 | 7 | 7 | i32x4 | i32x4 | i32x8 |
| 749 | 0 | 0 | 24 | 7 | 7 | i32x4 | i32x4 | i32x8 |
| 750 | 0 | 0 | 6 | 7 | 7 | i64 | i32x2 | i32x8 |
Output packed encoding (*a4, 64-bit):
| Bits | Field | Source |
|---|---|---|
| [3:0] | trans_a | *(entry+40) |
| [7:4] | shape | *(entry+48) << 4 |
| [15:8] | a_nregs | *(entry+64) << 8 |
| [27:16] | b_nregs | *(entry+72) << 16 |
| [31:28] | padding | *(entry+80) << 28 |
| [63:32] | trans_b | *(entry+56) << 32 |
| [25] | rowcol bit 1 | (rowcol & 2) == 0 ? 0x2000000 : 0x1000000 |
| [27:26] | rowcol bit 0 | ((rowcol & 1) + 1) << 26 |
The fence dispatch validates the rowcol operand (must be 0–3) and emits a 4-argument call to intrinsic 9062 (llvm.nvvm.wgmma.fence.aligned) with 3 type overloads. Fragment operands are prepared via sub_94B510.
WGMMA Store (IDs 751–752)
sub_954350 / sub_12B27B0 builds a separate parameter lookup tree. Store operations validate rowcol (0 or 1) and emit a 5-argument call using intrinsic 9145 (llvm.nvvm.wgmma.store) with 2 type overloads. Operands: {constant, B_fragment, descriptor, rowcol, zero}.
WGMMA MMA Async Load (IDs 753–764)
sub_9547E0 (NVVM) / sub_12B2E10 (EDG) builds a 12-entry red-black tree at ctx+656:
| ID | Shape | nregs | Variant | Fragment Type |
|---|---|---|---|---|
| 753 | 1 | 9 | 0 | — |
| 754 | 1 | 9 | 1 | — |
| 755 | 1 | 9 | 2 | i16x2 |
| 756 | 25 | 8 | 0 | — |
| 757 | 25 | 8 | 1 | — |
| 758 | 25 | 10 | 2 | i32x8 |
| 759 | 23 | 7 | 0 | i32x4 |
| 760 | 23 | 7 | 1 | i32x4 |
| 761 | 24 | 7 | 0 | i32x4 |
| 762 | 24 | 7 | 1 | i32x4 |
| 763 | 6 | 7 | 0 | i32x2/i64 |
| 764 | 6 | 7 | 1 | i32x2/i64 |
Output packed encoding (*a4, 64-bit):
| Bits | Field |
|---|---|
| [63:32] | *(entry+40) << 32 |
| [31:4] | *(entry+48) << 4 | rowcol |
| [1] | *(entry+56) << 1 |
Emits intrinsic 9067 (llvm.nvvm.wgmma.mma.async) with 2 type overloads. Arguments: {constant, B_fragment, rowcol_value, zero_constant}. Results scattered via sub_94B940.
WGMMA MMA Async Compute — The 800-Line Handler (IDs 765–768)
This is the primary WGMMA lowering path. It lives inline in the mega-switch of sub_955A70 (NVVM, lines ~2850–3138) and sub_12B3FD0 (EDG, lines ~2270–3138). The handler implements two completely different intrinsic selection strategies depending on which builtin ID triggered entry.
Argument Extraction
The handler walks the argument chain 7 levels deep from the call expression:
v263 = M dimension (first constant argument)
v512 = accumulator fragments (pointer to fragment array)
v528 = A descriptor (64-bit matrix descriptor or register fragments)
v524 = B descriptor (64-bit matrix descriptor)
v519 = scale factors (A and D scale constants)
v264 = layout params (rowcol encoding)
v516, v265 = shape params (additional dimension info)
v540 = element type info (integer type tag from AST)
Each constant argument is validated through sub_620FD0 (EDG) / sub_620FD0 (shared), which extracts the integer value and sets an overflow flag. On overflow:
"unexpected constant overflow in __wgmma_mma_async operand"
This check is applied 5 times: once for N dimension, once for each scale factor, and once for each negate/saturation bit.
Per-Builtin Argument Layouts
| ID | Builtin | Operand Chain |
|---|---|---|
| 765 (0x2FD) | _f16 | 6 chained: A, B, C, scaleA, scaleD, negate/saturation |
| 766 (0x2FE) | _bf16 | Separate branch (LABEL_56 path), 2-operand (no scale/negate) |
| 767 (0x2FF) | _tf32 | Rearranged arguments, fewer config bits |
| 768 (0x300) | _f8 | Simplest form, 2 matrix descriptors + config |
Strategy 1: N-Dimension Dispatch (IDs 765–768, inner path)
When the element type is checked and the first argument yields an N dimension, the handler enters a 33-entry switch mapping N values to LLVM intrinsic IDs in the range 10654–10779:
| N | Integer-type Intrinsic | Float-type Intrinsic |
|---|---|---|
| 8 | 10774 | 10775 |
| 16 | 10690 | 10691 |
| 24 | 10734 | 10735 |
| 32 | 10742 | 10743 |
| 40 | 10746 | 10747 |
| 48 | 10750 | 10751 |
| 56 | 10754 | 10755 |
| 64 | 10758 | 10759 |
| 72 | 10762 | 10763 |
| 80 | 10766 | 10767 |
| 88 | 10770 | 10771 |
| 96 | 10778 | 10779 |
| 104 | 10654 | 10655 |
| 112 | 10658 | 10659 |
| 120 | 10662 | 10663 |
| 128 | 10666 | 10667 |
| 136 | 10670 | 10671 |
| 144 | 10674 | 10675 |
| 152 | 10678 | 10679 |
| 160 | 10682 | 10683 |
| 168 | 10686 | 10687 |
| 176 | 10694 | 10695 |
| 184 | 10698 | 10699 |
| 192 | 10702 | 10703 |
| 200 | 10706 | 10707 |
| 208 | 10710 | 10711 |
| 216 | 10714 | 10715 |
| 224 | 10718 | 10719 |
| 232 | 10722 | 10723 |
| 240 | 10726 | 10727 |
| 248 | 10730 | 10731 |
| 256 | 10738 | 10739 |
The even/odd intrinsic ID pairing encodes the distinction between integer-element and float-element variants. Type discrimination uses the AST element type: if the element type is integer with width 10 (i.e., a 10-bit integer signaling bf16/tf32 internal encoding), the even (integer) intrinsic is selected; otherwise the odd (float) intrinsic.
N dimension validation:
if ((N & (N - 1)) != 0)
error("N only supported for powers of two");
This is applied when the N value does not match any case in the 33-entry switch. The N values 8, 16, 32, 64, 128, 256 are powers of two; the intermediate values (24, 40, 48, ..., 248) are non-power-of-two multiples of 8 that are still valid WGMMA dimensions.
Strategy 2: 5-Dimensional Intrinsic Grid (IDs 753–764 path, shared)
For the full WGMMA async variants (handled through sub_12B2E10), the handler selects from a 144-entry intrinsic table spanning IDs 5304–5447, organized as a 5-dimensional grid:
| Dimension | Values | Description |
|---|---|---|
| 1. N | {16, 32, 64, 128} | Output column dimension |
| 2. B_shared | {false, true} | Is B operand from shared memory? (sub_12A71A0 != 0) |
| 3. is_s64 | {false, true} | Is accumulator type s64/int? (type tag 2, subtype 10) |
| 4. scale/negate | varies | A scale nonzero? D scale nonzero? |
| 5. variant | {0x2FD, 0x2FE, 0x2FF, 0x300} | Which builtin triggered entry |
Base addresses and stride:
| N | Base ID | Stride per N |
|---|---|---|
| 128 | 5304 | 24 variants |
| 64 | ~5328 | 24 |
| 32 | ~5352 | 24 |
| 16 | ~5376 | 24 |
| overflow | ~5400–5447 | remaining |
Size-based opcode selection (for f16, ID 765):
| Accumulator Size | Opcode (integer) | Opcode (float) |
|---|---|---|
| 16 | 5332 | 5333 |
| 32 | 5380 | 5381 |
| 64 | 5404 | 5405 |
| 128 | 5308 | 5309 |
| other | 5356/5428 | 5357/5429 |
The mapping formula: base + N_offset + shared_offset + type_offset + variant_offset. The accumulator size is extracted by sub_12A71A0(expr) from the expression type chain.
WGMMA Config Bit Packing
Multiple boolean arguments are packed into a single configuration word passed to the final intrinsic call:
| Bit | Field | Source | Value Semantics |
|---|---|---|---|
| 0 | Accumulate / saturation flag | Final constant operand (v433) | 1 = accumulate into D, 0 = overwrite |
| 1 | ScaleD / transpose flag | v445 constant | 1 = transpose B descriptor |
| 2 | Negate-C / layout flag | v81 / v433 constant | 1 = negate accumulator input |
| 3 | Sign bit for B | v427 constant (if present) | Reserved / sign extension |
| 4 | Negate-A / additional mode | v80 / v427 constant (if present) | 1 = negate A operand |
Combined via: v79 = bit0 | (bit1 << 1) | (bit2 << 2) | (bit4 << 4).
After intrinsic selection, the handler:
- Converts the accumulator pointer to a vector pointer (
.asvecptrtag) - Extracts bitfield from constant operands for mode flags
- Calls
sub_1285290/sub_921880with name hint"mmafrag" - Scatters results via
sub_94B940/sub_1280F50(size 4 = float elements)
WGMMA Validation Summary
All constant arguments pass through sub_620FD0, which extracts the integer value and sets an overflow flag.
| Check | Error Message | Condition |
|---|---|---|
| Constant overflow | "unexpected constant overflow in __wgmma_mma_async operand" | Any integer operand overflows extraction (5 occurrences) |
| N power-of-two | "N only supported for powers of two" | (N & (N - 1)) != 0 and N not in the 33-entry switch |
| rowcol range (fence) | "'rowcol' operand can be 0 or 1 only" | rowcol > 1 for load/store |
| rowcol range (MMA) | (implicit — validated 0–3) | rowcol > 3 for MMA operations |
WGMMA Support Functions
| Function | Address | EDG Parallel | Purpose |
|---|---|---|---|
sub_953BA0 | 0x953BA0 | sub_12B1C20 | Fence/commit/wait parameter lookup, builds packed 64-bit encoding |
sub_9547E0 | 0x9547E0 | sub_12B2E10 | MMA async load parameter lookup, 12-entry red-black tree |
sub_954350 | 0x954350 | sub_12B27B0 | Store variant parameter lookup |
sub_94B510 | 0x94B510 | — | Prepare fragment operand for WGMMA call |
sub_94B940 | 0x94B940 | sub_1280F50 | Scatter MMA results back to fragment outputs |
sub_94B2B0 | 0x94B2B0 | — | Extract fragment element at index (WMMA shared) |
sub_12A71A0 | 0x12A71A0 | — | Extract size/dimension from expression type (EDG-only) |
sub_12A6F10 | 0x12A6F10 | — | Validate constant integer in range (EDG-only) |
sub_620FD0 | 0x620FD0 | — | Extract constant integer with overflow detection (shared) |
Packed MMA Descriptor Word
The MMA PTX string builder at sub_21E74C0 (AsmPrinter) / sub_35F_range (NVPTX backend) reads a packed 64-bit descriptor for all MMA instruction emission. The descriptor is stored at:
v22 = *(QWORD *)(*(QWORD *)(a1 + 16) + 16 * a2 + 8)
| Bits | Field | Query Key | Values |
|---|---|---|---|
| [0] | Row/col layout | "rowcol" | 0=row, 1=col |
| [2:1] | Matrix ID | "mid" | 0=a, 1=b, 2=c, 3=d |
| [7:4] | Binary opcode | "opc" | 0=default, 1=.and.popc, 2=.xor.popc |
| [2:0] | Rounding mode | "rnd" | 0=none, 1=.rn, 2=.rm, 3=.rp, 4=.rz |
| [15:8] | A element type | "aty" | Type enum 1–11 |
| [23:16] | B element type | "bty" | Type enum 1–11 |
| [25:24] | A layout | "al" | 0=row, nonzero=col |
| [27:26] | B layout | "bl" | 0=row, nonzero=col |
| [28] | Saturation | "satf" | 1=.satfinite |
| [39:32] | Shape enum | "shape" | 0x01–0x19, 18 entries |
Shape Enum
| Enum | Shape | PTX String | Min SM | Notes |
|---|---|---|---|---|
| 0x01 | m8n8k4 | "m8n8k4" | SM 70 | Original Volta HMMA |
| 0x02 | m8n8k16 | "m8n8k16" | SM 72 | Integer MMA (s8/u8) |
| 0x03 | m8n8k32 | "m8n8k32" | SM 75 | Sub-byte (s4/u4) |
| 0x04 | m8n8k64 | "m8n8k64" | SM 75 | Extended sub-byte |
| 0x05 | m8n8k128 | "m8n8k128" | SM 75 | Binary MMA (b1) |
| 0x06 | m8n32k16 | "m8n32k16" | SM 70 | Volta/Turing WMMA f16/bf16; 30+ matching builtins in cicc_strings.json |
| 0x10 | m16n8k4 | "m16n8k4" | SM 75 | Turing HMMA, f64 on Ampere |
| 0x11 | m16n8k8 | "m16n8k8" | SM 75 | Turing/Ampere HMMA |
| 0x12 | m16n8k16 | "m16n8k16" | SM 80 | Ampere HMMA (bf16, tf32) |
| 0x13 | m16n8k32 | "m16n8k32" | SM 75 | Ampere integer |
| 0x14 | m16n8k64 | "m16n8k64" | SM 75 | Sub-byte integer |
| 0x15 | m16n8k128 | "m16n8k128" | SM 75 | Extended sub-byte |
| 0x16 | m16n8k256 | "m16n8k256" | SM 75 | Binary/sub-byte (largest) |
| 0x17 | m16n16k16 | "m16n16k16" | SM 90 | Square shape, Hopper+ |
| 0x18 | m32n8k16 | "m32n8k16" | SM 80 | Tall shape |
| 0x19 | m16n16k8 | "m16n16k8" | SM 70 | WMMA f16 path |
Unknown shape codes hit the default branch and abort via BUG(). String emission uses fast-path integer stores: *(QWORD *)ptr = 0x36316B386E36316DLL emits "m16n8k16" as a single 8-byte write.
Type Enum
| Enum | Type | Bits | PTX String |
|---|---|---|---|
| 1 | b1 | 1 | "b1" |
| 2 | s4 | 4 | "s4" |
| 3 | u4 | 4 | "u4" |
| 4 | s8 | 8 | "s8" |
| 5 | u8 | 8 | "u8" |
| 6 | f16 | 16 | "f16" |
| 7 | bf16 | 16 | "bf16" |
| 8 | tf32 | 19 | "tf32" |
| 9 | f64 | 64 | "f64" |
| 10 | f32 | 32 | "f32" |
| 11 | s32 | 32 | "s32" |
Any other type code produces fatal error: "Wrong MMA element type".
Shape x Type x Architecture Summary
| Shape | A/B Types | Acc Types | Min SM | Notes |
|---|---|---|---|---|
| m8n8k4 | f16 | f16, f32 | SM 70 | Original Volta |
| m16n8k4 | f64 | f64 | SM 80 | Ampere f64 |
| m16n8k8 | f16 | f16, f32 | SM 75 | Turing+ |
| m16n8k16 | f16, bf16, tf32 | f16, f32 | SM 80 | Ampere+ |
| m16n16k8 | f16 | f16, f32 | SM 70 | WMMA path |
| m16n16k16 | f16, bf16 | f16, f32 | SM 90 | Hopper+ |
| m32n8k16 | f16, bf16 | f16, f32 | SM 80 | Tall shape |
| m8n8k16 | s8, u8 | s32 | SM 72 | Integer MMA |
| m16n8k16 | s8, u8 | s32 | SM 75 | Turing+ |
| m16n8k32 | s8, u8 | s32 | SM 75 | Turing+ |
| m8n8k32 | s4, u4 | s32 | SM 75 | Sub-byte |
| m16n8k64 | s4, u4 | s32 | SM 75 | Sub-byte |
| m8n8k64 | s4, u4 | s32 | SM 75 | Extended sub-byte |
| m16n8k128 | s4, u4 | s32 | SM 75 | Extended sub-byte |
| m8n8k128 | b1 | s32 | SM 75 | Binary (.and.popc, .xor.popc) |
| m16n8k256 | b1 | s32 | SM 75 | Binary extended |
| WGMMA (N=8..256) | f16, bf16, tf32, f8 | f16, f32 | SM 90 | Warp-group, 33 N values |
| tcgen05 (10 variants) | mxf8f6f4, mxf4, mxf4nvf4, f16, bf16, tf32, i8, fp4 | varies | SM 100 | See mma-codegen |
tcgen05 Blackwell Overview (SM 100+)
Full tcgen05 documentation lives in Tensor / MMA Codegen. Key points summarized here for cross-reference:
Data type kinds (bits [8:6] of the tcgen05 operand, emitted by sub_35F3330):
| Value | Kind | Notes |
|---|---|---|
| 0 | mxf4nvf4 | MX FP4 with NV FP4 |
| 1 | f8f6f4 | FP8/FP6/FP4 standard |
| 2 | mxf8f6f4 | MX variant of f8f6f4 |
| 3 | f16 | Half precision |
| 4 | i8 | 8-bit integer (arch-conditional only) |
| 5 | tf32 | TensorFloat-32 |
| 7 | mxf4 | MX FP4 |
Modifier fields:
| Modifier | Bits | Description |
|---|---|---|
Weight stationary (.ws) | bit 0 | NOT compatible with cta_group::2, mxf8f6f4, fp4 |
| CTA group | bit 1 | cta_group::1 (clear) or cta_group::2 (set) |
| Scale vector size | [3:2] | .scale_vec::1X/2X/4X with per-type constraints |
| Scale input accumulator | bit 4 | f16/tf32 only; NOT on sm_100a/sm_103a |
| Sparsity | bit 5 | MXF4/MXF4NVF4 restricted to arch-conditional |
| Block scale alias | [10:9] | .block16 (0) or .block32 (1) |
Collector modes (emitted by sub_35F38B0):
| Value | Modifier | Constraint |
|---|---|---|
| 1 | .collector::a::lastuse | — |
| 2 | .collector::a::fill | Cannot combine with .ashift |
| 3 | .collector::a::use | Cannot combine with .ashift |
tcgen05 scaled MMA operand builder (sub_21E8CD0 / sub_35F3E90):
| Bit | Query | Clear | Set |
|---|---|---|---|
| 0 | "scaleD" | "0" | "1" |
| 1 | "negA" | "1" (no negate) | "-1" (negate) |
| 2 | "negB" | "1" | "-1" |
| 3 | "transA" | "0" | "1" |
| 4 | "transB" | "0" | "1" |
Note the asymmetry: scaleD/transA/transB emit boolean "0"/"1" strings, while negA/negB emit sign multiplier "1"/"-1" strings. This reflects the PTX encoding where negation is a multiplication factor and transpose is a boolean flag.
LLVM Intrinsic Reference
| Intrinsic ID | Name | Usage |
|---|---|---|
| 9062 | llvm.nvvm.wgmma.fence.aligned | WGMMA fence (3 type overloads) |
| 9067 | llvm.nvvm.wgmma.mma.async | WGMMA MMA async load (2 type overloads) |
| 9145 | llvm.nvvm.wgmma.store | WGMMA store (2 type overloads) |
| 10654–10779 | llvm.nvvm.wgmma.mma.async.* | Per-N-dimension variants (126 entries, even=int, odd=float) |
| 5304–5447 | (WGMMA 5-D grid) | Per-N x shared x type x scale x variant (144 entries) |
| 4905–4940 | (tcgen05 ISD opcodes) | tcgen05.mma variants (36 opcodes via 10-way shape switch) |
NVPTX Backend Duplicate Functions
All MMA emission functions exist in two structurally identical copies:
| AsmPrinter (0x21Dxxxx) | NVPTX Backend (0x36Exxxx) | Function |
|---|---|---|
sub_21DFBF0 | sub_36E91F0 | hmmastc (HMMA store C) |
sub_21E0360 | sub_36E72A0 | hmmaldab (HMMA load A/B) |
sub_21E0630 | sub_36E7580 | hmmaldc (HMMA load C) |
sub_21E0870 | sub_36E77C0 | hmmamma (HMMA MMA) |
sub_21E1280 | sub_36E7B50 | immaldab (IMMA load A/B) |
sub_21E15D0 | sub_36E7EA0 | immaldc (IMMA load C) |
sub_21E1830 | sub_36E8110 | immastc (IMMA store C) |
sub_21E1D20 | sub_36E8630 | immamma (IMMA MMA) |
sub_21E2280 | sub_36E8BD0 | bmmamma (Binary MMA) |
sub_21E8CD0 | sub_35F3E90 | tcgen05 scaled MMA operands |
The pairs differ only in error reporting (sub_16BD130 vs sub_C64ED0) and reference counting functions (sub_1623A60/sub_161E7C0 vs sub_B96E90/sub_B91220).
Cross-References
- Tensor / MMA Codegen — backend PTX emission, tcgen05 full detail
- NVPTX Opcodes — ISD opcode numbers
- SM 90 (Hopper) — WGMMA architecture context, TMA, cluster
- SM 100 (Blackwell) — tcgen05 architecture context
- Builtin System — hash table, registration, dispatch architecture