AsmPrinter and Per-SM Windows
Abstract
The final PTX emission layer turns selected operations, operands, attributes,
and module metadata into PTX text that ptxas accepts. By the time execution
arrives here, Tileiras has already lowered MLIR operations to NVVM/LLVM IR,
selected NVPTX machine instructions, and verified subtarget legality. The
printer's job is precise and narrow.
The implementation combines two generated printer roles. The MLIR-facing role
prints nvvm.* operation assembly from TableGen assembly-format descriptions.
The LLVM-MC role prints selected MCInst opcodes from the NVPTX asm-writer
table. They share operand printers, modifier printers, register-name
printing, and module-level PTX emission helpers.
Printer Roles
| Role | Input | Output | Primary responsibility |
|---|---|---|---|
| MLIR operation printer | nvvm.* op, operands, attributes | NVVM dialect assembly | Print op syntax and attributes. |
| MC instruction printer | MCInst opcode and operands | PTX instruction text | Print opcode mnemonic, operands, and suffixes. |
| NVPTX asm printer | LLVM module and machine functions | PTX module text | Print headers, directives, globals, and function bodies. |
The two instruction printers differ in when they run. The MLIR printer describes operations before final machine selection; the MC printer describes the exact PTX instruction after selection. Keep those phases separate in a reimplementation, even when the same helper functions print common modifiers.
The MLIR printer is generated by the dialect's TableGen assemblyFormat and is unrelated to PTX itself. The MC printer is the one PTX consumers care about: it walks an MCInst, looks up the opcode's print shape, and renders mnemonic, modifiers, and operands into the output stream. The next sections document the table layout, the shared-body partition, and the obfuscated mnemonic pool the MC printer reads from.
MC Print Shapes
The MC printer is generated in the style of LLVM AsmWriterEmitter: each
opcode maps to a print shape interleaving literal text, operand slots, and
modifier helpers. Most ordinary ALU, conversion, load/store, branch, and
call instructions share a small set of repeated shapes.
| Shape family | Example PTX family | Printed structure |
|---|---|---|
| One-source move | mov, cvt, simple special ops | mnemonic, destination, source |
| Two-source arithmetic | add, mul, and, or, xor | mnemonic, dst, lhs, rhs |
| Ternary arithmetic | mad, fma, selp | mnemonic, dst, a, b, c |
| Predicate compare | setp, predicate logic | predicate dst, operands, compare suffix |
| Load/store | ld, st, atom, vector memory | address-space suffix plus memory operand |
| Control flow | bra, call, ret, exit | target or call prototype operands |
| Matrix / tensor | mma, wgmma, tcgen05, TMA | shape/type/scope modifiers plus operand groups |
void print_mc_shape(const McPrintShape *shape, MCInst inst, raw_ostream *os) {
for (int i = 0; i < shape->item_count; ++i) {
PrintItem item = shape->items[i];
if (item.kind == PRINT_LITERAL) {
os_write(os, item.literal);
} else if (item.kind == PRINT_OPERAND) {
print_operand(inst, item.operand_index, os);
} else if (item.kind == PRINT_MODIFIER) {
print_modifier(inst, item.modifier_kind, item.operand_index, os);
}
}
}
The printer performs no subtarget legality checks. By the time an opcode reaches this layer, the selector and machine verifier have already decided it is legal for the chosen target. The printer only renders the selected opcode.
MCOperand Wire Format
The 6,388-case AsmPrinter dispatcher consumes MCInst records that are themselves arrays of 16-byte MCOperand slots. Every operand seen by printOperand, printMemOperand, and the modifier helpers shares one layout, which keeps the shared-body dispatcher's mov rax, [rdi + 16*rcx] idiom uniform across operand classes.
typedef struct MCOperand {
/*+0x00*/ uint8_t kind_flags; // bit 0 = immediate-vs-register
/*+0x01*/ uint8_t type_tag; // see type-tag table below
/*+0x02*/ uint8_t pad[6];
/*+0x08*/ uint64_t value; // imm value or register number
} MCOperand;
The kind_flags byte carries the discriminator the printer's switch (MO.getKind()) ladder reads first: bit 0 selects the immediate-versus-register branch, and the high bits carry the smaller Expr / FPImm cases the selector promotes when an operand needs a symbolic relocation. The type_tag byte is the operand's element type. Modifier helpers consult it independently of kind_flags because PTX type suffixes are orthogonal to the register-vs-immediate question.
The eight-byte value field holds either an immediate (zero-extended to 64 bits) or a register number drawn from the virtual or physical register banks. The six-byte pad between the discriminator pair and the value keeps the value field naturally 8-byte aligned without growing the struct to 24 bytes — a size that would slow the dispatcher's stride arithmetic.
Type Tag Enum
The type_tag byte indexes a small enum the Blackwell block-scale dispatch leans on. The values below are what the SM120 mma.block_scale family and the NVFP4 variants read when picking a .kind::* suffix; type tags below 12 cover the integer and predicate families and inherit from the LLVM MVT numbering.
| Tag | Type | Notes |
|---|---|---|
| 12 | f16 | Half-precision; selected by .kind::f16 and packed .f16x2 paths. |
| 15 | E4M3 (Float8E4M3FN) | OCP FP8 with 4-bit exponent, finite-only mantissa. |
| 16 | E5M2 (Float8E5M2) | OCP FP8 with 5-bit exponent, finite-and-Inf mantissa. |
| 17 | E2M1 (Float4E2M1FN) | OCP MXFP4 / NVFP4 leaf; the BYTE1 == 17 && BYTE2 == 17 predicate inside the block-scale expander gates .scale_vec::2X and .scale_vec::4X. |
| 19 | tf32 | Selected by .kind::tf32; consumed by the legacy mma.sync family on SM80 and later. |
| 20 | mxf8f6f4 | Block-scaled mixed FP8/FP6/FP4 kind tag; selected by .kind::mxf8f6f4. |
| 21 | mxf4 | Block-scaled FP4 kind tag; selected by .kind::mxf4 and .kind::mxf4nvf4. |
SM120 Block-Scale Control Word
The SM120 block-scale MMA expander reads a packed control word from MCInst + 280. That offset is the seventh MCOperand slot for the dense form (MI 5468) and the eighth slot for the sparse form (MI 5469). Slot layout: A-fragment, B-fragment, C-accumulator, D-output, SFA handle, SFB handle, control word, optional sparse metadata. The control word's low bytes carry the type tags for A and B, the kind tag, the scale-vec format, and a sync-aligned bit the expander explicitly rejects: only the non-sync-aligned form survives into PTX, and a mismatch produces the nvvm.mma.blockscale currently supports non-sync aligned variants only! diagnostic.
⚡ QUIRK —
mma.block_scale.sync.alignedactually emits non-sync-aligned PTX The mnemonic family name ismma.block_scale.sync.aligned, but the SM120 expander reads the sync-aligned bit and rejects it: only the non-sync-aligned variant ever survives into PTX. A frontend that sets the sync-aligned bit hoping to match the mnemonic gets the diagnosticnvvm.mma.blockscale currently supports non-sync aligned variants only!rather than a working kernel — the bit and the family name disagree by design.
struct NvvmMmaBlockScaleCtrl { // 32-bit packed, MCInst + 280
uint32_t a_type_tag : 5; // BYTE1: 15 = E4M3, 16 = E5M2, 17 = E2M1
uint32_t b_type_tag : 5; // BYTE2: same coding as a_type_tag
uint32_t kind_tag : 3; // BYTE4: 20 = mxf8f6f4, 21 = mxf4
uint32_t block_scale_fmt : 2; // BYTE6 bits [4:5]: scale_vec::{1X, 2X, 4X}
uint32_t sync_aligned : 1; // BYTE6 bit 3: must be 0 for block-scale
uint32_t reserved :16;
};
The AsmPrinter consults this enum when emitting mma.block_scale.sync.aligned and the NVFP4 variants. The pre-flight filters before the shared body fires read BYTE6 & 0x38 to pick the scale-vec lane, then check BYTE1 / BYTE2 against the legal type pairs for that lane. Scale-vec 1X accepts the mixed-FP4/FP6/FP8 leaf set; scale-vec 2X requires both type tags to equal 17 (E2M1) and the block-scale format byte to equal 20; scale-vec 4X keeps the same E2M1 pair but binds the kind tag to 21 (mxf4nvf4) and emits NVFP4-only. Each filter carries a verbatim diagnostic string in the binary, which the printer never sees because the MC expander rejects the malformed MCInst before it reaches a shared body.
Operand Slot Stride
The dispatcher walks operand slots at the 16-byte stride the wire format dictates, but the SM120 block-scale expander reaches its later slots at a 40-byte MachineInstr-class stride. MCInst + 280 therefore corresponds to operand index 7 measured at the inflated MI stride, not at the MCOperand stride. A reimplementation that mirrors the AsmPrinter must keep the two strides separate: modifier helpers read MCOperand records at offsets 16 * opIdx; the MC expander reads MI-class operand metadata at offsets 40 * opIdx. Confusing the two produces operand-aliasing bugs the verifier does not catch, because both layouts agree on slot zero.
Per-SM Reachability
Per-SM availability is enforced before printing. One opcode always prints one PTX spelling; an "SM window" describes which target tiers can reach that opcode from instruction selection.
| Target window | Families that become reachable |
|---|---|
| SM70 / SM75 | Baseline ALU, memory, control flow, and NVVM-intrinsic MMA paths. |
| SM80 / SM86 / SM87 | mma.sync, mma.sp.sync, ldmatrix, cp.async, async barriers. |
| SM89 | SM80 surface plus FP8 mma.sync type combinations. |
| SM90 / SM90a | WGMMA, mbarrier, cluster operations, and TMA tensor-copy forms. |
| SM100 / SM103 | tcgen05, tensor-memory forms, Blackwell cluster/TMA extensions. |
| SM120 / SM121 | Block-scaled warp MMA without tensor-memory tcgen05. |
The separation keeps code generation robust: feature predicates decide which instruction is selected, and the printer stays deterministic.
Modifier Helpers
Most complexity in PTX printing comes from suffix construction. Modifier helpers map small encoded operands or attributes into PTX tokens.
| Modifier family | Examples |
|---|---|
| Rounding and saturation | .rn, .rz, .sat, .satfinite |
| Memory space | .global, .shared, .shared::cta, .shared::cluster, .local |
| Memory ordering | .relaxed, .acquire, .release, .acq_rel, .sc |
| Scope | .cta, .cluster, .gpu, .sys, .cta::cluster |
| Cache policy | .ca, .cg, .L2::cache_hint |
| CTA grouping | .cta_group::1, .cta_group::2 |
| Matrix shape | .m16n8k32, .m64nNkK, .128x256b |
| Matrix type | .f16, .bf16, .tf32, .e4m3, .e5m2, .s8, .u8 |
| Tensor-copy suffixes | .im2col, .multicast::cluster, .mbarrier::complete_tx::bytes |
void print_ldst_code(LdStCode code, raw_ostream *os) {
print_memory_space(code.space, os);
print_cache_policy(code.cache_policy, os);
print_memory_order(code.order, os);
print_scope(code.scope, os);
print_type_suffix(code.type, os);
}
Load/store printing is modifier-driven by design. Address-space tokens are not ordinary free-text operands; they decode from the selected load/store code so invalid order/scope/address-space combinations get rejected before reaching this point.
Modifier Emission Order
PTX is whitespace-tolerant but suffix-order-strict. ptxas parses each
instruction by stripping a dotted suffix sequence off the mnemonic in a
fixed order; reordering the suffixes — even when each individual token is
legal — yields a parse error. The print shapes are built around this
grammar, so a reimplementation must emit modifiers in the same canonical
order the parser expects rather than in the order the operand list happens
to enumerate them.
Atomic Operations
atom[.scope][.semantics].<op>.<type>[.addrspace]
| Slot | Token set |
|---|---|
| scope | .cta, .cluster, .gpu, .sys (default: device) |
| semantics | .relaxed, .acquire, .release, .acq_rel (default: .relaxed) |
| op | .add, .min, .max, .and, .or, .xor, .exch, .cas, .inc, .dec |
| type | .b32, .b64, .u32, .u64, .s32, .f16, .f32, .f64, .f16x2, .bf16, .bf16x2 |
| addrspace | .global, .shared, .shared::cta, .shared::cluster |
Examples:
atom.relaxed.cta.add.u32.shared [%rd0], %r1;
atom.acq_rel.gpu.cas.b64.global %rd0, [%rd1], %rd2, %rd3;
atom.release.cluster.add.f32 [%rd0], %f1;
Warp-Synchronous MMA
mma.sync.aligned.<shape>.<alayout>.<blayout>.<atype>.<btype>.<ctype>.<dtype>[.satfinite]
The fixed prefix mma.sync.aligned is invariant for the dense form. <shape>
encodes the M/N/K tile size (m8n8k4, m16n8k16, m16n8k32, m16n16k16,
and so on). <alayout> and <blayout> are .row or .col and are required
for the integer and FP8 variants; they are omitted for the FP16/BF16/TF32
half-precision forms where the layout is fixed by the shape. The four
type tokens always appear in the order A, B, C, D — never in the order the
operand list enumerates the fragments.
Examples:
mma.sync.aligned.m16n8k16.f32.f16.f16.f32 {%fd0,%fd1,%fd2,%fd3}, {%r0,%r1,%r2,%r3}, {%r4,%r5}, {%fd4,%fd5,%fd6,%fd7};
mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32 {%r0,%r1,%r2,%r3}, {%r4,%r5,%r6,%r7}, {%r8,%r9}, {%r10,%r11,%r12,%r13};
Sparse MMA
mma.sp::ordered_metadata.sync.aligned.<shape>.<atype>.<btype>.<ctype>.<dtype>[.satfinite]
The sparsity selector .sp::ordered_metadata sits in a fixed slot between
the mnemonic stem and the .sync.aligned infix. The metadata operand and
the selector byte are extra operands at the end of the print shape; their
suffix tokens do not move.
Warpgroup MMA
wgmma.mma_async.sync.aligned.<shape>.<dtype>.<atype>.<btype>[.scaleD][.scaleAB][.transA][.transB]
<shape> for WGMMA is the m64nNkK family. The destination type slot
precedes the A and B type slots — the inverse of the mma.sync ordering —
because the warpgroup form treats D as the architectural state and A/B as
streamed inputs. The optional scale-and-transpose suffixes follow in the
order scaleD, scaleAB, transA, transB; the printer omits each
suffix when its operand carries the default value.
Example:
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%fd0,...,%fd63}, %rd_descA, %rd_descB, 1, 1, 0, 0;
Tensor-Memory MMA (tcgen05)
tcgen05.mma[.cta_group::N][.scale_input_acc][.block_scale][.sp::ordered_metadata].<kind>.<shape>.<dtype>.<atype>.<btype>.<ctype>[.satfinite]
<kind> is one of .kind::f16, .kind::tf32, .kind::f8f6f4,
.kind::mxf8f6f4, .kind::mxf4, .kind::mxf4nvf4. The block-scale and
scale-input-acc flags are positional booleans whose presence depends on
operand bits documented in the SM120 control-word section above. The
suffix grammar is the strictest in the ISA: every optional token has a
fixed slot, and the parser rejects any reordering.
TMA Bulk-Tensor Copies
cp.async.bulk.tensor.<rank>d.<dst_space>.<src_space>[.<mode>].mbarrier::complete_tx::bytes[.multicast::cluster][.L2::cache_hint]
| Slot | Token set |
|---|---|
| rank | 1d, 2d, 3d, 4d, 5d |
| dst/src space | shared::cluster.global, global.shared::cta, shared::cta.shared::cluster |
| mode | .tile, .im2col, .im2col::w, .im2col::w::128 |
| barrier | .mbarrier::complete_tx::bytes (required for the load form) |
| multicast | .multicast::cluster (optional, only on load) |
| L2 hint | .L2::cache_hint (optional) |
Example:
cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes.multicast::cluster.L2::cache_hint
[%rd_dst], [%rd_desc, {%r_x, %r_y}], [%rd_bar], %h_mask, %rd_hint;
Async Copies
cp.async.<dst_space>.<src_space>.<size>[.<cache_hint>][.L2::cache_hint][.commit_group]
<size> is the bytes-per-thread token (.4, .8, .16). The
.commit_group suffix is emitted by a separate print shape on the
companion cp.async.commit_group instruction; it does not stack onto the
copy itself. The printer enforces the grammar by laying out the modifier
helpers in the exact order above and never letting one helper print into
another's slot.
Suffix Slot Invariant
The print shapes share a slot invariant: every modifier helper consumes a
specific operand of the MCInst and renders into its assigned grammar
slot regardless of operand-vector order. A reimplementation that prints
suffixes by walking operands in order will produce strings that look
plausible but get rejected by ptxas. Always drive the suffix emission
from the print shape's slot table, not from the operand vector's index.
Module Emission
The outer NVPTXAsmPrinter emits PTX module structure around individual
instructions.
| Module element | Printed PTX |
|---|---|
| Header | .version, .target, .address_size |
| Kernel directives | .entry, .reqntid, .maxntid, .minnctapersm, .maxnreg |
| Cluster directives | .explicitcluster, .maxclusterrank, .blocksareclusters |
| Visibility | .visible, .extern, .weak |
| Globals | .global, .const, .texref, .surfref, .samplerref |
| Managed/unified metadata | .attribute(.managed), .attribute(.unified(...)) |
| Function frame | local depot, %SP, %SPL, virtual register declarations |
| Function body | brace-delimited PTX instructions |
void emit_ptx_module(Module module, NvptxTarget target, raw_ostream *os) {
emit_ptx_header(target, os);
emit_module_globals(module, os);
for (Function fn : module.functions()) {
emit_function_directives(fn, target, os);
emit_function_body_start(fn, os);
emit_machine_instructions(fn, os);
emit_function_body_end(fn, os);
}
}
The .blocksareclusters directive requires both thread-block dimensions and
a cluster dimension. Emitters must diagnose that combination early: a
header-only correction later cannot repair a malformed kernel launch
contract.
Module Header Directives
Every PTX module begins with three mandatory directives followed by an
optional .debug toggle. The header emission runs once per Module and
draws every value from the active NvptxSubtarget plus the
TargetMachine debug flag.
//
// Generated by NVIDIA NVPTX Compiler
//
// Compiler Build ID: <build id>
// Cuda compilation tools, release 13.1
// Based on NVVM 7.0.1
//
.version 8.4
.target sm_90a, debug
.address_size 64
| Directive | Source | Notes |
|---|---|---|
.version | PTX ISA version selected by the subtarget. | 8.4 for the CUDA 13.1 baseline; bumped per ISA-feature requirement. |
.target | Lowered SM name plus optional ,debug. | sm_90a adds the architecture-specific a suffix when SM90 architecture-specific intrinsics are used. |
.address_size | Pointer width of the host-device interface. | Always 64 in this build; the 32-bit path is removed. |
The .target line carries up to four comma-separated tokens: the SM name,
the optional a-suffix marker, the debug flag, and the optional
map_f64_to_f32 legacy flag. The printer picks the SM name from
Subtarget.getSmVersion(), appends a when the function or any global
references an architecture-specific feature (SM90a tensor memory, SM100
distributed shared memory, SM120 block-scale MMA), appends debug when
the TargetMachine debug level is non-zero, and appends map_f64_to_f32
only for the legacy fp64 emulation path that the modern compiler never
selects.
The header banner above the directives is a fixed-format comment block
the AsmPrinter emits before the first directive. The build-ID line lets
post-link tools correlate a .ptx artefact with the exact tileiras
binary that produced it; the NVVM-version line documents the bytecode
schema feeding the printer.
Kernel Directive Emission
When the AsmPrinter encounters a kernel function (ptx_kernel calling
convention on the LLVM function, equivalent to a nvvm.kernel attribute
on the MLIR side), it emits a fixed-order directive cluster before the
function body.
.visible .entry KernelName(
.param .b64 KernelName_param_0,
.param .b32 KernelName_param_1,
.param .align 8 .b8 KernelName_param_2[16]
)
.reqntid 128, 1, 1
.maxntid 256, 1, 1
.minnctapersm 2
.maxnreg 64
.maxclusterrank 8
.explicitcluster
{
// function body
}
| Slot | Directive | MIR/LLVM source attribute |
|---|---|---|
| 1 | .visible / .weak / .extern linkage marker | LLVM linkage (external, weak_odr, internal). |
| 2 | .entry plus name and parameter list | ptx_kernel calling convention. |
| 3 | .reqntid X, Y, Z | nvvm.reqntid attribute / !reqntid{x,y,z} metadata. |
| 4 | .maxntid X, Y, Z | nvvm.maxntid attribute. |
| 5 | .minnctapersm N | nvvm.minctasm / minnctapersm metadata. |
| 6 | .maxnreg N | nvvm.maxnreg metadata. |
| 7 | .maxclusterrank N | nvvm.cluster_max_blocks attribute. |
| 8 | .reqnctapercluster X, Y, Z | nvvm.cluster_dim attribute. |
| 9 | .explicitcluster | nvvm.explicit_cluster attribute. |
| 10 | .blocksareclusters | nvvm.blocks_are_clusters (SM90+). |
| 11 | { | opens the function body. |
The order is fixed: the printer never reorders these directives based on
attribute traversal order, and a reimplementation must emit slots that
exist in the same canonical order. Slots whose attribute is absent are
simply skipped — there is no placeholder. The parameter list inside
.entry(...) is a separate sub-emission that walks the function's formal
parameters in declaration order, picks .param storage modifiers from
each parameter's byval/align/type attributes, and emits the
.b8 paramN[size] form for aggregate parameters that arrived through
ABI-mandated indirection.
For non-kernel device functions the .entry token is replaced by
.func, the visibility marker may be .visible/.weak/.extern, the
parameter list takes a different syntactic shape, and slots 3 through 10
are omitted entirely. The shared sub-emitter is the same; only the
slot-table varies.
Mnemonics and Register Names
Mnemonic lookup is table-driven: the MC opcode indexes a generated table and returns the PTX mnemonic stem. Register printing decodes the logical NVPTX register class, then prints a PTX register prefix and the register number.
| Register class | PTX prefix | Width | Use |
|---|---|---|---|
| Predicate | %p | 1 bit | Predicates and condition flags |
| 16-bit GPR | %rs | 16 bits | Half-width integer storage |
| 32-bit GPR | %r | 32 bits | Integer and bit-pattern values |
| 64-bit GPR | %rd | 64 bits | Pointers and 64-bit integers |
| 32-bit float view | %f | 32 bits | Float spelling of the 32-bit bank |
| 64-bit float view | %fd | 64 bits | Float spelling of the 64-bit bank |
| 128-bit GPR | %rq | 128 bits | Wide descriptors and grouped operands |
| Special registers | named PTX registers | varies | %tid.x, %laneid, %clock64, etc. |
void print_register_name(NvptxRegister reg, raw_ostream *os) {
switch (reg.class_id) {
case REG_PRED:
os_printf(os, "%%p%u", reg.number);
return;
case REG_I16:
os_printf(os, "%%rs%u", reg.number);
return;
case REG_I32:
os_printf(os, "%%r%u", reg.number);
return;
case REG_I64:
os_printf(os, "%%rd%u", reg.number);
return;
case REG_F32:
os_printf(os, "%%f%u", reg.number);
return;
case REG_F64:
os_printf(os, "%%fd%u", reg.number);
return;
case REG_I128:
os_printf(os, "%%rq%u", reg.number);
return;
case REG_SPECIAL:
os_write(os, special_register_name(reg));
return;
}
fail("bad NVPTX register class");
}
The 32-bit integer and f32 views share one physical register bank. The instruction's type suffix decides whether the value is interpreted as bits, integer, or floating point.
Register Classes
Tileiras exposes the practical NVPTX register classes a reimplementation needs for instruction selection and printing.
| Class | PTX type string | Prefix | Notes |
|---|---|---|---|
| Predicate | .pred | %p | Boolean predicates. |
| 16-bit | .b16 | %rs | Half-width integer or packed data. |
| 32-bit | .b32 | %r | Main scalar bank. |
| 32-bit float view | printed as type suffix | %f | Alias view of the 32-bit bank. |
| Special | special names | named | PTX special-register reads. |
| 64-bit | .b64 | %rd | Pointers, descriptors, and 64-bit scalars. |
| 128-bit | .b128 | %rq | Wide grouped operands and descriptors. |
Read the f32 class as a typed view over the 32-bit register bank. COPY
lowering can use ordinary 32-bit moves; instruction printing selects %f
spelling only when the operand is used as a floating-point register.
Operand Constraint Class Glossary
The PTX inline-asm constraint letters that user code passes to asm("..." :: "r"(x), "l"(p), "f"(v)) correspond one-to-one with NVPTX register
classes. The printer reads the constraint class off each MachineOperand,
selects the matching register prefix, and renders the operand's number
through print_register_name. The constraint letters are also the
canonical naming convention for register banks in PTX documentation, so a
reimplementation needs both directions: constraint-class to printed
prefix on output, and printed prefix back to constraint-class for inline
assembly parsing.
| Class | Width | Constraint letter | Register prefix | Type strings | Typical uses |
|---|---|---|---|---|---|
b | 1 bit | b | %p | .pred | branch guards, predicate logic, setp destinations |
h | 16 bits | h | %rs | .b16, .u16, .s16 | multicast masks, im2col offsets, FP16 raw bits |
r | 32 bits | r | %r | .b32, .u32, .s32 | most arithmetic operands, 32-bit pointers in shared address space |
l | 64 bits | l | %rd | .b64, .u64, .s64 | generic pointers, TMA descriptors, L2 cache hints |
f | 32 bits | f | %f | .f32 | FP32 arithmetic |
d | 64 bits | d | %fd | .f64 | FP64 arithmetic |
q | 128 bits | q | %rq | .b128 | wide descriptors, 128-bit vector loads, FP128 storage |
The class-to-prefix mapping is deterministic. The printer never picks %r
or %f based on the surrounding instruction's type suffix; it picks the
prefix from the operand's register class, and the type suffix is a
separate modifier the print shape emits independently. The 32-bit integer
and f32 banks share physical registers but carry distinct logical
classes, which is how the printer knows whether to spell a 32-bit live
range as %r3 or %f3.
const char *constraint_class_to_prefix(ConstraintClass cls) {
switch (cls) {
case CLASS_PRED: return "%p";
case CLASS_I16: return "%rs";
case CLASS_I32: return "%r";
case CLASS_I64: return "%rd";
case CLASS_F32: return "%f";
case CLASS_F64: return "%fd";
case CLASS_I128: return "%rq";
}
fail("bad constraint class");
}
Three printing rules cover the corner cases. First, when an operand is a
vector that the load/store needs to spell as a brace-grouped tuple, the
printer emits {%r0, %r1, %r2, %r3} and increments the sequence number
once per element; the constraint class still selects the prefix. Second,
parameter-passing operands use the %pa, %fa, %ia, %la, %h, %hh
prefix family rather than the generic prefixes; the printer routes these
through a parallel switch that consults the operand's parameter-class
flag. Third, special registers (%tid.x, %ntid.y, %laneid, %warpid,
%clock, %clock64, %globaltimer, %pm0, %envreg{0..31}) bypass the
prefix table entirely and print their canonical PTX name from the
physical-register pool documented in the
printRegName section below.
AsmWriter String Pools and the XOR-3 Walking Cipher
The MC printer's two string pools live not in .rodata like a stock LLVM
build, but XOR-encrypted in .data, decrypted in place during pre-main
initialization. The mnemonic pool occupies 0x5A4C080..0x5A656F0 —
exactly 103,536 bytes (~105 KB) — and stores every PTX opcode stem plus
three AsmWriter tail fragments. The physical-register-name pool occupies
0x5A4BE20..0x5A4C06A — exactly 586 bytes — and stores the 90 named
registers printRegName returns for class 0. Both pools share one cipher
and one initialization shape; the two init routines sub_1BD1810 and
sub_1BD1830 are 20-line bodies differing only in begin and end pointers.
The cipher is a walking byte XOR with a fully deterministic key schedule
k[i] = (3 * i) mod 256. Because gcd(3, 256) = 1, the orbit
0, 3, 6, ..., 255, 2, 5, ... enumerates every residue 0..255 exactly once
per 256-byte window before repeating. The cipher is linear, byte-granular,
and trivially invertible by replaying the same pass over the ciphertext. A
strings tileiras scan surfaces zero PTX mnemonics; the design target is
defeating naive static analysis, not real security.
void xor3_decode(uint8_t *p, uint8_t *end) {
uint8_t k = 0;
while (p != end) { *p++ ^= k; k += 3; }
}
After decryption the mnemonic pool decodes to 3,067 NUL-delimited chunks.
The first three are AsmWriter tail fragments emitted after the final
operand of an instruction template: "},\n\t\t", "},\n\t", ";\n\t". The
remaining ~5,500 entries are PTX mnemonic stems plus the per-template
prefix tokens AsmWriterEmitter packs in front of long-form opcodes. The
register pool decodes to 90 names covering seven virtual-register-class
prefixes (%p, %rs, %r, %rd, %f, %fd, %rq), the
parameter-passing prefixes (%da, %fa, %ia, %la, %h, %hh, plus
32 %envreg{0..31} slots), and the three frame registers %Depot, %SP,
%SPL.
Each decrypter is gated by a pthread_once flag in .bss. The mnemonic
pool uses dword_5B4F4D8; the register pool uses dword_5B4F4C0. Once the
walking-XOR pass returns, getMnemonic runs the Itanium-ABI "safely
initialize local static" dance to publish the decoded pool's base address
into a shared cache: __cxa_guard_acquire (sub_44A8A10) on the
byte_5B4F4C8 lock byte, the cache write to qword_5B4F4D0, then
__cxa_guard_release (sub_44A8AC0). Subsequent calls observe the
already-acquired guard and skip directly to the table lookup.
getMnemonic and the Offset Tables
MC opcode lookup is a pair of parallel .rodata tables indexed by the
32-bit MC opcode. dword_4D4D360 carries the packed mnemonic descriptor:
low 17 bits hold the byte offset into the decoded mnemonic pool, high 15
bits hold the per-opcode tail-state bits the print shape consults to pick
a trailing separator. The companion table dword_4D468C0 carries the
operand-width flags, modifier class index, and fragment indices that drive
the modifier helpers. Both tables hold 6,824 entries of uint32 each.
The first 293 slots are zero, matching LLVM's generic TargetOpcode
prelude (G_ADD, G_PHI, G_IMPLICIT_DEF, and the rest); real NVPTX
opcodes begin at index 293.
const char *getMnemonic(const MCInst *MI) {
pthread_once(&once_mnemonic, init_mnemonic_pool);
if (!guard_once && __cxa_guard_acquire(&guard_once)) {
base_ptr_cache = (uintptr_t)&mnemonic_pool[0];
__cxa_guard_release(&guard_once);
}
uint32_t opc = MI->Opcode;
uint32_t offset_tb = mnemonic_offsets[opc]; // dword_4D4D360
uint32_t companion = mnemonic_companion[opc]; // dword_4D468C0
if (offset_tb | ((uint64_t)companion << 32)) {
uint32_t off = offset_tb & 0x1FFFF;
return (const char *)(base_ptr_cache + off - 1);
}
return NULL;
}
The - 1 bias is LLVM's standard AsmWriterEmitter convention. Offset 0
encodes the "no mnemonic" sentinel; the first real mnemonic sits at pool
byte 0 and is reached through stored offset 1. The combined zero check
offset_tb | (companion << 32) lets one 64-bit test reject opcodes that
have neither a mnemonic nor a companion descriptor — no two separate
branches. The 17-bit offset field saturates at 131,072 bytes; the
103,536-byte payload leaves 26.6 % headroom, consistent with the
SM110/SM121 forward-projection allowance baked into this build's MC opcode
table. The empirical maximum lo17 observed is 103,806, which sits
inside the trailing NUL slack the post-link encoder reserves at the end of
the pool.
The companion-word dword_4D468C0 decomposition is inferred from the
415-value cardinality plus the canonical OpIdx << 8 | ModCls shape that
AsmWriterEmitter emits: the low byte carries operand-width flags, the next byte
indexes the tail-fragment list, the third byte indexes the prefix-fragment
list, and the top byte selects an AsmWriter modifier class. Mark this MED
confidence; the byte boundaries are stable but the per-byte semantic naming
has not been cross-checked against a TableGen build.
| Table | Address | Stride | Count | Purpose |
|---|---|---|---|---|
word_4D46800 | 0x4D46800 | u16 | 96 | Register-name offsets into the 586-byte pool. |
dword_4D4D360 | 0x4D4D360 | u32 | 6,824 | Mnemonic offset (low 17 bits) plus tail state (high 15 bits). |
dword_4D468C0 | 0x4D468C0 | u32 | 6,824 | AsmWriter companion: operand-width flags, modifier class, fragment indices. |
The .bss state cluster lives at four contiguous addresses with an 8-byte
alignment pad between the guard byte and the cache pointer:
| Slot | Address | Width | Role |
|---|---|---|---|
dword_5B4F4C0 | 0x5B4F4C0 | pthread_once_t | Register-name pool once-flag. |
byte_5B4F4C8 | 0x5B4F4C8 | uint8_t | Itanium-ABI __cxa_guard_* lock byte for the cache write. |
qword_5B4F4D0 | 0x5B4F4D0 | uintptr_t | Cached base pointer of the decoded mnemonic pool. |
dword_5B4F4D8 | 0x5B4F4D8 | pthread_once_t | Mnemonic pool once-flag. |
printRegName and the Register Pool
printRegName (sub_1BD1EB0) is the printer's 8-way class switch. The
top four bits of the MCRegister value select the class; the low 28 bits
carry the sequence number for virtual registers or the MCReg enum value
for physical registers. Class 0 is the physical path: it triggers
pthread_once(&dword_5B4F4C0, init_reg_name_pool), indexes the
register-name pool through word_4D46800[r - 1], then writes the resulting
NUL-terminated string to the output stream. The - 1 bias mirrors the
mnemonic-pool convention; MCRegister 0 is the "no register" sentinel.
Classes 1 through 7 print the seven virtual-register prefixes catalogued
in the Mnemonics and Register Names
section above, concatenated with the low 28 bits as a decimal sequence
number. The decoded 586-byte pool therefore carries the strings the
class-0 path returns directly — physical envregs, parameter-passing
prefixes, frame registers — plus the per-class "first virtual register of
class N" exemplars the MC layer emits for register-allocation dumps.
MC Switch Shape Population Table
The MC printer's dispatcher is a single switch over 6,388 MC opcodes
covering every selectable NVPTX instruction in this build. case arms do
not each carry a distinct printer body; most fall through to one of 297
shared body labels that emit textually-identical PTX. Compression is
steep: the fifteen most-populated shared bodies absorb roughly 80 % of
all dispatch traffic, and the top-eight bodies alone shape the bulk of the
printer's output behaviour.
| Shared body | Opcode count | Skeleton |
|---|---|---|
LABEL_18639 | 120 | slot 3 plus 1-byte terminator (e.g. cvt.{type}.{type}.{rnd} {reg}, {reg};) |
0x1C40201 | 577 | mnemonic, operands, terminator (the dominant ALU shape) |
0x1C4097D | 267 | 4-operand form (e.g. mma.sync.aligned {rd, rs1, rs2, rs3};) |
0x1C40B59 | 108 | [addr], reg form (e.g. st {addr}, {reg};) |
0x1C409DF | 96 | 2-operand reg-reg (e.g. mov.{type} {rd}, {rs};) |
0x1C40AAF | 84 | 3-operand plus modifier (e.g. set.{cmp}.{type} {rd}, {rs1}, {rs2};) |
LABEL_18984 | (slot 8) | predicated form |
LABEL_18729 | (slot 5) | conditional form |
Shared body 0x1C40201 is the centre of mass of the table. It prints the
canonical "mnemonic, comma-separated operands, semicolon" shape every
ordinary ALU instruction takes, so a reimplementation that gets exactly
this one body right covers more than 9 % of MC opcodes on its own. The
four 0x1C40... bodies together (the 201, 97D, B59, 9DF, AAF
cluster) form the ALU and memory backbone; the two LABEL_* entries cover
the predicated and conditional forms the selector synthesises around
guarded instructions.
18-Family Non-MMA Partition
Beyond the top-eight shared bodies, the AsmPrinter groups the non-MMA
opcodes into eighteen families F1 through F18 keyed by operand shape.
The partition is operand-driven rather than mnemonic-driven: opcodes whose
PTX text differs in mnemonic but agrees in operand layout share a family,
and the per-opcode flag word in the jump table picks the right mnemonic
stem out of the XOR-3 mnemonic pool. The largest family is F1, the
load/store mega-group, which carries its own inner sub-dispatcher because
LD and ST must discriminate address space, predication, sparsity, and
tensor-memory variants before the shared body can pick the correct PTX
spelling.
Inner LD/ST 13-Label Table
F1 dispatches through a 13-label sub-table that splits the load/store
opcodes by address space, predication, and tensor-memory variant:
| Sub-label | Opcode family |
|---|---|
| 1 | Generic load. |
| 2 | Generic store. |
| 3 | Constant-AS load. |
| 4 | Param-AS load. |
| 5 | Shared-AS load/store. |
| 6 | Global-AS load/store. |
| 7 | Local-AS load. |
| 8 | TMEM load/store. |
| 9 | Bulk-tensor load. |
| 10 | Bulk-tensor store. |
| 11 | Sparse load. |
| 12 | Predicated load. |
| 13 | Predicated store. |
Sub-labels 8 through 11 are the tensor-memory and bulk-tensor variants reachable only from the SM100 window onward; sub-labels 12 and 13 carry the predicated forms the selector emits when a guard predicate survives into the MC layer.
MC Opcode to Label Cascade
Each MC opcode in the dispatcher's jump table holds a u32 index packed
as {shared_label_id << 16 | per-op flag bits}. The high 16 bits select
the shared body; the low 16 bits encode the operand-flavour tweaks (FTZ,
satfinite, modifier kind, address-space hint, and so on) the shared body
consults via currentOp & 0xFFFF. This is the canonical AsmWriterEmitter
compression pattern, fingerprinted in the binary by the
mov rax, [rdi + 4*rcx] plus mov ecx, eax; shr rax, 16 idiom at the
entry of every shared body.
void emit_mc_opcode(MCInst inst, raw_ostream *os) {
uint32_t entry = jump_table[inst.opcode];
uint32_t label = entry >> 16;
uint32_t flags = entry & 0xFFFF;
print_shared_body(label, inst, flags, os);
}
The split is strict: a shared body never reads opcode identity directly.
It reads only the flag word handed to it by the dispatcher, plus the
operand indices its skeleton dictates. That is what lets 297 bodies
absorb 6,388 opcodes without per-opcode branching inside the bodies
themselves.
Worked Example: MMA M16N8K16
The cleanest way to see every layer of the printer cooperate is to trace
a single MachineInst from selection output through to the emitted PTX
line. The example below uses MMA_F32_F16_F16_F32_M16N8K16 — the
canonical FP16-input, FP32-accumulate warp MMA the FP32 GEMM kernels lean
on heaviest.
MachineInst Shape
After instruction selection, the MI carries four operand groups: an
A-fragment, a B-fragment, a C-accumulator that doubles as the D
destination, and an optional satfinite flag. The shape is fixed by the
TableGen instruction definition:
%fd0:f32, %fd1:f32, %fd2:f32, %fd3:f32 =
MMA_F32_F16_F16_F32_M16N8K16
%r0:i32, %r1:i32, %r2:i32, %r3:i32, // A fragment (4 x packed.f16x2)
%r4:i32, %r5:i32, // B fragment (2 x packed.f16x2)
%fd4:f32, %fd5:f32, %fd6:f32, %fd7:f32 // C accumulator
The destination is the first four operands; the A, B, C fragments follow
in source order. The register classes are mixed: A and B use the 32-bit
integer bank (%r) because PTX packs two FP16 lanes into one 32-bit
register, while C and D use the 32-bit float bank (%fd) because the
accumulator is held in FP32 doubles.
Print Shape Lookup
The MC opcode index for MMA_F32_F16_F16_F32_M16N8K16 resolves to a
companion-table entry whose high 16 bits select shared body 0x1C4097D
(the 4-operand-group form documented in the shape population
table) and whose low 16 bits encode
the type-suffix tuple (D=f32, A=f16, B=f16, C=f32) plus the
satfinite=0 bit.
The mnemonic offset retrieved from dword_4D4D360 resolves through the
XOR-3-decoded pool to the stem mma.sync.aligned.m16n8k16. The trailing
type tokens .f32.f16.f16.f32 are appended by the modifier helper
print_mma_type_tuple, which reads the four type-tag bytes off the flag
word and looks each one up in the type-string table.
Modifier Emission
The suffix slot table for the dense MMA family is:
| Slot | Token | Source |
|---|---|---|
| 1 | .sync | fixed for mma.sync |
| 2 | .aligned | fixed for the warp-synchronous form |
| 3 | .m16n8k16 | shape from print-shape entry |
| 4 | .f32 | D type tag from flag word |
| 5 | .f16 | A type tag from flag word |
| 6 | .f16 | B type tag from flag word |
| 7 | .f32 | C type tag from flag word |
| 8 | .satfinite | optional, gated by flag bit |
The printer walks this slot table in order. No operand-vector traversal participates — the slot table is the ground truth for suffix order.
Operand Group Printing
The shared body 0x1C4097D reads four operand-group descriptors out of
its skeleton:
| Group | Operand range | Register class | Width | Printed form |
|---|---|---|---|---|
| D | dest[0..3] | f32 (%fd) | 4 | {%fd0, %fd1, %fd2, %fd3} |
| A | src[0..3] | i32 (%r) | 4 | {%r0, %r1, %r2, %r3} |
| B | src[4..5] | i32 (%r) | 2 | {%r4, %r5} |
| C | src[6..9] | f32 (%fd) | 4 | {%fd4, %fd5, %fd6, %fd7} |
Each group emits an open brace, comma-separated operand prints, a closing brace, and a separator. The group sizes (4, 4, 2, 4) come from the MMA shape's fragment dimensions, not from the operand vector — a 16x8 D output produces 4 FP32 lanes per thread, a 16x16 A input produces 4 packed FP16x2 lanes per thread, and an 8x16 B input produces 2 packed FP16x2 lanes per thread.
Final Emitted Line
After all four groups have printed, the skeleton emits the closing semicolon and a newline:
mma.sync.aligned.m16n8k16.f32.f16.f16.f32 {%fd0, %fd1, %fd2, %fd3}, {%r0, %r1, %r2, %r3}, {%r4, %r5}, {%fd4, %fd5, %fd6, %fd7};
Three observations carry across to other MMA variants. First, the suffix
order and operand-group order are independent: the suffix list runs D,
A, B, C in type-token form while the operand list runs D, A, B, C in
brace-group form, and they happen to agree only because the print shape
arranged it that way. Second, the register-class mismatch between A/B
(integer bank) and C/D (float bank) is deliberate — PTX treats FP16
inputs as bit patterns and FP32 accumulators as floats, and the printer
faithfully picks the bank from each operand's class. Third, the
.satfinite slot is gone from the printed line because its flag bit was
zero; the printer never emits empty-slot placeholders.
The same skeleton drives every entry in the mma.sync.aligned.* family.
Switching to mma.sync.aligned.m16n8k32.f32.bf16.bf16.f32 changes only
the shape token in slot 3 and the type tokens in slots 4 through 7; the
operand-group sizes adjust to match the new fragment dimensions; and the
final line keeps the exact same syntactic shape. That regularity is what
lets one shared body cover several hundred MMA opcodes.
Reimplementation Budget
Recreating the dispatcher in a clean reimplementation takes four artefacts. The first three are bulk data; the fourth carries the per-operand encoding rules the modifier helpers consume.
| Artefact | Shape | Source |
|---|---|---|
| Shared-body table | 297 labels with operand-skeleton scripts. | Disassembled from the asm-printer body cluster. |
| MC opcode jump table | 6,388 entries of {label, flags} packed as u32. | Reconstructed from the per-case jump targets. |
| Mnemonic pool | XOR-3-encrypted, 3,067 NUL-delimited chunks. | See the XOR-3 walking cipher section. |
| Offset tables | dword_4D4D360, dword_4D468C0, word_4D46800. | See getMnemonic and the Offset Tables. |
Ship these four artefacts as static data plus the 297 body scripts as a
small interpreter loop and the entire MC printer surface comes back
without per-opcode code generation. The
ISelDAG and MatcherTable — Selector Layers
page documents the upstream stage feeding these MC opcodes into the printer, and the
XOR-3 walking cipher
section above covers the mnemonic-pool side of the budget.