AsmPrinter and Per-SM Windows

Abstract

The final PTX emission layer turns selected operations, operands, attributes, and module metadata into PTX text that ptxas accepts. By the time execution arrives here, Tileiras has already lowered MLIR operations to NVVM/LLVM IR, selected NVPTX machine instructions, and verified subtarget legality. The printer's job is precise and narrow.

The implementation combines two generated printer roles. The MLIR-facing role prints nvvm.* operation assembly from TableGen assembly-format descriptions. The LLVM-MC role prints selected MCInst opcodes from the NVPTX asm-writer table. They share operand printers, modifier printers, register-name printing, and module-level PTX emission helpers.

Printer Roles

Role	Input	Output	Primary responsibility
MLIR operation printer	`nvvm.*` op, operands, attributes	NVVM dialect assembly	Print op syntax and attributes.
MC instruction printer	`MCInst` opcode and operands	PTX instruction text	Print opcode mnemonic, operands, and suffixes.
NVPTX asm printer	LLVM module and machine functions	PTX module text	Print headers, directives, globals, and function bodies.

The two instruction printers differ in when they run. The MLIR printer describes operations before final machine selection; the MC printer describes the exact PTX instruction after selection. Keep those phases separate in a reimplementation, even when the same helper functions print common modifiers.

The MLIR printer is generated by the dialect's TableGen assemblyFormat and is unrelated to PTX itself. The MC printer is the one PTX consumers care about: it walks an MCInst, looks up the opcode's print shape, and renders mnemonic, modifiers, and operands into the output stream. The next sections document the table layout, the shared-body partition, and the obfuscated mnemonic pool the MC printer reads from.

MC Print Shapes

The MC printer is generated in the style of LLVM AsmWriterEmitter: each opcode maps to a print shape interleaving literal text, operand slots, and modifier helpers. Most ordinary ALU, conversion, load/store, branch, and call instructions share a small set of repeated shapes.

Shape family	Example PTX family	Printed structure
One-source move	`mov`, `cvt`, simple special ops	mnemonic, destination, source
Two-source arithmetic	`add`, `mul`, `and`, `or`, `xor`	mnemonic, dst, lhs, rhs
Ternary arithmetic	`mad`, `fma`, `selp`	mnemonic, dst, a, b, c
Predicate compare	`setp`, predicate logic	predicate dst, operands, compare suffix
Load/store	`ld`, `st`, `atom`, vector memory	address-space suffix plus memory operand
Control flow	`bra`, `call`, `ret`, `exit`	target or call prototype operands
Matrix / tensor	`mma`, `wgmma`, `tcgen05`, TMA	shape/type/scope modifiers plus operand groups

void print_mc_shape(const McPrintShape *shape, MCInst inst, raw_ostream *os) {
    for (int i = 0; i < shape->item_count; ++i) {
        PrintItem item = shape->items[i];

        if (item.kind == PRINT_LITERAL) {
            os_write(os, item.literal);
        } else if (item.kind == PRINT_OPERAND) {
            print_operand(inst, item.operand_index, os);
        } else if (item.kind == PRINT_MODIFIER) {
            print_modifier(inst, item.modifier_kind, item.operand_index, os);
        }
    }
}

The printer performs no subtarget legality checks. By the time an opcode reaches this layer, the selector and machine verifier have already decided it is legal for the chosen target. The printer only renders the selected opcode.

MCOperand Wire Format

The 6,388-case AsmPrinter dispatcher consumes MCInst records that are themselves arrays of 16-byte MCOperand slots. Every operand seen by printOperand, printMemOperand, and the modifier helpers shares one layout, which keeps the shared-body dispatcher's mov rax, [rdi + 16*rcx] idiom uniform across operand classes.

typedef struct MCOperand {
    /*+0x00*/ uint8_t  kind_flags;   // bit 0 = immediate-vs-register
    /*+0x01*/ uint8_t  type_tag;     // see type-tag table below
    /*+0x02*/ uint8_t  pad[6];
    /*+0x08*/ uint64_t value;        // imm value or register number
} MCOperand;

The kind_flags byte carries the discriminator the printer's switch (MO.getKind()) ladder reads first: bit 0 selects the immediate-versus-register branch, and the high bits carry the smaller Expr / FPImm cases the selector promotes when an operand needs a symbolic relocation. The type_tag byte is the operand's element type. Modifier helpers consult it independently of kind_flags because PTX type suffixes are orthogonal to the register-vs-immediate question.

The eight-byte value field holds either an immediate (zero-extended to 64 bits) or a register number drawn from the virtual or physical register banks. The six-byte pad between the discriminator pair and the value keeps the value field naturally 8-byte aligned without growing the struct to 24 bytes — a size that would slow the dispatcher's stride arithmetic.

Type Tag Enum

The type_tag byte indexes a small enum the Blackwell block-scale dispatch leans on. The values below are what the SM120 mma.block_scale family and the NVFP4 variants read when picking a .kind::* suffix; type tags below 12 cover the integer and predicate families and inherit from the LLVM MVT numbering.

Tag	Type	Notes
12	`f16`	Half-precision; selected by `.kind::f16` and packed `.f16x2` paths.
15	`E4M3` (`Float8E4M3FN`)	OCP FP8 with 4-bit exponent, finite-only mantissa.
16	`E5M2` (`Float8E5M2`)	OCP FP8 with 5-bit exponent, finite-and-Inf mantissa.
17	`E2M1` (`Float4E2M1FN`)	OCP MXFP4 / NVFP4 leaf; the `BYTE1 == 17 && BYTE2 == 17` predicate inside the block-scale expander gates `.scale_vec::2X` and `.scale_vec::4X`.
19	`tf32`	Selected by `.kind::tf32`; consumed by the legacy `mma.sync` family on SM80 and later.
20	`mxf8f6f4`	Block-scaled mixed FP8/FP6/FP4 kind tag; selected by `.kind::mxf8f6f4`.
21	`mxf4`	Block-scaled FP4 kind tag; selected by `.kind::mxf4` and `.kind::mxf4nvf4`.

SM120 Block-Scale Control Word

The SM120 block-scale MMA expander reads a packed control word from MCInst + 280. That offset is the seventh MCOperand slot for the dense form (MI 5468) and the eighth slot for the sparse form (MI 5469). Slot layout: A-fragment, B-fragment, C-accumulator, D-output, SFA handle, SFB handle, control word, optional sparse metadata. The control word's low bytes carry the type tags for A and B, the kind tag, the scale-vec format, and a sync-aligned bit the expander explicitly rejects: only the non-sync-aligned form survives into PTX, and a mismatch produces the nvvm.mma.blockscale currently supports non-sync aligned variants only! diagnostic.

⚡ QUIRK — mma.block_scale.sync.aligned actually emits non-sync-aligned PTX The mnemonic family name is mma.block_scale.sync.aligned, but the SM120 expander reads the sync-aligned bit and rejects it: only the non-sync-aligned variant ever survives into PTX. A frontend that sets the sync-aligned bit hoping to match the mnemonic gets the diagnostic nvvm.mma.blockscale currently supports non-sync aligned variants only! rather than a working kernel — the bit and the family name disagree by design.

struct NvvmMmaBlockScaleCtrl {  // 32-bit packed, MCInst + 280
    uint32_t a_type_tag        : 5;  // BYTE1: 15 = E4M3, 16 = E5M2, 17 = E2M1
    uint32_t b_type_tag        : 5;  // BYTE2: same coding as a_type_tag
    uint32_t kind_tag          : 3;  // BYTE4: 20 = mxf8f6f4, 21 = mxf4
    uint32_t block_scale_fmt   : 2;  // BYTE6 bits [4:5]: scale_vec::{1X, 2X, 4X}
    uint32_t sync_aligned      : 1;  // BYTE6 bit 3: must be 0 for block-scale
    uint32_t reserved          :16;
};

The AsmPrinter consults this enum when emitting mma.block_scale.sync.aligned and the NVFP4 variants. The pre-flight filters before the shared body fires read BYTE6 & 0x38 to pick the scale-vec lane, then check BYTE1 / BYTE2 against the legal type pairs for that lane. Scale-vec 1X accepts the mixed-FP4/FP6/FP8 leaf set; scale-vec 2X requires both type tags to equal 17 (E2M1) and the block-scale format byte to equal 20; scale-vec 4X keeps the same E2M1 pair but binds the kind tag to 21 (mxf4nvf4) and emits NVFP4-only. Each filter carries a verbatim diagnostic string in the binary, which the printer never sees because the MC expander rejects the malformed MCInst before it reaches a shared body.

Operand Slot Stride

The dispatcher walks operand slots at the 16-byte stride the wire format dictates, but the SM120 block-scale expander reaches its later slots at a 40-byte MachineInstr-class stride. MCInst + 280 therefore corresponds to operand index 7 measured at the inflated MI stride, not at the MCOperand stride. A reimplementation that mirrors the AsmPrinter must keep the two strides separate: modifier helpers read MCOperand records at offsets 16 * opIdx; the MC expander reads MI-class operand metadata at offsets 40 * opIdx. Confusing the two produces operand-aliasing bugs the verifier does not catch, because both layouts agree on slot zero.

Per-SM Reachability

Per-SM availability is enforced before printing. One opcode always prints one PTX spelling; an "SM window" describes which target tiers can reach that opcode from instruction selection.

Target window	Families that become reachable
SM70 / SM75	Baseline ALU, memory, control flow, and NVVM-intrinsic MMA paths.
SM80 / SM86 / SM87	`mma.sync`, `mma.sp.sync`, `ldmatrix`, `cp.async`, async barriers.
SM89	SM80 surface plus FP8 `mma.sync` type combinations.
SM90 / SM90a	WGMMA, mbarrier, cluster operations, and TMA tensor-copy forms.
SM100 / SM103	`tcgen05`, tensor-memory forms, Blackwell cluster/TMA extensions.
SM120 / SM121	Block-scaled warp MMA without tensor-memory `tcgen05`.

The separation keeps code generation robust: feature predicates decide which instruction is selected, and the printer stays deterministic.

Modifier Helpers

Most complexity in PTX printing comes from suffix construction. Modifier helpers map small encoded operands or attributes into PTX tokens.

Modifier family	Examples
Rounding and saturation	`.rn`, `.rz`, `.sat`, `.satfinite`
Memory space	`.global`, `.shared`, `.shared::cta`, `.shared::cluster`, `.local`
Memory ordering	`.relaxed`, `.acquire`, `.release`, `.acq_rel`, `.sc`
Scope	`.cta`, `.cluster`, `.gpu`, `.sys`, `.cta::cluster`
Cache policy	`.ca`, `.cg`, `.L2::cache_hint`
CTA grouping	`.cta_group::1`, `.cta_group::2`
Matrix shape	`.m16n8k32`, `.m64nNkK`, `.128x256b`
Matrix type	`.f16`, `.bf16`, `.tf32`, `.e4m3`, `.e5m2`, `.s8`, `.u8`
Tensor-copy suffixes	`.im2col`, `.multicast::cluster`, `.mbarrier::complete_tx::bytes`

void print_ldst_code(LdStCode code, raw_ostream *os) {
    print_memory_space(code.space, os);
    print_cache_policy(code.cache_policy, os);
    print_memory_order(code.order, os);
    print_scope(code.scope, os);
    print_type_suffix(code.type, os);
}

Load/store printing is modifier-driven by design. Address-space tokens are not ordinary free-text operands; they decode from the selected load/store code so invalid order/scope/address-space combinations get rejected before reaching this point.

Modifier Emission Order

PTX is whitespace-tolerant but suffix-order-strict. ptxas parses each instruction by stripping a dotted suffix sequence off the mnemonic in a fixed order; reordering the suffixes — even when each individual token is legal — yields a parse error. The print shapes are built around this grammar, so a reimplementation must emit modifiers in the same canonical order the parser expects rather than in the order the operand list happens to enumerate them.

Atomic Operations

atom[.scope][.semantics].<op>.<type>[.addrspace]

Slot	Token set
scope	`.cta`, `.cluster`, `.gpu`, `.sys` (default: device)
semantics	`.relaxed`, `.acquire`, `.release`, `.acq_rel` (default: `.relaxed`)
op	`.add`, `.min`, `.max`, `.and`, `.or`, `.xor`, `.exch`, `.cas`, `.inc`, `.dec`
type	`.b32`, `.b64`, `.u32`, `.u64`, `.s32`, `.f16`, `.f32`, `.f64`, `.f16x2`, `.bf16`, `.bf16x2`
addrspace	`.global`, `.shared`, `.shared::cta`, `.shared::cluster`

Examples:

atom.relaxed.cta.add.u32.shared    [%rd0], %r1;
atom.acq_rel.gpu.cas.b64.global    %rd0, [%rd1], %rd2, %rd3;
atom.release.cluster.add.f32       [%rd0], %f1;

Warp-Synchronous MMA

mma.sync.aligned.<shape>.<alayout>.<blayout>.<atype>.<btype>.<ctype>.<dtype>[.satfinite]

The fixed prefix mma.sync.aligned is invariant for the dense form. <shape> encodes the M/N/K tile size (m8n8k4, m16n8k16, m16n8k32, m16n16k16, and so on). <alayout> and <blayout> are .row or .col and are required for the integer and FP8 variants; they are omitted for the FP16/BF16/TF32 half-precision forms where the layout is fixed by the shape. The four type tokens always appear in the order A, B, C, D — never in the order the operand list enumerates the fragments.

Examples:

mma.sync.aligned.m16n8k16.f32.f16.f16.f32    {%fd0,%fd1,%fd2,%fd3}, {%r0,%r1,%r2,%r3}, {%r4,%r5}, {%fd4,%fd5,%fd6,%fd7};
mma.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32   {%r0,%r1,%r2,%r3}, {%r4,%r5,%r6,%r7}, {%r8,%r9}, {%r10,%r11,%r12,%r13};

Sparse MMA

mma.sp::ordered_metadata.sync.aligned.<shape>.<atype>.<btype>.<ctype>.<dtype>[.satfinite]

The sparsity selector .sp::ordered_metadata sits in a fixed slot between the mnemonic stem and the .sync.aligned infix. The metadata operand and the selector byte are extra operands at the end of the print shape; their suffix tokens do not move.

Warpgroup MMA

wgmma.mma_async.sync.aligned.<shape>.<dtype>.<atype>.<btype>[.scaleD][.scaleAB][.transA][.transB]

<shape> for WGMMA is the m64nNkK family. The destination type slot precedes the A and B type slots — the inverse of the mma.sync ordering — because the warpgroup form treats D as the architectural state and A/B as streamed inputs. The optional scale-and-transpose suffixes follow in the order scaleD, scaleAB, transA, transB; the printer omits each suffix when its operand carries the default value.

Example:

wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16    {%fd0,...,%fd63}, %rd_descA, %rd_descB, 1, 1, 0, 0;

Tensor-Memory MMA (tcgen05)

tcgen05.mma[.cta_group::N][.scale_input_acc][.block_scale][.sp::ordered_metadata].<kind>.<shape>.<dtype>.<atype>.<btype>.<ctype>[.satfinite]

<kind> is one of .kind::f16, .kind::tf32, .kind::f8f6f4, .kind::mxf8f6f4, .kind::mxf4, .kind::mxf4nvf4. The block-scale and scale-input-acc flags are positional booleans whose presence depends on operand bits documented in the SM120 control-word section above. The suffix grammar is the strictest in the ISA: every optional token has a fixed slot, and the parser rejects any reordering.

TMA Bulk-Tensor Copies

cp.async.bulk.tensor.<rank>d.<dst_space>.<src_space>[.<mode>].mbarrier::complete_tx::bytes[.multicast::cluster][.L2::cache_hint]

Slot	Token set
rank	`1d`, `2d`, `3d`, `4d`, `5d`
dst/src space	`shared::cluster.global`, `global.shared::cta`, `shared::cta.shared::cluster`
mode	`.tile`, `.im2col`, `.im2col::w`, `.im2col::w::128`
barrier	`.mbarrier::complete_tx::bytes` (required for the load form)
multicast	`.multicast::cluster` (optional, only on load)
L2 hint	`.L2::cache_hint` (optional)

Example:

cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes.multicast::cluster.L2::cache_hint
    [%rd_dst], [%rd_desc, {%r_x, %r_y}], [%rd_bar], %h_mask, %rd_hint;

Async Copies

cp.async.<dst_space>.<src_space>.<size>[.<cache_hint>][.L2::cache_hint][.commit_group]

<size> is the bytes-per-thread token (.4, .8, .16). The .commit_group suffix is emitted by a separate print shape on the companion cp.async.commit_group instruction; it does not stack onto the copy itself. The printer enforces the grammar by laying out the modifier helpers in the exact order above and never letting one helper print into another's slot.

Suffix Slot Invariant

The print shapes share a slot invariant: every modifier helper consumes a specific operand of the MCInst and renders into its assigned grammar slot regardless of operand-vector order. A reimplementation that prints suffixes by walking operands in order will produce strings that look plausible but get rejected by ptxas. Always drive the suffix emission from the print shape's slot table, not from the operand vector's index.

Module Emission

The outer NVPTXAsmPrinter emits PTX module structure around individual instructions.

Module element	Printed PTX
Header	`.version`, `.target`, `.address_size`
Kernel directives	`.entry`, `.reqntid`, `.maxntid`, `.minnctapersm`, `.maxnreg`
Cluster directives	`.explicitcluster`, `.maxclusterrank`, `.blocksareclusters`
Visibility	`.visible`, `.extern`, `.weak`
Globals	`.global`, `.const`, `.texref`, `.surfref`, `.samplerref`
Managed/unified metadata	`.attribute(.managed)`, `.attribute(.unified(...))`
Function frame	local depot, `%SP`, `%SPL`, virtual register declarations
Function body	brace-delimited PTX instructions

void emit_ptx_module(Module module, NvptxTarget target, raw_ostream *os) {
    emit_ptx_header(target, os);
    emit_module_globals(module, os);

    for (Function fn : module.functions()) {
        emit_function_directives(fn, target, os);
        emit_function_body_start(fn, os);
        emit_machine_instructions(fn, os);
        emit_function_body_end(fn, os);
    }
}

The .blocksareclusters directive requires both thread-block dimensions and a cluster dimension. Emitters must diagnose that combination early: a header-only correction later cannot repair a malformed kernel launch contract.

Module Header Directives

Every PTX module begins with three mandatory directives followed by an optional .debug toggle. The header emission runs once per Module and draws every value from the active NvptxSubtarget plus the TargetMachine debug flag.

//
// Generated by NVIDIA NVPTX Compiler
//
// Compiler Build ID: <build id>
// Cuda compilation tools, release 13.1
// Based on NVVM 7.0.1
//

.version 8.4
.target sm_90a, debug
.address_size 64

Directive	Source	Notes
`.version`	PTX ISA version selected by the subtarget.	`8.4` for the CUDA 13.1 baseline; bumped per ISA-feature requirement.
`.target`	Lowered SM name plus optional `,debug`.	`sm_90a` adds the architecture-specific `a` suffix when SM90 architecture-specific intrinsics are used.
`.address_size`	Pointer width of the host-device interface.	Always `64` in this build; the 32-bit path is removed.

The .target line carries up to four comma-separated tokens: the SM name, the optional a-suffix marker, the debug flag, and the optional map_f64_to_f32 legacy flag. The printer picks the SM name from Subtarget.getSmVersion(), appends a when the function or any global references an architecture-specific feature (SM90a tensor memory, SM100 distributed shared memory, SM120 block-scale MMA), appends debug when the TargetMachine debug level is non-zero, and appends map_f64_to_f32 only for the legacy fp64 emulation path that the modern compiler never selects.

The header banner above the directives is a fixed-format comment block the AsmPrinter emits before the first directive. The build-ID line lets post-link tools correlate a .ptx artefact with the exact tileiras binary that produced it; the NVVM-version line documents the bytecode schema feeding the printer.

Kernel Directive Emission

When the AsmPrinter encounters a kernel function (ptx_kernel calling convention on the LLVM function, equivalent to a nvvm.kernel attribute on the MLIR side), it emits a fixed-order directive cluster before the function body.

.visible .entry KernelName(
    .param .b64 KernelName_param_0,
    .param .b32 KernelName_param_1,
    .param .align 8 .b8 KernelName_param_2[16]
)
.reqntid 128, 1, 1
.maxntid 256, 1, 1
.minnctapersm 2
.maxnreg 64
.maxclusterrank 8
.explicitcluster
{
    // function body
}

Slot	Directive	MIR/LLVM source attribute
1	`.visible` / `.weak` / `.extern` linkage marker	LLVM linkage (`external`, `weak_odr`, `internal`).
2	`.entry` plus name and parameter list	`ptx_kernel` calling convention.
3	`.reqntid X, Y, Z`	`nvvm.reqntid` attribute / `!reqntid{x,y,z}` metadata.
4	`.maxntid X, Y, Z`	`nvvm.maxntid` attribute.
5	`.minnctapersm N`	`nvvm.minctasm` / `minnctapersm` metadata.
6	`.maxnreg N`	`nvvm.maxnreg` metadata.
7	`.maxclusterrank N`	`nvvm.cluster_max_blocks` attribute.
8	`.reqnctapercluster X, Y, Z`	`nvvm.cluster_dim` attribute.
9	`.explicitcluster`	`nvvm.explicit_cluster` attribute.
10	`.blocksareclusters`	`nvvm.blocks_are_clusters` (SM90+).
11	`{`	opens the function body.

The order is fixed: the printer never reorders these directives based on attribute traversal order, and a reimplementation must emit slots that exist in the same canonical order. Slots whose attribute is absent are simply skipped — there is no placeholder. The parameter list inside .entry(...) is a separate sub-emission that walks the function's formal parameters in declaration order, picks .param storage modifiers from each parameter's byval/align/type attributes, and emits the .b8 paramN[size] form for aggregate parameters that arrived through ABI-mandated indirection.

For non-kernel device functions the .entry token is replaced by .func, the visibility marker may be .visible/.weak/.extern, the parameter list takes a different syntactic shape, and slots 3 through 10 are omitted entirely. The shared sub-emitter is the same; only the slot-table varies.

Mnemonics and Register Names

Mnemonic lookup is table-driven: the MC opcode indexes a generated table and returns the PTX mnemonic stem. Register printing decodes the logical NVPTX register class, then prints a PTX register prefix and the register number.

Register class	PTX prefix	Width	Use
Predicate	`%p`	1 bit	Predicates and condition flags
16-bit GPR	`%rs`	16 bits	Half-width integer storage
32-bit GPR	`%r`	32 bits	Integer and bit-pattern values
64-bit GPR	`%rd`	64 bits	Pointers and 64-bit integers
32-bit float view	`%f`	32 bits	Float spelling of the 32-bit bank
64-bit float view	`%fd`	64 bits	Float spelling of the 64-bit bank
128-bit GPR	`%rq`	128 bits	Wide descriptors and grouped operands
Special registers	named PTX registers	varies	`%tid.x`, `%laneid`, `%clock64`, etc.

void print_register_name(NvptxRegister reg, raw_ostream *os) {
    switch (reg.class_id) {
    case REG_PRED:
        os_printf(os, "%%p%u", reg.number);
        return;
    case REG_I16:
        os_printf(os, "%%rs%u", reg.number);
        return;
    case REG_I32:
        os_printf(os, "%%r%u", reg.number);
        return;
    case REG_I64:
        os_printf(os, "%%rd%u", reg.number);
        return;
    case REG_F32:
        os_printf(os, "%%f%u", reg.number);
        return;
    case REG_F64:
        os_printf(os, "%%fd%u", reg.number);
        return;
    case REG_I128:
        os_printf(os, "%%rq%u", reg.number);
        return;
    case REG_SPECIAL:
        os_write(os, special_register_name(reg));
        return;
    }

    fail("bad NVPTX register class");
}

The 32-bit integer and f32 views share one physical register bank. The instruction's type suffix decides whether the value is interpreted as bits, integer, or floating point.

Register Classes

Tileiras exposes the practical NVPTX register classes a reimplementation needs for instruction selection and printing.

Class	PTX type string	Prefix	Notes
Predicate	`.pred`	`%p`	Boolean predicates.
16-bit	`.b16`	`%rs`	Half-width integer or packed data.
32-bit	`.b32`	`%r`	Main scalar bank.
32-bit float view	printed as type suffix	`%f`	Alias view of the 32-bit bank.
Special	special names	named	PTX special-register reads.
64-bit	`.b64`	`%rd`	Pointers, descriptors, and 64-bit scalars.
128-bit	`.b128`	`%rq`	Wide grouped operands and descriptors.

Read the f32 class as a typed view over the 32-bit register bank. COPY lowering can use ordinary 32-bit moves; instruction printing selects %f spelling only when the operand is used as a floating-point register.

Operand Constraint Class Glossary

The PTX inline-asm constraint letters that user code passes to asm("..." :: "r"(x), "l"(p), "f"(v)) correspond one-to-one with NVPTX register classes. The printer reads the constraint class off each MachineOperand, selects the matching register prefix, and renders the operand's number through print_register_name. The constraint letters are also the canonical naming convention for register banks in PTX documentation, so a reimplementation needs both directions: constraint-class to printed prefix on output, and printed prefix back to constraint-class for inline assembly parsing.

Class	Width	Constraint letter	Register prefix	Type strings	Typical uses
`b`	1 bit	`b`	`%p`	`.pred`	branch guards, predicate logic, `setp` destinations
`h`	16 bits	`h`	`%rs`	`.b16`, `.u16`, `.s16`	multicast masks, im2col offsets, FP16 raw bits
`r`	32 bits	`r`	`%r`	`.b32`, `.u32`, `.s32`	most arithmetic operands, 32-bit pointers in shared address space
`l`	64 bits	`l`	`%rd`	`.b64`, `.u64`, `.s64`	generic pointers, TMA descriptors, L2 cache hints
`f`	32 bits	`f`	`%f`	`.f32`	FP32 arithmetic
`d`	64 bits	`d`	`%fd`	`.f64`	FP64 arithmetic
`q`	128 bits	`q`	`%rq`	`.b128`	wide descriptors, 128-bit vector loads, FP128 storage

The class-to-prefix mapping is deterministic. The printer never picks %r or %f based on the surrounding instruction's type suffix; it picks the prefix from the operand's register class, and the type suffix is a separate modifier the print shape emits independently. The 32-bit integer and f32 banks share physical registers but carry distinct logical classes, which is how the printer knows whether to spell a 32-bit live range as %r3 or %f3.

const char *constraint_class_to_prefix(ConstraintClass cls) {
    switch (cls) {
    case CLASS_PRED:   return "%p";
    case CLASS_I16:    return "%rs";
    case CLASS_I32:    return "%r";
    case CLASS_I64:    return "%rd";
    case CLASS_F32:    return "%f";
    case CLASS_F64:    return "%fd";
    case CLASS_I128:   return "%rq";
    }

    fail("bad constraint class");
}

Three printing rules cover the corner cases. First, when an operand is a vector that the load/store needs to spell as a brace-grouped tuple, the printer emits {%r0, %r1, %r2, %r3} and increments the sequence number once per element; the constraint class still selects the prefix. Second, parameter-passing operands use the %pa, %fa, %ia, %la, %h, %hh prefix family rather than the generic prefixes; the printer routes these through a parallel switch that consults the operand's parameter-class flag. Third, special registers (%tid.x, %ntid.y, %laneid, %warpid, %clock, %clock64, %globaltimer, %pm0, %envreg{0..31}) bypass the prefix table entirely and print their canonical PTX name from the physical-register pool documented in the printRegName section below.

AsmWriter String Pools and the XOR-3 Walking Cipher

The MC printer's two string pools live not in .rodata like a stock LLVM build, but XOR-encrypted in .data, decrypted in place during pre-main initialization. The mnemonic pool occupies 0x5A4C080..0x5A656F0 — exactly 103,536 bytes (~105 KB) — and stores every PTX opcode stem plus three AsmWriter tail fragments. The physical-register-name pool occupies 0x5A4BE20..0x5A4C06A — exactly 586 bytes — and stores the 90 named registers printRegName returns for class 0. Both pools share one cipher and one initialization shape; the two init routines sub_1BD1810 and sub_1BD1830 are 20-line bodies differing only in begin and end pointers.

The cipher is a walking byte XOR with a fully deterministic key schedule k[i] = (3 * i) mod 256. Because gcd(3, 256) = 1, the orbit 0, 3, 6, ..., 255, 2, 5, ... enumerates every residue 0..255 exactly once per 256-byte window before repeating. The cipher is linear, byte-granular, and trivially invertible by replaying the same pass over the ciphertext. A strings tileiras scan surfaces zero PTX mnemonics; the design target is defeating naive static analysis, not real security.

void xor3_decode(uint8_t *p, uint8_t *end) {
    uint8_t k = 0;
    while (p != end) { *p++ ^= k; k += 3; }
}

After decryption the mnemonic pool decodes to 3,067 NUL-delimited chunks. The first three are AsmWriter tail fragments emitted after the final operand of an instruction template: "},\n\t\t", "},\n\t", ";\n\t". The remaining ~5,500 entries are PTX mnemonic stems plus the per-template prefix tokens AsmWriterEmitter packs in front of long-form opcodes. The register pool decodes to 90 names covering seven virtual-register-class prefixes (%p, %rs, %r, %rd, %f, %fd, %rq), the parameter-passing prefixes (%da, %fa, %ia, %la, %h, %hh, plus 32 %envreg{0..31} slots), and the three frame registers %Depot, %SP, %SPL.

Each decrypter is gated by a pthread_once flag in .bss. The mnemonic pool uses dword_5B4F4D8; the register pool uses dword_5B4F4C0. Once the walking-XOR pass returns, getMnemonic runs the Itanium-ABI "safely initialize local static" dance to publish the decoded pool's base address into a shared cache: __cxa_guard_acquire (sub_44A8A10) on the byte_5B4F4C8 lock byte, the cache write to qword_5B4F4D0, then __cxa_guard_release (sub_44A8AC0). Subsequent calls observe the already-acquired guard and skip directly to the table lookup.

getMnemonic and the Offset Tables

MC opcode lookup is a pair of parallel .rodata tables indexed by the 32-bit MC opcode. dword_4D4D360 carries the packed mnemonic descriptor: low 17 bits hold the byte offset into the decoded mnemonic pool, high 15 bits hold the per-opcode tail-state bits the print shape consults to pick a trailing separator. The companion table dword_4D468C0 carries the operand-width flags, modifier class index, and fragment indices that drive the modifier helpers. Both tables hold 6,824 entries of uint32 each. The first 293 slots are zero, matching LLVM's generic TargetOpcode prelude (G_ADD, G_PHI, G_IMPLICIT_DEF, and the rest); real NVPTX opcodes begin at index 293.

const char *getMnemonic(const MCInst *MI) {
    pthread_once(&once_mnemonic, init_mnemonic_pool);

    if (!guard_once && __cxa_guard_acquire(&guard_once)) {
        base_ptr_cache = (uintptr_t)&mnemonic_pool[0];
        __cxa_guard_release(&guard_once);
    }

    uint32_t opc       = MI->Opcode;
    uint32_t offset_tb = mnemonic_offsets[opc];      // dword_4D4D360
    uint32_t companion = mnemonic_companion[opc];    // dword_4D468C0

    if (offset_tb | ((uint64_t)companion << 32)) {
        uint32_t off = offset_tb & 0x1FFFF;
        return (const char *)(base_ptr_cache + off - 1);
    }
    return NULL;
}

The - 1 bias is LLVM's standard AsmWriterEmitter convention. Offset 0 encodes the "no mnemonic" sentinel; the first real mnemonic sits at pool byte 0 and is reached through stored offset 1. The combined zero check offset_tb | (companion << 32) lets one 64-bit test reject opcodes that have neither a mnemonic nor a companion descriptor — no two separate branches. The 17-bit offset field saturates at 131,072 bytes; the 103,536-byte payload leaves 26.6 % headroom, consistent with the SM110/SM121 forward-projection allowance baked into this build's MC opcode table. The empirical maximum lo17 observed is 103,806, which sits inside the trailing NUL slack the post-link encoder reserves at the end of the pool.

The companion-word dword_4D468C0 decomposition is inferred from the 415-value cardinality plus the canonical OpIdx << 8 | ModCls shape that AsmWriterEmitter emits: the low byte carries operand-width flags, the next byte indexes the tail-fragment list, the third byte indexes the prefix-fragment list, and the top byte selects an AsmWriter modifier class. Mark this MED confidence; the byte boundaries are stable but the per-byte semantic naming has not been cross-checked against a TableGen build.

Table	Address	Stride	Count	Purpose
`word_4D46800`	`0x4D46800`	`u16`	`96`	Register-name offsets into the 586-byte pool.
`dword_4D4D360`	`0x4D4D360`	`u32`	`6,824`	Mnemonic offset (low 17 bits) plus tail state (high 15 bits).
`dword_4D468C0`	`0x4D468C0`	`u32`	`6,824`	AsmWriter companion: operand-width flags, modifier class, fragment indices.

The .bss state cluster lives at four contiguous addresses with an 8-byte alignment pad between the guard byte and the cache pointer:

Slot	Address	Width	Role
`dword_5B4F4C0`	`0x5B4F4C0`	`pthread_once_t`	Register-name pool once-flag.
`byte_5B4F4C8`	`0x5B4F4C8`	`uint8_t`	Itanium-ABI `__cxa_guard_*` lock byte for the cache write.
`qword_5B4F4D0`	`0x5B4F4D0`	`uintptr_t`	Cached base pointer of the decoded mnemonic pool.
`dword_5B4F4D8`	`0x5B4F4D8`	`pthread_once_t`	Mnemonic pool once-flag.

printRegName and the Register Pool

printRegName (sub_1BD1EB0) is the printer's 8-way class switch. The top four bits of the MCRegister value select the class; the low 28 bits carry the sequence number for virtual registers or the MCReg enum value for physical registers. Class 0 is the physical path: it triggers pthread_once(&dword_5B4F4C0, init_reg_name_pool), indexes the register-name pool through word_4D46800[r - 1], then writes the resulting NUL-terminated string to the output stream. The - 1 bias mirrors the mnemonic-pool convention; MCRegister 0 is the "no register" sentinel. Classes 1 through 7 print the seven virtual-register prefixes catalogued in the Mnemonics and Register Names section above, concatenated with the low 28 bits as a decimal sequence number. The decoded 586-byte pool therefore carries the strings the class-0 path returns directly — physical envregs, parameter-passing prefixes, frame registers — plus the per-class "first virtual register of class N" exemplars the MC layer emits for register-allocation dumps.

MC Switch Shape Population Table

The MC printer's dispatcher is a single switch over 6,388 MC opcodes covering every selectable NVPTX instruction in this build. case arms do not each carry a distinct printer body; most fall through to one of 297 shared body labels that emit textually-identical PTX. Compression is steep: the fifteen most-populated shared bodies absorb roughly 80 % of all dispatch traffic, and the top-eight bodies alone shape the bulk of the printer's output behaviour.

Shared body	Opcode count	Skeleton
`LABEL_18639`	`120`	slot 3 plus 1-byte terminator (e.g. `cvt.{type}.{type}.{rnd} {reg}, {reg};`)
`0x1C40201`	`577`	mnemonic, operands, terminator (the dominant ALU shape)
`0x1C4097D`	`267`	4-operand form (e.g. `mma.sync.aligned {rd, rs1, rs2, rs3};`)
`0x1C40B59`	`108`	`[addr], reg` form (e.g. `st {addr}, {reg};`)
`0x1C409DF`	`96`	2-operand reg-reg (e.g. `mov.{type} {rd}, {rs};`)
`0x1C40AAF`	`84`	3-operand plus modifier (e.g. `set.{cmp}.{type} {rd}, {rs1}, {rs2};`)
`LABEL_18984`	(slot 8)	predicated form
`LABEL_18729`	(slot 5)	conditional form

Shared body 0x1C40201 is the centre of mass of the table. It prints the canonical "mnemonic, comma-separated operands, semicolon" shape every ordinary ALU instruction takes, so a reimplementation that gets exactly this one body right covers more than 9 % of MC opcodes on its own. The four 0x1C40... bodies together (the 201, 97D, B59, 9DF, AAF cluster) form the ALU and memory backbone; the two LABEL_* entries cover the predicated and conditional forms the selector synthesises around guarded instructions.

18-Family Non-MMA Partition

Beyond the top-eight shared bodies, the AsmPrinter groups the non-MMA opcodes into eighteen families F1 through F18 keyed by operand shape. The partition is operand-driven rather than mnemonic-driven: opcodes whose PTX text differs in mnemonic but agrees in operand layout share a family, and the per-opcode flag word in the jump table picks the right mnemonic stem out of the XOR-3 mnemonic pool. The largest family is F1, the load/store mega-group, which carries its own inner sub-dispatcher because LD and ST must discriminate address space, predication, sparsity, and tensor-memory variants before the shared body can pick the correct PTX spelling.

Inner LD/ST 13-Label Table

F1 dispatches through a 13-label sub-table that splits the load/store opcodes by address space, predication, and tensor-memory variant:

Sub-label	Opcode family
1	Generic load.
2	Generic store.
3	Constant-AS load.
4	Param-AS load.
5	Shared-AS load/store.
6	Global-AS load/store.
7	Local-AS load.
8	TMEM load/store.
9	Bulk-tensor load.
10	Bulk-tensor store.
11	Sparse load.
12	Predicated load.
13	Predicated store.

Sub-labels 8 through 11 are the tensor-memory and bulk-tensor variants reachable only from the SM100 window onward; sub-labels 12 and 13 carry the predicated forms the selector emits when a guard predicate survives into the MC layer.

MC Opcode to Label Cascade

Each MC opcode in the dispatcher's jump table holds a u32 index packed as {shared_label_id << 16 | per-op flag bits}. The high 16 bits select the shared body; the low 16 bits encode the operand-flavour tweaks (FTZ, satfinite, modifier kind, address-space hint, and so on) the shared body consults via currentOp & 0xFFFF. This is the canonical AsmWriterEmitter compression pattern, fingerprinted in the binary by the mov rax, [rdi + 4*rcx] plus mov ecx, eax; shr rax, 16 idiom at the entry of every shared body.

void emit_mc_opcode(MCInst inst, raw_ostream *os) {
    uint32_t entry  = jump_table[inst.opcode];
    uint32_t label  = entry >> 16;
    uint32_t flags  = entry & 0xFFFF;

    print_shared_body(label, inst, flags, os);
}

The split is strict: a shared body never reads opcode identity directly. It reads only the flag word handed to it by the dispatcher, plus the operand indices its skeleton dictates. That is what lets 297 bodies absorb 6,388 opcodes without per-opcode branching inside the bodies themselves.

Worked Example: MMA M16N8K16

The cleanest way to see every layer of the printer cooperate is to trace a single MachineInst from selection output through to the emitted PTX line. The example below uses MMA_F32_F16_F16_F32_M16N8K16 — the canonical FP16-input, FP32-accumulate warp MMA the FP32 GEMM kernels lean on heaviest.

MachineInst Shape

After instruction selection, the MI carries four operand groups: an A-fragment, a B-fragment, a C-accumulator that doubles as the D destination, and an optional satfinite flag. The shape is fixed by the TableGen instruction definition:

%fd0:f32, %fd1:f32, %fd2:f32, %fd3:f32 =
    MMA_F32_F16_F16_F32_M16N8K16
        %r0:i32, %r1:i32, %r2:i32, %r3:i32,   // A fragment (4 x packed.f16x2)
        %r4:i32, %r5:i32,                      // B fragment (2 x packed.f16x2)
        %fd4:f32, %fd5:f32, %fd6:f32, %fd7:f32 // C accumulator

The destination is the first four operands; the A, B, C fragments follow in source order. The register classes are mixed: A and B use the 32-bit integer bank (%r) because PTX packs two FP16 lanes into one 32-bit register, while C and D use the 32-bit float bank (%fd) because the accumulator is held in FP32 doubles.

Print Shape Lookup

The MC opcode index for MMA_F32_F16_F16_F32_M16N8K16 resolves to a companion-table entry whose high 16 bits select shared body 0x1C4097D (the 4-operand-group form documented in the shape population table) and whose low 16 bits encode the type-suffix tuple (D=f32, A=f16, B=f16, C=f32) plus the satfinite=0 bit.

The mnemonic offset retrieved from dword_4D4D360 resolves through the XOR-3-decoded pool to the stem mma.sync.aligned.m16n8k16. The trailing type tokens .f32.f16.f16.f32 are appended by the modifier helper print_mma_type_tuple, which reads the four type-tag bytes off the flag word and looks each one up in the type-string table.

Modifier Emission

The suffix slot table for the dense MMA family is:

Slot	Token	Source
1	`.sync`	fixed for `mma.sync`
2	`.aligned`	fixed for the warp-synchronous form
3	`.m16n8k16`	shape from print-shape entry
4	`.f32`	D type tag from flag word
5	`.f16`	A type tag from flag word
6	`.f16`	B type tag from flag word
7	`.f32`	C type tag from flag word
8	`.satfinite`	optional, gated by flag bit

The printer walks this slot table in order. No operand-vector traversal participates — the slot table is the ground truth for suffix order.

Operand Group Printing

The shared body 0x1C4097D reads four operand-group descriptors out of its skeleton:

Group	Operand range	Register class	Width	Printed form
D	dest[0..3]	`f32` (`%fd`)	4	`{%fd0, %fd1, %fd2, %fd3}`
A	src[0..3]	`i32` (`%r`)	4	`{%r0, %r1, %r2, %r3}`
B	src[4..5]	`i32` (`%r`)	2	`{%r4, %r5}`
C	src[6..9]	`f32` (`%fd`)	4	`{%fd4, %fd5, %fd6, %fd7}`

Each group emits an open brace, comma-separated operand prints, a closing brace, and a separator. The group sizes (4, 4, 2, 4) come from the MMA shape's fragment dimensions, not from the operand vector — a 16x8 D output produces 4 FP32 lanes per thread, a 16x16 A input produces 4 packed FP16x2 lanes per thread, and an 8x16 B input produces 2 packed FP16x2 lanes per thread.

Final Emitted Line

After all four groups have printed, the skeleton emits the closing semicolon and a newline:

mma.sync.aligned.m16n8k16.f32.f16.f16.f32    {%fd0, %fd1, %fd2, %fd3}, {%r0, %r1, %r2, %r3}, {%r4, %r5}, {%fd4, %fd5, %fd6, %fd7};

Three observations carry across to other MMA variants. First, the suffix order and operand-group order are independent: the suffix list runs D, A, B, C in type-token form while the operand list runs D, A, B, C in brace-group form, and they happen to agree only because the print shape arranged it that way. Second, the register-class mismatch between A/B (integer bank) and C/D (float bank) is deliberate — PTX treats FP16 inputs as bit patterns and FP32 accumulators as floats, and the printer faithfully picks the bank from each operand's class. Third, the .satfinite slot is gone from the printed line because its flag bit was zero; the printer never emits empty-slot placeholders.

The same skeleton drives every entry in the mma.sync.aligned.* family. Switching to mma.sync.aligned.m16n8k32.f32.bf16.bf16.f32 changes only the shape token in slot 3 and the type tokens in slots 4 through 7; the operand-group sizes adjust to match the new fragment dimensions; and the final line keeps the exact same syntactic shape. That regularity is what lets one shared body cover several hundred MMA opcodes.

Reimplementation Budget

Recreating the dispatcher in a clean reimplementation takes four artefacts. The first three are bulk data; the fourth carries the per-operand encoding rules the modifier helpers consume.

Artefact	Shape	Source
Shared-body table	`297` labels with operand-skeleton scripts.	Disassembled from the asm-printer body cluster.
MC opcode jump table	`6,388` entries of `{label, flags}` packed as `u32`.	Reconstructed from the per-case jump targets.
Mnemonic pool	XOR-3-encrypted, `3,067` NUL-delimited chunks.	See the XOR-3 walking cipher section.
Offset tables	`dword_4D4D360`, `dword_4D468C0`, `word_4D46800`.	See getMnemonic and the Offset Tables.

Ship these four artefacts as static data plus the 297 body scripts as a small interpreter loop and the entire MC printer surface comes back without per-opcode code generation. The ISelDAG and MatcherTable — Selector Layers page documents the upstream stage feeding these MC opcodes into the printer, and the XOR-3 walking cipher section above covers the mnemonic-pool side of the budget.

Keyboard shortcuts

Tileiras Internals