Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Instruction Selection

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

Instruction selection in ptxas is a two-phase process that converts PTX virtual ISA operations into concrete SASS machine opcodes. Unlike LLVM, which uses a single SelectionDAG or GlobalISel framework, ptxas distributes instruction selection across two distinct pipeline stages separated by the entire optimization pipeline: Phase 1 converts PTX opcodes to Ori IR opcodes during initial lowering (phase 5, ConvertUnsupportedOps), and Phase 2 converts Ori IR to final SASS binary forms during code generation (phases 112--122, ISel driver + Mercury encoder). The two phases serve fundamentally different purposes: Phase 1 legalizes the IR so the optimizer can reason about it, while Phase 2 selects the optimal machine encoding for the target architecture after register allocation and scheduling are complete.

Phase 1 locationPhase 5: ConvertUnsupportedOps (PTX opcode to Ori opcode)
Phase 2 locationPhases 112+: ISel driver + Mercury encoder (Ori to SASS binary)
MercConverter dispatchsub_9ED2D0 (25 KB, master switch on *(instr+72) & 0xCF mask)
ISel driversub_B285D0 (9 KB, 66 callees, vtable entry)
ISel mega-selectorsub_C0EB10 (185 KB, 500+ locals, giant switch)
DAG pattern matchers~801 functions at 0xB28F60--0xB7D000 (~1.3 MB)
Arch dispatch tables5 thunks at sub_B128E0--sub_B12920 (13 bytes each -> shared 15 KB handler)
Mercury master encodersub_6D9690 (94 KB, instruction type switch)
MercExpandsub_C3CC60 (26 KB, pseudo-instruction expansion)
SM120 pattern coordinatorsub_13AF3D0 (137 KB, 130-case switch, opcodes 2--352)
Opcode variant selectorssub_B0BE00 (19 KB, class 194), sub_B0AA70 (5 KB, class 306)

Architecture

PTX source text
     |
     v
[Bison parser]  sub_4CE6B0 (48KB)
     |  Reduction actions build raw Ori nodes with PTX-derived opcodes
     v
+------------------------------------------------------------------+
| RAW ORI IR (PTX opcodes: add.f32, ld.global, mad.lo.s32, ...)    |
+------------------------------------------------------------------+
     |
     |  PHASE 1: PTX-to-Ori Opcode Legalization (phase 5)
     |
     |  sub_9F3340 (orchestrator, 7KB)
     |    -> sub_9F1A90 (MercConverter main, 35KB)
     |         -> sub_9ED2D0 (opcode dispatch, 25KB)
     |              Switch on (*(instr+72)) with BYTE1 & 0xCF mask
     |              ~120 case values -> ~60 handler functions
     |              + vtable dispatch for architecture-extensible ops
     |         -> sub_934630 (instruction creation, called N times)
     |    -> sub_9EF5E0 (post-conversion lowering, 27KB)
     |
     v
+------------------------------------------------------------------+
| OPTIMIZER-READY ORI IR (SASS opcodes: FADD, IMAD, LDG, STG, ...).|
| Every instruction has a valid SASS opcode for the target SM.      |
+------------------------------------------------------------------+
     |
     |  [Phases 14-111: Full optimization pipeline]
     |  Register allocation, scheduling, peephole, etc.
     |
     v
+------------------------------------------------------------------+
| OPTIMIZED ORI IR (register-allocated, scheduled)                  |
+------------------------------------------------------------------+
     |
     |  PHASE 2: Ori-to-SASS Selection & Encoding (phases 112+)
     |
     |  sub_B285D0 (ISel driver, 9KB)
     |    -> sub_C0EB10 (mega-selector, 185KB, default backend)
     |    -> sub_13AF3D0 (pattern coordinator, 137KB, SM120 backend)
     |    -> sub_B1FA20 / sub_B20E00 (builder variants)
     |    -> sub_B28F60..sub_B74C60 (~801 DAG pattern matchers)
     |    -> sub_B128E0..sub_B12920 (5 arch dispatch thunks -> shared handler)
     |
     |  sub_6D9690 (Mercury master encoder, 94KB)
     |    -> Switch on instruction type (*(instr+8))
     |    -> sub_C00BF0 (opcode lookup)
     |    -> sub_91D160 (register encoding)
     |    -> sub_7B9B80 (bitfield insert, 18,347 callers)
     |
     |  sub_C3CC60 (MercExpand, 26KB)
     |    -> sub_C37A10 (expand instruction, 16KB)
     |    -> sub_C39B40 (expand memory, 10KB)
     |    -> sub_C3BCD0 (expand control flow, 19KB)
     |
     v
+------------------------------------------------------------------+
| SASS binary (packed machine code in 64/128/256-bit words)         |
+------------------------------------------------------------------+

Phase 1: PTX-to-Ori Opcode Conversion

Phase 1 runs as ConvertUnsupportedOps (pipeline phase 5), the most substantial bridge phase. Its job is to replace every PTX-derived opcode in the raw Ori IR with a valid SASS-level opcode for the target SM. After this phase completes, the optimizer sees only SASS-level instruction semantics.

The conversion is not a simple table lookup. Many PTX operations have no 1:1 SASS equivalent and must be expanded into multi-instruction sequences. The expansion depends on the target architecture, the operand types, and the available hardware functional units.

MercConverter Dispatch -- sub_9ED2D0 (25 KB)

The central dispatch function of Phase 1. Despite the sweep's initial identification as PhaseRunner::executePhaseSequence, the decompiled code reveals a classic opcode switch: it reads *(instr+72), masks byte 1 with 0xCF (stripping modifier bits 4--5), and dispatches to per-category handler functions. The switch covers approximately 120 distinct case values (opcode indices 1--352) routing to roughly 60 handler functions plus vtable-dispatched methods for architecture-extensible operations.

// sub_9ED2D0 -- simplified dispatch logic
void MercConverter_Dispatch(context, instruction) {
    // Pre-dispatch: check predication eligibility
    bool can_predicate = sub_7E18A0(instruction, *(context+8));
    if (can_predicate)
        can_predicate = vtable[205](*(*(context+8)+1584), instruction);
    *(context+40) = can_predicate;

    // Read opcode, mask out modifier bits
    int opcode = *(DWORD*)(instruction + 72);
    BYTE1(opcode) &= 0xCF;

    // Special case: opcode 130 (HSET2 in ROT13; internal marker) with GPR operand -> clear predication
    if (opcode == 130) {
        int operand = *(DWORD*)(instruction + 84);
        if (((operand >> 28) & 7) == 1 && reg_type(operand) == 6)
            *(context+40) = 0;
    }

    // Main dispatch
    switch (opcode) {
    case 1:   sub_9DA5C0(context, instruction);  break;  // opcode class 1
    case 6:   sub_9DA100(context, instruction);  break;  // arithmetic
    case 8:   sub_9D2440(context, instruction);  break;  // specific class
    case 10: case 11: case 149: case 151: case 152: case 290: case 291:
              sub_9D80E0(context, instruction);  break;  // memory load/store
    case 16:  sub_9E8B20(context, instruction);  break;  // texture/surface
    case 61: case 63: case 80:
              sub_9E6600(context, instruction);  break;  // instruction expansion
    case 108: sub_9D76D0(context, instruction);  break;  // memory legalization
    // ... ~100 more cases ...
    default:  emit_noop(context, 0xFFFF);        break;  // unknown -> passthrough
    }

    // Post-dispatch: apply predication and operand adjustments
    vtable[107](context, instruction);
}

MercConverter Opcode Dispatch Table

The complete switch covers opcodes 1--352. Cases route to three dispatch mechanisms: direct function calls (for common PTX categories), vtable-indirect calls (for architecture-extensible operations), and the emit_noop fallback for unrecognized opcodes. Below is the reconstructed routing table from the decompiled sub_9ED2D0.

Direct handler dispatch (35 handlers):

Opcode(s)HandlerSizeCategory
1sub_9DA5C02 KBOpcode class 1 (basic ALU)
6sub_9DA1009 KBArithmetic operations
8sub_9D2440--Specific class
10, 11, 149, 151, 152, 290, 291sub_9D80E017 KBMemory load/store
15, 85sub_9EC34023 KBMulti-operand legalization
16sub_9E8B2017 KBTexture/surface lowering
17sub_9E7FB0--Surface operations
22sub_9D6DB0--Specific lowering
23sub_9E58F0--Specific lowering
24sub_9D9F60--Specific lowering
26sub_9E54C0--Specific lowering
27sub_9E4BB0--Specific lowering
28sub_9D9E70--Specific lowering
32, 271sub_9E2440--Bitfield operations
34sub_9E55E0--Specific lowering
38, 59, 106, 180, 182, 192, 194, 215, 221, 242sub_9DA6B0--Generic ALU group
41, 284sub_9D1DA0--Specific lowering
42, 53, 55, 66sub_9D54B0--Grouped operations
47sub_9E74E0--Conditional (arch flag check)
51sub_9E2F60--Specific lowering
52, 54, 72, 97sub_9D09C0--Group with v8=1 (deletion flag)
57, 101sub_9D6170--Paired operations
60, 62, 78, 79sub_9E5EE0--Comparison group
61, 63, 80sub_9E660025 KBInstruction expansion (64-bit split)
67sub_9D9C30--Specific lowering
70sub_9E3490--Specific lowering
75sub_9E0C10--Specific lowering
77sub_9E4DF0--Specific lowering
83sub_9D6AB0--Specific lowering
88, 89sub_9D5990--Paired operations
90sub_9D2820--Specific lowering
91sub_9E7600--Specific lowering
92sub_9E7890--Specific lowering
93, 95sub_9E1D40--Comparison variants
94sub_9E1DF0--Specific lowering
96sub_9D41C0--Specific lowering
98sub_9D3230--Specific lowering
100sub_9D70E0--Specific lowering
102sub_9D9750--Specific lowering
103, 104sub_9E31D0--Paired operations
108sub_9D76D018 KBMemory instruction legalization
124sub_9E18B0--Specific lowering
135sub_9D6560--Specific lowering
139, 140, 141, 143sub_9D4C10--Related operations group
145sub_9D3020--Specific lowering
155, 268sub_9E5260--Paired operations
156sub_9D94B0--Specific lowering
158, 167sub_9E4A00--Paired operations
161sub_9D21D0--Specific lowering
162sub_9D9660--Specific lowering
166sub_9E2100--Specific lowering
170sub_9E2DF0--Specific lowering
173, 267sub_9EB5C0--Paired operations
174sub_9D9300--Specific lowering
184sub_9D2E70--Specific lowering
185sub_9E32F0--Specific lowering
188, 190sub_9E2970--Paired operations
195sub_9D2AB0--Specific lowering
196sub_9D9080--Specific lowering
198sub_9D66F0--Specific lowering
201, 202, 204, 285sub_9EAC30--Async/bulk group
203sub_9D8E90--Specific lowering
205sub_9E1260--Specific lowering
209sub_9E5740--Specific lowering
210, 213, 214sub_9D8B30--Grouped operations
240sub_9D6280--Specific lowering
241sub_9E2CC0--Specific lowering
247sub_9D0F70--Specific lowering
248sub_9D0DF0--Specific lowering
262sub_9E7440--Specific lowering
264sub_9D73F0--Specific lowering
276sub_9D5EC0--Specific lowering
292sub_9D0E90--Specific lowering

Vtable-indirect dispatch (for architecture-extensible operations):

Opcode(s)Vtable offsetCategory (inferred)
2, 3, 4, 5, 7vtable[0] (+0)Generic fallback
14, 39, 40, 105, 125, 299, 300, 321vtable[7] (+56)Group A operations
18vtable[3] (+24)Specific class
31vtable[4] (+32)Specific class
35vtable[6] (+48)Specific class
36vtable[21] (+168)Specific class
43vtable[9] (+72)Specific class
50vtable[12] (+96)Specific class
65vtable[22] (+176)Specific class
73vtable[15] (+120)Specific class
74vtable[16] (+128)Specific class
81vtable[24] (+192)Specific class
110, 111, 112, 114vtable[25] (+200)Warp shuffle group
118vtable[10] (+80)Specific class
119vtable[28] (+224)Specific class
120, 121, 126, 127, 128, 280, 281vtable[27] (+216)Barrier/sync group
122, 123, 310, 311, 312vtable[26] (+208)Related group
130 (HSET2), 169vtable[29] (+232)Move/convert group (130 is MOV-like internally; actual SASS MOV = 19)
157vtable[84] (+672)Specific class
176, 177vtable[34] (+272)Paired operations
183, 288vtable[36] (+288)Paired operations
186vtable[35] (+280)Specific class
211vtable[39] (+312)Specific class
220vtable[40] (+320)Specific class
223, 238vtable[41] (+328)Paired operations
228vtable[42] (+336)Specific class
243vtable[43] (+344)Specific class
245--253, 257vtable[67--77] (+536--+624)SM 100+ operations
265, 266vtable[93] (+744)Paired operations
270vtable[77] (+616)Specific class
277vtable[65] or vtable[11] (+520/+88)Operand-type dependent
279--351various high vtable offsetsSM 100+ / Blackwell operations

The vtable mechanism allows architecture backends to override conversion behavior without modifying the core dispatch. The vtable factory at sub_1CCEEE0 (17 KB, 244 callees) selects which overrides are active based on the SM version.

Per-Category Handlers

The larger handlers implement non-trivial conversion logic:

HandlerSizeCategoryKey behavior
sub_9E660025 KBInstruction expansionSplits 64-bit ops on 32-bit ALU into hi/lo pairs with carry chains. Calls sub_9D4380 (instruction builder) ~10 times per expansion.
sub_9EC34023 KBMulti-operand legalizationOperand type test: (v >> 28) & 7 == 1 means register. Register class query via sub_7BE7B0. Creates new instructions via sub_7DEAD0.
sub_9D76D018 KBMemory legalization (load/store)Register type dispatch: 6=GPR, 7=predicate, 3=address. Uses sub_9D4380 (instruction builder) and sub_9CD420 (predication).
sub_9D80E017 KBMemory legalization (variant)Same opcode set as sub_9D76D0, alternate code path for different operand patterns.
sub_9E8B2017 KBTexture/surface loweringRegister type 6 = GPR. Manipulates bitmask at register descriptor offset +48.
sub_9DA1009 KBArithmetic operationsHandles opcode case 6 -- standard ALU instruction legalization.
sub_9DA6B0--Generic ALU groupCovers 10 opcode values (38, 59, 106, 180, 182, 192, 194, 215, 221, 242).

1:1 vs 1:N Expansion

Most PTX operations map 1:1 to a single SASS opcode. When they do not, the handlers in sub_9E6600 and related functions create multi-instruction sequences:

PTX                                    Ori IR (after Phase 1)
-----------------------------------    -----------------------------------
add.f32  %r1, %r2, %r3          -->   FADD  R1, R2, R3                [1:1]
add.s32  %r4, %r5, %r6          -->   IADD3 R4, R5, R6, RZ           [1:1, operand added]
mul.lo.s64 %rd1, %rd2, %rd3     -->   IMAD.LO  R1, R2, R6, RZ       [1:N split]
                                       IMAD.HI  R0, R2, R6, RZ
                                       IMAD      R0, R3, R6, R0
                                       IMAD      R0, R2, R7, R0
div.f32  %r7, %r8, %r9          -->   MUFU.RCP  R10, R9              [1:N, Newton-Raphson]
                                       FMUL      R7, R8, R10
                                       (+ correction iterations)
bar.sync 0                       -->   BAR                            [1:1]

The expansion creates new instruction nodes via sub_934630 and links them into the doubly-linked instruction list. The original PTX-level instruction is replaced by the expanded sequence.

Type-Dependent Opcode Selection

PTX's explicitly-typed opcodes (where the type is a qualifier like .f32, .s64) map to different SASS mnemonics based on the type:

PTX typeSASS prefixExample PTXExample SASS
.f16 / .f16x2Hadd.f16HADD2
.f32Fadd.f32FADD
.f64Dadd.f64DADD
.s32 / .u32Iadd.s32IADD3
.s64 / .u64I (split)add.s64IADD3 + IADD3.X (carry chain)
.predPsetp.eq.f32FSETP

The type qualifier disappears from the instruction syntax during conversion. It becomes encoded in the SASS mnemonic itself (the F in FADD, the I in IADD3) and in the register class of the operands.

SM-Dependent Legalization

The MercConverter gates operations by SM version through the architecture vtable. An instruction available natively on one SM may require a multi-instruction lowering sequence on another:

  • 64-bit integer arithmetic on SM 50--75 (no native 64-bit ALU): splits into 32-bit hi/lo pairs
  • FP16 operations on pre-SM 53 targets: promoted to FP32 (handled by Phase 2 PromoteFP16)
  • bfe/bfi variants: some bit-field extract/insert modes not supported on all targets
  • Tensor core intrinsics: SM 70 has HMMA v1, SM 75 has HMMA v2, SM 80+ has HMMA v3/DMMA, SM 100 has TCGen05

The architecture vtable factory at sub_1CCEEE0 populates the vtable with SM-specific method overrides. The vtable has approximately 90 method slots (up to offset +720), with the highest-numbered slots (offset 624+) serving SM 100+ Blackwell operations.

Phase 2: Ori-to-SASS Selection & Encoding

Phase 2 runs during code generation (phases 112+) after the optimizer, register allocator, and scheduler have completed. It operates on fully optimized, register-allocated Ori IR and produces final SASS machine code. Phase 2 has three major components: the ISel driver with DAG pattern matching, the Mercury master encoder, and MercExpand pseudo-instruction expansion.

ISel Driver -- sub_B285D0 (9 KB)

The top-level ISel coordinator is a vtable entry point with 66 callees. It selects the appropriate instruction builder variant based on the target architecture:

// Simplified ISel driver
void ISel_LowerInstruction(context, instruction) {
    int sm = *(context + 184);          // SM version
    int opcode = instruction[18] & 0xFFFFCFFF;

    // Select architecture-variant builder
    if (sm == 14)
        Builder_VariantA(context, instruction);    // sub_B1FA20 (13 KB)
    else
        Builder_VariantB(context, instruction);    // sub_B20E00 (11 KB)

    // Apply post-ISel modifiers
    ApplyModifiers(context, instruction);           // sub_B1D670 (13 KB)
    SetProperties(context, instruction);            // sub_B241A0 (7 KB)
}

The two builder variants (sub_B1FA20 and sub_B20E00) are structurally near-identical, with 50 callees each. Both call sub_7E3EF0 (operand index helper) 6 times (3 source + 3 destination operands) and use sub_A3B930 (operand register class resolver). The key difference is the validation function: variant A uses sub_C49440, variant B uses sub_C49400, reflecting different encoding constraints for different SM families.

ISel Mega-Selector -- sub_C0EB10 (185 KB)

The single largest function in the Phase 2 ISel range: 185 KB decompiled, 6,016 lines, 719+ local variables. It performs the final Ori-IR-to-SASS opcode and operand encoding for 169 distinct instruction types (SASS opcode indices 7--221). While the ~801 DAG pattern matchers handle template-based ISel through a priority contest, the mega-selector handles complex instructions that require procedural, multi-step encoding logic -- instructions where the operand marshalling depends on runtime state (calling conventions, symbol resolution, address space aliasing).

Dual-Switch SM-Generation Dispatch

The function contains two copies of the same 169-case switch statement, separated by a vtable-based opcode translation mechanism. This dual-switch structure is the SM-generation dispatch:

// sub_C0EB10 -- simplified dispatch skeleton
void MegaSelector(context *a1, instruction *a2, isel_ctx *a3) {
    int64_t *vtable = *(a3->backend);
    int opcode = *(int *)(a2 + 8);           // SASS opcode type

    // Pre-dispatch: capability check via vtable[12]
    auto cap_check = vtable[12];              // offset +96
    if (cap_check != sub_BFEAA0)              // default stub?
        if (cap_check(a3, a2))
            ctx->flags[256] = 1;              // set encoding flag

    // Read opcode translator from vtable[2]
    auto translator = vtable[2];              // offset +16

    if (translator != sub_BFEBF0) {
        // PATH A: SM-specific translation
        int encoding_index = translator(a3, opcode);
        int isel_opcode = *(ctx + 8);         // post-translation opcode
        switch (isel_opcode) {                // PRIMARY SWITCH (169 cases)
            case 7: case 34: case 35: case 36:
                emit_simple(encoding_index, ...);
                break;
            case 8: case 38: case 46: ...
                /* already encoded */ break;
            // ... 169 cases total ...
            default: goto high_opcode_path;
        }
    } else {
        // PATH B: static table lookup (default backend)
        int encoding_index = 355;             // sentinel for extended opcodes
        if (opcode <= 0xDD)
            encoding_index = word_22B4B60[opcode];
        switch (opcode) {                     // FALLBACK SWITCH (same 169 cases)
            case 7: ...: goto handler_7;      // jumps into Path A handlers
            // ... identical case set ...
            default: return;
        }
    }

high_opcode_path:
    if (opcode > 0x199) return;
    // Try vtable[3] extension dispatch for SM 100+ / Blackwell
    auto extension = vtable[3];               // offset +24
    if (extension != sub_BFEA30)
        extension(a3, a2);                    // arch-extension handler
}

The dual-switch pattern is a code-generation artifact: the compiler emitted two copies because the vtable path and static-table path produce different values for the encoding index but need identical case routing. This doubles the binary size but avoids a conditional merge point at every case entry.

Three Vtable Dispatch Points

Vtable slotOffsetDefault stubPurpose
vtable[2]+16sub_BFEBF0Opcode-to-encoding-index translator. SM-specific override remaps opcodes to different encoding slots. Fallback: word_22B4B60[] static table.
vtable[12]+96sub_BFEAA0Pre-dispatch capability check. Returns boolean that sets ctx[256] encoding flag.
vtable[3]+24sub_BFEA30Extension opcode handler for opcodes outside the 169-case set (barrier/sync 61--63/221, opcodes > 0x199, SM 100+ extensions).

The word_22B4B60 static table is a uint16[222] array indexed by SASS opcode (0--0xDD = 221). Each entry is a SASS encoding slot index used to select a format descriptor from the encoder tables. Of the 222 entries: 117 carry a non-zero encoding slot, 95 are zero (opcode uses no static encoding or is handled purely by case logic), and 10 carry the sentinel value 355 (opcode requires SM-specific vtable override and has no default encoding). SM-specific vtable overrides can remap any entry, enabling per-architecture instruction variants without modifying the mega-selector logic.

Complete word_22B4B60 dump (117 non-zero non-sentinel entries, sorted by opcode):

OpcMnemonicSlotOpcMnemonicSlotOpcMnemonicSlot
5SGXT16068BRX283141UISETP120
6LOP324169JMP164142ULDC126
7ISETP14870JMX168143ULEA129
8IABS9771CALL130144UP2UR139
10SHF9374BREAK12145ULOP3143
11FFMA9475BPT13146UPLOP3151
13FMUL9577EXIT172147USEL163
20SEL2978RTT30149UFLO200
22R2P3784TLD86150UIADD3201
24PRMT18885TLD488151UIMAD206
25NOP19086TMML89152UMOV207
29PMTRIG3288TXQ187153UPRMT208
30CSMTEST27199STL1154VOTEU213
31VABSDIFF159100LD25156USHF214
32VABSDIFF472101ST33158F2FP216
34IDE55102ATOM38159HMMA_1688217
35I2I42104RED44160HMMA_16816219
36I2IP53105ATOMS45161BMMA227
47I2F224106QSPC59162TTUCCTL229
48I2F_X235108CCTL60163TTUMACRO290
50FRND_X150109CCTLL62164R2UR7
51AL2P87110CCTLT68168FOOTPRINT36
55BMOV_R16111MEMBAR71172SM82_FIRST110
57S2R98112SULD78173GATHER115
58B2R153113SUST79174GENMETADATA114
59R2B183114SUATOM106175SPMETADATA117
60LEPC288118ISBEWR147176BMMA_88128196
61BAR18119SHFL149177BMMA_168128254
62BAR_IDX302122DFMA179178BMMA_168256255
63SETCTAID303123DADD180179CLMAD256
65GETLMEMBASE248124DMUL192180DMMA257
125DSETP191181HMMA_SP_1688258
126HADD2199182HFMA2_MMA259
127HADD2_F32215183HMNMX2260
129HMUL2221184IMMA_88261
130HSET2225187IMMA_16832262
131HSETP22188IMMA_SP_16832243
132HMMA_1610202QMMA_1683296
133HMMA_3248204QMMA_SP_12864169
205SM89_LAST178
206SM90_FIRST197
207ACQBLK239
208CGABAR_ARV154
210CGABAR_SET195
211CGABAR_WAIT175
217ENDCOLLECTIVE309
218FENCE_G170

Sentinel entries (slot = 355, no default encoding -- require SM-override):

OpcMnemonicNotes
40FCHKFP range check -- encoding varies by arch
49FRNDFP round -- superseded by FRND_X (opc 50) in default table
169S2URSpecial-reg to uniform-reg -- uniform pipe is SM-specific
212CGAERRBARCGA error barrier -- SM 90+ only
213CREATEPOLICYCache policy creation -- SM 90+ only
214CVTAGeneric address conversion -- encoding varies by arch
215DMMADouble MMA (duplicate; opc 180 has default slot 257)
216ELECTWarp elect -- SM 90+ only
219FENCE_SShared fence -- SM 90+ only
220FMNMXFP min/max (duplicate; separate from FMNMX at opc 14)

Opcode Case Routing

The 169 distinct opcode cases (338 total case labels across both switches) group into approximately 70 handler blocks. The groupings reveal SASS ISA families:

GroupOpcodesHandler patternInstruction family
No-op passthrough8, 38, 46, 87, 89, 90, 93, 97, 98, 208goto LABEL_33 (already encoded)Pre-encoded by upstream ISel
Simple emission7, 34, 35, 36sub_9314F0(encoding_index, 1 operand)Basic ALU / simple 1-op
Branch/call9, 10, 11, 12, 13, 22sub_926370 / vtable[17] / linked-list walkControl flow, call frames
Memory load/store15, 16, 18, 19, 20, 23, 24, 25, 26, 30sub_C01840 + address helpersLDG, STG, LDS, etc.
Control flow31, 32, 33SSA phi nodes, branch tablesPhi, switch, call return
Generic ALU39, 41, 42, 50, 51, 52, 53sub_9314F0 passthroughStandard arithmetic
Special register43, 44, 45sub_C06E90 symbol lookupSR access, shared memory alias
Constant/predicate47, 54, 55, 56Direct operand copy / sub_BFFD60Constant bank, predicate ops
Address compute57200-line handler, "__nv_reservedSMEM_offset_0_alias"Complex addressing with SMEM
Immediate ops59, 60sub_C05CC0 / sub_C07690Immediate-operand variants
Barrier/sync61, 62, 63, 221Forward to vtable[3] extensionBAR, MEMBAR, SYNC
Conversion/move65Operand loop with per-element sub_9314F0MOV, CVT
Texture/surface67, 68, 69, 70Multi-operand type-qualified encodingTEX, TLD, TXQ
Intrinsics71, 74, 75Loop-based operand emissionHardware intrinsics
Tensor core84, 88, 91, 92Wide-operand encoding (case 92 = 354 lines)HMMA, DMMA, IMMA, TCGen05
Predication ext94, 95Predicate-dependent path selectionExtended predication
Memory extended99--130 (19 opcodes)sub_C0B2C0 or sub_BFFD60 + encoding lookupExtended memory ops
Warp intrinsics131--189 (50+ opcodes)Mixed handlers, vtable[198]+632 dispatchSHFL, VOTE, MATCH, REDUX
Async/bulk192--218 (15 opcodes)sub_C0B2C0 / individual handlersTMA, async copy, bulk ops

The largest case handlers:

  • Cases 141/142: ~503 lines (warp shuffle/vote extended operations)
  • Case 92: ~354 lines (tensor core instructions -- widest operand format)
  • Cases 45, 57, 95: ~200 lines each (shared memory, address compute, predication)

Representative Per-Case Handler Pseudocode

Six handlers spanning the five marshalling patterns. All write tagged operands into buf[32] (256 bytes) and converge at LABEL_33 (clears ctx[256..257] encoding flags).

1. Simple emission -- cases 7/34/35/36 (ISETP, IDE, I2I, I2IP)

instr->output = ctx->reg_base;                        // instr[16] = ctx[35]
BuildOperandRecord(buf, ctx, encoding_index, 1, 0, NULL);  // sub_9314F0

2. Control flow -- cases 15/16 (BRA/RET with call frame)

int frame_id = ++isel_ctx->call_frame_counter;
GrowCallFrameIfNeeded(isel_ctx, frame_id);               // sub_C00C30
MarshalSrcOperands(isel_ctx, ctx, instr->src_list, buf, 32);
ResolveRegisterIndex(&tmp, ctx, instr->data_type);        // sub_91C150
EmitBranchRecord(&tmp, ctx, 0x82, data_type, tmp, buf);   // sub_92E800
record = isel_ctx->frame_base + (frame_id << 6);          // 64-byte frame record
record->encoding = tmp;  record->opcode = instr->opcode;
if (opcode == 16/*RET*/ && IsUniformPipe(ctx->backend))
    MarkReturnRecord(ctx, record, 4);
// Trailing BRA stub to callee address via encoding_index 0x5F
EmitBranchRecord(&tmp, ctx, 0xC7, data_type, 0xFFFFFF, buf);
buf[0] = callee_addr & 0xFFFFFF | TAG_CONST_BANK;  buf[1] = tmp;
buf[2] = 0x60000003;  // modifier: count=3
BuildOperandRecord(&result, ctx, 0x5F, 1, 3, buf);

3. Multi-operand ALU -- cases 102/106/114/122--124/127 (ATOM, DFMA, DADD, DMUL, ...)

int n_dst = MarshalDstOperands(isel_ctx, ctx, instr, buf);     // sub_BFFD60
int n_src = MarshalSrcOperands(..., &buf[n_dst], 32 - n_dst);
int enc = (vtable[2] != default) ? vtable[2](isel_ctx, opcode)
                                 : word_22B4B60[opcode];
if (data_type == 19 && vtable[14](isel_ctx, instr))           // shared-mem override
    enc = (enc == 180) ? 181 : (enc == 192) ? 193 : enc;
Emit2OpRecord(&result, ctx, enc, data_type, buf, &buf[n_dst]); // sub_92E720
instr->encoding = RegisterResult(isel_ctx->output, &result);
for (int j = 1; j < n_dst; j++)
    ChainExtraDest(isel_ctx->output, &result);                 // sub_A64220

4. Tensor core -- case 92 (HMMA/IMMA/DMMA wide-operand)

int n_dst = MarshalDstOperands(isel_ctx, ctx, instr, buf);
MarshalSrcOperands(..., &buf[n_dst], 32 - n_dst);
int sched_class = GetSchedClass(ctx->backend, data_type);     // vtable[198]+904
if (sched_class == 2) {                                       // HMMA path
    int frag = 0;  instr->encoding = 0;
    for (int k = 0; k < 4; k++) {
        if (!instr->fragment_mask[k]) continue;                // bytes +48..+51
        BuildFragOperand(&imm, 8, 8*k);                       // sub_91E8E0
        int reg = EncodeRegIndex(ctx, imm);                    // sub_91D160
        Emit4OpRecord(&rec, ctx, 0x14, 0xC, &buf[frag++], &buf[n_dst],
                      &reg, &TAG_ALU);                         // sub_92FF10
        int r = RegisterResult(isel_ctx->output, &rec);
        if (!instr->encoding) instr->encoding = r;
    }
} else if (sched_class == 4) { /* DMMA: flag-OR per fragment via sub_92E720 */ }

5. Surface/texture -- case 150 (3-phase operand collection + per-dest loop)

bool is_float = IsFloatType(instr->data_type);                // sub_7D6860
int n_dst   = MarshalDstSpecial(isel_ctx, ctx, instr, buf, 32, is_float);
int n_src   = MarshalSrcOperands(..., &buf[n_dst], 32 - n_dst);
int n_coord = MarshalSrcOperands(..., &buf[n_dst+n_src], ...);
if ((instr->subop - 28) & ~4) {                               // non-texture path
    int modifier = (instr->modifier & 0x3F) | TAG_CONTROL;
    for (int i = 0; i < n_dst; i++) {                          // one SASS per dest
        Emit4OpRecord(&rec, ctx, enc_idx, data_type,
                      &buf[i], &buf[n_dst+...], &buf[n_dst+n_src+...], &modifier);
        int r = RegisterResult(isel_ctx->output, &rec);
        if (i == 0) instr->encoding = r;
    }
} else { /* texture-qualified: vtable[29] dispatch */ }

6. MMA/warp extended -- cases 141/142 (shape-descriptor construction)

int shape = instr->modifier & 7;
if (!shape) goto generic_path;                                // LABEL_417
int enc_idx = (opcode == 141) ? 121 : 127;                   // UISETP / ULDC
int desc_mod = (vtable[21] != default) ? vtable[21](isel_ctx, instr) << 16 : 0;
int descriptor = desc_mod + instr->mma_m + (instr->mma_n << 8);
buf[0] = 0xF0000000;  buf[1] = descriptor;                   // TAG_SPECIAL
AllocUniformReg(&tmp, ctx, 6);                                // sub_91BF30
buf[2] = (tmp & 0xFFFFFF) | 0x90000000;  buf[3] = descriptor;
goto marshal_and_emit;                                        // -> LABEL_418

Operand Encoding Protocol

The mega-selector encodes operands into a stack-allocated 256-byte output buffer using a tagged-pointer word format. Each operand occupies 8 bytes (a DWORD pair):

BitsFieldDescription
[31:28] of word 0Type tag0x1=register, 0x4=constant bank, 0x5=immediate, 0x6=control/modifier, 0x9=special register
[23:0] of word 0ValueRegister index, immediate value, or bank offset
word 1FlagsModifier bits, encoding-format flags

The marshalling pipeline for a typical case:

1. sub_C01840(ctx, instr, operand_list, output_buf, max_count, ...)
   -> Iterates source operands, writes tagged words to output_buf
   -> Returns: number of operand words written

2. sub_C01F50(ctx, instr, dest_list, output_buf, max_count, ...)
   -> Same for destination operands

3. Encoding-index lookup:
   if (vtable[2] != default)
     index = vtable[2](ctx, opcode);
   else
     index = word_22B4B60[opcode];

4. sub_9314F0(output, ctx, encoding_index, count, n_words, buf, ...)
   -> Emits the instruction record to the output stream
HelperCallsPurpose
sub_C0184052Marshal source operands into tagged-word buffer
sub_9314F031Emit instruction with encoding index + operand buffer
sub_C00EA08Extract single operand as tagged word
sub_91D1608Encode register index to encoding bits
sub_9346306Build new instruction node in IR (for multi-instruction expansion)
sub_91D1505Decode register index from operand word
sub_9263704Emit simple instruction (branch/jump)
sub_C01F503Marshal destination operands
sub_7D68603Encode data type qualifier (FP32/FP64/INT)
sub_BFEF103Register bank capacity check / grow
sub_92E1B02Emit instruction with constant-bank operand

Cross-Reference: Arch Dispatch Tables

The 5 arch dispatch thunks (sub_B128E0--sub_B12920, each 13 bytes setting esi to an SM-family code: 0x5004, 0x5003, 0x5001, 0x5000, 0x4000) are not called from the mega-selector. They tail-jump to a shared handler at loc_1C38C00 and operate at the Mercury encoder level:

Mega-selector (sub_C0EB10)
  -> Produces (encoding_index, operand_buffer) pairs
  -> Calls sub_9314F0 to package into instruction nodes

Mercury encoder (sub_6D9690)
  -> Reads instruction type field from instruction node
  -> Arch dispatch tables (sub_B128E0 etc.) resolve type to encoding format
  -> Encoder emits binary SASS using format + operand data

The mega-selector and arch dispatch tables thus operate at different abstraction levels: the mega-selector decides what to encode (opcode selection, operand marshalling), while the arch tables decide how to encode it (encoding format, bit layout). The arch tables' per-SM variants handle encoding-level differences (field widths, modifier positions) that are invisible to the mega-selector's opcode-level logic.

Post-ISel Modifiers -- sub_B1D670 (13 KB)

After the main ISel selection, this pass applies architecture-specific instruction modifications:

  • Opcode 13: sets instruction field [79] = 3
  • Opcode 14: sets instruction field [79] = 2
  • Opcode 11: separate modifier path

The function has 51 callees including sub_AAD690 (field accessor, called multiple times), sub_AADF40, and sub_C49400 (encoding validator). It handles encoding mode bits, register class adjustments, and predicate attachment.

Instruction Properties -- sub_B241A0 (7 KB)

Sets scheduling-relevant properties on the selected instruction:

  • inst[74] = 7 -- scheduling class
  • inst[75] = (opcode == 325) -- special flag for specific opcode
  • inst[77] = sub_A3B930(...) -- operand class from register resolver
  • inst[79] -- derived from a2[19], architecture-dependent

Contains a switch on *(context+46) (target architecture selector), confirming per-SM property assignment.

DAG Pattern Matchers -- ~800 Functions at 0xB28F60--0xB7D000

Every pattern matcher follows an identical prototype and a strict check-and-report protocol. These are the ptxas equivalent of LLVM's TableGen-generated ISel patterns, but handwritten in C++. Binary analysis confirms 801 functions with the matching *a4 <= priority-comparison idiom, with the bulk (750+) residing in the 0xB30000--0xB7D000 range and a handful of smaller matchers in the 0xB28F60--0xB30000 preamble zone.

Pattern Matcher Architecture

The pattern matching system implements a priority-based best-match selection protocol. For each instruction being lowered, the ISel infrastructure invokes all applicable matchers (dispatched through vtable function pointers, not direct calls). Each matcher independently tests whether the instruction matches its pattern; if it does, it writes a (template_id, priority) pair to the output parameters. The dispatcher selects the match with the highest priority value.

Function signature (all 801+ matchers):

char __fastcall match(
    int64_t  ctx,           // a1: ISel context (passed through to field reader)
    int64_t  dag_node,      // a2: pointer to the Ori IR instruction node
    int32_t *template_id,   // a3: OUT: encoding template index [1..152]
    int32_t *priority       // a4: IN/OUT: current best priority; written only if better
);

The priority parameter is read-then-conditionally-written: the matcher checks if (*a4 <= threshold) before overwriting. This means the dispatcher initializes *a4 = 0 and calls matchers in sequence; each matcher only upgrades the result if its specificity exceeds the current best. After all matchers complete, *a3 holds the template index of the winning pattern.

Matching pipeline (invariant across all 801 matchers):

 1. OPCODE PROPERTY CHECKS      sub_10AE5C0(ctx, node, field_id)
    Check 1-12 instruction properties against expected values.
    Any mismatch -> return immediately (early exit).

 2. SOURCE OPERAND COUNT         sub_B28F50(node) -> source_count
    Verify the instruction has the expected number of source operands.

 3. SOURCE OPERAND VALIDATION    sub_B28F30(node, i) -> operand_record
    For each source operand:
      a. Type predicate: isImmediate / isGPR / isPredicate / isUniformReg / ...
      b. Register class: class == 1023 (wildcard) OR class == specific_value

 4. RESULT OPERAND COUNT         sub_B28F40(node) -> result_count
    Verify the expected number of result (destination) operands.

 5. RESULT OPERAND VALIDATION    sub_B28F30(node, first_result + j)
    Same type + register-class checks as for source operands.
    First-result index = sub_B28E00(*(node + 92)).

 6. PRIORITY WRITE               if (*a4 <= N) { *a4 = N+1; *a3 = template; }
    Conditional update: only overwrite if this pattern is more specific
    than whatever was already matched.

Match-Score Priority System

The priority values range from 2 (least specific) to 34 (most specific), with the distribution heavily concentrated in the 8--19 range. The priority correlates directly with pattern specificity: matchers with more constraints (more sub_10AE5C0 checks, more operand type checks, tighter register class requirements) assign higher priority values.

Priority rangeCountInterpretation
2--531Fallback / generic patterns (few constraints)
6--10253Common patterns (3--6 constraints)
11--15293Standard patterns (5--8 constraints)
16--20168Specific patterns (6--10 constraints)
21--3456Highly specific patterns (8--12+ constraints)

Template IDs range from 1 to 152. Multiple matchers can target the same template ID at different priority levels, forming a specificity ladder: a generic matcher might match FADD at priority 8 while a specialized matcher matches FADD.FTZ.SAT with specific register classes at priority 17. Both write the same template ID but the specialized matcher wins when its constraints are satisfied.

Dispatcher Mechanism

The matchers are not called directly from a single dispatch function. Instead, they are registered as virtual methods on per-instruction-class descriptor objects. The dispatch chain is:

sub_B285D0 (ISel driver, 9 KB)
  -> opcode switch on (instruction[18] & 0xFFFFCFFF)
     -> selects builder variant (sub_B1FA20 / sub_B20E00 / sub_B1EC10 / ...)
        -> builder invokes vtable method on instruction descriptor
           -> vtable slot contains pointer to one of the 801 pattern matchers
              -> matcher writes (template_id, priority) if pattern matches

For a given instruction, the dispatcher may invoke multiple matchers (one per applicable template variant). Each matcher independently checks its constraints and conditionally updates the priority/template pair. After all candidates have been tried, the dispatcher reads the final template_id and uses it to select the SASS encoding template.

Vtable Registration and Dispatch Table Layout

All ~801 matchers are registered through a single C++ vtable rooted at off_22AD230 (installed by the descriptor constructor sub_9CE190). The vtable contains 244 general-purpose virtual method slots followed by a contiguous dispatch table of 1884 function pointers starting at vtable byte offset +1952 (address 0x22AD9D0). Every pointer in this dispatch table targets a location inside a single 200 KB monolithic function sub_BA9CF0 (spanning 0xBA9CF0--0xBDBA60), which the compiler emitted as an enormous switch-case body. Of the 1884 entries, 1611 point to distinct case handlers (the actual matcher logic) and 273 point to a shared sentinel at 0xBA9E23 (a no-op return indicating "no matcher for this slot").

The coordinator function sub_13AF3D0 (137 KB) invokes matchers through indirect calls at specific vtable byte offsets. Each offset maps to a dispatch table index:

Vtable offsetDispatch indexRole
+232847Guard check (compared against sub_13A6110)
+260081Primary matcher -- operand-with-offset form
+261683Secondary matcher -- multi-operand ALU (case 102)
+262484Tertiary matcher -- extended ALU (case 108)
+265688Quaternary matcher -- opcode 78 fallthrough
+2896118Predicated-store matcher (LD/ST .STRONG)
+3232160Architecture gate (compared against sub_868720)

The coordinator uses a guard-then-dispatch invocation pattern:

// sub_13AF3D0 -- primary matcher invocation (vtable +2600)
vtable = *(int64_t*)descriptor;
guard_fn = *(fn_ptr*)(vtable + 2328);
if (guard_fn == sub_13A6110) {                    // default guard -> proceed
    offset_ptr = &operand_array[operand_index];
    (*(fn_ptr*)(vtable + 2600))(descriptor, node, offset_ptr, out);
    if (!(descriptor[1039] & 2) && (offset_ptr[1] & 0x18000000))
        sub_13A5ED0(ctx, node, operand_index, out, ...);  // post-match fixup
} else {
    guard_fn(descriptor, node, operand_index, out, ...);   // SM-override path
}

// vtable +2896 -- sentinel-guarded dispatch for predicated stores
store_fn = *(fn_ptr*)(vtable + 2896);
if (store_fn != nullsub_239)                      // sentinel = no-op stub
    store_fn(descriptor, node, out);

The 1884 dispatch table entries partition into 144 non-sentinel groups separated by sentinel runs. Group sizes range from 1 to 152 matchers, reflecting per-instruction-class matcher fan-out:

Group sizeCountExample instruction classes
127Single-encoding (NOP, EXIT, BAR)
2--537Simple ALU (IADD, IMUL)
6--1044Common FP (FADD, FMUL, FSETP)
11--2516Complex LD/ST (LDG, STG, ATOM)
26--5011Wide-encoding (HMMA, conversion)
51--1529Max fan-out (opcode 0x135 arith)

The matchers have no static callers -- they appear exclusively through indirect function pointer invocation via this vtable, which is why cross-reference tools report them as "no callers in function DB."

DAG Node Property Accessor -- sub_10AE5C0

The field reader is the most-called function in the matcher range (typically 2--12 calls per matcher, so 3,000--8,000 total invocations across all 801 matchers):

// sub_10AE5C0 -- Read instruction property by field_id
int64_t DAGNode_ReadField(int64_t ctx, int64_t node, uint32_t field_id) {
    if (sub_10E32E0(node, field_id))        // field exists in descriptor?
        return sub_10D5E60(node, field_id); // read value from property table
    else
        return 0xFFFFFFFF;                  // sentinel: field not present
}

The field_id values form a large flat namespace (observed range: 5--595). These are not byte offsets into the instruction record; they are logical property identifiers resolved through a descriptor table. The backing store (managed by sub_10E32E0 / sub_10D5E60) implements a sparse property bag that maps field IDs to integer values.

The companion write functions follow the same field-ID namespace:

// sub_10AE590 -- Write single field
void DAGNode_WriteField(int64_t ctx, int64_t node, uint32_t field_id, uint32_t value);

// sub_10AE640 -- Write two fields atomically (multi-field update)
void DAGNode_WriteFields(int64_t ctx, int64_t node, uint32_t f1, uint32_t v1, uint32_t v2);

Semantic mapping of the 43 most frequently referenced field IDs, ranked by total read+write frequency across the 801 matchers ("W" = write-count in lowering functions, "R" = read-count in guard predicates):

FieldHexWRSemantic nameEvidence / observed values
3450x159--417rounding mode selectorMost-read field. Guard value 1900 in 135 matchers; 1901--1903 select alternatives. Drives field 300 (1900->1513, 1901/02->1515, 1903->1516).
3000x12C302--rounding mode encodingHighest write count. Values: 1513 (RN), 1515 (RZ/RM), 1516 (RP). Always paired with field 301.
3010x12D222--FTZ / rounding qualifierValues: 1521 (default/FTZ), 1520 (non-FTZ). Co-occurs with field 300 in every rounding-sensitive matcher.
4800x1E0163--encoding format classValue 2481 dominant (126 writes); also 2480, 2478. Selects top-level SASS encoding format (3-source vs immediate).
2120x0D4140--operand source layoutSingle value 1184. Part of a 5-field "instruction shape descriptor" group with {456, 336, 316, 318}.
4560x1C8140--instruction shape tagValue 2373 (120 writes), 2375 (20). Always precedes the shape descriptor group.
3360x150120--operand count / layoutValues 1863, 1865. Distinguishes 2-source from 3-source operand formats.
3180x13E120--source B encoding classValues 1784, 1785, 1789. Selects reg-imm vs reg-reg for second source operand.
3160x13C120--source A encoding classValues 1776, 1777. Paired with field 318 to specify the source operand encoding pair.
1500x096101--comparison modeValues 650, 651. Used in ISETP/FSETP matchers for comparison operator.
640x040101--destination data typeSingle value 297. Sets the output data-type tag in the encoding descriptor.
290x01D101--predicate / condition codeSingle value 126. Controls whether the instruction writes a predicate register.
2700x10E101--accumulation modeSingle value 1385. Controls FMA-style accumulation (fused vs non-fused).
880x058902sub-operation modifierValue 408 dominant. Selects instruction sub-variant (e.g., ADD vs FADD within same opcode).
3590x16790--operand negation maskSingle value 1957. Encodes source-operand sign-flip for FP instructions.
2830x11B--80source type AGuard values 1446--1449. Read with field 284 to encode source type for HMMA and similar.
2840x11C--80source type BGuard values 1451, 1452. Paired with 283 (e.g., {1449,1452} -> FP16xFP16).
2100x0D265--memory scopeValues 1175--1177. CTA / GPU / SYS scope for ld/st/atom.
2340x0EA60--cache control modifierSingle value 1258. Encodes cache hints (L1 bypass, etc.).
330x02147--data width specifierValues 147, 148. Selects 32-bit vs 64-bit operation width.
220x01647--integer sign qualifierValues 99 (unsigned), 101 (signed).
1540x09A41--texture/surface samplerValues 664--669 (6 variants). Selects among texture access modes.
2090x0D140--memory orderingSingle value 1172. Sets acquire/release semantics.
890x059--36operand class guardRead-only pre-condition filter; matchers exit early if absent or wrong.
5000x1F436--extended modifier AValues 2547--2549. Blackwell (sm_100+) matchers only.
3300x14A--32precision / type tagRead-then-copied: propagates data-type attribute from source to output node.
2080x0D030--atomic operation typeValues 1163--1167 (ADD, MIN, MAX, CAS, EXCH).
3270x147--30register format descriptor ARead-only register class constraint. Always checked with field 328.
3280x148--30register format descriptor BRead-only companion to field 327.
2810x11930--address space qualifierSingle value 1436. GLOBAL/SHARED/LOCAL address space tag.
310x01F27--logical operation selectorSingle value 133. AND/OR/XOR for LOP3-class instructions.
200x01427--immediate formatSingle value 94. Immediate field encoding (sign extension, shift).
2930x12527--shift amountSingle value 1489. Shift distance for SHL/SHR.
4180x1A224--uniform register hintSingle value 2180. Marks instruction eligible for uniform register allocation.
3440x1582412output saturationValues 1896, 1897. Bidirectional: read in guards, written in lowering. [0,1] clamping.
4790x1DF22--extended modifier BValues 2467, 2468. Blackwell-era extended instruction property.
4320x1B022--warp shuffle modeValues 2269, 2270. SHFL sub-op (UP, DOWN, BFLY, IDX).
4520x1C4--22architecture guardRead-only. Guards SM-generation-specific matchers.
120x00C2020operand encoding classValue 59 (write). Bidirectional; feeds LUT-based encoder selection.
3930x18920--dual-issue hintValues 2104, 2105. Marks pairs eligible for dual-issue scheduling.
3970x18D2--encoding validity stampAlways value 2115 (0x843). Written by lowering; encoder sub_1C4F470 ORs 0x843 into dword 0. 0x843 = 0b100_0100_0011 (bits 0,1,6). Post-ISel seal, not a pre-condition guard.
3260x1462--SASS major opcodeValue 1810 (0x712). Written once at node creation; identifies top-level SASS instruction class.
50x0052--Ori opcode IDValue 12 (0x0C). Copies internal Ori opcode number into the encoding descriptor.

The value namespace is disjoint from the field-ID namespace -- each field's values are drawn from a global enumeration (~2,850 entries, max value 2829). For any given field only 1--6 distinct values are ever written. The encoding bitfield lookup table at VA 0x23F2E00 (4,096 entries, 98% fill) maps value IDs to concrete bit positions and masks in the SASS instruction word.

Field 397 / value 2115 deep analysis. Contrary to the initial hypothesis that 397 acts as a pre-ISel validity guard, decompilation shows it is write-only: lowering functions stamp field[397] = 2115 onto newly created nodes, and the encoder sub_1C4F470 ORs the literal 0x843 into instruction-word dword 0. 0x843 = 0b100_0100_0011: bit 0 = opcode valid, bit 1 = encoding template assigned, bit 6 = ISel generation stamp. This is a post-ISel seal confirming the instruction passed pattern matching and has a valid binary encoding template.

Operand Record Layout

Each operand is a 32-byte record accessed by index via sub_B28F30:

// sub_B28F30 -- Get operand record by index
int64_t GetOperand(int64_t node, int index) {
    return *(int64_t*)(node + 32) + 32LL * index;
}

The 32-byte operand record:

OffsetSizeFieldDescription
+01type_tagOperand kind (see predicate table below)
+44primary_classRegister class ID; 1023 = wildcard (any class)
+141modifier_aWritten by sub_B28F10
+151modifier_bWritten by sub_B28F20
+204secondary_classFallback register class constraint

Source operand count is stored at node + 92 and doubles as the first-result-operand index:

uint32_t source_count = *(uint32_t*)(node + 92);   // sub_B28F50
uint32_t result_count = *(node + 40) + 1 - source_count; // sub_B28F40

Operand Type Predicates

Fifteen predicate functions classify operand type tags. Each is a single comparison returning bool:

AddressNameTestSemantics
sub_B28E20isImmediatetag == 1Constant / immediate literal
sub_B28E10isGPRtag == 2General-purpose register
sub_B28E80isPredicatetag == 3Predicate register
sub_B28E70isType4tag == 4(specific operand class)
sub_B28E60isType5tag == 5(specific operand class)
sub_B28E30isSpecialRegtag == 6Special register
sub_B28ED0isType7tag == 7(specific operand class)
sub_B28EF0isType8tag == 8(specific operand class)
sub_B28E50isType9tag == 9(specific operand class)
sub_B28E40isValidRegtag == 10Generic valid register
sub_B28EE0isType11tag == 11(specific operand class)
sub_B28EA0isType13tag == 13(specific operand class)
sub_B28EB0isType14tag == 14(specific operand class)
sub_B28E90isUniformRegtag == 15Uniform register (SM 75+)
sub_B28EC0isType16tag == 16(specific operand class)

Register class 1023 serves as a wildcard: if (class == 1023 || class == expected). This allows matchers to accept both unconstrained operands and operands already assigned to a specific register file.

Register Class Constraint Protocol

Operand records carry two register class fields: primary_class at offset +4 and secondary_class at offset +20. The matching protocol checks them with a cascading OR:

// Typical register class check (from sub_B33F00, sub_B390A0, etc.)
uint32_t primary   = *(uint32_t*)(operand + 4);
uint32_t secondary = *(uint32_t*)(operand + 20);

if (sub_B28E00(primary) == 1023) {
    // Wildcard -- operand is unconstrained, accept it
} else {
    uint32_t cls = sub_B28E00(secondary);
    if (cls != expected_class) return;  // mismatch
}

sub_B28E00 and sub_B28F00 are identity functions -- the register class is stored as a plain integer, not packed. The two-field scheme allows the matcher to accept an operand where either the allocation constraint (primary) is wildcard or the resolved register file (secondary) matches.

Observed register class values in matchers:

ClassFrequencyLikely meaning
1023ubiquitousWildcard (any register class)
1very common32-bit GPR (R0..R255)
2common64-bit GPR pair
3occasional128-bit GPR quad
4occasionalPredicate or special register file
5rareExtended register class

Representative Matcher Walkthroughs

sub_B30160 -- simple 2-source, 4-result pattern (68 lines, priority 9, template 12):

1. field 480 == 2481                    -> opcode/subclass check
2. source_count == 2                    -> expects 2 source operands
3. operand[0].type == 1 (immediate)     -> first source is a constant
4. operand[1].type == 2 (GPR)           -> second source is a register
5. operand[1].class == 1023 OR sec == 1 -> 32-bit GPR or unconstrained
6. result_count == 4                    -> expects 4 result operands
7. result[0].type == 2 (GPR)            -> first result is GPR
   result[0].class == 1023 OR sec == 1
8. result[1].type == 3 OR 15            -> predicate or uniform register
9. result[2].type == 2 (GPR)            -> third result is GPR
   result[2].class == 1023 OR sec == 1
10. if (*a4 <= 8) -> *a4 = 9, *a3 = 12

sub_B33F00 -- medium 2-source, 5-result pattern (4,166 bytes, priority 21, template 22):

1. field 7 == 21                            -> major opcode class
2. field 163 in {705, 706}                  -> addressing mode variant
3. field 203 in {1113..1117}                -> encoding format (5 values)
4. field 105 == 477                         -> operation variant
5. field 88 == 408                          -> sub-operation modifier
6. field 345 == 1903                        -> rounding/saturation mode
7. source_count == 2                        -> 2 sources
8. operand[0].type == 1 (immediate)         -> constant source
9. operand[1].type == 2 (GPR)              -> register source
   operand[1].class: primary wildcard or secondary in {1,2}
10. result_count == 5                       -> 5 results
11. result[0].type == 2 (GPR), class != 1023, secondary == 2 (64-bit)
12. result[1].type == 3 OR 15 (pred/uniform)
13. result[2].type == 2 (GPR), class: wildcard or secondary in {1,2}
14. result[3].type == 2 (GPR), class: wildcard or secondary in {1,2}
15. if (*a4 <= 20) -> *a4 = 21, *a3 = 22

sub_B44CA0 -- complex 0-source, 7-result pattern (6,214 bytes, priority 11, template varies):

1.  field 5 == 12                           -> opcode class 12
2.  field 220 == 1206                       -> encoding property
3.  field 595 in {2937, 2938}               -> extended field (high range)
4.  field 294 == 1493                       -> constraint
5.  field 242 in {1281, 1282}               -> width qualifier
6.  field 355 == 1943                       -> extended property
7.  field 376 == 2035                       -> extended property
8.  field 377 in {2037..2041}               -> extended property (5 values)
9.  field 429 in {2252, 2253}               -> extended qualifier
10. field 126 in {547, 548}                 -> data type
11. field 397 == 2115                       -> encoding validity stamp (post-ISel seal)
12. source_count == 0                       -> no source operands
13. result_count == 7                       -> 7 result operands
14. All 7 results checked: type == 10 (valid register), various class constraints
15. if (*a4 <= 10) -> *a4 = 11, *a3 = (template)

This pattern has the most field checks (12) of the representative examples, validating properties deep into the extended field namespace (field 595). Its zero-source, seven-result shape suggests a hardware intrinsic or complex output instruction like a tensor-core operation.

sub_B28FE0 -- minimal matcher in the preamble zone (31 lines, priority 8, template 42):

1. field 211 == 1182
2. field 201 == 1109
3. field 348 in {1912, 1915}   -> precision qualifier
4. field 397 == 2115           -> encoding validity stamp (post-ISel seal)
5. source_count == 0           -> no sources
6. if (*a4 <= 7) -> *a4 = 8, *a3 = 42

The simplest matchers skip operand validation entirely and rely solely on opcode-property checks. These are for instructions with fixed operand formats where the operand shape is fully determined by the opcode.

Helper Function Summary

AddressNameSignaturePurpose
sub_10AE5C0DAGNode_ReadField(ctx, node, field_id) -> valueRead instruction property by ID; returns 0xFFFFFFFF if absent
sub_10AE590DAGNode_WriteField(ctx, node, field_id, value)Write single instruction property
sub_10AE640DAGNode_WriteFields(ctx, node, f1, v1, v2)Multi-field atomic update
sub_B28F30GetOperand(node, index) -> operand_ptrIndex into operand array (32-byte records at *(node+32))
sub_B28F40GetResultCount(node) -> countNumber of result operands: node[40] + 1 - node[92]
sub_B28F50GetSourceCount(node) -> countNumber of source operands: *(node+92)
sub_B28E00DecodeRegClass(packed) -> class_idIdentity function (class stored as plain int)
sub_B28F00DecodeRegClass2(packed) -> class_idSecond identity accessor (same semantics)
sub_B28F10SetModifierA(operand, value)Write operand modifier at offset +14
sub_B28F20SetModifierB(operand, value)Write operand modifier at offset +15
sub_B28E10isGPR(tag) -> booltag == 2
sub_B28E20isImmediate(tag) -> booltag == 1
sub_B28E30isSpecialReg(tag) -> booltag == 6
sub_B28E40isValidReg(tag) -> booltag == 10
sub_B28E50isType9(tag) -> booltag == 9
sub_B28E60isType5(tag) -> booltag == 5
sub_B28E70isType4(tag) -> booltag == 4
sub_B28E80isPredicate(tag) -> booltag == 3
sub_B28E90isUniformReg(tag) -> booltag == 15
sub_B28EA0isType13(tag) -> booltag == 13
sub_B28EB0isType14(tag) -> booltag == 14
sub_B28EC0isType16(tag) -> booltag == 16
sub_B28ED0isType7(tag) -> booltag == 7
sub_B28EE0isType11(tag) -> booltag == 11
sub_B28EF0isType8(tag) -> booltag == 8

SM120 Pattern Coordinator -- sub_13AF3D0 (137 KB)

The largest ISel function in the binary (137 KB, 4,225 lines, 570+ locals). It is an architecture-specific operand-emission coordinator that runs in Phase 2 as a parallel backend to the mega-selector sub_C0EB10. The two do not call each other -- they are mutually exclusive implementations of the same ISel protocol, selected per-SM by the vtable in the ISel driver. The mega-selector covers opcodes 7--221 for the default backend; the coordinator covers opcodes 2--352 for the SM120 (consumer RTX 50xx / enterprise Pro) backend.

Position in the ISel Pipeline

sub_B285D0 (ISel driver, 9 KB)
  -> selects builder variant by SM version
     -> Builder variant vtable dispatch
        |
        +-- DEFAULT BACKEND: sub_C0EB10 (mega-selector, 185 KB)
        |     opcodes 7..221, dual-switch, word_22B4B60 encoding table
        |
        +-- SM120 BACKEND: sub_A29220 (instruction iterator, 435 lines)
              -> sub_13AF3D0 (pattern coordinator, 137 KB)
                   opcodes 2..352, single switch, inline operand emission

The coordinator is called once per instruction by sub_A29220, which walks the instruction list. Before entering the main switch, the coordinator performs a predication pre-test: if bit 0x1000 is set in the opcode word and the opcode is not 169, it queries vtable[3232/8] and optionally emits the last source operand via sub_13A6AE0.

Dispatch Structure

The coordinator reads the opcode from *(instr+72) with the standard BYTE1 & 0xCF mask (identical to Phase 1's MercConverter) and enters a single 130-case switch. Unlike the mega-selector's dual-switch encoding-slot translation, the coordinator emits operands inline -- each case directly calls sub_13A6280 (the operand emitter) with explicit operand indices.

// sub_13AF3D0 -- simplified dispatch skeleton
char PatternCoordinator(context *a1, instruction *a2, output *a3,
                        pattern_table *a4, flags a5, int a6) {
    int opcode = *(DWORD*)(a2 + 72);
    BYTE1(opcode) &= 0xCF;

    // Pre-dispatch: predication check when bit 0x1000 is set
    if ((*(a2+72) & 0x1000) && opcode != 169) {
        if (vtable[3232/8] == sub_868720 || vtable[3232/8]())
            EmitLastSource(a1[1], a2, operand_count - 2*flag, a3);
    }

    // Setup output context
    vtable[104/8](output_ctx, a1, &context_ref);

    switch (opcode) {
    case 2: case 4: case 7:     // FMA/MAD 2-source
        operand_span = 16; src_count = 2;
        goto SHARED_FMA_HANDLER;
    case 3: case 5:             // FMA/MAD 3-source
        operand_span = 24; src_count = 3;
        goto SHARED_FMA_HANDLER;
    case 6:                     // IMAD/IADD3 with 3+ sources
        EmitOperand(ctx, instr, 3, out);
        EmitOperand(ctx, instr, 4, out);
        EmitOperand(ctx, instr, 5, out);
        break;
    case 8:                     // Pure vtable dispatch (vtable+2328)
        vtable[2328/8](a1, a2, operand_count, a3, a5, 0);
        break;
    case 10: case 11: case 151: case 152: case 290: case 291:
        vtable[2768/8](a1, a2, a3, a4, a5);   // Memory load/store
        break;
    case 16:                    // Texture/surface (163-line handler)
        for (i = first_src; i < last_src; i++)
            EmitOperand(ctx, instr, i, out);   // loop up to 15 operands
        break;
    // ... 120 more cases ...
    case 329:                   // Variable-count loop + vtable+2328
        for (i = 0; i < src_count; i++)
            EmitOperand(ctx, instr, i, out);
        vtable[2328/8](a1, a2, remaining, a3, a5, 0, 0, 0);
        break;
    default:
        break;                  // no-op passthrough
    }
}

Opcode Case Routing

The 130 distinct case labels (spanning 82 distinct handler blocks) cover the full SASS opcode range including SM100+/SM120 extensions:

OpcodesHandler patternInstruction family
2, 3, 4, 5, 7Shared FMA handler with operand-span parametrizationFMA/MAD variants (32/64-bit)
6Inline 3-source emission + optional operands 6/7IMAD/IADD3 wide
8Pure vtable+2328 delegationBuilder-only instructions
10, 11, 151, 152, 290, 291vtable+2768 delegationMemory load/store
16163-line operand loop (up to 15 sources)Texture/surface
20, 21vtable+2680/2688 with stub checkMemory/store alternates
22, 77, 83, 297, 352vtable+2744 with nullsub_463 checkControl flow
24, 34, 209, 213, 214Passthrough: emit src 1 + dst 2Simple 2-operand ALU
29, 95, 96, 190Conditional operand-6 checkPredicate-source instructions
38, 59, 106, 180, 182, 192, 194, 215, 221Single EmitOperand(1) at high SMGeneric ALU
42, 53, 55EmitOperand(1)Paired ALU
60, 61, 62, 63, 64Comparison / inner sub-opcode switch (case 61: 5 sub-cases)Compare / set-predicate
88, 89Variable source count (2 or 3) with sign-dependent offsetsExtended FMA
110, 111, 114, 115, 117Warp operand emissionWarp shuffle / vote
120, 121, 126, 127Barrier handler with operand loop at LABEL_53Barrier / sync
139, 140, 141, 143sub_13A4DA0 for commutative operand selectionCommutative ALU
183Extended memory with register-class-6 checkWide memory
201, 202, 204vtable+2328 delegationAsync / bulk operations
270, 279, 282, 285, 325--328Goto LABEL_53 (barrier/sync shared handler)Extended memory / warp
280, 281vtable+2896 with nullsub_239 check, then LABEL_53Sync instructions
329Variable-count operand loop + vtable+2328Variable-width encoding

Three Competing-Match Selection Mechanisms

The coordinator selects among competing pattern matchers through three mechanisms:

1. LABEL_750 -- vtable alternate-match dispatch. Six opcode paths (cases 6, 36, 130, 137, plus opcodes reaching LABEL_119 when sub_7D6850 confirms a double-precision operand) jump to LABEL_750:

LABEL_750:
    replacement = vtable[16/8](output_ctx, instruction);
    *output = replacement;
    return;

This is the "try architecture-specific alternate" escape hatch. The vtable slot at offset +16 on the ISel context object points to an SM-specific matcher. If it succeeds, the coordinator's inline emission is entirely bypassed and the replacement instruction is written to the output.

2. sub_13A4DA0 -- commutative operand position selector. Called 12 times for commutative instructions (FMA, IADD3, comparison) where source operands can be swapped for better encoding. The function holds up to 4 pattern entries at offsets +12/+16 through +36/+40, each a (lo_word, hi_word_mask) pair. It tests operand properties via sub_13A48E0 against each entry; the first match returns a preferred operand index. The coordinator then calls sub_13A6280 with the returned index instead of the default.

// sub_13A4DA0 -- simplified
int SelectOperandSlot(pattern_table, instruction, default_slot, alt_slot, out_match) {
    if (!pattern_table->active) return default_slot;
    uint64_t operand_desc = GetOperandDescriptor(instruction, default_slot);
    for (i = 0; i < pattern_table->count; i++) {  // up to 4 entries
        if (operand_desc matches pattern_table->entry[i])
            { *out_match = entry[i].preferred; return default_slot; }
    }
    // Repeat with alt_slot if no match on default_slot
    operand_desc = GetOperandDescriptor(instruction, alt_slot);
    for (i = 0; i < pattern_table->count; i++) {
        if (operand_desc matches pattern_table->entry[i])
            { *out_match = entry[i].preferred; return alt_slot; }
    }
    return default_slot;
}

3. Inline vtable override checks. Many cases test whether a vtable function pointer equals a known null-stub before calling it. The stub addresses serve as sentinel values -- when the vtable slot has been overridden by an SM-specific implementation, the coordinator calls the override:

Vtable offsetDefault stubPurpose
+2680sub_A8CBE0Memory operation alternate matcher
+2688sub_A8CBF0Store operation alternate matcher
+2744nullsub_463Control flow alternate
+2632nullsub_233Move/convert alternate
+2760nullsub_235Atomic/barrier alternate
+2896nullsub_239Sync instruction alternate
+3232sub_868720Pre-dispatch predication alternate
+3112sub_A8CCA0MADC alternate (case 36)

When the vtable slot holds the stub, the coordinator skips the call and proceeds with its inline emission logic.

Primary Callee: sub_13A6280 (239 lines)

The operand emitter, called 83 times. It reads the operand at instruction[operand_index + 10] (each operand is 8 bytes starting at instruction + 84), checks the type tag at bits [31:28], and emits:

  • Tag 1 (register): fast-path returns if register class == 6 (UB/dead register). Otherwise reads the register descriptor from *(context+88)[reg_index], checks register class at descriptor offset +64.
  • Tags 2/3 (constant/immediate): calls sub_7DBC80 to validate constant-bank availability, then sub_A9A290 for type-5 immediate expansion. Delegates to vtable methods at *(*(context+1584) + 1504) and *(*(context+1584) + 3248).
  • Other types: pass through to the vtable dispatch chain.

The third parameter (operand index) ranges from 0 to 7 across the coordinator's call sites, with 0/1/2/3 being the most common (corresponding to the first 4 source operands in the Ori IR instruction layout).

Function Map Additions

AddressSizeIdentityConfidence
sub_13AF3D0137 KBSM120 ISel pattern coordinator (130-case switch, 83x operand emission)HIGH
sub_A29220435 linesInstruction iterator / coordinator caller (per-instruction walk)HIGH
sub_13A6280239 linesOperand emitter (type-tag dispatch, register class 6 fast-path)HIGH
sub_13A7410--Destination operand emitter (with register class 6 check)MEDIUM
sub_13A6AE0--Pre-dispatch source emitter (predicated instruction operands)MEDIUM
sub_13A4DA0180 linesCommutative operand position selector (4-entry pattern table)HIGH
sub_13A6F90--Extended destination emitter (3rd variant, class 6 check)MEDIUM
sub_13A6790--Fenced memory operand emitterMEDIUM
sub_13A45E0--Extra operand emitter (operands 6/7 for wide instructions)MEDIUM
sub_13A5ED0--Modifier flag emitter (operands with 0x18000000 bits)MEDIUM
sub_13A75D0--Register class 6 (UB) operand substitution handlerMEDIUM
sub_13A48E0--Operand property extractor (for sub_13A4DA0 matching)MEDIUM

Architecture Dispatch Tables -- 5 Thunks into Shared Handler at loc_1C38C00

Five 13-byte thunks each set an SM-family selector in esi and tail-jump to a single shared handler:

ThunkAddressesi valueSM family (inferred)
sub_B129200xB129200x4000Turing / Ampere (sm_75--86)
sub_B129100xB129100x5000Ada (sm_89)
sub_B129000xB129000x5001Hopper (sm_90)
sub_B128F00xB128F00x5003Blackwell datacenter (sm_100/103)
sub_B128E00xB128E00x5004Blackwell consumer (sm_120/121)

Each thunk is identical in structure (mov esi, <imm32>; mov rdi, rdx; jmp loc_1C38C00). The shared handler contains a switch on *(a3+12) (instruction type field, 36+ top-level cases) with secondary switches on *(a3+14) (type sub-field) in 15 of those cases. Return value is a SASS encoding slot index; 772 is the universal "invalid / unhandled" sentinel.

Top-level case return values (from sub_B12920 / esi=0x4000, the only variant where the decompiler resolved concrete return values instead of JUMPOUTs):

CaseSub-switch?Slot index / rangeNotes
0yes, *(a3+14)197, 526, 691, 697, 7725 sub-variants by data width
1yes21, 647, 772Narrow (<=2) vs wide (3--4) vs invalid
2--636Single slot
3--22Single slot
4yes25, 26, 179, 180, 772Even/odd sub-field pairs
5--27Single slot
6yes28, 646, 772Same narrow/wide pattern as case 1
7yes29, 30, 181, 182, 7724-way by sub-field
8--0xA--31, 32, 33Three consecutive single-slot cases
0xByes, 15 sub-cases526, 527, 549, 571, 572, 697--699, 704, 764Texture/surface types
0xCyes, 56 sub-cases52, 72, 73, 82, 103, 132, 140, 557, 666, 668, 683, 684, 752, 758Memory load/store variants
0xDyes, 52 sub-cases55, 79, 80, 84, 85, 106, 107, 558, 667, 669Conversion operations
0xEyes, 37 sub-cases34, 42, 638, 639, 643, 744, 746, 748Predicate/comparison ops
0xFyes, 41 sub-cases60, 86, 90, 91, 121, 148, 149, 166, 568, 672, 685FP math variants
0x10yes, 67 sub-cases61, 98, 124, 175, 176, 556, 676, 688, 689, 703Integer math variants
0x11--65Single slot (MOV/CVT)
0x12yes, 249 sub-cases66--71, 94, 95, 104, 125--129, 662--665, 673, 678--681Largest: ALU permutations
0x13--685Single slot
0x14--75Single slot
0x15yes87, 670, 772Narrow/wide pattern
0x16--0x17yes, identical88, 120, 671, 753, 760Duplicated: shift variants
0x18--8Single slot
0x19yes, 17 sub-cases96, 586, 587, 674, 707, 708Shared memory ops
0x1Ayes, 16 sub-cases97, 123, 675, 762Extended shared memory
0x1Byes, 113 sub-cases595, 612, 613, 615, 619--621, 715, 718, 719, 721, 723Tensor/MMA (sparse sub-field)
0x1Cyes130, 682, 772Narrow/wide pattern
0x1Dyes, 14 sub-cases140, 683, 684Memory fence variants
0x1E--0x1F--143, 144Two single-slot cases
0x20yes, 40 sub-cases147--149, 685FP extended variants
0x21--709Single slot
0x22yes, 46 sub-cases184--187, 190--192, 690Warp intrinsic variants
0x23yes, 58 sub-cases5, 188, 190, 637, 640, 745, 747, 749, 750Async/bulk ops

Total unique encoding slots referenced across all cases: ~120 distinct values out of a 772-entry namespace. The four 0x50xx thunks share all JUMPOUT targets with sub_B128E0, confirming the shared handler selects per-SM slot variants internally based on the esi family code rather than at the thunk level.

Opcode Variant Selectors

Two specialized variant selectors handle the final opcode-to-encoding mapping for specific instruction families:

sub_B0BE00 (19 KB) -- opcode class 194:

Full switch on a2 with 255 sequential cases plus default. The function unpacks ctx = *(a1+8), inst = *(a1+16), then dispatches:

default  -> sub_10AE590(ctx, inst, 194,  826)
case   1 -> sub_10AE590(ctx, inst, 194,  827)
case   2 -> sub_10AE590(ctx, inst, 194,  828)
  ...        (strictly linear: slot = 826 + case_value)
case 254 -> sub_10AE590(ctx, inst, 194, 1080)
case 255 -> sub_10AE590(ctx, inst, 194, 1081)

Closed-form mapping: slot_index = 826 + a2 for a2 in [1, 255]; slot 826 is the fallback for any value outside that range (including 0 and values > 255). The 256 encoding slots (826--1081) are contiguous within the ISel dispatch table at 0x22AD9D0, occupying pointer entries [826] through [1081] of the 1884-entry isel_dispatch_tables array. This is the largest single opcode family in the table -- opcode class 194 alone accounts for 13.6% of all ISel dispatch entries.

The perfect linearity (no gaps, no reordering, no secondary switch) distinguishes this selector from sub_B0AA70 (class 306) which uses sparse non-sequential case indices. The identity mapping case K -> slot 826+K means the switch is logically a bounds-checked offset addition -- the compiler emitted it as a jump table.

sub_B0AA70 (5 KB) -- opcode class 306:

  • Same pattern but with opcode class 306
  • Variants numbered 1680--1726 with non-sequential case indices (2, 3, 8, 9, 14, 15, 20, 21, 26, 27, 30, 31, 36, 37, 40, 41, ...)
  • The alternating-pair pattern at stride 6 suggests type-width combinations (e.g., F32/pair, F64/pair, S32/pair, ...)

Instruction Modifier Dispatchers

Two modifier-application functions run after the main ISel selection to set type modifiers, rounding modes, and register width:

sub_B13E10 (5,792 B) -- basic modifier dispatcher (21 calls to sub_10AE640):

StageBitfield extractField IDSwitch mapping
Register width(a7 >> 3) & 0x3190/1 -> 71, 2 -> 72, 3 -> 73 (type 13: <=1 -> 71, else 72/73)
Modifier typeBYTE1(a7) & 0x1F81 -> 31, 2 -> 33, 3 -> 34, 4 -> 32, 5 -> 35; 6--13 -> skip
SaturationBYTE4(a7) >> 316 / 31 -> 62 (type 18: fld 16; else fld 3 val 13), 2 -> 63 (type 18: fld 16; else fld 3 val 14), 3 -> 65, 4 -> 64
Rounding (hi)(HIDWORD(a7) >> 14) & 0x1F31 -> 13, 2 -> 14
Saturate flaga8 & 0x1020set -> 76, clear -> 75
Negate flaga8 & 0x2021set -> 79, clear -> 78
Abs flag(HIBYTE(a7) >> 1) & 0xF222 -> 82, else -> 81
Complement flaga8 & 0xF232 -> 85, else -> 84
Carry enablea8 >> 6181 -> 69
Widen flagHIBYTE(a8) & 3171 -> 67
Integer type(HIBYTE(a8) >> 2) & 741 -> 16, 2 -> 17
Cache hint A(WORD1(a7) >> 7) & 0xF151 -> 55, 2 -> 57, 3 -> 59, 4 -> 60
Cache hint BBYTE4(a7) & 7151 -> 56, 2 -> 58

sub_B157E0 (11,815 B) -- extended modifier dispatcher (37 calls to sub_10AE640):

Texture/surface fast path (opcode type == 18, runs before anything else):

BYTE5(a7) & 0x3FField ID 15 value
0 (default)54
155
256
357
458
559
660

Extended stages for types 12/13/15 (load/store/atomic) and type 18 at geometry:

StageBitfield extractField IDSwitch mapping
LD/ST qualifierBYTE1(a7) >> 5261 -> 99, 2 -> 100, 3 -> 101
Address modeBYTE2(a7) & 7291 -> 116, 2 -> 117
Memory orderingBYTE4(a7) >> 3 (non-tex)241 -> 87, 2 -> 88
Reg size(v15 + 26) & 0x1F250--7 -> 90--97
Type size(v51 + 26) & 0x1F250--7 -> 90--97 (same table)
Geometry (non-tex)(v17 + 13) & 0xF270 -> 107, 1 -> 106, 2 -> 105, 3 -> 104, 4 -> 103
Tex/surf dim (type 18)v17280 -> 108, 1 -> 109, 2 -> 110, 3 -> 111, 4 -> 112, 5 -> 113, 6 -> 114

After these type-specific stages, sub_B157E0 runs the identical tail as sub_B13E10 (register width through cache hint B) with matching field IDs and values.

Mercury Master Encoder -- sub_6D9690 (94 KB)

The Mercury master encoder is the single largest backend function and the final instruction selection point before binary emission. It contains a massive switch on the instruction type field (read from instruction+8) covering all SASS instruction formats. While its primary role is encoding (documented in Mercury Encoder Pipeline and SASS Instruction Encoding), the switch itself performs the final opcode-to-encoding-format selection:

// Simplified encoding flow
void EncodeInstruction(context, instruction) {
    int type = *(int*)(instruction + 8);
    uint64_t base = 0x2000000000LL;     // encoding base constant

    switch (type) {
    case 61:    // FFMA with literal operand
        sub_6D9580(ctx, operand);       // encode literal
        break;
    case 455:   // complex multi-operand format
        // bit-field extraction and assembly
        break;
    // ... hundreds of cases ...
    }

    // Common tail: append operand words, commit
    sub_6D2750(ctx, word);              // append 8-byte operand word
    sub_6D28C0(ctx);                    // commit instruction record
}

Key encoding dispatch details:

  • Operand word type prefix in bits [31:28]: 0x1 = register, 0x5 = immediate/constant, 0x6 = control/modifier, 0x7 = literal, 0x9 = special
  • sub_7D6860 handles data type encoding (FP32/FP64/INT)
  • sub_C00BF0 provides opcode lookup from the encoding tables
  • Architecture-specific bits accumulated via SM 100+ extensions controlled by knob 4176

MercExpand -- Pseudo-Instruction Expansion

sub_C3CC60 (26 KB) runs as phase 118 (MercExpandInstructions) and expands Mercury pseudo-instructions into concrete SASS sequences. This is the third and final instruction selection point -- where abstract instruction forms that survived through ISel and Mercury encoding are replaced by their concrete multi-instruction implementations.

HandlerSizeInstruction class
sub_C37A1016 KBGeneral instruction expansion (jump table, 4+ cases)
def_C37B2E13 KBComplex expansion cases (default handler, string "EXPANDING")
sub_C39B4010 KBMemory operations (LDG, STG, LDS, etc.)
sub_C3A4606 KBAtomic operations
sub_C3B5608 KBTexture operations
sub_C3BCD019 KBControl flow (branches, jumps, calls)
sub_C3E03018 KBFinalization and cleanup

The expansion creates new instruction nodes, links them into the doubly-linked list, and deletes the original pseudo-instruction. After all expansions, sub_C3E030 performs post-expansion verification. The expansion engine also uses sub_719D00 (50 KB), which builds output for expanded instructions across different operand widths (32/64/128-bit, predicate) -- four near-identical code blocks corresponding to template instantiations over operand width types.

OCG Encoding Template Lookup -- sub_C3F490

The OCG (Optimized Code Generation) intrinsic pipeline on SM100+ does not use the ISel mega-selector or DAG pattern matchers. Instead, the OCG router (sub_6CC690, documented in Intrinsics) assigns each instruction one of 7 internal routing values and passes it to the SASS instruction emitter sub_6CB8A0. These routing values are not Ori IR opcodes, not binary SASS opcodes, and not encoding slot indices from word_22B4B60. They are a small, closed set of keys that exist solely to select an operand gathering template inside sub_C3F490.

Routing values assigned by the OCG router

ValueHexInstruction classAssigned when
700x46Memory-ordered load/store/atomic (with barrier)Barrier register present (v108 != 0 in conditional paths)
2430xF3Default memory operationFallback for general memory ops without barrier or special fence
2450xF5Load variant (LD/LDG/LDS)Load-type operations (from OCG load/store handler)
2460xF6Reduction/atomic defaultAtomic operations and reductions
2470xF7Fenced memory operation (LDGSTS)Operations requiring memory fence semantics
2570x101Async copy without memory orderBulk copy ops when no barrier: v108 == 0 selects 257, else 70
2610x105Atomic with pre-existing value readAtomic exchange / compare-and-swap returning old value

How sub_C3F490 maps routing values to encoding templates

sub_C3F490 is a pure lookup function (184 bytes) that takes a routing value plus 7 boolean modifier flags and returns a pointer to an operand gathering template in .data at 0x22B8960--0x22BB460. The function is a nested if-else tree: the first-level switch selects on the routing value, then inner branches refine the template based on the modifier flags.

sub_C3F490(routing_value, a2..a8) -> template_ptr
    a2: has pre-existing-value operand (used only by value 257)
    a3: SM generation > sm_7x (SM80+)
    a4: has predicate attachment
    a5: has scope/fence operand (SM generation > sm_8x && memory_order == 4)
    a6: (always 0 from OCG emitter, used by MercExpand callers)
    a7: (always 0 from OCG emitter, used by MercExpand callers)
    a8: (always 0 from OCG emitter, used by MercExpand callers)

The OCG emitter (sub_6CB8A0) always passes a6=a7=a8=0, which means the OCG path only reaches a subset of template leaves. The MercExpand callers (sub_C41100, sub_C40420, sub_C40B90, sub_C42330) pass all 7 flags and can reach the full template space. The returned template is a packed array: template[0] is the operand count, followed by operand slot indices that reference positions in the 39-QWORD operand buffer (v134[]). The emitter iterates over these indices, gathers the tagged operand words, builds control words from bitfields, and calls sub_9314F0 to commit the encoded instruction.

Two additional routing values (254, 262) are handled by sub_C3F490 but are never assigned by the OCG router -- they originate exclusively from the MercExpand memory instruction handlers, where the routing value is read from the instruction's opcode field (instr[18] masked with & 0xCFFF).

ValueHexOriginInstruction class
2540xFEMercExpand onlyExtended memory format (operand gather mode 3)
2620x106MercExpand onlyWide memory format (operand gather mode 0, with scope/fence branches)

Template address space

The 40+ distinct templates returned by sub_C3F490 occupy a contiguous .data region:

Address rangeRouting values served
0x22B8960--0x22B8E60257 (async copy variants)
0x22B8E60--0x22B936070 (barrier memory variants)
0x22B9360--0x22B9860262 (MercExpand wide memory)
0x22B9860--0x22B9E60247, 245 (fenced / load variants)
0x22B9E60--0x22BA960243, 246, 70 (default / reduction / barrier sub-variants)
0x22BA960--0x22BB460Leaf templates for bare operand forms (no modifiers)

Each template is 256 bytes (0x100). For a given routing value, the modifier flags select progressively simpler templates as flags are cleared: the most complex template (all modifiers active) is reached first in the if-chain, and the simplest (no modifiers) is the final fallback.

Addressing Mode Selection

Addressing mode selection is distributed across Phases 1 and 2. During Phase 1, the operand processing function sub_6273E0 (44 KB) classifies PTX operand forms into internal categories. During Phase 2, the ISel driver and Mercury encoder select the optimal SASS addressing mode based on the register-allocated operand forms.

PTX addressing modes and their SASS encodings:

PTX syntaxAddressing modeSASS instructionEncoding
[%rd1]Register indirectLDG.E R0, [R2]Register + zero offset
[%rd1+16]Register + offsetLDG.E R0, [R2+0x10]Register + immediate offset
c[2][0x100]Constant bankLDC R0, c[0x2][0x100]Bank index + offset
[%rd1], %r2Base + indexSTG.E [R2], R4Separate base/data registers

Special string references in sub_6273E0 confirm complex addressing:

  • ".nv.reservedSmem.offset0" -- reserved shared memory region
  • "COARSEOFFSET" -- coarse-grained offset computation for large address spaces
  • "__$endLabel$__%s" -- label generation for structured control flow

The ISel mega-selector (sub_C0EB10) references "__nv_reservedSMEM_offset_0_alias" for shared memory alias resolution during final encoding.

Vtable Dispatcher Zone -- 0xAF0000--0xB10000

The range 0xAF0000--0xB10000 contains approximately 2,735 tiny vtable method implementations (average 160 bytes) that form the instruction encoding hierarchy. These implement polymorphic instruction property queries:

// Typical vtable method (sub_AFXXXX, ~160 bytes)
int64_t get_property(int64_t a1, unsigned int a2) {
    if (a2 <= N)
        return (unsigned int)dword_XXXXXXX[a2];  // table lookup
    return default_value;
}

Each function maps a small integer index to an encoding constant, answering questions like "what is the register class for operand N of this instruction?" The 0xAF0000--0xB00000 sub-range has 1,269 functions (all under 200 bytes), while 0xB00000--0xB10000 has 1,466 with slightly more complex logic (13 exceeding 1 KB).

Comparison with LLVM ISel

AspectLLVMptxas
ISel frameworkSelectionDAG or GlobalISel (single pass)Two-phase: MercConverter (phase 5) + ISel driver (phase 112+)
Pattern specificationTableGen .td files, machine-generatedHandwritten C++ (~750 functions)
Pattern countTarget-dependent (thousands for x86)~801 DAG matchers + 185 KB mega-selector
Architecture dispatchSubtarget feature bits4 architecture dispatch tables + vtable overrides
Intermediate formMachineInstr (already selected)Ori IR (SASS opcodes after phase 5, not yet encoded)
EncodingMCInst emission (separate pass)Integrated: ISel + Mercury encode in same pipeline
ExpansionPseudo-instruction expansion in AsmPrinterMercExpand (phase 118, post-ISel)
Optimization post-ISelMachineFunction passesPhases 14--111 (full optimizer runs between Phase 1 and Phase 2)

The key architectural difference: LLVM performs instruction selection once, then optimization happens on already-selected machine instructions. ptxas selects SASS opcodes early (phase 5) so the optimizer can reason about SASS-level semantics, then performs a second selection/encoding pass after optimization is complete. This two-phase design gives the optimizer accurate cost models (it sees real SASS opcodes, not abstract PTX operations) at the cost of architectural complexity.

Function Map

AddressSizeIdentityConfidence
sub_C0EB10185 KBISel mega-selector (719 locals, dual 169-case switch, SM-generation dispatch)HIGH
sub_6D969094 KBMercury master encoder (instruction type switch)VERY HIGH
sub_9F1A9035 KBMercConverter main instruction conversion passHIGH
sub_9EF5E027 KBPost-MercConverter lowering ("CONVERTING")HIGH
sub_C3CC6026 KBMercExpand::run (pseudo-instruction expansion)HIGH
sub_9ED2D025 KBMercConverter opcode dispatch (master switch, & 0xCF mask)HIGH
sub_9E660025 KBInstruction expansion (64-bit split)HIGH
sub_9EC34023 KBMulti-operand instruction legalizationMEDIUM
sub_B0BE0019 KBOpcode variant selector (class 194, 100+ cases)HIGH
sub_C3BCD019 KBMercExpand::expandControlFlowMEDIUM
sub_9D76D018 KBMemory instruction legalization (load/store)HIGH
sub_C3E03018 KBMercExpand::finalizeExpansionMEDIUM
sub_9D80E017 KBMemory instruction legalization (variant)HIGH
sub_9E8B2017 KBTexture/surface loweringMEDIUM
sub_C37A1016 KBMercExpand::expandInstruction (jump table)HIGH
sub_B128E0--sub_B1292013 B x5 + 15 KB sharedArchitecture dispatch thunks (5 SM families)HIGH
sub_B1FA2013 KBSASS 3-operand builder (variant A)HIGH
sub_B1D67013 KBPost-ISel instruction modifierHIGH
def_C37B2E13 KBMercExpand complex cases ("EXPANDING")HIGH
sub_B157E012 KBExtended modifier dispatcher (37 callees)HIGH
sub_B20E0011 KBSASS 3-operand builder (variant B)HIGH
sub_C39B4010 KBMercExpand::expandMemoryOpMEDIUM
sub_9DA1009 KBArithmetic operation handler (case 6)HIGH
sub_B285D09 KBISel lowering driver (66 callees)HIGH
sub_B241A07 KBSASS instruction property setterHIGH
sub_9F33407 KBMercConverter orchestrator ("After MercConverter")HIGH
sub_C3A4606 KBMercExpand::expandAtomicOpMEDIUM
sub_B13E106 KBBasic modifier dispatcher (21 callees)HIGH
sub_B0AA705 KBOpcode variant selector (class 306)HIGH
sub_9DA5C02 KBOpcode class 1 handlerMEDIUM
sub_13AF3D0137 KBSM120 ISel pattern coordinator (130-case switch, 83x sub_13A6280, opcodes 2--352)HIGH
sub_A29220~17 KBSM120 instruction iterator (calls sub_13AF3D0 per instruction)HIGH
sub_13A6280~10 KBOperand emitter (type-tag dispatch, register class 6 fast-path)HIGH
sub_13A4DA0~7 KBCommutative operand position selector (4-entry pattern table)HIGH
sub_13A7410--Destination operand emitter (with register class 6 check)MEDIUM
sub_13A6AE0--Pre-dispatch source emitter (predicated instruction operands)MEDIUM
sub_13A6F90--Extended destination emitter (3rd variant, class 6 check)MEDIUM
sub_13A6790--Fenced memory operand emitterMEDIUM
sub_13A45E0--Extra operand emitter (wide instruction operands 6/7)MEDIUM
sub_13A5ED0--Modifier flag emitter (operands with 0x18000000 bits)MEDIUM
sub_13A48E0--Operand property extractor (for sub_13A4DA0 matching)MEDIUM
sub_10AE5C0tinyDAGNode_ReadField (field_id to value, delegates to sub_10D5E60)VERY HIGH
sub_10AE590tinyDAGNode_WriteField (single field write)VERY HIGH
sub_10AE640tinyDAGNode_WriteFields (multi-field update)VERY HIGH
sub_B28F30tinyGetOperand (index into 32-byte operand array at *(node+32))VERY HIGH
sub_B28F40tinyGetResultCount (node[40] + 1 - node[92])VERY HIGH
sub_B28F50tinyGetSourceCount (*(node+92))VERY HIGH
sub_B28E00tinyDecodeRegClass (identity function, class is plain int)VERY HIGH
sub_B28E10tinyisGPR operand predicate (tag == 2)VERY HIGH
sub_B28E20tinyisImmediate operand predicate (tag == 1)VERY HIGH
sub_B28E40tinyisValidReg operand predicate (tag == 10)VERY HIGH
sub_B28E80tinyisPredicate operand predicate (tag == 3)VERY HIGH
sub_B28E90tinyisUniformReg operand predicate (tag == 15)VERY HIGH
sub_B28F60--sub_B74C60~1.3 MB~801 DAG pattern matchers (priority 2--34, template 1--152)HIGH
sub_C01840--Mega-selector source operand marshaller (52 calls from mega-selector)HIGH
sub_C01F50--Mega-selector destination operand marshallerHIGH
sub_C00EA0--Single operand extractor (returns tagged operand word)HIGH
sub_BFFD60--Operand reference resolver (register ref to encoding word)HIGH
sub_C06E90--Symbol/special-register lookup for shared memoryHIGH
sub_C07690--Immediate-operand encoding helperMEDIUM
sub_C0B2C0--Extended memory/warp operation encoderHIGH
sub_C05CC0--Immediate operation encoder (flag-dependent path)MEDIUM
sub_BFEBF0tinyDefault vtable[2] stub (opcode translator, no-op identity)VERY HIGH
sub_BFEAA0tinyDefault vtable[12] stub (capability check, always false)VERY HIGH
sub_BFEA30tinyDefault vtable[3] stub (extension handler, no-op)VERY HIGH
sub_BFEF10--Register bank capacity check / growMEDIUM
word_22B4B60--Static opcode-to-encoding-index table (uint16[222], default backend)VERY HIGH
sub_C3F490184 BOCG encoding template lookup (routing value + 7 flags -> template ptr)VERY HIGH
sub_6CB8A0--OCG SASS instruction emitter (calls sub_C3F490 then sub_9314F0)HIGH
sub_C41100--MercExpand memory encoder (calls sub_C3F490 with full flag set)HIGH
sub_C40420--MercExpand memory encoder variant (calls sub_C3F490)HIGH
sub_C40B90--MercExpand memory encoder variant (calls sub_C3F490)HIGH
sub_C42330--MercExpand memory encoder variant (calls sub_C3F490)HIGH
unk_22B8960--unk_22BB460~11 KBOperand gathering templates (40+ entries, 256 B each)HIGH

Cross-References