Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ISel Pattern Matching & Instruction Selection

Prerequisites: Familiarity with SelectionDAG, Type Legalization, and DAG Node Layout. Understanding of the Pattern Database structure, the 460 NVPTXISD opcodes catalog, and NVPTX machine opcodes is recommended.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

The NVPTX instruction selector in cicc v13.0 translates legal SelectionDAG nodes into target MachineInstr opcodes through a three-level dispatch hierarchy totaling roughly 175 KB of code across the principal selectors. At the top sits NVPTXDAGToDAGISel::Select (sub_3090F90, 12 KB / 3,012 insns), which builds a per-function cost table, manages a priority-queue-driven topological worklist, and calls the pattern matcher (sub_308FEE0) for every node. The pattern matcher fans out to a hand-written NVPTX-specific select switch (sub_347A8D0, 50 KB / 10,416 insns -- the largest ISel function) and a TableGen-generated SelectCode function (sub_348D3E0, 26 KB / 6,163 insns). Surrounding this core are six NVPTX-specific sub-selectors covering memory operations, texture/surface fetches, complex addressing modes, vector patterns, and atomics. The hand-written switch is responsible for the 460 distinct NVPTXISD::* target nodes catalogued in NVPTXISD Opcodes (372 of which are the texture/surface family) -- anything in the standard ISD::* range falls through to SelectCode. NVIDIA's key delta from upstream LLVM is (1) a compressed per-SM-variant legality table that gates which target opcodes exist on which GPU architecture, (2) a secondary 4-bit packed bitfield for fine-grained operand-class legality, and (3) the iteration budget that prevents the selector from looping indefinitely on pathological DAGs.

ISel driversub_3090F90 (12 KB, 3,012 insns)
Pattern matcher entrysub_308FEE0
NVPTX Select switchsub_347A8D0 (50 KB, 10,416 insns -- largest ISel function)
SelectCode (TableGen)sub_348D3E0 (26 KB, 6,163 insns -- auto-generated)
Vector/SIMD patternssub_3475BB0 (19 KB, 3,966 insns)
Memory operation patternssub_306D850 (14 KB, 3,192 insns)
Complex addressing modessub_30811D0 (10 KB, 2,604 insns)
Addressing mode helpersub_30783B0 (7 KB, 1,722 insns)
Texture/surface ISelsub_306A930 (9 KB, 2,191 insns)
Atomic loweringsub_3048C30 (14 KB, 3,015 insns)
Constraint tableword_3F3E6C0 (see Pattern Database)
Compressed legality tableBase + 6414, 500-byte stride per SM variant
Secondary 4-bit bitfieldBase + 521536
Legalize action tableObject + 72760, 4-bit packed
Knob registrationctor_286 at 0x4FA0C0 (1.7 KB)
Upstream LLVM sourcelib/CodeGen/SelectionDAG/SelectionDAGISel.cpp, lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp

ISel Driver: sub_3090F90

The top-level driver is not the pattern matcher itself; it is the orchestration loop that feeds nodes to the matcher in the right order and maintains shared state. It breaks into three phases.

Phase 1: Function Argument Cost Table

Before selecting any instructions, the driver builds a DenseMap-style hash table at this + 408 that maps function argument indices to their byte sizes. The hash table uses LLVM's standard integer-key hash function key * 37, open addressing with linear probing, and the tombstone sentinel -2. Growth triggers at 75% load factor (4 * (count + 1) >= 3 * capacity).

// Phase 1: build argument cost table
hash_table = this->arg_cost_map;  // at this + 408
for each argument A in function->args():
    byte_size = alignTo(getSizeInBits(A.type) / 8, A.alignment)
    key = A.index
    slot = (key * 37) & (capacity - 1)
    while hash_table[slot] is occupied and != key:
        slot = (slot + 1) & (capacity - 1)
    hash_table[slot] = { key, byte_size }
    if load_factor > 0.75: rehash()

The table layout:

FieldOffset from thisDescription
data+416Pointer to hash bucket array
count+424Number of live entries
tombstone_count+428Number of tombstone slots
capacity+432Total bucket count (power of 2)

If the function has a non-void return type, the driver also inserts the return value sizes into the same table, computing aligned_size = ((size + 7) >> 3 + (1 << align) - 1) >> align << align for each return element. The return-type attribute check uses attribute kind 81 (likely sret).

Phase 2: Return Value Processing

For non-void functions, the driver iterates each return value element via:

  • sub_A74710(attribute, 81) -- checks for sret attribute
  • sub_A748A0(index) -- gets return type at given index
  • sub_AE5020(dataLayout, type) -- computes ABI alignment
  • sub_9208B0(dataLayout, type) -- computes size in bits

Each return value's aligned byte size is inserted into the argument cost table, so the pattern matcher can look up the cost of materializing any function parameter or return value during instruction selection.

Phase 3: Topological Selection Loop

The main selection loop processes DAG nodes in topological order using a min-heap priority queue where priority equals topological order (lower number = earlier in the DAG, processed first). The iteration is bounded by an explicit budget.

// Phase 3: main ISel loop
sub_308B6F0(this);  // initialize worklist from DAG
budget = 4 * numInstructions * maxBlockSize
iteration = 0

while heap is not empty:
    node = heap.extractMin()         // sub_3089BD0: heap-sift-down
    sub_308FEE0(this, node, &tmp)    // pattern matcher dispatch

    if this->selectionChanged:       // byte at this + 400
        re-scan affected nodes

    iteration++
    if iteration > budget:
        break  // anti-infinite-loop guard

sub_308AB30(this)    // cleanup
sub_264E600(this)    // deallocate worklist
sub_308B100(this)    // destroy hash table

The min-heap stores (SDNode*, priority) pairs at 16-byte stride. The heap-sift-down operation (sub_3089BD0) maintains the heap invariant after extraction. The selectionChanged flag at this + 400 is set by the pattern matcher when it replaces a node, signaling the driver to re-examine downstream users.

The iteration budget formula 4 * numInstructions * maxBlockSize is an NVIDIA addition -- upstream LLVM's SelectionDAGISel does not have this guard. It prevents pathological DAGs (for example, from heavily-inlined device functions with thousands of parameters) from causing the selector to spin indefinitely when combine/legalize/select cycles interact.

Pattern Matcher Dispatch: sub_308FEE0

The pattern matcher is called once per SDNode. It reads the node's opcode at *(node + 24) and dispatches through a multi-level decision tree:

  1. Quick-reject filter. If the node is already selected (machine opcode bit set in flags), return immediately.
  2. NVPTX-specific hand-written patterns. Calls sub_347A8D0 for NVPTX custom opcodes (NVPTXISD::*, values at or above ISD::BUILTIN_OP_END -- reconstructed as 499 from the sub_33D4EF0 cutover, see NVPTXISD Opcodes). This handles the 460 enumerated target nodes -- texture/surface fetches (372 opcodes, dispatched into sub_306A930), MMA instructions, atomic operations, .param-space loads/stores, branch-index tables (Brx*), funnel shifts, and the call-frame pseudos in MachineInstr opcode range 505--573.
  3. TableGen auto-generated matcher. Calls sub_348D3E0 (SelectCode) for standard ISD opcodes. This function is mechanically generated from the .td pattern files in the NVPTX backend and contains a massive switch table mapping DAG patterns to MachineInstr opcodes.
  4. Complex pattern matching. For load/store addressing modes, calls sub_30811D0 (77KB) and sub_30783B0 (39KB), which match base + offset, base + scaled_index, and address-space-qualified patterns.
  5. Fallback. If no pattern matches, the node is marked as "failed ISel" and the driver may retry after DAG combining.

NVPTX Select Switch: sub_347A8D0 (309KB)

This is the largest single ISel function, containing the hand-written pattern matching for all NVIDIA-specific DAG nodes. It calls sub_969240 263 times (SDNode accessor), is self-recursive 42 times, and dispatches to:

Sub-selectorSizeCoverage
sub_3447D7032KBSpecific pattern sub-dispatch
sub_3441190--Pattern helpers
sub_343FD60--Type-aware matching
sub_3475BB089KBVector/SIMD patterns (v2, v4 packed types)

The function switches on the SDNode opcode to handle:

  • Load/store with address spaces -- selects between ld.global, ld.shared, ld.local, ld.param, ld.const, and generic-space loads, each requiring different PTX instructions.
  • Texture/surface operations -- dispatches to sub_306A930 for tex, suld, sust instruction patterns.
  • MMA/WMMA/tensor ops -- selects the correct mma.sync, wmma.mma, wgmma variant based on operand types and SM architecture.
  • Atomic operations -- selects between atom.global.add, atom.shared.cas, red.global.add, etc., with scope qualifiers (.cta, .gpu, .sys).
  • Barrier/fence operations -- selects bar.sync, bar.warp.sync, membar.cta, membar.gl, membar.sys.

SelectCode (TableGen): sub_348D3E0 (256KB)

This auto-generated function implements the standard LLVM TableGen pattern matching algorithm. It is a giant switch-table compiled from the .td instruction pattern files in lib/Target/NVPTX/*.td. The function:

  • Calls sub_969240 45 times and sub_32889F0 38 times (opcode/type checkers).
  • Contains no string literals (purely mechanical code).
  • Works in tandem with sub_347A8D0: the hand-written selector handles NVPTX custom nodes first, and anything that falls through goes to SelectCode.

The auto-generated matcher encodes patterns as a sequence of opcode checks, type checks, and operand recursive matches. When a full pattern matches, it calls MorphNodeTo to convert the SDNode into a MachineSDNode with the target opcode and register operands.

Compressed Instruction Legality Table

NVIDIA's instruction selector uses a per-SM-variant legality table to determine whether a given target opcode is legal on the current GPU architecture. This table is checked during instruction selection to gate SM-specific instructions (for example, wgmma instructions are illegal on SM 70 but legal on SM 90+).

The table lives at a fixed offset from the base of the ISel object, accessed by sub_376DE90:

legality = *(uint8_t*)(base + 500 * arch_variant + opcode + 6414)
FieldEncoding
Base offset6414 bytes from object base
Row stride500 bytes per architecture variant
Index500 * arch_variant + opcode
Value 0Illegal -- this opcode does not exist on this SM
Value 1Custom -- requires custom lowering before emission
Value 2Legal -- can be emitted directly

The arch_variant value selects which row of the table to consult. Each row contains 500 entries, one per target opcode. The table is read-only after initialization and occupies approximately num_variants * 500 bytes in the .data section.

Secondary 4-bit Packed Bitfield

A second legality table at base + 521536 provides fine-grained operand-class legality using 4-bit packed nibbles:

byte_offset = (opcode_class >> 3) + 36 * arch_id - arch_id
nibble      = (*(uint8_t*)(base + 521536 + byte_offset) >> (4 * (opcode_class & 7))) & 0xF

The offset simplification 36 * arch_id - arch_id equals 35 * arch_id, giving a 35-byte stride per architecture variant. Each byte packs two 4-bit legality fields, and the low/high nibble is selected by bit 0 of opcode_class. The 4-bit values encode a richer set of actions than the primary table's 3-value encoding.

Legalize Action Table

The operation legalization subsystem (separate from the ISel legality table above) uses a 4-bit packed action table at object offset 72760 to determine how to legalize each (opcode, type) pair:

index  = type_bits + 15 * opcode + 18112
action = (*(uint32_t*)(object + 4 * index + 72760) >> (4 * (type & 7))) & 0xF
ActionValueBehavior
Legal0Node is natively supported
Promote1Widen to a larger legal type
Custom5Call NVPTXTargetLowering::LowerOperation via vtable slot 164
ExpandInteger9Split wide integers into halves
ExpandFloat13Emulate unsupported FP via libcalls
SplitVector14Decompose illegal vector into legal sub-vectors

This table is distinct from the type-legality table at TLI + 2422 (described in SelectionDAG), which uses a 259-byte stride and encodes the simpler 5-action set (Legal/Custom/Expand/LibCall/Promote). The table at +72760 is the operation-level action table used during the LegalizeOp phase, while the +2422 table is the type-level action table used during LegalizeTypes.

NVPTX-Specific Pattern Categories

Memory Operations: sub_306D850 (77KB)

Selects PTX load/store instructions with the correct address space qualifier, vector width, and volatility. The function handles the full matrix of {ld,st} x {.global,.shared,.local,.param,.const,.gen} x {.b8,.b16,.b32,.b64,.b128} x {.v1,.v2,.v4} x {.volatile,.relaxed,.acquire,.release} instruction variants. Address space is determined by querying the pointer operand's address space attribute through the DAG.

The memory pattern matching also covers:

  • Vector loads/stores -- ld.global.v2.b32, ld.global.v4.b32, and their 64-bit variants, selected based on the vector element count (1, 2, or 4).
  • Parameter loads -- ld.param.b32 and st.param.b32 for call ABI (see SelectionDAG: .param ABI).
  • Generic-space loads with addrspacecast -- when the address space is generic (AS 0), the selector checks whether the source can be proven to be in a specific space and emits a non-generic load if so.

Texture/Surface Instructions: sub_306A930 (52KB)

Selects tex, suld, and sust instructions from DAG nodes produced by the intrinsic lowering mega-switch. The selector dispatches through helper functions:

HelperPurpose
sub_2FE5F00Texture fetch type selection
sub_2FE5F30Surface read type selection
sub_2FE5F60Surface write type selection
sub_2FE69A0Texture sampler mode selection
sub_2FE6CC0Unified texture/surface dispatch

Texture instructions have complex operand requirements: sampler reference, texture reference, coordinate type (1D/2D/3D/cube), data type (f32/i32/f16), and optional LOD/gradient parameters. The selector maps each combination to a specific PTX tex.1d.v4.f32.f32 (or similar) opcode.

Complex Addressing Modes: sub_30811D0 (77KB)

Matches addressing patterns for load/store operands. NVPTX supports a limited set of addressing modes compared to x86:

  • Register + immediate offset -- [%r1 + 16], the most common PTX addressing mode.
  • Register -- [%r1], zero-offset variant.
  • Immediate -- [0x1000], absolute address (rare on GPU).
  • Register + register -- not directly supported in PTX; decomposed into add + register addressing.

The complex pattern matcher at sub_30811D0 calls seven helper functions (sub_307B990 through sub_307FEF0) to decompose DAG address expressions into base-register + offset pairs. When the offset is a constant that fits in the PTX immediate field, it folds into the instruction encoding. When the offset is too large or non-constant, it generates a separate add instruction and uses register addressing.

MMA / Tensor Core Instructions

Tensor core instruction selection is split across the intrinsic lowering stage (which generates NVPTXISD nodes from wmma.load, wmma.mma, mma.sync, wgmma intrinsics) and the ISel stage (which selects the specific PTX opcode). The ISel switch in sub_347A8D0 handles these by checking:

  1. SM architecture -- wmma requires SM 70+, mma.sync requires SM 75+, wgmma requires SM 90+, tcgen05.mma requires SM 100+ (gated by the "supported only on arch-conditional or family-conditional variants from SM100 onwards" diagnostic recovered from cicc_strings.json).
  2. Matrix dimensions -- m16n16k16, m8n8k4, m16n8k8, etc.
  3. Data types -- f16, bf16, tf32, f64, i8, i4, b1, fp8 (SM 90+), fp4 (SM 100+).
  4. Accumulator type -- f16 or f32 for half-precision MMA.

The architecture check consults the compressed legality table to determine whether a given MMA variant is legal on the target SM. The block-scale variants impose an extra structural check: the matcher rejects ashift operands with a specific "ashift is not supported with tcgen05.mma.block_scale variants" assert path.

SM90+ / SM100+ Extensions (Hopper / Blackwell)

A distinct family of ISel patterns covers the asynchronous bulk-copy and tensor-memory subsystems introduced on Hopper and Blackwell. These are selected in sub_347A8D0 and rely heavily on the compressed legality table for SM gating:

  • cp.async.bulk.tensor.g2s.* -- TMA bulk load/store with optional multicast (.multicast::cluster) and shared-memory destination (.shared::cluster). Selected from intrinsic-lowered NVPTXISD nodes.
  • tcgen05.{alloc,dealloc,commit,cp,fence,mma,wait,relinquish.alloc} -- tensor-memory allocator and compute family; binary strings confirm all eight verbs are present and gated to SM100+ arch-conditional variants.
  • Cluster barriers -- barrier.cluster.{arrive,arrive.relaxed,wait}, cluster.barrier.aligned, fence.sc.cluster, cluster.get.rank, cluster.set.rank. The atomic-scope downgrade diagnostic ("scope of cluster is supported on architecture sm_90 or above. Using device scope instead.") drives a fallback path when the target SM is too old.
  • setmaxnreg.{inc,dec}.sync.aligned -- per-warp register-budget reshaping; emitted via dedicated NVPTXISD nodes.
  • griddepcontrol.{launch_dependents,wait} and elect.sync -- producer/consumer kernel coordination.

Atomic Operations: sub_3048C30 (86KB)

Atomic instruction selection generates atom.{scope}.{op}.{type} instructions. The selector handles:

OperationPTX mnemonicMachineInstr opcode range (reconstructed)
Compare-and-swapatom.cas~462
Add (int)atom.add294--297
Min (signed)atom.min302--305
Max (signed)atom.max314--317
Exchangeatom.exch(via generic path)
AND/OR/XORatom.and / atom.or / atom.xor(via generic path)

Numbering caveat. The integer values above are reconstructed MachineInstr opcode positions (the third dispatch level, after NVPTXISD::*), not NVPTXISD enumerator values. cicc_strings.json contains exactly 460 NVPTXISD::* symbols; the NVPTXISD Opcodes catalog is the authoritative cross-reference. Atomics enter ISel through generic ISD::ATOMIC_* nodes (handled by SelectCode) rather than as named NVPTXISD::* opcodes, which is why this family has no entry in the 460-name list.

The selector checks "vector atomics not supported on this architecture!" for vector-width atomics and gates them behind an SM version check (SM 90+ per the cluster-scope downgrade diagnostic). Scope qualifiers (.cta, .gpu, .sys, and on SM90+ .cluster) are determined from the memory ordering of the LLVM atomic instruction; on pre-Hopper targets a cluster-scoped atomic is silently rewritten to device scope and the diagnostic "atomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead." is emitted.

Vector / SIMD Patterns: sub_3475BB0 (89KB)

Handles vector-type instruction selection for NVPTX's limited vector support (v2 and v4 packed types). The function calls sub_969240 121 times and is self-recursive 28 times. It selects between:

  • Packed register operations -- add.v2.f32, mul.v2.f32 when the SM supports native vector operations.
  • Scalarized fallback -- decomposes vector operations into per-element scalar operations when the vector type is not natively supported.
  • mov.v2 / mov.v4 -- register-to-register vector moves for shuffles and extracts.

Knobs

The ISel subsystem registers its knobs at ctor_286 (0x4FA0C0, 5KB):

KnobTypeDescription
fast-isel-abortintAbort mode for FastISel failures (0=silent, 1=warn, 2=abort)
fast-isel-report-on-fallbackboolReport when FastISel falls back to SelectionDAG
use-mbpiboolUse Machine Branch Probability Info during ISel
dag-disable-combineboolDisable DAG combining entirely
pre-RA-schedenumPre-RA scheduler variant: "default", "list-burr", "source", "list-hybrid", "list-ilp"

Note that cicc does not use FastISel for GPU code generation. The fast-isel-* knobs exist because the upstream LLVM SelectionDAGISel framework registers them unconditionally, but the NVPTX backend always takes the full SelectionDAG path. The dag-disable-combine flag is the only ISel-phase knob that has a meaningful effect on NVPTX code generation; setting it skips the DAG combiner entirely, which produces worse code but can be useful for debugging.

Differences from Upstream LLVM

AspectUpstream LLVM 20.0NVIDIA cicc v13.0
Iteration budgetNo explicit budget; relies on DAG invariants to terminateBudget = 4 * numInstructions * maxBlockSize
Argument cost tableNot present in SelectionDAGISelHash table with key * 37 hash for argument byte sizes
Legality tableSimple isLegal() callback per targetCompressed 500-stride table + 4-bit packed secondary table
FastISelUsed for -O0 on most targetsNever used; always full SelectionDAG
ISel function sizeTypical NVPTX Select() is a few KB upstream50 KB hand-written + 26 KB TableGen = ~76 KB combined
Memory patternsStandard load/store5 address spaces, each with distinct PTX encoding
Texture/surfaceNot present in upstream NVPTX (handled by intrinsics only)52KB dedicated sub-selector for tex/suld/sust
Atomic patternsStandard expansion via AtomicExpandPass86KB custom selector with scope qualifiers and architecture gating

Function Map

FunctionAddressSizeRole
NVPTXDAGToDAGISel::Select -- ISel driversub_3090F9012 KB--
Pattern matcher entry (dispatches to Select switch and SelectCode)sub_308FEE0----
NVPTX hand-written Select switchsub_347A8D0309KB--
TableGen-generated SelectCodesub_348D3E0256KB--
Vector/SIMD pattern selectionsub_3475BB089KB--
Memory operation patterns (ld/st with address spaces)sub_306D85077KB--
Complex addressing mode matchingsub_30811D077KB--
Addressing mode helper (base + offset extraction)sub_30783B039KB--
Texture/surface instruction selectionsub_306A93052KB--
Atomic operation selectionsub_3048C3086KB--
Sub-selector for specific NVPTX patternssub_3447D7032KB--
Pattern matching helperssub_347297036KB--
Operand matchingsub_343A2E049KB--
Compressed legality table lookupsub_376DE90----
Initialize topological worklistsub_308B6F0----
Min-heap sift-down (priority queue)sub_3089BD0----
ISel cleanupsub_308AB30----
Hash table destructionsub_308B100----

Cross-References