ISel Pattern Matching & Instruction Selection

Prerequisites: Familiarity with SelectionDAG, Type Legalization, and DAG Node Layout. Understanding of the Pattern Database structure, the 460 NVPTXISD opcodes catalog, and NVPTX machine opcodes is recommended.

NVIDIA-modified pass. See Differences from Upstream for GPU-specific changes.

The NVPTX instruction selector in cicc v13.0 translates legal SelectionDAG nodes into target MachineInstr opcodes through a three-level dispatch hierarchy totaling roughly 175 KB of code across the principal selectors. At the top sits NVPTXDAGToDAGISel::Select (sub_3090F90, 12 KB / 3,012 insns), which builds a per-function cost table, manages a priority-queue-driven topological worklist, and calls the pattern matcher (sub_308FEE0) for every node. The pattern matcher fans out to a hand-written NVPTX-specific select switch (sub_347A8D0, 50 KB / 10,416 insns -- the largest ISel function) and a TableGen-generated SelectCode function (sub_348D3E0, 26 KB / 6,163 insns). Surrounding this core are six NVPTX-specific sub-selectors covering memory operations, texture/surface fetches, complex addressing modes, vector patterns, and atomics. The hand-written switch is responsible for the 460 distinct NVPTXISD::* target nodes catalogued in NVPTXISD Opcodes (372 of which are the texture/surface family) -- anything in the standard ISD::* range falls through to SelectCode. NVIDIA's key delta from upstream LLVM is (1) a compressed per-SM-variant legality table that gates which target opcodes exist on which GPU architecture, (2) a secondary 4-bit packed bitfield for fine-grained operand-class legality, and (3) the iteration budget that prevents the selector from looping indefinitely on pathological DAGs.


ISel driver	`sub_3090F90` (12 KB, 3,012 insns)
Pattern matcher entry	`sub_308FEE0`
NVPTX Select switch	`sub_347A8D0` (50 KB, 10,416 insns -- largest ISel function)
SelectCode (TableGen)	`sub_348D3E0` (26 KB, 6,163 insns -- auto-generated)
Vector/SIMD patterns	`sub_3475BB0` (19 KB, 3,966 insns)
Memory operation patterns	`sub_306D850` (14 KB, 3,192 insns)
Complex addressing modes	`sub_30811D0` (10 KB, 2,604 insns)
Addressing mode helper	`sub_30783B0` (7 KB, 1,722 insns)
Texture/surface ISel	`sub_306A930` (9 KB, 2,191 insns)
Atomic lowering	`sub_3048C30` (14 KB, 3,015 insns)
Constraint table	`word_3F3E6C0` (see Pattern Database)
Compressed legality table	Base + 6414, 500-byte stride per SM variant
Secondary 4-bit bitfield	Base + 521536
Legalize action table	Object + 72760, 4-bit packed
Knob registration	`ctor_286` at `0x4FA0C0` (1.7 KB)
Upstream LLVM source	`lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp`, `lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp`

ISel Driver: `sub_3090F90`

The top-level driver is not the pattern matcher itself; it is the orchestration loop that feeds nodes to the matcher in the right order and maintains shared state. It breaks into three phases.

Phase 1: Function Argument Cost Table

Before selecting any instructions, the driver builds a DenseMap-style hash table at this + 408 that maps function argument indices to their byte sizes. The hash table uses LLVM's standard integer-key hash function key * 37, open addressing with linear probing, and the tombstone sentinel -2. Growth triggers at 75% load factor (4 * (count + 1) >= 3 * capacity).

// Phase 1: build argument cost table
hash_table = this->arg_cost_map;  // at this + 408
for each argument A in function->args():
    byte_size = alignTo(getSizeInBits(A.type) / 8, A.alignment)
    key = A.index
    slot = (key * 37) & (capacity - 1)
    while hash_table[slot] is occupied and != key:
        slot = (slot + 1) & (capacity - 1)
    hash_table[slot] = { key, byte_size }
    if load_factor > 0.75: rehash()

The table layout:

Field	Offset from `this`	Description
`data`	+416	Pointer to hash bucket array
`count`	+424	Number of live entries
`tombstone_count`	+428	Number of tombstone slots
`capacity`	+432	Total bucket count (power of 2)

If the function has a non-void return type, the driver also inserts the return value sizes into the same table, computing aligned_size = ((size + 7) >> 3 + (1 << align) - 1) >> align << align for each return element. The return-type attribute check uses attribute kind 81 (likely sret).

Phase 2: Return Value Processing

For non-void functions, the driver iterates each return value element via:

sub_A74710(attribute, 81) -- checks for sret attribute
sub_A748A0(index) -- gets return type at given index
sub_AE5020(dataLayout, type) -- computes ABI alignment
sub_9208B0(dataLayout, type) -- computes size in bits

Each return value's aligned byte size is inserted into the argument cost table, so the pattern matcher can look up the cost of materializing any function parameter or return value during instruction selection.

Phase 3: Topological Selection Loop

The main selection loop processes DAG nodes in topological order using a min-heap priority queue where priority equals topological order (lower number = earlier in the DAG, processed first). The iteration is bounded by an explicit budget.

// Phase 3: main ISel loop
sub_308B6F0(this);  // initialize worklist from DAG
budget = 4 * numInstructions * maxBlockSize
iteration = 0

while heap is not empty:
    node = heap.extractMin()         // sub_3089BD0: heap-sift-down
    sub_308FEE0(this, node, &tmp)    // pattern matcher dispatch

    if this->selectionChanged:       // byte at this + 400
        re-scan affected nodes

    iteration++
    if iteration > budget:
        break  // anti-infinite-loop guard

sub_308AB30(this)    // cleanup
sub_264E600(this)    // deallocate worklist
sub_308B100(this)    // destroy hash table

The min-heap stores (SDNode*, priority) pairs at 16-byte stride. The heap-sift-down operation (sub_3089BD0) maintains the heap invariant after extraction. The selectionChanged flag at this + 400 is set by the pattern matcher when it replaces a node, signaling the driver to re-examine downstream users.

The iteration budget formula 4 * numInstructions * maxBlockSize is an NVIDIA addition -- upstream LLVM's SelectionDAGISel does not have this guard. It prevents pathological DAGs (for example, from heavily-inlined device functions with thousands of parameters) from causing the selector to spin indefinitely when combine/legalize/select cycles interact.

Pattern Matcher Dispatch: `sub_308FEE0`

The pattern matcher is called once per SDNode. It reads the node's opcode at *(node + 24) and dispatches through a multi-level decision tree:

Quick-reject filter. If the node is already selected (machine opcode bit set in flags), return immediately.
NVPTX-specific hand-written patterns. Calls sub_347A8D0 for NVPTX custom opcodes (NVPTXISD::*, values at or above ISD::BUILTIN_OP_END -- reconstructed as 499 from the sub_33D4EF0 cutover, see NVPTXISD Opcodes). This handles the 460 enumerated target nodes -- texture/surface fetches (372 opcodes, dispatched into sub_306A930), MMA instructions, atomic operations, .param-space loads/stores, branch-index tables (Brx*), funnel shifts, and the call-frame pseudos in MachineInstr opcode range 505--573.
TableGen auto-generated matcher. Calls sub_348D3E0 (SelectCode) for standard ISD opcodes. This function is mechanically generated from the .td pattern files in the NVPTX backend and contains a massive switch table mapping DAG patterns to MachineInstr opcodes.
Complex pattern matching. For load/store addressing modes, calls sub_30811D0 (77KB) and sub_30783B0 (39KB), which match base + offset, base + scaled_index, and address-space-qualified patterns.
Fallback. If no pattern matches, the node is marked as "failed ISel" and the driver may retry after DAG combining.

NVPTX Select Switch: `sub_347A8D0` (309KB)

This is the largest single ISel function, containing the hand-written pattern matching for all NVIDIA-specific DAG nodes. It calls sub_969240 263 times (SDNode accessor), is self-recursive 42 times, and dispatches to:

Sub-selector	Size	Coverage
`sub_3447D70`	32KB	Specific pattern sub-dispatch
`sub_3441190`	--	Pattern helpers
`sub_343FD60`	--	Type-aware matching
`sub_3475BB0`	89KB	Vector/SIMD patterns (v2, v4 packed types)

The function switches on the SDNode opcode to handle:

Load/store with address spaces -- selects between ld.global, ld.shared, ld.local, ld.param, ld.const, and generic-space loads, each requiring different PTX instructions.
Texture/surface operations -- dispatches to sub_306A930 for tex, suld, sust instruction patterns.
MMA/WMMA/tensor ops -- selects the correct mma.sync, wmma.mma, wgmma variant based on operand types and SM architecture.
Atomic operations -- selects between atom.global.add, atom.shared.cas, red.global.add, etc., with scope qualifiers (.cta, .gpu, .sys).
Barrier/fence operations -- selects bar.sync, bar.warp.sync, membar.cta, membar.gl, membar.sys.

SelectCode (TableGen): `sub_348D3E0` (256KB)

This auto-generated function implements the standard LLVM TableGen pattern matching algorithm. It is a giant switch-table compiled from the .td instruction pattern files in lib/Target/NVPTX/*.td. The function:

Calls sub_969240 45 times and sub_32889F0 38 times (opcode/type checkers).
Contains no string literals (purely mechanical code).
Works in tandem with sub_347A8D0: the hand-written selector handles NVPTX custom nodes first, and anything that falls through goes to SelectCode.

The auto-generated matcher encodes patterns as a sequence of opcode checks, type checks, and operand recursive matches. When a full pattern matches, it calls MorphNodeTo to convert the SDNode into a MachineSDNode with the target opcode and register operands.

Compressed Instruction Legality Table

NVIDIA's instruction selector uses a per-SM-variant legality table to determine whether a given target opcode is legal on the current GPU architecture. This table is checked during instruction selection to gate SM-specific instructions (for example, wgmma instructions are illegal on SM 70 but legal on SM 90+).

The table lives at a fixed offset from the base of the ISel object, accessed by sub_376DE90:

legality = *(uint8_t*)(base + 500 * arch_variant + opcode + 6414)

Field	Encoding
Base offset	6414 bytes from object base
Row stride	500 bytes per architecture variant
Index	`500 * arch_variant + opcode`
Value 0	Illegal -- this opcode does not exist on this SM
Value 1	Custom -- requires custom lowering before emission
Value 2	Legal -- can be emitted directly

The arch_variant value selects which row of the table to consult. Each row contains 500 entries, one per target opcode. The table is read-only after initialization and occupies approximately num_variants * 500 bytes in the .data section.

Secondary 4-bit Packed Bitfield

A second legality table at base + 521536 provides fine-grained operand-class legality using 4-bit packed nibbles:

byte_offset = (opcode_class >> 3) + 36 * arch_id - arch_id
nibble      = (*(uint8_t*)(base + 521536 + byte_offset) >> (4 * (opcode_class & 7))) & 0xF

The offset simplification 36 * arch_id - arch_id equals 35 * arch_id, giving a 35-byte stride per architecture variant. Each byte packs two 4-bit legality fields, and the low/high nibble is selected by bit 0 of opcode_class. The 4-bit values encode a richer set of actions than the primary table's 3-value encoding.

Legalize Action Table

The operation legalization subsystem (separate from the ISel legality table above) uses a 4-bit packed action table at object offset 72760 to determine how to legalize each (opcode, type) pair:

index  = type_bits + 15 * opcode + 18112
action = (*(uint32_t*)(object + 4 * index + 72760) >> (4 * (type & 7))) & 0xF

Action	Value	Behavior
Legal	0	Node is natively supported
Promote	1	Widen to a larger legal type
Custom	5	Call `NVPTXTargetLowering::LowerOperation` via vtable slot 164
ExpandInteger	9	Split wide integers into halves
ExpandFloat	13	Emulate unsupported FP via libcalls
SplitVector	14	Decompose illegal vector into legal sub-vectors

This table is distinct from the type-legality table at TLI + 2422 (described in SelectionDAG), which uses a 259-byte stride and encodes the simpler 5-action set (Legal/Custom/Expand/LibCall/Promote). The table at +72760 is the operation-level action table used during the LegalizeOp phase, while the +2422 table is the type-level action table used during LegalizeTypes.

NVPTX-Specific Pattern Categories

Memory Operations: `sub_306D850` (77KB)

Selects PTX load/store instructions with the correct address space qualifier, vector width, and volatility. The function handles the full matrix of {ld,st} x {.global,.shared,.local,.param,.const,.gen} x {.b8,.b16,.b32,.b64,.b128} x {.v1,.v2,.v4} x {.volatile,.relaxed,.acquire,.release} instruction variants. Address space is determined by querying the pointer operand's address space attribute through the DAG.

The memory pattern matching also covers:

Vector loads/stores -- ld.global.v2.b32, ld.global.v4.b32, and their 64-bit variants, selected based on the vector element count (1, 2, or 4).
Parameter loads -- ld.param.b32 and st.param.b32 for call ABI (see SelectionDAG: .param ABI).
Generic-space loads with addrspacecast -- when the address space is generic (AS 0), the selector checks whether the source can be proven to be in a specific space and emits a non-generic load if so.

Texture/Surface Instructions: `sub_306A930` (52KB)

Selects tex, suld, and sust instructions from DAG nodes produced by the intrinsic lowering mega-switch. The selector dispatches through helper functions:

Helper	Purpose
`sub_2FE5F00`	Texture fetch type selection
`sub_2FE5F30`	Surface read type selection
`sub_2FE5F60`	Surface write type selection
`sub_2FE69A0`	Texture sampler mode selection
`sub_2FE6CC0`	Unified texture/surface dispatch

Texture instructions have complex operand requirements: sampler reference, texture reference, coordinate type (1D/2D/3D/cube), data type (f32/i32/f16), and optional LOD/gradient parameters. The selector maps each combination to a specific PTX tex.1d.v4.f32.f32 (or similar) opcode.

Complex Addressing Modes: `sub_30811D0` (77KB)

Matches addressing patterns for load/store operands. NVPTX supports a limited set of addressing modes compared to x86:

Register + immediate offset -- [%r1 + 16], the most common PTX addressing mode.
Register -- [%r1], zero-offset variant.
Immediate -- [0x1000], absolute address (rare on GPU).
Register + register -- not directly supported in PTX; decomposed into add + register addressing.

The complex pattern matcher at sub_30811D0 calls seven helper functions (sub_307B990 through sub_307FEF0) to decompose DAG address expressions into base-register + offset pairs. When the offset is a constant that fits in the PTX immediate field, it folds into the instruction encoding. When the offset is too large or non-constant, it generates a separate add instruction and uses register addressing.

MMA / Tensor Core Instructions

Tensor core instruction selection is split across the intrinsic lowering stage (which generates NVPTXISD nodes from wmma.load, wmma.mma, mma.sync, wgmma intrinsics) and the ISel stage (which selects the specific PTX opcode). The ISel switch in sub_347A8D0 handles these by checking:

SM architecture -- wmma requires SM 70+, mma.sync requires SM 75+, wgmma requires SM 90+, tcgen05.mma requires SM 100+ (gated by the "supported only on arch-conditional or family-conditional variants from SM100 onwards" diagnostic recovered from cicc_strings.json).
Matrix dimensions -- m16n16k16, m8n8k4, m16n8k8, etc.
Data types -- f16, bf16, tf32, f64, i8, i4, b1, fp8 (SM 90+), fp4 (SM 100+).
Accumulator type -- f16 or f32 for half-precision MMA.

The architecture check consults the compressed legality table to determine whether a given MMA variant is legal on the target SM. The block-scale variants impose an extra structural check: the matcher rejects ashift operands with a specific "ashift is not supported with tcgen05.mma.block_scale variants" assert path.

SM90+ / SM100+ Extensions (Hopper / Blackwell)

A distinct family of ISel patterns covers the asynchronous bulk-copy and tensor-memory subsystems introduced on Hopper and Blackwell. These are selected in sub_347A8D0 and rely heavily on the compressed legality table for SM gating:

cp.async.bulk.tensor.g2s.* -- TMA bulk load/store with optional multicast (.multicast::cluster) and shared-memory destination (.shared::cluster). Selected from intrinsic-lowered NVPTXISD nodes.
tcgen05.{alloc,dealloc,commit,cp,fence,mma,wait,relinquish.alloc} -- tensor-memory allocator and compute family; binary strings confirm all eight verbs are present and gated to SM100+ arch-conditional variants.
Cluster barriers -- barrier.cluster.{arrive,arrive.relaxed,wait}, cluster.barrier.aligned, fence.sc.cluster, cluster.get.rank, cluster.set.rank. The atomic-scope downgrade diagnostic ("scope of cluster is supported on architecture sm_90 or above. Using device scope instead.") drives a fallback path when the target SM is too old.
setmaxnreg.{inc,dec}.sync.aligned -- per-warp register-budget reshaping; emitted via dedicated NVPTXISD nodes.
griddepcontrol.{launch_dependents,wait} and elect.sync -- producer/consumer kernel coordination.

Atomic Operations: `sub_3048C30` (86KB)

Atomic instruction selection generates atom.{scope}.{op}.{type} instructions. The selector handles:

Operation	PTX mnemonic	MachineInstr opcode range (reconstructed)
Compare-and-swap	`atom.cas`	~462
Add (int)	`atom.add`	294--297
Min (signed)	`atom.min`	302--305
Max (signed)	`atom.max`	314--317
Exchange	`atom.exch`	(via generic path)
AND/OR/XOR	`atom.and` / `atom.or` / `atom.xor`	(via generic path)

Numbering caveat. The integer values above are reconstructed MachineInstr opcode positions (the third dispatch level, after NVPTXISD::*), not NVPTXISD enumerator values. cicc_strings.json contains exactly 460 NVPTXISD::* symbols; the NVPTXISD Opcodes catalog is the authoritative cross-reference. Atomics enter ISel through generic ISD::ATOMIC_* nodes (handled by SelectCode) rather than as named NVPTXISD::* opcodes, which is why this family has no entry in the 460-name list.

The selector checks "vector atomics not supported on this architecture!" for vector-width atomics and gates them behind an SM version check (SM 90+ per the cluster-scope downgrade diagnostic). Scope qualifiers (.cta, .gpu, .sys, and on SM90+ .cluster) are determined from the memory ordering of the LLVM atomic instruction; on pre-Hopper targets a cluster-scoped atomic is silently rewritten to device scope and the diagnostic "atomic operations' scope of cluster is supported on architecture sm_90 or above. Using device scope instead." is emitted.

Vector / SIMD Patterns: `sub_3475BB0` (89KB)

Handles vector-type instruction selection for NVPTX's limited vector support (v2 and v4 packed types). The function calls sub_969240 121 times and is self-recursive 28 times. It selects between:

Packed register operations -- add.v2.f32, mul.v2.f32 when the SM supports native vector operations.
Scalarized fallback -- decomposes vector operations into per-element scalar operations when the vector type is not natively supported.
mov.v2 / mov.v4 -- register-to-register vector moves for shuffles and extracts.

Knobs

The ISel subsystem registers its knobs at ctor_286 (0x4FA0C0, 5KB):

Knob	Type	Description
`fast-isel-abort`	int	Abort mode for FastISel failures (0=silent, 1=warn, 2=abort)
`fast-isel-report-on-fallback`	bool	Report when FastISel falls back to SelectionDAG
`use-mbpi`	bool	Use Machine Branch Probability Info during ISel
`dag-disable-combine`	bool	Disable DAG combining entirely
`pre-RA-sched`	enum	Pre-RA scheduler variant: `"default"`, `"list-burr"`, `"source"`, `"list-hybrid"`, `"list-ilp"`

Note that cicc does not use FastISel for GPU code generation. The fast-isel-* knobs exist because the upstream LLVM SelectionDAGISel framework registers them unconditionally, but the NVPTX backend always takes the full SelectionDAG path. The dag-disable-combine flag is the only ISel-phase knob that has a meaningful effect on NVPTX code generation; setting it skips the DAG combiner entirely, which produces worse code but can be useful for debugging.

Differences from Upstream LLVM

Aspect	Upstream LLVM 20.0	NVIDIA cicc v13.0
Iteration budget	No explicit budget; relies on DAG invariants to terminate	Budget = `4 * numInstructions * maxBlockSize`
Argument cost table	Not present in `SelectionDAGISel`	Hash table with `key * 37` hash for argument byte sizes
Legality table	Simple `isLegal()` callback per target	Compressed 500-stride table + 4-bit packed secondary table
FastISel	Used for -O0 on most targets	Never used; always full SelectionDAG
ISel function size	Typical NVPTX `Select()` is a few KB upstream	50 KB hand-written + 26 KB TableGen = ~76 KB combined
Memory patterns	Standard load/store	5 address spaces, each with distinct PTX encoding
Texture/surface	Not present in upstream NVPTX (handled by intrinsics only)	52KB dedicated sub-selector for tex/suld/sust
Atomic patterns	Standard expansion via AtomicExpandPass	86KB custom selector with scope qualifiers and architecture gating

Function Map

Function	Address	Size	Role
`NVPTXDAGToDAGISel::Select` -- ISel driver	`sub_3090F90`	12 KB	--
Pattern matcher entry (dispatches to Select switch and SelectCode)	`sub_308FEE0`	--	--
NVPTX hand-written Select switch	`sub_347A8D0`	309KB	--
TableGen-generated SelectCode	`sub_348D3E0`	256KB	--
Vector/SIMD pattern selection	`sub_3475BB0`	89KB	--
Memory operation patterns (ld/st with address spaces)	`sub_306D850`	77KB	--
Complex addressing mode matching	`sub_30811D0`	77KB	--
Addressing mode helper (base + offset extraction)	`sub_30783B0`	39KB	--
Texture/surface instruction selection	`sub_306A930`	52KB	--
Atomic operation selection	`sub_3048C30`	86KB	--
Sub-selector for specific NVPTX patterns	`sub_3447D70`	32KB	--
Pattern matching helpers	`sub_3472970`	36KB	--
Operand matching	`sub_343A2E0`	49KB	--
Compressed legality table lookup	`sub_376DE90`	--	--
Initialize topological worklist	`sub_308B6F0`	--	--
Min-heap sift-down (priority queue)	`sub_3089BD0`	--	--
ISel cleanup	`sub_308AB30`	--	--
Hash table destruction	`sub_308B100`	--	--

Cross-References

SelectionDAG & Instruction Selection -- parent page covering the full SelectionDAG pipeline (type legalization, operation legalization, DAG combining, and the ISel overview)
NVPTXISD Opcodes -- authoritative catalog of all 460 NVPTXISD::* target nodes recovered from the binary, grouped by family (texture/surface 372, load 18, store 17, call/frame 29, math 10, funnel-shift 4, brx 3, misc 7)
Pattern Database / Constraint Table -- the per-instruction operand constraint table at word_3F3E6C0
DAG Node Layout -- SDNode structure definition
NVPTX Machine Opcode Reference -- the MachineInstr opcodes that NVPTXISD nodes lower into, including the 505--573 call-ABI pseudo range
NVPTX Target Infrastructure -- target machine, subtarget features, and register classes
Hash Infrastructure -- the key * 37 integer hash used throughout cicc
Tensor / MMA Builtins -- intrinsic lowering for MMA operations that feed into ISel
Surface & Texture Builtins -- intrinsic lowering for texture/surface operations
Atomics Builtins -- intrinsic lowering for atomic operations

Keyboard shortcuts

CICC Reverse Engineering Reference