Branch & Switch Optimization
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
Four phases in the ptxas pipeline transform branch and switch-statement control flow in the Ori IR. Two phases optimize switch statements (phases 14 and 30), one performs general branch simplification (phase 15), and one flattens nested conditional branches (phase 38). Together they reduce branch count, eliminate unreachable code, and prepare the CFG for downstream passes like predication (phase 63), liveness analysis (phase 16), and loop canonicalization (phase 18).
These phases operate on the Ori IR before register allocation and scheduling. At this pipeline stage, branch instructions use the Ori OEN opcode (SASS BRA), conditional execution is controlled by predicate registers (P0--P6, PT), and the CFG is a hash-map-based structure with FNV-1a-keyed successor/predecessor edges.
| DoSwitchOptFirst | Phase 14 -- vtable at off_22BD7F8 |
| OriBranchOpt | Phase 15 -- vtable at off_22BD820 |
| DoSwitchOptSecond | Phase 30 -- vtable at off_22BDA78 |
| OptimizeNestedCondBranches | Phase 38 -- vtable at off_22BDBB8 |
| Phase factory | sub_C60D30 cases 14, 15, 30, 38 |
| Phase object size | 16 bytes (standard {vtable_ptr, allocator_ptr}) |
| IR level | Ori -- SASS opcodes with virtual registers |
| Key opcodes | OEN (BRA), OFFL (BSSY), OFLAP (BSYNC) |
| CFG infrastructure | FNV-1a hash maps at Code Object +648 (successors), +680 (backedges) |
| Related passes | 31 OriLinearReplacement, 63 OriDoPredication, 80 ExpandJmxComputation, 133/136 MergeEquivalentConditionalFlow |
Pipeline Placement
Phase 3 AnalyzeControlFlow ── builds CFG (predecessors, successors, RPO, dominators)
Phase 6 SetControlFlowOpLastInBB ── ensures branches are last in each block
Phase 13 GeneralOptimizeEarly ── const fold + copy prop (feeds branch info)
Phase 14 DoSwitchOptFirst ── SWITCH OPTIMIZATION (1st pass)
Phase 15 OriBranchOpt ── BRANCH SIMPLIFICATION
Phase 16 OriPerformLiveDeadFirst ── DCE cleanup of dead branches
...
Phase 29 GeneralOptimize ── const fold after loop transforms
Phase 30 DoSwitchOptSecond ── SWITCH OPTIMIZATION (2nd pass)
Phase 31 OriLinearReplacement ── branchless replacement
...
Phase 37 GeneralOptimizeMid ── const fold + copy prop (feeds nested cond info)
Phase 38 OptimizeNestedCondBranches ── NESTED CONDITIONAL FLATTENING
...
Phase 63 OriDoPredication ── if-conversion (converts short branches to predicates)
...
Phase 80 ExpandJmxComputation ── expands jump-table index computations
...
Phase 133 MergeEquivalentConditionalFlow ── tail merging
Phase 136 LateMergeEquivalentConditionalFlow
Why Two DoSwitchOpt Passes?
The first pass (phase 14) runs immediately after the initial GeneralOptimizeEarly compound pass. At this point, constant folding and copy propagation have resolved many switch selector values, enabling the optimizer to determine case density and choose a lowering strategy.
The second pass (phase 30) runs after loop unrolling (phase 22), strength reduction (phase 21), SSA phi insertion (phase 23), and software pipelining (phase 24). These transformations can expose new switch patterns -- particularly after loop unrolling duplicates switch bodies, creating opportunities for case clustering that were not visible before.
Despite their names, the two passes use different dispatch paths. Phase 14 dispatches through the SM backend's vtable at offset +136 (*(*(ctx+1584)+136)), making it a polymorphic, architecture-specific switch optimization. Phase 30 calls the generic switch optimization core (sub_77CF40 via sub_791F00). This means phase 14 runs whatever switch optimization the current SM target provides, while phase 30 always runs the generic algorithm. The two passes share pipeline position semantics (first pass vs. second pass) but not necessarily the same code.
Phase 14 Vtable Slot 17 -- Per-SM Dispatch Resolution
The vtable call at offset +136 (slot 17) resolves to three distinct handlers depending on the SM backend's construction case in sub_662920. Resolved from the binary by reading each SM backend vtable at file offset (vtable_VA + 136 - 0x400000):
| SM Backend | Vtable | Slot 17 Handler | Behavior |
|---|---|---|---|
| SM30 (Kepler) | off_2029DD0 | sub_661210 (2B) | NO-OP (rep ret) |
| SM50 (Maxwell) | off_21B4A50 | sub_661210 (2B) | NO-OP (rep ret) |
| SM60 (Pascal) | off_22B2A58 | sub_849110 (613B) | Sync-pattern restructuring + branch scan |
| SM70 (Volta/Turing) | off_21D82B0 | sub_849110 (613B) | Sync-pattern restructuring + branch scan |
| SM80 (Ampere) | off_21B2D30 | sub_729160 (546B) | Per-instruction branch-distance rewriter |
| SM89 (Ada) | off_21C0C68 | sub_729160 (546B) | Per-instruction branch-distance rewriter |
| SM90+ (Hopper/Blackwell) | off_21D6860 | sub_849110 (613B) | Sync-pattern restructuring + branch scan |
Phase 14 is a no-op on Kepler and Maxwell -- those legacy architectures rely entirely on the generic phase 30 path. On all active targets (sm_75+), one of two real implementations runs.
Handler A: sub_849110 (Pascal, Volta/Turing, Hopper/Blackwell). Two-phase operation:
- Sync-pattern restructuring. Gated by
ctx+1384bit 4 ANDctx+896 == 5. Queries knob 232 for disable. When enabled,sub_848F40iterates over instructions matching opcode 98 (sync-stack entries), identifies sync patterns that can be restructured into simpler branch patterns, rewrites them viasub_934630(instruction creation) andsub_9253C0(instruction deletion), and collects modified blocks into a worklist. - Branch-target scan. Gated by
backend+1400 != 0. Callssub_7E6090(the same branch pattern scanner used by phase 15), then iterates over the instruction linked list. For each instruction with opcode 52 or 97, callssub_A9D9B0which walks operands backward, identifies branch-target operands (type 5, verified bysub_7DEB50), and dispatches through*(backend+1184)to perform target-specific branch rewriting. Knob 351 limits iteration count.
Handler B: sub_729160 (Ampere, Ada). A single-pass per-instruction classifier. Takes an instruction pointer as argument, extracts the opcode at instr+72 (masked with & 0xFFFFCFFF), and returns a branch-distance category:
| Opcode (masked) | Category | Return value |
|---|---|---|
| 22 (CALL) | Complex control flow | sub_7E40E0 result |
| 50 (BSSY) | Sync-stack push | Lookup from xmmword_21B2EC0 table |
| 51 (CAL), 110--111 (BRK/CONT), 289 | Simple control flow | 3 |
| 77 (BRA) | Branch | sub_7E36C0 (5-field distance metric) |
| 83 (RET) | Return | sub_7E3640 result |
| 112 (BRX) | Indexed branch | 4 |
| 279 | Specific control flow | 6 |
| 297 (JMXU), 352 (EXIT) | Jump/exit | sub_7E3790 / sub_7E3800 result |
| Other | Unhandled | -1 (0xFFFFFFFF) |
After classification, the handler scans forward through operands checking register class membership (class 6 or 3), then optionally invokes sub_687C00 to rewrite the branch target based on distance thresholds read from knob offsets 44280 and 44208 on the knob container. This replaces near branches with short-encoding variants when the target is within range -- an Ampere/Ada-specific optimization that exploits shorter branch encodings available on those architectures.
DoSwitchOpt -- Switch Statement Optimization (Phases 14, 30)
Overview
DoSwitchOpt transforms high-level switch statements from their initial representation as cascading conditional branches into one of three lowered forms, selected based on case density and count. The input is a chain of ISETP (integer set-predicate) + BRA (conditional branch) instruction pairs that compare the switch selector against successive case constants. The output is one of:
- Jump table -- a contiguous array of branch targets indexed by the selector value
- Binary search tree -- a balanced tree of comparisons that narrows the target in O(log n)
- Cascading if-else chain -- the original form, retained when the switch is small or sparse
Input: Switch Pattern Recognition
The pass scans each basic block for a characteristic pattern:
// Input: cascading if-else for switch(x)
BB0:
ISETP.EQ P0, x, #case_0 // compare selector against constant
@P0 BRA target_0 // conditional branch to case body
ISETP.EQ P0, x, #case_1
@P0 BRA target_1
ISETP.EQ P0, x, #case_2
@P0 BRA target_2
...
BRA default_target // fallthrough to default case
The recognizer collects:
- The selector register (the common first operand of all
ISETPinstructions) - The set of case constants (immediate operands of each
ISETP) - The branch targets (one per case, plus the default target)
- The case count N
Scanner Tolerance (sub_77CF40, lines 234--411)
The pattern recognizer is not a rigid ISETP+BRA pair matcher. It walks the instruction linked list from the block head at code_object+280 and tolerates several forms of non-ideal input. The scanning loop in sub_77CF40 (4698 bytes) processes each instruction through a three-stage filter:
Stage 1 -- Control-flow skip. sub_7DF3A0 reads the masked opcode (*(instr+72) & 0xFFFFCFFF) and returns a per-opcode property pointer. If bit 0 of that property is set (control-flow instruction), the instruction is skipped without terminating the chain. This is how interleaved BRA instructions are consumed: they are classified as control-flow and silently stepped over while the scanner continues looking for comparison instructions.
Stage 2 -- Chain-terminating opcodes. When parameter a3 is set (second-pass mode), the scanner reads the first instruction of the successor block (*(*(block+8))+72). If that unmasked opcode is 159 (sm_73+ control), 32 (VABSDIFF4), or 271 (warpgroup arrive/wait), the chain terminates with a found_chain flag set. These mark block boundaries that cannot be part of a switch cascade -- convergence points, GMMA synchronization, or SIMT barriers.
Stage 3 -- Accepted comparison opcodes. Instructions surviving stages 1--2 are collected into the case set. The block number (field +144) is recorded, and the masked opcode determines acceptance:
| Masked opcode | SASS mnemonic | Acceptance condition |
|---|---|---|
| 96 | LDG | Unconditional -- any LDG in the chain is accepted |
| 190 | (sm_82+ extended) | sub_745D80: operand type bits 28--30 == 6, value 1--3 |
| 31 | VABSDIFF | Last source operand type bits 5--7 == 2 (immediate constant) |
| (other) | -- | Falls through to operand scan without immediate acceptance |
Operand backward scan. After collection, the inner loop walks the operand array from index operand_count - 1 down to 0 (operands at instr + 8*idx + 84). Two helpers classify each operand:
sub_7DEB50(14B): returns true when the register descriptor's class == 16 (predicate register). Identifies which predicate each comparison writes.sub_7DEAD0(19B): returns true for a constant-zero predicate definition -- class 16, sub-type bits 10--12 == 4, value == 0. Identifies the PT (always-true) guard operand.
The scan terminates on a negative operand word (sentinel) or operand exhaustion.
Key tolerance properties:
- Interleaved non-ISETP instructions do not break the chain. Any non-control-flow instruction is collected as a potential case member; strict ISETP+BRA alternation is not required.
- Predicate register changes mid-chain are permitted. Each comparison's predicate is identified independently by the backward operand scan. The chain can use P0 for one case and P1 for another.
- Non-contiguous case constants are accepted by the scanner. Density evaluation happens downstream in the strategy selector, not in the recognizer.
- The chain terminates when: (a) the instruction list is exhausted (null next pointer), (b) a chain-stopping opcode appears in the successor block (opcodes 159/32/271), or (c) the block-end cleanup at
LABEL_155fires when*(block+164) == *(entry_block+164)(matching block group IDs).
Decision: Strategy Selection
The strategy selection is not implemented as a simple threshold comparison against hardcoded constants. Instead, sub_77CF40 uses the bitmap population in the BitsetRBTree case collection to determine the lowering strategy. The decision emerges from the structure of the collected cases:
- Cascading if-else is retained when the pattern recognizer collects few cases (the bitmap contains a single 256-bit leaf or sparse coverage). The pass checks whether the case set justifies the overhead of an indirect branch structure.
- Jump table is selected when the bitmap shows dense coverage within a contiguous value range. The density is implicit in the bitmap layout: if most bits in a range are set, the table-based lowering is preferred.
- Binary search tree is the fallback for cases that are numerous but too sparse for a jump table.
Two OCG knobs (registered in ctor_005) influence the strategy selection:
| Knob | ROT13 | VA | Type | Semantics |
|---|---|---|---|---|
SwitchOptBudget | FjvgpuBcgOhqtrg | 0x21B7CE0 | integer (03) | Limits the optimization effort budget -- controls how many cases or how much analysis the pass performs before falling back to the default BST strategy |
SwitchOptExtendedPatternMatch | FjvgpuBcgRkgraqrqCnggreaZngpu | 0x21B7CB0 | boolean (02) | When enabled, the pattern recognizer accepts interleaved instruction patterns beyond the simple ISETP+BRA pair, allowing it to match switches that have been partially transformed by earlier passes |
The exact default values of these knobs are set by the OCG knob initialization system (sub_79D990) and are not visible as immediate constants at the registration site. The knob names follow the standard ROT13 obfuscation convention. Neither knob is checked by the sub_7DDB50 opt-level gate -- they are consumed directly within the switch optimization algorithm in sub_77CF40 and its callees.
Jump table is preferred when case values are dense -- the selector maps directly to a table index with a bounds check and a subtraction. This produces the fastest code but consumes data memory proportional to the value range (not the case count).
Binary search tree is the default for large sparse switches. The pass sorts case constants and generates a balanced BST of ISETP + BRA pairs. Each comparison eliminates half the remaining candidates, reaching the target in O(log N) branches.
Cascading if-else is retained for small switches (typically 4 or fewer cases) where the overhead of a jump table or BST setup exceeds the cost of linear comparison.
Output: Jump Table Lowering
For jump-table-eligible switches, the pass produces:
// Output: jump table lowering
BB_switch:
IADD3 Rtmp, Rselector, -min_val, RZ // normalize to 0-based index
ISETP.GE.U32 P0, Rtmp, #range // bounds check (unsigned)
@P0 BRA default_target // out-of-range -> default
// The jump table index computation is left as a pseudo-instruction
// that phase 80 (ExpandJmxComputation) expands later into:
// LEA Raddr, Rtmp, #table_base, 2 // Raddr = table_base + index * 4
// BRX Raddr, #table_base // indexed branch
Phase 80 (ExpandJmxComputation) runs much later (after legalization) to expand the index computation pseudo-instruction into the final LEA + BRX sequence. The delayed expansion exists because the table base address is unknown until final code layout, and intervening optimization passes (GVN-CSE, strength reduction) may simplify the index computation while it remains a single pseudo-instruction.
Jump Table Data Structure
The jump table is embedded inline in the .text section, immediately after the BRX instruction that references it. BRX is a 64-bit (8-byte) format instruction (format code 0x1), so the table begins at the next 4-byte-aligned position after the instruction word. The .text layout around a jump table:
.text offset contents
──────────── ─────────────────────────────────────────────
base [ctrl] LEA Raddr, Rtmp, #table_base, 2
base + 0x08 [ctrl] BRX Raddr, #table_base
base + 0x10 table[0] i32 offset -> case min_val target
base + 0x14 table[1] i32 offset -> case min_val+1 target
...
base + 0x10 + 4*(range-1) i32 offset -> case max_val target
Entry format. Each entry is a signed 32-bit (4-byte) relative offset. The hardware computes the branch target as table_base + table[index], where table_base is the constant immediate encoded in the BRX instruction. Offsets are relative to the table base, not to the individual entry position.
Sparse fill policy. The table has exactly range = max_val - min_val + 1 entries. Case values present in the switch map to the offset of their case body. Absent values (gaps in a sparse switch) are filled with the relative offset to the default target. The bounds check (ISETP.GE.U32 + BRA) preceding the lookup handles values outside [min_val, max_val]; the default fill handles gaps inside the range. Together they guarantee every selector value reaches either its case body or the default.
Alignment. No explicit padding is inserted between BRX and the table. The table starts at the byte immediately following the 8-byte BRX instruction. SASS .text instructions are naturally aligned to their encoding width (8 or 16 bytes), so the first table entry is always at least 4-byte aligned.
Relocation. The EIATTR_JUMPTABLE_RELOCS attribute (code 0x03) in .nv.info.<func> records an array of u32 byte offsets into .text where jump table data begins. Each offset points to table[0] of a jump table. The linker (nvlink) uses these offsets to patch table entries when the final code layout changes during cross-module linking. See EIATTR Reference.
Output: Binary Search Tree Lowering
Case Sort via BitsetRBTree
Before BST emission, sub_77CF40 builds a sorted set of case block indices using a red-black tree with 256-bit bitmap leaves (BitsetRBTree, managed by sub_7436F0). Each RBT node is 64 bytes:
BitsetRBTreeNode (64 bytes):
+0 left child pointer
+8 right child pointer
+16 parent pointer
+24 key (int32) -- RPO_index >> 8 | color bit in MSB
+32 bitmap[4] -- 4 x uint64 = 256 bits, one per low-8-bit position
Insertion (sub_7436F0) decomposes a case's RPO block index (from block offset +144) into key = index >> 8 and bit = index & 0xFF, locates or creates the RBT node keyed by the high bits, then sets bitmap[bit >> 6] |= (1 << bit). If the node already exists (same key), the bit is OR'd into the existing bitmap. The RBT maintains sorted order by key, and sub_6887C0 rebalances after insertion.
The BST emission driver (LABEL_46 in sub_77CF40) iterates the sorted set in descending order. sub_758210 initializes an iterator pointing to the highest set bit of the last RBT node. Each step extracts the current bit position via _BitScanReverse64, computes a composite index (bit_position) | (rbt_key << 6), removes the bit via sub_7654E0, and processes the corresponding case block at SwitchOptContext+104 + 40 * rpo_index. This yields cases in descending RPO order without an explicit sort pass -- the RBT provides O(log N) insertion and inherently sorted traversal.
PGO Early-Exit Guard
The entry of sub_77CF40 checks three code-object fields before proceeding:
if code_obj[1404] != 0 // PGO profile data is attached
&& code_obj[1400] == 0 // instrumentation pass counter = 0 (complete)
&& (code_obj[1377] & 0x08) == 0: // scheduling-mode bit 3 is clear
return // skip switch optimization entirely
When profile-guided optimization data is present and fully processed, the pass exits without modifying any switch statements. PGO-guided code layout has already arranged case blocks by execution frequency, making static BST rebalancing counterproductive -- the profiled block ordering already minimizes expected branch depth for the observed workload.
Inside the main loop, PGO also affects the block reachability check (sub_76ABE0). When code_obj[1370] & 0x40 is set (PGO block-frequency data available), the function uses a fast bitmask lookup at block+176 instead of querying the profile object at code_obj+1168 via sub_76AAE0. This bitmask records which successor blocks were observed as reachable during profiling, allowing unreachable-case pruning before BST construction.
BST Emission Algorithm
For BST-eligible switches:
function emit_bst(cases[], lo, hi, selector, default_target):
if lo > hi:
emit: BRA default_target
return
mid = (lo + hi) / 2
if lo == hi:
emit: ISETP.EQ P0, selector, #cases[mid].value
emit: @P0 BRA cases[mid].target
emit: BRA default_target
return
emit: ISETP.LT P0, selector, #cases[mid].value
emit: @P0 BRA left_subtree_label
// Right subtree (selector >= cases[mid].value)
emit: ISETP.EQ P0, selector, #cases[mid].value
emit: @P0 BRA cases[mid].target
emit_bst(cases, mid+1, hi, selector, default_target)
// Left subtree (selector < cases[mid].value)
left_subtree_label:
emit_bst(cases, lo, mid-1, selector, default_target)
This produces a balanced tree with depth ceil(log2(N+1)). Each internal node performs at most two comparisons (less-than and equality), though the pass may optimize nodes with consecutive case values to use range checks.
GPU-Specific: SIMT Divergence Impact
Switch optimization interacts directly with SIMT execution. On a GPU, when threads in a warp take different switch cases, the warp diverges and each case executes serially. The optimizer considers this:
- Jump tables produce a single divergence point at the
BRXinstruction. All threads that pick the same case reconverge naturally. The hardwareBSSY/BSYNC(branch sync stack push/pop) mechanism ensures reconvergence after the switch. - BST lowering produces O(log N) potential divergence points. Threads that agree on the BST path stay converged; threads that disagree at each BST node split into independently masked sub-warps.
- Cascading if-else produces N potential divergence points. Each comparison can split the warp.
For GPU code, jump tables are strongly preferred when density permits, because they minimize the number of divergence points to exactly one (the BRX), regardless of case count.
OriBranchOpt -- Branch Simplification (Phase 15)
Overview
OriBranchOpt performs four categories of CFG-level simplification on the Ori IR. It runs as a single pass that iterates over all basic blocks and applies the following transformations until no further changes occur:
- Unconditional branch folding -- eliminates
BRAinstructions that jump to the immediately following block - Unreachable block elimination -- removes basic blocks with no predecessors (except the entry block)
- Conditional branch simplification -- simplifies conditional branches where the condition is provably constant or the true/false targets are identical
- Branch chain threading -- redirects branches that target blocks consisting of a single unconditional
BRA, directly to the final destination
Transformation 1: Unconditional Branch Folding
When a basic block ends with an unconditional BRA to the block that immediately follows in layout order, the branch is redundant and is deleted:
// Before: // After:
BB_A: BB_A:
... ...
BRA BB_B // fallthrough
BB_B: BB_B:
... ...
This is the most common transformation. It arises frequently after switch optimization introduces new blocks and after loop unrolling creates copies of loop bodies that end with unconditional jumps back to the next iteration.
Transformation 2: Unreachable Block Elimination
After other branch simplifications may redirect branches away from certain blocks, those blocks lose all predecessors and become unreachable. The pass deletes them:
The deletion is performed by sub_BDE6C0 (CFGHashMap::erase, 3KB), a recursive function called from sub_BDFB10 (CFG::buildBlockMap, 24KB). The caller first resizes the RPO array (+720) and a parallel array (+744) to num_blocks + 1 entries, then loops over all block indices 0..num_blocks invoking the erase function for each.
The erase function operates on a CFG context object that contains three parallel arrays indexed by block index:
| Context Offset | Type | Field |
|---|---|---|
+16 (a1[2]) | int* | pred_count[] -- predecessor count per block |
+64 (a1[8]) | int* | rpo_number[] -- RPO position per block |
*a1 + 720 | int* | rpo_to_bix[] -- maps RPO position to block index |
function erase_block(ctx, bix, counter): // sub_BDE6C0
if pred_count[bix] == 0: // already removed
return
pred_count[bix] = 0 // mark deleted first
code_obj = ctx[0]
succ_map = *(code_obj + 648) // FNV-1a hash map
if succ_map == NULL or succ_map.total_elements == 0:
goto finalize
bucket = fnv1a(bix) & (succ_map.num_buckets - 1)
node = succ_map.bucket_array[bucket].head
while node != NULL:
if node.key != bix:
node = node.next; continue
// Found this block's successor edge set.
// Walk sub-hash buckets for multi-successor blocks:
for each successor S in node's edge set:
if S == 0xFFFFFFFF: skip // sentinel
if pred_count[S] == 1: // sole predecessor is bix
erase_block(ctx, S, counter) // recursive cascade
break
finalize:
rpo_number[bix] = *counter // reassign RPO slot
rpo_to_bix[*counter] = bix // inverse mapping
(*counter)-- // consume one slot
Key design decisions confirmed from the binary:
-
Predecessor count zeroed first, successor map walked but not modified. The erase function clears
pred_count[bix]immediately (line*result = 0at 0xBDE6F2), then reads the successor edge map at +648 to find cascade targets. Successor hash map nodes are never unlinked -- the stale entries remain in the bucket chains. Thepred_count == 0sentinel is the authoritative "deleted" marker. -
Recursive cascade for single-predecessor successors. For each successor
S, the function checkspred_count[S] == 1(at 0xBDE7DF:cmp dword ptr [rdx+rax*4], 1). If the sole predecessor ofSis the block being erased,Sitself is unreachable and is recursively erased. This propagates deletion forward through chains of single-predecessor blocks without requiring a separate worklist. -
RPO arrays updated in deletion order. After the successor walk, the erased block is assigned the current counter value into both the per-block RPO number array (
rpo_number[bix] = *counter) and the reverse RPO-to-block-index array (rpo_to_bix[*counter] = bix), then the counter is decremented. These writes go to Code Object +720 (via offset 0x2D0 = 720) and the context's RPO number array at offset +64. -
Rebuild phase follows. After
sub_BDFB10completes the per-block erase loop, it re-walks all blocks, looks up each surviving block's successors in the +648 hash map, and repopulates the backedge map at +680 viasub_BDF480and successor entries viasub_BDED20. This two-phase approach (erase-then-rebuild) avoids incremental hash map maintenance during the deletion cascade.
Transformation 3: Conditional Branch Simplification
Two sub-cases:
Constant condition. If copy propagation or constant folding (in the preceding GeneralOptimizeEarly, phase 13) has determined that a predicate register always holds a known value at the branch point, the conditional branch is replaced:
// Before: condition always true // After:
BB: BB:
ISETP.EQ PT, R0, R0 // (deleted -- tautology)
@PT BRA target BRA target // unconditional
BRA fallthrough // (deleted)
Equivalent targets. If both the taken and not-taken paths of a conditional branch go to the same block, the condition test is dead and the branch becomes unconditional:
// Before: both targets identical // After:
BB: BB:
@P0 BRA target BRA target // unconditional
BRA target // (deleted)
Transformation 4: Branch Chain Threading
When a branch targets a block whose only content is another unconditional branch, the pass redirects the original branch directly to the final target:
// Before: // After:
BB_A: BB_A:
@P0 BRA BB_B @P0 BRA BB_C // threaded
BB_B: // BB_B may become unreachable
BRA BB_C BB_C:
BB_C: ...
...
The pass applies threading iteratively, following chains of single-branch blocks until a non-trivial block is reached. A depth limit prevents infinite loops on pathological CFGs with cycles of empty blocks (which should not exist in well-formed IR but are guarded against defensively).
Fixed-Point Iteration
The execute body (sub_7917F0) drives the four transformations through a two-level loop over the RPO block ordering. The algorithm from the decompiled 529-byte body:
function OriBranchOpt_execute(ctx):
// --- gate checks (knobs 214, 487; CFG validity bit at ctx+1382) ---
// Phase 1: infrastructure setup
rebuild_cfg(ctx) // sub_785E20(ctx, 0)
prepare_blocks(ctx, 1) // sub_781F80(ctx, 1)
scan_branch_patterns(ctx, 0, 0, 0, 0) // sub_7E6090
init_chain_context(ctx, ...) // sub_7E6AD0
// Phase 2: RPO traversal with per-block convergence
changed = false
block_count = *(ctx + 520) // number of blocks
rpo_indices = *(ctx + 512) // int32[] RPO permutation
block_table = *(ctx + 296) // ptr[] block array
for i = 1 to block_count: // skip entry block (RPO index 0)
block = block_table[rpo_indices[i]]
loop: // inner convergence loop
matched = try_match(work, block) // sub_753600: pattern match
if not matched:
break
if not knob_enabled(464): // convergence-loop gate
break
changed = true
apply_rewrite(work) // sub_753B50: CFG rewrite
// loop back: re-examine same block for newly exposed patterns
// Phase 3: post-pass CFG rebuild (only if any rewrite fired)
if changed:
rebuild_cfg(ctx) // sub_785E20(ctx, 0)
Key structural properties:
- Processing order is strict RPO (reverse post-order). The RPO index array at
ctx+512is a permutation of block IDs; the outer loop walks indices 1..block_count, skipping index 0 (the entry block, which by definition has no incoming branches to simplify). - Per-block re-examination is the inner
loop. After each successful match+rewrite on a block, the pass immediately re-tests the same block. This catches cascading opportunities: threading a branch may leave the block ending with a redundant unconditional jump to its layout successor, which the next iteration folds away. - Knob 464 gates the inner loop. When disabled, each block is tested at most once (no convergence). When enabled (the default), the inner loop runs until
sub_753600returns false, meaning no further pattern matches exist for that block. - Single RPO pass, not a global fixed point. The outer loop executes exactly once over all blocks. The pass does not restart from the beginning when a change is made. Global convergence across blocks relies on later re-invocations of the phase by the pass manager (or on the nested conditional optimizer in phase 38 cleaning up remaining patterns).
- CFG rebuild (
sub_785E20) runs unconditionally before the traversal and conditionally after. The post-pass rebuild is gated on thechangedflag (r12in the binary), which latches to true on any successful match and is never reset.
OptimizeNestedCondBranches -- Nested Conditional Flattening (Phase 38)
Overview
Phase 38 targets nested conditional branches that test related predicates. The pattern commonly arises from C/C++ compound conditions (if (a && b), if (a || b)) and from switch-case fall-through after DoSwitchOpt lowering. The entry point is sub_A0F020 (1978 bytes).
The pass runs after GeneralOptimizeMid (phase 37), which provides fresh constant folding and copy propagation results. It runs before OriDoPredication (phase 63), feeding it simpler CFG patterns that are easier to convert to predicated code.
Pass Structure
sub_A0F020 is a three-stage pipeline wrapped in a global fixed-point loop. The outer while(2) re-executes the entire pipeline whenever Stage 3 makes progress.
Stage 1 -- Liveness-driven operand simplification. Iterates blocks in reverse RPO via the priority array at ctx+792. For each block, calls sub_A06A60 with callback sub_A08250. The callback scans each instruction's operand array and replaces dead-in operands (not in the live-in bitset at ctx+832) with the tombstone value 0xF0000000. A flag at ctx+1368 bit 4 enables a two-pass strategy: the first pass is aggressive (v63=1), and if it makes no changes, the second pass runs conservatively (v63=0). This stage does not combine branches -- it prunes operands to expose merging opportunities.
Stage 2 -- Block-pair merging. Iterates blocks in reverse RPO again, calling sub_A06A60 with callback sub_A07DA0. This callback examines operand liveness within each instruction: for each source operand referencing a virtual register (type tag 1, extracted as (operand >> 28) & 7 == 1), if the register index is not in the range 41--44 (the dedicated predicate registers P0--P3, checked as (operand & 0xFFFFFF) - 41 <= 3) and the operand is dead-in, it marks the live bit via sub_BDBC70. This builds the kill set used by Stage 3.
Stage 3 -- Nested branch combination. This is the core transformation. It scans basic blocks (at ctx+296) looking for a conditional branch (opcode 95 after masking bits 12--13) whose taken target leads to another conditional branch. When it finds such a chain, it walks backward through the intermediate instructions checking each one with sub_A07940 for combinability, then splices the two branches into one using sub_91E070.
Maximum Nesting Depth
Stage 3 walks the instruction chain without an explicit depth counter. Starting from the last instruction of a candidate block (the conditional branch, opcode 95), it follows the linked-list backward (next = *(ptr*)current) examining each instruction until it reaches the block head. Every instruction in the chain must pass the sub_A07940 eligibility check. The walk terminates at the block boundary, so the effective maximum nesting depth per iteration equals the number of eligible instructions in the intermediate block.
In practice this is bounded by the block structure: each nesting level requires its own basic block, and the pass processes one block-pair at a time. After combining two levels, the global while(2) loop re-runs the pipeline, letting the combined block participate in a new round. For a chain of N nested conditions (if (a && b && c && ...)), the pass requires N-1 iterations of the outer loop, each collapsing one level.
The guard sub_7DDB50(ctx) != 1 gates re-iteration: if the function has been reduced to a single block, no further nesting can exist and the outer loop exits.
Instruction Eligibility (sub_A07940)
sub_A07940 (396 bytes) determines whether an instruction in the intermediate chain can be absorbed into the combined branch. It rejects:
- Instructions flagged at
sub_7DF3A0with bit 3 set (convergence barrier) or with a negative byte at offset +28 (predication lock) - Instructions with a negated-predicate output (bit 5 of the last operand for opcodes 288/183, bit 20 for opcode 16, or a flag in the predicate descriptor at
ctx+416for opcode 85) - Four explicitly excluded opcodes after masking: 186 (
BMOV), 119 (LEA), 211 (PLOP3), 283 (VOTE) -- these have side effects or warp-level semantics that prevent reordering - Opcode 250 (
ISETP) with a nonzero combined-condition field in its last operand - Opcode 226 (
FSETP) with operand count minus modifier adjustment equal to 1 and a non-negative first operand - Opcode 9 (
BRX) with a non-negative second operand (resolved indirect branch) - Opcode 315 (
DSETP) with bit 2 of the last operand clear - Instructions where
sub_7E0030(has live predicate output) ANDsub_7E08E0(has observable barrier effect) both return true
LOP3 Truth Table Composition
When the intermediate chain passes eligibility, sub_91C840 maps each comparison-type code in the predicate-setting instruction to a truth-table index for LOP3. The mapping from the comparison type field (first word of the setp descriptor) to 4-bit LOP3 sub-table entries:
| Type | Index | Semantics | Type | Index | Semantics |
|---|---|---|---|---|---|
| 1 | 16 (0x10) | GT | 12 | 4 (0x04) | XOR |
| 2 | 12 (0x0C) | LT | 13 | 0 (0x00) | FALSE |
| 4 | 1 (0x01) | EQ | 14 | 8 (0x08) | AND |
| 5 | 3 (0x03) | LE | 15 | 17 (0x11) | XNOR |
| 6 | 11 (0x0B) | GE | 16 | 7 (0x07) | NAND |
| 7,9 | 2 (0x02) | NE | 0x13 | 15 (0x0F) | TRUE |
| 0x14 | 6 (0x06) | ordered | 0x15,0x16 | 5 (0x05) | unordered |
The combined truth table for two predicates is assembled by the caller into an 8-bit LOP3 immediate. For AND-chains (if (a && b)), the per-input truth tables are ANDed: TT_combined = TT_outer & TT_inner, producing 0xC0 for two simple boolean inputs. For OR-chains (if (a || b)), they are ORed: TT_combined = TT_outer | TT_inner, producing 0xFC. Mixed chains compose naturally through repeated application:
AND: if (a && b) --> LOP3 Ptmp, P0, P1, 0xC0 // TT = 0b11000000
OR: if (a || b) --> LOP3 Ptmp, P0, P1, 0xFC // TT = 0b11111100
XOR: if (a ^ b) --> LOP3 Ptmp, P0, P1, 0x3C // TT = 0b00111100
NAND: if (!(a && b)) --> LOP3 Ptmp, P0, P1, 0x3F // TT = 0b00111111
NOR: if (!(a || b)) --> LOP3 Ptmp, P0, P1, 0x03 // TT = 0b00000011
Mixed: if (a && (b || c)) --> two rounds: first OR(b,c), then AND(a, b|c)
Safety Constraints
The pass applies these transformations only when:
- No side effects between the nested branches -- the intermediate block must contain only the branch and optionally predicate-setting instructions. The four excluded opcodes (186, 119, 211, 283) block combination because they have warp-level or memory side effects.
- No live-out values from the intermediate block other than the predicate -- checked via the liveness bitset at
ctx+832. Stage 1 tombstones dead operands to make this check precise. - Both branches target the same merge point -- the not-taken path of both the outer and inner branches must reach the same basic block.
- The predicates are independent --
sub_A07940verifies that the predicate output is not negated and not locked by a convergence barrier. If P1's definition depends on P0's branch outcome, the instructions fail the eligibility walk because the shared live range appears in the merge-point liveness bitmap (checked insub_7F44D0viasub_7DEEB0; see predicate independence check above in the GO-19 section). - No already-folded setp --
sub_7F44D0checks whether the predicate-defining instruction is itself a LOP3/PLOP3 with 2 inputs (field at descriptor+4 equals 2). This prevents composing a 4-input expression that exceeds LOP3's 3-input limit.
Relationship to Predication
Phase 38 is a stepping stone toward phase 63 (OriDoPredication). By reducing nested branches to single-level branches, it creates more opportunities for if-conversion:
Phase 38: nested {if(a) { if(b) { ... }}} --> if(a AND b) { ... }
Phase 63: if(a AND b) { x = y; } --> @(a AND b) MOV x, y
Without phase 38, the predication pass would see a multi-level branch diamond that exceeds its nesting-depth threshold, and both branches would remain in the output.
GPU-Specific Considerations
SIMT Divergence and Reconvergence
On NVIDIA GPUs, branch optimization has a direct impact on warp execution efficiency. Every conditional branch is a potential divergence point where threads in a 32-thread warp may take different paths. Divergence serializes execution: the warp must execute both paths, masking inactive threads.
The BSSY (branch sync stack push) / BSYNC (branch sync) mechanism on modern NVIDIA architectures (sm_75+) manages reconvergence:
BSSY B0, reconvergence_point // push reconvergence point onto sync stack
@P0 BRA taken_path // diverge
... not-taken path ...
BSYNC B0 // threads arriving here wait
taken_path:
... taken path ...
BSYNC B0 // all threads reconverge here
reconvergence_point:
... // continue with full warp
Branch optimization directly reduces the number of BSSY/BSYNC pairs needed:
- Branch folding (phase 15) eliminates unconditional branches that do not cause divergence but still consume
BSSY/BSYNCbookkeeping - Nested conditional flattening (phase 38) reduces two nested
BSSY/BSYNCregions to one, cutting sync-stack depth by one level - Jump table lowering (phases 14/30) collapses N divergence points into one
BRXinstruction
Reconvergence Stack Depth
The hardware branch sync stack has finite depth (varies by architecture, typically 16--32 entries on sm_75+). Deeply nested branches can overflow the stack, causing hardware serialization or requiring the compiler to restructure control flow. Branch optimization reduces sync-stack pressure by flattening nesting.
Uniform Branches
When all threads in a warp evaluate a branch condition identically (uniform branch), no divergence occurs. The optimizer detects uniform branches via the AnalyzeUniformsForSpeculation pass (phase 27) and the OriPropagateVarying passes (phases 53, 70). Uniform branches are cheaper because:
- No
BSSY/BSYNCis needed (the warp stays converged) - On sm_75+, uniform branches can use the
UBRA(uniform branch) encoding, which has lower latency
Branch optimization interacts with uniformity analysis: simplifications that eliminate branches also eliminate divergence-point metadata, and conversely, branches proven uniform may not need optimization because their execution cost is already minimal.
Switch Tables and Warp Divergence
A switch with K active cases in a 32-thread warp incurs at most K serialized case executions (one per unique case value across threads). Jump table lowering does not change this thread-level divergence cost, but it does change the instruction-level cost:
| Strategy | Instructions (worst case) | Divergence points | Sync-stack entries |
|---|---|---|---|
| Cascading if-else (N cases) | 2N (ISETP + BRA per case) | N | N |
| BST (N cases) | 2 * ceil(log2(N)) | ceil(log2(N)) | ceil(log2(N)) |
| Jump table | 3 (IADD3 + ISETP + BRX) | 1 | 1 |
The jump table is strongly preferred for GPU execution because it minimizes sync-stack entries to exactly 1, regardless of case count.
Implementation Details
Phase Vtable Structure
All four phases follow the standard 16-byte phase object model. Each vtable has three methods: +0 execute, +8 getPhaseNumber, +16 isNoOp.
| Phase | Factory case | Vtable address | execute body | isNoOp |
|---|---|---|---|---|
14 DoSwitchOptFirst | case 14 | off_22BD7F8 | sub_C5F720 (42B) | returns false |
15 OriBranchOpt | case 15 | off_22BD820 | sub_C5F950 (34B) | returns false |
30 DoSwitchOptSecond | case 30 | off_22BDA78 | sub_C5FC80 (34B) | returns false |
38 OptimizeNestedCondBranches | case 38 | off_22BDBB8 | sub_C5FA70 (34B) | returns false |
All four isNoOp methods return false unconditionally -- gating is performed inside the execute body, not via isNoOp. Each execute body calls sub_7DDB50 (156B), which reads the optimization level from compilation_context+2104 and checks knob 499. The guard is opt_level > 1, so these phases execute at -O2 and above. At -O0 and -O1, sub_7DDB50 returns 1 and the execute body returns without action.
Execute Body Details
Phase 14 -- sub_C5F720 (42 bytes). After the sub_7DDB50 gate, dispatches through the SM backend object's vtable: (*(*(ctx+1584) + 136))(*(ctx+1584)). Offset +136 is vtable slot 17 on the SM backend. This is a polymorphic call -- each SM target (sm_50, sm_75, sm_89, sm_100, etc.) provides its own switch optimization implementation. The SM backend object at compilation_context+1584 is documented in data-structures.md.
Phase 15 -- sub_C5F950 (34 bytes). After the gate, calls sub_7917F0 (529B) directly -- no polymorphic dispatch. sub_7917F0 is the branch simplification core:
- Checks
context+1382bit 2 (CFG validity flag) -- returns immediately if clear - Checks knob 214 via the knob state dispatcher -- if set, skips the pass (OriBranchOpt disable switch)
- Checks knob 487 (general optimization enablement)
- Calls
sub_785E20(266B) to rebuild the CFG - Calls
sub_781F80(8335B) for block preparation infrastructure - Calls
sub_7E6090(2614B) to scan branch patterns andsub_7E6AD0(33B) for chain setup - Iterates over basic blocks in RPO order (block list at
*(ctx+296), RPO indices at*(ctx+512)). For each block, callssub_753600(1351B) for the transformation, with a convergence loop gated by knob 464 - After convergence, calls
sub_785E20again to finalize the CFG
Phase 30 -- sub_C5FC80 (34 bytes). After the gate, calls sub_791F00(ctx, 1). The second argument 1 indicates this is the second switch optimization pass. sub_791F00 (587B) performs lazy initialization of a 152-byte SwitchOptContext cached at code_object+1288:
SwitchOptContext (152 bytes, allocated at code_object+1288):
+0 back-pointer to code object
+8 allocator reference (from code_object+16)
+16 case collection array (capacity = block_count + 2)
+56 secondary collection array
+96 code_object reference copy
+104 initialized sentinel (0xFFFFFFFF)
+112 tertiary collection array
After setup, sub_791F00 calls sub_77CF40 (4698B, 987 instructions) -- the main switch optimization algorithm containing pattern matching, strategy selection (jump table vs. BST vs. cascading if-else), and code emission.
Phase 38 -- sub_C5FA70 (34 bytes). After the gate, calls sub_A0F020 (2375B, 563 instructions) directly. sub_A0F020 implements the nested conditional optimizer as a fixed-point loop. It allocates a 16-byte work context at code_object+1648 (lazy init), then iterates: scan blocks for nested branch patterns, combine predicates with LOP3, remove eliminated blocks, repeat until stable. The function also accesses code object fields +832 (block hash map) and +856 (edge data) for CFG manipulation.
Knob Gating Summary
| Knob | Index | Type | Effect | Checked by |
|---|---|---|---|---|
| ConvertUnsupportedOps | 499 | bool | Master opt-level gate (all 4 phases) | sub_7DDB50 |
| OriBranchOpt disable | 214 | bool | Disables branch simplification | sub_7917F0 (phase 15) |
| General optimization | 487 | bool | Enables/disables optimizer passes | sub_7917F0 (phase 15) |
| Convergence loop | 464 | bool | Gates the fixed-point iteration | sub_7917F0 (phase 15) |
| SwitchOptBudget | -- | int | Limits switch optimization effort budget | sub_77CF40 (phases 14, 30) |
| SwitchOptExtendedPatternMatch | -- | bool | Enables extended pattern recognition | sub_77CF40 (phases 14, 30) |
Interaction with ExpandJmxComputation (Phase 80)
Phase 80 is the delayed lowering phase for jump table index computations created by DoSwitchOpt. The separation exists because:
- Jump table index computation requires knowing the final table address, which is not available until after legalization
- Intervening optimization passes (GVN-CSE, strength reduction) may simplify the index computation before it is expanded
- Register allocation needs to see the index computation as a single pseudo-instruction for accurate pressure estimation
The pseudo-instruction left by DoSwitchOpt is expanded by phase 80 into the final LEA + BRX sequence after all high-level optimizations are complete.
Interaction with OriLinearReplacement (Phase 31)
Phase 31 runs immediately after DoSwitchOptSecond (phase 30). It targets branch-heavy patterns that survived switch optimization and attempts to replace them with branchless (linear) computation sequences using SEL (select) and predicated MOV instructions. This is a complement to predication (phase 63) -- it operates earlier in the pipeline on simpler patterns, while predication handles more complex diamond-shaped control flow later.
Interaction with MergeEquivalentConditionalFlow (Phases 133, 136)
Two late-pipeline passes perform tail merging -- identifying basic blocks with identical instruction sequences that branch to the same targets, and merging them into a single block. This catches redundancy left over after branch optimization, particularly in switch case bodies that perform similar operations on different case values.
Algorithmic Summary
Pass Algorithm Complexity CFG Changes
───────────────────────────── ─────────────────────────── ──────────── ──────────────────────
DoSwitchOpt (14, 30) Pattern match + decision O(N log N) Rewrites blocks, adds
tree for strategy selection per switch jump table pseudo-ops
OriBranchOpt (15) Single RPO pass with per- O(B + E) Deletes blocks, removes
block inner convergence loop per iter edges, threads branches
OptimizeNestedCondBranches Pattern match on nested O(B) Merges blocks, replaces
(38) branch diamonds branches with LOP3+BRA
Where N = number of switch cases, B = number of basic blocks, E = number of CFG edges.
Function Map
All addresses from ptxas v13.0.88. Vtable entries resolved by reading the ELF .rodata section at file offset VA - 0x400000. Confidence: HIGH for vtable functions (direct binary read), HIGH for core algorithms (single-caller chains from vtable execute bodies).
Phase Vtable Functions
| Address | Size | Phase | Vtable slot | Role |
|---|---|---|---|---|
sub_C5F720 | 42B | 14 | +0 | execute -- dispatches to SM backend vtable[17] |
sub_C5F4A0 | 6B | 14 | +8 | getPhaseNumber -- returns 14 |
sub_C5F4B0 | 3B | 14 | +16 | isNoOp -- returns false |
sub_C5F950 | 34B | 15 | +0 | execute -- calls sub_7917F0 |
sub_C5F480 | 6B | 15 | +8 | getPhaseNumber -- returns 15 |
sub_C5F490 | 3B | 15 | +16 | isNoOp -- returns false |
sub_C5FC80 | 34B | 30 | +0 | execute -- calls sub_791F00(ctx, 1) |
sub_C5F2A0 | 6B | 30 | +8 | getPhaseNumber -- returns 30 |
sub_C5F2B0 | 3B | 30 | +16 | isNoOp -- returns false |
sub_C5FA70 | 34B | 38 | +0 | execute -- calls sub_A0F020 |
sub_C5F1A0 | 6B | 38 | +8 | getPhaseNumber -- returns 38 |
sub_C5F1B0 | 3B | 38 | +16 | isNoOp -- returns false |
Core Algorithm Functions
| Address | Size | Callers | Description |
|---|---|---|---|
sub_77CF40 | 4698B | 1 | DoSwitchOpt core -- pattern match, strategy select, code emit |
sub_7917F0 | 529B | 2 | OriBranchOpt core -- single RPO pass with per-block inner convergence loop |
sub_A0F020 | 2375B | 11 | OptimizeNestedCondBranches core -- predicate combining |
sub_791F00 | 587B | 3 | DoSwitchOpt setup -- SwitchOptContext init, calls sub_77CF40 |
sub_661210 | 2B | -- | SM backend slot 17 NO-OP (rep ret) -- SM30, SM50 |
sub_849110 | 613B | -- | SM backend slot 17 handler A -- sync restructuring + branch scan (SM60/70/90+) |
sub_729160 | 546B | -- | SM backend slot 17 handler B -- branch-distance rewriter (SM80/89) |
sub_848F40 | 472B | 1 | Sync-pattern restructuring helper (called by sub_849110) |
sub_A9D9B0 | 256B | 2 | Per-instruction branch-target operand rewriter |
Infrastructure Functions
| Address | Size | Callers | Description |
|---|---|---|---|
sub_7DDB50 | 156B | 180 | Optimization level gate (knob 499 + opt-level check) |
sub_781F80 | 8335B | 131 | Block preparation infrastructure (major shared function) |
sub_785E20 | 266B | 34 | CFG rebuild after transformation |
sub_7E6090 | 2614B | 80 | Branch pattern scanner |
sub_7E6AD0 | 33B | 10 | Branch chain setup |
sub_753600 | 1351B | 1 | Block-level branch transform (phase 15 inner loop) |
sub_753B50 | 598B | 1 | Block transform continuation |
Factory and Vtable Data
| Symbol | Address | Description |
|---|---|---|
sub_C60D30 | 0xC60D30 | Phase factory -- 159-case switch, allocates 16-byte phase objects |
off_22BD5C8 | 0x22BD5C8 | Vtable base -- 40-byte stride, index = phase number |
off_22BD7F8 | 0x22BD7F8 | Phase 14 vtable (base + 14 * 0x28) |
off_22BD820 | 0x22BD820 | Phase 15 vtable (base + 15 * 0x28) |
off_22BDA78 | 0x22BDA78 | Phase 30 vtable (base + 30 * 0x28) |
off_22BDBB8 | 0x22BDBB8 | Phase 38 vtable (base + 38 * 0x28) |
Cross-References
- Pass Inventory -- complete 159-phase table with phase numbers and categories
- Optimization Pipeline -- pipeline infrastructure, dispatch loop, phase ordering
- Ori IR -- instruction layout, opcode table (OEN = BRA, OFFL = BSSY, OFLAP = BSYNC), CFG hash maps
- GeneralOptimize Bundles -- phases 13, 29, 37 that feed constant/copy information to branch passes
- Liveness Analysis -- phase 16 (DCE cleanup after branch/switch optimization)
- Predication -- phase 63 (if-conversion, consumes simplified CFG from phases 15 and 38)
- Hot/Cold Partitioning -- phases 41, 108, 109 (block placement interacts with branch layout)
- Synchronization & Barriers -- BSSY/BSYNC reconvergence mechanism
- Data Structures -- SM backend object at +1584 (phase 14 polymorphic dispatch target)
- Optimization Levels --
sub_7DDB50opt-level gate, knob 499 interaction - Knobs System -- knobs 214, 464, 487, 499 gating branch/switch phases; SwitchOptBudget and SwitchOptExtendedPatternMatch controlling switch lowering strategy