Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Pass Inventory & Ordering

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas compilation pipeline consists of exactly 159 phases, executed in a fixed order determined by a static index table at 0x22BEEA0. Every compilation traverses the same sequence -- phase skipping is handled per-phase via isNoOp() virtual method overrides, not by reordering the table. This page is the definitive inventory of all 159 phases: their index, name, category, one-line description, and cross-references to detailed documentation where available.

All 159 phases have names in the static name table at off_22BD0C0 (159 entries, indexed 0--158). The factory switch at sub_C60D30 allocates each phase as a 16-byte polymorphic object with a 5-slot vtable: execute() at +0, getIndex() at +8 (returns the factory/table index), and isNoOp() at +16 (returns 0 for active phases, 1 for phases skipped by default). Slots +24 and +32 are NULL.

Total phases159 (indices 0--158)
Named (static table)159 (all have entries in off_22BD0C0)
Late-pipeline phases20 (indices 139--158, added after the original 0--138 design)
Gate passes (AdvancedPhase)17 conditional hooks
Update passes9 data-structure refresh passes (6 in main table + 3 in static name table, not yet positioned)
Report passes10 diagnostic/dump passes (9 in main table + 1 in static name table, not yet positioned)
GeneralOptimize instances6 compound optimization bundles
Liveness/DCE instances5 (including EarlyOriSimpleLiveDead)
LICM instances4
Pipeline infrastructurePhase Manager, Optimization Pipeline

Phase Categories

Each phase is tagged with one of 10 categories. These are not present in the binary -- they are an analytical classification applied during reverse engineering.

TagMeaningCount
ValidationChecks IR structural correctness, catches illegal patterns3
LoweringConverts unsupported ops, expands macros, legalizes IR14
OptimizationTransforms IR to improve performance (DCE, CSE, LICM, etc.)68
AnalysisComputes information consumed by later passes (liveness, CFG)6
ReportingDumps IR, statistics, or memory usage for debugging9
SchedulingInstruction scheduling, sync insertion, WAR fixup8
RegAllocRegister allocation and related fixups6
EncodingMercury SASS encoding, expansion, microcode generation9
CleanupPost-transformation updates, NOP removal, block layout13
GateConditional hooks (AdvancedPhase*) -- no-op by default17

Phases 139--158 are late-pipeline phases covering Mercury encoding, scoreboards, register map computation, diagnostics, and a terminal NOP. They have the same vtable infrastructure as phases 0--138 and are fully named in the static table.

Numbering Discrepancy -- Complete Wiki-to-Binary Mapping

Warning: The wiki phase numbers 0--138 use a compressed scheme that omits 23 binary indices from the contiguous range 0--139. Of these 23, seven are displaced to wiki positions 132--138, and 16 have no wiki number at all. The divergence begins at binary index 8 (UpdateAfterConvertUnsupportedOps, skipped in the wiki) and accumulates to a delta of +23 by wiki phase 116. Phases 140--158 match their binary indices. Every cross-reference on this page and 40+ linked pages uses wiki numbers, NOT binary indices. Use the table below to convert.

Complete Binary-to-Wiki Translation Table

Reading guide: W# = wiki phase number used on this page. Rows marked SKIP have no wiki number (16 phases). Rows marked DISP are displaced to wiki 132--138 (7 phases). Delta = binary index minus wiki number.

BinPhase NameW#DBinPhase NameW#D
00OriCheckInitialProgram0080OriDoRemat69
11ApplyNvOptRecipes1081OriPropagateVaryingSecond70
22PromoteFP162082OptimizeSyncInstructions71
33AnalyzeControlFlow3083AdvPhLateExpandSync135
44AdvPhBeforeConvUnSup4084LateExpandSyncInstructions72
55ConvertUnsupportedOps5085ConvertAllMovPhiToMov73
66SetControlFlowOpLastInBB6086ConvertToUniformReg74
77AdvPhAfterConvUnSup7087LateArchOptimizeFirst75
8UpdateAfterConvUnSupOps132DISP88UpdateAfterOptimize76+12
910OriCreateMacroInsts8+189AdvPhLateConvUnSup77
1011ReportInitialRepresentation9+190LateExpUnSupportedOps78
1112EarlyOriSimpleLiveDead10+191LateMergeEquivCondFlow136
1213ReplaceUniformsWithImm11+192OriHoistInvariantsLate279
1314OriSanitize12+193LateExpUnSupOpsMid137
1416GeneralOptimizeEarly13+194ExpandJmxComputation80
15MergeEquivCondFlow133DISP95LateArchOptimizeSecond81+14
1618DoSwitchOptFirst14+296AdvPhBackPropVReg82
1719OriBranchOpt15+297OriBackCopyPropagate83
1820OriPerformLiveDeadFirst16+298OriSplitHiPressLR138
1921OptimizeBindlessHeaderLoads17+299OriPerformLiveDeadFourth84
2023OriLoopSimplification18+2100OriPropagateGmma85
2124OriSplitLiveRanges19+2101InsertPseudoUseDefConvUR86
22OriCopyProp--SKIP102FixupGmmaSequence87+15
2326PerformPGO20+3103LateEnforceArgRestr--
2427OriStrengthReduce21+3104OriHoistInvariantsLate388
2528OriLoopUnrolling22+3105AdvPhSetRegAttr89
2629GenerateMovPhi23+3106OriSetRegisterAttr90
2730OriPipelining24+3107OriCalcDependantTex91
2831StageAndFence25+3108AdvPhAfterSetRegAttr92
2933OriRemoveRedundantBarriers26+3109LateExpUnSupportedOps293
3034AnalyzeUniformsForSpec27+3110FinalInspectionPass94
3135SinkRemat28+3111SetAfterLegalization95
32OptimizeNaNOrZero--SKIP112ReportBeforeScheduling96+16
3338GeneralOptimize29+4113AdvPhPreSched97
3439DoSwitchOptSecond30+4114ScheduleInstructions--
3540OriLinearReplacement31+4115UpdateAfterSchedInstr--
3642CompactLocalMemory32+4116BackPropagateVEC2D98
37ConvMemToRegOrUniform--SKIP117OriDoSyncronization99+18
3844OriPerformLiveDeadSecond33+5118UpdateAfterOriDoSync--
3945ExtractShaderConstsFirst34+5119ApplyPostSyncWars100
4046OriHoistInvariantsEarly35+5120ReportBeforeRegAlloc--
41Vectorization--SKIP121AdvPhAllocReg101+20
4248EmitPSI36+6122AllocateRegisters--
4349GeneralOptimizeMid37+6123ReportAfterRegAlloc102
4450OptimizeNestedCondBranches38+6124UpdateAfterOriAllocReg--
4551ConvertVTGReadWrite39+6125Get64bRegComponents103
4653DoVirtualCTAExpansion40+6126AdvPhPostExpansion104
4754MarkAdditionalColdBlocks41+6127PostExpansion--
4855ExpandMbarrier42+6128ApplyPostRegAllocWars105
4956ForwardProgress43+6129AdvPhPostSched106
5058OptimizeUniformAtomic44+6130OriRemoveNopCode107
5159MidExpansion45+6131OptimizeHotColdInLoop108
52AdvPhAfterMidExpansion134DISP132OptimizeHotColdFlow109+23
5361GeneralOptimizeMid246+7133PostSchedule110
5462AdvPhEarlyEnforceArgs47+7134AdvPhPostFixUp111
5563EnforceArgumentRestrictions48+7135PlaceBlocksInSourceOrder112
5664GvnCse49+7136PostFixForMercTargets113
57OriCommoning--SKIP137FixUpTexDepBarAndSync114+23
5866OriReassociateAndCommon50+8138AdvScoreboardsAndOpexes115
5967ExtractShaderConstsFinal51+8139ProcessO0WaitsAndSBs116
6068OriReplaceEquivMultiDefMov52+8140--158(19 late-pipeline phases)140--158
6170OriPropagateVaryingFirst53+8
6271OriDoRematEarly54+8
6372LateExpansion55+8
6474SpeculativeHoistComInsts56+8
6575RemoveASTToDefaultValues57+8
6676GeneralOptimizeLate58+8
6778OriLoopFusion59+8
6879DoVTGMultiViewExpansion60+8
69OriSimpleLiveDead--SKIP
7081OriPerformLiveDeadThird61+9
7182OriRemoveRedundantMultiDefMov62+9
7284OriDoPredication63+9
73LateVectorization--SKIP
7486LateOriCommoning64+10
7587GeneralOptimizeLate265+10
7688OriHoistInvariantsLate66+10
77SinkCodeIntoBlock--SKIP
7890DoKillMovement67+11
7992DoTexMovement68+11

Phases 140--158 are identity-mapped (wiki number = binary index). The full list appears in Stage 10 below. Note that binary 139 (ProcessO0WaitsAndSBs) appears at BOTH wiki 116 (in Stage 7) and wiki 139 (in Stage 10).

16 Phases Missing from Wiki Numbering

These binary phases have no wiki number. All are valid DUMPIR and DisablePhases targets.

BinNameCatPipeline Position
2225OriCopyPropOpt
3236OptimizeNaNOrZeroOpt
3743ConvertMemoryToRegisterOrUniformOpt
4147VectorizationOpt
5765OriCommoningOpt
6980OriSimpleLiveDeadOpt
7385LateVectorizationOpt
7789SinkCodeIntoBlockOpt
103125LateEnforceArgumentRestrictionsLower
114137ScheduleInstructionsSched
115138UpdateAfterScheduleInstructionsClean
118143UpdateAfterOriDoSyncronizationClean
120145ReportBeforeRegisterAllocationReport
122147AllocateRegistersRegAlloc
124149UpdateAfterOriAllocateRegistersClean
127152PostExpansionLower

7 Displaced Phases (Wiki 132--138)

These phases exist in the binary at early/mid positions but were assigned wiki numbers 132--138 when discovered after the initial compressed numbering was established. Their true execution order follows their binary index, not their wiki number.

Wiki #True Binary IndexNameExecutes Between
1328UpdateAfterConvertUnsupportedOpsAdvPhAfterConvUnSup [7] and OriCreateMacroInsts [9]
13315MergeEquivalentConditionalFlowGeneralOptimizeEarly [14] and DoSwitchOptFirst [16]
13452AdvancedPhaseAfterMidExpansionMidExpansion [51] and GeneralOptimizeMid2 [53]
13583AdvancedPhaseLateExpandSyncInstructionsOptimizeSyncInstructions [82] and LateExpandSyncInstructions [84]
13691LateMergeEquivalentConditionalFlowLateExpansionUnsupportedOps [90] and OriHoistInvariantsLate2 [92]
13793LateExpansionUnsupportedOpsMidOriHoistInvariantsLate2 [92] and ExpandJmxComputation [94]
13898OriSplitHighPressureLiveRangesOriBackCopyPropagate [97] and OriPerformLiveDeadFourth [99]

Gate Passes (AdvancedPhase)

Seventeen phase instances (16 unique gates, plus AdvancedPhaseOriPhaseEncoding appearing at both wiki 127 and 152) are conditional extension points whose isNoOp() returns true in the default vtable. They exist as insertion points for architecture backends and optimization-level overrides. When a specific SM target or -O level requires additional processing at a given pipeline position, the backend overrides the phase's vtable to provide a real execute() implementation.

Gate passes bracket major pipeline transitions. For example, phases 4 and 7 bracket ConvertUnsupportedOps (phase 5), allowing a backend to inject pre- and post-legalization logic without modifying the fixed phase table. Phase 101 (AdvancedPhaseAllocReg) is the most critical gate -- the entire register allocation subsystem is driven through this hook; the base pipeline contains no hardcoded allocator.

The naming convention is consistent: AdvancedPhase prefix followed by the pipeline position or action name. One exception is AdvancedScoreboardsAndOpexes (phase 115), which uses Advanced without Phase.

Gate Pass Worker Correspondence

All 17 gate passes fall into three categories when activated by a backend override: (A) dispatch to a named worker phase from the static name table, (B) dispatch through an SM backend vtable slot at ctx+0x630, or (C) execute a pipeline progress counter thunk that writes ctx+1552 = N. Category A gates have a named worker visible to DUMPIR. Category B dispatches through architecture-specific code in the SM backend object. Category C gates' only effect is advancing the pipeline progress counter, which downstream passes read via *(ctx+1552) > N guards.

Gate (Wiki #)BinCatExecute FnWorker / Dispatch TargetEvidence
AdvPhBeforeConvUnSup (4)4Csub_C5F620 (7B)ctx+1552 = 1; marks pre-legalizationP0_03 thunk table; early pipeline boundary
AdvPhAfterConvUnSup (7)7Csub_C5F5A0 (7B)ctx+1552 = 2; marks post-ConvUnSupP0_03 thunk table; sub_752CF0 checks <= 3
AdvPhEarlyEnforceArgs (47)54Avtable dispatchEnforceArgumentRestrictions [48]P5_02 correspondence table; W020 "Before EnforceArgumentRestrictions"
AdvPhAfterMidExpansion (134)52Csub_C5EF80 (7B)ctx+1552 = 3; marks mid-expansion doneP0_03 thunk table; sub_752CF0 checks <= 3
AdvPhLateExpandSync (135)83B0xC5F110 (6B)jmp *(*(ctx+0x630))+0x168; SM backend vtable slot 360W029 disasm; brackets LateExpandSyncInstructions [84]
AdvPhLateConvUnSup (77)89B0xC5EA50 (13B)jmp *(*(ctx+0x630))+0x178; SM backend vtable slot 376W033 disasm lines 108--111; drives LateExpUnSupportedOps [90]
AdvPhBackPropVReg (82)96Boff_22BE298Arch-override vtable dispatch; next phase [83] writes ctx+1552 = 9P1_08 vtable layout; isNoOp returns 0 (runtime-overridden to 1)
AdvPhSetRegAttr (89)105Bvtable dispatchctx+0x630 SM backend vtable; precedes OriSetRegisterAttr [90]W020 line 407 "Before OriSetRegisterAttr"
AdvPhAfterSetRegAttr (92)108B0xC607A0 (51B)*(*(ctx+0x630))+0x110; guarded by nullsub_170@0x7D6C80W029 disasm line 53; returns NOP when default impl
AdvPhPreSched (97)113Avtable dispatchScheduleInstructions [114]; sub_8D0640 (22 KB)P5_02 table; string "ScheduleInstructions"
AdvPhAllocReg (101)121Avtable dispatchAllocateRegisters [122]String "Please use -knob DUMPIR=AllocateRegisters" at sub_9714E0
AdvPhPostExpansion (104)126Avtable dispatchPostExpansion [127]; post-RA expansion dispatchP5_02 table
AdvPhPostSched (106)129Csub_C5E830 (7B)ctx+1552 = 14; marks post-schedulingP0_03 thunk table; adjacent to PostSchedule [110]
AdvPhPostFixUp (111)134Avtable dispatchPostFixUp [140]; ctx+0x630 vtable+0x148P2_14 line 85; target-specific post-fixup
AdvScoreboardsAndOpexes (115)138Bvtable dispatchsub_A36360 (52 KB) + sub_A23CF0 (54 KB); O1+ onlyControl word gen + DAG scheduler; -O0 uses phase 139 instead
AdvPhOriPhaseEncoding (127)152Csub_C5E0B0 (7B)ctx+1552 = 21; marks encoding boundaryP2_15 disasm; sub_8C0270 checks == 19
(total: 5 type A, 5 type B, 6 type C = 16 gates)

Type A gates (5) dispatch to a named worker phase in the static name table -- valid DUMPIR/NamedPhases/DisablePhases targets. AdvPhEarlyEnforceArgs was reclassified from C to A based on P5_02 evidence: it dispatches to EnforceArgumentRestrictions [48], with LateEnforceArgumentRestrictions [103] as its late counterpart. Type B gates (5) dispatch through an SM backend vtable slot at ctx+0x630; the worker code lives in the per-SM backend object. Specific vtable offsets: +0x168 (late sync expansion), +0x178 (late unsupported ops), +0x110 (post-reg-attr, guarded by default-impl check against nullsub_170@0x7D6C80). Type C gates (6) write ctx+1552 (pipeline_progress) to values 1--21, forming a monotonically increasing timeline that 20+ downstream guards check. AdvPhPostSched was reclassified from B to C based on P0_03 evidence: sub_C5E830 is a 7-byte thunk writing ctx+1552 = 14, identical in structure to the other progress thunks.

See Optimization Levels for per-gate activation rules.

Update Passes

Nine phases refresh data structures invalidated by preceding transformations. Six are documented at specific wiki phase numbers; three additional update phases exist in the static name table but are not yet mapped to wiki phase numbers (see Numbering Discrepancy above):

PhaseNameRefreshes
7688UpdateAfterOptimize
125150UpdateAfterPostRegAlloc
128154UpdateAfterFormatCodeList
1328UpdateAfterConvertUnsupportedOps
150150UpdateAfterPostRegAlloc
154154UpdateAfterFormatCodeList
(true 115)UpdateAfterScheduleInstructionsRefreshes IR after scheduling completes (omitted from compressed numbering)
(true 118)UpdateAfterOriDoSyncronizationRefreshes IR after sync insertion (omitted from compressed numbering)
(true 124)UpdateAfterOriAllocateRegistersRefreshes IR after register allocation (omitted from compressed numbering)

These are lightweight passes that call into the IR's internal consistency maintenance routines. They do not transform the IR -- they only update auxiliary data structures (liveness bitmaps, instruction lists, block layout caches) so that downstream passes see a coherent view. Phases 150 and 154 are late-pipeline duplicates whose isNoOp() returns 1 by default; they only activate when a backend requires a second update cycle. The three *(true N)* entries are in the static name table at the indicated indices but are not yet assigned wiki phase numbers.

Report Passes

Ten phases produce diagnostic output. They are no-ops unless specific debug options are enabled (e.g., --stat=phase-wise, DUMPIR, --keep):

PhaseNameOutput
910ReportInitialRepresentation
96112ReportBeforeScheduling
102123ReportAfterRegisterAllocation
(true 120)ReportBeforeRegisterAllocationDumps IR before register allocation; omitted from compressed numbering (name at 0x22BD068)
126151ReportFinalMemoryUsage
129155DumpNVuCodeText
130156DumpNVuCodeHex
151151ReportFinalMemoryUsage
155155DumpNVuCodeText
156156DumpNVuCodeHex

Phase 131 (DebuggerBreak) is a development-only hook that triggers a breakpoint -- it is not a report pass per se, but serves a similar diagnostic purpose. Phase 157 is its late-pipeline counterpart (empty body in release builds).

GeneralOptimize Bundles

The GeneralOptimize* passes are compound optimization bundles that run multiple small transformations (copy propagation, constant folding, algebraic simplification, dead code elimination) in a fixed-point iteration until no further changes occur. They appear at 6 positions throughout the pipeline to re-clean the IR after major transformations:

PhaseNamePosition
1314GeneralOptimizeEarly
2933GeneralOptimize
3743GeneralOptimizeMid
4653GeneralOptimizeMid2
5866GeneralOptimizeLate
6575GeneralOptimizeLate2

See GeneralOptimize Bundles for the sub-pass decomposition.


O-Level Gating

Mechanism

The 159-phase pipeline does not carry any opt-level metadata on the phase objects themselves. Three binary facts establish this:

  1. Uniform phase construction. sub_C60D30 (the 159-case phase factory at 0xC60D30, 1132 lines) allocates every phase object via the same 5-line body: request 16 bytes from the pool, store a per-case vtable pointer at offset +0, store the allocator pointer at +8, return. There is no *(char*)(obj+N) = LEVEL write anywhere in the switch; every case is byte-identical except for the vtable symbol (cases 0--158 at sub_C60D30_0xc60d30.c:172--1125, tail default at 1126--1129). The 16-byte phase object therefore has room for exactly {vtable, allocator} and no inline "minimum opt level" field.

  2. The dispatch loop does not consult opt-level. sub_C64F70 (the phase iterator at 0xC64F70, 276 lines) calls each phase's isNoOp() virtual (vtable slot +0x10) twice per iteration, and then unconditionally calls the phase's execute() virtual (vtable slot +0x00) via LABEL_4:

    // sub_C64F70:86 (first isNoOp check)
    if ( (*(unsigned __int8 (**)(__int64))(*(_QWORD *)v6 + 16LL))(v6) )
      goto LABEL_4;             // skips "Before <phase>" diagnostic print
    /* ... allocate & emit "Before <name>" diagnostic string ... */
    LABEL_4:
      (**(void (***)(__int64, __int64))v6)(v6, *a1);   // execute(ctx)
    // sub_C64F70:162 (second isNoOp check)
    if ( !(*(unsigned __int8 (**)(__int64))(*(_QWORD *)v6 + 16LL))(v6) )
      /* ... allocate & emit "After <name>" diagnostic string ... */
    

    The goto LABEL_4 branch bypasses only the diagnostic-string formatting block (lines 88--159); control still falls through to execute() at line 161. isNoOp() is therefore a diagnostic-suppression flag, not an execution gate. The pre-execute call at line 86 hides the "Before" string; the post-execute call at line 162 hides the "After" string; the execute() body itself runs every iteration regardless. See Phase Manager -- Phase Dispatch Loop for the full annotated dispatch pseudocode and the isNoOp timing discussion.

  3. The gate lives inside each execute body. Phases that honour the -O level do so via an early-return prologue in their execute() thunk. The canonical pattern (instantiated ~82 times in the 0xC5F7xx--0xC60Bxx range) is:

    // sub_C60140 (representative execute thunk, 8 bytes + prologue)
    void PhaseN::execute(ocg_ctx* ctx) {
      if ( (int)sub_7DDB50(ctx) > 1 )    // opt_level > 1 (i.e. O2+)
        sub_XXXXXX(ctx);                  // tail-call real implementation
      // else: fall through -- phase was a no-op at this O-level
    }
    

    sub_7DDB50 (the opt-level accessor at 0x7DDB50, 232 bytes) reads the cached 32-bit opt_level field from ocg_ctx + 2104 (i.e. ctx + 0x838), but only when knob 499 is active; otherwise it returns 1, capping effective behaviour at O1. The knob-499 kill-switch and the iteration-budget counter at kv->state[35940] are documented in Optimization Levels -- Gate Accessor.

Important corollary. Because execute() is always invoked, every phase's timing record and pre-snapshot (written at sub_C64F70:72--85, before the first isNoOp() call) are also recorded. --ftime output therefore contains a row for all 159 phases in every compilation, including phases that immediately early-returned because the opt-level guard failed. Gated-off phases show near-zero elapsed time rather than being omitted.

Pseudocode for the full gate mechanism

// OCG context fields referenced by the gate
struct ocg_ctx {
    // ...
    void*      options_mgr;       // +0x680 (1664)  -- knob query vtable
    int32_t    opt_level_cached;  // +0x838 (2104)  -- parsed -O level, 0..5
    // ...
};

// sub_7DDB50 (0x7DDB50) -- opt-level accessor, called by each phase execute
int getOptLevel(ocg_ctx* ctx) {
    OptionsMgr* om  = ctx->options_mgr;           // [ctx+1664]
    auto        set = om->vtable->setOption;      // [vtbl+152]
    if (set == sub_67EB60) {
        auto isSet = om->vtable->isOptionSet;     // [vtbl+72]
        bool knob_499 = (isSet == sub_6614A0)
            ? (om->state[35928] != 0)             // direct bss read
            : isSet(om, 499);                     // virtual query
        if (!knob_499)
            return ctx->opt_level_cached;         // honour -O level
        int used = om->state[35940];              // iteration counter
        if (om->state[35936] > used) {            // budget not exhausted?
            om->state[35940] = used + 1;
            return ctx->opt_level_cached;         // honour -O level
        }
    } else if (set(om, 499, 1)) {
        return ctx->opt_level_cached;             // honour -O level
    }
    return 1;                                     // fallback: clamp to O1
}

// Per-phase execute prologue (replicated in ~82 wrapper thunks)
void Phase_execute(phase* self, ocg_ctx* ctx) {
    if ((int)getOptLevel(ctx) > 1) {              // the gate: O2+ only
        do_the_actual_work(ctx);                  // tail-call real pass
    }
    // else: phase is a runtime no-op for this compilation
}

// sub_C64F70 dispatch loop (pseudocode; isNoOp() only gates diagnostics)
void PhaseManager::dispatch(int* idx, int n) {
    for (int i = 0; i < n; i++) {
        phase* p = phases[idx[i]];
        append_timing_record(p);                  // unconditional
        take_pre_snapshot();                      // unconditional
        if (!p->isNoOp()) print("Before " + p->name);
        p->execute(ctx);                          // ALWAYS called
        if (!p->isNoOp()) print("After "  + p->name);
    }
}

The matrix is regular (two-bucket structure at the wrapper layer)

Scanning all phase wrappers in 0xC5F7xx--0xC60Bxx (the per-phase execute thunks) for calls to sub_7DDB50:

Gate predicateWrappersMeaning
(none -- wrapper unconditionally calls implementation)~50Phase runs at every -O level
(int)sub_7DDB50(ctx) > 1~78Phase runs at O2, O3, O4, O5
(unsigned int)sub_7DDB50(ctx) == 1 && knob_235 (or similar guarded O1 path)3--4Phase runs at O1 only when an auxiliary knob is set
> 1 || (ctx+1424 == 199 && == 1)1Phase 58 GeneralOptimizeLate -- O2+ or O1 with option-31 extended value 199 (see General Optimize)

Zero layer-1 wrappers use the thresholds > 0 (would mean "O1+"), > 2 ("O3+"), > 3 ("O4+"), or > 4 ("O5 only"). Fine-grained opt-level branching (e.g. opt_level <= 2 in sub_78DB70, <= 3 in sub_914B40, > 2 in sub_8FB5D0 / sub_9FC860 / sub_9F8C00, > 3 in sub_137EE50) happens inside the implementation bodies, after the wrapper has already let control through. Those internal decisions toggle sub-algorithms (e.g. forward vs. reverse scheduler pass, loop-peeling depth, remat strategy) rather than enabling or disabling the phase as a whole.

The phase-to-O-level activity matrix is therefore regular: the layer-1 wrapper either runs the phase at every level, or gates it at exactly one threshold (opt_level > 1). Per-phase irregularity exists only at layer 2 -- inside the implementations that the wrappers call. This collapses the "159 phases × 6 opt-levels" table to a two-column classification at the outer dispatch layer:

+------------------------------+-------------------------------------+
|  Category A: always-run       |  Category B: O2+ only               |
|  (no sub_7DDB50 in wrapper)   |  (wrapper: sub_7DDB50(ctx) > 1)     |
+------------------------------+-------------------------------------+
|  * All reporting/dump phases  |  * GVN-CSE, LICM, rematerialization |
|  * All validation phases      |  * Loop unrolling, software pipe    |
|  * All legalization phases    |  * Predication / if-conversion      |
|  * All Mercury/SASS encoding  |  * Switch/branch optimization       |
|  * All register-allocation    |  * Sync-instruction optimization    |
|  * All AdvancedPhase gates    |  * Barrier removal                  |
|  * All pseudo/expansion phases|  * Backward copy propagation        |
|  * Initial setup & cleanup    |  * Speculative hoisting, peephole   |
+------------------------------+-------------------------------------+

Concrete -O0 vs -O3 phase lists

Resolving the gate with opt_level = 0 (i.e. sub_7DDB50 returns 0) against the 159-phase pipeline and the Category-B wrappers identified above:

At -O0, the following phases early-return (runtime no-ops): Phase 14 DoSwitchOptFirst (gate sub_C5F720), 15 OriBranchOpt (sub_C5F950), 22 OriLoopUnrolling, 24 OriPipelining, 26 OriRemoveRedundantBarriers, 28 SinkRemat, 30 DoSwitchOptSecond (sub_C5FC80), 38 OptimizeNestedCondBranches (sub_C5FA70), 49 GvnCse, 54 OriDoRematEarly, 58 GeneralOptimizeLate (sub_C603E0, unless option-31 override), 63 OriDoPredication, 69 OriDoRemat, 71 OptimizeSyncInstructions, 72 LateExpandSyncInstructions, 95 SetAfterLegalization, 99 OriDoSyncronization, 100 ApplyPostSyncronizationWars, 110 PostSchedule, 115 AdvancedScoreboardsAndOpexes, and ~60 other Category-B phases. At -O0 the scheduling subsystem does still run phase 116 ProcessO0WaitsAndSBs, which performs the conservative-scoreboard insertion that makes O0 code actually executable -- phase 116 is itself a Category-A wrapper that dispatches to sub_C5E2A0 only when the target architecture has sm_version > 0x3FFF.

At -O3 (the default), every Category-A wrapper runs, and every Category-B wrapper also runs because sub_7DDB50 returns 3 which satisfies > 1. The difference between -O2 and -O3 at the wrapper level is therefore zero phases -- both levels activate the same 159 wrappers. The -O2/-O3 distinction happens entirely inside the implementation bodies (e.g. scheduling direction in sub_8D0640, which branches on opt_level > 2). The same is true for -O3 vs. -O4 vs. -O5: identical layer-1 wrapper activation, different internal algorithm selection. Only the -O0 and -O1 thresholds produce layer-1 visible skips.

This two-tier design explains why the wiki's "O-Level" column in the 159-phase table below is sparse: most phases have no entry because they always run (Category A) or because the visible O-level branching is buried inside a layer-2 implementation and does not show up at the phase wrapper at all.

See Optimization Levels for the confirmed per-phase threshold list, the detailed sub_7DDB50 accessor breakdown, knob 499 kill-switch semantics, the NvOpt recipe system, and the scheduler/RA-specific opt-level interactions.


Complete 159-Phase Table

Stage 1 -- Initial Setup (Phases 0--13)

Program validation, recipe application, FP16 promotion, control flow analysis, unsupported-op conversion, macro creation, initial diagnostics.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
00OriCheckInitialProgramValidationValidates structural correctness of the initial Ori IR after PTX lowering
11ApplyNvOptRecipesOptimizationApplies NvOptRecipe transformations (option 391, 440-byte sub-manager)
22PromoteFP16LoweringPromotes FP16 operations to FP32 where hardware lacks native support
33AnalyzeControlFlowAnalysisBuilds the CFG: identifies loops, dominators, back edges
44AdvancedPhaseBeforeConvUnSupGateType C: sub_C5F620 writes ctx+1552 = 1; pre-legalization boundary
55ConvertUnsupportedOpsLoweringReplaces operations not natively supported on the target SM with equivalent sequencesLate Legalization
66SetControlFlowOpLastInBBCleanupEnsures control flow instructions are the final instruction in each basic block
77AdvancedPhaseAfterConvUnSupGateType C: sub_C5F5A0 writes ctx+1552 = 2; post-ConvUnSup boundary
89OriCreateMacroInstsLoweringExpands PTX-level macro instructions into Ori instruction sequences
910ReportInitialRepresentationReportingDumps the Ori IR for debugging (no-op unless DUMPIR enabled)
1011EarlyOriSimpleLiveDeadOptimizationQuick early dead code elimination passLiveness
1112ReplaceUniformsWithImmOptimizationReplaces uniform register reads with immediate constants where value is knownUniform Regs
1213OriSanitizeValidationValidates IR consistency after initial setup transformations
1314GeneralOptimizeEarlyOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (early)GeneralOptimize

Stage 2 -- Early Optimization (Phases 14--32)

Branch/switch optimization, loop canonicalization, strength reduction, software pipelining, SSA phi insertion, barrier optimization.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
1416DoSwitchOptFirstOptimization> 1Optimizes switch statements: jump table generation, case clustering (1st pass); wrapper sub_C5F720Branch & Switch
1517OriBranchOptOptimization> 1Branch folding, unreachable block elimination, conditional branch simplification; wrapper sub_C5F950Branch & Switch
1618OriPerformLiveDeadFirstAnalysisFull liveness analysis + dead code elimination (1st of 4 major instances)Liveness
1719OptimizeBindlessHeaderLoadsOptimizationHoists and deduplicates bindless texture header loads
1820OriLoopSimplificationOptimization4--5Canonicalizes loops: single entry, single back-edge, preheader insertion; aggressive loop peeling at O4+Loop Passes
1921OriSplitLiveRangesOptimizationSplits live ranges at loop boundaries to reduce register pressureLiveness
2023PerformPGOOptimizationApplies profile-guided optimization data (block weights, branch probabilities)
2124OriStrengthReduceOptimizationReplaces expensive operations (multiply, divide) with cheaper equivalents (shift, add)Strength Reduction
2225OriLoopUnrollingOptimization> 1Unrolls loops based on trip count and register pressure heuristicsLoop Passes
2326GenerateMovPhiLoweringInserts SSA phi nodes as MOV.PHI pseudo-instructions
2427OriPipeliningOptimization> 1Software pipelining: overlaps loop iterations to hide latencyLoop Passes
2528StageAndFenceLoweringInserts memory fence and staging instructions for coherenceSync & Barriers
2629OriRemoveRedundantBarriersOptimization> 1Eliminates barrier instructions proven redundant by data-flow analysisSync & Barriers
2730AnalyzeUniformsForSpeculationAnalysisAnalyzes constant bank accesses for speculation safety across control flowUniform Regs
2831SinkRematOptimization> 1 / > 4Sinks instructions closer to uses and marks remat candidates; O2+: basic; O5: full cutlassRematerialization
2933GeneralOptimizeOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (mid-early)GeneralOptimize
3034DoSwitchOptSecondOptimization> 1Second switch optimization pass after loop/branch transformations; wrapper sub_C5FC80Branch & Switch
3135OriLinearReplacementOptimizationReplaces branch-heavy patterns with linear (branchless) sequences
3236CompactLocalMemoryOptimizationCompacts local memory allocations by eliminating dead slots and reordering

Stage 3 -- Mid-Level Optimization (Phases 33--52)

GVN-CSE, reassociation, shader constant extraction, CTA/VTG expansion, argument enforcement.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
3338OriPerformLiveDeadSecondAnalysisFull liveness analysis + DCE (2nd instance, post-early-optimization cleanup)Liveness
3439ExtractShaderConstsFirstOptimizationIdentifies uniform values loadable from constant memory instead of per-thread computation (1st pass)
3540OriHoistInvariantsEarlyOptimizationLoop-invariant code motion: hoists invariant computations out of loops (early)Loop Passes
3642EmitPSILoweringEmits PSI (Pixel Shader Input) interpolation setup for graphics shaders
3743GeneralOptimizeMidOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (mid)GeneralOptimize
3844OptimizeNestedCondBranchesOptimization> 1Simplifies nested conditional branches into flatter control flow; wrapper sub_C5FA70Branch & Switch
3945ConvertVTGReadWriteLoweringConverts vertex/tessellation/geometry shader read/write operations
4046DoVirtualCTAExpansionLoweringExpands virtual CTA operations into physical CTA primitives
4147MarkAdditionalColdBlocksAnalysisMarks basic blocks as cold based on heuristics and profile dataHot/Cold
4248ExpandMbarrierLoweringExpands MBARRIER pseudo-instructions into native barrier sequencesSync & Barriers
4349ForwardProgressLoweringInserts instructions guaranteeing forward progress (prevents infinite stalls)
4450OptimizeUniformAtomicOptimizationConverts thread-uniform atomic operations into warp-level reductions
4551MidExpansionLoweringTarget-dependent mid-level expansion of operations before register allocationLate Legalization
4653GeneralOptimizeMid2OptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (mid 2nd)GeneralOptimize
4754AdvancedPhaseEarlyEnforceArgsGateType A: dispatches to EnforceArgumentRestrictions [48]; late counterpart LateEnforceArgumentRestrictions [103]
4855EnforceArgumentRestrictionsLoweringEnforces ABI restrictions on function arguments (register classes, alignment)
4956GvnCseOptimization> 1Global value numbering combined with common subexpression eliminationCopy Prop & CSE
5058OriReassociateAndCommonOptimizationReassociates expressions for better commoning opportunities, then eliminates commonsCopy Prop & CSE
5159ExtractShaderConstsFinalOptimizationFinal shader constant extraction pass (after GVN may expose new constants)
5260OriReplaceEquivMultiDefMovOptimizationEliminates redundant multi-definition move instructions with equivalent sources

Stage 4 -- Late Optimization (Phases 53--77)

Predication, rematerialization, loop fusion, varying propagation, sync optimization, phi destruction, uniform register conversion.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
5361OriPropagateVaryingFirstOptimizationPropagates varying (non-uniform) annotations to identify divergent values (1st pass)
5462OriDoRematEarlyOptimization> 1Early rematerialization: recomputes cheap values near uses to reduce register pressureRematerialization
5563LateExpansionLoweringExpands operations that must be lowered after high-level optimizationsLate Legalization
5664SpeculativeHoistComInstsOptimizationSpeculatively hoists common instructions above branches
5765RemoveASTToDefaultValuesCleanupRemoves AST (address space type) annotations that have been lowered to defaults
5866GeneralOptimizeLateOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (late)GeneralOptimize
5967OriLoopFusionOptimizationFuses adjacent loops with compatible bounds and no inter-loop dependenciesLoop Passes
6068DoVTGMultiViewExpansionLoweringExpands multi-view operations for vertex/tessellation/geometry shaders
6170OriPerformLiveDeadThirdAnalysisFull liveness analysis + DCE (3rd instance, post-late-optimization)Liveness
6271OriRemoveRedundantMultiDefMovOptimizationRemoves dead multi-definition move instructions
6372OriDoPredicationOptimization> 1If-conversion: converts short conditional branches into predicated instructionsPredication
6474LateOriCommoningOptimizationLate commoning pass: eliminates common subexpressions exposed by predicationCopy Prop & CSE
6575GeneralOptimizeLate2OptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (late 2nd)GeneralOptimize
6676OriHoistInvariantsLateOptimizationLICM: hoists loop-invariant code (late, after predication may expose new invariants)Loop Passes
6778DoKillMovementOptimizationMoves kill annotations closer to last use to improve register pressure
6879DoTexMovementOptimizationMoves texture fetch instructions to minimize latency exposure
6980OriDoRematOptimization> 1Late rematerialization: recomputes values exposed by predication and fusionRematerialization
7081OriPropagateVaryingSecondOptimizationPropagates varying annotations (2nd pass, after predication changes control flow)
7182OptimizeSyncInstructionsOptimization> 1Eliminates and simplifies synchronization instructionsSync & Barriers
7284LateExpandSyncInstructionsLowering> 2Expands sync pseudo-instructions into final hardware sequencesSync & Barriers
7385ConvertAllMovPhiToMovLoweringDestroys SSA form: converts MOV.PHI instructions into plain MOV
7486ConvertToUniformRegOptimizationConverts qualifying values from general registers (R) to uniform registers (UR)Uniform Regs
7587LateArchOptimizeFirstOptimizationArchitecture-specific late optimizations (1st pass)
7688UpdateAfterOptimizeCleanupRebuilds IR metadata invalidated by the late optimization group
7789AdvancedPhaseLateConvUnSupGateType B: 0xC5EA50 dispatches ctx+0x630 vtable+0x178 (slot 376); drives LateExpUnSupportedOps [90]Late Legalization

Stage 5 -- Legalization (Phases 78--96)

Late unsupported-op expansion, backward copy propagation, GMMA fixup, register attributes, final validation.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
7890LateExpansionUnsupportedOpsLoweringExpands remaining unsupported operations after all optimizationsLate Legalization
7992OriHoistInvariantsLate2OptimizationLICM (late 2nd pass) after unsupported-op expansionLoop Passes
8094ExpandJmxComputationLoweringExpands JMX (jump with index computation) pseudo-instructions
8195LateArchOptimizeSecondOptimizationArchitecture-specific late optimizations (2nd pass)
8296AdvancedPhaseBackPropVRegGateType B: arch-override vtable dispatch (off_22BE298); next phase [83] writes ctx+1552 = 9Copy Prop & CSE
8397OriBackCopyPropagateOptimizationBackward copy propagation: propagates values backward through move chainsCopy Prop & CSE
8499OriPerformLiveDeadFourthAnalysisFull liveness analysis + DCE (4th instance, pre-legalization cleanup)Liveness
85100OriPropagateGmmaOptimizationPropagates WGMMA accumulator values through the IRGMMA Pipeline
86101InsertPseudoUseDefForConvURLoweringInserts pseudo use/def instructions for uniform register conversion bookkeepingUniform Regs
87102FixupGmmaSequenceLoweringFixes WGMMA instruction sequences for hardware ordering constraintsGMMA Pipeline
88104OriHoistInvariantsLate3OptimizationLICM (late 3rd pass) after GMMA fixupLoop Passes
89105AdvancedPhaseSetRegAttrGateType B: ctx+0x630 SM backend vtable dispatch; precedes OriSetRegisterAttr [90]
90106OriSetRegisterAttrAnalysisAnnotates registers with scheduling attributes (latency class, bank assignment)Scheduling
91107OriCalcDependantTexAnalysisComputes texture instruction dependencies for scheduling
92108AdvancedPhaseAfterSetRegAttrGateType B: 0xC607A0 dispatches ctx+0x630 vtable+0x110; guarded by nullsub_170@0x7D6C80
93109LateExpansionUnsupportedOps2LoweringSecond late unsupported-op expansion (catches ops exposed by GMMA/attr passes)Late Legalization
94110FinalInspectionPassValidationFinal IR validation gate: catches illegal patterns before irreversible scheduling/RA
95111SetAfterLegalizationCleanup> 1Sets post-legalization flag on the compilation context
96112ReportBeforeSchedulingReportingDumps IR before scheduling (no-op unless diagnostic options enabled)

Stage 6 -- Scheduling & Register Allocation (Phases 97--103)

Synchronization insertion, WAR fixup, register allocation, 64-bit register handling.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
97113AdvancedPhasePreSchedGateHook before scheduling; when active, dispatches to ScheduleInstructions (sub_8D0640, true table index 114)Scheduling
98116BackPropagateVEC2DOptimizationBackward-propagates 2D vector register assignments
99117OriDoSyncronizationScheduling> 1Inserts synchronization instructions (BAR, DEPBAR, MEMBAR) per GPU memory modelSync & Barriers
100119ApplyPostSyncronizationWarsScheduling> 1Fixes write-after-read hazards exposed by sync insertionSync & Barriers
101121AdvancedPhaseAllocRegGateRegister allocation driver hook; when active, dispatches to AllocateRegisters (true table index 122); DUMPIR=AllocateRegisters targets thisRegAlloc Architecture
102123ReportAfterRegisterAllocationReportingDumps IR after register allocation (no-op unless diagnostic options enabled)
103125Get64bRegComponentsRegAllocSplits 64-bit register pairs into 32-bit components for architectures that require itRegAlloc Architecture

Stage 7 -- Post-RA & Post-Scheduling (Phases 104--116)

Post-expansion, NOP removal, hot/cold optimization, block placement, scoreboard generation.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
104126AdvancedPhasePostExpansionGateHook after post-RA expansion; when active, dispatches to PostExpansion (true table index 127)
105128ApplyPostRegAllocWarsRegAllocFixes write-after-read hazards exposed by register allocation
106129AdvancedPhasePostSchedGateType C: sub_C5E830 writes ctx+1552 = 14; post-scheduling boundary
107130OriRemoveNopCodeCleanupRemoves NOP instructions and dead code inserted as placeholders
108131OptimizeHotColdInLoopOptimizationSeparates hot and cold paths within loops for cache localityHot/Cold
109132OptimizeHotColdFlowOptimizationSeparates hot and cold paths at the function levelHot/Cold
110133PostScheduleScheduling> 0Post-scheduling pass: finalizes instruction orderingScheduling
111134AdvancedPhasePostFixUpGateHook after post-fixup; when active, dispatches to PostFixUp (phase 140, target vtable+0x148)
112135PlaceBlocksInSourceOrderCleanupDetermines final basic block layout in the emitted binary
113136PostFixForMercTargetsEncodingFixes up instructions for Mercury encoding requirementsMercury
114137FixUpTexDepBarAndSyncSchedulingFixes texture dependency barriers and sync instructions post-schedulingScoreboards
115138AdvancedScoreboardsAndOpexesGate> 0Full scoreboard generation: computes 23-bit control word per instruction (-O1+); no-op at -O0Scoreboards
116139ProcessO0WaitsAndSBsScheduling== 0Conservative scoreboard insertion for -O0: maximum stalls, barriers at every hazardScoreboards

Scoreboard generation has two mutually exclusive paths. At -O1 and above, phase 115 (AdvancedScoreboardsAndOpexes) runs the full dependency analysis using sub_A36360 (52 KB) and sub_A23CF0 (54 KB DAG list scheduler), while phase 116 is a no-op. At -O0, phase 115 is a no-op and phase 116 inserts conservative stall counts.

Stage 8 -- Mercury Backend (Phases 117--122)

SASS instruction encoding, expansion, WAR generation, opex computation, microcode emission.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
117142MercEncodeAndDecodeEncodingConverts Ori instructions to Mercury encoding, then round-trip decodes for verificationMercury
118143MercExpandInstructionsEncodingExpands pseudo-instructions into final SASS instruction sequencesMercury
119144MercGenerateWARs1EncodingGenerates write-after-read hazard annotations (1st pass, pre-expansion)Mercury
120145MercGenerateOpexEncodingGenerates "opex" (operation extension) annotations for each instructionMercury
121146MercGenerateWARs2EncodingGenerates WAR annotations (2nd pass, covers hazards introduced by expansion)Mercury
122147MercGenerateSassUCodeEncodingProduces the final SASS microcode bytes (the actual binary encoding)Mercury

"Mercury" is NVIDIA's internal name for the SASS encoding framework. WAR generation runs in two passes (119, 121) because instruction expansion in phase 118 can introduce new write-after-read hazards. The MercConverter infrastructure (sub_9F1A90, 35 KB) drives instruction-level legalization via a visitor pattern dispatched through sub_9ED2D0 (25 KB opcode switch).

Stage 9 -- Post-Mercury (Phases 123--131)

Register map computation, diagnostics, debug output.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
123148ComputeVCallRegUseRegAllocComputes register usage for virtual call sites
124149CalcRegisterMapRegAllocComputes the final physical-to-logical register mapping emitted as EIATTR metadataRegAlloc Architecture
125150UpdateAfterPostRegAllocCleanupRebuilds IR metadata after post-RA processing
126151ReportFinalMemoryUsageReportingPrints memory pool consumption summary to stderr
127152AdvancedPhaseOriPhaseEncodingGateType C: sub_C5E0B0 writes ctx+1552 = 21; marks encoding boundary
128154UpdateAfterFormatCodeListCleanupRebuilds the code list after Mercury encoding reformats instructions
129155DumpNVuCodeTextReportingDumps human-readable SASS text disassembly
130156DumpNVuCodeHexReportingDumps raw SASS binary as hex
131157DebuggerBreakCleanupDevelopment hook: triggers a debugger breakpoint at this pipeline position

Stage 10 -- Late Cleanup & Late Pipeline (Phases 132--158)

Late merge operations, late unsupported-op expansion, high-pressure live range splitting, Mercury encoding pipeline, register map computation, diagnostics, and debug hooks.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
1328UpdateAfterConvertUnsupportedOpsCleanupRebuilds IR metadata after late unsupported-op conversion
13315MergeEquivalentConditionalFlowOptimizationMerges basic blocks with equivalent conditional flow (tail merging)
13452AdvancedPhaseAfterMidExpansionGateType C: sub_C5EF80 writes ctx+1552 = 3; marks mid-expansion done
13583AdvancedPhaseLateExpandSyncInstructionsGateType B: 0xC5F110 dispatches ctx+0x630 vtable+0x168 (slot 360)Sync & Barriers
13691LateMergeEquivalentConditionalFlowOptimizationSecond conditional flow merge pass (catches cases exposed by late transforms)
13793LateExpansionUnsupportedOpsMidLoweringMid-late unsupported-op expansion (between the two merge passes)Late Legalization
13898OriSplitHighPressureLiveRangesRegAllocLast-resort live range splitter when register pressure exceeds hardware limitsRegAlloc Architecture
Columns: # (wiki number) = Bin# (binary factory index) for all late-pipeline phases (no renumbering gap from phase 139 onward). Execute = address of the vtable-slot-0 execute(ctx) function allocated by sub_C60D30 (factory cases 139--158 at lines 1006--1125); worker / tail-call target addresses are listed in the Description column. Gate = the runtime predicate checked inside execute (if any) before the body runs; "unconditional" means the execute body has no skip branch. SM activation is all unless marked otherwise -- the only phase with an SM-version check in its execute body is 139.
# = Bin#Phase NameCategoryExecuteGateDescription (algorithm / nullsub status / cross-ref)
139ProcessO0WaitsAndSBsSchedulingsub_C5E2A0 (41 B)target[+0x174] > 0x3FFF (sm50+)On sm50+, tail-calls target.vtable[+0x150] (ApplyConservativeScoreboards) with edx=1 (O0-mode flag) to insert maximum-wait scoreboards on every instruction; pre-sm50 targets fall through to ret. isNoOp=0. See Scoreboards and deep-dive below.
140PostFixUpCleanupsub_C5E270 (13 B)unconditionalTail-calls target.vtable[+0x148]; each Mercury target installs a target-specific post-fixup method (texture-barrier placement on Volta, scoreboard packing on Turing+, etc.), non-Mercury targets install a nullptr-safe stub. isNoOp=0.
141MercConverterEncodingsub_C60300 (8 B thunk) -> sub_9F3760cu[+1398] & 0x20 (bit 5 inside body)Second MercConverter pass re-lowering PTX-derived opcodes introduced by optimization (rematerialization, peephole, loop xforms); dispatches on target.sm_code to the per-generation path, then routes through sub_9F1A90 / sub_9ED2D0 (the 35 KB opcode dispatcher reused from phase 5 ConvertUnsupportedOps). After completion every IR instruction carries a valid SASS opcode. See Mercury: MercConverter.
142MercEncodeAndDecodeEncodingsub_C60310 (8 B thunk) -> sub_18F21F0ctx[+0x571] & 0x02 (bit 1) AND ctx[+0x788] != NULL (Mercury ctx present)Encodes every Ori IR node to a Mercury node via the master encoder sub_6D9690, then round-trip-decodes each node to verify the binary encoding is reversible. After this phase all subsequent pipeline stages operate on Mercury nodes exclusively. See Mercury: Stage 1 MercEncodeAndDecode.
143MercExpandInstructionsEncodingsub_C60320 (16 B) -> sub_C3DFC0 (102 B orchestrator)ctx[+0x570] & 0x20 (bit 5)Expands compound Mercury pseudo-instructions -- multi-word branches, multi-step LDG/STG sequences, sm_120 TCGEN05 macros -- into their primitive SASS sequences. sub_C3DFC0 walks the Mercury list via sub_C3CC60 and invokes each node's vtable[+0x40] Expand hook. See Mercury: Stage 2 MercExpandInstructions.
144MercGenerateWARs1Encodingsub_C60340 (16 B) -> sub_6FC240 (47 B)ctx[+0x570] & 0x80 (bit 7, js opcode test)First WAR (write-after-read) hazard annotation pass -- walks the Mercury node list and tags each consumer with the stall counts the target's hazard model requires. Runs before MercGenerateOpex (145). See Mercury: Stage 3 MercGenerateWARs.
145MercGenerateOpexEncodingsub_C60380 (16 B) -> sub_7032A0 (472 B)ctx[+0x570] & 0x40 (bit 6)Generates Opex (operand-exchange) annotations -- the per-instruction control bits that tell the hardware which physical register bank to read each operand from, required by the sm_90+ banked-register-file microarchitecture to avoid read-port conflicts. See Mercury: Stage 4 MercGenerateOpex.
146MercGenerateWARs2Encodingsub_C60360 (16 B) -> sub_6FC240ctx[+0x570] & 0x80 (bit 7)Second WAR pass, byte-for-byte identical to phase 144 except for the vtable store. Two WAR passes bracket Opex (145) because Opex can rewrite operand banks and thereby introduce new write-to-read distances that must be re-annotated. See Mercury: Stage 3 MercGenerateWARs.
147MercGenerateSassUCodeEncodingsub_C603A0 (16 B) -> sub_6EEE90 (1472 B) -> sub_6E4110 (24 KB)ctx[+0x571] & 0x01 (bit 0)Terminal Mercury stage: walks the fully-annotated Mercury node list and emits the final SASS binary microcode bytes that end up in the ELF .text section. sub_6EEE90 is a 0x110-byte stack-scratch wrapper that calls sub_6E8EB0 for per-function setup, then hands off to the 24 KB emitter sub_6E4110. See Mercury: Stage 5 MercGenerateSassUCode and SASS Printing.
148ComputeVCallRegUseRegAllocsub_C5E160 (13 B)unconditionalTail-calls target.vtable[+0x2B8] to compute register usage at virtual call sites (indirect calls, function pointers). The result is written into the target-side register-use tracker and later emitted as EIATTR_EXTERNS / EIATTR_INDIRECT_BRANCH_TARGETS metadata so the CUDA runtime can honour conservative register budgets for callees whose footprint is unknown at compile time.
149CalcRegisterMapRegAllocsub_C603C0 (32 B) -> sub_95A350 (6456 B)ctx[+0x590] & 0x02 (bit 1)Computes the final physical-to-logical register mapping that gets emitted as EIATTR_REGCOUNT / EIATTR_MIN_STACK_SIZE metadata. The execute thunk indirects through ctx.target[+0x18] (the SM-specific sub-target) before tail-calling sub_95A350 (the actual mapping builder). The map is needed by the CUDA driver to inflate saved contexts during preemption and by NVRTC for relocation. See RegAlloc Architecture.
150UpdateAfterPostRegAllocCleanupnullsub_630 at 0xC5E110 (2 B)--True no-op in release ptxas. Empty repz ret body; isNoOp() = 1 (sub_C5E130, 6 B) suppresses the "Before/After" diagnostic frame. Slot retained for ABI compatibility with debug builds where the body would be PhaseManager::RebuildAfterPostRegAlloc.
151ReportFinalMemoryUsageReportingnullsub_629 at 0xC5E0E0 (2 B)--True no-op. isNoOp() = 1 (sub_C5E100). Debug builds would dump the memory-arena high-water mark to stderr here; release strips the body entirely.
152AdvancedPhaseOriPhaseEncodingGatesub_C5E0B0 (11 B)unconditionalType-C gate: movl dword [rsi+0x610], 0x15; ret -- writes pipeline_progress = 21 (the final value of the monotonic ctx[+0x610] counter). Downstream consumers: sub_8C0270 checks *(ctx+0x610) == 19; scoreboard guards check values 16--19. isNoOp() = 1 (sub_C5E0D0) because this is state-tracking, not an IR transform.
153FormatCodeListEncodingsub_C5E080 (13 B)unconditionalThe one late-pipeline phase that indirects through ctx[+0x648] (the code-list / ELF-section emitter) rather than ctx[+0x630] (target) or ctx[+0x788] (Mercury). Tail-calls emitter.vtable[+0x10] -- the "format" entry point that serialises the fully-encoded instructions into the final ELF text-section layout (addresses, relocations, alignment). isNoOp=0. See Mercury.
154UpdateAfterFormatCodeListCleanupnullsub_628 at 0xC5E050 (2 B)--True no-op. isNoOp() = 1 (sub_C5E070). Hook point kept in case a backend needs to re-sync IR metadata after FormatCodeList reorders instructions, but no release target uses it.
155DumpNVuCodeTextReportingsub_C60420 (54 B)ctx[+0x598] > 0 && ctx[+0x740] != NULL && *ctx[+0x740] != NULLGuarded by -dump_nvu_code_text=1 knob; the full gate cascade is retained, but the tail-call target 0x67FF60 resolves to nullsub_31 (2 B) -- the actual text dumper has been stripped from release ptxas, leaving the gate as an orphan that falls through to a stub. Effective no-op. See SASS Printing.
156DumpNVuCodeHexReportingsub_C60460 (~48 B)ctx[+0x598] > 0 && ctx[+0x740] != NULLMirror image of phase 155 with a simpler gate (no extra pointer indirection) and tail-call target 0x67FF50 = nullsub_30 (2 B). Same conclusion: stripped from release, orphan gate only. See SASS Printing.
157DebuggerBreakCleanupnullsub_627 at 0xC5DFE0 (2 B)--Debug-build breakpoint marker; release builds emit a bare ret. isNoOp() = 0 (sub_C5E000), so the "Before/After" diagnostic frame still fires -- useful when running ptxas under gdb with b *0xC5DFE0 because the dispatch loop will print "Before DebuggerBreak" / "After DebuggerBreak" on either side of the breakpoint.
158NOPCleanupnullsub_626 at 0xC5DFB0 (2 B)--Terminal sentinel. The 159-phase dispatch loop (sub_C64F70) iterates a1[0] .. a1[158] and needs a final slot to anchor the loop end; NOP is that anchor. isNoOp() = 0 (sub_C5DFD0), so the final "Before NOP" / "After NOP" prints appear in verbose dumps as the explicit terminator for "All Phases Summary".

Phases 139--158 are 20 late-pipeline phases whose vtable pointers range from off_22BEB80 to off_22BEE78 (40-byte stride, 20 * 0x28 bytes). All 20 have names in the static table at off_22BD0C0 (159 entries -- the earlier wiki note claiming "139 entries" was based on a compressed model that excluded these phases). Name resolution for the dispatch-loop diagnostic (sub_C64F70) goes through the static table indexed by getIndex() at vtable+8.

Vtable layout (3 slots, 24 bytes per object, off_22BExxx). Every late-pipeline phase object has exactly three virtual methods:

SlotOffsetPurposeBehaviour
0vtbl+0execute(ctx)Entry point called by sub_C64F70 dispatch loop (LABEL_4). See per-phase details below.
1vtbl+8getIndex()Returns the constant 139--158 (mov eax, 0x8b..0x9e; ret). Index into off_22BD0C0 for the name string. Always 6 bytes.
2vtbl+16isNoOp()Either xor eax,eax; ret (3 bytes, always false) or mov eax,1; ret (6 bytes, always true). Does not skip execute -- it suppresses the "Before <phase>" / "After <phase>" diagnostic print around the call (see sub_C64F70:86, the goto LABEL_4 branch falls into execute either way).

The phase object itself is 16 bytes: [0]=vtable*, [8]=ctx*. No per-phase instance state -- all state lives in the shared OCG context passed to every execute() call.

isNoOp statistics. 16 of 20 phases return 0 from isNoOp() (diagnostics printed). Exactly 4 phases return 1 (diagnostics suppressed): 150, 151, 152, 154. Of those, three (150, 151, 154) also have nullsub execute bodies and are truly vestigial; phase 152 has an 11-byte body that writes pipeline_progress = 21 but is hidden from dumps because it is a state-tracking marker, not an IR transform.

Phase-by-phase deep dive (139--158)

Each entry gives: execute function address, body size in bytes, vtable address, gate condition (if any), and pseudocode. All addresses are verified via (a) the factory switch in sub_C60D30 (cases 139--158 at lines 1006--1125), (b) the raw pointer-table dump at 0x22BEB80--0x22BEE78 read from .rodata, and (c) direct objdump of the .text segment.

Phase 139 ProcessO0WaitsAndSBs -- vtable=0x22BEB80 -- execute sub_C5E2A0 (41 bytes, IDA missed it, recovered via objdump). Runs on sm50+ only. Tail-dispatches to the target's ApplyConservativeScoreboards hook (vtable slot 0x150) with flag edx=1 (O0 mode). On sm30 / sm_3x and pre-sm50 architectures the phase returns immediately because the legacy shader-processor scoreboard model does not apply.

mov  rdi, [rsi+0x630]          ; rdi = ocg_ctx->target
cmp  dword [rdi+0x174], 0x3FFF ; if sm_version_encoded <= 16383 (pre-sm50)
jle  .return                   ;   skip
mov  rax, [rdi]                ; target.vtable
mov  edx, 1                    ; mode = O0
jmp  [rax+0x150]                ; target->ApplyConservativeScoreboards(ctx,1)
.return:
ret

isNoOp returns 0 (sub_C5E2E0, 3 bytes). No pipeline_progress write.

Phase 140 PostFixUp -- vtable=0x22BEBA8 -- execute sub_C5E270 (13 bytes). Unconditional target-hook dispatch. Every Mercury target registers a post-fixup method at vtable slot 0x148; non-Mercury targets install a nullptr-safe stub. The method performs target-specific cleanup after schedule and register allocation are final (examples: texture barrier placement on Volta, scoreboard packing on Turing+).

mov  rdi, [rsi+0x630]          ; target
mov  rax, [rdi]                ; target.vtable
jmp  [rax+0x148]                ; target->PostFixUp(target)

isNoOp = 0 (sub_C5E290).

Phase 141 MercConverter -- vtable=0x22BEBD0 -- execute sub_C60300 (8 bytes, thunk) -> body sub_9F3760. Second MercConverter invocation, re-running the 35 KB opcode-dispatch machinery from phase 5 (ConvertUnsupportedOps) on instructions introduced by optimization passes (rematerialization, peephole, loop transforms) that may carry unlegalized PTX-derived opcodes. Internal gate testb $0x10, [rdi+0x570] (bit 4) inside sub_9F3760 makes the body an immediate return on non-Mercury targets. When enabled the body dispatches on target.sm_code at [rdi+0x174] with the arch constants 0x9000/0x7005/0x7001/0x6001 to pick a per-generation conversion path. After completion every IR instruction carries a valid SASS opcode ready for encoding. See Mercury sub_9F1A90 / sub_9ED2D0 for the full opcode dispatch.

; execute thunk
mov  rdi, rsi                  ; rdi = ocg_ctx
jmp  0x9F3760                  ; MercConverter::Run

; sub_9F3760 prologue
testb [rdi+0x570], 0x10        ; Mercury-active bit
jz   .return
...

Phase 142 MercEncodeAndDecode -- vtable=0x22BEBF8 -- execute sub_C60310 (8 bytes, thunk) -> body sub_18F21F0. Encodes each Ori IR node to its Mercury-node form via sub_6D9690 (the master encoder), then round-trip-decodes to verify the binary encoding is reversible. After this phase all subsequent pipeline stages operate on Mercury nodes exclusively. Internal gate testb $0x2, [rdi+0x571] (bit 1 of the high byte of ctx+0x570) makes it a no-op when Mercury is not the active backend.

mov  rdi, rsi
jmp  0x18F21F0                 ; MercEncodeAndDecode::Run

; body prologue
testb [rdi+0x571], 0x2
jz   .return
mov  r15, [rdi+0x788]          ; Mercury context
test r15, r15
jz   .return
...

Phase 143 MercExpandInstructions -- vtable=0x22BEC20 -- execute sub_C60320 (16 bytes). Expands compound Mercury pseudo-instructions (e.g. multi-word branches, multi-step LDG/STG sequences, sm_120 TCGEN05 macros) into their SASS primitives. Gated by ctx+0x570 bit 5; the Mercury backend sets this bit during its init recipe. Tail-calls sub_C3DFC0 (102 bytes, an orchestrator that calls sub_C3CC60 to iterate the Mercury list and invokes per-instruction vtable+0x40 Expand hooks). sub_C3DFC0 also emits the "After MercExpand" diagnostic on completion.

testb [rsi+0x570], 0x20        ; bit 5: MercExpandEnable
jnz  .active
repz ret                        ; skip -- non-Mercury target
.active:
mov  rdi, [rsi+0x788]          ; rdi = Mercury context
jmp  0xC3DFC0                  ; RunMercExpandPass

Phase 144 MercGenerateWARs1 -- vtable=0x22BEC48 -- execute sub_C60340 (16 bytes). First WAR-hazard annotation pass. Walks the Mercury node list and tags each consumer with the write-after-read stall counts needed to satisfy the target's hazard model. Runs after MercExpandInstructions (143) but before MercGenerateOpex (145); the "pass-1" naming reflects that two WAR passes are needed because Opex (145) can rewrite operand banks and introduce new write-to-read distances that pass 146 then re-annotates. Gated by the sign bit (cmpb $0, [rsi+0x570]; js i.e. bit 7) of ctx+0x570.

cmpb [rsi+0x570], 0             ; js = "if signed" = bit 7 set
js   .active
repz ret
.active:
mov  rdi, [rsi+0x788]
jmp  0x6FC240                  ; RunMercWARsPass (47 bytes)

Phase 145 MercGenerateOpex -- vtable=0x22BEC70 -- execute sub_C60380 (16 bytes). Generates Opex (operand-exchange) annotations per instruction -- extra control bits that tell the hardware which physical register bank to read each operand from, required by the sm_90+ banked-register file to avoid bank conflicts. Gated by ctx+0x570 bit 6. Tail-calls sub_7032A0 (472 bytes, RunMercOpexPass). See Mercury Stage 4.

testb [rsi+0x570], 0x40
jnz  .active
repz ret
.active:
mov  rdi, [rsi+0x788]
jmp  0x7032A0

Phase 146 MercGenerateWARs2 -- vtable=0x22BEC98 -- execute sub_C60360 (16 bytes). Second WAR-hazard pass. Identical instruction body to phase 144 (same sub_6FC240 tail-call, same bit-7 gate); the two invocations bracket phase 145 (Opex) which may rewrite operand banks and thereby introduce new write-to-read distances that need re-annotation. Opcode bytes are byte-for-byte identical to phase 144 modulo the vtable store before it.

cmpb [rsi+0x570], 0
js   .active
repz ret
.active:
mov  rdi, [rsi+0x788]
jmp  0x6FC240                  ; same entry as phase 144

Phase 147 MercGenerateSassUCode -- vtable=0x22BECC0 -- execute sub_C603A0 (16 bytes). The terminal Mercury stage: walks the fully-annotated Mercury node list and emits the final SASS binary microcode bytes that will end up in the ELF .text section. Gated by ctx+0x571 bit 0 (the lowest bit of the second flag byte). Tail-calls sub_6EEE90 (1472 bytes), which is a thin wrapper that allocates a 0x110-byte stack scratch area, invokes sub_6E8EB0 for per-function setup, then calls into sub_6E4110 (24 KB, the real emitter documented in Mercury Stage 5).

testb [rsi+0x571], 0x1
jnz  .active
repz ret
.active:
mov  rdi, [rsi+0x788]
jmp  0x6EEE90                  ; MercGenerateSassUCode::Run

Phase 148 ComputeVCallRegUse -- vtable=0x22BECE8 -- execute sub_C5E160 (13 bytes). Computes register usage at virtual call sites (indirect calls, function pointers) and stores the result in the target-side register-use tracker. The data is consumed during ELF emission as EIATTR_EXTERNS/EIATTR_INDIRECT_BRANCH_TARGETS metadata so that the CUDA runtime can honour conservative register budgets for callees whose register footprint is unknown at compile time. Unconditional; all architectures route through the target vtable slot 0x2B8.

mov  rdi, [rsi+0x630]           ; target
mov  rax, [rdi]                 ; vtable
jmp  [rax+0x2B8]                ; target->ComputeVCallRegUse(target)

Phase 149 CalcRegisterMap -- vtable=0x22BED10 -- execute sub_C603C0 (32 bytes). Computes the final physical-to-logical register mapping that gets emitted as EIATTR_REGCOUNT / EIATTR_MIN_STACK_SIZE metadata. The mapping is needed by the CUDA driver to inflate saved contexts during preemption and by NVRTC for relocation. Gated by ctx+0x590 bit 1 (register-map-export knob). Indirects through ctx.target->tex_or_fat_target at ctx+0x630 ; [rax+0x18] then tail-calls sub_95A350 (6456 bytes, the actual mapping builder).

testb [rsi+0x590], 0x2
jnz  .active
repz ret
.active:
mov  rax, [rsi+0x630]
mov  rdi, [rax+0x18]            ; target.sub_target (sm-specific)
jmp  0x95A350                   ; CalcRegisterMap body

Phase 150 UpdateAfterPostRegAlloc -- vtable=0x22BED38 -- execute nullsub_630 at 0xC5E110 (2 bytes, repz ret). True no-op in release ptxas. isNoOp() returns 1 (sub_C5E130, 6 bytes) to suppress the diagnostic frame around the call. The phase slot is kept for ABI compatibility with debug builds where the body is PhaseManager::RebuildAfterPostRegAlloc, but the release build strips it.

Phase 151 ReportFinalMemoryUsage -- vtable=0x22BED60 -- execute nullsub_629 at 0xC5E0E0 (2 bytes, repz ret). True no-op. isNoOp() = 1 (sub_C5E100). Debug builds would dump the memory-arena high-water mark to stderr here; release strips the body entirely.

Phase 152 AdvancedPhaseOriPhaseEncoding -- vtable=0x22BED88 -- execute sub_C5E0B0 (11 bytes). The single surviving late-pipeline gate hook.

movl dword [rsi+0x610], 0x15   ; pipeline_progress = 21
ret

Writes pipeline_progress = 21 (the final value of the monotonic ctx+0x610 counter; see Targets offset +1552). Downstream consumers: sub_8C0270 checks *(ctx+0x610) == 19; scoreboard guards check values 16--19. isNoOp() = 1 (sub_C5E0D0) because the write is state-tracking, not IR transformation.

Phase 153 FormatCodeList -- vtable=0x22BEDB0 -- execute sub_C5E080 (13 bytes). Indirects through a different context object than the other late phases: ctx+0x648 is the code-list / ELF-section emitter rather than ctx+0x630 (target) or ctx+0x788 (Mercury context). Tail-calls vtable+0x10 on that object -- the "format" entry point that serialises the fully-encoded instructions into the final ELF text-section layout (addresses, relocations, alignment).

mov  rdi, [rsi+0x648]           ; code-list emitter
mov  rax, [rdi]                 ; its vtable
jmp  [rax+0x10]                 ; emitter->FormatCodeList()

Phase 154 UpdateAfterFormatCodeList -- vtable=0x22BEDD8 -- execute nullsub_628 at 0xC5E050 (2 bytes, repz ret). True no-op. isNoOp() = 1 (sub_C5E070). Kept as a hook point in case a target backend needs to re-sync IR metadata after FormatCodeList reordered instructions, but no release target uses it.

Phase 155 DumpNVuCodeText -- vtable=0x22BEE00 -- execute sub_C60420 (54 bytes). The gate cascade ctx+0x598 > 0 && ctx+0x740 != NULL && *(ctx+0x740) != NULL is fully retained, so the code path is reachable when the hidden -dump_nvu_code_text=1 knob is set, but the tail-call target 0x67FF60 resolves to nullsub_31 (2 bytes) -- the actual text dumper has been stripped from release ptxas, leaving an orphan gate that falls through to a stub.

mov  eax, [rsi+0x598]           ; verbosity level
test eax, eax
jle  .skip
mov  rax, [rsi+0x740]           ; dump sink
test rax, rax
je   .skip
mov  rdi, [rax]
test rdi, rdi
je   .skip
xor  edx, edx
xor  esi, esi
jmp  0x67FF60                   ; nullsub_31 -- stub
.skip:
repz ret

Phase 156 DumpNVuCodeHex -- vtable=0x22BEE28 -- execute sub_C60460 (~48 bytes). Mirror image of phase 155 with a simpler gate (no extra pointer indirection) and tail-call target 0x67FF50 = nullsub_30. Same conclusion: stripped from release, orphan gate only.

Phase 157 DebuggerBreak -- vtable=0x22BEE50 -- execute nullsub_627 at 0xC5DFE0 (2 bytes, repz ret). Debug-build breakpoint marker; release builds emit a bare ret. isNoOp() = 0 (sub_C5E000), so the diagnostic frame still fires -- useful when running ptxas under gdb with b *0xC5DFE0 because the dispatch loop will print "Before DebuggerBreak" / "After DebuggerBreak" on either side of the breakpoint.

Phase 158 NOP -- vtable=0x22BEE78 -- execute nullsub_626 at 0xC5DFB0 (2 bytes, repz ret). Terminal sentinel. The 159-phase dispatch loop (sub_C64F70) iterates a1[0] .. a1[158] and needs a final slot to anchor the loop end; NOP is that anchor. isNoOp() = 0 (sub_C5DFD0), so the final "Before NOP" / "After NOP" prints appear in verbose dumps as the explicit terminator for "All Phases Summary".

Summary of nullsubs (release build). Five of the 20 phases have bodies that are pure ret stubs: 150, 151, 154, 157, 158. Two more (155, 156) have non-trivial gate cascades but their tail-call targets resolve to nullsubs, making them effectively no-ops too. That leaves 13 phases (139--149, 152, 153) that actually transform IR or pipeline state in a release build. Of the 13 active phases, seven are Mercury encoder stages (141--147) gated by ctx+0x570/ctx+0x571 bits -- so on a non-Mercury backend the active count drops to six (139, 140, 148, 149, 152, 153).

No per-SM arch split across these phases. None of the 20 execute bodies contain an sm_version switch on ctx.target[+0x174] at the phase level; the only such check is in phase 139's gate (> 0x3FFF i.e. "sm50-and-up"). All per-generation specialisation happens one level down, inside the target vtable methods each phase tail-calls (Mercury backend for 141--147, target vtable slots 0x148/0x150/0x2B8 for 140/139/148). The pipeline itself is arch-uniform; backends differ only in the methods they plug into the vtables.

The Mercury phases (141--147) are gated by flag bits at ctx+0x570/ctx+0x571, allowing non-Mercury backends to selectively disable encoding stages. WAR generation runs in two passes (144, 146) bracketing Opex (145) because Opex can rewrite operand banks and thereby introduce new write-to-read distances that need re-annotation -- phase 143 (MercExpandInstructions) also runs before the pair but has its own bit-5 gate.


Pipeline Ordering Notes

Stage numbering. The 10 stages on this page (Stage 1--10) subdivide the 159-phase OCG pipeline. They are distinct from the 6 timed phases in Pipeline Overview (Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo), which cover the entire program lifecycle. All 10 stages here fall within the single OCG timed phase.

Identity ordering. The default ordering table at 0x22BEEA0 (159 x uint32) is an identity mapping for indices 0--156: exec[N] = factory[N]. The last two entries are zero: exec[157] = 0 and exec[158] = 0, mapping both slots back to factory index 0 instead of the expected 157 and 158. This is benign -- phase 157 (DebuggerBreak, empty body in release builds) and phase 158 (NOP, terminal sentinel) both have trivial execute() bodies, so the factory index they resolve through is irrelevant to pipeline behavior. For all practical purposes the factory index IS the execution order: phases execute in strict index order 0--158, and the two trailing zeros are don't-care slots. The original wiki analysis that placed phases 132--138 as "out-of-order slots" was based on a compressed 139-phase model that excluded 20 phases (see note below).

Repeated passes. Several transformations run at multiple pipeline positions because intervening passes expose new opportunities:

Pass FamilyInstancesPhases
GeneralOptimize*613, 29, 37, 46, 58, 65
OriPerformLiveDead*416, 33, 61, 84
OriHoistInvariants*435, 66, 79, 88
LateExpansionUnsupportedOps*378, 93, 137
ExtractShaderConsts*234, 51
OriPropagateVarying*253, 70
OriDoRemat*254, 69
DoSwitchOpt*214, 30
LateArchOptimize*275, 81
MergeEquivalentConditionalFlow2133, 136
MercGenerateWARs*2144, 146
UpdateAfterPostRegAlloc2125, 150
UpdateAfterFormatCodeList2128, 154
ReportFinalMemoryUsage2126, 151
DumpNVuCodeText2129, 155
DumpNVuCodeHex2130, 156
ComputeVCallRegUse2123, 148
CalcRegisterMap2124, 149
DebuggerBreak2131, 157
Vectorization/LateVectorization2(true 41, 73) -- omitted from compressed numbering
EnforceArgumentRestrictions/Late...248 (wiki), (true 103) -- late variant omitted

Cross-References

Key Functions

AddressSizeRoleConfidence
sub_C60D30--Phase factory switch; allocates each of the 159 phases as a 16-byte polymorphic object with a 5-slot vtable (execute, getIndex, isNoOp, NULL, NULL)0.92
sub_7DDB50232BOpt-level accessor; runtime gate called by 20+ pass execute functions to check opt-level threshold0.95
sub_A3636052KBMaster scoreboard control word generator; per-opcode dispatch for phase 115 (AdvancedScoreboardsAndOpexes)0.90
sub_A23CF054KBDAG list scheduler heuristic; barrier assignment for phase 115 scoreboard generation0.90
sub_9F1A9035KBMercConverter infrastructure; drives instruction-level legalization for Mercury phases 117--122 via visitor pattern0.92
sub_9ED2D025KBOpcode switch inside MercConverter; dispatches per-opcode legalization/conversion0.90
sub_9F3760--Phase 141 (MercConverter) execute function; initial Mercury conversion of Ori instructions0.85
sub_18F21F0--Phase 142 (MercEncodeAndDecode) execute function; encode/decode round-trip verification0.85