Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Pass Inventory & Ordering

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

The ptxas compilation pipeline consists of exactly 159 phases, executed in a fixed order determined by a static index table at 0x22BEEA0. Every compilation traverses the same sequence -- phase skipping is handled per-phase via isNoOp() virtual method overrides, not by reordering the table. This page is the definitive inventory of all 159 phases: their index, name, category, one-line description, and cross-references to detailed documentation where available.

All 159 phases have names in the static name table at off_22BD0C0 (159 entries, indexed 0--158). The factory switch at sub_C60D30 allocates each phase as a 16-byte polymorphic object with a 5-slot vtable: execute() at +0, getIndex() at +8 (returns the factory/table index), and isNoOp() at +16 (returns 0 for active phases, 1 for phases skipped by default). Slots +24 and +32 are NULL.

Total phases159 (indices 0--158)
Named (static table)159 (all have entries in off_22BD0C0)
Late-pipeline phases20 (indices 139--158, added after the original 0--138 design)
Gate passes (AdvancedPhase)17 conditional hooks
Update passes9 data-structure refresh passes (6 in main table + 3 in static name table, not yet positioned)
Report passes10 diagnostic/dump passes (9 in main table + 1 in static name table, not yet positioned)
GeneralOptimize instances6 compound optimization bundles
Liveness/DCE instances5 (including EarlyOriSimpleLiveDead)
LICM instances4
Pipeline infrastructurePhase Manager, Optimization Pipeline

Phase Categories

Each phase is tagged with one of 10 categories. These are not present in the binary -- they are an analytical classification applied during reverse engineering.

TagMeaningCount
ValidationChecks IR structural correctness, catches illegal patterns3
LoweringConverts unsupported ops, expands macros, legalizes IR14
OptimizationTransforms IR to improve performance (DCE, CSE, LICM, etc.)68
AnalysisComputes information consumed by later passes (liveness, CFG)6
ReportingDumps IR, statistics, or memory usage for debugging9
SchedulingInstruction scheduling, sync insertion, WAR fixup8
RegAllocRegister allocation and related fixups6
EncodingMercury SASS encoding, expansion, microcode generation9
CleanupPost-transformation updates, NOP removal, block layout13
GateConditional hooks (AdvancedPhase*) -- no-op by default17

Phases 139--158 are late-pipeline phases covering Mercury encoding, scoreboards, register map computation, diagnostics, and a terminal NOP. They have the same vtable infrastructure as phases 0--138 and are fully named in the static table.

Numbering Discrepancy -- Complete Wiki-to-Binary Mapping

Warning: The wiki phase numbers 0--138 use a compressed scheme that omits 23 binary indices from the contiguous range 0--139. Of these 23, seven are displaced to wiki positions 132--138, and 16 have no wiki number at all. The divergence begins at binary index 8 (UpdateAfterConvertUnsupportedOps, skipped in the wiki) and accumulates to a delta of +23 by wiki phase 116. Phases 140--158 match their binary indices. Every cross-reference on this page and 40+ linked pages uses wiki numbers, NOT binary indices. Use the table below to convert.

Complete Binary-to-Wiki Translation Table

Reading guide: W# = wiki phase number used on this page. Rows marked SKIP have no wiki number (16 phases). Rows marked DISP are displaced to wiki 132--138 (7 phases). Delta = binary index minus wiki number.

Name shortening. The Phase Name column below uses compact aliases to fit the two-column layout; the actual strings in the static name table at off_22BD0C0 are the unabbreviated forms used everywhere else on this page. Aliases used here: AdvPh* = AdvancedPhase* (15 phases), AdvScoreboardsAndOpexes = AdvancedScoreboardsAndOpexes, MergeEquivCondFlow = MergeEquivalentConditionalFlow, LateMergeEquivCondFlow = LateMergeEquivalentConditionalFlow, LateExpUnSupportedOps[2] = LateExpansionUnsupportedOps[2], LateExpUnSupOpsMid = LateExpansionUnsupportedOpsMid, LateEnforceArgRestr = LateEnforceArgumentRestrictions, UpdateAfterConvUnSupOps = UpdateAfterConvertUnsupportedOps, UpdateAfterSchedInstr = UpdateAfterScheduleInstructions, UpdateAfterOriDoSync = UpdateAfterOriDoSyncronization, UpdateAfterOriAllocReg = UpdateAfterOriAllocateRegisters, ReportBeforeRegAlloc = ReportBeforeRegisterAllocation, ReportAfterRegAlloc = ReportAfterRegisterAllocation, OriSplitHiPressLR = OriSplitHighPressureLiveRanges, AnalyzeUniformsForSpec = AnalyzeUniformsForSpeculation, InsertPseudoUseDefConvUR = InsertPseudoUseDefForConvUR, ConvMemToRegOrUniform = ConvertMemoryToRegisterOrUniform, ApplyPostSyncWars = ApplyPostSyncronizationWars.

BinDUMPIR#Phase NameW#DBinPhase NameW#D
00OriCheckInitialProgram0080OriDoRemat69+11
11ApplyNvOptRecipes1081OriPropagateVaryingSecond70+11
22PromoteFP162082OptimizeSyncInstructions71+11
33AnalyzeControlFlow3083AdvPhLateExpandSync135DISP
44AdvPhBeforeConvUnSup4084LateExpandSyncInstructions72+12
55ConvertUnsupportedOps5085ConvertAllMovPhiToMov73+12
66SetControlFlowOpLastInBB6086ConvertToUniformReg74+12
77AdvPhAfterConvUnSup7087LateArchOptimizeFirst75+12
8--UpdateAfterConvUnSupOps132DISP88UpdateAfterOptimize76+12
910OriCreateMacroInsts8+189AdvPhLateConvUnSup77+12
1011ReportInitialRepresentation9+190LateExpUnSupportedOps78+12
1112EarlyOriSimpleLiveDead10+191LateMergeEquivCondFlow136DISP
1213ReplaceUniformsWithImm11+192OriHoistInvariantsLate279+13
1314OriSanitize12+193LateExpUnSupOpsMid137DISP
1416GeneralOptimizeEarly13+194ExpandJmxComputation80+14
15--MergeEquivCondFlow133DISP95LateArchOptimizeSecond81+14
1618DoSwitchOptFirst14+296AdvPhBackPropVReg82+14
1719OriBranchOpt15+297OriBackCopyPropagate83+14
1820OriPerformLiveDeadFirst16+298OriSplitHiPressLR138DISP
1921OptimizeBindlessHeaderLoads17+299OriPerformLiveDeadFourth84+15
2023OriLoopSimplification18+2100OriPropagateGmma85+15
2124OriSplitLiveRanges19+2101InsertPseudoUseDefConvUR86+15
22--OriCopyProp--SKIP102FixupGmmaSequence87+15
2326PerformPGO20+3103LateEnforceArgRestr--SKIP
2427OriStrengthReduce21+3104OriHoistInvariantsLate388+16
2528OriLoopUnrolling22+3105AdvPhSetRegAttr89+16
2629GenerateMovPhi23+3106OriSetRegisterAttr90+16
2730OriPipelining24+3107OriCalcDependantTex91+16
2831StageAndFence25+3108AdvPhAfterSetRegAttr92+16
2933OriRemoveRedundantBarriers26+3109LateExpUnSupportedOps293+16
3034AnalyzeUniformsForSpec27+3110FinalInspectionPass94+16
3135SinkRemat28+3111SetAfterLegalization95+16
32--OptimizeNaNOrZero--SKIP112ReportBeforeScheduling96+16
3338GeneralOptimize29+4113AdvPhPreSched97+16
3439DoSwitchOptSecond30+4114ScheduleInstructions--SKIP
3540OriLinearReplacement31+4115UpdateAfterSchedInstr--SKIP
3642CompactLocalMemory32+4116BackPropagateVEC2D98+18
37--ConvMemToRegOrUniform--SKIP117OriDoSyncronization99+18
3844OriPerformLiveDeadSecond33+5118UpdateAfterOriDoSync--SKIP
3945ExtractShaderConstsFirst34+5119ApplyPostSyncWars100+19
4046OriHoistInvariantsEarly35+5120ReportBeforeRegAlloc--SKIP
41--Vectorization--SKIP121AdvPhAllocReg101+20
4248EmitPSI36+6122AllocateRegisters--SKIP
4349GeneralOptimizeMid37+6123ReportAfterRegAlloc102+21
4450OptimizeNestedCondBranches38+6124UpdateAfterOriAllocReg--SKIP
4551ConvertVTGReadWrite39+6125Get64bRegComponents103+22
4653DoVirtualCTAExpansion40+6126AdvPhPostExpansion104+22
4754MarkAdditionalColdBlocks41+6127PostExpansion--SKIP
4855ExpandMbarrier42+6128ApplyPostRegAllocWars105+23
4956ForwardProgress43+6129AdvPhPostSched106+23
5058OptimizeUniformAtomic44+6130OriRemoveNopCode107+23
5159MidExpansion45+6131OptimizeHotColdInLoop108+23
52--AdvPhAfterMidExpansion134DISP132OptimizeHotColdFlow109+23
5361GeneralOptimizeMid246+7133PostSchedule110+23
5462AdvPhEarlyEnforceArgs47+7134AdvPhPostFixUp111+23
5563EnforceArgumentRestrictions48+7135PlaceBlocksInSourceOrder112+23
5664GvnCse49+7136PostFixForMercTargets113+23
57--OriCommoning--SKIP137FixUpTexDepBarAndSync114+23
5866OriReassociateAndCommon50+8138AdvScoreboardsAndOpexes115+23
5967ExtractShaderConstsFinal51+8139ProcessO0WaitsAndSBs116+23
6068OriReplaceEquivMultiDefMov52+8140--158(19 late-pipeline phases)140--1580
6170OriPropagateVaryingFirst53+8
6271OriDoRematEarly54+8
6372LateExpansion55+8
6474SpeculativeHoistComInsts56+8
6575RemoveASTToDefaultValues57+8
6676GeneralOptimizeLate58+8
6778OriLoopFusion59+8
6879DoVTGMultiViewExpansion60+8
69--OriSimpleLiveDead--SKIP
7081OriPerformLiveDeadThird61+9
7182OriRemoveRedundantMultiDefMov62+9
7284OriDoPredication63+9
73--LateVectorization--SKIP
7486LateOriCommoning64+10
7587GeneralOptimizeLate265+10
7688OriHoistInvariantsLate66+10
77--SinkCodeIntoBlock--SKIP
7890DoKillMovement67+11
7992DoTexMovement68+11

Phases 140--158 are identity-mapped (wiki number = binary index). The full list appears in Stage 10 below. Note that binary 139 (ProcessO0WaitsAndSBs) appears at BOTH wiki 116 (in Stage 7) and wiki 139 (in Stage 10).

16 Phases Missing from Wiki Numbering

These binary phases have no wiki number. All are valid DUMPIR and DisablePhases targets.

BinDefault OrderNameCatPipeline Position
2225OriCopyPropOptBetween OriSplitLiveRanges [21] and PerformPGO [23]; sub-pass of all 6 GeneralOptimize bundles
3236OptimizeNaNOrZeroOptBetween SinkRemat [31] and GeneralOptimize [33]; NaN/zero constant folding
3743ConvertMemoryToRegisterOrUniformOptBetween CompactLocalMemory [36] and OriPerformLiveDeadSecond [38]; knob 487 gated
4147VectorizationOptBetween OriHoistInvariantsEarly [40] and EmitPSI [42]; load/store vectorization
5765OriCommoningOptBetween GvnCse [56] and OriReassociateAndCommon [58]; commoning sub-pass
6980OriSimpleLiveDeadOptBetween DoVTGMultiViewExpansion [68] and OriPerformLiveDeadThird [70]; quick DCE
7385LateVectorizationOptBetween OriDoPredication [72] and LateOriCommoning [74]; 2nd vectorization pass
7789SinkCodeIntoBlockOptBetween OriHoistInvariantsLate [76] and DoKillMovement [78]; code sinking
103125LateEnforceArgumentRestrictionsLowerBetween FixupGmmaSequence [102] and OriHoistInvariantsLate3 [104]; late ABI enforcement
114137ScheduleInstructionsSchedWorker for AdvancedPhasePreSched [113]; sub_8D0640 (22 KB)
115138UpdateAfterScheduleInstructionsCleanIR refresh after scheduling; between [113] and BackPropagateVEC2D [116]
118143UpdateAfterOriDoSyncronizationCleanIR refresh after sync insertion [117]; between [117] and ApplyPostSyncronizationWars [119]
120145ReportBeforeRegisterAllocationReportDiagnostic dump; between ApplyPostSyncronizationWars [119] and AdvancedPhaseAllocReg [121]
122147AllocateRegistersRegAllocWorker for AdvancedPhaseAllocReg [121]; canonical allocator entry
124149UpdateAfterOriAllocateRegistersCleanIR refresh after regalloc; between ReportAfterRegisterAllocation [123] and Get64bRegComponents [125]
127152PostExpansionLowerWorker for AdvancedPhasePostExpansion [126]; post-RA expansion

7 Displaced Phases (Wiki 132--138)

These phases exist in the binary at early/mid positions but were assigned wiki numbers 132--138 when discovered after the initial compressed numbering was established. Their true execution order follows their binary index, not their wiki number.

Wiki #True Binary IndexNameExecutes Between
1328UpdateAfterConvertUnsupportedOpsAdvancedPhaseAfterConvUnSup [7] and OriCreateMacroInsts [9]
13315MergeEquivalentConditionalFlowGeneralOptimizeEarly [14] and DoSwitchOptFirst [16]
13452AdvancedPhaseAfterMidExpansionMidExpansion [51] and GeneralOptimizeMid2 [53]
13583AdvancedPhaseLateExpandSyncInstructionsOptimizeSyncInstructions [82] and LateExpandSyncInstructions [84]
13691LateMergeEquivalentConditionalFlowLateExpansionUnsupportedOps [90] and OriHoistInvariantsLate2 [92]
13793LateExpansionUnsupportedOpsMidOriHoistInvariantsLate2 [92] and ExpandJmxComputation [94]
13898OriSplitHighPressureLiveRangesOriBackCopyPropagate [97] and OriPerformLiveDeadFourth [99]

Gate Passes (AdvancedPhase)

Seventeen phase instances (16 unique gates, plus AdvancedPhaseOriPhaseEncoding appearing at both wiki 127 and 152) are conditional extension points whose isNoOp() returns true in the default vtable. They exist as insertion points for architecture backends and optimization-level overrides. When a specific SM target or -O level requires additional processing at a given pipeline position, the backend overrides the phase's vtable to provide a real execute() implementation.

Gate passes bracket major pipeline transitions. For example, phases 4 and 7 bracket ConvertUnsupportedOps (phase 5), allowing a backend to inject pre- and post-legalization logic without modifying the fixed phase table. Phase 101 (AdvancedPhaseAllocReg) is the most critical gate -- the entire register allocation subsystem is driven through this hook; the base pipeline contains no hardcoded allocator.

The naming convention is consistent: AdvancedPhase prefix followed by the pipeline position or action name. One exception is AdvancedScoreboardsAndOpexes (phase 115), which uses Advanced without Phase.

Gate Pass Worker Correspondence

All 17 gate passes fall into three categories when activated by a backend override: (A) dispatch to a named worker phase from the static name table, (B) dispatch through an SM backend vtable slot at ctx+0x630, or (C) execute a pipeline progress counter thunk that writes ctx+1552 = N. Category A gates have a named worker visible to DUMPIR. Category B dispatches through architecture-specific code in the SM backend object. Category C gates' only effect is advancing the pipeline progress counter, which downstream passes read via *(ctx+1552) > N guards.

Name aliases used in this table. Phase names are abbreviated to fit the Gate column (AdvPh* = AdvancedPhase*, AdvScoreboardsAndOpexes = AdvancedScoreboardsAndOpexes) per the §Numbering Discrepancy reading guide. Worker names in the Worker / Dispatch Target column appear unabbreviated.

Gate (Wiki #)BinCatExecute FnWorker / Dispatch TargetEvidence
AdvPhBeforeConvUnSup (4)4Csub_C5F620 (7B)ctx+1552 = 1; marks pre-legalizationP0_03 thunk table; early pipeline boundary
AdvPhAfterConvUnSup (7)7Csub_C5F5A0 (7B)ctx+1552 = 2; marks post-ConvUnSupP0_03 thunk table; sub_752CF0 checks <= 3
AdvPhEarlyEnforceArgs (47)54Avtable dispatchEnforceArgumentRestrictions [48]P5_02 correspondence table; W020 "Before EnforceArgumentRestrictions"
AdvPhAfterMidExpansion (134)52Csub_C5EF80 (7B)ctx+1552 = 3; marks mid-expansion doneP0_03 thunk table; sub_752CF0 checks <= 3
AdvPhLateExpandSync (135)83B0xC5F110 (6B)jmp *(*(ctx+0x630))+0x168; SM backend vtable slot 360W029 disasm; brackets LateExpandSyncInstructions [84]
AdvPhLateConvUnSup (77)89B0xC5EA50 (13B)jmp *(*(ctx+0x630))+0x178; SM backend vtable slot 376W033 disasm lines 108--111; drives LateExpansionUnsupportedOps [78]
AdvPhBackPropVReg (82)96Boff_22BE298Arch-override vtable dispatch; next phase [83] writes ctx+1552 = 9P1_08 vtable layout; isNoOp returns 0 (runtime-overridden to 1)
AdvPhSetRegAttr (89)105Bvtable dispatchctx+0x630 SM backend vtable; precedes OriSetRegisterAttr [90]W020 line 407 "Before OriSetRegisterAttr"
AdvPhAfterSetRegAttr (92)108B0xC607A0 (51B)*(*(ctx+0x630))+0x110; guarded by nullsub_170@0x7D6C80W029 disasm line 53; returns NOP when default impl
AdvPhPreSched (97)113Avtable dispatchScheduleInstructions [114]; sub_8D0640 (22 KB)P5_02 table; string "ScheduleInstructions"
AdvPhAllocReg (101)121Avtable dispatchAllocateRegisters [122]String "Please use -knob DUMPIR=AllocateRegisters" at sub_9714E0
AdvPhPostExpansion (104)126Avtable dispatchPostExpansion [127]; post-RA expansion dispatchP5_02 table
AdvPhPostSched (106)129Csub_C5E830 (7B)ctx+1552 = 14; marks entry into PostSchedule worker (before phase 110)P0_03 thunk table; adjacent to PostSchedule [110]
AdvPhPostFixUp (111)134Avtable dispatchPostFixUp [140]; ctx+0x630 vtable+0x148P2_14 line 85; target-specific post-fixup
AdvScoreboardsAndOpexes (115)138Bvtable dispatchsub_A36360 (52 KB) + sub_A23CF0 (54 KB); O1+ onlyControl word gen + DAG scheduler; -O0 uses phase 139 instead
AdvPhOriPhaseEncoding (127)152Csub_C5E0B0 (7B)ctx+1552 = 21; marks encoding boundaryP2_15 disasm; sub_8C0270 checks == 19
(total: 5 type A, 5 type B, 6 type C = 16 gates)

Type A gates (5) dispatch to a named worker phase in the static name table -- valid DUMPIR/NamedPhases/DisablePhases targets. AdvPhEarlyEnforceArgs was reclassified from C to A based on P5_02 evidence: it dispatches to EnforceArgumentRestrictions [48], with LateEnforceArgumentRestrictions [103] as its late counterpart. Type B gates (5) dispatch through an SM backend vtable slot at ctx+0x630; the worker code lives in the per-SM backend object. Specific vtable offsets: +0x168 (late sync expansion), +0x178 (late unsupported ops), +0x110 (post-reg-attr, guarded by default-impl check against nullsub_170@0x7D6C80). Type C gates (6) write ctx+1552 (pipeline_progress) to values 1--21, forming a monotonically increasing timeline that 20+ downstream guards check. AdvPhPostSched was reclassified from B to C based on P0_03 evidence: sub_C5E830 is a 7-byte thunk writing ctx+1552 = 14, identical in structure to the other progress thunks.

See Optimization Levels for per-gate activation rules.

Update Passes

Nine phases refresh data structures invalidated by preceding transformations. Six are documented at specific wiki phase numbers; three additional update phases exist in the static name table but are not yet mapped to wiki phase numbers (see Numbering Discrepancy above):

W#BinNameRefreshes
7688UpdateAfterOptimizeRebuilds IR metadata after the late optimization group
125150UpdateAfterPostRegAllocRebuilds IR metadata after register allocation and post-RA fixups
128154UpdateAfterFormatCodeListRebuilds the code list after Mercury encoding reformats instructions
1328UpdateAfterConvertUnsupportedOpsRebuilds IR metadata after late unsupported-op expansion
150150UpdateAfterPostRegAllocLate-pipeline duplicate: rebuilds IR metadata after post-RA processing (no-op by default)
154154UpdateAfterFormatCodeListLate-pipeline duplicate: rebuilds IR data structures after FormatCodeList (no-op by default)
--(true 115)UpdateAfterScheduleInstructionsRefreshes IR after scheduling completes (omitted from compressed numbering)
--(true 118)UpdateAfterOriDoSyncronizationRefreshes IR after sync insertion (omitted from compressed numbering)
--(true 124)UpdateAfterOriAllocateRegistersRefreshes IR after register allocation (omitted from compressed numbering)

These are lightweight passes that call into the IR's internal consistency maintenance routines. They do not transform the IR -- they only update auxiliary data structures (liveness bitmaps, instruction lists, block layout caches) so that downstream passes see a coherent view. Phases 150 and 154 are late-pipeline duplicates whose isNoOp() returns 1 by default; they only activate when a backend requires a second update cycle. The three *(true N)* entries are in the static name table at the indicated indices but are not yet assigned wiki phase numbers.

Report Passes

Ten phases produce diagnostic output. They are no-ops unless specific debug options are enabled (e.g., --stat=phase-wise, DUMPIR, --keep):

W#BinNameOutput
910ReportInitialRepresentationDumps the Ori IR immediately after initial lowering
96112ReportBeforeSchedulingDumps the IR as it enters the scheduling/RA stage
102123ReportAfterRegisterAllocationDumps the IR after register allocation completes
--(true 120)ReportBeforeRegisterAllocationDumps IR before register allocation; omitted from compressed numbering (name at 0x22BD068)
126151ReportFinalMemoryUsagePrints memory pool consumption summary
129155DumpNVuCodeTextSASS text disassembly (cuobjdump-style)
130156DumpNVuCodeHexRaw SASS hex dump
151151ReportFinalMemoryUsageLate-pipeline duplicate: memory pool summary (no-op by default, isNoOp=1)
155155DumpNVuCodeTextLate-pipeline duplicate: SASS text disassembly; guarded by ctx+0x598 and ctx+0x740
156156DumpNVuCodeHexLate-pipeline duplicate: raw SASS hex dump; same guard as phase 155

Phase 131 (DebuggerBreak) is a development-only hook that triggers a breakpoint -- it is not a report pass per se, but serves a similar diagnostic purpose. Phase 157 is its late-pipeline counterpart (empty body in release builds).

GeneralOptimize Bundles

The GeneralOptimize* passes are compound optimization bundles that run multiple small transformations (copy propagation, constant folding, algebraic simplification, dead code elimination) in a fixed-point iteration until no further changes occur. They appear at 6 positions throughout the pipeline to re-clean the IR after major transformations:

W#BinNamePosition
1314GeneralOptimizeEarlyAfter initial setup, before loop passes
2933GeneralOptimizeAfter early loop/branch optimizations
3743GeneralOptimizeMidAfter mid-level transformations
4653GeneralOptimizeMid2After VTA/CTA/mbarrier expansion
5866GeneralOptimizeLateAfter late expansion
6575GeneralOptimizeLate2After predication and late commoning

See GeneralOptimize Bundles for the sub-pass decomposition.


O-Level Gating

Mechanism

The 159-phase pipeline does not carry any opt-level metadata on the phase objects themselves. Three binary facts establish this:

  1. Uniform phase construction. sub_C60D30 (the 159-case phase factory at 0xC60D30, 1132 lines) allocates every phase object via the same 5-line body: request 16 bytes from the pool, store a per-case vtable pointer at offset +0, store the allocator pointer at +8, return. There is no *(char*)(obj+N) = LEVEL write anywhere in the switch; every case is byte-identical except for the vtable symbol (cases 0--158 at sub_C60D30_0xc60d30.c:172--1125, tail default at 1126--1129). The 16-byte phase object therefore has room for exactly {vtable, allocator} and no inline "minimum opt level" field.

  2. The dispatch loop does not consult opt-level. sub_C64F70 (the phase iterator at 0xC64F70, 276 lines) calls each phase's isNoOp() virtual (vtable slot +0x10) twice per iteration, and then unconditionally calls the phase's execute() virtual (vtable slot +0x00) via LABEL_4:

    // sub_C64F70:86 (first isNoOp check)
    if ( (*(unsigned __int8 (**)(__int64))(*(_QWORD *)v6 + 16LL))(v6) )
      goto LABEL_4;             // skips "Before <phase>" diagnostic print
    /* ... allocate & emit "Before <name>" diagnostic string ... */
    LABEL_4:
      (**(void (***)(__int64, __int64))v6)(v6, *a1);   // execute(ctx)
    // sub_C64F70:162 (second isNoOp check)
    if ( !(*(unsigned __int8 (**)(__int64))(*(_QWORD *)v6 + 16LL))(v6) )
      /* ... allocate & emit "After <name>" diagnostic string ... */
    

    The goto LABEL_4 branch bypasses only the diagnostic-string formatting block (lines 88--159); control still falls through to execute() at line 161. isNoOp() is therefore a diagnostic-suppression flag, not an execution gate. The pre-execute call at line 86 hides the "Before" string; the post-execute call at line 162 hides the "After" string; the execute() body itself runs every iteration regardless. See Phase Manager -- Phase Dispatch Loop for the full annotated dispatch pseudocode and the isNoOp timing discussion.

  3. The gate lives inside each execute body. Phases that honour the -O level do so via an early-return prologue in their execute() thunk. The canonical pattern (instantiated ~82 times in the 0xC5F7xx--0xC60Bxx range) is:

    // sub_C60140 (representative execute thunk, 8 bytes + prologue)
    void PhaseN::execute(ocg_ctx* ctx) {
      if ( (int)sub_7DDB50(ctx) > 1 )    // opt_level > 1 (i.e. O2+)
        sub_XXXXXX(ctx);                  // tail-call real implementation
      // else: fall through -- phase was a no-op at this O-level
    }
    

    sub_7DDB50 (the opt-level accessor at 0x7DDB50, 232 bytes) reads the cached 32-bit opt_level field from ocg_ctx + 2104 (i.e. ctx + 0x838), but only when knob 499 is active; otherwise it returns 1, capping effective behaviour at O1. The knob-499 kill-switch and the iteration-budget counter at kv->state[35940] are documented in Optimization Levels -- Gate Accessor.

Important corollary. Because execute() is always invoked, every phase's timing record and pre-snapshot (written at sub_C64F70:72--85, before the first isNoOp() call) are also recorded. --ftime output therefore contains a row for all 159 phases in every compilation, including phases that immediately early-returned because the opt-level guard failed. Gated-off phases show near-zero elapsed time rather than being omitted.

Pseudocode for the full gate mechanism

// OCG context fields referenced by the gate
struct ocg_ctx {
    // ...
    void*      options_mgr;       // +0x680 (1664)  -- knob query vtable
    int32_t    opt_level_cached;  // +0x838 (2104)  -- parsed -O level, 0..5
    // ...
};

// sub_7DDB50 (0x7DDB50) -- opt-level accessor, called by each phase execute
int getOptLevel(ocg_ctx* ctx) {
    OptionsMgr* om  = ctx->options_mgr;           // [ctx+1664]
    auto        set = om->vtable->setOption;      // [vtbl+152]
    if (set == sub_67EB60) {
        auto isSet = om->vtable->isOptionSet;     // [vtbl+72]
        bool knob_499 = (isSet == sub_6614A0)
            ? (om->state[35928] != 0)             // direct bss read
            : isSet(om, 499);                     // virtual query
        if (!knob_499)
            return ctx->opt_level_cached;         // honour -O level
        int used = om->state[35940];              // iteration counter
        if (om->state[35936] > used) {            // budget not exhausted?
            om->state[35940] = used + 1;
            return ctx->opt_level_cached;         // honour -O level
        }
    } else if (set(om, 499, 1)) {
        return ctx->opt_level_cached;             // honour -O level
    }
    return 1;                                     // fallback: clamp to O1
}

// Per-phase execute prologue (replicated in ~82 wrapper thunks)
void Phase_execute(phase* self, ocg_ctx* ctx) {
    if ((int)getOptLevel(ctx) > 1) {              // the gate: O2+ only
        do_the_actual_work(ctx);                  // tail-call real pass
    }
    // else: phase is a runtime no-op for this compilation
}

// sub_C64F70 dispatch loop (pseudocode; isNoOp() only gates diagnostics)
void PhaseManager::dispatch(int* idx, int n) {
    for (int i = 0; i < n; i++) {
        phase* p = phases[idx[i]];
        append_timing_record(p);                  // unconditional
        take_pre_snapshot();                      // unconditional
        if (!p->isNoOp()) print("Before " + p->name);
        p->execute(ctx);                          // ALWAYS called
        if (!p->isNoOp()) print("After "  + p->name);
    }
}

The matrix is regular (two-bucket structure at the wrapper layer)

Scanning all phase wrappers in 0xC5F7xx--0xC60Bxx (the per-phase execute thunks) for calls to sub_7DDB50:

Gate predicateWrappersMeaning
(none -- wrapper unconditionally calls implementation)~50Phase runs at every -O level
(int)sub_7DDB50(ctx) > 1~78Phase runs at O2, O3, O4, O5
(unsigned int)sub_7DDB50(ctx) == 1 && knob_235 (or similar guarded O1 path)3--4Phase runs at O1 only when an auxiliary knob is set
> 1 || (ctx+1424 == 199 && == 1)1Phase 58 GeneralOptimizeLate -- O2+ or O1 with option-31 extended value 199 (see General Optimize)

Zero layer-1 wrappers use the thresholds > 0 (would mean "O1+"), > 2 ("O3+"), > 3 ("O4+"), or > 4 ("O5 only"). Fine-grained opt-level branching (e.g. opt_level <= 2 in sub_78DB70, <= 3 in sub_914B40, > 2 in sub_8FB5D0 / sub_9FC860 / sub_9F8C00, > 3 in sub_137EE50) happens inside the implementation bodies, after the wrapper has already let control through. Those internal decisions toggle sub-algorithms (e.g. forward vs. reverse scheduler pass, loop-peeling depth, remat strategy) rather than enabling or disabling the phase as a whole.

The phase-to-O-level activity matrix is therefore regular: the layer-1 wrapper either runs the phase at every level, or gates it at exactly one threshold (opt_level > 1). Per-phase irregularity exists only at layer 2 -- inside the implementations that the wrappers call. This collapses the "159 phases × 6 opt-levels" table to a two-column classification at the outer dispatch layer:

+------------------------------+-------------------------------------+
|  Category A: always-run       |  Category B: O2+ only               |
|  (no sub_7DDB50 in wrapper)   |  (wrapper: sub_7DDB50(ctx) > 1)     |
+------------------------------+-------------------------------------+
|  * All reporting/dump phases  |  * GVN-CSE, LICM, rematerialization |
|  * All validation phases      |  * Loop unrolling, software pipe    |
|  * All legalization phases    |  * Predication / if-conversion      |
|  * All Mercury/SASS encoding  |  * Switch/branch optimization       |
|  * All register-allocation    |  * Sync-instruction optimization    |
|  * All AdvancedPhase gates    |  * Barrier removal                  |
|  * All pseudo/expansion phases|  * Backward copy propagation        |
|  * Initial setup & cleanup    |  * Speculative hoisting, peephole   |
+------------------------------+-------------------------------------+

Concrete -O0 vs -O3 phase lists

Resolving the gate with opt_level = 0 (i.e. sub_7DDB50 returns 0) against the 159-phase pipeline and the Category-B wrappers identified above:

At -O0, the following phases early-return (runtime no-ops): Phase 14 DoSwitchOptFirst (gate sub_C5F720), 15 OriBranchOpt (sub_C5F950), 22 OriLoopUnrolling, 24 OriPipelining, 26 OriRemoveRedundantBarriers, 28 SinkRemat, 30 DoSwitchOptSecond (sub_C5FC80), 38 OptimizeNestedCondBranches (sub_C5FA70), 49 GvnCse, 54 OriDoRematEarly, 58 GeneralOptimizeLate (sub_C603E0, unless option-31 override), 63 OriDoPredication, 69 OriDoRemat, 71 OptimizeSyncInstructions, 72 LateExpandSyncInstructions, 95 SetAfterLegalization, 99 OriDoSyncronization, 100 ApplyPostSyncronizationWars, 110 PostSchedule, 115 AdvancedScoreboardsAndOpexes, and ~60 other Category-B phases. At -O0 the scheduling subsystem does still run phase 116 ProcessO0WaitsAndSBs, which performs the conservative-scoreboard insertion that makes O0 code actually executable -- phase 116 is itself a Category-A wrapper that dispatches to sub_C5E2A0 only when the target architecture has sm_version > 0x3FFF.

At -O3 (the default), every Category-A wrapper runs, and every Category-B wrapper also runs because sub_7DDB50 returns 3 which satisfies > 1. The difference between -O2 and -O3 at the wrapper level is therefore zero phases -- both levels activate the same 159 wrappers. The -O2/-O3 distinction happens entirely inside the implementation bodies (e.g. scheduling direction in sub_8D0640, which branches on opt_level > 2). The same is true for -O3 vs. -O4 vs. -O5: identical layer-1 wrapper activation, different internal algorithm selection. Only the -O0 and -O1 thresholds produce layer-1 visible skips.

This two-tier design explains why the wiki's "O-Level" column in the 159-phase table below is sparse: most phases have no entry because they always run (Category A) or because the visible O-level branching is buried inside a layer-2 implementation and does not show up at the phase wrapper at all.

See Optimization Levels for the confirmed per-phase threshold list, the detailed sub_7DDB50 accessor breakdown, knob 499 kill-switch semantics, the NvOpt recipe system, and the scheduler/RA-specific opt-level interactions.


Complete 159-Phase Table

Coverage status (audit 2026-05-18). Of the 159 phases: 92 GREEN rows carry a cross-reference to a dedicated detail page (column Detail Page populated, link verified to resolve to an existing file); 27 YELLOW rows carry no dedicated page but ARE covered by an in-page anchor section — the 10 RED gate hooks documented in §Gate Passes, the 4 RED update phases in §Update Passes, the 7 RED report/debugger phases in §Report Passes, and the 6 nullsub-bodied late-pipeline phases (150, 151, 152, 154, 157, 158) in §Phase-by-phase deep dive (139--158); 40 RED rows have neither a detail page nor an in-page anchor section beyond their single-row description. Of those 40 RED, the highest-leverage targets are the lowering/expansion phases (ConvertVTGReadWrite 39, DoVirtualCTAExpansion 40, ForwardProgress 43, ExpandJmxComputation 80, EnforceArgumentRestrictions 48, ExtractShaderConsts 34/51, BackPropagateVEC2D 98) and the validation/setup phases (OriCheckInitialProgram 0, PromoteFP16 2, AnalyzeControlFlow 3, OriCreateMacroInsts 8, OriSanitize 12, FinalInspectionPass 94). New detail pages should follow the structure of Strength Reduction or Varying Propagation (front-matter table, algorithm pseudocode, data flow, function map, cross-refs).

Ranked-by-impact targets for the next hunter-resolver wave (composite score = implementation size in bytes + 100 × outgoing callees, derived by walking the factory sub_C60D30 jump table at 0x22BBEB8, reading vtable slot 0 from 0x22BCC78 .. 0x22BEE50, and cross-referencing ptxas_functions.json):

  1. OriLinearReplacement (wiki 31 / bin 35)DOCUMENTED, see Linear Replacement. Confirmed eleven-case opcode dispatch (cases 2, 5, 98, 110, 112, 130, 139, 141, 195, 213) over IADD/IMAD/SHL/SEL/MOV chains, collapsing affine fragments into single LEA-style IADD3/IMAD.WIDE. 29-bucket memoization cache at sub_7EC080. Gated on knob 487 (master) + knob 416 (per-iteration); SM-tier gate ((*(ctx+1368) & 2) + *(ctx+896) ∈ {4,5}) enables the Blackwell datacenter IMAD.WIDE half-fold variant only.
  2. ExtractShaderConstsFirst / ExtractShaderConstsFinal (wiki 34, 51 / bin 39, 59) — share the same implementation sub_1C72640, 4,582 bytes, 37 callees. Two pipeline positions, one algorithm parameterised by call site; the row descriptions hide the deduplication. Confidence: HIGH (caller table shows both wrappers land on 0x1C72640).
  3. DoKillMovement / DoTexMovement / two unnamed siblings (wiki 67, 68 / bin 78, 79 plus two more) — DOCUMENTED, see Instruction Movement Engine. Four wrappers (sub_C5FE00, sub_C5FE30, sub_C5FE60, sub_C5FE90) all tail-call sub_8FFDE0 (573 B, 37 BBs, 9 callees) with esi ∈ {0, 1, 2, 3}. Discriminator selects direction (kill=down, tex=up) and post-cleanup mode (sub_785E20 skipped for esi=3). All four share the single "HoistInvariants" named-phase token; engine requires opt-level > O2 despite wrappers only checking > O1. Confidence: HIGH.
  4. OriRemoveRedundantMultiDefMov (wiki 62 / bin 71) — impl sub_90A340, 1,670 bytes, 21 callees. Sibling of OriReplaceEquivMultiDefMov (wiki 52, impl sub_8FBCF0, 580 bytes). Together they form the multi-def-MOV cleanup subsystem operating on SSA-destruction artefacts. Confidence: HIGH.
  5. ApplyNvOptRecipes (wiki 1 / bin 1) — impl sub_796D60, 1,484 bytes, 22 callees, 2 callers (the phase wrapper at 0xC5F6E0 AND sub_8F4D80). The dual-caller pattern means the NvOptRecipe engine is invoked both as a pipeline phase and re-entered later in the pipeline — a quirk the existing one-line description hides entirely. Confidence: HIGH.
  6. AnalyzeControlFlow (wiki 3 / bin 3)DOCUMENTED, see AnalyzeControlFlow. Wrapper sub_C60870 (89 B, O1+knob-235 gated, inverted-polarity test) tail-calls impl sub_781F80 (8,335 B, 454 BBs, 51 callees, 131 callers). The 131-caller fan-in is the highest of any "RED" phase impl: this is the canonical CFG-rebuild routine reused by every pipeline phase that needs a fresh dominator/back-edge map (loop passes, predication, register allocator with 18 force=1 calls). Re-annotates +144 RPO, +152 dual-use latch-RPO/loop-exit, +280 bits 0x10/0x20/0x800000 (LOOP_HEADER / HAS_PRED / IN_LOOP). Confidence: HIGH.
  7. GenerateMovPhi (wiki 23 / bin 26) — wrapper sub_C60BD0 is 334 B (vs the typical 34 B thunk), containing inline SSA-pair construction (two call *(rax+0x18) allocator slots, six vtable jmps) before tail-calling impl sub_790A40 (2,288 B, 33 callees). The unusually large wrapper indicates per-call argument marshalling for the phi-insertion engine that the one-line table description hides. Confidence: HIGH.
  8. OriCheckInitialProgram (wiki 0 / bin 0) — wrapper sub_C5F6D0 (11 B jmp 0x7EEB10) — calls a 164 B validator (sub_7EEB10, 3 callees), not the NvOptRecipe engine. (Earlier-cycle confusion: sub_796D60 appears in disassembly windows but belongs to the next wrapper at 0xC5F6E0.) The validator is small but invokable independently — a candidate for a short "validation pass" anchor section together with OriSanitize (bin 13) and FinalInspectionPass (bin 110). Confidence: HIGH.
  9. FinalInspectionPass (wiki 94 / bin 110) — wrapper sub_C60B30 (62 B) with 3 vtable jmps plus static call to sub_788A30 (927 B, 6 callees, 1 caller — i.e. solely this phase). Same impl is called from PlaceBlocksInSourceOrder wrapper (bin 135), suggesting sub_788A30 is a shared post-legalization invariant-checker rather than a layout pass — another deduplication candidate parallel to ExtractShaderConsts and Kill/Tex movement. Confidence: MEDIUM-HIGH (needs xref into bin 135 wrapper to confirm).
  10. OriCreateMacroInsts (wiki 8 / bin 9) — wrapper sub_C5F8D0 tail-calls sub_A112C0 (870 B, 4 callees, 10 callers). Ten-caller fan-in places this in the same "shared utility" tier as AnalyzeControlFlow: macro-instruction emission is reused from ConvertUnsupportedOps, MidExpansion, LateExpansion, and several Mercury stages. Confidence: HIGH.

Reclassification: dispatch-only phases. Seven phases listed in the original RED set carry zero static impl — their wrappers consist entirely of arch-vtable indirect jumps (jmp *(*(ctx+0x630))+0xN) gated against null-stub sentinels at sub_7D6Cxx. The "real" body lives in per-SM target code that is not visible to function-level callgraph analysis. These should be re-categorised as "dispatch-only" (parallel to the Type-B gate hooks) rather than as undocumented lowering algorithms: ConvertVTGReadWrite (bin 45), DoVirtualCTAExpansion (bin 46), ForwardProgress (bin 49), DoVTGMultiViewExpansion (bin 68), ApplyPostRegAllocWars (bin 128), PostFixForMercTargets (bin 136), EmitPSI (bin 42). The downstream documentation effort for these belongs in the per-SM backend page, not in a phase-detail page.

See SURGICAL FIXES below for additional impl-address breadcrumbs added to the table rows themselves. The translation table in Numbering Discrepancy had stale Bin# columns for several rows; the main 159-phase table below is authoritative (verified against the static name table at off_22BD0C0).

YELLOW-row anchor map. Some rows in the table below have an empty Detail Page cell but are nonetheless covered by an in-page anchor section elsewhere on this page. The mapping is:

  • Gate phases (Type A/B/C AdvancedPhase* hooks) → §Gate Passes: phases 4, 7, 47, 77, 82, 89, 92, 97, 101, 104, 106, 111, 127, 134, 135. The §Gate Passes section gives the full Type A/B/C classification, execute thunk address, worker / vtable-slot dispatch target, and ctx+1552 progress-counter value for each gate. Wiki phases 77, 82, 97, 101, 135 also have a Detail Page link (GREEN); the rest are covered only by the §Gate Passes anchor.
  • Update phases§Update Passes: phases 76, 125, 128, 132, 150, 154. The §Update Passes section explains that these refresh auxiliary IR data structures (liveness bitmaps, instruction lists, block layout caches) without transforming the IR.
  • Report phases§Report Passes: phases 9, 96, 102, 126, 129, 130, 131, 151, 155, 156. The §Report Passes section explains that these are no-ops unless specific debug knobs (DUMPIR, --stat=phase-wise, --keep) are enabled, and that 151/155/156 are nullsubs in release.
  • Late-pipeline nullsubs / state-trackers§Phase-by-phase deep dive (139--158): phases 150, 151, 152, 154, 157, 158. The deep-dive gives the execute thunk address, body byte count, isNoOp() return value, and the reason each slot is preserved (ABI compatibility with debug builds, or terminal dispatch sentinel).

Stage 1 -- Initial Setup (Phases 0--13)

Program validation, recipe application, FP16 promotion, control flow analysis, unsupported-op conversion, macro creation, initial diagnostics.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
00OriCheckInitialProgramValidationValidates structural correctness of the initial Ori IR after PTX lowering
11ApplyNvOptRecipesOptimizationApplies NvOptRecipe transformations (option 391, 440-byte sub-manager); impl sub_796D60 (1,484 B, 22 callees). ⚡ Dual-caller quirk: the implementation has two callers — the phase-1 wrapper at 0xC5F6E0 AND sub_8F4D80 (mid-pipeline re-entry), so recipes apply both as a phase and as a callable sub-pass. Stub.
22PromoteFP16LoweringPromotes FP16 operations to FP32 where hardware lacks native support
33AnalyzeControlFlowAnalysisBuilds the CFG: identifies loops, dominators, back edges; wrapper sub_C60870 (89 B, O1+knob-235 gated) tail-calls impl sub_781F80 (8,335 B, 51 callees, 131 callers — shared CFG-rebuild routine reused by every pipeline phase that requires a fresh dominator/back-edge map)AnalyzeControlFlow
44AdvancedPhaseBeforeConvUnSupGateType C: sub_C5F620 writes ctx+1552 = 1; pre-legalization boundary§Gate Passes
55ConvertUnsupportedOpsLoweringReplaces operations not natively supported on the target SM with equivalent sequencesLate Legalization
66SetControlFlowOpLastInBBCleanupEnsures control flow instructions are the final instruction in each basic block
77AdvancedPhaseAfterConvUnSupGateType C: sub_C5F5A0 writes ctx+1552 = 2; post-ConvUnSup boundary§Gate Passes
89OriCreateMacroInstsLoweringExpands PTX-level macro instructions into Ori instruction sequences
910ReportInitialRepresentationReportingDumps the Ori IR for debugging (no-op unless DUMPIR enabled)
1011EarlyOriSimpleLiveDeadOptimizationQuick early dead code elimination passLiveness
1112ReplaceUniformsWithImmOptimizationReplaces uniform register reads with immediate constants where value is knownUniform Regs
1213OriSanitizeValidationValidates IR consistency after initial setup transformations
1314GeneralOptimizeEarlyOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (early)GeneralOptimize

Stage 2 -- Early Optimization (Phases 14--32)

Branch/switch optimization, loop canonicalization, strength reduction, software pipelining, SSA phi insertion, barrier optimization.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
1416DoSwitchOptFirstOptimization> 1Optimizes switch statements: jump table generation, case clustering (1st pass); wrapper sub_C5F720Branch & Switch
1517OriBranchOptOptimization> 1Branch folding, unreachable block elimination, conditional branch simplification; wrapper sub_C5F950Branch & Switch
1618OriPerformLiveDeadFirstAnalysisFull liveness analysis + dead code elimination (1st of 4 major instances)Liveness
1719OptimizeBindlessHeaderLoadsOptimizationHoists and deduplicates bindless texture header loads
1820OriLoopSimplificationOptimization4--5Canonicalizes loops: single entry, single back-edge, preheader insertion; aggressive loop peeling at O4+Loop Passes
1921OriSplitLiveRangesOptimizationSplits live ranges at loop boundaries to reduce register pressureLiveness
2023PerformPGOOptimizationApplies profile-guided optimization data (block weights, branch probabilities)
2124OriStrengthReduceOptimizationReplaces expensive operations (multiply, divide) with cheaper equivalents (shift, add)Strength Reduction
2225OriLoopUnrollingOptimization> 1Unrolls loops based on trip count and register pressure heuristicsLoop Passes
2326GenerateMovPhiLoweringInserts SSA phi nodes as MOV.PHI pseudo-instructions; wrapper sub_C60BD0 is unusually large (334 B vs 34 B typical) with inline SSA-pair allocation before tail-call to impl sub_790A40 (2,288 B, 33 callees)
2427OriPipeliningOptimization> 1Software pipelining: overlaps loop iterations to hide latencyLoop Passes
2528StageAndFenceLoweringInserts memory fence and staging instructions for coherenceSync & Barriers
2629OriRemoveRedundantBarriersOptimization> 1Eliminates barrier instructions proven redundant by data-flow analysisSync & Barriers
2730AnalyzeUniformsForSpeculationAnalysisAnalyzes constant bank accesses for speculation safety across control flowUniform Regs
2831SinkRematOptimization> 1 / > 4Sinks instructions closer to uses and marks remat candidates; O2+: basic; O5: full cutlassRematerialization
2933GeneralOptimizeOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (mid-early)GeneralOptimize
3034DoSwitchOptSecondOptimization> 1Second switch optimization pass after loop/branch transformations; wrapper sub_C5FC80Branch & Switch
3135OriLinearReplacementOptimizationMulti-pattern affine-expression linearizer; eleven-case opcode dispatch collapses IADD/IMAD/SHL/SEL/MOV chains into single LEA-style IADD3/IMAD.WIDE; impl sub_7EC4B0 (7,084 B, 241 BBs, 71 callees); knobs 487 + 416; 29-bucket memoization cacheLinear Replacement
3236CompactLocalMemoryOptimizationCompacts local memory allocations by eliminating dead slots and reordering

Stage 3 -- Mid-Level Optimization (Phases 33--52)

GVN-CSE, reassociation, shader constant extraction, CTA/VTG expansion, argument enforcement.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
3338OriPerformLiveDeadSecondAnalysisFull liveness analysis + DCE (2nd instance, post-early-optimization cleanup)Liveness
3439ExtractShaderConstsFirstOptimizationIdentifies uniform values loadable from constant memory instead of per-thread computation (1st pass); impl sub_1C72640 (4,582 B, 37 callees — shared with phase 51 ExtractShaderConstsFinal; one engine, two pipeline positions; called via sub_C5FDA0 with is_final_pos = 0, the finalize sub-step is skipped)Shader Const Extraction
3540OriHoistInvariantsEarlyOptimizationLoop-invariant code motion: hoists invariant computations out of loops (early)Loop Passes
3642EmitPSILoweringEmits PSI (Pixel Shader Input) interpolation setup for graphics shaders
3743GeneralOptimizeMidOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (mid)GeneralOptimize
3844OptimizeNestedCondBranchesOptimization> 1Simplifies nested conditional branches into flatter control flow; wrapper sub_C5FA70Branch & Switch
3945ConvertVTGReadWriteLoweringConverts vertex/tessellation/geometry shader read/write operations
4046DoVirtualCTAExpansionLoweringExpands virtual CTA operations into physical CTA primitives
4147MarkAdditionalColdBlocksAnalysisMarks basic blocks as cold based on heuristics and profile dataHot/Cold
4248ExpandMbarrierLoweringExpands MBARRIER pseudo-instructions into native barrier sequencesSync & Barriers
4349ForwardProgressLoweringInserts instructions guaranteeing forward progress (prevents infinite stalls)
4450OptimizeUniformAtomicOptimizationConverts thread-uniform atomic operations into warp-level reductions
4551MidExpansionLoweringTarget-dependent mid-level expansion of operations before register allocationLate Legalization
4653GeneralOptimizeMid2OptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (mid 2nd)GeneralOptimize
4754AdvancedPhaseEarlyEnforceArgsGateType A: dispatches to EnforceArgumentRestrictions [48]; late counterpart LateEnforceArgumentRestrictions [103]§Gate Passes
4855EnforceArgumentRestrictionsLoweringEnforces ABI restrictions on function arguments (register classes, alignment)
4956GvnCseOptimization> 1Global value numbering combined with common subexpression eliminationCopy Prop & CSE
5058OriReassociateAndCommonOptimizationReassociates expressions for better commoning opportunities, then eliminates commonsCopy Prop & CSE
5159ExtractShaderConstsFinalOptimizationFinal shader constant extraction pass (after GVN may expose new constants); impl shared with phase 34 = sub_1C72640 (4,582 B); called via sub_C5FDD0 with is_final_pos = 1, the finalize sub-step sub_1C68760 runs and commits bank allocationsShader Const Extraction
5260OriReplaceEquivMultiDefMovOptimizationEliminates redundant multi-definition move instructions with equivalent sources

Stage 4 -- Late Optimization (Phases 53--77)

Predication, rematerialization, loop fusion, varying propagation, sync optimization, phi destruction, uniform register conversion.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
5361OriPropagateVaryingFirstOptimizationPropagates varying (non-uniform) annotations to identify divergent values (1st pass)Varying Propagation
5462OriDoRematEarlyOptimization> 1Early rematerialization: recomputes cheap values near uses to reduce register pressureRematerialization
5563LateExpansionLoweringExpands operations that must be lowered after high-level optimizationsLate Legalization
5664SpeculativeHoistComInstsOptimizationSpeculatively hoists common instructions above branches
5765RemoveASTToDefaultValuesCleanupRemoves AST (address space type) annotations that have been lowered to defaults
5866GeneralOptimizeLateOptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (late)GeneralOptimize
5967OriLoopFusionOptimizationFuses adjacent loops with compatible bounds and no inter-loop dependenciesLoop Passes
6068DoVTGMultiViewExpansionLoweringExpands multi-view operations for vertex/tessellation/geometry shaders
6170OriPerformLiveDeadThirdAnalysisFull liveness analysis + DCE (3rd instance, post-late-optimization)Liveness
6271OriRemoveRedundantMultiDefMovOptimizationRemoves dead multi-definition move instructions
6372OriDoPredicationOptimization> 1If-conversion: converts short conditional branches into predicated instructionsPredication
6474LateOriCommoningOptimizationLate commoning pass: eliminates common subexpressions exposed by predicationCopy Prop & CSE
6575GeneralOptimizeLate2OptimizationCompound pass: copy prop + const fold + algebraic simplify + DCE (late 2nd)GeneralOptimize
6676OriHoistInvariantsLateOptimizationLICM: hoists loop-invariant code (late, after predication may expose new invariants)Loop Passes
6778DoKillMovementOptimization> 2Sinks kill markers downward toward last use to regularize allocator spill heuristic; wrapper sub_C5FE00 (34 B) tail-calls shared engine sub_8FFDE0 (573 B) with esi=0, direction +1Instruction Movement
----sibling A (unnamed)Optimization> 2Kill-class upward movement; wrapper sub_C5FE30 (34 B), engine esi=1, direction −1; shares "HoistInvariants" named-phase token with DoKillMovement (no separate DUMPIR target)Instruction Movement
6879DoTexMovementOptimization> 2Hoists texture fetches upward toward latest safe hoist point to hide latency; wrapper sub_C5FE60 (34 B), engine esi=2, direction −1; post-cleanup via sub_785E20Instruction Movement
----sibling B (unnamed)Optimization> 2TEX-class upward movement without post-cleanup (skips sub_785E20); wrapper sub_C5FE90 (34 B), engine esi=3, direction −1; shares "HoistInvariants" named-phase token with DoTexMovementInstruction Movement
6980OriDoRematOptimization> 1Late rematerialization: recomputes values exposed by predication and fusionRematerialization
7081OriPropagateVaryingSecondOptimizationPropagates varying annotations (2nd pass, after predication changes control flow)Varying Propagation
7182OptimizeSyncInstructionsOptimization> 1Eliminates and simplifies synchronization instructionsSync & Barriers
7284LateExpandSyncInstructionsLowering> 2Expands sync pseudo-instructions into final hardware sequencesSync & Barriers
7385ConvertAllMovPhiToMovLoweringDestroys SSA form: converts MOV.PHI instructions into plain MOV
7486ConvertToUniformRegOptimizationConverts qualifying values from general registers (R) to uniform registers (UR)Uniform Regs
7587LateArchOptimizeFirstOptimizationArchitecture-specific late optimizations (1st pass)
7688UpdateAfterOptimizeCleanupRebuilds IR metadata invalidated by the late optimization group
7789AdvancedPhaseLateConvUnSupGateType B: 0xC5EA50 dispatches ctx+0x630 vtable+0x178 (slot 376); drives LateExpansionUnsupportedOps [78]Late Legalization

Stage 5 -- Legalization (Phases 78--96)

Late unsupported-op expansion, backward copy propagation, GMMA fixup, register attributes, final validation.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
7890LateExpansionUnsupportedOpsLoweringExpands remaining unsupported operations after all optimizationsLate Legalization
7992OriHoistInvariantsLate2OptimizationLICM (late 2nd pass) after unsupported-op expansionLoop Passes
8094ExpandJmxComputationLoweringExpands JMX (jump with index computation) pseudo-instructions
8195LateArchOptimizeSecondOptimizationArchitecture-specific late optimizations (2nd pass)
8296AdvancedPhaseBackPropVRegGateType B: arch-override vtable dispatch (off_22BE298); next phase [83] writes ctx+1552 = 9Copy Prop & CSE
8397OriBackCopyPropagateOptimizationBackward copy propagation: propagates values backward through move chainsCopy Prop & CSE
8499OriPerformLiveDeadFourthAnalysisFull liveness analysis + DCE (4th instance, pre-legalization cleanup)Liveness
85100OriPropagateGmmaOptimizationPropagates WGMMA accumulator values through the IRGMMA Pipeline
86101InsertPseudoUseDefForConvURLoweringInserts pseudo use/def instructions for uniform register conversion bookkeepingUniform Regs
87102FixupGmmaSequenceLoweringFixes WGMMA instruction sequences for hardware ordering constraintsGMMA Pipeline
88104OriHoistInvariantsLate3OptimizationLICM (late 3rd pass) after GMMA fixupLoop Passes
89105AdvancedPhaseSetRegAttrGateType B: ctx+0x630 SM backend vtable dispatch; precedes OriSetRegisterAttr [90]
90106OriSetRegisterAttrAnalysisAnnotates registers with scheduling attributes (latency class, bank assignment)Scheduling
91107OriCalcDependantTexAnalysisComputes texture instruction dependencies for scheduling
92108AdvancedPhaseAfterSetRegAttrGateType B: 0xC607A0 dispatches ctx+0x630 vtable+0x110; guarded by nullsub_170@0x7D6C80
93109LateExpansionUnsupportedOps2LoweringSecond late unsupported-op expansion (catches ops exposed by GMMA/attr passes)Late Legalization
94110FinalInspectionPassValidationFinal IR validation gate: catches illegal patterns before irreversible scheduling/RA
95111SetAfterLegalizationCleanup> 1Sets post-legalization flag on the compilation context
96112ReportBeforeSchedulingReportingDumps IR before scheduling (no-op unless diagnostic options enabled)

Stage 6 -- Scheduling & Register Allocation (Phases 97--103)

Synchronization insertion, WAR fixup, register allocation, 64-bit register handling.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
97113AdvancedPhasePreSchedGateHook before scheduling; when active, dispatches to ScheduleInstructions (sub_8D0640, true table index 114)Scheduling
98116BackPropagateVEC2DOptimizationBackward-propagates 2D vector register assignments
99117OriDoSyncronizationScheduling> 1Inserts synchronization instructions (BAR, DEPBAR, MEMBAR) per GPU memory modelSync & Barriers
100119ApplyPostSyncronizationWarsScheduling> 1Fixes write-after-read hazards exposed by sync insertionSync & Barriers
101121AdvancedPhaseAllocRegGateRegister allocation driver hook; when active, dispatches to AllocateRegisters (true table index 122); DUMPIR=AllocateRegisters targets thisRegAlloc Architecture
102123ReportAfterRegisterAllocationReportingDumps IR after register allocation (no-op unless diagnostic options enabled)
103125Get64bRegComponentsRegAllocSplits 64-bit register pairs into 32-bit components for architectures that require itRegAlloc Architecture

Stage 7 -- Post-RA & Post-Scheduling (Phases 104--116)

Post-expansion, NOP removal, hot/cold optimization, block placement, scoreboard generation.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
104126AdvancedPhasePostExpansionGateHook before post-RA expansion worker; when active, dispatches to PostExpansion (true table index 127)
105128ApplyPostRegAllocWarsRegAllocFixes write-after-read hazards exposed by register allocation
106129AdvancedPhasePostSchedGateType C: sub_C5E830 writes ctx+1552 = 14; runs before post-scheduling worker (phase 110 PostSchedule)
107130OriRemoveNopCodeCleanupRemoves NOP instructions and dead code inserted as placeholders
108131OptimizeHotColdInLoopOptimizationSeparates hot and cold paths within loops for cache localityHot/Cold
109132OptimizeHotColdFlowOptimizationSeparates hot and cold paths at the function levelHot/Cold
110133PostScheduleScheduling> 0Post-RA re-scheduling: sub_C60640 (51 B) gates sub_7DDB50(ctx) > 1 (sm_80+), then tail-calls ctx[+0x630]->[+0x10]->vtable[+0x90] (sub-target post-schedule hook); guards against nullsub_45@0x680190 no-override sentinel. Modern path lands in Backend C sub_1908D90; legacy fallback in sub_A97600 (42 KB)PostSchedule
111134AdvancedPhasePostFixUpGateHook before post-fixup worker; when active, dispatches to PostFixUp (phase 140, target vtable+0x148)
112135PlaceBlocksInSourceOrderCleanupDetermines final basic block layout in the emitted binary
113136PostFixForMercTargetsEncodingFixes up instructions for Mercury encoding requirementsMercury
114137FixUpTexDepBarAndSyncSchedulingFixes texture dependency barriers and sync instructions post-schedulingScoreboards
115138AdvancedScoreboardsAndOpexesGate> 0Full scoreboard generation: computes 23-bit control word per instruction (-O1+); no-op at -O0Scoreboards
116139ProcessO0WaitsAndSBsScheduling== 0Conservative scoreboard insertion for -O0: maximum stalls, barriers at every hazardScoreboards

Scoreboard generation has two mutually exclusive paths. At -O1 and above, phase 115 (AdvancedScoreboardsAndOpexes) runs the full dependency analysis using sub_A36360 (52 KB) and sub_A23CF0 (54 KB DAG list scheduler), while phase 116 is a no-op. At -O0, phase 115 is a no-op and phase 116 inserts conservative stall counts.

Stage 8 -- Mercury Backend (Phases 117--122)

SASS instruction encoding, expansion, WAR generation, opex computation, microcode emission.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
117142MercEncodeAndDecodeEncodingConverts Ori instructions to Mercury encoding, then round-trip decodes for verificationMercury
118143MercExpandInstructionsEncodingExpands pseudo-instructions into final SASS instruction sequencesMercury
119144MercGenerateWARs1EncodingGenerates write-after-read hazard annotations (1st pass, pre-expansion)Mercury
120145MercGenerateOpexEncodingGenerates "opex" (operation extension) annotations for each instructionMercury
121146MercGenerateWARs2EncodingGenerates WAR annotations (2nd pass, covers hazards introduced by expansion)Mercury
122147MercGenerateSassUCodeEncodingProduces the final SASS microcode bytes (the actual binary encoding)Mercury

"Mercury" is NVIDIA's internal name for the SASS encoding framework. WAR generation runs in two passes (119, 121) because instruction expansion in phase 118 can introduce new write-after-read hazards. The MercConverter infrastructure (sub_9F1A90, 35 KB) drives instruction-level legalization via a visitor pattern dispatched through sub_9ED2D0 (25 KB opcode switch).

Stage 9 -- Post-Mercury (Phases 123--131)

Register map computation, diagnostics, debug output.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
123148ComputeVCallRegUseRegAllocComputes register usage for virtual call sites
124149CalcRegisterMapRegAllocComputes the final physical-to-logical register mapping emitted as EIATTR metadataRegAlloc Architecture
125150UpdateAfterPostRegAllocCleanupRebuilds IR metadata after post-RA processing
126151ReportFinalMemoryUsageReportingPrints memory pool consumption summary to stderr
127152AdvancedPhaseOriPhaseEncodingGateType C: sub_C5E0B0 writes ctx+1552 = 21; marks encoding boundary
128154UpdateAfterFormatCodeListCleanupRebuilds the code list after Mercury encoding reformats instructions
129155DumpNVuCodeTextReportingDumps human-readable SASS text disassembly
130156DumpNVuCodeHexReportingDumps raw SASS binary as hex
131157DebuggerBreakCleanupDevelopment hook: triggers a debugger breakpoint at this pipeline position

Stage 10 -- Late Cleanup & Late Pipeline (Phases 132--158)

Late merge operations, late unsupported-op expansion, high-pressure live range splitting, Mercury encoding pipeline, register map computation, diagnostics, and debug hooks.

#Bin#Phase NameCategoryO-LevelDescriptionDetail Page
1328UpdateAfterConvertUnsupportedOpsCleanupRebuilds IR metadata after late unsupported-op conversion
13315MergeEquivalentConditionalFlowOptimizationMerges basic blocks with equivalent conditional flow (tail merging)
13452AdvancedPhaseAfterMidExpansionGateType C: sub_C5EF80 writes ctx+1552 = 3; marks mid-expansion done
13583AdvancedPhaseLateExpandSyncInstructionsGateType B: 0xC5F110 dispatches ctx+0x630 vtable+0x168 (slot 360)Sync & Barriers
13691LateMergeEquivalentConditionalFlowOptimizationSecond conditional flow merge pass (catches cases exposed by late transforms)
13793LateExpansionUnsupportedOpsMidLoweringMid-late unsupported-op expansion (between the two merge passes)Late Legalization
13898OriSplitHighPressureLiveRangesRegAllocLast-resort live range splitter when register pressure exceeds hardware limitsRegAlloc Architecture
Columns: # (wiki number) = Bin# (binary factory index) for all late-pipeline phases (no renumbering gap from phase 139 onward). Execute = address of the vtable-slot-0 execute(ctx) function allocated by sub_C60D30 (factory cases 139--158 at lines 1006--1125); worker / tail-call target addresses are listed in the Description column. Gate = the runtime predicate checked inside execute (if any) before the body runs; "unconditional" means the execute body has no skip branch. SM activation is all unless marked otherwise -- the only phase with an SM-version check in its execute body is 139.
# = Bin#Phase NameCategoryExecuteGateDescription (algorithm / nullsub status / cross-ref)
139ProcessO0WaitsAndSBsSchedulingsub_C5E2A0 (41 B)target[+0x174] > 0x3FFF (sm50+)On sm50+, tail-calls target.vtable[+0x150] (ApplyConservativeScoreboards) with edx=1 (O0-mode flag) to insert maximum-wait scoreboards on every instruction; pre-sm50 targets fall through to ret. isNoOp=0. See Scoreboards and deep-dive below.
140PostFixUpCleanupsub_C5E270 (13 B)unconditionalTail-calls target.vtable[+0x148]; each Mercury target installs a target-specific post-fixup method (texture-barrier placement on Volta, scoreboard packing on Turing+, etc.), non-Mercury targets install a nullptr-safe stub. isNoOp=0.
141MercConverterEncodingsub_C60300 (8 B thunk) -> sub_9F3760cu[+1398] & 0x20 (bit 5 inside body)Second MercConverter pass re-lowering PTX-derived opcodes introduced by optimization (rematerialization, peephole, loop xforms); dispatches on target.sm_code to the per-generation path, then routes through sub_9F1A90 / sub_9ED2D0 (the 35 KB opcode dispatcher reused from phase 5 ConvertUnsupportedOps). After completion every IR instruction carries a valid SASS opcode. See Mercury: MercConverter.
142MercEncodeAndDecodeEncodingsub_C60310 (8 B thunk) -> sub_18F21F0ctx[+0x571] & 0x02 (bit 1) AND ctx[+0x788] != NULL (Mercury ctx present)Encodes every Ori IR node to a Mercury node via the master encoder sub_6D9690, then round-trip-decodes each node to verify the binary encoding is reversible. After this phase all subsequent pipeline stages operate on Mercury nodes exclusively. See Mercury: Stage 1 MercEncodeAndDecode.
143MercExpandInstructionsEncodingsub_C60320 (16 B) -> sub_C3DFC0 (102 B orchestrator)ctx[+0x570] & 0x20 (bit 5)Expands compound Mercury pseudo-instructions -- multi-word branches, multi-step LDG/STG sequences, sm_120 TCGEN05 macros -- into their primitive SASS sequences. sub_C3DFC0 walks the Mercury list via sub_C3CC60 and invokes each node's vtable[+0x40] Expand hook. See Mercury: Stage 2 MercExpandInstructions.
144MercGenerateWARs1Encodingsub_C60340 (16 B) -> sub_6FC240 (47 B)ctx[+0x570] & 0x80 (bit 7, js opcode test)First WAR (write-after-read) hazard annotation pass -- walks the Mercury node list and tags each consumer with the stall counts the target's hazard model requires. Runs before MercGenerateOpex (145). See Mercury: Stage 3 MercGenerateWARs.
145MercGenerateOpexEncodingsub_C60380 (16 B) -> sub_7032A0 (472 B)ctx[+0x570] & 0x40 (bit 6)Generates Opex (operand-exchange) annotations -- the per-instruction control bits that tell the hardware which physical register bank to read each operand from, required by the sm_90+ banked-register-file microarchitecture to avoid read-port conflicts. See Mercury: Stage 4 MercGenerateOpex.
146MercGenerateWARs2Encodingsub_C60360 (16 B) -> sub_6FC240ctx[+0x570] & 0x80 (bit 7)Second WAR pass, byte-for-byte identical to phase 144 except for the vtable store. Two WAR passes bracket Opex (145) because Opex can rewrite operand banks and thereby introduce new write-to-read distances that must be re-annotated. See Mercury: Stage 3 MercGenerateWARs.
147MercGenerateSassUCodeEncodingsub_C603A0 (16 B) -> sub_6EEE90 (1472 B) -> sub_6E4110 (24 KB)ctx[+0x571] & 0x01 (bit 0)Terminal Mercury stage: walks the fully-annotated Mercury node list and emits the final SASS binary microcode bytes that end up in the ELF .text section. sub_6EEE90 is a 0x110-byte stack-scratch wrapper that calls sub_6E8EB0 for per-function setup, then hands off to the 24 KB emitter sub_6E4110. See Mercury: Stage 5 MercGenerateSassUCode and SASS Printing.
148ComputeVCallRegUseRegAllocsub_C5E160 (13 B)unconditionalTail-calls target.vtable[+0x2B8] to compute register usage at virtual call sites (indirect calls, function pointers). The result is written into the target-side register-use tracker and later emitted as EIATTR_EXTERNS / EIATTR_INDIRECT_BRANCH_TARGETS metadata so the CUDA runtime can honour conservative register budgets for callees whose footprint is unknown at compile time.
149CalcRegisterMapRegAllocsub_C603C0 (32 B) -> sub_95A350 (6456 B)ctx[+0x590] & 0x02 (bit 1)Computes the final physical-to-logical register mapping that gets emitted as EIATTR_REGCOUNT / EIATTR_MIN_STACK_SIZE metadata. The execute thunk indirects through ctx.target[+0x18] (the SM-specific sub-target) before tail-calling sub_95A350 (the actual mapping builder). The map is needed by the CUDA driver to inflate saved contexts during preemption and by NVRTC for relocation. See RegAlloc Architecture.
150UpdateAfterPostRegAllocCleanupnullsub_630 at 0xC5E110 (2 B)--True no-op in release ptxas. Empty repz ret body; isNoOp() = 1 (sub_C5E130, 6 B) suppresses the "Before/After" diagnostic frame. Slot retained for ABI compatibility with debug builds where the body would be PhaseManager::RebuildAfterPostRegAlloc.
151ReportFinalMemoryUsageReportingnullsub_629 at 0xC5E0E0 (2 B)--True no-op. isNoOp() = 1 (sub_C5E100). Debug builds would dump the memory-arena high-water mark to stderr here; release strips the body entirely.
152AdvancedPhaseOriPhaseEncodingGatesub_C5E0B0 (11 B)unconditionalType-C gate: movl dword [rsi+0x610], 0x15; ret -- writes pipeline_progress = 21 (the final value of the monotonic ctx[+0x610] counter). Downstream consumers: sub_8C0270 checks *(ctx+0x610) == 19; scoreboard guards check values 16--19. isNoOp() = 1 (sub_C5E0D0) because this is state-tracking, not an IR transform.
153FormatCodeListEncodingsub_C5E080 (13 B)unconditionalThe one late-pipeline phase that indirects through ctx[+0x648] (the code-list / ELF-section emitter) rather than ctx[+0x630] (target) or ctx[+0x788] (Mercury). Tail-calls emitter.vtable[+0x10] -- the "format" entry point that serialises the fully-encoded instructions into the final ELF text-section layout (addresses, relocations, alignment). isNoOp=0. See Mercury.
154UpdateAfterFormatCodeListCleanupnullsub_628 at 0xC5E050 (2 B)--True no-op. isNoOp() = 1 (sub_C5E070). Hook point kept in case a backend needs to re-sync IR metadata after FormatCodeList reorders instructions, but no release target uses it.
155DumpNVuCodeTextReportingsub_C60420 (54 B)ctx[+0x598] > 0 && ctx[+0x740] != NULL && *ctx[+0x740] != NULLGuarded by -dump_nvu_code_text=1 knob; the full gate cascade is retained, but the tail-call target 0x67FF60 resolves to nullsub_31 (2 B) -- the actual text dumper has been stripped from release ptxas, leaving the gate as an orphan that falls through to a stub. Effective no-op. See SASS Printing.
156DumpNVuCodeHexReportingsub_C60460 (~48 B)ctx[+0x598] > 0 && ctx[+0x740] != NULLMirror image of phase 155 with a simpler gate (no extra pointer indirection) and tail-call target 0x67FF50 = nullsub_30 (2 B). Same conclusion: stripped from release, orphan gate only. See SASS Printing.
157DebuggerBreakCleanupnullsub_627 at 0xC5DFE0 (2 B)--Debug-build breakpoint marker; release builds emit a bare ret. isNoOp() = 0 (sub_C5E000), so the "Before/After" diagnostic frame still fires -- useful when running ptxas under gdb with b *0xC5DFE0 because the dispatch loop will print "Before DebuggerBreak" / "After DebuggerBreak" on either side of the breakpoint.
158NOPCleanupnullsub_626 at 0xC5DFB0 (2 B)--Terminal sentinel. The 159-phase dispatch loop (sub_C64F70) iterates a1[0] .. a1[158] and needs a final slot to anchor the loop end; NOP is that anchor. isNoOp() = 0 (sub_C5DFD0), so the final "Before NOP" / "After NOP" prints appear in verbose dumps as the explicit terminator for "All Phases Summary".

Phases 139--158 are 20 late-pipeline phases whose vtable pointers range from off_22BEB80 to off_22BEE78 (40-byte stride, 20 * 0x28 bytes). All 20 have names in the static table at off_22BD0C0 (159 entries -- the earlier wiki note claiming "139 entries" was based on a compressed model that excluded these phases). Name resolution for the dispatch-loop diagnostic (sub_C64F70) goes through the static table indexed by getIndex() at vtable+8.

Vtable layout (3 slots, 24 bytes per object, off_22BExxx). Every late-pipeline phase object has exactly three virtual methods:

SlotOffsetPurposeBehaviour
0vtbl+0execute(ctx)Entry point called by sub_C64F70 dispatch loop (LABEL_4). See per-phase details below.
1vtbl+8getIndex()Returns the constant 139--158 (mov eax, 0x8b..0x9e; ret). Index into off_22BD0C0 for the name string. Always 6 bytes.
2vtbl+16isNoOp()Either xor eax,eax; ret (3 bytes, always false) or mov eax,1; ret (6 bytes, always true). Does not skip execute -- it suppresses the "Before <phase>" / "After <phase>" diagnostic print around the call (see sub_C64F70:86, the goto LABEL_4 branch falls into execute either way).

The phase object itself is 16 bytes: [0]=vtable*, [8]=ctx*. No per-phase instance state -- all state lives in the shared OCG context passed to every execute() call.

isNoOp statistics. 16 of 20 phases return 0 from isNoOp() (diagnostics printed). Exactly 4 phases return 1 (diagnostics suppressed): 150, 151, 152, 154. Of those, three (150, 151, 154) also have nullsub execute bodies and are truly vestigial; phase 152 has an 11-byte body that writes pipeline_progress = 21 but is hidden from dumps because it is a state-tracking marker, not an IR transform.

Phase-by-phase deep dive (139--158)

Each entry gives: execute function address, body size in bytes, vtable address, gate condition (if any), and pseudocode. All addresses are verified via (a) the factory switch in sub_C60D30 (cases 139--158 at lines 1006--1125), (b) the raw pointer-table dump at 0x22BEB80--0x22BEE78 read from .rodata, and (c) direct objdump of the .text segment.

Phase 139 ProcessO0WaitsAndSBs -- vtable=0x22BEB80 -- execute sub_C5E2A0 (41 bytes, IDA missed it, recovered via objdump). Runs on sm50+ only. Tail-dispatches to the target's ApplyConservativeScoreboards hook (vtable slot 0x150) with flag edx=1 (O0 mode). On sm30 / sm_3x and pre-sm50 architectures the phase returns immediately because the legacy shader-processor scoreboard model does not apply.

mov  rdi, [rsi+0x630]          ; rdi = ocg_ctx->target
cmp  dword [rdi+0x174], 0x3FFF ; if sm_version_encoded <= 16383 (pre-sm50)
jle  .return                   ;   skip
mov  rax, [rdi]                ; target.vtable
mov  edx, 1                    ; mode = O0
jmp  [rax+0x150]                ; target->ApplyConservativeScoreboards(ctx,1)
.return:
ret

isNoOp returns 0 (sub_C5E2E0, 3 bytes). No pipeline_progress write.

Phase 140 PostFixUp -- vtable=0x22BEBA8 -- execute sub_C5E270 (13 bytes). Unconditional target-hook dispatch. Every Mercury target registers a post-fixup method at vtable slot 0x148; non-Mercury targets install a nullptr-safe stub. The method performs target-specific cleanup after schedule and register allocation are final (examples: texture barrier placement on Volta, scoreboard packing on Turing+).

mov  rdi, [rsi+0x630]          ; target
mov  rax, [rdi]                ; target.vtable
jmp  [rax+0x148]                ; target->PostFixUp(target)

isNoOp = 0 (sub_C5E290).

Phase 141 MercConverter -- vtable=0x22BEBD0 -- execute sub_C60300 (8 bytes, thunk) -> body sub_9F3760. Second MercConverter invocation, re-running the 35 KB opcode-dispatch machinery from phase 5 (ConvertUnsupportedOps) on instructions introduced by optimization passes (rematerialization, peephole, loop transforms) that may carry unlegalized PTX-derived opcodes. Internal gate testb $0x10, [rdi+0x570] (bit 4) inside sub_9F3760 makes the body an immediate return on non-Mercury targets. When enabled the body dispatches on target.sm_code at [rdi+0x174] with the arch constants 0x9000/0x7005/0x7001/0x6001 to pick a per-generation conversion path. After completion every IR instruction carries a valid SASS opcode ready for encoding. See Mercury sub_9F1A90 / sub_9ED2D0 for the full opcode dispatch.

; execute thunk
mov  rdi, rsi                  ; rdi = ocg_ctx
jmp  0x9F3760                  ; MercConverter::Run

; sub_9F3760 prologue
testb [rdi+0x570], 0x10        ; Mercury-active bit
jz   .return
...

Phase 142 MercEncodeAndDecode -- vtable=0x22BEBF8 -- execute sub_C60310 (8 bytes, thunk) -> body sub_18F21F0. Encodes each Ori IR node to its Mercury-node form via sub_6D9690 (the master encoder), then round-trip-decodes to verify the binary encoding is reversible. After this phase all subsequent pipeline stages operate on Mercury nodes exclusively. Internal gate testb $0x2, [rdi+0x571] (bit 1 of the high byte of ctx+0x570) makes it a no-op when Mercury is not the active backend.

mov  rdi, rsi
jmp  0x18F21F0                 ; MercEncodeAndDecode::Run

; body prologue
testb [rdi+0x571], 0x2
jz   .return
mov  r15, [rdi+0x788]          ; Mercury context
test r15, r15
jz   .return
...

Phase 143 MercExpandInstructions -- vtable=0x22BEC20 -- execute sub_C60320 (16 bytes). Expands compound Mercury pseudo-instructions (e.g. multi-word branches, multi-step LDG/STG sequences, sm_120 TCGEN05 macros) into their SASS primitives. Gated by ctx+0x570 bit 5; the Mercury backend sets this bit during its init recipe. Tail-calls sub_C3DFC0 (102 bytes, an orchestrator that calls sub_C3CC60 to iterate the Mercury list and invokes per-instruction vtable+0x40 Expand hooks). sub_C3DFC0 also emits the "After MercExpand" diagnostic on completion.

testb [rsi+0x570], 0x20        ; bit 5: MercExpandEnable
jnz  .active
repz ret                        ; skip -- non-Mercury target
.active:
mov  rdi, [rsi+0x788]          ; rdi = Mercury context
jmp  0xC3DFC0                  ; RunMercExpandPass

Phase 144 MercGenerateWARs1 -- vtable=0x22BEC48 -- execute sub_C60340 (16 bytes). First WAR-hazard annotation pass. Walks the Mercury node list and tags each consumer with the write-after-read stall counts needed to satisfy the target's hazard model. Runs after MercExpandInstructions (143) but before MercGenerateOpex (145); the "pass-1" naming reflects that two WAR passes are needed because Opex (145) can rewrite operand banks and introduce new write-to-read distances that pass 146 then re-annotates. Gated by the sign bit (cmpb $0, [rsi+0x570]; js i.e. bit 7) of ctx+0x570.

cmpb [rsi+0x570], 0             ; js = "if signed" = bit 7 set
js   .active
repz ret
.active:
mov  rdi, [rsi+0x788]
jmp  0x6FC240                  ; RunMercWARsPass (47 bytes)

Phase 145 MercGenerateOpex -- vtable=0x22BEC70 -- execute sub_C60380 (16 bytes). Generates Opex (operand-exchange) annotations per instruction -- extra control bits that tell the hardware which physical register bank to read each operand from, required by the sm_90+ banked-register file to avoid bank conflicts. Gated by ctx+0x570 bit 6. Tail-calls sub_7032A0 (472 bytes, RunMercOpexPass). See Mercury Stage 4.

testb [rsi+0x570], 0x40
jnz  .active
repz ret
.active:
mov  rdi, [rsi+0x788]
jmp  0x7032A0

Phase 146 MercGenerateWARs2 -- vtable=0x22BEC98 -- execute sub_C60360 (16 bytes). Second WAR-hazard pass. Identical instruction body to phase 144 (same sub_6FC240 tail-call, same bit-7 gate); the two invocations bracket phase 145 (Opex) which may rewrite operand banks and thereby introduce new write-to-read distances that need re-annotation. Opcode bytes are byte-for-byte identical to phase 144 modulo the vtable store before it.

cmpb [rsi+0x570], 0
js   .active
repz ret
.active:
mov  rdi, [rsi+0x788]
jmp  0x6FC240                  ; same entry as phase 144

Phase 147 MercGenerateSassUCode -- vtable=0x22BECC0 -- execute sub_C603A0 (16 bytes). The terminal Mercury stage: walks the fully-annotated Mercury node list and emits the final SASS binary microcode bytes that will end up in the ELF .text section. Gated by ctx+0x571 bit 0 (the lowest bit of the second flag byte). Tail-calls sub_6EEE90 (1472 bytes), which is a thin wrapper that allocates a 0x110-byte stack scratch area, invokes sub_6E8EB0 for per-function setup, then calls into sub_6E4110 (24 KB, the real emitter documented in Mercury Stage 5).

testb [rsi+0x571], 0x1
jnz  .active
repz ret
.active:
mov  rdi, [rsi+0x788]
jmp  0x6EEE90                  ; MercGenerateSassUCode::Run

Phase 148 ComputeVCallRegUse -- vtable=0x22BECE8 -- execute sub_C5E160 (13 bytes). Computes register usage at virtual call sites (indirect calls, function pointers) and stores the result in the target-side register-use tracker. The data is consumed during ELF emission as EIATTR_EXTERNS/EIATTR_INDIRECT_BRANCH_TARGETS metadata so that the CUDA runtime can honour conservative register budgets for callees whose register footprint is unknown at compile time. Unconditional; all architectures route through the target vtable slot 0x2B8.

mov  rdi, [rsi+0x630]           ; target
mov  rax, [rdi]                 ; vtable
jmp  [rax+0x2B8]                ; target->ComputeVCallRegUse(target)

Phase 149 CalcRegisterMap -- vtable=0x22BED10 -- execute sub_C603C0 (32 bytes). Computes the final physical-to-logical register mapping that gets emitted as EIATTR_REGCOUNT / EIATTR_MIN_STACK_SIZE metadata. The mapping is needed by the CUDA driver to inflate saved contexts during preemption and by NVRTC for relocation. Gated by ctx+0x590 bit 1 (register-map-export knob). Indirects through ctx.target->tex_or_fat_target at ctx+0x630 ; [rax+0x18] then tail-calls sub_95A350 (6456 bytes, the actual mapping builder).

testb [rsi+0x590], 0x2
jnz  .active
repz ret
.active:
mov  rax, [rsi+0x630]
mov  rdi, [rax+0x18]            ; target.sub_target (sm-specific)
jmp  0x95A350                   ; CalcRegisterMap body

Phase 150 UpdateAfterPostRegAlloc -- vtable=0x22BED38 -- execute nullsub_630 at 0xC5E110 (2 bytes, repz ret). True no-op in release ptxas. isNoOp() returns 1 (sub_C5E130, 6 bytes) to suppress the diagnostic frame around the call. The phase slot is kept for ABI compatibility with debug builds where the body is PhaseManager::RebuildAfterPostRegAlloc, but the release build strips it.

Phase 151 ReportFinalMemoryUsage -- vtable=0x22BED60 -- execute nullsub_629 at 0xC5E0E0 (2 bytes, repz ret). True no-op. isNoOp() = 1 (sub_C5E100). Debug builds would dump the memory-arena high-water mark to stderr here; release strips the body entirely.

Phase 152 AdvancedPhaseOriPhaseEncoding -- vtable=0x22BED88 -- execute sub_C5E0B0 (11 bytes). The single surviving late-pipeline gate hook.

movl dword [rsi+0x610], 0x15   ; pipeline_progress = 21
ret

Writes pipeline_progress = 21 (the final value of the monotonic ctx+0x610 counter; see Targets offset +1552). Downstream consumers: sub_8C0270 checks *(ctx+0x610) == 19; scoreboard guards check values 16--19. isNoOp() = 1 (sub_C5E0D0) because the write is state-tracking, not IR transformation.

Phase 153 FormatCodeList -- vtable=0x22BEDB0 -- execute sub_C5E080 (13 bytes). Indirects through a different context object than the other late phases: ctx+0x648 is the code-list / ELF-section emitter rather than ctx+0x630 (target) or ctx+0x788 (Mercury context). Tail-calls vtable+0x10 on that object -- the "format" entry point that serialises the fully-encoded instructions into the final ELF text-section layout (addresses, relocations, alignment).

mov  rdi, [rsi+0x648]           ; code-list emitter
mov  rax, [rdi]                 ; its vtable
jmp  [rax+0x10]                 ; emitter->FormatCodeList()

Phase 154 UpdateAfterFormatCodeList -- vtable=0x22BEDD8 -- execute nullsub_628 at 0xC5E050 (2 bytes, repz ret). True no-op. isNoOp() = 1 (sub_C5E070). Kept as a hook point in case a target backend needs to re-sync IR metadata after FormatCodeList reordered instructions, but no release target uses it.

Phase 155 DumpNVuCodeText -- vtable=0x22BEE00 -- execute sub_C60420 (54 bytes). The gate cascade ctx+0x598 > 0 && ctx+0x740 != NULL && *(ctx+0x740) != NULL is fully retained, so the code path is reachable when the hidden -dump_nvu_code_text=1 knob is set, but the tail-call target 0x67FF60 resolves to nullsub_31 (2 bytes) -- the actual text dumper has been stripped from release ptxas, leaving an orphan gate that falls through to a stub.

mov  eax, [rsi+0x598]           ; verbosity level
test eax, eax
jle  .skip
mov  rax, [rsi+0x740]           ; dump sink
test rax, rax
je   .skip
mov  rdi, [rax]
test rdi, rdi
je   .skip
xor  edx, edx
xor  esi, esi
jmp  0x67FF60                   ; nullsub_31 -- stub
.skip:
repz ret

Phase 156 DumpNVuCodeHex -- vtable=0x22BEE28 -- execute sub_C60460 (~48 bytes). Mirror image of phase 155 with a simpler gate (no extra pointer indirection) and tail-call target 0x67FF50 = nullsub_30. Same conclusion: stripped from release, orphan gate only.

Phase 157 DebuggerBreak -- vtable=0x22BEE50 -- execute nullsub_627 at 0xC5DFE0 (2 bytes, repz ret). Debug-build breakpoint marker; release builds emit a bare ret. isNoOp() = 0 (sub_C5E000), so the diagnostic frame still fires -- useful when running ptxas under gdb with b *0xC5DFE0 because the dispatch loop will print "Before DebuggerBreak" / "After DebuggerBreak" on either side of the breakpoint.

Phase 158 NOP -- vtable=0x22BEE78 -- execute nullsub_626 at 0xC5DFB0 (2 bytes, repz ret). Terminal sentinel. The 159-phase dispatch loop (sub_C64F70) iterates a1[0] .. a1[158] and needs a final slot to anchor the loop end; NOP is that anchor. isNoOp() = 0 (sub_C5DFD0), so the final "Before NOP" / "After NOP" prints appear in verbose dumps as the explicit terminator for "All Phases Summary".

Summary of nullsubs (release build). Five of the 20 phases have bodies that are pure ret stubs: 150, 151, 154, 157, 158. Two more (155, 156) have non-trivial gate cascades but their tail-call targets resolve to nullsubs, making them effectively no-ops too. That leaves 13 phases (139--149, 152, 153) that actually transform IR or pipeline state in a release build. Of the 13 active phases, seven are Mercury encoder stages (141--147) gated by ctx+0x570/ctx+0x571 bits -- so on a non-Mercury backend the active count drops to six (139, 140, 148, 149, 152, 153).

No per-SM arch split across these phases. None of the 20 execute bodies contain an sm_version switch on ctx.target[+0x174] at the phase level; the only such check is in phase 139's gate (> 0x3FFF i.e. "sm50-and-up"). All per-generation specialisation happens one level down, inside the target vtable methods each phase tail-calls (Mercury backend for 141--147, target vtable slots 0x148/0x150/0x2B8 for 140/139/148). The pipeline itself is arch-uniform; backends differ only in the methods they plug into the vtables.

The Mercury phases (141--147) are gated by flag bits at ctx+0x570/ctx+0x571, allowing non-Mercury backends to selectively disable encoding stages. WAR generation runs in two passes (144, 146) bracketing Opex (145) because Opex can rewrite operand banks and thereby introduce new write-to-read distances that need re-annotation -- phase 143 (MercExpandInstructions) also runs before the pair but has its own bit-5 gate.


Pipeline Ordering Notes

Stage numbering. The 10 stages on this page (Stage 1--10) subdivide the 159-phase OCG pipeline. They are distinct from the 6 timed phases in Pipeline Overview (Parse, CompileUnitSetup, DAGgen, OCG, ELF, DebugInfo), which cover the entire program lifecycle. All 10 stages here fall within the single OCG timed phase.

Identity ordering. The default ordering table at 0x22BEEA0 (159 x uint32) is an identity mapping for indices 0--156: exec[N] = factory[N]. The last two entries are zero: exec[157] = 0 and exec[158] = 0, mapping both slots back to factory index 0 instead of the expected 157 and 158. This is benign -- phase 157 (DebuggerBreak, empty body in release builds) and phase 158 (NOP, terminal sentinel) both have trivial execute() bodies, so the factory index they resolve through is irrelevant to pipeline behavior. For all practical purposes the factory index IS the execution order: phases execute in strict index order 0--158, and the two trailing zeros are don't-care slots. The original wiki analysis that placed phases 132--138 as "out-of-order slots" was based on a compressed 139-phase model that excluded 20 phases (see note below).

Repeated passes. Several transformations run at multiple pipeline positions because intervening passes expose new opportunities:

Pass FamilyInstancesPhases
GeneralOptimize*613, 29, 37, 46, 58, 65
OriPerformLiveDead*416, 33, 61, 84
OriHoistInvariants*435, 66, 79, 88
LateExpansionUnsupportedOps*378, 93, 137
ExtractShaderConsts*234, 51
OriPropagateVarying*253, 70
OriDoRemat*254, 69
DoSwitchOpt*214, 30
LateArchOptimize*275, 81
MergeEquivalentConditionalFlow2133, 136
MercGenerateWARs*2144, 146
UpdateAfterPostRegAlloc2125, 150
UpdateAfterFormatCodeList2128, 154
ReportFinalMemoryUsage2126, 151
DumpNVuCodeText2129, 155
DumpNVuCodeHex2130, 156
ComputeVCallRegUse2123, 148
CalcRegisterMap2124, 149
DebuggerBreak2131, 157
Vectorization/LateVectorization2(true 41, 73) -- omitted from compressed numbering
EnforceArgumentRestrictions/Late...248 (wiki), (true 103) -- late variant omitted

Cross-References

Key Functions

AddressSizeRoleConfidence
sub_C60D30--Phase factory switch; allocates each of the 159 phases as a 16-byte polymorphic object with a 5-slot vtable (execute, getIndex, isNoOp, NULL, NULL)0.92
sub_7DDB50232BOpt-level accessor; runtime gate called by 20+ pass execute functions to check opt-level threshold0.95
sub_A3636052KBMaster scoreboard control word generator; per-opcode dispatch for phase 115 (AdvancedScoreboardsAndOpexes)0.90
sub_A23CF054KBDAG list scheduler heuristic; barrier assignment for phase 115 scoreboard generation0.90
sub_9F1A9035KBMercConverter infrastructure; drives instruction-level legalization for Mercury phases 117--122 via visitor pattern0.92
sub_9ED2D025KBOpcode switch inside MercConverter; dispatches per-opcode legalization/conversion0.90
sub_9F3760--Phase 141 (MercConverter) execute function; initial Mercury conversion of Ori instructions0.85
sub_18F21F0--Phase 142 (MercEncodeAndDecode) execute function; encode/decode round-trip verification0.85