Priority List Scheduling Algorithm
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The scheduling engine implements a classical priority list scheduling algorithm extended with GPU-specific heuristics for register pressure management, functional unit contention avoidance, yield hint generation, and barrier-aware instruction ordering. A single unified engine (sub_688DD0, 20 KB) serves all three scheduling phases -- ReduceReg, ILP/Latency, and DynBatch -- differentiated only by a mode byte that selects different priority weight configurations. The algorithm iterates basic blocks, builds a ready list of zero-dependency instructions, selects the highest-priority candidate via an 8-bit packed heuristic, emits it into the final schedule, updates the dependency DAG, and repeats until all instructions in the block are placed.
| Unified engine | sub_688DD0 (20 KB) -- mode-parameterized core loop |
| Priority function | sub_8C9320 (47 KB, ~1300 lines) -- 8-bit packed heuristic |
| Ready list builder | sub_6820B0 (1.5 KB) -- zero-predecessor scan |
| Dependency pre-scan | sub_8CF880 (28 KB) -- reverse BB iteration |
| Edge builder | sub_8D9930 (19 KB) -- RAW/WAR/WAW/memory/barrier edges |
| Instruction mover | sub_925510 (341 bytes) -- doubly-linked list relink |
| Resource tracker | sub_A09530 (365 bytes) -- per-instruction stall update |
| Stall/barrier encoder | sub_8D7760 (41 KB) -- control word generation |
| Alternative loop | sub_68B9C0 (46 KB) -- combined DAG + scheduling |
| BB size limit | 4095 instructions (split via sub_931920) |
| Large function limit | 16383 instructions (chunk-based via sub_A9DDD0) |
Core Algorithm
The unified scheduling engine executes the following sequence for each basic block. All three phases (ReduceReg mode 0x39, ILP mode 0x49, DynBatch mode 0x41) follow this identical structure; only the priority weight selection differs.
function ScheduleEngine(sched, mode, arg3, rebuild):
if rebuild:
InitScheduleRegion(sched) // sub_6833F0
// Allocates 72-byte per-BB records
// Queries knobs 595, 743, 747
// Calls sub_7E5120 for instruction characterization
for each bb in sched.basic_blocks: // 72-byte stride
InitResourceTracking(sched, bb) // sub_A091C0
// Zeroes 40-byte resource records (one per register class + 1)
BuildReadyList(sched) // sub_6820B0
while ready_list is not empty:
best = SelectHighestPriority(sched) // via priority vtable
UnlinkFromReadyList(sched, best) // sub_682200
MoveInstruction(sched, best, ref) // sub_925510
UpdateStallCycles(sched, best) // sub_A09530
UpdateWARTracking(sched, best) // sub_A09D40
for each successor of best:
successor.dep_count -= 1
if successor.dep_count == 0:
ready_list.insert(successor)
// Sorted insertion using priority value
PostBBCleanup(sched, bb) // sub_BDC200 / sub_BDCDE0
The outer loop iterates basic blocks via an array of 72-byte records (v112 = 72 * bb_index). The inner loop is a standard worklist algorithm: remove the highest-priority ready instruction, schedule it, and propagate readiness to its successors in the dependency DAG by decrementing their predecessor counts.
Mode Selection
The mode byte stored at *(DWORD*)(scheduler+60) (byte offset 240 from the scheduler object base) controls which priority weight set the engine uses. The orchestrator sub_8D0640 writes this field before each call to the unified engine:
| Mode | Value | Callback | Objective |
|---|---|---|---|
| ReduceReg | 1 | 0x39 | Minimize register pressure. Prioritizes instructions that release registers (last-use operands). |
| ILP/Latency | 0 | 0x49 | Maximize instruction-level parallelism. Prioritizes critical-path and long-latency instructions. |
| DynBatch | 2 | 0x41 | Batch-aware tensor scheduling. Groups GMMA/WGMMA operations for warpgroup cooperation. |
The engine reads the mode byte at the top of each basic block's scheduling loop to select the per-BB knob gate and the vtable pipeline group (sub_688DD0 line 281):
function decode_mode(sched): // sub_688DD0, line 281
mode = *(DWORD*)(sched + 240) // scheduler DWORD index 60
switch mode:
case 1: // ReduceReg -------------------------------------------------
// Per-BB gate: skip this BB if knob "ScheduleInstructionsReduceReg" says so
skip = CheckKnob(ctx, "ScheduleInstructionsReduceReg", bb)
if skip: goto next_bb
// Weight configuration (set by orchestrator before engine entry):
// sched+484 = 1 (ReduceReg flag -- enables budget reduction in sub_8C9320)
// sched+176 = 1 (pressure-tracking active)
// sched+296 = RegToHWUnits(knob 776, default 250) // GPR budget target
// sched+300 = RegToHWUnits(knob 778, default 300) // GPR budget ceiling
// sched+292 = 1 (budget validity flag)
// sched+412 = knob 761 (default 3; or 6/12 for sm_90 with GMMA present)
// sched+416 = knob 762 (default -1 = auto-compute)
// Priority effect: bit 4 (pressure overflow) dominates the comparison;
// sub_8C9320 applies the budget reduction formula -(target/8 + 3) regs
// Vtable: pipeline_group_B (indices 31-53 at 0x8E14xx-0x8E15xx)
case 0: // ILP/Latency -----------------------------------------------
// No per-BB knob gate (always executes every BB)
// Weight configuration:
// sched+484 = 0 (ReduceReg flag off)
// sched+176 = 0 (no pressure reduction in priority function)
// sched+324 = budget from sub_8CEE80: min(archLimit, 0.95 * maxRegs)
// sched+316 = context+616 (latency-depth scaling factor)
// sched+404 = min(context+508, 16), or knob 805 (queue depth cap)
// sched+408 = knob 741 (default 3)
// Priority effect: bits 1 (critical-path) and 2 (stall-free) dominate;
// latency hiding is the primary scheduling objective
// Vtable: pipeline_group_A (indices 8-30 at 0x8E0Exx-0x8E10xx)
case 2: // DynBatch ---------------------------------------------------
// Per-BB gate: skip this BB if knob "ScheduleInstructionsDynBatch" says so
// Default: enabled unless knob 742 overrides, or ctx+1419 bit 5 is set
skip = CheckKnob(ctx, "ScheduleInstructionsDynBatch", bb)
if skip: goto next_bb
// Weight configuration (inherits ILP budget, adds batch context):
// sched+176 = 1 (pressure-tracking re-enabled)
// Batch context: 184-byte struct from sub_8BF890 with 8*numBBs array
// sched+404 = knob 806 (capped at 16), or 6 if context+507 is set
// sched+408 = knob 741 (default 3)
// Priority effect: bits 5 (hot-cold / batch grouping) and 3 (same-BB)
// emphasize keeping GMMA instruction groups together
// Vtable: pipeline_group_C (indices 54-76 at 0x8E22xx-0x8E24xx)
The three vtable pipeline groups contain 23 method pointers each (one per scheduling callback slot). All groups share the same 8 core methods (indices 0-7) but substitute specialized implementations for the 23 pipeline-specific hooks that control ready-list insertion order, candidate comparison, and post-scheduling resource updates. The callback index passed to the engine (0x39=57, 0x41=65, 0x49=73 decimal) selects the entry point within the core methods, indexing into the vtable at off_21DBEF8.
The engine uses vtable dispatch at *(a1+40) and *(a1+48) for polymorphic pre/post scheduling hooks. This allows each mode to inject custom behavior at scheduling boundaries without modifying the core loop.
Ready List Construction
sub_6820B0 (1.5 KB) scans the instruction linked list and collects every instruction with zero unsatisfied dependencies into a sorted ready list.
function BuildReadyList(sched):
for instr in sched.instruction_list: // linked list at sched[20]
if instr.opcode == 52: // NOP / BB boundary marker
continue // follow to real instruction
metadata = *(QWORD*)(instr + 40) // SchedNode pointer
if *(DWORD*)(metadata + 8) == 0: // depCount == 0
*(QWORD*)(metadata + 16) = sched[5] // link to current head
sched[5] = instr // new head
vtable_callback(sched, 104, instr) // insertion hook
*(DWORD*)(metadata + 28) = 0 // reset latency counter
The ready list is a singly-linked list threaded through SchedNode offset +16 (the nextReady field). Sort order is maintained at insertion time by the priority function -- each new instruction is inserted at its correct position so that the head of the list is always the highest-priority candidate. All metadata+N offsets throughout the scheduling pages refer to fields within the SchedNode block pointed to by instr+40 (sched_slot), not offsets from the instruction object itself. See the SchedNode layout for the complete field map.
Opcode 52 instructions are phantom BB boundary markers. The builder skips them but follows their linked-list successors to reach real instructions beyond the boundary.
The vtable+104 callback provides a polymorphic insertion hook. Different scheduling strategies can override this to implement custom ready-list policies (e.g., the dual-issue scheduler uses it to pair co-issuable instructions).
Priority Function
sub_8C9320 (47 KB decompiled, ~1300 lines) is the heart of instruction selection. It computes a scheduling priority as an integer with 8-bit packed fields. Each bit encodes a different heuristic criterion. Because priority comparison reduces to a single integer comparison, the ready list maintains sort order without multi-key sorting overhead.
8-Bit Priority Encoding
| Bit | Name | Meaning | Notes |
|---|---|---|---|
| 7 (MSB) | yield-related | Instruction is near a yield boundary | Higher priority ensures yield hints align with scheduling boundaries |
| 6 | yield flag | Instruction triggers or participates in a yield sequence | Controls warp scheduler round-robin interaction |
| 5 | hot-cold | Memory access temperature (1 = hot, 0 = cold) | Hot = global/texture/surface loads with long latencies; cold = constant/shared |
| 4 | hot-cold / pressure | Packed byte holds hot-cold flag; pressure overflow acts as comparison override | See Priority Function Internals for the dual mechanism |
| 3 | same-BB preference | Instruction belongs to the currently-scheduled BB | Discourages cross-BB instruction motion |
| 2 | stall-free | Scheduling this instruction introduces zero stall cycles | All producer latencies have completed |
| 1 | latency-bound | Instruction is on the DAG critical path | Prioritizes latency-sensitive dependency chains |
| 0 (LSB) | tiebreaker | Additional ordering factor | Encodes instruction position, operand count, or FU preference |
Higher numeric value = higher priority. Bit 7 is the most significant criterion: yield-boundary instructions always schedule before non-yield instructions when both are ready.
Hot-Cold Classification
The hot-cold flag (bit 5) classifies memory operations by expected latency:
- Hot (bit 5 = 1, higher priority): global memory loads (
LDG), texture fetches (TEX,TLD), surface operations. These have high latency (hundreds of cycles) and benefit most from early scheduling to overlap with computation. Detected bysub_A9CDE0. - Cold (bit 5 = 0, lower priority): constant memory loads (
LDC), shared memory operations (LDS,STS). These have low latency and do not need early scheduling. Detected bysub_A9CF90, which also suppresses the pressure-overflow and critical-path extension signals.
Classification uses sub_A9CDE0 (hot detection) and sub_A9CF90 (cold detection). Memory space type is determined by sub_693BC0, which returns space codes: 3 = shared, 16 = global, 2 = local, 11 = surface, 7 = constant, 1 = generic. Hot-cold tracking is gated by scheduler+523 and scheduler+532; when the hot-cold budget (scheduler+532) reaches zero, the feature deactivates for the remainder of the BB.
Pressure Overflow
Bit 4 in the packed byte holds the hot-cold flag (see above). The pressure overflow signal is a separate Boolean computed by checking all four register classes (GPR, predicate, address, UGP) against their respective limits. When any class exceeds its budget, the pressure overflow flag activates and acts as a comparison override: the candidate wins regardless of the packed priority byte, forcing the scheduler to select the instruction that relieves register pressure. This is the primary mechanism by which the ReduceReg phase achieves its objective: the mode sets a tight register budget via scheduler+178, causing pressure overflow to activate frequently and driving the scheduler toward pressure-reducing orderings. See the Priority Function Internals section for the exact per-class threshold checks.
Priority Evaluation Sequence
The priority function evaluates criteria in this order for each candidate instruction:
sub_8C7290: extract 4-class register deltas, same-BB flag, and per-BB resource vector (SSE-optimized)- Compute yield saturation: check write-barrier counters for predicate, GPR, and UGP register classes against their ceilings (7, 7, and
target_desc+624respectively) sub_8C67A0: compute per-instruction resource cost if BB slot not yet committedsub_8C7120: update barrier tracking state (ifmetadata+111bit 7 set)- Evaluate register pressure: compute per-class overflow against budget (
scheduler+432) and per-class limits; derive pressure-overflow Boolean - Evaluate stall-free: compare earliest cycle (
metadata+32) vs current cycle (scheduler+480) - Evaluate critical path: compare barrier-target count vs depth threshold (
scheduler+464) - Evaluate yield bits: opcode 39 (yield-related) and opcode 96 (yield flag from
scheduler+524) - Pack 8 bits into priority byte
- Evaluate hot/cold:
sub_A9CDE0/sub_A9CF90(only whenscheduler+523active) - Multi-stage comparison against running best: resource vectors, then XOR-based bit scan, then secondary tiebreakers
The function scans the full ready list in a single pass (not limited by knob 770 for the scan itself). Knob 770 (priority queue depth, default 4) controls the depth threshold mechanism for critical-path activation, not the number of candidates evaluated.
Key Internal Variables
| Variable | Source | Content |
|---|---|---|
budget_hw | sub_6818D0(scheduler, scheduler[432] - scheduler[412]) | Register budget in HW register units |
reduced_hw | sub_6818D0(scheduler, budget - budget/16) | Tighter budget for critical-path threshold (or knob 760 override) |
queue_depth | knob 770 | Depth threshold parameter (default 4); controls critical-path activation |
per_bb_flag | knob 769 | Per-BB scheduling flag; when set, resets yield state between BBs |
scheduler+420 | state | Spill-mode countdown; when > 0, forces aggressive scheduling with bit 1 = 1 |
scheduler+464 | state | Depth threshold -- number of barrier targets that must be ready before critical-path activates |
scheduler+480 | state | Current scheduling cycle; used for stall-free evaluation |
scheduler+523 | state | Hot-cold tracking enable flag; gated by knob |
scheduler+524 | state | Current yield state; propagated to CONTROL instructions via bit 6 |
scheduler+532 | state | Hot-cold budget counter; decremented per cold instruction, disables tracking at zero |
scheduler+672 | allocation | Per-BB resource cost table (84 bytes per slot) |
Support Subroutines
| Address | Size | Purpose |
|---|---|---|
sub_8C67A0 | 3.7 KB | Compute per-instruction resource cost. Calls sub_A08A00 (resource model) three times: mode 1 = instruction's own cost, mode 2 = operand release cost, mode 3 = combined BB-level impact. Uses SSE _mm_add_epi32 for vector accumulation. |
sub_8C7290 | 5.1 KB | Copy 10-element int32 resource vector from per-BB table at scheduler+672. SSE _mm_loadu_si128 bulk copy. Special case: opcode 97 (STG in ROT13; used as control/boundary marker) returns base scheduler state with zeroed deltas. |
sub_8C7720 | 20 KB | Red-black tree operations for instruction reordering within BB. Maintains a balanced BST of scheduling candidates for O(log N) insertion, removal, and priority update. |
sub_8C7120 | -- | Barrier tracking state update. |
sub_693BC0 | -- | Memory space classification and latency query. |
sub_6818D0 | 24 B | RegToHWUnits: conditionally doubles register count when LivenessCountRegComp (bit 3 of scheduling_mode_flags) is set (sm_70+). |
Priority Function Internals
The full logic of sub_8C9320 divides into three phases: (1) pre-scan the ready list to collect aggregate BB statistics, (2) iterate the ready list a second time evaluating each candidate and maintaining a running best, and (3) update scheduler state and return the winner. The function signature is (scheduler, &second_best) -> best_instruction.
Phase 1: Pre-Scan Statistics
Before priority evaluation begins, the function iterates the entire ready list (linked via metadata+16) and accumulates per-BB statistics that feed into the per-instruction priority decisions:
| Variable | Init | Accumulation | Meaning |
|---|---|---|---|
shared_mem_count | 0 | ++ when opcode 183 and sub_693BC0 returns space 3 | Count of shared-memory operations in ready list |
neg_reg_deficit | 0 | += delta when register delta < 0 | Total register pressure reduction from ready instructions |
max_dep_cycle | -1 | max(current, metadata+92) | Highest dependency cycle among all ready instructions |
max_pred_cycle | 0 | max(current, metadata+88) | Highest predecessor cycle among all ready instructions |
barrier_count | 0 | ++ when metadata+108 & 1 | Count of barrier-target instructions in ready list |
dep_flag_count | 0 | ++ when metadata+108 & 2 | Count of instructions with dependency-set flag |
pos_pressure_sum | 0 | += delta when register delta > 0 | Total register pressure increase from ready instructions |
filtered_pressure | 0 | += delta when within depth threshold | Pressure increase from depth-eligible instructions |
max_barrier_slot | -1 | max(current, metadata+24) for barrier targets | Latest BB slot among barrier-target instructions |
min_barrier_latency | 99999 | min(current, metadata+28) for barrier targets | Shortest latency counter among barrier-target instructions |
max_nonbarrier_cycle | -1 | max(current, metadata+32) for non-barrier | Latest earliest-available-cycle for non-barrier instructions |
any_stall_free | 0 | |= (metadata+32 >= 0) | Whether any instruction can issue without stalling |
total_ready | 0 | ++ for every instruction | Total instructions in ready list |
preferred_instr | NULL | non-barrier instr with max metadata+24 | The program-order-latest non-barrier instruction |
The pre-scan also maintains a depth-threshold table: an array of up to 32 barrier-target instruction pointers sorted by their latency counter (metadata+28). This table is scanned to compute scheduler+464 (depth threshold) and scheduler+380 (latency cutoff), which control when the critical-path bit activates.
Depth-threshold derivation algorithm. On the first invocation per BB (scheduler+464 == -1), the state initializes to scheduler+464 = 0, scheduler+480 = 0, scheduler+420 = 0. If the BB has active barriers (scheduler+396 > 0) and barrier-target instructions exist in the ready list, the threshold computation runs:
// Phase 1a: collect barrier-target pointers (during pre-scan)
barrier_table[32] // local stack array, max 32 entries
barrier_table_count = 0
barrier_ready_count = 0 // subset with metadata+32 < INT_MAX (non-stalling)
for each instr in ready_list:
if metadata+108 & 1: // barrier-target flag
if barrier_table_count <= 31:
barrier_table[barrier_table_count] = instr
barrier_table_count++
barrier_ready_count += (metadata+32 < 0x80000000)
// Phase 1b: derive threshold from sorted latency scan
scheduler[424] = barrier_table_count // total barrier targets
scheduler[428] = barrier_ready_count // non-stalling subset
old_threshold = scheduler[464]
if old_threshold < barrier_table_count:
cap = min(barrier_table_count, 32)
lower = -1 // latency lower bound
accum = 0 // accumulated instruction count
do:
// --- find next distinct latency above lower bound ---
next_lat = 999999
tie_count = 0
for i in 0 .. cap-1:
lat = metadata+28 of barrier_table[i]
if lat > lower and lat <= next_lat:
tie_count++
if lat != next_lat:
next_lat = lat // new minimum found
tie_count = 1
accum += tie_count // add this tier
lower = next_lat // advance lower bound
if barrier_table_count == 0:
lower = 999999 // sentinel for empty list
while old_threshold > accum
scheduler[376] = 0 // threshold is now valid
scheduler[464] = accum // depth threshold
scheduler[380] = lower // latency cutoff
// Fallback (at LABEL_87): if scheduler[376]==1 and barrier_count >= scheduler[464],
// set scheduler[464]=barrier_count, scheduler[380]=max_barrier_latency
The inner loop is equivalent to an iterative selection of the next-smallest distinct latency tier from an unsorted array. Each tier groups barrier targets that share the same latency counter value. The do...while old_threshold > accum condition ensures the threshold absorbs at least old_threshold instructions before stopping. On subsequent invocations within the same BB, scheduler+464 >= 0 and the computation is skipped.
During Phase 2 candidate evaluation, the critical-path bit activates when scheduler+464 > 0 && scheduler+464 <= barrier_count -- i.e., enough barrier targets are ready to meet the depth threshold.
Phase 2: Register Budget Prologue
Before the main loop, the function computes two register budgets from scheduler+432 (target register count):
budget_base = scheduler[432] - scheduler[412] // target minus committed
if ReduceReg_mode (scheduler+178): // ReduceReg tightens budget
if scheduler[416] < 0:
budget_base -= (scheduler[432] / 8) + 3 // reduce by ~12.5% + 3
else:
budget_base -= scheduler[416] // explicit reduction
budget_hw = RegToHWUnits(scheduler, budget_base) // sub_6818D0
reduced_hw = RegToHWUnits(scheduler, budget_base - budget_base/16)
// ~6.25% tighter
if knob_760_active:
reduced_hw = RegToHWUnits(scheduler, budget_base - knob_760_value)
queue_depth = 4 // default
if knob_770_active:
queue_depth = knob_770_value // override
The RegToHWUnits conversion (sub_6818D0) maps a virtual register count to hardware allocation units:
fn RegToHWUnits(scheduler, count):
flags = *(*(scheduler + 8) + 1376) // scheduling_mode_flags
if flags & 0x08: // bit 3: LivenessCountRegComp
return 2 * count // doubled for HW-aligned tracking
else:
return count // 1:1 passthrough
Bit 3 of scheduling_mode_flags (knob 419 LivenessCountRegComp) is set on sm_70+ targets, where the hardware allocates registers in 8-register blocks (256 registers/warp). When set, the scheduler internally tracks pressure in 4-register half-units, so RegToHWUnits doubles to produce full 8-register hardware units. On sm_30--sm_60, registers allocate in 2-register blocks (64 registers/warp) and the scheduler tracks in native units, so the function is an identity.
| SM Generation | HW Alloc Granularity | Bit 3 | Multiplier | Effect |
|---|---|---|---|---|
| sm_30 -- sm_60 | 2 regs/thread (64/warp) | 0 | 1x | count passes through unchanged |
| sm_70+ | 8 regs/thread (256/warp) | 1 | 2x | count doubled to match 8-reg blocks |
budget_hw sets the threshold for bit 4 (pressure overflow). reduced_hw provides a tighter threshold used in the critical-path assessment. queue_depth (knob 770) limits how many candidates receive full priority evaluation; the rest use cached values from initial insertion.
Phase 3: Per-Bit Computation
For each instruction in the ready list, sub_8C7290 extracts its per-register-class deltas (4 classes: GPR, predicate, address, UGP) and the same-BB flag. Then each priority bit is computed:
Bit 7 -- Yield-related. Determined by opcode. Only opcode 39 (YIELD instruction variant) can set this bit. The condition checks the last operand's low 2 bits:
if opcode_masked == 39:
operand_index = operand_count - 1 - ((opcode >> 11) & 2)
yield_related = (instr[84 + 8*operand_index] & 3) == 0
else:
yield_related = 0
When set, the instruction is a yield boundary marker and receives absolute highest priority regardless of all other heuristics.
Bit 6 -- Yield flag. Set only for opcode 96 (CONTROL instruction):
if opcode_masked == 96:
yield_flag = scheduler[524] // current yield state
else:
yield_flag = 0
// Post-adjustment: suppress when hot/pressure bits dominate
if (bit5_set || bit4_set):
yield_flag = 0
if metadata[32] < scheduler[480]: // behind schedule
yield_flag = scheduler[396] ? original_yield : 0
The yield flag propagates the scheduler's warp yield state only through CONTROL instructions, ensuring yield hints align with scheduling barriers.
Bit 5 -- Hot-cold classification. Requires hot-cold tracking to be active (scheduler+523 set, gated by scheduler+532 > 0):
if hot_cold_active:
is_hot = sub_A9CDE0(target_desc, context, instruction)
else:
is_hot = 0
// Cold detection suppresses priority
if sub_A9CF90(target_desc, context, instruction): // is_cold?
pressure_overflow = 0 // suppress bit 4
critical_extension = 0 // suppress lookahead
sub_A9CDE0 returns true for global memory loads (LDG), texture fetches (TEX, TLD), and surface operations -- instructions with latencies in the hundreds of cycles. sub_A9CF90 returns true for constant loads (LDC), shared memory operations (LDS/STS) -- low-latency operations. Hot instructions (bit 5 = 1) get higher priority to schedule early and overlap their long latencies with computation. Cold instructions (bit 5 = 0) are deprioritized.
Bit 4 -- Pressure overflow. This bit does NOT appear directly in the initial packing as a single variable. Instead, the pressure overflow signal (v81 in decompiled source) feeds into the candidate comparison logic as an override. The mechanism:
// For barrier-target instructions:
budget_in_units = RegToHWUnits(scheduler, scheduler[432])
headroom = RegToHWUnits(scheduler, 8)
if budget_in_units > headroom + scheduler[72]: // plenty of headroom
pressure_overflow = 0
elif latency_counter > min_barrier_latency + 9: // far from ready
pressure_overflow = 0
else:
// Check all 4 register classes against their limits:
overflow = false
overflow |= (scheduler[72] + gpr_delta > budget_hw)
overflow |= (scheduler[68] + pred_delta > 7)
overflow |= (scheduler[56] + addr_delta > 7)
overflow |= (scheduler[60] + ugp_delta >= target_desc[624])
pressure_overflow = overflow
When pressure_overflow = 1, the candidate wins the comparison regardless of other bits -- it is the scheduler's mechanism for emergency register pressure relief. In the packed byte's bit 4 position, the hot-cold flag occupies the slot. The pressure overflow signal operates at a higher level: it can force the candidate to win even when its packed priority byte is lower.
Per-register-class overflow limits. The four register classes each have a fixed overflow threshold. Three are hardcoded constants; the fourth is dynamic:
| Class | Scheduler field | Limit | Comparison | Source |
|---|---|---|---|---|
| 0 -- GPR | scheduler[72] | budget_hw | current + delta > budget_hw | RegToHWUnits(scheduler, budget_base), dynamic per-kernel |
| 1 -- Predicate | scheduler[68] | 7 | current + delta > 7 | Hardcoded in sub_8C9320 line 1044 |
| 2 -- Address | scheduler[56] | 7 | current + delta > 7 | Hardcoded in sub_8C9320 line 1045 |
| 3 -- UGP | scheduler[60] | sm_backend[624] | current + delta >= limit | Read from target descriptor at runtime |
Note the asymmetry: classes 0--2 use strict greater-than (>), while class 3 uses greater-than-or-equal (>=). In the decompiled source, the class 3 check is v81 = (target_desc[624] < delta + current) at line 1086 -- strict less-than from the limit's perspective, which is >= from the sum's perspective. This means UGP overflow triggers one count sooner than the other classes relative to their limits.
sm_backend[624] values by architecture. The UGP limit is a field on the SM backend object, accessed via *(*(context+1584) + 624). Its value tracks the architectural uniform register file size:
| Architecture | SM | sm_backend[624] | UR file | Notes |
|---|---|---|---|---|
| Volta | sm_70--sm_73 | 0 | absent | No UR hardware; UGP class check is vacuous |
| Turing | sm_75 | 63 | UR0--UR62 | First architecture with dedicated UR file |
| Ampere | sm_80--sm_87 | 63 | UR0--UR62 | Same UR geometry as Turing |
| Ada | sm_89 | 63 | UR0--UR62 | Same UR geometry |
| Hopper | sm_90 | 63 | UR0--UR62 | Same UR geometry |
| Blackwell | sm_100+ | 63 | UR0--UR62 | Same UR geometry |
All sm_75+ architectures share a 63-entry UR file (UR63 = URZ, the zero register). On pre-Turing targets, sm_backend[624] is 0, making the >= 0 comparison always true for any nonzero UGP delta -- but this is moot because no UGP registers are allocated on those targets, so ugp_delta is always 0.
Bit 3 -- Same-BB preference. Output parameter from sub_8C7290:
same_bb = sub_8C7290.output_param_5 // boolean from resource copy
Set when the instruction belongs to the currently-scheduled basic block. Instructions imported from other BBs by global scheduling get same_bb = 0, reducing their priority relative to local instructions.
Bit 2 -- Stall-free. Computed from the earliest-available-cycle field:
if countdown_active (scheduler[420] != 0):
if metadata[32] < scheduler[480] AND instr != preferred_instr:
stall_free = 0
if pressure_plus_reg_sum > 0:
goto full_evaluation // positive pressure = needs analysis
else:
stall_free = 1
else:
// Normal mode: stall-free when producers have completed
if metadata[32] >= scheduler[480]:
stall_free = 1
elif instr == preferred_instr:
stall_free = 1
else:
stall_free = 0
metadata+32 is the instruction's earliest available cycle -- the latest completion time among all its producer instructions. scheduler+480 is the current scheduling cycle. When earliest >= current, all producers have retired and the instruction can issue with zero pipeline stalls.
Bit 1 -- Critical-path / latency-bound. Complex multi-path computation:
if countdown_active (scheduler[420] != 0):
// Spill mode: almost always critical
if !(barrier_bits_set_in_priority):
if slot_limit_exceeded:
critical = 1
else:
critical = !(pressure_sum <= 0 && max_reg_class == 0)
else:
critical = 0
else:
// Normal mode: depth threshold comparison
if barrier_count >= scheduler[464]:
critical = 1 // enough barriers ready -> critical path active
else:
critical = 0
In spill mode (active when scheduler+420 > 0), the critical-path bit is set for nearly all instructions to maximize scheduling throughput. In normal mode, it activates when the number of barrier-target instructions in the ready list meets or exceeds the depth threshold computed during the pre-scan, indicating that the scheduler is processing a latency-critical dependency chain.
Spill mode state machine. The scheduler tracks three fields that together form a NORMAL/SPILL/NORMAL state machine: spillModeCountdown (+420), spillSequenceCounter (+396), and depthThreshold (+464). The post-processing phase of each SelectBestInstruction call updates them:
// === ENTER SPILL MODE (computed_countdown) ===
// Triggered when winner is a barrier-target AND countdown == 0.
// Pre-scan populates: barrierTargetCount (+424), readyBarrierCount (+428).
if winner_is_barrier_target AND spillModeCountdown == 0:
N = barrierTargetCount // from pre-scan of ready list
T = depthThreshold
if N <= 3 OR readyBarrierCount > 0: // few targets or some ready
if barrierReadyCount <= T OR allPressureInThreshold:
countdown = barrierReadyCount // use ready-list count
else:
if totalPressure + winnerGPR > budget:
countdown = depthThreshold // fall back to threshold
else:
countdown = barrierReadyCount
else: // many targets, none ready
if N > T AND partialPressure != totalPressure
AND totalPressure + winnerGPR > budget:
countdown = depthThreshold // pressure-driven threshold
// else: countdown stays at N (default from barrierTargetCount)
// === TICK (every scheduling step while in spill mode) ===
spillModeCountdown -= 1
spillSequenceCounter -= 1
if winner.earliestCycle >= 0: // valid producer cycle
stallFreeCounter -= 1 // decrement stall-free tracker
if stallFreeCounter == 0
AND spillSequenceCounter > 0
AND spillSequenceCounter < maxStallCycles
AND spillModeCountdown > 1:
goto exit_spill // stall-free supply exhausted
// === EXIT SPILL MODE (two paths) ===
// Path A -- barrier-target winner while countdown active:
// normal tick above, countdown naturally reaches 0.
if spillModeCountdown == 0:
depthThreshold = -1 // sentinel: force re-scan
// Path B -- non-barrier winner forces immediate exit:
if !winner_is_barrier_target AND spillModeCountdown > 0:
pack winner's saved priority bits
if !(same_bb_bit | stall_free_bit) // no scheduling pressure
AND slotContention + slotBase + winnerCycle <= slotLimit:
skip exit // low-pressure, keep going
else:
exit_spill:
spillModeCountdown = 0
depthThreshold = -1 // force full re-init next call
The sentinel value depthThreshold = -1 causes the next SelectBestInstruction invocation to zero all three counters and re-run the barrier-target pre-scan, which recomputes the depth threshold from the updated ready list. This ensures the transition back to NORMAL mode uses fresh pressure data rather than stale values from the previous spill episode.
Bit 0 -- Tiebreaker (barrier-target). Read directly from instruction metadata:
tiebreaker = metadata[108] & 1 // barrier-target flag
Barrier-target instructions (those waiting on a hardware barrier) get bit 0 = 1. Since this is the lowest-priority bit, it only affects ordering when all higher bits are identical. Scheduling barrier targets promptly allows the barrier resource to be retired sooner, freeing scoreboard entries for other instructions.
Packed Byte Assembly
The 8 bits are packed into a single byte using shift-and-mask arithmetic:
priority = (yield_related << 7) // bit 7
| (yield_flag << 6) & 0x7F // bit 6
| (hot_cold << 5) & 0x3F // bit 5 (initially yield copy)
| (hot_flag << 4) & 0x3F // bit 4
| (same_bb << 3) & 0x0F // bit 3
| (stall_free << 2) & 0x0F // bit 2
| (critical_path << 1) & 0x03 // bit 1
| (tiebreaker << 0) & 0x03 // bit 0
The & 0xNN masks ensure each bit occupies exactly one position. In the initial packing, bit 5 and bit 6 both derive from the yield variable; the hot-cold flag (sub_A9CDE0 result) overwrites bit 5 in subsequent repackings that occur during the spill-mode and comparison paths.
Candidate Comparison
The comparison between the current candidate and the running best is NOT a simple integer comparison of the packed bytes. The function performs a multi-stage refinement:
- Resource vector comparison: A 4-tuple lexicographic comparison of per-register-class pressure scores. The four classes are compared in fixed order; the first class whose scores differ determines the winner. Lower score wins (ascending), so the instruction that increases register pressure less is preferred:
// Per-class pressure scores (computed earlier for each candidate):
//
// score_GPR = max(0, gpr_delta + scheduler[64] - 1)
// +1 when gpr_delta > 0 AND cumulative_gpr > 0
// AND instr_latency < max_predecessor_cycle
// +1 for control ops (opcode 97) when gpr_sum > 0
// score_pred = pred_delta + scheduler[68] // live predicates
// zeroed when total <= 6, or == 7 with pred_delta <= 0
// score_addr = addr_delta + scheduler[56] // live address regs
// zeroed when total <= 6, or == 7 with addr_delta <= 0
// score_UGP = ugp_delta + scheduler[60] // live uniform GPRs
// zeroed when total < UGP_limit (= context[1584]+624)
//
// Best-so-far scores stored as: best_GPR, best_pred, best_addr, best_UGP
// (all initialized to -1 before the first candidate is evaluated)
// 4-tuple lexicographic compare (sub_8C9320 lines 1375-1386)
classes = [GPR, pred, addr, UGP] // fixed priority order
for class in classes:
if current.score[class] != best.score[class]:
if current.score[class] < best.score[class]:
current_wins = true // ascending: lower is better
else:
current_wins = false
goto apply_result // first difference decides
// all four equal -> fall through to priority byte XOR scan
The direction is uniformly ascending across all four classes: the candidate with the smaller pressure score wins. A score of 0 means the instruction does not push the register class above its safe occupancy threshold; larger values indicate increasing pressure overflow. By checking GPR first, the scheduler prioritizes the widest register file -- the class with the greatest spill cost impact.
-
Priority byte XOR scan: When resource vectors are equal, the function XORs the current and best packed bytes and checks differing bits in this order:
- Bit 4 (0x10) -- pressure: winner has bit 4 set (higher pressure need)
- Bit 6 (0x40) -- yield: winner has bit 6 set (yield participation)
- Bit 1 (0x02) -- critical: winner has bit 1 set
- Bit 2 (0x04) -- stall-free: winner has bit 2 set
- Bit 5 (0x20) -- hot-cold: winner has bit 5 set (hot memory op)
-
Secondary tiebreakers (when all checked bits match):
- Barrier group index (
v213vsv253) - Latency counter comparison (
v223vsv248) - Bit 7 yield-related (only when shared-memory count > 0)
- Contention score (see formula below)
- Slot manager cycles (
max(ComputeCost(instr) - cycleCursor, 0); see Slot Manager Object Layout) - Earliest available cycle (
metadata+32) - Dependency cycle (
metadata+92) - Latest deadline (
metadata+40) - Register delta magnitude
- Barrier group index (
-
Positional fallback: When all heuristic comparisons are tied, the instruction with the higher BB slot (
metadata+24) wins, preserving original program order.
The multi-stage comparison explains why the packed byte uses non-obvious bit ordering. Bits 4, 6, 1, 2, 5 are checked before bit 7 in the refinement path, even though bit 7 is the MSB. The packed byte enables fast ready-list insertion sort (integer comparison), while the full comparison function provides nuanced selection for the actual scheduling decision.
Contention Score Formula
The contention score (v216 in decompiled source) quantifies how much scheduling a given instruction would increase register pressure beyond the budget. It has two cases:
live_after_gpr = scheduler[72] + gpr_delta // projected GPR live count
pressure_sum = pos_pressure_sum + gpr_delta // v289 + v283
if reduced_hw >= live_after_gpr:
// instruction stays within budget -- no pressure penalty
contention_score = base_contention
else:
// instruction would exceed the reduced budget
pressure_delta = live_after_gpr - reduced_hw
pressure_sign = (pressure_sum <= 0) ? 1 : 0
contention_score = base_contention
+ 2 * RegHalfUnits(scheduler, pressure_delta)
- pressure_sign
| Term | Decompiled | Meaning |
|---|---|---|
base_contention | v72 | Slot-manager scheduling cost. For barrier-target instructions, clamped to max(v72, slotLimit - slotBase). For non-barrier non-stalling instructions, 0 or 1 from vtable dispatch; +100 on the fallthrough path when no stall-free candidates exist. |
gpr_delta | v283 | GPR register delta from sub_8C7290 resource vector (element 5 of the 10-element vector). Positive means the instruction defines more registers than it consumes. |
pos_pressure_sum | v289 | Accumulated positive GPR deltas from prior scheduling steps (element 9 of v284 resource vector). Tracks cumulative register pressure increase within the BB. |
reduced_hw | v222 | Tighter register budget in HW units: RegToHWUnits(budget_base - budget_base/16), or RegToHWUnits(budget_base - knob_760) when knob 760 is active. Approximately 6% below the full budget. |
live_after_gpr | v191 | Projected GPR live count after scheduling this instruction: scheduler[72] + gpr_delta. |
pressure_sign | v212 | Boolean: 1 when pressure_sum <= 0 (net pressure is non-positive across ready candidates including this one), 0 otherwise. Provides a one-unit discount when pressure is already declining. |
RegHalfUnits | sub_6818B0 | Inverse of RegToHWUnits (sub_6818D0): when LivenessCountRegComp (bit 3 of scheduling mode flags) is set, returns ceil(count / 2); otherwise identity. Converts register-count overshoot into HW-aligned penalty units. |
The guard condition reduced_hw < live_after_gpr activates the penalty term only when scheduling this instruction would push projected GPR pressure above the reduced budget. The 2 * multiplier doubles the penalty to make register overflow a strong differentiator in the tiebreaker cascade. The - pressure_sign term subtracts 1 when the overall pressure direction is downward, slightly favoring instructions that are part of a declining pressure trend even if they individually overshoot the budget.
Scheduler State Updates
After selecting the best candidate, the function updates scheduler state:
// Spill mode countdown
if winner is barrier-target:
scheduler[420] = computed_countdown - 1
scheduler[396] -= 1 // spill sequence counter
if metadata[32] >= 0:
scheduler[400] -= 1 // stall-free counter
if stall_free_count==0 AND remaining>0 AND countdown>1:
scheduler[420] = 0 // force exit spill mode
scheduler[464] = -1 // reset depth threshold
else:
// Non-barrier winner in countdown mode
if !(barrier_bits in priority) AND slot_cost within budget:
// do nothing, continue countdown
else:
scheduler[420] = 0 // exit spill mode
scheduler[464] = -1 // reset depth threshold
// Slot manager update (when winner has positive scheduling cost)
if best_cost > 0 AND slotManager.pendingCount > 0:
if slotManager.advanceReady:
slotManager.cycleCursor += slotManager.cycleStride // advance epoch
slotManager.pendingCount = 0 // reset batch
slotManager.anchorInstr = NULL // clear anchor
best.metadata[28] = slotManager.vtable.ComputeCost(slotManager, winner, 0)
// Hot-cold counter update
if hot_cold_active AND winner is cold (sub_A9CF90 returns true):
scheduler[532] -= 1 // decrement hot-cold budget
elif hot_flag was set for winner:
scheduler[523] = 0 // disable hot-cold tracking
The function returns the best instruction pointer and writes the second-best to *a2 for lookahead scheduling.
Slot Manager Object Layout
The slot manager is a polymorphic C++ object at scheduler+16. It tracks scheduling-cycle budgets so the priority function can detect when issuing an instruction would exceed the per-BB throughput limit. Binary evidence: sub_8C9320 lines 911--945, 1113, 1119, 1789, 1832--1848; sub_688DD0 lines 631--638.
| Offset | Size | Name | Purpose |
|---|---|---|---|
| +0 | 8 | vtable* | Per-architecture method table pointer |
| +28 | 4 | cycleCursor | Cumulative issue-cycle position. Incremented by 1 per instruction via vtable[12] (sub_680080) and by cycleStride on epoch advance |
| +44 | 4 | cycleStride | Epoch advance delta added to cycleCursor when advanceReady fires |
| +52 | 4 | cycleLimit | Upper throughput bound; budget test is cycleCursor + contention + cycleStride <= cycleLimit |
| +76 | 4 | pendingCount | Instructions scheduled since last epoch advance; guards the advance path |
| +80 | 8 | anchorInstr | Instruction that anchored the current epoch; cleared on advance |
| +140 | 1 | advanceReady | Boolean; when set the next winner-update advances cycleCursor by cycleStride |
Three vtable slots drive the slot manager:
| Vtable idx | Offset | Signature | Role |
|---|---|---|---|
| 12 | +96 | IncrementCount(slotMgr, instr) | Per-instruction: ++cycleCursor. Fast path sub_680080 inlines to a single ++DWORD[7] |
| 36 | +288 | ComputeCost(slotMgr, instr, 0) -> int | Returns the cycle cost for instr. Contention = max(cost - cycleCursor, 0) |
| 43 | +344 | CompareAnchors(slotMgr, anchor, cand, 0) -> bool | When cycleCursor >= cost and anchorInstr != NULL, tests whether cand shares the anchor's slot. Result XOR'd to 0/1 contention |
The contention value feeds into two budget checks: (1) spill-mode critical-path test cycleStride + contention + cycleCursor <= cycleLimit (line 1119), and (2) second-best spill-exit comparison at line 1789. A third guard at line 1113 checks cycleCursor + 100 >= minReadyCycle to suppress non-barrier candidates when the cursor is near the earliest ready cycle.
During the scheduling loop (sub_688DD0 line 637--638), each non-BB-boundary instruction triggers sub_682490(scheduler, instr, prev, 1). That function walks register operands, matches source/destination conflicts against the live-register bitmask, and returns a register-port-conflict cost written to metadata+28. This becomes best_cost (v233) that gates the slot manager advance.
Dependency DAG Construction
The dependency graph is built in two stages before the scheduling loop begins. The DAG is a directed acyclic graph where nodes are instructions and edges represent ordering constraints with associated latency values.
Stage 1: Pre-Scan (sub_8CF880, 28 KB)
Iterates basic blocks in reverse order (bb[N-1] to bb[0]) using the BB ordering array at func+512.
For each BB:
- Check knobs 314/313 for per-BB scheduling skip flags
- Walk the instruction linked list, identifying NOP/control instructions
- Set
bb->nextpointers and configure BB scheduling state - Delegate to Stage 2 (
sub_8D9930) for edge construction - Manage memory arenas with SSE-optimized copies for metadata arrays
Contains approximately 14 nested loops for edge construction. The reverse iteration order ensures that when the scheduler processes a BB, all of its successors have already been characterized.
Stage 2: Edge Construction (sub_8D9930, 19 KB)
For each pair of instructions within a BB, checks for five dependency types:
| Type | Abbreviation | Condition | Edge Latency |
|---|---|---|---|
| True | RAW | Read-after-write on same register | Producer's pipeline latency |
| Anti | WAR | Write-after-read on same register | 0 (ordering constraint only) |
| Output | WAW | Write-after-write on same register | 1 (minimum separation) |
| Memory | -- | Store before load to same memory space | Conservative; full ordering |
| Barrier | -- | Instruction depends on barrier/sync result | Barrier completion latency |
Operand analysis is performed by sub_894290 (27 KB), which processes 16-bit operand descriptors encoding:
| Bits | Field |
|---|---|
| 12--15 | Register class |
| 8--11 | Bank number |
| 0--7 | Dependency type |
Memory dependencies are conservative: all stores are ordered before subsequent loads to the same memory space. The scheduler does not perform alias analysis -- it relies on the memory space classification from sub_693BC0 to determine whether two operations might conflict.
Pair-check pruning via last-writer tracking
Despite the "for each pair" description above, the implementation does not perform an all-pairs O(N^2) comparison. sub_68A690 (BuildDependencies) maintains two RB-trees keyed by register ID -- one for the last writer of each register, one for the last reader set -- and uses bitvector-indexed iteration to touch only the registers an instruction actually references.
function BuildDependencies(reg_tree, bb, slot_index):
// reg_tree: RB-tree { reg_id -> last_writer_instr }
// reader_tree: RB-tree { reg_id -> last_reader_instr_list }
// Both are reset at BB entry, so they track only intra-BB state.
def_bv = BitVector.alloc(slot_index) // sub_BDBEF0: defs bitvector
use_bv = BitVector.alloc(slot_index) // sub_BDBF80: uses bitvector
for instr in bb.instructions (program order):
// --- extract operand bitvectors from instr+84 descriptors ---
ProcessOperands(instr, &def_bv, &use_bv) // sub_6A78F0
// --- RAW edges: for each source register, find its last writer ---
for reg_id in use_bv.set_bits(): // O(K_use) iterations
writer = reg_tree.lookup(reg_id) // O(log R) RB-tree search
if writer != NULL:
AddEdge(writer, instr, RAW, reg_id) // sub_6848D0
// --- WAR edges: for each dest register, find its last reader ---
for reg_id in def_bv.set_bits(): // O(K_def) iterations
reader = reader_tree.lookup(reg_id) // O(log R)
if reader != NULL:
AddEdge(reader, instr, WAR, reg_id) // sub_684970
// --- WAW edges: dest-vs-previous-dest ---
prev_writer = reg_tree.lookup(reg_id) // O(log R)
if prev_writer != NULL:
AddEdge(prev_writer, instr, WAW, reg_id) // sub_684920
reg_tree.insert_or_update(reg_id, instr) // update last writer
reader_tree.remove(reg_id) // clear stale readers
// --- update reader set for sources ---
for reg_id in use_bv.set_bits():
reader_tree.insert(reg_id, instr)
def_bv.clear()
use_bv.clear()
// Complexity: O(N * K * log R) where:
// N = instructions in BB, K = avg operands per instruction (~3),
// R = distinct live registers (bounded by register file size, typ. <256).
// For a BB of 500 instructions this is ~4,500 tree ops instead of 125,000 pairs.
The SBPruning knob (DAG knobs region, ROT13-encoded) enables or disables this pruning strategy. When disabled, the fallback path in sub_8D9930 reverts to the straightforward pair-wise scan. A separate AdvancedSBPruningBudget knob caps the total number of edges that can be created per BB, providing an additional compile-time safety valve for pathological cases.
Supplementary Dependency Builders
These functions handle specific aspects of dependency construction in the 0x6A0000--0x6B0000 range:
| Address | Size | Purpose |
|---|---|---|
sub_68A690 | 31 KB | BuildDependencies -- walks instruction lists and creates producer-consumer dependency edges from def-use chains |
sub_6A97B0 | 26 KB | AddDependencyEdges -- register-level data dependency edges |
sub_6A2D30 | 11 KB | ChainDependencies -- memory ordering constraints (ordering edges between memory operations even without explicit data deps) |
sub_6A78F0 | 23 KB | ProcessOperands -- iterates operand arrays at instruction +84, extracts register file pressure and dependency distance information |
Instruction Emission
sub_925510 (341 bytes, 57 lines) is the universal instruction relocation primitive. It moves an instruction to a new position in the doubly-linked instruction list.
function MoveInstruction(block, instr, insert_before):
// 1. Unlink from current position
instr.prev.next = instr.next
instr.next.prev = instr.prev
// 2. Insert before reference instruction
instr.next = insert_before
instr.prev = insert_before.prev
insert_before.prev.next = instr
insert_before.prev = instr
// 3. Update block boundaries if needed
if instr was block.head (block+272):
block.head = instr.next
if instr was block.tail (block+280):
block.tail = instr.prev
// 4. Notify subsystems
UpdateDependencyGraph(block, instr) // sub_7EEC10
UpdateBlockTimestamp(block) // sub_7DDCA0
This function has 13 callers across the codebase. It serves as the shared instruction movement primitive for the scheduler, register allocator, and peephole optimizer.
Resource Tracking
The scheduler maintains 10 functional unit resource counters per basic block, tracking pipeline utilization to avoid saturating any single execution unit.
Resource Vector Layout
Each per-BB resource slot occupies 84 bytes (21 DWORDs) stored at *(scheduler+672) + 84 * slot_index:
| Offset | Size | Content |
|---|---|---|
| 0--36 | 10 x int32 | Current resource usage per functional unit |
| 40--76 | 10 x int32 | Resource pressure delta (change since last step) |
| 80 | int32 | BB-entered flag and auxiliary bits |
Functional Unit Pipes
| Index | Pipe | Typical Instructions |
|---|---|---|
| 0 | Integer ALU | IADD, IMAD, ISETP, LOP, SHF |
| 1 | FP32 | FADD, FFMA, FMUL, FSETP |
| 2 | FP64 | DADD, DFMA, DMUL |
| 3 | Tensor core | HMMA, IMMA, BMMA, BGMMA |
| 4 | Load/store | LD, ST, LDG, STG, LDS, STS |
| 5 | Texture | TEX, TLD, TXQ |
| 6 | Branch/control | BRA, JMP, EXIT, RET, BAR |
| 7 | Shared memory | ATOMS, REDS (overlaps with pipe 4 for LDS/STS) |
| 8 | Special function | MUFU (RCP, RSQ, SIN, COS, EX2, LG2) |
| 9 | Uniform/predicate | UPLOP, UISETP, uniform operations |
Resource Tracking Helpers
| Address | Size | Purpose |
|---|---|---|
sub_A091C0 | -- | Initialize per-BB resource arrays to zero |
sub_A09530 | 365 bytes | Update stall cycle counters after scheduling an instruction. Decrements pending latency counters for all tracked resources. |
sub_A09D40 | -- | Update WAR (anti-dependency) resource tracking for register operands |
sub_A08A00 | -- | Resource model query (called in 3 modes by sub_8C67A0) |
The resource model sub_A08A00 is called three times per instruction by sub_8C67A0:
- Mode 1: instruction's own execution cost (FU assignment + pipeline latency)
- Mode 2: operand release costs (freed resources when an operand reaches last-use)
- Mode 3: combined instruction + BB-level impact (aggregate pressure)
SSE intrinsics (_mm_add_epi32, _mm_loadu_si128) are used throughout for vectorized resource accumulation and copying.
Register Pressure Tracking
The scheduler tracks register liveness via a bitvector at scheduler+832. Each bit represents one register; the pressure is the popcount of the live set.
function UpdateRegisterPressure(sched, instr):
for each operand in instr.operands:
if operand.is_def:
set_bit(sched.live_bv, operand.reg) // DEF: mark live
if operand.is_last_use:
clear_bit(sched.live_bv, operand.reg) // LAST-USE: mark dead
sched.current_pressure = popcount(sched.live_bv)
The bitvector is sized to (numRegs + 1) words, or (2 * numRegs + 2) when knob 420 (dual-register tracking) is active. Dual-register tracking separately tracks register pairs for instructions that consume or produce 64-bit values.
Pressure state fields:
| Offset | Content |
|---|---|
scheduler+432 | Target register count (from budget computation) |
scheduler+324 | Committed register target |
scheduler+316 | Minimum register count |
scheduler+320 | Register pressure slack (headroom) |
When current_pressure > scheduler+432, the priority function sets bit 4 (pressure overflow) in the encoding, biasing the scheduler toward instructions that release registers.
Per-Instruction Scheduling Metadata (SchedNode)
Each instruction has a pointer at instr+40 to a heap-allocated SchedNode block. The offsets below are relative to the SchedNode base, not the 296-byte Ori instruction. See the SchedNode layout for the authoritative field map.
| Offset | Type | Content |
|---|---|---|
| +8 | int32 | dep_count -- unsatisfied predecessor count (0 = ready) |
| +16 | QWORD | next_ready -- linked-list pointer in ready list |
| +24 | int32 | bbSlot -- 1-based BB position (-1 = unscheduled) |
| +28 | int32 | latency_counter -- current stall counter |
| +32 | int32 | earliestCycle -- earliest available cycle |
| +40 | int32 | latestDeadline -- latest deadline cycle |
| +44 | int32 | Barrier group index |
| +88 | int32 | maxPredecessorCycle |
| +92 | int32 | maxDependencyCycle |
| +108 | byte | Flags: bit 0 = barrier target, bit 1 = has dependency, bit 2 = early schedulable, bit 3 = late schedulable, bit 4 = has register operand |
| +111 | byte | Flags: bit 7 = uses expensive register file |
Extended SchedNode Fields (+112 through +236)
The alternative scheduling loop sub_68B9C0 (46 KB) and the main scheduling driver sub_688DD0 access fields well beyond +111. These fields are populated during cross-block scheduling setup and consumed by the region-ordering and resource-accounting phases. The total SchedNode allocation is at least 240 bytes, not the 112 documented above.
| Offset | Size | Type | Name | Purpose |
|---|---|---|---|---|
| +112 | 16 | -- | (reserved) | Alignment padding / internal state between +111 and +128 |
| +128 | 8 | ptr | regionChainNext | Singly-linked next pointer for the cross-block region chain. sub_68B9C0 walks this chain to iterate over BB-representative nodes: for (node = head; node; node = *(QWORD*)(node + 128)). Also written during chain construction: *(QWORD*)(node + 128) = prev_head. Separate from the func+104 metadata chain at +0. |
| +136 | 8 | -- | (reserved) | Padding between chain pointer and scheduling region index |
| +144 | 4 | i32 | schedRegionIndex | Index into the 72-byte per-block scheduling record array at scheduler+184. Used as 72 * *(int*)(node + 144) to reach the block's pressure snapshot and resource state. Also serves as the hash key for the FNV-1a region dedup cache in sub_68B9C0 (hashed with constants 0x811C9DC5 / 16777619). |
| +148 | 16 | -- | (reserved) | Gap between region index and resource class |
| +164 | 4 | i32 | resourceClassIndex | Index into the resource-class record array (40 bytes per entry). sub_688DD0 computes src + 40 * *(int*)(node + 164) to look up the 10-element register-delta vector for pressure subtraction during unscheduling. |
| +168 | 68 | -- | (reserved) | Internal state for the cross-block scheduling engine. Offsets +168 through +232 are not directly accessed via named patterns in sub_68B9C0 or sub_688DD0. |
| +236 | 4 | i32 | regionOrderWeight | Region ordering weight used by the cross-block BB traversal. sub_68B9C0 reads this to decide whether to skip or schedule a region: values near INT_MIN (0x80000001) or INT_MAX (0x7FFFFFFF) are sentinels meaning "boundary" (the test (weight + 0x7FFFFFFF) > 0x7FFFFFFE catches both). The loop iterates through the region chain comparing consecutive weights to detect region transitions. |
The scheduling loop also reads Ori instruction fields directly (not via the SchedNode): instr+72 (opcode), instr+80 (operand count), instr+84 (operand descriptors).
Sentinel values: bbSlot -1 (unscheduled), latency 0x1869F (99999 = infinity), regionOrderWeight INT_MIN/INT_MAX (boundary marker).
The dep_count field at +8 is the key scheduling control: it counts unsatisfied predecessors in the dependency DAG. When a predecessor is scheduled, the engine decrements every successor's dep_count. When dep_count reaches zero, the instruction becomes ready and is inserted into the ready list.
Stall and Barrier Insertion
After the scheduling loop determines instruction order, sub_8D7760 (41 KB) converts the abstract schedule into SASS control words.
For each instruction in the scheduled order:
| Field | Computation | Range |
|---|---|---|
| Stall count | Distance in cycles to the nearest dependent consumer | 0--15 (capped by knob 805, max 16) |
| Yield hint | Warp scheduling hint -- should the HW scheduler switch to another warp? | 0 or 1 |
| Barrier assignment | Which of the 6 available barriers this instruction writes/waits on | 0--5, or none |
| Scoreboard deps | Read/write dependency tracking for the hardware scoreboard | Bitmask |
Per-Architecture Barrier Model
The barrier model parameters come from per-SM data tables indexed through the architecture backend. The scheduling context reaches the SM profile via ctx+8 -> backend[198] -> profile, carrying the scoreboard config table at profile+54216 and a secondary gate at +54224. Two polymorphic vtable variants (A/B, differing in 2 of 65 slots at 0x21DAA00/0x21DAC10) are selected by the profile's scoreboard model byte.
Per-SM parameters from per_sm_scoreboard_configs.json and per_sm_dependency_rules.json:
| SM | Gen | HW barriers | SB entries | Max triplets | Dep rules | Max stall | Max barrier lat |
|---|---|---|---|---|---|---|---|
| sm_60 | Pascal | 6 | -- | -- | 619 | 39 | 56 |
| sm_70 | Volta | 6 | -- | -- | 619 | 39 | 56 |
| sm_75 | Turing | 6 | -- | -- | 619 | 39 | 56 |
| sm_80 | Ampere | 6 | 31 | 1 | 256 | 39 | 56 |
| sm_86 | Ampere | 6 | 31 | 1 | 256 | 39 | 56 |
| sm_89 | Ada | 6 | 32 | 1 | 256 | 39 | 56 |
| sm_90 | Hopper | 6 | 31 | 1 | 256 | 39 | 56 |
| sm_90a | Hopper | 6 | 32 | 1 | 256 | 39 | 56 |
| sm_100 | Blackwell | 6 | 75 | 6 | 430 | 39 | 56 |
| sm_103 | Blackwell | 6 | 46 | 1 | 430 | 39 | 56 |
All architectures share 6 HW barriers and raw max stall of 39 (capped to 15 by the 4-bit field or knob 805). Yield threshold is arch-dependent (typically 4 on sm_80+). The key differentiator is the scoreboard config table: sm_80--90a use 31--32 single-triplet entries ({scoreboard_id, threshold, mask}), while sm_100 uses 75 entries with up to 6 triplets, reflecting Blackwell's richer scoreboard hierarchy.
Dispatch Pseudocode for sub_8D7760
function EmitScoreboardRecords(sched_ctx, instr, dep_table, out):
backend = *(sched_ctx + 8)
profile = *(backend[198] + 16) // SM profile
sb_mode = *(u8*)(profile + 54216) // scoreboard model selector
sb_gate = *(u32*)(profile + 54224) // secondary gate
vtable = **(backend + 1584) // polymorphic vtable A or B
// vtable+1816: isScoredInstruction(obj, instr) -> bool
// vtable+1432: isBarrierExempt(obj, instr) -> bool
opcode = instr[18] & 0xFFFFCFFF // mask modifier bits 12-13
if opcode in {0x55/*TEX*/, 0x6D/*TLD4*/}:
group = lookup_texture_tables(instr, backend)
elif opcode == 0x5B: group = 6 // BSSY
elif opcode == 0x5C: group = 5 // BSYNC
elif opcode == 0xB7 && has_ld_flag: group = 13
else: group = sub_7DFFC0(instr, backend) // resource class query
// 330-entry jump table at 0x21D9EF8 (guard: opcode <= 0x149)
// 206 specialized handlers, 124 default-path
emit_records_via_sub_8C25B0(group, instr, out)
if backend[1368] & 0x20: // extended scoreboard
if vtable[+1816](*(backend+1584), instr):
emit_extended_record(out, instr)
The function manages a 32-entry barrier table for tracking active assignments.
See Scoreboards & Barriers for the control word encoding format.
Alternative Scheduling Loop
sub_68B9C0 (46 KB) is a monolithic function that combines dependency graph construction with the scheduling loop. It serves as an alternative entry point for scheduling passes that need to build the DAG inline rather than using the pre-built graph from Stage 1.
Internal structure:
- Initialize scheduling state (
sub_685700) - Initialize ready-list management (
sub_687080) - Check resource conflicts (
sub_687410) - Inner loop (
while(2)infinite loop with break conditions):- Check if ready list is empty -- break if so
- Check opcode 97 (
STGin ROT13; used as scheduling barrier/control marker) -- special handling - Select best instruction from ready list
- Schedule it: assign cycle, update resources, process edges
- For each successor: decrement
dep_count, add to ready list if zero - Check boundary condition (
v236) -- break if done
- Track first-pass initialization via
v215
This function accesses the Ori instruction's opcode at instr+72, plus SchedNode fields (via instr+40 pointer): +24 (bbSlot), +144 (scheduling slot), +164 (resource class), and +236 (latency).
Specialized Scheduling Strategies
The region 0x89C550--0x8BE320 contains 17+ specialized scheduling strategies. These are selected based on code characteristics (loop structure, tensor operations, function size) and optimization level. Each strategy implements a variation of the core list scheduling algorithm with different heuristics or search strategies.
| Address | Size | Strategy | Description |
|---|---|---|---|
sub_8B1190 | 16 KB | Backtracking | Undo and retry on scheduling conflicts. Rolls back the last N steps and tries alternative orderings. Bounded depth prevents exponential blowup. |
sub_8B2D90 | 18 KB | Global optimization | Cross-BB scheduling. Moves instructions across BB boundaries when safe (no side effects, dominance preserved). |
sub_8B4590 | 13 KB | Permutation search | Exhaustive permutation of instruction orderings for small BBs. Falls back to heuristic for larger blocks. |
sub_8B5400 | 14 KB | Latency-optimized | Maximizes memory latency hiding by aggressive interleaving of independent operations. |
sub_8B6D60 | 12 KB | Pressure-optimized | Minimizes live range overlap by scheduling defs as close to their uses as possible. |
sub_8B77C0 | 15 KB | Dual-issue | Pairs co-issuable instructions for dual-issue architectures (sm_50/Maxwell). Uses sub_A9CDE0 and sub_A9CF90 for compatibility checks. |
sub_8B8900 | 12 KB | Tensor scheduling | HMMA/BMMA/BGMMA grouping for warpgroup tensor operations. |
sub_8B9390 | 23 KB | Software pipelining | Loop body overlapping -- interleaves iterations to fill pipeline bubbles. |
sub_8BAAE0 | 15 KB | Loop-aware | Trip count + register awareness for loop bodies. |
sub_8BB9C0 | 8.2 KB | Prefetch scheduling | Inserts and schedules memory prefetch instructions. |
sub_8BC0B0 | 6.1 KB | Barrier coalescence | Merges adjacent barrier instructions to reduce overhead. |
sub_8BC990 | 7.6 KB | Scoreboard optimization | Minimizes scoreboard entries by reusing barrier registers. |
sub_8BCFA0 | 6.8 KB | Warp schedule optimization | Warp-level yield tuning for multi-warp scheduling. |
sub_8BDC40 | 7.9 KB | Dual-issue pairing | Instruction pair selection for dual-issue slots. |
sub_8BE320 | 25 KB | Complex combined pass | Multi-strategy combined pass for complex code patterns. |
sub_8A9D80 | 21 KB | Depth-first | DFS-based instruction ordering for deep dependency chains. |
sub_8AB750 | 9.8 KB | Critical path | DAG analysis for priority weight computation. |
Backtracking Scheduler
The backtracking strategy (sub_8B1190) is notable because it breaks from the greedy nature of standard list scheduling. When a scheduling decision leads to excessive stalls or resource conflicts, it can undo the last N steps (where N is bounded by a configurable depth), re-insert the affected instructions into the ready list, and try a different selection. This provides limited but effective lookahead without the full cost of optimal scheduling.
Dual-Issue Scheduling
For sm_50 (Maxwell), sub_8B77C0 pairs instructions that can execute simultaneously on different functional units. Eligibility is checked by sub_8CF5D0 (3.5 KB), which verifies architecture support and computes a dual-issue benefit score at scheduler+328. Pairing compatibility uses sub_A9CDE0 (is instruction dual-issuable?) and sub_A9CF90 (can this pair with the next instruction?).
Size Limits and Chunking
Two mechanisms prevent the scheduling algorithm from hitting quadratic complexity on large inputs:
BB Size Limit
sub_8CBAD0 scans all basic blocks during pre-scheduling setup. Any BB exceeding 4095 instructions is split by inserting scheduling barriers (sub_931920). This caps the per-BB scheduling problem size, ensuring the O(n^2) dependency graph construction remains tractable. The maximum BB size is tracked at scheduler+388.
Large Function Chunking
Functions exceeding 16383 instructions (*(a1+372) > 0x3FFF) trigger chunk-based scheduling via sub_A9DDD0 (11.5 KB). The function is partitioned into chunks that are scheduled independently, then the results are merged. This avoids the full O(n^2) DAG construction for very large kernels. The chunk boundary selection respects BB boundaries and dependency chains to minimize cross-chunk constraint violations.
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_6820B0 | 1.5 KB | BuildReadyList -- zero-dep instruction scan | HIGH |
sub_682200 | -- | UnlinkFromReadyList -- remove and update deps | HIGH |
sub_682490 | 14 KB | RegisterPressureAnalyzer -- per-class deltas | HIGH |
sub_6833F0 | 10 KB | InitScheduleRegion -- per-BB setup, knob query | HIGH |
sub_685700 | -- | InitSchedulingState -- loop initialization | MEDIUM |
sub_685A10 | 11 KB | InstructionBarrierCheck -- opcode analysis | HIGH |
sub_687080 | -- | ReadyListManagementHelper | MEDIUM |
sub_687410 | -- | ResourceConflictCheck | MEDIUM |
sub_687FE0 | 12 KB | ScheduleBlock -- per-BB scheduling entry | HIGH |
sub_688DD0 | 20 KB | ScheduleEngine -- unified 3-mode core loop | HIGH |
sub_68A690 | 31 KB | BuildDependencies -- def-use chain DAG | HIGH |
sub_68B9C0 | 46 KB | MainSchedulingLoop -- combined DAG + scheduling | HIGH |
sub_692200 | 18 KB | SchedulingHeuristic -- priority with FP scoring | HIGH |
sub_695530 | 15 KB | ComputeLatencies -- instruction latency computation | HIGH |
sub_69B7D0 | 17 KB | TopologicalSort -- valid execution ordering | HIGH |
sub_69F170 | 12 KB | CriticalPathAnalysis -- DAG critical path | HIGH |
sub_893100 | 17 KB | ClassifyInstruction -- opcode/operand analysis | HIGH |
sub_894290 | 27 KB | BuildOperandDependencies -- operand-level edges | HIGH |
sub_89C550 | 14 KB | InnerScheduleLoop -- inner scheduling iteration | HIGH |
sub_89EFC0 | 16 KB | ReadyListManager -- BST management | HIGH |
sub_8A9D80 | 21 KB | DepthFirstSchedule | MEDIUM |
sub_8AB750 | 9.8 KB | CriticalPathCompute | MEDIUM |
sub_8B1190 | 16 KB | ScheduleWithBacktrack | MEDIUM |
sub_8B2D90 | 18 KB | GlobalScheduleOpt -- cross-BB scheduling | MEDIUM |
sub_8B4590 | 13 KB | PermuteSchedule -- permutation search | MEDIUM |
sub_8B5400 | 14 KB | ScheduleForLatency | MEDIUM |
sub_8B6D60 | 12 KB | ScheduleForPressure | MEDIUM |
sub_8B77C0 | 15 KB | DualIssueScheduler | MEDIUM |
sub_8B8900 | 12 KB | TensorScheduler | MEDIUM |
sub_8B9390 | 23 KB | SoftwarePipeline | MEDIUM |
sub_8BAAE0 | 15 KB | LoopScheduler | MEDIUM |
sub_8BB9C0 | 8.2 KB | PrefetchScheduler | MEDIUM |
sub_8BC0B0 | 6.1 KB | BarrierCoalescence | MEDIUM |
sub_8BC990 | 7.6 KB | ScoreboardOpt | MEDIUM |
sub_8BCFA0 | 6.8 KB | WarpScheduleOpt | MEDIUM |
sub_8BDC40 | 7.9 KB | DualIssuePairing | MEDIUM |
sub_8BE320 | 25 KB | ComplexSchedulePass | MEDIUM |
sub_8C67A0 | 3.7 KB | ComputeResourceCost -- per-instruction FU cost | HIGH |
sub_8C7120 | -- | BarrierTrackingUpdate | MEDIUM |
sub_8C7290 | 5.1 KB | GetResourceVector -- SSE-optimized copy | HIGH |
sub_8C7720 | 20 KB | ReorderInstructions -- red-black tree | HIGH |
sub_8C9320 | 47 KB | ComputePriority -- 8-bit packed heuristic | HIGH |
sub_8CBAD0 | 2.9 KB | PreScheduleSetup -- BB scan, 4095-instr limit | HIGH |
sub_8CCF80 | 2.3 KB | IsLongLatencyOp -- latency > 19 check | HIGH |
sub_8CD160 | 9.3 KB | ScheduleBasicBlock -- per-BB ordering loop | HIGH |
sub_8CF880 | 28 KB | BuildDependencyGraph -- pre-scan stage 1 | HIGH |
sub_8D0640 | 22 KB | ScheduleInstructions -- top-level orchestrator | HIGH |
sub_8D1730 | 19 KB | ExecuteSchedulePass | HIGH |
sub_8D2510 | 3.6 KB | UpdateDependencies -- post-schedule dep update | HIGH |
sub_8D3150 | 2.0 KB | CheckResourceConflict | MEDIUM |
sub_8D32D0 | 14 KB | ScheduleInstruction -- schedule single instruction | HIGH |
sub_8D3D60 | 1.4 KB | InsertStall | HIGH |
sub_8D3E20 | 2.1 KB | ComputeStallCycles | HIGH |
sub_8D4000 | 3.0 KB | InsertBarrier | HIGH |
sub_8D5E00 | 38 KB | MainSchedulingLoop -- workhorse | HIGH |
sub_8D7760 | 41 KB | StallAndBarrierInsertion -- control word generation | HIGH |
sub_8D9930 | 19 KB | BuildDependencyEdges -- RAW/WAR/WAW/memory/barrier | HIGH |
sub_925510 | 341 bytes | MoveInstruction -- doubly-linked list relink | HIGH |
sub_A08A00 | -- | ResourceModel -- FU cost query (3 modes) | HIGH |
sub_A091C0 | -- | InitResourceTracking | MEDIUM |
sub_A09530 | 365 bytes | UpdateStallCycles -- per-instruction latency update | HIGH |
sub_A09D40 | -- | UpdateWARTracking -- anti-dependency tracking | MEDIUM |
sub_A9DDD0 | 11.5 KB | HandleLargeFunction -- chunk-based scheduling | MEDIUM |
Per-SM Scheduling Backends
Everything documented above describes the main scheduler (Backend A), which covers approximately 436 KB at 0x680000--0x8FE000. ptxas contains two additional complete scheduling implementations activated for newer SM architectures. The three backends coexist in the binary; SM-version-gated dispatch selects which combination runs.
Architecture Dispatch
The function sub_7DDB50 reads an SM architecture index from context+2104 and returns it as an integer. Four dispatch stubs in the 0xC5FE00--0xC61000 range use this value to select the scheduling backend:
| Dispatch Stub | Condition | Backend Selected | Pipeline Stage |
|---|---|---|---|
sub_C5FEF0 | SmVersion > 1 | Backend B (SM89/90 Codec) | Codec/ISel scheduling |
sub_C60910 | SmVersion > 1 && (context+1392 & 1) | Backend B (SM89/90 Codec) | Codec/ISel scheduling |
sub_C5FFC0 | SmVersion > 1 | Backend C (RBT List), mode 1 | Pre-scheduling |
sub_C5FFF0 | SmVersion > 1 | Backend C (RBT List), mode 0 | Post-scheduling |
When SmVersion <= 1 (sm_50 through sm_75 -- Maxwell through Volta), control falls through to the main Backend A documented in the preceding sections. When SmVersion >= 2 (sm_80+ -- Ampere, Ada Lovelace, Hopper, Blackwell), Backends B and C replace Backend A entirely.
sub_C60910 has a secondary activation path: if *(options + 23544) == 1 && *(options + 23552) != 0, Backend B activates regardless of SM version, providing a knob override for testing the codec scheduler on older architectures.
Backends B and C are complementary, not competing. Backend C handles pre-scheduling and post-scheduling (the same pipeline stages as Backend A's 3-phase ReduceReg/ILP/DynBatch), while Backend B handles a separate codec/ISel scheduling step that has no equivalent in the legacy path.
Backend B -- SM89/90 Codec Scheduler (0x1225000)
Backend B is a forward-then-backward scheduling pass with continuous floating-point priority weighting. It replaces Backend A's discrete 8-bit packed heuristic with a configurable pressure/ILP tradeoff expressed as doubles.
| Entry | sub_1233D70 (1,527 B, 321 lines) -- pass phase 5 |
| Forward scheduler | sub_122AD60 (17.5 KB, 4,118 lines) -- largest function in range |
| Backward scheduler | sub_122F650 (18.2 KB, 3,917 lines) |
| Preparation | sub_123E0D0 -- instruction characterization |
| Post-fixup | sub_A112C0 -- scheduling result finalization |
| Priority structure | BST with FNV-1a hash tracking |
| Code region | 0x1225000--0x1240000 (132 functions, 111 KB) |
Float Weighting System
The entry point sub_1233D70 initializes two pairs of floating-point weights from the options object at *(context+1664) + 72:
Pair 1 -- Pressure/ILP tradeoff (options offsets 7200/7208):
| Weight | Default | Meaning |
|---|---|---|
pressure_weight | 1.8 | Contribution of register pressure to scheduling priority. Positive = favors orderings that reduce live register count. |
ilp_weight | -0.8 | Contribution of instruction-level parallelism. Negative = penalizes moves that reduce available parallelism. |
The two weights sum to 1.0 and form a weighted combination on a unit scale. The default 1.8/-0.8 split heavily favors register pressure reduction, accepting moderate ILP degradation -- appropriate for register-hungry Ada Lovelace and Hopper kernels.
Pair 2 -- Secondary scoring axis (options offsets 7560/7568):
| Weight | Default | Meaning |
|---|---|---|
forward_weight | 3.2 | Forward-looking scheduling contribution |
backward_penalty | -2.2 | Backward-looking penalty factor |
Both pairs are overridable. When the configuration byte at the respective offset equals 3, the weight is read from the adjacent double field and the complement is computed as 1.0 - weight:
if (*(BYTE*)(options + 7200) == 3):
pressure_weight = *(double*)(options + 7208)
ilp_weight = 1.0 - pressure_weight
After loading, both weight pairs are normalized by dividing by the register range (float)(max_regs - min_regs), producing per-register slopes:
range = (float)(max_regs) - (float)(min_regs)
pressure_slope = ilp_weight / range
secondary_slope = backward_penalty / range
This normalization ensures the scheduling heuristic scales consistently regardless of the target architecture's register file size.
Priority Score Formula
Both the forward and backward passes evaluate candidate instructions using the same weighted quadratic score. Given a candidate whose scheduling would result in register pressure P (an integer from popcount of the live-register bitvector):
score(P) = P * (pressure_slope * (P - min_regs) + pressure_weight)
where:
P= live register count after tentatively scheduling the candidate (int, converted to double)min_regs=GetMinRegs()from the target profile (vtable+720), stored as(double)min_regspressure_weight= Pair 1 first weight (default 1.8)pressure_slope=ilp_weight / range(default -0.8 / range)range=(double)max_regs - (double)min_regswheremax_regscomes from vtable+768
Expanding the inner expression:
score(P) = P * ( ilp_weight * (P - min_regs) / range + pressure_weight )
With defaults (1.8 / -0.8, range = max_regs - min_regs):
score(P) = P * ( -0.8 * (P - min_regs) / range + 1.8 )
The backward scheduler uses the secondary pair identically for a second acceptance gate:
score2(P) = floor( P * (secondary_slope * (P - min_regs) + forward_weight) )
= floor( P * (backward_penalty / range * (P - min_regs) + forward_weight) )
With defaults (3.2 / -2.2):
score2(P) = floor( P * ( -2.2 * (P - min_regs) / range + 3.2 ) )
Both scores serve as acceptance thresholds during list scheduling. After computing score(P), the scheduler evaluates depth_cost = popcount_sum(live_bv_tree) via sub_1229BD0 (BST walk accumulating __popcnt across bitvector nodes). A candidate is accepted if:
Pair 1: P + depth_cost > score(P) // forward pass (sub_122AD60)
Pair 2: P + depth_cost + accum_count > score2(P) // backward pass (sub_122F650)
The guard condition P > min_regs short-circuits evaluation: when pressure is at or below the minimum, the instruction is accepted unconditionally. This makes the score a pressure penalty that only activates when the register file is under contention. Binary evidence: the score computation appears at 0x123320f (Pair 1) and 0x1233a4b (Pair 2) in sub_122F650, as the sequence cvtsi2sd / subsd [r13+off] / mulsd [r13+off] / addsd [r13+off] / mulsd, followed by call _floor for Pair 2 only.
Forward Pass (sub_122AD60)
The forward scheduler implements list scheduling with a BST priority queue, iterating basic blocks front-to-back. It uses FNV-1a hash tables (seed 0x811C9DC5, multiplier 16777619) for tracking scheduled instruction mappings. Instruction properties are queried via sub_7DF3A0. The function manages a ref-counted working set with proper cleanup at function exit. The Pair 1 score is evaluated inline via the stored pressure_weight, min_regs, and pressure_slope fields in the scheduling context at offsets +112, +104, and +120 respectively. At 4,118 decompiled lines, it is the largest function in the 0x1225000 scheduling range.
Backward Pass (sub_122F650)
The backward scheduler receives range and secondary_slope as direct double parameters (a2, a3) and processes basic blocks in reverse order. It evaluates both Pair 1 and Pair 2 scores, using the scheduling context fields at offsets +136/+144/+152 for the secondary pair. It calls into the barrier/scoreboard system (sub_BDC080, sub_BDBA60, sub_BDC0A0) and performs register liveness analysis via sub_A0EDE0. The function uses BST operations with left/right/parent pointer traversal and explicit rebalancing, then performs iterative tree cleanup at exit.
Backend C -- RBT List Scheduler (0x18CD000)
Backend C is a complete reimplementation of the list scheduling algorithm using a red-black tree priority queue, double-precision scoring, and an evaluate-then-commit model with hash-table solution caching. It replaces Backend A for all sm_80+ targets.
| Orchestrator | sub_1908D90 -- pre/post mode dispatch |
| Driver | sub_1906090 -- per-block scheduling loop |
| Core scheduler | sub_1902B70 (19 KB) -- RBT-based list scheduling |
| Solution evaluator | sub_1904B70 (26 KB) -- constraint check + commit |
| Constraint validator | sub_19043F0 (10 KB) -- feasibility testing |
| Pressure cost model | sub_18F3CB0 (16 KB) -- SIMD register pressure |
| Recursive cost propagation | sub_18FFD70 (23 KB) -- call-graph-aware scoring |
| Dependency update | sub_1902100 (15 KB) -- post-scheduling DAG update |
| RBT insert | sub_18FD370 -- balanced insertion with 3-key comparison |
| RBT extract | sub_18FCDA0 -- pop highest-priority node |
| RBT reset | sub_18F7EC0 -- tree cleanup |
| Score computation | sub_18FDAF0 -- double-precision weighted score |
| Hash table | sub_1906510 (14 KB) -- FNV-1a instruction ID lookup |
| Code region | 0x18CD000--0x190FFFF (392 functions, 275 KB) |
Red-Black Tree Priority Queue
The critical difference from Backend A is the priority queue data structure. Backend A uses a sorted singly-linked list (O(N) insertion per instruction). Backend C uses a red-black tree that maintains balance through rotation operations in sub_18FD170 (called at the end of every insertion).
Each RBT node is 40 bytes allocated from a pool, with node+24 pointing to the instruction's scheduling entry. The tree ordering uses a three-key comparison in sub_18FD370:
- Priority integer at
scheduling_entry + 384(descending -- higher priority nodes are left children) - Latency double at
scheduling_entry + 368(descending -- higher latency scheduled first among equal-priority instructions) - Instruction ID at
*(scheduling_entry + 16) + 12(ascending -- deterministic tiebreaker)
This three-key comparison provides O(log N) insertion and extraction, a significant improvement for basic blocks with hundreds of instructions where Backend A's O(N) sorted insertion becomes a bottleneck.
RBT Score Computation (sub_18FDAF0)
sub_18FDAF0 produces the double at scheduling_entry+368 (key 2 in the RBT comparison). It accumulates three components: per-operand pressure cost, an excess live-range penalty, and dependency-depth weighting.
function ComputeRBTScore(this): // sub_18FDAF0
ctx = *(this + 0)
latency_wt = *(double*)(ctx + 472) // per-context latency weight
// -- Phase 1: pressure cost (stored at this+376) --
pressure_cost = 0.0
for each node in linked_list(this + 32): // operand entries
operand = *(node + 16)
base = PressureCost(ctx, operand, latency_wt) // sub_18FD7A0
scale = base
last_dep = FindLastRegDep(operand) // walk operand+84, skip pred/imm
if last_dep is negative-encoded register ref:
scale = base * *(double*)(ctx + 448) // damping weight
vtable = *(*(ctx + 8) + 1784)
if (**vtable)(vtable, operand + 84, last_dep): // arch-specific eligibility
scale = LatencyScale(ctx, sched_class_entry) * scale // sub_18F5460
pressure_cost += scale
// -- Phase 2: excess live-range penalty --
mode = *(int*)(this + 200)
if mode != 2:
live_count = 0
for each node in linked_list(this + 160):
e = *(node + 16)
if *(int*)(e + 20) > 1 and (*(byte*)(e + 48) & 0x40) == 0:
live_count++
excess = live_count - *(int*)(this + 264) + 2
if excess > 0 and excess / (mode - 2) > *(double*)(ctx + 464):
pressure_cost *= *(double*)(ctx + 456) // penalty multiplier
// -- Phase 3: dependency-depth weighting --
dep_score = 0.0; max_conflict_lat = 1.0
if *(int*)(this + 288) != 0: // has dependency hash map
for each dep_entry in HashMapIter(this + 280): // sub_18F4D30
lat = LatencyScale(ctx, class_of(dep_entry))
// Binary search sorted 12-byte conflict array at *(this+8)+224
// (uses 0xAAAAAAAAAAAAAAAB multiply for div-by-3 index computation)
if BinarySearchConflict(array, *(int*)(dep_entry + 52)):
max_conflict_lat = fmax(max_conflict_lat, lat)
dep_score += lat * pressure_cost
if *(byte*)(this + 364): // long-latency flag
dep_score += max_conflict_lat * 10.0
*(double*)(this + 368) = dep_score // final RBT score (key 2)
return dep_score
PressureCost (sub_18FD7A0) counts live non-predicate register operands and returns count * *(double*)(ctx+480), plus opcode-specific bonuses: 88/89 (TEX/TLD) +5.0, 44/45 (LD/ST) +0.5/+1.0, 86 +5.0, 195 (LDG) +6.0, 288 (ATOMS) +0.4*latency_weight, 130/137 +1.0 if no cross-dep. LatencyScale (sub_18F5460) queries the vtable at ctx+8->+1784 for a normalized latency ratio of the scheduling class.
Core Scheduling Loop (sub_1902B70)
function RBTListSchedule(context, block, dep_info, bound, constraint):
InitRegisterPressure(context, block) // sub_18F8580
InitRBTree(tree) // sub_18F7EC0
for each instruction in block.instruction_list:
node = AllocPoolNode(40 bytes)
node.scheduling_entry = instruction
RBTreeInsert(tree, node) // sub_18FD370
ResizeScheduleArray(block, tree.count) // sub_18F9CC0
while tree is not empty:
best_node = RBTreeExtractMax(tree) // sub_18FCDA0
ReturnNodeToPool(best_node)
instruction = best_node.scheduling_entry
valid = vtable_check(context, block, instruction)
*(instruction + 365) = valid
UpdateDependencies(context, instruction, tree) // sub_1902100
if not valid:
InsertRejection(block + 112, instruction_id)
continue
// Record scheduling decision
position = ++(block + 360)
entry = *(block + 352) + 24 * position
entry[0] = instruction_id
entry[1] = instruction + 2 // scheduling state
entry[2] = instruction // back-pointer
// Compute and accumulate scores
latency = LookupLatency(context, instruction) // sub_18F5460
*(block + 96) += (priority - 2) * latency
*(block + 88) += latency * *(instruction + 376)
// Process successors, check conflicts (binary search on
// sorted 12-byte conflict array with 0xAAAAAAAAAAAAAAAB
// division-by-3 trick for index computation)
Evaluate-Then-Commit Model
Backend A uses a greedy approach: each scheduling decision is final. Backend C introduces a two-phase model where sub_1904B70 evaluates a proposed schedule against constraints before committing it:
- Build a candidate schedule (408-byte evaluation nodes with def/use/pred-deps/succ-deps lists)
- Validate via
sub_19043F0(checks scheduling mode at+64must be 5 or 6) - Run architecture-specific check via vtable at
*context + 16 - Verify register pressure via
sub_19016E0 - Compute score via
sub_18FDAF0(returns double) - Apply threshold gate: if score meets the threshold, cache in solution hash table at
block + 304; otherwise reject into the block's rejection set atinstr + 112
This allows Backend C to explore multiple scheduling alternatives and commit only the best-scoring solution, a capability Backend A's greedy model lacks.
Threshold Gate (context+360, context+368)
The threshold mechanism controls which evaluated solutions proceed to hash table caching versus immediate rejection. Two fields on the scheduling context govern the gate:
| Offset | Type | Initial | Meaning |
|---|---|---|---|
context+360 | double | 0.0 | best_threshold -- score ceiling for acceptance |
context+368 | byte | 0 | threshold_active -- gate enable flag |
sub_18FEE60 (context constructor) initializes both to zero. The gate logic in sub_1904B70 is:
function EvaluateAndCache(context, block, ...): // sub_1904B70, line 820+
candidate = *(block + 168) // current evaluation node
sched_entry = candidate + 16 // scheduling entry within node
// --- 4-step validation ---
if not ConstraintCheck(context, sched_entry, bound): return // sub_19043F0
if not VtableArchCheck(context, sched_entry): return // *context+16
if not PressureCheck(context, sched_entry): return // sub_19016E0
score = ComputeRBTScore(sched_entry) // sub_18FDAF0 -> double
// --- threshold gate ---
if !context.threshold_active OR score > context.best_threshold:
// REJECT: score too high (lower is better) or gate disabled
InsertRejection(candidate.instr + 112, instr_id) // sub_122F1E0
return
// ACCEPT: score <= threshold, insert into solution hash table
SolutionHashInsert(block, candidate, sched_entry)
At initialization, threshold_active = 0, so !0 = true short-circuits the OR and all candidates are rejected. The threshold is activated externally (via vtable callback or iterative refinement pass) by setting context+368 = 1 and writing a ceiling score into context+360. Once active, only solutions with score <= best_threshold pass through to the hash table. Lower scores are better -- the threshold acts as a ceiling that tightens as the scheduler discovers improving solutions.
Solution Hash Table (block+304)
Each per-block evaluation record (368 bytes, iterated at stride 368 by the driver sub_1906090) contains a chained hash table for caching accepted solutions. The hash table layout occupies offsets +304 through +336:
| Offset | Type | Content |
|---|---|---|
block+304 | ptr | Free-list head for 32-byte hash chain nodes |
block+312 | int32 | total_entries -- total distinct keys inserted |
block+316 | int32 | total_chain_length -- sum of per-bucket chain lengths |
block+320 | ptr | buckets -- pointer to array of 24-byte bucket headers |
block+328 | int64 | capacity -- number of bucket slots (power of 2) |
Each 24-byte bucket header contains: {chain_head (ptr), chain_tail (ptr), chain_count (int32)}. Chain nodes are 32 bytes: {next (ptr), key (int32), value (ptr), hash (int32)}.
The hash function is FNV-1a on the 4-byte instruction scheduling class ID (at sched_entry+8):
function FNV1a_32(key): // inline in sub_1904B70
h = 0x811C9DC5 // FNV offset basis
for byte in key[0..3]:
h = (h ^ byte) * 16777619 // 0x01000193 = FNV prime
return h
function SolutionHashInsert(block, candidate, sched_entry):
key = *(int32*)(sched_entry + 8) // scheduling class ID
hash = FNV1a_32(key)
buckets = *(block + 320)
capacity = *(block + 328)
if !buckets:
InitHashTable(block + 304, initial_size=8) // sub_18FBD20
buckets = *(block + 320)
capacity = *(block + 328)
slot = hash & (capacity - 1)
bucket = buckets + 24 * slot
node = bucket.chain_head
// --- probe chain for existing key ---
while node:
if *(int32*)(node + 8) == key:
// HIT: update existing entry's scheduling pointer
*(node + 16) = sched_entry
return
node = *node
// --- MISS: allocate and insert new 32-byte node ---
node = FreeListPop(block + 304) // or arena alloc 32 bytes
node.next = 0
*(int32*)(node + 8) = key
*(ptr*)(node + 16) = sched_entry
*(int32*)(node + 24) = hash
PrependToBucket(bucket, node)
// --- update statistics ---
bucket.chain_count++
*(block + 316) += bucket.chain_count // total_chain_length
*(block + 312)++ // total_entries
// --- rehash if load exceeds threshold ---
if total_chain_length > total_entries AND total_entries > capacity / 2:
Resize(block + 304, 4 * capacity) // sub_18FBD20
The separate function sub_1906510 (14 KB) provides the same hash-table-plus-BST lookup interface used by the cross-block scheduling path. It combines the FNV-1a hash chain with a red-black tree at *(context_base + 312) that orders solutions by a bitvector comparison (sub_19061B0), enabling efficient retrieval of the best cached solution across multiple evaluation blocks.
Recursive Cost Propagation (sub_18FFD70)
Backend C uniquely supports cross-function scheduling awareness through recursive cost propagation. sub_18FFD70 walks the call graph:
- For a given schedule entry, iterate predecessor blocks (linked list at +12/+13)
- Look up each predecessor in the scheduling map via
sub_18F4E70 - If predecessor is live (byte at +365), recursively process it
- After recursion, scan instruction operand lists (offsets 80, 84), identifying register operands by the type tag
(operand >> 28) & 7 == 1 - Clear register usage bits at
reg_entry + 28 - Update double-precision scores at offsets 88 and 96
This propagation allows scheduling decisions in callee functions to influence caller scheduling priorities -- a form of interprocedural scheduling absent from both Backend A and Backend B.
Backend Comparison
| Feature | Backend A (Main) | Backend B (SM89/90 Codec) | Backend C (RBT List) |
|---|---|---|---|
| Address range | 0x680000--0x8FE000 | 0x1225000--0x1240000 | 0x18CD000--0x190FFFF |
| Code size | ~436 KB | ~111 KB | ~275 KB |
| SM gate | SmVersion <= 1 | SmVersion >= 2 | SmVersion >= 2 |
| Pipeline stage | Pre + post scheduling | Codec/ISel scheduling | Pre + post scheduling |
| Priority encoding | 8-bit packed integer | Float-weighted BST | RB-tree (int + double + ID) |
| Priority function size | 47 KB monolithic | Distributed across weights | 3-key comparison |
| Ready list structure | Sorted singly-linked list | Binary search tree | Red-black tree |
| Insertion complexity | O(N) per instruction | O(log N) | O(log N) |
| Scheduling passes | 3 (ReduceReg / ILP / DynBatch) | 2 (Forward / Backward) | 2 (Pre / Post) |
| Pressure tracking | Bitvector + popcount | Float slope per register | SIMD bitmap + cost model |
| Weight configuration | Knobs 769--805 (integer) | Options 7200/7560 (double) | Vtable dispatch |
| Score type | Integer (packed bits) | Double (weighted sum) | Double (accumulated) |
| Solution search | Greedy (single pass) | Forward + backward | Evaluate + commit |
| Cross-function awareness | None | None | Recursive cost propagation |
| Hash infrastructure | None | FNV-1a | FNV-1a |
| Backtracking | Optional (sub_8B1190) | None | Rejection set + retry |
Backend B + C Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_1233D70 | 1.5 KB | SM89/90 CodecScheduleEntry -- pass phase 5, float weight init | HIGH |
sub_122AD60 | 17.5 KB | ForwardCodecScheduler -- BST list scheduling, FNV-1a hash tracking | HIGH |
sub_122F650 | 18.2 KB | BackwardCodecScheduler -- reverse pass, barrier/scoreboard integration | HIGH |
sub_123ADD0 | 5.8 KB | CodecDependencyGraphBuilder -- dispatched via vtable | MEDIUM |
sub_12371D0 | 3.8 KB | CodecInstructionClassifier -- convergence-based property testing | MEDIUM |
sub_123E0D0 | -- | CodecSchedulePreparation -- instruction characterization | MEDIUM |
sub_A112C0 | -- | CodecSchedulePostFixup -- result finalization | MEDIUM |
sub_1908D90 | -- | RBTScheduleOrchestrator -- pre/post mode dispatch | HIGH |
sub_1906090 | -- | RBTScheduleDriver -- per-block loop, 368-byte block stride | HIGH |
sub_1902B70 | 19 KB | RBTCoreListScheduler -- RB-tree priority queue loop | HIGH |
sub_1904B70 | 26 KB | RBTSolutionEvaluator -- constraint check, score threshold, hash commit | HIGH |
sub_19043F0 | 10 KB | RBTConstraintValidator -- mode 5/6 feasibility | HIGH |
sub_19038E0 | 15 KB | RBTInitialEvaluation -- per-block constraint bootstrapping | MEDIUM |
sub_18F3CB0 | 16 KB | RBTPressureCostModel -- SIMD register pressure computation | HIGH |
sub_18FFD70 | 23 KB | RBTRecursiveCostPropagation -- call-graph-aware scoring | HIGH |
sub_1902100 | 15 KB | RBTDependencyUpdate -- post-scheduling DAG maintenance | HIGH |
sub_18FD370 | -- | RBTreeInsert -- 3-key balanced insertion + fix-up | HIGH |
sub_18FCDA0 | -- | RBTreeExtractMax -- pop highest-priority node | HIGH |
sub_18F7EC0 | -- | RBTreeReset -- tree cleanup | HIGH |
sub_18F8580 | -- | RBTRegisterPressureInit -- pressure state initialization | MEDIUM |
sub_18F5460 | -- | RBTLatencyLookup -- vtable-dispatched latency query | MEDIUM |
sub_18FDAF0 | -- | RBTScoreComputation -- double-precision weighted score | HIGH |
sub_1906510 | 14 KB | RBTHashLookup -- FNV-1a instruction ID hash table | HIGH |
sub_18FB850 | -- | RBTHashResize -- power-of-2 growth, 0.5 load factor | HIGH |
sub_1901200 | -- | RBTScorePropagationDriver -- calls sub_18FFD70 | MEDIUM |
sub_19081F0 | 17 KB | RBTBlockDependencyGraphBuild -- per-block DAG construction | HIGH |
sub_19072F0 | 14 KB | RBTInterBlockScheduling -- cross-BB register dependency | MEDIUM |
sub_18FEE60 | -- | RBTScheduleStateCreate -- 528-byte state construction | MEDIUM |
sub_18FE320 | -- | RBTScheduleDataPrepare -- pre-scheduling data setup | MEDIUM |
sub_18F94C0 | -- | RBTCleanup -- state teardown | MEDIUM |
sub_C5FFC0 | -- | DispatchPreSchedule -- SM gate -> Backend C (mode 1) | CERTAIN |
sub_C5FFF0 | -- | DispatchPostSchedule -- SM gate -> Backend C (mode 0) | CERTAIN |
sub_C5FEF0 | -- | DispatchCodecSchedule -- SM gate -> Backend B | CERTAIN |
sub_C60910 | -- | DispatchConditionalCodecSchedule -- SM gate + knob override | CERTAIN |
sub_7DDB50 | -- | GetSmVersionIndex -- reads context+2104 | HIGH |
Scheduling Guidance Output
After scheduling completes, ptxas can emit statistics comments into the SASS output and DUMPIR stream. Three emitter functions produce scheduling guidance in different contexts, all reading from a shared ~1400-byte statistics object. sub_A46CE0 controls the "SCHEDULING GUIDANCE:" header that wraps per-block scheduling output. sub_A3A7E0 emits per-function statistics as # [field=value] comment lines during DUMPIR. Eight post-regalloc clones at sub_ABBA50--sub_ABEB50 emit a variant with hardware pipe names.
Verbosity Controls
Two independent verbosity mechanisms gate the output:
Scheduling guidance level at *(DWORD*)(vtable + 992):
| Level | Behavior |
|---|---|
| 0 | No scheduling guidance output |
| 1+ | "SCHEDULING GUIDANCE:" header emitted; per-block scheduling dispatched |
| 2+ | Pre-formatting hook called via vtable+816 before header emission |
| 4+ | "LOOP STATIC METRICS : " sub-header appended |
DUMPIR detail bits at context+1416:
| Bit | Mask | Behavior |
|---|---|---|
| 3 | 0x08 | Enable detailed statistics (FP16 vectorization, functional unit breakdown, throughput estimates) |
| 4 | 0x10 | Show worst-case latency: # [worstcaseLat=%f] |
| 5 | 0x20 | Show average-case latency: # [avgcaseLat=%f] |
Bits 4 and 5 are mutually exclusive -- only one latency variant is emitted.
Emitter Functions
| Address | Size | Identity | Confidence | Context |
|---|---|---|---|---|
sub_A3A7E0 | 1,236 B | Statistics::emitFunctionStats | CERTAIN | Pre-regalloc DUMPIR statistics. 20+ format strings at 0x21EBF76--0x21EC3B0. Uses abstract FU names (fp, half, shared, controlFlow, loadStore). |
sub_A46CE0 | 1,793 B | SchedulingGuidance::buildAndEmit | HIGH | Scheduling guidance header + BB classification. Walks BB array at context+296, dispatches schedulable blocks to vtable+336. |
sub_A4B8F0 | 248 B | StatsEmitter::emitInstrRegStats | HIGH | Binary-embedded metadata. Writes record type 3 (string) into SASS code object at *(a1+1000) + *(a1+996). |
sub_ABBA50--sub_ABEB50 | 8 x 1,771 B | PostSchedStats::emit (SM-variant) | CERTAIN | Post-regalloc statistics. 8 clones at 0x700 spacing. Format strings at 0x21FA008--0x21FA400. Uses hardware pipe names (adu, alu, cbu, fma, lsu). |
Pre-Regalloc Output Format (sub_A3A7E0)
Emitted during DUMPIR. All lines prefixed with # . Lines marked [COND] are gated by the stated condition.
# 142 instructions, 24 R-regs
# [inst=142] [texInst=0] [tepid=0] [rregs=24]
# [urregs=8] [COND: SM > 0x5FFF]
# [_lat2inst=0.0]
# [FP16 inst=0] [FP16 VectInst=0] [Percentage Vectorized=0.00] [COND: +1416 bit 3]
# [est latency = 87] [LSpillB=0] [LRefillB=0], [SSpillB=0], [SRefillB=0], [LowLmemSpillSize=0] [FrameLmemSpillSize=0]
# [LNonSpillB=0] [LNonRefillB=0], [NonSpillSize=0]
# [Occupancy = 0.750000], [est numDivergentBranches=2] [attributeMemUsage=0], [programSize=1024]
# [est fp=12] [est half=0], [est trancedental=0], [est ipa=0], [est shared=0], [est controlFlow=8], [est loadStore=24]
# [est tex=0] [est pairs=4]
# [issue thru=0.888889] [fp thru=0.111111] [half thru=0.000000], [trancedental thru=0.000000], [ipa thru=0.000000]
# [shared thru=0.000000] [controlFlow thru=0.062500] [texLoadStore thru=0.187500], [reg thru=0.000000], [warp thru=0.000000]
# [SharedMem Alloc thru=0.125000] [COND: value != 0.0]
# [partially unrolled loops=0] [non-unrolled loops=1]
# [CB-Bound Tex=0] [UR-Bound Tex=0] [Bindless Tex=0] [Partially Bound Tex=0]
# [UDP inst=0] [numVecToURConverts inst=0]
# [maxNumLiveValuesAtSuspend=0]
# [Precise inst=0]
# [worstcaseLat=87.000000] [COND: +1416 bits 4-5 == 0x10]
# [avgcaseLat=52.500000] [COND: +1416 bits 4-5 == 0x20]
# [instHint=142] [instPairs=4] [COND: instPairs != 0]
# <custom annotation> [COND: linked list at stats[55] != NULL]
Key format details: pre-regalloc uses commas between some bracket groups ([SSpillB=%d], [SRefillB=%d],) and abstract functional unit names (fp, half, trancedental, shared, controlFlow, loadStore, texLoadStore). The typo "trancedental" (missing "s") is present in the binary.
Post-Regalloc Output Format (sub_ABBA50 clones)
Emitted after scheduling by SM-variant clones dispatched via vtable. Same # prefix. Differs from the pre-regalloc format in three ways:
- No commas between bracket groups
- SpillSize replaces
LowLmemSpillSize+FrameLmemSpillSize - Hardware pipe names replace abstract unit names; MMA variant breakdown added
The unique lines (lines shared with pre-regalloc use the same structure minus commas):
# [est latency = %d] [LSpillB=%d] [LRefillB=%d] [SSpillB=%d] [SRefillB=%d] [SpillSize=%d]
# [LNonSpillB=%d] [LNonRefillB=%d] [NonSpillSize=%d]
# [Occupancy = %f] [est numDivergentBranches=%d] [attributeMemUsage=%d] [programSize=%d]
# [est adu=%d] [est alu=%d] [est cbu=%d] [est fma2x=%d] [est fma=%d] [est half=%d]
# [est trancedental=%d] [est ipa=%d] [est lsu=%d] [est redux=%d]
# [est schedDisp=%d] [est tex=%d] [est ttu=%d] [est udp=%d]
# [est imma16816=%d] [est imma16832=%d] [est immaSp8832=%d] [est immaSp16832=%d]
# [est dmma=%d] [est fma64=%d] [est hmma16816=%d] [est hmma16816f16=%d]
# [est hmma1688=%d] [est hmma1688f16=%d] [est hmmaSp1688=%d] [est hmmaSp1688f16=%d]
# [issue thru=%f] [adu thru=%f] [alu thru=%f] [cbu thru=%f] [fma2x thru=%f] [fma thru=%f]
# [trancedental thru=%f] [ipa thru=%f] [lsu thru=%f] [redux thru=%f]
# [schedDisp thru=%f] [tex thru=%f] [ttu thru=%f] [udp thru=%f]
# [imma16816 thru=%f] [imma16832 thru=%f] [immaSp8832 thru=%f] [immaSp16832 thru=%f]
# [dmma thru=%f] [fma64 thru=%f] [hmma16816 thru=%f] [hmma16816f16 thru=%f]
# [hmma1688 thru=%f] [hmma1688f16 thru=%f] [hmmaSp1688 thru=%f] [hmmaSp1688f16 thru=%f]
# [reg thru=%f] [warp thru=%f]
Hardware Pipe Name Mapping
The post-regalloc format maps abstract functional unit names to hardware execution pipe identifiers:
| Post-Regalloc Pipe | Pre-Regalloc Equivalent | Description |
|---|---|---|
adu | -- | Address Divergence Unit (address computation) |
alu | fp | Arithmetic Logic Unit (integer + FP32 combined) |
cbu | controlFlow | Control/Branch Unit (branch, exit, barrier) |
fma2x | -- | Double-precision FMA (separate pipe on sm_80+) |
fma | fp | Fused Multiply-Add (FP32) |
half | half | FP16 operations |
lsu | loadStore + shared | Load/Store Unit (unified) |
redux | -- | Reduction Unit (warp-level reductions) |
schedDisp | -- | Scheduler Dispatch (internal overhead) |
tex | tex | Texture Unit |
ttu | -- | Tensor Texture Unit (Ada Lovelace+) |
udp | -- | Uniform Data Path operations |
Binary-Embedded Statistics Record (sub_A4B8F0)
Separate from the DUMPIR comment output, sub_A4B8F0 writes a compact binary record into the SASS code object during emission:
Format string: "instr/R-regs: %d instructions, %d R-regs"
instructions = stats[335] - stats[341] (total minus removed)
R-regs = stats[159] + stats[102] (extra + base allocation)
Record layout in output buffer:
+0 DWORD type = 3 (string record type)
+4 DWORD string_length
+8 char[] string_content (formatted text)
The companion function sub_A4B9F0 writes record type 2 for undefined register warnings: "Referencing undefined register: %s%d".
Scheduling Guidance Header (sub_A46CE0)
sub_A46CE0 emits the scheduling guidance wrapper, then walks the BB array to classify and dispatch blocks for scheduling. The header is emitted into the output stream via sub_7FE930 (string builder) at context + 1440.
BB classification algorithm:
For each BB in context+296 (index 0 through context+304):
-
Schedulable:
sub_7544D0(context, bb)returns true ANDsub_754510(context, bb)returns false. Dispatched immediately to scheduling viavtable+336. -
Type-8 (deferred):
*(bb+16) == 8. Added to a dynamically-grownsrcarray for second-pass processing. -
Loop back-edge: When
*(bb+148) != 0and*(bb+128) != NULL, the function walks the predecessor linked list atbb+128. For each predecessor, it checks whether the predecessor's iteration order (bb+144) exceeds the current block's, and whether the predecessor's terminal instruction is a branch (opcode0x5Dafter masking with0xFFFFCFFD) with a matching program counter at(instr+84) & 0xFFFFFF. If a back-edge is detected, scheduling dispatch includes the back-edge source instruction as a hint parameter.
After the first pass, deferred type-8 blocks are processed in a second loop with the same back-edge detection logic.
Statistics Object Field Map
Both emitter families read from the same ~1400-byte statistics object. The object is accessed as a float* array; integer fields use the same DWORD index but read as int32.
| Index | Type | Field | Description |
|---|---|---|---|
| 8 | int32 | est_latency | Estimated schedule length in cycles |
| 9 | float | FP16_vectorization_pct | Fraction of FP16 instructions vectorized |
| 10 | int32 | worstcase_latency | Worst-case latency (cast to float for output) |
| 11 | int32 | avgcase_latency | Average-case latency (cast to float for output) |
| 12 | int32 | LSpillB | Long spill byte count |
| 13 | int32 | LRefillB | Long refill byte count |
| 14 | int32 | SRefillB | Short refill byte count |
| 15 | int32 | SSpillB | Short spill byte count |
| 16 | int32 | LowLmemSpillSize | Local-memory low spill allocation |
| 17 | int32 | FrameLmemSpillSize | Frame local-memory spill allocation |
| 18 | int32 | LNonSpillB | Long non-spill byte count |
| 19 | int32 | LNonRefillB | Long non-refill byte count |
| 20 | int32 | NonSpillSize | Non-spill allocation size |
| 26 | float | Occupancy | Occupancy ratio (0.0--1.0) |
| 27 | int32 | numDivergentBranches | Estimated divergent branch count |
| 28 | int32 | attributeMemUsage | Attribute memory usage in bytes |
| 29 | int32 | programSize | Program binary size in bytes |
| 42 | int32 | preciseInst | Count of precise (non-approximate) instructions |
| 44 | int32 | UDPinst | Uniform data-path instruction count |
| 45 | int32 | vecToURConverts | Vector-to-uniform-register conversion count |
| 49 | int32 | maxLiveAtSuspend | Max live register values at suspend points |
| 50 | float | issue_thru | Overall issue throughput (fraction of peak) |
| 54 | float | fp_thru | FP32 pipe throughput |
| 57 | float | half_thru | FP16 pipe throughput |
| 58 | float | transcendental_thru | Transcendental function pipe throughput |
| 59 | float | ipa_thru | Interpolation pipe throughput |
| 61 | float | shared_thru | Shared memory pipe throughput |
| 62 | float | controlFlow_thru | Control flow pipe throughput |
| 65 | float | texLoadStore_thru | Texture and load/store pipe throughput |
| 84 | float | reg_thru | Register throughput |
| 85 | float | warp_thru | Warp throughput |
| 86 | float | sharedMemAlloc_thru | Shared memory allocation throughput |
| 87 | int32 | partiallyUnrolledLoops | Partially unrolled loop count |
| 88 | int32 | nonUnrolledLoops | Non-unrolled loop count |
| 89 | int32 | CBBoundTex | Constant-bank-bound texture count |
| 90 | int32 | PartiallyBoundTex | Partially bound texture count |
| 91 | int32 | BindlessTex | Bindless texture count |
| 92 | int32 | URBoundTex | Uniform-register-bound texture count |
| 93 | int32 | SM_architecture_enum | SM version discriminator (>0x5FFF enables UR stats) |
| 99 | int32 | uniform_reg_count | Uniform register count |
| 102 | int32 | R_reg_base | Base R-register allocation |
| 159 | int32 | R_reg_extra | Extra R-register allocation |
| 303 | int32 | est_fp | Estimated FP32 instruction count |
| 306 | int32 | est_half | Estimated FP16 instruction count |
| 307 | int32 | est_transcendental | Estimated transcendental instruction count |
| 308 | int32 | est_ipa | Estimated IPA instruction count |
| 310 | int32 | est_shared | Estimated shared memory operation count |
| 311 | int32 | est_controlFlow | Estimated control flow operation count |
| 315 | int32 | est_loadStore | Estimated load/store instruction count |
| 316 | int32 | est_tex | Estimated texture instruction count |
| 334 | int32 | est_pairs | Estimated co-issuable instruction pairs |
| 335 | int32 | total_inst | Total instruction count (before removal) |
| 336 | int32 | texInst | Texture instruction count |
| 337 | int32 | FP16inst | FP16 instruction count |
| 338 | int32 | FP16VectInst | FP16 vectorized instruction count |
| 339 | int32 | instHint | Instruction hint value |
| 340 | int32 | instPairs | Instruction pair count (also output gate) |
| 341 | int32 | removed_inst | Removed instruction count |
| 342 | int32 | tepid_inst | TEPID (texture-pending) instruction count |
Cross-References
- Scheduler Overview -- 3-phase architecture, register budget, scheduling knobs
- Latency Model -- per-opcode latency tables, functional unit mapping, architecture profiles
- Scoreboards & Barriers -- scoreboard encoding, dependency barrier assignment, stall/yield format
- Register Allocation -- the allocator that the scheduler interacts with
- Phase Manager -- how ScheduleInstructions fits in the 159-phase pipeline
- Knobs -- the 76 scheduling knobs and the knob query infrastructure
- GMMA Pipeline -- GMMA/WGMMA operations targeted by DynBatch
- DUMPIR Configuration -- DUMPIR levels that trigger statistics output
- Spilling -- spill metrics (LSpillB, SSpillB) referenced in guidance output