Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Spill Mechanism

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

When the fat-point register allocator cannot fit all simultaneously-live virtual registers into the hardware register budget, it spills excess values to memory and reloads them on demand. The spill subsystem is the second-largest component of the register allocator by code volume, spanning roughly 25 functions and 12,000+ lines of decompiled code. It implements a cost-driven, retry-based spill strategy with two memory targets (local memory and shared memory), a per-class priority queue guidance engine, and a multi-attempt allocation loop that progressively relaxes constraints until allocation succeeds or fails fatally.

Spill triggersub_94FDD0 (155 lines) -- sets flag 0x40000 when assignment exceeds budget
Spill guidancesub_96D940 (2983 lines) -- builds 7 priority queues of spill candidates
Spill codegensub_94F150 (561 lines) -- inserts spill stores and refill loads
LMEM setupsub_939BD0 (65 lines) -- local memory slot allocator configuration
SMEM allocatorsub_9539C0 (1873 lines) -- shared memory spill alternative
Retry driversub_971A90 (355 lines) -- NOSPILL then SPILL retry loop
Finalizationsub_9714E0 (290 lines) -- commit spills, emit errors on failure
SASS codegensub_9850F0 (520 lines) -- generate LDL/STL instruction sequences
Key knobs623 (spill mode), 638/639 (retry limits), 684 (interference threshold)

Spill Trigger

The spill trigger fires inside the per-virtual-register assignment function sub_94FDD0 (155 lines). When the fat-point allocator (sub_957160) selects a physical slot for a virtual register, it calls sub_94FDD0 to commit the assignment. If the chosen slot index equals or exceeds the per-class register budget, the function does not commit -- instead it marks the virtual register for spilling.

function assign_register(alloc, ctx, mode, vreg, regclass_info, slot, cost):
    max_regs = regclass_info.max_regs           // at regclass_info+16

    // Budget check
    if slot >= max_regs and not vreg.has_flag(0x4000):  // not already spilled
        vreg.flags |= 0x40000                    // set "needs-spill" bit
        return                                   // do NOT commit assignment

    // Spill path: flag was set on a previous call
    if vreg.has_flag(0x40000):                   // needs-spill
        setup_spill_allocator(alloc)             // sub_939BD0
        generate_spill_code(alloc, ctx, 1)       // sub_94F150
        return

    // Non-spill path: commit the assignment
    consumption = compute_consumption(vreg)       // sub_939CE0
    update_peak(alloc+1564, consumption)
    update_peak(alloc+1528, consumption)
    vreg.physical_register = slot                 // vreg+68

    // Accumulate spill cost even for successful assignments
    *(double*)(alloc+1568) += *(float*)(vreg+40)  // store weight
    *(float*)(alloc+1576)  += load_weight          // load weight

    apply_preallocated_candidate(alloc, vreg)     // sub_950100

    // Propagate through alias chain
    alias = vreg.alias_parent                     // vreg+36
    while alias != NULL:
        alias.physical_register = slot
        alias = alias.alias_parent

The two flag bits at vreg+48 encode spill state:

BitMaskMeaning
140x4000Already spilled -- prevents the same vreg from being spilled again
180x40000Needs spill -- triggers spill codegen on the next sub_94FDD0 call

Register consumption (sub_939CE0, 23 lines) accounts for paired registers. For double-width registers (pair mode 3 at vreg+48 bits 20--21), it returns assignment + 1, consuming two physical slots.

Spill Retry Loop

The outer allocation driver sub_971A90 (355 lines) wraps the core allocator in a two-phase strategy: first attempt allocation without spilling, then retry with progressively more aggressive spill guidance.

function alloc_with_spill_retry(alloc, ctx, class_id):
    // Phase 1: NOSPILL
    pre_allocation_pass(alloc)                        // sub_94A020
    per_class_driver(alloc, ctx, class_id)            // sub_95DC10
    result = fatpoint_allocate(alloc, ctx, attempt=0) // sub_957160
    record_best_result(&best, class_id, 0, result)    // sub_93D070

    if alloc.target >= adjusted_result:
        goto finalize                                 // success

    // Phase 2: SPILL retry
    max_attempts = query_knob(638)     // default varies
    if knob_639_set:
        max_attempts = query_knob(639) // override

    for attempt in 0..max_attempts:
        reset_alloc_state(alloc, ctx, attempt)        // sub_93FBE0
        if attempt == 0:
            build_interference_masks(alloc, class_id) // sub_956130
        result = fatpoint_allocate(alloc, ctx, attempt)

        debug("-CLASS NOSPILL REGALLOC: attemp %d, used %d, target %d",
              attempt, result, alloc.target)

        record_best_result(&best, class_id, attempt, result)
        if alloc.target >= adjusted_result:
            break

    if all_attempts_failed and no_spill_recorded:
        result = final_fallback(&best, result)        // sub_936FD0

    // Finalize
    status = finalize_allocation(alloc, result, class_id, &best)  // sub_9714E0
    if HIBYTE(status):
        clear_all_assignments_to_minus_one()          // allocation failed
    else:
        commit_results()

The debug string "-CLASS NOSPILL REGALLOC: attemp " (note the typo -- present in the binary) is printed for every attempt.

For SMEM spilling (modes 3/6 when ctx+896 == 5), the driver activates spill setup before entering the retry loop:

if (class_id == 3 or class_id == 6) and device_type == 5:
    if vreg_count > 0:
        setup_spill_allocator(alloc)      // sub_939BD0
        generate_spill_code(alloc, ctx, 1) // sub_94F150
    alloc.spill_done_flag = 1             // alloc+865

Spill Guidance Engine

sub_96D940 (2983 lines, 84 KB decompiled) computes which registers should be spilled and in what order. It is the largest spill-related function and one of the largest in the entire allocator.

Structure

The function contains 7 near-identical code blocks, one per register class (R, P, B, UR, UP, UB, tensor/accumulator). Each block is approximately 400 lines of bitvector iteration and set intersection. This repetition strongly suggests C++ template instantiation or macro expansion in the original source.

Spill Guidance Structure (11,112 bytes)

The guidance engine allocates a single 11,112-byte working structure from the arena (vtable +24). The structure is organized into five regions.

Region 0 -- Header and core pointers (bytes 0--271)

Byte offsetQWORD idxTypeInitField
0[0]ptrctxBack-pointer to allocation context
24[3]ptralloc+16Pointer into allocator state object
32[4]QWORD0Run counter / iteration state
40[5]QWORD0Class processing counter
48[6]QWORD0Spill mode flags
96[12]ptrarenaArena allocator pointer (from ctx+16)
104[13]ptrarenaQueue header base / candidate list base
112[14]DWORD+DWORD-1, 0Max element index sentinel, entry count
136[17]ptrarenaThird arena reference
144[18]QWORD0Queue 0 entry count
152[19]QWORD-1Queue 0 sentinel
160[20]ptrarenaFourth arena reference
168[21]QWORD0Queue 0b entry count
176[22]QWORD-1Queue 0b sentinel
184[23]ptrctxBack-pointer to context
192[24]ptrnodeCandidate node list head (24-byte arena node)
200[25]ptrnodeCandidate node list tail
208[26]QWORD0Node count
216[27]QWORD0Node capacity
240[30]ptrnodeSentinel node (same as initial node at [24])
248[31]QWORD0Free list head
256[32]QWORD0Free list count

Region 1 -- Bitmask arrays (bytes 272--1327)

Two 508-byte bitmask arrays (127 DWORDs each), separated by single-byte sentinels:

Byte rangeContent
284Sentinel byte (set to 0x80 after zeroing)
288--795Bitmask array 0: 127 DWORDs for live range set intersection
808Sentinel byte (set to 0x80 after zeroing)
812--1319Bitmask array 1: 127 DWORDs for second class pair

Each bitmask array is zeroed via an SSE2 vectorized loop (16 bytes per iteration, 0x1F iterations). The 0x80 sentinel byte at the start of each array marks initialization completion.

Region 2 -- Priority queue table blocks (bytes 1328--2063)

Five embedded priority queue tables, each containing an entry count (QWORD) followed by an array of 6 queue entries (24 bytes each):

QWORD idxByte offsetContent
[166]1328Queue block 1 entry count (incremented by 7 per pass)
[167]--[184]1336--1479Queue block 1: 6 entries x 24 bytes
[188]1504Queue block 2 entry count
[189]--[206]1512--1655Queue block 2: 6 entries x 24 bytes
[210]1680Queue block 3 entry count
[211]--[228]1688--1831Queue block 3: 6 entries x 24 bytes
[232]1856Queue block 4 entry count
[233]--[250]1864--2007Queue block 4: 6 entries x 24 bytes
[256]2048Queue block 5 (overflow) count

Each 24-byte queue entry has this layout:

Entry offsetTypeInitField
+0ptrarenaBitvector storage pointer
+8QWORD0Bitvector data pointer
+16DWORD-1Max element index
+20DWORD0Current element count

Queue entries are built by sub_8BE190 and sorted by sub_7553C0. Candidates are inserted via sub_9370A0 (with tie-breaking) and removed via sub_9365A0 (bit-clear in bitvector).

Region 3 -- Candidate node management (bytes ~2064--10591)

The largest region (~8,528 bytes). Contains working storage for spill candidate evaluation across all 7 register classes. This region is zeroed during initialization and populated during the instruction walk phase by sub_93BF50 (candidate evaluation), sub_936610 (candidate insertion with cost), sub_9680F0 (cost propagation), and sub_93A1F0 (interference counting). The exact internal sub-layout varies by register class and virtual register count.

Region 4 -- Linked list, accumulators, and tail (bytes 10592--11111)

Byte offsetQWORD idxTypeInitField
10592[1324]QWORD0Linked list head (spill candidate chain)
10600[1325]ptr&self[1326]Forward pointer (circular doubly-linked)
10608[1326]ptr&self[1324]Backward pointer
10616[1327]QWORD0List count
10624[1328]ptr&self[1324]Secondary forward pointer
10632[1329]ptr&self[1326]Secondary backward pointer
10640DWORDint2Node type tag
10648[1331]ptrnodePrimary candidate node (24B, type=2)
10656[1332]ptrnodeSecondary candidate node (24B, type=2)
10696[1337]ptrnodeSecondary tail pointer
10704[1338]ptr0Instruction walk context (knob 622 gate)
10712[1339]QWORD0Walk state
10720[1340]QWORD0Walk counter
10728BYTEbyte0Walk active flag
10736[1342]QWORD0Spill cost accumulator 0
10744[1343]QWORD0Spill cost accumulator 1
10752--10824[1344]--[1353]QWORD0Additional cost/range counters (10 slots)
10840[1355]QWORD0Interference counter
10872DWORDint0Class mask
10888[1361]QWORD0Result register count
10896[1362]QWORD0Result cost metric
10904[1363]QWORD0Result spill count
10912[1364]QWORD0Result class width
10920[1365]QWORD0Best-attempt index
10960[1370]QWORD0Phase indicator
10968[1371]QWORD0Phase state
10976[1372]QWORD0Output flag
11008[1376]QWORD0SMEM spill tracking
11016[1377]QWORD0SMEM spill state
11048[1381]QWORD0Output queue pointer
11056[1382]QWORD0Output queue size
11072[1384]ptr0Callee-save tracking (freed by sub_96CFA0)
11080[1385]ptr0Callee-save arena ref (freed by sub_96CFA0)
11089BYTEbyte1Guidance enabled flag
11096[1387]ptr0Final candidate object (freed by sub_96CFA0)
11104[1388]ptr0Final candidate arena ref (freed by sub_96CFA0)

The linked list at [1324]--[1329] is initialized as a circular doubly-linked list with self-referential pointers, following the standard intrusive list pattern. The cleanup function sub_96CFA0 (694 lines) deallocates the candidate node objects at offsets 11072, 11080, 11096, and 11104.

Candidate node object (24 bytes)

Each candidate node is a 24-byte arena-allocated object used in the doubly-linked list and as priority queue sentinels:

OffsetTypeField
+0QWORDType tag: 2 = priority queue node, 3 = initial sentinel
+8QWORDCount / payload
+16ptrArena back-reference

Stack-local queue header array

In addition to the 11,112-byte structure, the function maintains a stack-local 7-element queue header array (v514, 168 bytes on stack). Each entry is 24 bytes (3 QWORDs) with the same layout as the embedded queue entries above. The 7 entries map to the 7 register classes:

IndexClass
0R (general-purpose registers)
1P (predicate registers)
2B (barrier registers)
3UR (uniform registers)
4UP (uniform predicates)
5UB (uniform barriers)
6Acc (tensor/accumulator registers)

After bitvector iteration, each stack-local queue header is built by sub_8BE190 and sorted by sub_7553C0.

Algorithm

Three phases: (A) stack-local queue init, (B) structure alloc + instruction walk, (C) finalization.

Phase A -- Per-class bitvector copy and sort

function compute_spill_guidance(ctx, alloc, attempt):
    // Zero two 127-DWORD bitmask arrays (SSE2, 16B/iter, 0x1F iters).
    // Sentinel 0x80 at offsets +284 and +808.
    zero_sse2(bitmask[0..1], 127); bitmask[0][-1] = bitmask[1][-1] = 0x80
    queue_headers[7]  // stack-local 168B: 0=R,1=P,2=B,3=UR,4=UP,5=UB,6=Acc
    arena = ctx+16
    for class_id in 0..6:
        hdr = { storage=arena, data=0, max_index=-1, count=0 }
        if alloc.max_index[class_id] >= 0:
            n = alloc.max_index[class_id] + 1
            bitvec_resize(hdr, n)                      // sub_6E6650
            sse2_copy(hdr.data + hdr.count, alloc.data, n)
            hdr.count += n
    build_overflow_queue(queue_headers)                 // sub_8BE190
    queue_headers.run_counter += 7
    sort_queue(queue_headers)                           // sub_7553C0

    // Same copy-and-sort for 5 priority queue blocks at QWORD offsets
    // [166], [188], [210], [232], [256] (22 QWORDs each, 6 entries/block).
    for block_id in 0..4:
        qblk = &g[166 + block_id * 22]; qblk.count = 0
        for class_id in 0..6: /* identical init + bitvec copy */
        build_overflow_queue(qblk_end); qblk.count += 7; sort_queue(qblk_end)

Phase B -- Structure init and instruction walk

    g = arena_alloc(ctx+16, 11112)                     // guidance structure
    g[0] = ctx; g[3] = alloc+16; g[4..6] = 0          // back-ptrs + counters
    // Init circular doubly-linked list [1324]--[1329] (self-referential).
    // Alloc two 24B candidate nodes (type=2) at g[1331], g[1332].
    // Per-class bitvec: sub_957020 allocates ceil(vreg_count/64) QWORDs.
    sub_95BC90(g, ctx.active_class, ctx.block_list, ctx.class_limit, attempt > 5)
    if !ctx.class_nospill[attempt]: sub_93CD20(g)      // candidate pre-eval
    vtable_call(ctx, slot_39)                           // arch-specific init
    if get_knob(622)==1 and !g[1338]:                   // knob 622 gate
        g[1338] = arena_alloc(ctx+16, 128)              // walk context

    for insn in block_list(ctx+280):                    // instruction walk
      switch insn.opcode:
        case 72: update_block_counters(g)               // block boundary
        case 54: load_block_state(g, func_node)         // function entry
        case 97: update_high_water_mark(g)              // block exit
                 check_block_boundary(alloc, insn)      // sub_9363E0
        default: // ordinary instruction -- per-operand walk:
          for op_idx in 0..insn.operand_count:
            op = insn.operands[op_idx]
            if (op>>28)&7 != 1: continue               // not a VR reference
            vr_id = op & 0xFFFFFF
            if vr_id in 41..44: continue               // skip special regs
            vreg = vr_table[vr_id]; cls = vreg+64
            if op < 0:                                  // DEF (bit 31 set)
                sub_91E860(alloc, insn, op_idx)         // remat cost
                cost = sub_93BE80(ctx, insn, op_idx)    // spill cost
                insert_candidate(&qh[cls], vreg, cost)  // sub_9370A0
            else:                                       // USE
                remove_candidate(&qh[cls], vreg)        // sub_9365A0
            // Cross-class demotion (sub_936610): for subsequent operands
            // whose class <= attempt and range-end <= budget, remove them.
            if cls < attempt:
              for j in op_idx+1..insn.operand_count:
                ref = vr_table[insn.operands[j] & 0xFFFFFF]
                if ref+64 <= attempt and ref+68 <= budget[ref+64]:
                    remove_candidate(&qh[ref+64], ref)
          // Cost propagation:
          if first_use_in_block:
              sub_93BF50(ctx, insn, &qh, class_limit)  // candidate eval
          if should_propagate:
              sub_9680F0(g, &qh, insn, ...)             // per-insn assignment
        case 52: // BB entry: set bits for [class_min..class_max] per class,
          // AND-subtract liveout (sub_BDC180 first visit / sub_BDCDE0 re-visit).
          // Iterate set bits; remove VRs with cls<=attempt, end<=budget.

Phase C -- Finalization

    if ctx.class_nospill[ctx.active_class]:
        sub_96CFA0(g); arena_free(arena, g)            // cleanup only
    else:
        sub_93C0B0(ctx, g, attempt)                     // finalize scores
        if attempt == 6 and ctx.uniform_spill_count > 0:
            sub_9539C0(ctx, g)                          // uniform spill fixup
        for cls in 0..g.max_class_index:                // emit guidance
            entry = g[13] + 24*cls
            if entry.reg_count > 0:
                emit_spill_guidance(alloc.spill_table, entry)
        sub_96CFA0(g); arena_free(arena, g)
    for i in (run_counter-1) downto 0:                  // final re-sort
        sort_queue(&queue_headers[i])                   // sub_7553C0

Per-VR cost comparison (sub_7553C0)

Single-pass bitvector drain: if max_index < 0 return (empty). Otherwise reset max_index = -1 and delegate to vtable_call(storage, slot_4). Slot 4 walks the bitvector from highest set bit downward, yielding VR indices in descending cost order (cheapest-to-spill first).

Candidate insertion (sub_9370A0) and removal (sub_9365A0)

Insertion is gated on the VR pair/width field vreg+48 bits 20--21:

Bits 20--21WidthInsertion behavior
0singleSet bit vreg+68 in class bitvector
1pairNo insertion (pairs not independently spillable)
3quadSet vreg+68; if cost tag == 2, also set vreg+68 + 1

The a4 tie-break parameter (values 0/4/8) selects single-bit vs double-bit paths for quad-width VRs. Removal clears bit vreg+68; for 64-bit pairs ((vreg+48 ^ 0x300000) & 0x300000 == 0), also clears vreg+68 + 1. Both operations guard against exceeding bitvector capacity ((max_index+1) << 6).

The guidance output is consumed by the retry loop: after each failed allocation attempt, the allocator consults the guidance to decide which virtual registers to allow to spill on the next attempt.

Spill Cost Model

The allocator uses a multi-level cost model to evaluate which registers are cheapest to spill. The model combines per-instruction frequency weighting, knob-controlled thresholds, and a density-based eligibility gate into a single unified cost per virtual register.

Per-virtual-register weights

FieldTypeMeaning
vreg+28floatPrimary spill cost (accumulated from usage frequency)
vreg+32floatSecondary spill cost (use-after-last-def weighting)
vreg+40floatNormalized spill cost (primary / live range length)
vreg+76floatAlternate cost (architecture-specific adjustments)
vreg+80intSpill flag: 0 = not spilled, 1 = spilled

Allocator-level accumulators

FieldTypeMeaning
alloc+1568doubleTotal spill-store cost (summed over all spilled vregs)
alloc+1576floatTotal spill-load cost (summed over all refilled vregs)

Unified spill cost formula

function compute_spill_cost(alloc, vreg, regclass):
    // Step 1: Select base weight from register pressure state
    if alloc.high_pressure_flag[regclass]:          // byte at alloc + 32*regclass + 893
        base_weight = 3.0                           // encourage spilling under pressure
    else:
        base_weight = 15.0                          // default: penalize spilling

    // Step 2: Accumulate cost per instruction touching this vreg
    vreg.primary_cost = 0.0                         // vreg+28
    vreg.secondary_cost = 0.0                       // vreg+32
    for each instr that defs or uses vreg:
        freq = block_frequency(instr.block)         // vtable callback at +8; loop depth
        vreg.primary_cost += base_weight * freq

        // Extra penalty for values consumed far from their definition
        if instr is_use and instr.program_point > vreg.last_def_point:
            vreg.secondary_cost += base_weight * freq

    // Step 3: Fold into allocator-wide accumulators on spill commit
    //   (triggered when vreg.flags bit 18 is set = needs-spill)
    alloc.total_store_cost += (double)vreg.primary_cost     // alloc+1568
    alloc.total_load_cost  += vreg.secondary_cost           // alloc+1576

Spill eligibility gate (sub_9997D0)

Before a vreg enters the spill candidate set, it must pass two knob-controlled thresholds and a density check. The knobs reside in the options block reachable through ctx->pipeline->options + offset:

function is_spill_eligible(alloc, vreg, num_candidates):
    // Threshold 1: primary cost floor (default 0.9, knob at opts+38952)
    primary_floor = 0.9
    mode = opts.override_mode[38952]
    if mode == 2:  primary_floor = opts.value[38960]    // user override
    if mode == 1:  primary_floor = 0.0                  // disabled

    if vreg.primary_cost < primary_floor:               // vreg+28
        return false

    // Threshold 2: secondary cost floor (default 1.0, knob at opts+38880)
    secondary_floor = 1.0
    mode = opts.override_mode[38880]
    if mode == 2:  secondary_floor = opts.value[38888]
    if mode == 1:  secondary_floor = 0.0

    if vreg.secondary_cost <= secondary_floor:          // vreg+32
        return false

    // Density gate: registers consumed per candidate must stay under budget
    phys_class = resolve_regclass(vreg)
    regs_per_candidate = alloc.class_reg_count[phys_class] / num_candidates
    return regs_per_candidate < alloc.density_budget    // int at alloc+484

Zero-cost pruning (sub_999AA0)

After initial cost computation, sub_999AA0 (162 lines) walks the interference graph and removes any vreg whose primary_cost == 0.0 from the candidate list. This avoids wasting spill slots on dead or single-use values that would have zero benefit from spilling. The function also decrements the per-node degree count and recycles freed nodes into the allocator's free list at alloc+144.

Max-pressure query (sub_99A0B0)

sub_99A0B0 (224 lines) computes the peak number of simultaneously live defs across all blocks spanned by a live range. It walks instructions in program order within each spanned block, counting def operands whose register class matches the target class. Special register IDs 41--44 (PT, P0--P3) are excluded. The peak is stored at alloc.peak_defs (DWORD at offset 402*4) and is used to gauge whether spilling a particular vreg would meaningfully relieve pressure.

Best-result comparison (sub_93D070)

Called after every allocation attempt to decide whether the current result improves on the stored best, and if so, snapshot the full VR-to-physical assignment.

Best-result structure layout (base pointer best, 88 bytes + variable backup array):

OffsetFieldStored from
best[2]budget thresholdset externally
best[3]total attempt countset externally
best[10]best reg countv7 (possibly remapped reg_count)
best[11]class widthreg_count (raw input)
best[13]inverse density128 / reg_count
best[16]best pressuremax_pressure
best+56 (double)best costcost
best[12], best[17]aux countersfrom alloc->subsys+1528, +1564
(QWORD)(best+32)backup array ptrarena-allocated on first call
best+24 (byte)improved flag0 = no improvement, 1 = improved
function record_best_result(best, ctx, attempt, reg_count, max_pressure, arena, cost):
    // Step 0: class-specific register count remapping via vtable slot 89
    v7 = reg_count
    remap_fn = ctx->alloc->subsys->vtable[89]          // offset 712
    if remap_fn != identity_stub:                       // sub_693720
        v7 = remap_fn(ctx->alloc->subsys, ctx, reg_count)
    inv_density = 128 / reg_count

    // Step 1: first-ever call (attempt == 0, best[3] == 1)
    if attempt == 0:
        best.improved = 1
        if best.attempt_count == 1:                     // single-attempt mode
            if reg_count <= best[2] + 1: return         // within budget, skip save
        best.backup = arena_alloc(arena, (vreg_count + 1) * 4)
        goto save_state

    // Step 2: budget gate -- require current within threshold
    threshold = min(reg_count, best.class_width)
    if best[2] + 1 < threshold:
        // Over budget: fall back to cost/pressure/width comparison
        if best.cost == cost:
            improved = (max_pressure < best.pressure)
            if max_pressure == best.pressure:
                improved = (reg_count < best.class_width)
        else:
            improved = (cost < best.cost)
        best.improved = improved
        if !improved: return
        goto check_shortcircuit

    // Step 3: within budget -- lexicographic (reg_count, inv_density) comparison
    if best.reg_count <= v7 and best.inv_density >= inv_density:
        // Existing best is at least as good on primary+secondary keys
        // Tie-break on pressure, then class_width
        if (best.reg_count != v7 or max_pressure >= best.pressure) \
           and (max_pressure != best.pressure or best.class_width <= reg_count):
            best.improved = 0                           // no improvement
            return
    best.improved = 1

check_shortcircuit:
    if attempt == 99: return                            // probe-only mode
    if attempt == best.attempt_count - 1:               // final attempt
        cap = best[2] + 1
        if reg_count <= cap: return cap                 // clamp to budget

save_state:
    // Step 4: record new best scalars
    best.class_width    = reg_count
    best.reg_count      = v7
    best.inv_density    = 128 / reg_count
    best.pressure       = max_pressure
    best.cost           = cost
    best.aux_counter_a  = ctx->alloc->subsys->child[1528]  // e.g. spill store count
    best.aux_counter_b  = ctx->alloc->subsys->child[1564]  // e.g. spill load count

    // Step 5: snapshot per-VR assignment (skip for special class width 0x989677+8)
    if reg_count != SPECIAL_CLASS_WIDTH:
        vr = vr_list_head(best->ctx)                    // linked list via vreg+120
        while vr != NULL:
            idx         = vr.id                         // vreg+12
            phys_reg    = vr.assigned_reg               // vreg+68
            is_pair_hi  = (vr.pair_width >> 23) & 1     // bit 23 of vreg+48
            best.backup[idx] = is_pair_hi + 2 * phys_reg
            vr = vr.next                                // vreg+120
        // Also snapshot subsystem auxiliary state
        best[18] = ctx->alloc->subsys[408]
        best[20] = ctx->alloc->subsys[400]

The backup array encodes each VR's physical register assignment as 2*phys_reg + pair_high_bit, allowing exact restoration when the allocator commits the best result. The is_pair_hi bit distinguishes the low vs high half of a register pair so the restore path can reconstruct pair assignments without re-running the pair allocator.

Spill cost infrastructure

AddressSizeRoleDetail
sub_9997D044 linesEligibility gateTwo-threshold + density check (see pseudocode above)
sub_9998A057 linesHas-constrained checkHash table walk at alloc+712; true if any vreg has bit 3
sub_99995074 linesConstraint-aware compareCalls sub_998450 for setup, walks block order checking interference
sub_999AA0162 linesZero-cost pruningRemoves cost-0 vregs from interference graph; recycles to free list
sub_999D10102 linesPhi/copy aggregationFor opcode 96 (phi), aggregates cost across coalesced operand chains
sub_999F00100 linesPer-block cost arrayGrows 12B-per-block array; copies {block_id, store_cost, load_cost} tuples
sub_99A0B0224 linesMax-pressure queryPeak simultaneous defs per live range span (excludes regs 41--44)
sub_9A8270~580 linesLive range spill driver14 KB; drains worklists, checks bitvector membership, commits per-range spills

Has-constrained check (sub_9998A0)

Walks the hash table rooted at alloc+712 (populated by sub_995110). Each bucket chain is a linked list of nodes; the function scans every node checking byte +12 bit 3 (the "constrained" flag). Returns 1 immediately on the first constrained node found, 0 if the entire table is exhausted. When alloc+720 == 0 (table empty), returns 0 without walking.

Constraint-aware interference check (sub_999950)

function check_constrained_interference(alloc, block_range):
    setup_live_state(alloc, block_range)                 // sub_998450
    if not has_constrained_vreg(alloc):                  // sub_9998A0
        return PROCEED                                   // no constrained -- safe

    // Phase 1: walk successor blocks of the range
    if alloc.pass_count > 1:                             // alloc+756
        succ = block_range.successor_list                // block_range[17]
        while succ:
            if alloc.hash_count > 0:                     // alloc+720
                entry = hash_lookup(alloc+712, bb_node(succ.bb_id))
                if entry and (entry.flags & 0x4):        // constrained bit
                    goto phase2
            succ = succ.next
        return NO_INTERFERENCE

    // Phase 2: walk RPO block list for cross-block check
    phase2:
    found = false
    for i in 1..alloc.block_count:                       // alloc+520
        block = block_table[rpo_order[i]]
        if dominates(alloc, block_range, block):         // sub_76ABE0
            result = check_single_block(alloc, block)    // sub_995740
            if not result:  return PROCEED
            found = result
        elif found:
            return PROCEED                               // past dominated region
    return NO_INTERFERENCE

Phi/copy aggregation (sub_999D10)

For opcode 96 (phi/merge nodes), traces through coalesced operand chains to find the source live range that can absorb the phi cost. Validates a diamond CFG pattern: a phi with exactly two predecessors, each with a single predecessor, both merging from the same dominator. When this pattern holds and the source vreg has use_count == 1 and is not constrained (bit 1 at vreg+51), returns the source vreg pointer for cost aggregation. Returns null (byte +8 = 0) when the pattern does not match or the opcode is not 96.

Per-block cost array builder (sub_999F00)

function build_block_cost_array(alloc):
    block_info = *(alloc+8) -> +72                       // current block descriptor
    if not block_info:  return
    max_id = block_info.max_id                           // +32
    if max_id + 1 > alloc.cost_array_size + 1:           // alloc+592
        // Grow with 1.5x policy
        new_cap = old_cap + (old_cap + 1) / 2
        if new_cap < max_id + 1:  new_cap = max_id + 1
        buf = arena_alloc(arena, 12 * new_cap)           // vtable+24
        memcpy(buf, old_buf, 12 * (old_size + 1))
        arena_free(arena)                                // vtable+32
        alloc.cost_array = buf                           // alloc+584
        alloc.cost_capacity = new_cap                    // alloc+596
    // Zero-fill new slots as {0, 0, 0}
    for i in (old_size+1)..max_id:
        cost_array[i] = {0, 0, 0}                       // 12 bytes each
    alloc.cost_array_size = max_id                       // alloc+592
    // Copy per-block tuples from block descriptor
    src = block_info.entry_base + 16                     // +48 with 16B stride
    for j in 0..block_info.entry_count:                  // +36
        idx = src[j].block_id
        cost_array[idx] = {src[j].val_a, src[j].val_b, src[j].val_c}

Each entry is a 12-byte record indexed by basic block ID. The 1.5x growth policy avoids reallocation churn when processing functions with many blocks.

Live range spill driver (sub_9A8270)

The largest spill cost function (580 lines, 14 KB). Four phases:

function drive_live_range_spills(alloc, loop_info, result_ptr):
    // Phase 1 -- drain 3 worklists (at alloc+123/+138/+156) into free pools
    //   via sub_69DD70 pop; recycle interval nodes (alloc[144..150]) via vtable+32.
    //   If has_secondary_map (DWORD[202]): drain alloc[102..103] -> alloc[100].
    //   If has_bucket_array (DWORD[304]): recycle alloc[153..154] per-bucket.
    drain_all_worklists(alloc)

    // Phase 2 -- early exit
    *result_ptr = 0
    if alloc.pass_count <= 1 and alloc.mode > 1:  return 0
    if alloc.block_count == 0:                     return 0

    // Phase 3 -- per-block RPO walk
    best_range = NULL; conflict_range = NULL; seq = 1
    for block in rpo_order[1..block_count]:
        range = block_to_range(block)
        if not in_live_set(alloc, range):  continue      // sub_995740
        for insn in range.instructions:
            // Bitvector membership: dense alloc[57][vr>>6] bit (vr&63),
            // then sparse BST at alloc+156 keyed by vr>>8.
            if has_interference_map and constrained(insn):
                if is_loop_entry(insn):                  // sub_7E0F50
                    analyze_interval(alloc, insn)        // sub_99CF80
                    if split_required(insn):  continue   // sub_9A3690
                    if insn in inner_loops:
                        mark_inner_loop(alloc, insn, 2)  // sub_9A29C0
                    else:
                        // Commit: add to candidate set (alloc+128),
                        // collect operand VR refs into worklist B (alloc+138),
                        // call try_commit (sub_9A79A0), finalize (sub_9A2D90),
                        // then mark_spill_candidate or try_evict_or_split.
                        // Arch callback: vtable slot 227 (offset 1816).
                        commit_or_evict(alloc, &best_range, range, insn, seq++)

    // Phase 4 -- reconcile: conflict_range must equal best_range (or NULL)
    if conflict_range and conflict_range != best_range:  return 0
    if best_range and *result_ptr:
        if best_range.block_id != result_ptr.block_id:  return 0
    return best_range

Spill Code Generation

sub_94F150 (561 lines) inserts actual spill-store and refill-load instructions into the IR instruction stream.

Per-register spill info

The function allocates a tracking array: 12 bytes per virtual register, initialized to {0, -1, -1}:

OffsetSizeField
+04Spill state (0 = none)
+44Spill slot index (-1 = unassigned)
+84Refill slot index (-1 = unassigned)

Execution flow

function generate_spill_code(alloc, ctx, mode):
    // 1. Allocate per-block tracking
    tracking = arena_alloc(12 * (numBlocks + 1))

    // 2. Set up instruction iteration
    setup_instruction_walk(ctx, walk=1, dir=0)       // sub_7E6090

    // 3. Multi-block liveness (if numBlocks > 1 and mode == 1)
    compute_cross_block_liveness(alloc, ctx)          // sub_9449B0
    //   Uses bitvectors: sub_BDBA60, sub_BDC180, sub_BDCDE0

    // 4. Clear per-instruction spill flags
    for each instr:
        instr.flags &= ~0xE00

    // 5. Walk instruction list
    spill_weight = 15.0
    if high_pressure_class:
        spill_weight = 3.0

    for each instr in instruction_list:
        for each operand:
            vreg = lookup_vreg(operand)
            if vreg.was_previously_spilled:
                // Track for potential refill
                update_refill_tracking(vreg, instr)
            // Accumulate spill cost weighted by block frequency
            vreg.spill_cost += spill_weight * block_frequency(instr.block)

        // Handle call instructions (opcode 97; STG in ROT13, used as CALL marker -- actual CALL is opcode 71)
        if instr.opcode == 97:
            handle_callee_save(alloc, instr)

        // Track "use after last def" for enhanced cost
        update_use_after_def(vreg, instr)

    // 6. Uniform register special handling (flag 0x200)
    if vreg.flags & 0x200:
        apply_uniform_spill_rules(vreg)

Epoch tracking

sub_936CF0 (81 lines) checks basic block boundaries for epoch increments. It returns true if the block's successor is a CALL instruction (opcode 52) with a target that has opcode 236 (special call), or if allocator flags at alloc+1588/alloc+1589 are set. This mechanism tracks liveness across call boundaries, ensuring that spilled values are correctly reloaded after calls that may clobber registers.

LMEM Spilling

Local memory (LMEM) is the default spill destination. Each GPU thread has private local memory backed by DRAM and cached in L2.

Slot allocation

sub_939BD0 (65 lines) configures the spill slot allocator. It queries OCG knob 623 for a custom spill mode; if the knob is disabled, it selects between two default configurations based on the cost threshold at alloc+776:

ConditionBucket sizeAlignmentMax pool
Cost threshold == 0.08 bytes4 bytes1 MB
Cost threshold != 0.016 bytes16 bytes1 MB

The 8-byte/4-byte configuration handles standard 32-bit register spills. The 16-byte/16-byte configuration handles double-precision or 64-bit values that require stricter alignment.

if (*(double*)(alloc + 776) == 0.0)
    return init_spill_pool(mem_alloc, 8, 4, 0x100000);    // 8B buckets, 4B aligned
else
    return init_spill_pool(mem_alloc, 16, 16, 0x100000);   // 16B buckets, 16B aligned

When knob 623 is enabled, the knob value at offset +224 supplies a custom spill limit, passed to the spill allocator init function via vtable +24.

Spill pool object

The spill pool lives at alloc+1784 and is accessed through a vtable. Its fields:

OffsetTypeField
+0ptrVtable pointer
+8u32Bucket size (bytes per slot: 8 or 16)
+12u32Alignment (4 or 16)
+16u32Pool tail offset (next free byte, starts at 0)
+20u32Max pool size (0x100000 = 1 MB)
+24ptrFree list head (singly-linked freed slots)
+32u32Free list count

The vtable provides two initialization methods and one allocator:

SlotOffsetSignatureRole
2+16init(pool, bucket_size, align, max)Default init (from sub_939BD0)
3+24init_custom(pool, limit)Knob-623 custom limit init
4+32allocate(pool) -> offsetAllocate one slot, returns byte offset
5+40release(pool, offset)Return slot to free list

Slot allocation algorithm

function init_spill_pool(pool, bucket_size, alignment, max_size):
    pool.bucket_size = bucket_size          // 8 or 16
    pool.alignment   = alignment            // 4 or 16
    pool.tail_offset = 0                    // bump pointer starts at 0
    pool.max_size    = max_size             // 0x100000 (1 MB)
    pool.free_head   = NULL
    pool.free_count  = 0

function allocate_lmem_slot(pool) -> int:
    // Step 1: Try free-list reuse (slots from dead live ranges)
    if pool.free_head != NULL:
        slot_offset      = pool.free_head.offset
        pool.free_head   = pool.free_head.next
        pool.free_count -= 1
        return slot_offset

    // Step 2: Bump-allocate from pool tail
    offset = align_up(pool.tail_offset, pool.alignment)
    if offset + pool.bucket_size > pool.max_size:
        fatal("local memory pool exhausted")    // 1 MB limit
    pool.tail_offset = offset + pool.bucket_size
    return offset

function release_lmem_slot(pool, offset):
    // Prepend to free list for LIFO reuse
    node          = {offset, pool.free_head}
    pool.free_head = node
    pool.free_count += 1

Slot reuse occurs when sub_93FBE0 resets per-iteration state: it walks all virtual registers, sets vreg+68 = -1 for any vreg not pinned (flag 0x20 clear and class != 8), and returns their previously-assigned LMEM offsets to the free list via release_lmem_slot. On the next allocation attempt, those offsets are recycled for different vregs whose live ranges no longer overlap the original occupants.

SASS instruction sequences

sub_9850F0 (520 lines) generates the actual SASS load/store instruction sequences for spill traffic. Architecture-specific registers drive the address computation:

FieldSourceRole
target_info[400]Architecture vtableSpill base register
target_info[401]Architecture vtableSpill slot stride
target_info[402]Architecture vtableSpill offset register

Spill store sequence:

IADD  addr, spill_base, offset       // compute slot address
IMAD  addr, addr, stride, 0          // scale by slot stride
ST    [addr], value                   // store to local memory

Refill load sequence:

IADD  addr, spill_base, offset       // compute slot address
IMAD  addr, addr, stride, 0          // scale by slot stride
LD    value, [addr]                   // load from local memory

The generated instructions use these SASS opcodes:

OpcodeSASS mnemonicRole in spill sequence
0xC9IADDAdd offset to base register
0x11BIMADMultiply-add for address
0xC3MOVMove value
0x82 (130)HSET2 in ROT13; used as LD/LDL-likeLoad from local memory (refill)
0xB7ST / STLStore to local memory (spill)
0x14ISETPSet predicate (conditional spill)
0x8BSHLShift for alignment
0x110LOP3Logical operation for masking
0x5FBRABranch (conditional spill)
0x120STSStore to shared memory (SMEM path)

At the hardware level, local memory spills become LDL/STL instructions. The SETLMEMBASE (opcode 147) and GETLMEMBASE (opcode 148) instructions manage the local memory base pointer for the thread.

SMEM Spilling

Shared memory (SMEM) spilling is an alternative to local memory spilling. Shared memory is on-chip SRAM, offering significantly lower latency than LMEM (which goes through L2/DRAM). However, shared memory is a finite resource shared across all threads in a CTA, making this optimization applicable only in specific circumstances.

Entry point

sub_9539C0 (1873 lines, 54 KB decompiled) implements the SMEM spill allocator.

Hard restriction

"Smem spilling should not be enabled when functions use abi."

This assertion (checked at two call sites within sub_9539C0) prevents SMEM spilling for ABI-conformant functions. Functions using the GPU calling convention require a stable stack frame in local memory; shared memory spill slots would conflict with the calling convention's stack layout requirements.

Activation conditions

SMEM spilling activates when all of the following hold:

  1. Register class is 3 (UR) or 6 (Tensor/Acc)
  2. Device type is 5 (ctx+896 == 5)
  3. The class has virtual registers to allocate (vreg_count > 0)
  4. The function does not use ABI calling conventions

Two execution paths

The function selects between a fast path and a full path based on three conditions:

ConditionTestFast path if
Candidate countvreg_count from ctx+104 descriptor<= 4
ABI modectx+1368 bit 2clear
Extended flagctx+1380 bit 6clear

If any condition fails, the full path runs. The fast path handles 1--4 uniform/accumulator registers with a single linear instruction walk (decompiled lines 1715--1862). The full path builds an explicit cost model and candidate list.

Per-CTA SMEM pool partitioning

Both paths read smem_sym = *(ctx+104)->qword[0]. The runtime SMEM pool available for spilling is partitioned per-CTA. Pool sizes come from the shared memory configuration table at 0x21FB640: 11 tiers from 0 to 328 KB on SM 8.x+, or three tiers (0/32/64 KB) on SM 7.5 (table at 0x21D9168). The allocator carves slots in 4-byte quanta (one 32-bit register width), so slot indices are always multiples of 4.

Algorithm (full path)

function smem_spill_allocate(alloc, ctx):
    // Step 1 -- ABI guard
    if ctx.device_type == 4 and ctx.flags_1368 & 0x4:
        fatal("Smem spilling should not be enabled when functions use abi.")

    vreg_count = ctx_descriptor.dword[3]        // candidate count
    smem_sym   = ctx_descriptor.qword[0]        // SMEM allocation symbol
    if vreg_count == 0:
        alloc.smem_slot_count = 0               // alloc+1520
        return

    // Step 2 -- cost bitmap: one double per 4-register group
    bitmap_slots = (vreg_count + 3) >> 2        // ceil(vreg_count / 4)
    cost_array   = arena_alloc(8 * bitmap_slots + 8)
    cost_array[0] = bitmap_slots                // length prefix
    memset(&cost_array[1], 0, 8 * bitmap_slots) // all costs = 0.0

    // Step 3 -- callee hash table (FNV-1a indexed, sub_754DE0)
    //   Tracks which callees touch SMEM and need preservation.
    callee_ht = hash_table_new(initial_buckets=8)

    // Step 4 -- first instruction walk: score candidates
    //   Walks ctx.instruction_list (linked list at ctx+272).
    for each instr in instruction_list:
        opc = instr.opcode & 0xFFFFCFFF         // mask out predicate bits

        if opc == 97:   // spill callback
            ctx.vtable_1784[1](instr_operand, 1, 1)

        if opc == 72:   // CALL -- check if callee has SMEM flag
            callee = lookup_function(instr.operand_id)
            if callee.smem_eligible:
                callee_ht.insert(callee.id)      // FNV-1a hash of callee ID

        if opc == 130:  // operand referencing callee in callee_ht
            if callee_ht.contains(operand.callee_id):
                record_candidate(instr)

        if opc == 183 or opc == 288:             // LDS / STS
            if last_source_operand == smem_sym:
                reg_bytes = query_reg_byte_width(ctx, instr.subop)  // vtable+904
                slot_bytes = align4(reg_bytes * (operand.type_bits + 1))
                for offset = 0 to slot_bytes step 4:
                    node = alloc_candidate_node()    // 32 bytes
                    node.slot_id  = base_offset + offset
                    node.width    = (offset == 0) ? slot_bytes : 4
                    can_reuse     = alloc.vtable[36](alloc, instr, offset, callee_ht)
                    cost_idx      = node.slot_id / 4
                    if can_reuse and cost_array[cost_idx] >= 0.0:
                        if opc == 288:               // STS
                            cost_array[cost_idx] += block_freq * 0.5
                        else:                        // LDS (183)
                            cost_array[cost_idx] += block_freq
                    else:
                        cost_array[cost_idx] = -1.0  // permanently ineligible
                    candidate_list.append(node)      // doubly-linked

    // Step 5 -- propagate costs to candidate nodes, then sort
    for each node in candidate_list:
        node.score = cost_array[node.slot_id / 4]
    sort_candidates_descending(candidate_list)       // sub_9535C0

    // Step 6 -- greedy slot assignment with bitmap coalescing
    smem_offset  = alloc.smem_slot_count * 4         // alloc+1520
    assigned_ht  = hash_table_new()                  // sub_7BE7B0
    free_slot_ht = hash_table_new()

    for each node in candidate_list (descending score):
        if node.score <= 0.0:  break                 // no benefit
        if smem_offset + 4 > smem_pool_limit:  break // budget exhausted

        if assigned_ht.contains(align4(node.slot_id)):
            continue                                 // already handled

        // Probe sub-slot coverage; build bitmap v79 = |= 1 << (offset/4)
        coverage = 0;  assigned_count = 0
        for sub = 0 to node.width step 4:
            if not assigned_ht.contains(node.slot_id + sub):
                coverage |= 1 << (sub >> 2)
                assigned_count++

        if assigned_count == 0:
            // Remove node from candidate list; recycle to free list
            continue

        // Coalesce adjacent sub-slots into wider STS/LDS (sub_945C40)
        if (coverage & 0x3) == 0x3:                  // bits 0,1 -> 8B at +0
            emit_coalesced_node(node.slot_id, width=8)
        elif (coverage & 0x1):                       // bit 0 only -> 4B at +0
            emit_coalesced_node(node.slot_id, width=4)
        if (coverage & 0xC) == 0xC:                  // bits 2,3 -> 8B at +8
            emit_coalesced_node(node.slot_id + 8, width=8)

        assigned_ht.insert(node); smem_offset += 4

    // Remove fully-covered or zero-benefit nodes; recycle to pool+8.

    // Step 7 -- physical SMEM offset assignment for free slots
    running_offset = 0
    for each surviving node in candidate_list:
        for sub = 0 to node.width step 4:
            global_id = node.slot_id + sub
            if not free_slot_ht.contains(global_id):
                free_slot_ht.insert(global_id, running_offset)
                running_offset += 4

    // Step 8 -- second instruction walk: emit STS/LDS replacements
    smem_base_vreg = create_vreg(ctx, class=12, flags=0x4000)  // sub_926500

    for each instr where opc in {183, 288} and last_source == smem_sym:
        reg_bytes  = query_reg_byte_width(ctx, instr.subop)
        type_width = (operand.type_bits & 7) + 1
        total_regs = type_width
        if instr.subop == 19:  total_regs = 2 * type_width     // 64-bit
        if instr.subop == 13:  total_regs = ceil(type_width+1, 2) // packed

        for i = 0 to total_regs:
            assigned = assigned_ht.lookup(base + 4*i)
            free_off = free_slot_ht.lookup(base + 4*i)

            if assigned >= 0:
                // SMEM address encoding: slot*32 + lane_offset
                //   Activated when vtable[108] returns > 0 or ctx+1421 bit 0 set
                addr = 32 * (assigned & ~3) + (assigned & 3)
                // sub_930440 emits STS (opcode 0x120) or LDS instruction
                emit_smem_spill(ctx, opc, instr.subop, src_operand,
                                smem_base_sym, addr, vector_width)
                // Adjust byte counters
                if opc == 288: func_props.SSpillB  -= bytes  // +224
                else:          func_props.SRefillB -= bytes  // +228
            else:
                // Emit from free pool using original smem_sym
                emit_smem_spill(ctx, opc, instr.subop, src_operand,
                                smem_sym, free_off, vector_width)

            // Link to original instruction's live range (sub_93CB10)
            if instr.live_range_id > 0:
                link = sub_93CB10(alloc, instr, 4 * i)
                new_instr.dword[12] = link.dword[2]

        delete_instruction(ctx, instr)
        ctx.rewrite_state = 7                        // ctx+240

    alloc.smem_slot_count = smem_offset / 4          // alloc+1520

Fast path (vreg_count <= 4)

When vreg_count <= 4, ABI bit is clear, and extended flag 0x40 is clear, the function skips the cost model entirely. It walks the instruction list once, looking for STS (opcode 288) and LDS (opcode 183) that reference smem_sym. For each match it directly calls sub_92E720 to emit the replacement instruction with opcode 0x82 (class 12). If the instruction carries a predicate (bit 12 set), sub_7DA5A0 handles predicate lowering before emission. The SMEM base address is the constant 268435496 (0x10000028), encoding address-space 1 (shared) with offset 0x28. Register width is derived from vtable+904; for widths <= 3 bytes, a packed flag (bit 25 or 26) is set in the operand word.

Bitmap coverage tracking

The coverage bitmap (v79 in the decompiled source) tracks which 4-byte sub-slots within a candidate have been assigned. It uses a 1 << (offset >> 2) scheme:

Bitmap bitsMeaningCoalesced width
0b0001Slot +0 assigned4 bytes
0b0011Slots +0, +4 assigned8 bytes (merged STS.64)
0b1100Slots +8, +12 assigned8 bytes at +8
0b1111All four sub-slots16 bytes (two STS.64)

This coalescing feeds sub_945C40 which adjusts the candidate list node widths, ensuring the final STS/LDS instructions use the widest possible memory transaction.

SMEM symbols

The SMEM spill infrastructure uses two internal symbols for allocation tracking:

  • __nv_reservedSMEM_allocation_phase (address 0x1CFCE80)
  • __nv_reservedSMEM_allocation_mask (address 0x1CFCEA8)

The CLI flag --disable-smem-reservation can disable shared memory reservation entirely.

Spill Statistics

The allocator collects detailed spill metrics into a per-function statistics object. These are used for compilation reporting, performance analysis, and the --warn-on-spills diagnostic.

Statistics fields

The statistics object stores spill counters at fixed DWORD offsets:

Offset (DWORDs)NameDescription
[12]LSpillBLocal memory spill bytes
[13]LRefillBLocal memory refill bytes
[14]SRefillBShared memory refill bytes
[15]SSpillBShared memory spill bytes
[16]LowLmemSpillSizeLow-bound local memory spill size
[17]FrameLmemSpillSizeFrame-level local memory spill size
[18]LNonSpillBNon-spill local memory bytes
[19]LNonRefillBNon-refill local memory bytes
[20]NonSpillSizeTotal non-spill memory size

Format strings

The statistics printing subsystem (sub_A3A7E0) emits two lines for spill metrics:

# [est latency = %d] [LSpillB=%d] [LRefillB=%d], [SSpillB=%d],
  [SRefillB=%d], [LowLmemSpillSize=%d] [FrameLmemSpillSize=%d]
# [LNonSpillB=%d] [LNonRefillB=%d], [NonSpillSize=%d]

The function properties string (used in verbose output):

Function properties for %s
    %d bytes stack frame, %d bytes spill stores, %d bytes spill loads

Spill warning

When --warn-on-spills is active, the following warning is emitted for any function with spills:

Registers are spilled to local memory in function '%s',
    %d bytes spill stores, %d bytes spill loads

The flag is registered in sub_432A00 / sub_434320 and stored at compilation_ctx + 473.

Allocation Failure

When all spill retry attempts are exhausted and the allocator still cannot fit within the register budget, a fatal error is emitted:

Register allocation failed with register count of '%d'.
Compile the program with a higher register target

This error originates from sub_9714E0 (allocation finalization, 290 lines), which is the last function called in the retry pipeline. Two emission paths exist:

PathFunctionContext
With source locationsub_895530Includes function name and PTX line number
Without source locationsub_7EEFA0Generic error when location unavailable

The alternate allocator path sub_964130 (1794 lines) also references this error at six separate points, covering different failure/retry scenarios. A dedicated failure reporter sub_966490 (474 lines) handles error diagnostic formatting.

Spill-Refill Pattern Optimization

The Ori IR includes dedicated instruction type markers for spill/refill patterns, enabling post-allocation optimization of spill traffic:

Type IDNameDescription
8Spill-refillSpill/refill pair marker
10Bit-spillSingle-bit spill (predicate register spill to GPR)

The SpillRefill pass attempts to match and optimize these patterns. Error strings reveal three failure modes:

  1. "Failed to find matching spill for refilling load that is involved in this operand computation" -- the refill load has no corresponding spill store
  2. "Failed to establish match for bit-spill-refill pattern involved in this operand computation" -- the bit-spill pattern does not match expected form
  3. "Some instruction(s) are destroying the base of bit-spill-refill pattern involved in this operand computation" -- instructions between spill and refill clobber the base address register

Debug strings include " spill-regill bug " and " bit-spill bug " (both with typos present in the binary).

Pattern matching algorithm

The memcheck verifier (sub_A73F30, 806 lines) drives the matching. For each post-allocation instruction whose reaching definitions changed, it classifies the instruction and dispatches to one of three pattern-specific matchers:

// sub_A73F30 -- simplified dispatch for spill/refill verification
match_spill_refill_pairs(verifier, post_insn, operand_idx):

    // --- Path A: standard spill-refill (type_id 8) ---
    if is_spill_store(ctx, post_insn):                  // sub_A56CE0: opcode == 183 (STL)
        chain = find_matching_spill(verifier, post_insn) // sub_A677C0
        if chain.insn_id == -1:
            emit_error(1)  // "Failed to find matching spill for refilling load"
            return
        // Walk chain entries; verify each candidate store's slot matches the refill
        for entry in chain.entries:
            if entry.bitmask & class_mask == 0:
                continue
            if entry.store_insn == NULL:
                emit_error(2)  // "refill reads potentially uninitialized memory"
                return
            verify_slot_match(verifier, entry)  // compare slot IDs
        return  // matched

    // --- Path B: P2R/R2P predicate spill (type_id 8, predicate variant) ---
    if is_p2r_r2p(ctx, post_insn):                      // sub_A57590
        if not clobber_check_ok(verifier, post_insn):    // sub_A56F80
            emit_error(8)  // "destroying the base of P2R-R2P pattern"
            return
        return  // matched

    // --- Path C: bit-spill (type_id 10) ---
    if insn.num_operands > 2 and (insn.flags & 0x10):   // predicate-packed bit-spill
        if is_bit_spill_pattern(ctx, post_insn):         // sub_A53DB0
            if not clobber_check_ok(verifier, post_insn): // sub_A56F80
                emit_error(9)  // "destroying the base of bit-spill-refill pattern"
                return
            return  // matched

    // ... fall through to other verifier checks (remat, extra defs, etc.)

find_matching_spill (sub_A677C0) uses an FNV-1a hash map keyed on instruction ID to locate the spill chain node corresponding to a refill load. It verifies that the store and load target the same basic block (via the remapped-instruction map), then returns a chain structure containing the matched store instruction and a bitmask of register classes involved.

clobber_check_ok (sub_A56F80) iterates the def-use chain of the matched spill-store node. For each def whose operand index matches the target, it looks up the defining instruction in the remapped-instruction map. If any such instruction was remapped (indicating an intervening write to the same physical register), the check returns false -- the base register was clobbered between spill and refill.

is_spill_store / is_refill_load (sub_A56CE0 / sub_A56DE0) check the instruction opcode (183 for STL, 288 for LDL), then verify the source/dest vreg has spill flags set (bit 14 or bit 17 in vreg_desc+36), and confirm the register class == 2.

Function Map

AddressLinesRoleConfidence
sub_939BD065Spill allocator setup (knob 623 dispatch)HIGH
sub_93C0B0582Spill range optimizerHIGH
sub_93D070155Best allocation result recorderHIGH
sub_93E290397Spill candidate node creator (192-byte arena nodes)HIGH
sub_93F130544Spill code inserterHIGH
sub_93FBE0940Per-iteration state reset / slot assignmentHIGH
sub_940EF0764Alternate spill slot assignment pathMEDIUM
sub_944740138Interference count at program pointHIGH
sub_9449B0418Cross-block liveness range calculatorHIGH
sub_94B200642Spill weight accumulatorHIGH
sub_94E620617Spill cost accumulator / liveness annotatorHIGH
sub_94F150561Spill code generation (main entry)HIGH
sub_94FDD0155Register assignment + spill triggerHIGH
sub_9539C01,873SMEM (shared memory) spill allocatorHIGH
sub_9714E0290Allocation finalization / error emissionMEDIUM
sub_96D9402,983Spill guidance engine (7 class-parallel queues)HIGH
sub_971A90355NOSPILL / SPILL retry driverHIGH
sub_9850F0520SASS-level spill instruction generatorHIGH
sub_9997D0--Spill cost initializationMEDIUM
sub_9998A0--Spill cost computationMEDIUM
sub_999950--Spill cost comparisonMEDIUM
sub_999AA0--Spill benefit estimationMEDIUM
sub_9A8270--Live range spill cost computation (14 KB)MEDIUM