CSSA -- Conventional SSA for GPU Divergence
Standard SSA form assumes that a PHI node selects its incoming value based solely on the control flow edge along which execution arrived. On a scalar CPU, exactly one predecessor edge is taken per dynamic execution of the PHI, so this assumption holds trivially. On an NVIDIA GPU, it does not. A warp of 32 threads executes in lockstep, and when control flow diverges -- different threads take different branches -- all paths are eventually serialized and the warp reconverges. At the reconvergence point, a standard PHI node cannot correctly select a single incoming value because the warp carries live values from multiple predecessors simultaneously. The wrong thread could see the wrong value.
CSSA (Conventional SSA) is NVIDIA's transformation that rewrites the IR so that every PHI node is safe under warp-divergent execution. It does this by inserting explicit copy instructions at points where threads reconverge, ensuring that each thread's value is materialized into its own copy before the PHI merges anything. The name "Conventional SSA" comes from the SSA literature: a program is in CSSA form when every PHI node's operands can be simultaneously live without interfering -- the PHI web has no overlapping live ranges. This property is exactly what GPU divergence demands.
| Pass location | sub_3720740 (3,521 bytes native, 165 basic blocks, 17 callees) |
| Address range | 0x3720740--0x3721501 |
| Single caller | sub_3721510 (pass driver thunk) |
| Gate knob | do-cssa (NVVMPassOptions boolean toggle) |
| Coalesce knob | cssa-coalesce (controls copy coalescing aggressiveness) |
| Debug knobs | cssa-verbosity, dump-before-cssa |
| Container knob | CSSACoalescing (NVVM container format, parsed at sub_CD9990) |
| Debug string | "IR Module before CSSA:\n" (sole string ref in pass body) |
| Helper cluster | sub_371F790 (4014B, 197bb) PCP propagator, sub_371F160 (1079B) copy materializer, sub_371EDF0 (875B) inserter, sub_371CDC0 (227B) instruction factory |
| NVVM opcode | 0x22D7 (8919) -- divergence-safe copy mnemonic |
| Pass-option slot | One of the 221 NVVMPassOptions slots (boolean do/don't pair) |
| Pipeline position | Late IR, after optimization, before SelectionDAG lowering |
| Upstream equivalent | None. LLVM has no concept of warp-divergent PHI semantics. |
GPU Divergence Background
The Warp Execution Model
NVIDIA GPUs execute threads in warps of 32 under the SIMT model: all threads share a program counter, and divergent branches serialize both paths before the warp reconverges. The full warp execution model and its implications for cicc are documented in the GPU Execution Model.
Why Standard SSA Breaks
Consider a diamond CFG:
entry
/ \
then else
\ /
join <-- PHI(%x = [then: %a], [else: %b])
On a CPU, the PHI at join works correctly: execution came from exactly one predecessor, so the PHI selects the corresponding value. On a GPU warp where threads 0-15 took then and threads 16-31 took else, both paths executed sequentially. When the warp reconverges at join, the PHI must produce %a for threads 0-15 and %b for threads 16-31 simultaneously in the same register. A naive lowering of the PHI to a simple register copy is incorrect -- whichever path executed last would overwrite the value from the first path.
The CSSA Solution
CSSA transforms the IR so that the PHI web has non-interfering live ranges. Concretely, it inserts copy instructions at the end of each predecessor block so that each thread's value is written into a dedicated copy before the warp reconverges:
entry
/ \
then else
%a_copy %b_copy <-- inserted copies (one per predecessor)
\ /
join
%x = PHI [then: %a_copy], [else: %b_copy]
Now the PHI's operands occupy distinct virtual registers. During later register allocation, the allocator can assign them the same physical register only when their live ranges truly do not overlap -- which is the correct condition for divergent execution. The copies give the allocator the freedom to keep the values separate when divergence requires it.
Algorithm
sub_3720740 is organized as five distinct phases driven off a pass-state struct (the Function-level a1 argument plus the PassState at a1[1]). The phase-1 latch byte at [PassState+0x70] lets the driver re-enter the pass without redoing block numbering: if the latch is set, only phases 2--5 run again on the updated CFG.
The five phases below correspond to the four major loop nests in the decompiler output (0x3720740--0x37208BB numbering, 0x37208C0--0x3720A28 PHI map, 0x3720A2F--0x3720A88 propagation + cleanup, with the copy-insertion block at 0x3720B40--0x3720E18 invoked from inside the PHI map walk).
Phase 1: Basic Block Ordering and Numbering -- HIGH
The function begins by walking every basic block in the LLVM Function (accessed via [r15] -> Module pointer) and stamping each block with a preorder index at offset +0x48 and a reverse-postorder index at offset +0x4C. These indices are used downstream for dominance/reconvergence queries and to keep the inserted copies in a deterministic position relative to terminators. The block worklist lives on the stack in v130[] and grows via sub_C8D5F0 (the standard SmallVector growth helper, see Hash and Container Infrastructure).
/* Phase 1: assign BB preorder + RPO numbers; idempotent via latch byte. */
void cssa_phase1_number_blocks(Function *F, PassState *PS) {
if (PS->blocks_numbered) { /* byte [PS+0x70] */
PS->next_phi_id = 0; /* dword [PS+0x74] reset */
return; /* skip straight to phase 2 */
}
BasicBlock *entry = F->entry_block; /* [F+0x60] */
if (!entry) return;
/* Iterative DFS with a SmallVector worklist (initial inline cap = 32). */
Frame stack[32]; /* v130 inline buffer */
Frame *sp = stack;
size_t cap = 32, len = 0;
sp[len++] = (Frame){ entry, entry->succ_begin };
entry->preorder = 0; /* [BB+0x48] */
uint32_t preorder_counter = 1;
uint32_t postorder_counter = 0;
while (len) {
Frame *top = &sp[len - 1];
BasicBlock *bb = top->bb;
SuccIter *it = top->it;
if (it == bb->succ_end) {
bb->rpo_index = postorder_counter++; /* [BB+0x4C] */
--len;
continue;
}
BasicBlock *succ = *it++;
top->it = it;
if (succ->preorder == UINT32_MAX) {
succ->preorder = preorder_counter++;
if (len + 1 > cap) {
cap = smallvec_grow(&sp, cap, len + 1, sizeof(Frame));
}
sp[len++] = (Frame){ succ, succ->succ_begin };
}
}
PS->next_phi_id = 0;
PS->blocks_numbered = 1; /* latch */
}
The latch (byte [PS+0x70]) makes phase 1 strictly run-once per Function; later re-entries (for example after the pass driver discards and rerunning a verifier) skip directly to phase 2 with a reset PHI counter. HIGH confidence (latch read at 0x37207C0, write at 0x3720881).
Phase 2: PHI Discovery and Hash Map Population -- HIGH
After numbering, the pass walks every block (outer loop at 0x37208C0), and within each block walks the instruction use-list at offset +0x18 (inner loop at 0x3720930). The block-list bounds come from the sentinel pointers [F+0x48] (begin) and [F+0x50] (end); each instruction header carries its LLVM opcode in the byte at [I-0x18]. PHI nodes are opcode 0x54 (decimal 84).
/* Phase 2: every PHI gets a monotonic id and an entry in the open-addressed
* hash table at PS->phi_map (PassState+0x60). Non-PHIs are skipped. */
#define LLVM_OPCODE_PHI 0x54
static inline uint32_t cssa_hash(void *p) {
/* Same canonical NVIDIA DenseMap hash used across cicc. */
uintptr_t v = (uintptr_t)p;
return (uint32_t)((v >> 4) ^ (v >> 9));
}
void cssa_phase2_discover_phis(Function *F, PassState *PS) {
for (BasicBlock *bb = F->blocks.first; bb != F->blocks.sentinel;
bb = bb->next) {
for (Instruction *I = bb->insts.first;
I != bb->insts.sentinel;
I = I->next) {
if (I->opcode != LLVM_OPCODE_PHI) { /* byte [I-0x18] */
cssa_record_use(PS, I); /* phase-2 inner table */
continue;
}
uint32_t id = PS->next_phi_id++; /* [PS+0x78] */
cssa_phi_map_insert(PS->phi_map, I, id);
/* Inline copy insertion runs immediately after recording the
* PHI; see phase 3. */
cssa_phase3_insert_copies(F, PS, bb, I);
}
}
}
/* Open-addressed quadratic-probe DenseMap with two tombstones.
* LLVM_EMPTY = -4096 (0xfffffffffffff000)
* LLVM_TOMB = -8192 (0xffffffffffffe000)
* Growth threshold: load > 3/4 of capacity. */
static void cssa_phi_map_insert(PhiMap *m, void *key, uint32_t id) {
if (m->capacity == 0) { phi_map_grow(m, 2 * m->capacity); }
uint32_t mask = m->capacity - 1;
uint32_t h = cssa_hash(key) & mask;
void **slot = &m->buckets[h].key;
void **tomb = NULL;
uint32_t probe = 1;
while (*slot != key) {
if (*slot == LLVM_EMPTY) break;
if (*slot == LLVM_TOMB && !tomb) tomb = slot;
h = (h + probe++) & mask; /* triangular probe */
slot = &m->buckets[h].key;
}
if (*slot != key) {
slot = tomb ? tomb : slot;
++m->live;
if (*slot != LLVM_EMPTY) --m->tombstones;
*slot = key;
slot[1] = (void *)(uintptr_t)id;
}
if (4 * (m->live + 1) >= 3 * m->capacity) {
phi_map_grow(m, 2 * m->capacity); /* sub_A41E30 */
}
}
The map serves two purposes: it dedupes PHIs encountered through multiple control-flow paths, and it gives sub_371F790 (phase 4) an O(1) test for "is this value a PHI node I just rewrote?". HIGH confidence (the inline body of sub_A41E30 and the -4096 / -8192 sentinels are visible verbatim in the decomp at 0x3720E15).
Phase 3: Copy Insertion at Predecessor Terminators -- HIGH
For every PHI discovered in phase 2, the pass materializes one copy per incoming edge. The body lives at 0x3720B40--0x3720E18 and is invoked synchronously from inside the phase-2 walk (the decompiler hoists it out, but the call site is LABEL_53's call to sub_371F160). For each PHI operand, three helper calls run in sequence:
sub_371F160(out, F, phi, incoming_use, 1)-- finds the predecessor block and computes the position immediately before its terminator. Theflag=1argument requests copy materialization. The four-byte return buffer (v126[]) carries the destination value pointer plus the predecessor-block handle.sub_371CDC0(0x22D7, builder, 1, args, 2, name_ptr, type)-- constructs the new IR instruction. Opcode0x22D7(8919 decimal) is the NVVM-internal "divergence-safe copy" mnemonic; the name pointer is the constant string"pcp"(PHI Copy Propagation prefix, set atv128 = "pcp"immediately before the call at0x3720D34).sub_371EDF0(F, new_inst, 1, ...)followed bysub_BD84D0(insert_point, new_inst, ...)-- splices the freshly built instruction into the predecessor block before its terminator and rewires the PHI's use-def chain.
/* Phase 3: for every (PHI, incoming) pair, drop one "pcp" copy at the end
* of the matching predecessor block, then redirect the PHI use to the copy. */
#define NVVM_OPC_DIV_SAFE_COPY 0x22D7
void cssa_phase3_insert_copies(Function *F, PassState *PS,
BasicBlock *home, Instruction *phi) {
/* Walk PHI operand pairs: each (value, predecessor) is a 32-byte use
* record at [phi+0x18] forward through the use-def chain. */
for (Use *u = phi->operands; u != phi->operands_end; u += 1) {
Value *incoming = u->value;
BasicBlock *pred = u->parent; /* [use+8] */
/* Step 1: materialize / locate the copy slot before pred's terminator. */
CopySlot slot;
sub_371F160(&slot, F, phi, /*incoming=*/u, /*flag=*/1);
/* Step 2: build the copy IR instruction (opcode 0x22D7, name "pcp"). */
Type *ty = (Type *)slot.type;
const char *name = "pcp";
Value *name_obj = sub_ACA8A0(/*scope=*/incoming); /* setName helper */
Instruction *cpy = sub_371CDC0(
NVVM_OPC_DIV_SAFE_COPY,
/*builder=*/&slot,
/*nresults=*/1,
/*args=*/(Value *[]){ slot.dst_proxy, name_obj },
/*nargs=*/2,
/*name_ptr=*/&name,
/*type=*/ty);
/* Step 3: splice into pred block before its terminator. */
sub_371EDF0(F, cpy, /*flag=*/1, ...);
sub_BD84D0(/*before=*/pred->terminator, /*what=*/cpy, ...);
/* Step 4: rewire PHI's use chain so the operand points at the copy
* instead of the original value. This is the four-pointer dance
* around offsets [+0, +8, +0x10] (prev/next/parent in the use list). */
Use *uses_head = (Use *)cpy->uses_head;
if (uses_head->prev) {
uses_head->prev->next = uses_head->next;
if (uses_head->next)
uses_head->next->prev = uses_head->prev;
}
uses_head->parent = u;
uses_head->next = u->next;
if (u->next) u->next->prev = uses_head;
u->next = uses_head;
u->value = cpy; /* PHI now reads from the copy */
}
}
⚡ QUIRK -- Copies materialize at the terminator, not in the merge block Every
"pcp"copy is inserted before the predecessor's terminator (sub_BD84D0is called withv47 - 24as the anchor, which is the terminator instruction header minus its opcode prefix). Putting the copy in the merge block instead would defeat CSSA: the warp has already reconverged at that point, and whichever divergent path wrote last would clobber the values that the earlier paths produced. The terminator anchor is non-negotiable. HIGH confidence.
⚡ QUIRK -- Opcode
0x22D7is invisible to upstream LLVM The constant8919is hard-coded as the opcode for the inserted copy. It is an NVVM-internal opcode (not in LLVM'sInstruction.def) that survives until the SelectionDAG lowering, where it is matched as a plain "move" but carries a flag bit telling the register allocator that this copy is allowed to coalesce with its source only when divergence analysis proves the live ranges are non-interfering. A naive lowerer that treated0x22D7as a generic copy and let the coalescer eliminate it unconditionally would reintroduce the original CPU-vs-GPU bug. MED confidence on the "flag bit" semantics (inferred from the surrounding coalescer knobs), HIGH on the opcode value itself.
Phase 4: PHI-Map-Driven Copy Propagation -- HIGH
Once every PHI has its predecessor-side copies, sub_3720740 walks every basic block a second time (0x3720A2F--0x3720A62) and hands each block to sub_371F790, the 4 KB "PHI Copy Propagator" (PCP) helper:
/* Phase 4: walk every block again, run PCP on each instruction stream. */
void cssa_phase4_propagate(Function *F, PassState *PS) {
for (BasicBlock *bb = F->blocks.first; bb != F->blocks.sentinel;
bb = bb->next) {
Instruction *anchor = bb->insts.first;
sub_371F790(PS, anchor, /*scratch=*/0, /*scratch=*/0, 0, 0);
}
}
sub_371F790 itself does three things per block, in order:
-
Reconvergence-point hash-map build. It scans the block looking for instruction opcodes in the range
[30, 40](the(*v9 - 30) <= 0xAtest at0x37203B6), which is the NVVM control-flow family (branch, conditional branch, switch, indirect-branch, invoke, ret, unreachable, etc.). For each such terminator-like instruction it inserts an entry into a local DenseMap (v150/v151/ capacity inv153) keyed on the instruction's operand-zero pointer. This local map is the "reconvergence index" -- it tells the propagator, for any block, what set of predecessor terminators target it. -
Per-PHI-operand copy rewrite. For every populated slot in the local map (the
*v45 != -8192 && *v45 != -4096filter at0x37200CB), the helper callssub_371CDC0(0x22D7, ...)again with the same"pcp"name and the local map's value as the type/template. This second construction creates a copy at the consumer side -- inside the block that will read the PHI's result -- and threads it through the same use-list manipulation that phase 3 did for the producer side.
/* Sketch of sub_371F790's per-block work. Type names are reconstructed. */
typedef struct { void *key; uint32_t cap; uint8_t pad[20]; } LocalMap;
void pcp_run_on_block(PassState *PS, Instruction *anchor) {
LocalMap recon = {0}; /* 32-byte buckets */
/* (1) Walk operands of the block's terminator; record reconvergence
* points keyed on the source-value pointer. */
for (Use *u = anchor->parent->terminator->uses; u; u = u->next) {
Instruction *src = u->value;
if ((uint8_t)(src->opcode - 30) > 0xA) continue; /* not a CF op */
local_map_insert(&recon, (void *)src->operand0);
}
/* (2) For every live slot in the local map, materialize a consumer-side
* "pcp" copy and link it onto the PHI's use chain. */
for (uint32_t i = 0; i < recon.cap; ++i) {
void *k = recon.buckets[i].key;
if (k == LLVM_EMPTY || k == LLVM_TOMB) continue;
Instruction *cpy = sub_371CDC0(
NVVM_OPC_DIV_SAFE_COPY,
/*builder=*/recon.buckets[i].builder,
1, /*args=*/recon.buckets[i].args, 2,
/*name=*/(const char *[]){ "pcp" },
/*type=*/recon.buckets[i].type);
pcp_rewire_use_chain(cpy, recon.buckets[i].use);
pcp_propagate_live_range(PS, cpy); /* offsets +0x40..+0x4C on PS */
}
/* (3) Sort the recorded reconvergence-point list; aborts use the BSR
* trick `_BitScanReverse64` to compute a heap-style introsort depth
* limit (visible at 0x371F951). */
pcp_sort_records(/*begin=*/recon.records, /*end=*/recon.records_end);
local_map_free(&recon);
}
- Live-range propagation. Each consumer-side copy is added to the pass-level use-graph at
[PS+0x40..+0x4C]-- the same DenseMap structure used in phase 2, just keyed on different pointers. This is how PCP guarantees that the cleanup phase (phase 5) sees every freshly inserted copy: the live-range graph is the cleanup worklist.
⚡ QUIRK -- PHI propagation uses a local hash map per block, not a global one Most LLVM-style propagators reuse a single global map across the whole function. CSSA's PCP allocates a fresh DenseMap on the stack for each block (the
v150 / v151 / v152 / v153quartet insub_371F790) and frees it withsub_C7D6A0at the end of the block. This means the same PHI pointer can hash to different slots in different blocks -- a deliberate design choice that lets the propagator distinguish "this PHI was seen via predecessor edge A" from "this PHI was seen via predecessor edge B" without having to track edges explicitly. The cost is one DenseMap allocation per block, which the heap-style introsort depth limit_BitScanReverse64(2 * (63 - log2(span)))keeps cache-friendly. HIGH confidence.
⚡ QUIRK -- PHI-with-self-edge requires the propagator's deduplication A loop header PHI whose incoming edge is its own block (
%x = phi [latch: %x.update], [preheader: %x.init]) produces an instruction that is both a consumer (it reads%x.update) and a producer (every iteration overwrites%x). Without the local-map dedup in step (2), the propagator would insert an unbounded chain of copies as it follows the use-def cycle. The hash-map check at0x37200CB(*v45 != -8192 && *v45 != -4096) is what breaks the cycle: once a value has been mapped, it is never remapped, so the propagator visits each self-edge exactly once. HIGH confidence.
Phase 5: Dead Copy Cleanup -- HIGH
The final phase walks a linked list rooted at [F+0x28] (the pass-state's cleanup worklist, populated incrementally by phases 3 and 4). For each entry, if the instruction has zero remaining uses, it is erased via sub_B43D60 (which is Instruction::eraseFromParent).
/* Phase 5: walk cleanup worklist, erase any zero-use copies. */
void cssa_phase5_cleanup(Function *F) {
for (CleanupNode *n = (CleanupNode *)F->cleanup_head; n; n = n->next) {
Instruction *cand = n->inst;
while (cand) {
if (cand->use_count == 0) { /* [I+0x10] */
Instruction *next_cand = cand->cleanup_next;
sub_B43D60(cand); /* eraseFromParent */
cand = next_cand;
} else {
break;
}
}
}
}
The two-pointer loop at 0x3720A6F (while (1) { v38 = j[1]; if (!v38[2]) break; ... sub_B43D60(v38); }) is exactly this pattern: it walks until it finds a copy that still has uses, then breaks. Cleanup is therefore a forward-only pass -- it never revisits a copy once one is found live. HIGH confidence.
⚡ QUIRK -- Cleanup must run only after propagation, never interleaved Erasing a copy mid-propagation would invalidate the use-def chains that PCP is still walking. The pass enforces ordering by populating the cleanup worklist (
[F+0x28]) during phases 3 and 4 but never reading it until phase 5. A reimplementation that tries to be clever by erasing dead copies as soon as the propagator decides they are dead will hit use-after-free in the live-range propagation step, because the use-list nodes are reused across phases via the same arena. HIGH confidence (the worklist node layout at[node+0]/[node+8]matches the use-list node layout exactly).
Copy Coalescing
The cssa-coalesce knob controls how aggressively the pass coalesces the inserted copies back together. Without coalescing, CSSA inserts one copy per PHI operand per predecessor -- potentially a large number of copies in control flow with many branches. Coalescing identifies cases where two or more copies carry the same value and can share a single register, reducing the copy overhead.
The CSSACoalescing knob in the NVVM container format (parsed by sub_CD9990 from the finalizer knobs structure) provides a separate control path for the same behavior. The container knob is categorized alongside register allocation and scheduling controls (AdvancedRemat, DisablePredication, DisableXBlockSched, ReorderCSE), confirming that CSSA coalescing is considered part of the register allocation subsystem.
⚡ QUIRK -- "Coalesce copies on split edges" is a subtarget default, not a global one The strings
"Coalesce copies on split edges (default=subtarget)"and"Coalesce copies that span blocks (default=subtarget)"appear in the binary alongside the CSSA strings. The(default=subtarget)suffix means the default value depends on the GPU architecture in use -- newer SMs may aggressively coalesce edge-split copies while older SMs leave them alone. A reimplementor that hard-codes a single default will produce different code than cicc on at least one SM. MED confidence (string evidence is direct; subtarget-table cross-reference is inferred).
deSSA Alternative
The usedessa knob (default value 2, registered at ctor_358_0 at 0x50E8D0, stored in dword_4FD26A0) selects an alternative path for PHI elimination during the transition from SSA to machine code. Despite its name suggesting "de-Static Single Assignment", analysis of the dispatch functions shows it controls the scheduling and PHI elimination pipeline:
| Mode | Pre-RA Scheduling | Post-RA Scheduling | Behavior |
|---|---|---|---|
| 1 | Skipped | Minimal (single pass) | Simple mode -- no pre-RA scheduling |
| 2 (default) | Full (&unk_4FC8A0C) | Three passes + StackSlotColoring | Full mode -- complete scheduling pipeline |
The deSSA mode and CSSA transformation are complementary. CSSA operates at the LLVM IR level, converting PHI nodes into a form safe for GPU divergence before instruction selection. The usedessa mode controls how PHI nodes are ultimately eliminated during the MachineIR lowering, after SelectionDAG has already consumed the CSSA-transformed IR. When usedessa=2 (default), the full scheduling pipeline runs, giving the register allocator maximum flexibility to handle the extra copies that CSSA introduced. When usedessa=1, the minimal scheduling mode may be appropriate for debugging or for kernels where scheduling causes regressions.
Configuration Knobs
NVVMPassOptions Knob
| Knob | Type | Description |
|---|---|---|
do-cssa | bool | Master enable/disable for the CSSA pass |
Set via -opt "-do-cssa=0" to disable the pass entirely.
cl::opt Knobs (ctor_705 at 0x5BD430)
| Knob | Type | Default | Global | Description |
|---|---|---|---|---|
cssa-coalesce | int | (unknown) | (ctor_705 data) | Controls PHI operand coalescing aggressiveness. Higher values = more aggressive coalescing = fewer copies but higher risk of incorrect merging under divergence. |
cssa-verbosity | int | 0 | (ctor_705 data) | Verbosity level for diagnostic output during the CSSA transformation. |
dump-before-cssa | bool | false | qword_5050A28 | When non-zero, dumps the entire IR module before CSSA runs. Triggers the "IR Module before CSSA:\n" output followed by sub_A69980 (Module::print). |
Container-Format Knob
| Knob | Parsed At | Category | Description |
|---|---|---|---|
CSSACoalescing | sub_CD9990 | Register allocation / scheduling | Controls CSSA coalescing from the NVVM container format. Parsed alongside AdvancedRemat, DisablePredication, DisableXBlockSched. |
Related Knob
| Knob | Type | Default | Global | Description |
|---|---|---|---|---|
usedessa | int | 2 | dword_4FD26A0 | Selects deSSA method / scheduling pipeline mode. Mode 1 = simple (no pre-RA scheduling), mode 2 = full. |
Diagnostic Strings
"IR Module before CSSA:\n" -- Module dump header (dump-before-cssa)
"pcp" -- PHI copy propagation instruction name prefix
The "pcp" prefix is assigned to all copy instructions created by the CSSA pass. These copies can be identified in IR dumps by their %pcp naming. After register allocation, these copies may be eliminated (coalesced into the same physical register) or materialized as actual move instructions in the final PTX.
Function Map
| Function | Address | Size | Role | Confidence |
|---|---|---|---|---|
| CSSA main | sub_3720740 | 3521B / 165bb | BB ordering, PHI scanning, copy insertion, cleanup | HIGH |
| PCP propagator | sub_371F790 | 4014B / 197bb | Per-block reconvergence map, consumer-side copy emission, live-range propagation | HIGH |
| Copy slot resolver | sub_371F160 | 1079B / 60bb | Locates predecessor-block terminator anchor, returns CopySlot triple | HIGH |
| IR builder helper | sub_371EDF0 | 875B / 47bb | Splices new instruction into block's instruction list (uses _OWORD for parent-block link) | HIGH |
| Instruction factory | sub_371CDC0 | 227B / 5bb | Allocates and initializes the divergence-safe copy node (calls sub_B6E160 for type, sub_BD2C40(88, ...) for the 88-byte instruction body) | HIGH |
| PCP map grow | sub_371F5A0 | 494B / 28bb | DenseMap rehash for sub_371F790's local maps (32-byte slots, distinct from sub_A41E30's 16-byte slots) | HIGH |
| PCP map grow (small) | sub_371EC10 | 478B / 28bb | Auxiliary DenseMap rehash for PCP use-list (16-byte slots) | HIGH |
| Hash table grow | sub_A41E30 | 478B / 28bb | DenseMap resize for PHI hash table | HIGH |
| Module printer | sub_A69980 | 416B / 17bb | Module::print (for dump-before-cssa) | HIGH |
| raw_ostream::write | sub_CB6200 | 276B / 15bb | String output for debug dump | HIGH |
| Debug stream getter | sub_C5F790 | 11B / 1bb | Returns current debug output stream | HIGH |
| Instruction eraser | sub_B43D60 | 102B / 1bb | Erases dead instruction from parent block | HIGH |
| Instruction insert | sub_BD84D0 | 10B / 2bb | BasicBlock::insert (instruction splice) | HIGH |
| Name setter | sub_ACA8A0 | 728B / 37bb | Value::setName for "pcp" prefix | HIGH |
| Use chain rewrite | sub_B96E90 | 94B / 5bb | replaceAllUsesWith on operand | HIGH |
| Use helper | sub_B91220 | 66B / 6bb | Use-list manipulation | HIGH |
| Use list insert | sub_B976B0 | 72B / 3bb | Inserts use into specific position in use-list | HIGH |
| DenseMap grow helper | sub_C8D5F0 | 470B / 29bb | SmallVector/DenseMap capacity growth | HIGH |
| Knob registration | ctor_705 (0x5BD430) | 5.4KB | Registers cssa-coalesce, cssa-verbosity, dump-before-cssa | HIGH |
| Container knob parser | sub_CD9990 | 31KB | Parses CSSACoalescing from NVVM container | HIGH |
| deSSA dispatch (post-RA) | sub_21668D0 | -- | Scheduling pipeline mode selector | MED |
| deSSA dispatch (pre-RA) | sub_2165850 | -- | Pre-RA scheduling mode selector | MED |
Differences from Upstream LLVM
LLVM's standard PHI elimination pass (llvm::PHIEliminationPass, registered as "phi-node-elimination" at pipeline slot 493 in CICC's pass parser) lowers PHI nodes to machine copies during the SelectionDAG-to-MachineIR transition. It operates under the assumption that PHI semantics follow scalar control flow -- exactly one predecessor contributes a value at each dynamic execution.
NVIDIA's CSSA pass runs before instruction selection, at the LLVM IR level, and transforms the IR into a form where PHI elimination can proceed safely even when the underlying execution model is SIMT. The two passes are not alternatives -- CSSA runs first to prepare the IR, then standard PHI elimination runs later to lower the CSSA-safe PHI nodes to machine copies.
This is one of the fundamental semantic gaps between LLVM's CPU-centric IR model and GPU reality. LLVM assumes sequential scalar semantics; NVIDIA's CSSA pass bridges that gap by making the implicit thread-level parallelism explicit in the copy structure of the IR.
Common Pitfalls
These are mistakes a reimplementor is likely to make when building an equivalent CSSA transformation for GPU targets.
1. Inserting copies only at the merge block instead of at the end of each predecessor. The entire point of CSSA is that copies must be placed before the warp reconverges, not at the reconvergence point. If you insert the copy instruction at the beginning of the merge block (after the PHI), the warp has already reconverged and whichever path executed last has overwritten the register value for all threads. Copies must be at the terminator position of each predecessor block, before control leaves that block. This is the fundamental GPU-vs-CPU distinction: on a CPU, only one predecessor executes so placement does not matter; on a GPU, all predecessors may execute sequentially within the same warp.
2. Coalescing copies that have divergent live ranges. The cssa-coalesce knob controls how aggressively copies are merged back together. Over-aggressive coalescing can assign two copies to the same physical register when their live ranges overlap under divergence -- threads from different predecessor paths would see each other's values. The coalescer must verify that live ranges are truly non-interfering under the SIMT execution model, not just under the sequential CFG model. A reimplementation that reuses a standard LLVM register coalescer without divergence-aware interference checking will produce silent miscompilation on any kernel with divergent control flow.
3. Failing to insert copies for uniform PHI nodes that become divergent after later transformations. CSSA runs before instruction selection, but divergence analysis at that point may be imprecise. A PHI node classified as uniform (all threads agree on the incoming edge) may become effectively divergent after subsequent loop transformations or predication changes the control flow. The safe approach is to insert copies for all PHI nodes and let the coalescing phase remove unnecessary ones. A reimplementation that skips "uniform" PHI nodes based on divergence analysis risks correctness if that analysis is later invalidated.
4. Using a standard LLVM PHIElimination pass without the CSSA preprocessing step. LLVM's built-in PHI elimination assumes scalar control flow semantics (exactly one predecessor contributes at runtime). Running it directly on GPU IR without first converting to CSSA form will produce incorrect register assignments whenever a warp diverges at a branch leading to a PHI merge point. CSSA is not a replacement for PHI elimination -- it is a prerequisite that transforms PHI semantics into a form safe for the standard lowering.
5. Not propagating the "pcp" copy through the instruction graph after insertion. Phase 4 of the algorithm (copy propagation via sub_371F790) replaces uses of original values with uses of the inserted copies. A reimplementation that inserts copies but skips this propagation step will leave the PHI node still referencing the original value, making the copies dead. The subsequent dead-copy cleanup (Phase 5) will then erase them, and the transformation has no effect -- the original divergence-unsafe PHI remains.
6. Splitting critical edges before CSSA but not re-running block numbering. If a downstream pass between phase 1 and phase 2 (or a re-entry into CSSA via the latch byte) splits a critical edge, the new edge-split block has no preorder/RPO index. The PHI map will still work because it is keyed on instruction pointers, but the [BB+0x48] / [BB+0x4C] indices will be stale for the new blocks. The pass guards against this with the latch byte: the second time sub_3720740 runs on the same function with [PS+0x70] already set, only the PHI counter ([PS+0x74]) is reset; phase 1 is skipped. A reimplementor must either always re-run phase 1 or never split edges after CSSA.
Reimplementation Checklist
- Basic block ordering and numbering. Assign preorder and reverse-postorder indices to every basic block (stored at block offsets +0x48/+0x4C), used later for dominance and reconvergence queries. Latch the result so re-entries skip this phase.
- PHI node scanning and hash map population. Walk all instructions across all basic blocks, identify PHI nodes (opcode 0x54), assign monotonic IDs, and insert into a DenseMap using the hash
(ptr >> 4) ^ (ptr >> 9)with LLVM-layer sentinels (-4096/-8192) and 75% load-factor growth. - Copy insertion at reconvergence points. For each PHI node's incoming value, insert a
"pcp"-prefixed copy instruction at the end of the predecessor block (before the terminator) using opcode 0x22D7 (divergence-safe copy), then rewire the PHI's use chain so the operand points to the copy instead of the original value. - Copy propagation. Iterate all blocks a second time, invoking the PCP builder on each block to populate per-block reconvergence maps, emit consumer-side copies, and propagate live ranges through the pass-level use-graph.
- Dead copy cleanup. Walk the cleanup worklist, check each entry for zero remaining uses, and erase dead copy instructions via
eraseFromParent. Never interleave with phases 3 or 4. - Copy coalescing (cssa-coalesce). Implement configurable coalescing that identifies cases where multiple
"pcp"copies carry the same value and can share a single register, reducing copy overhead while preserving correctness under warp divergence. Honor the(default=subtarget)semantics for edge-split and block-spanning copies.
Cross-References
- NVIDIA Custom Passes -- CSSA listed as
sub_3720740with knobscssa-coalesce,cssa-verbosity,dump-before-cssa - Rematerialization -- runs before CSSA; rematerializable values reduce the number of PHI copies CSSA must insert
- Sinking2 -- runs adjacent to CSSA; can move definitions across PHI boundaries, increasing CSSA's workload
- Register Allocation -- greedy RA consumes CSSA-prepared IR
- Scheduling --
usedessaknob controls pre-RA/post-RA scheduling mode - Code Generation Pipeline -- CSSA's position in the overall compilation flow
- StructurizeCFG -- related pass that ensures structured control flow for PTX
- Hash and Container Infrastructure -- DenseMap and SmallVector primitives shared with phase-1 worklist and phase-2 PHI map
- Configuration Knobs -- full knob inventory