Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Structure Layouts

All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.

This page documents the key internal data structures in ptxas v13.0.88: the compilation context ("god object"), the Ori Code Object, symbol tables, constant/shared memory descriptors, the pool allocator's object model, and the generic container types (hash maps, linked lists, growable arrays) that underpin nearly every subsystem.

All offsets are byte offsets from the structure base unless otherwise noted. Types are inferred from decompiled access patterns. Field names are reverse-engineered -- the binary is stripped.

Compilation Context (the "God Object")

The compilation context is the central state object passed to every phase in the pipeline. It is not the Code Object (which is per-function); it is the per-compilation-unit container that owns the Code Object, the knob system, the output stream, the function list, and all per-pass configuration. The sub_7FBB70 (PerKernelEntry) function receives this as a1, and every phase's execute() receives it as the second argument.

The context is a polymorphic C++ object with a vtable at offset +0. It is allocated by the compilation driver and persists for the lifetime of a single compilation unit. Key observations:

  • The vtable at +0 provides 263+ virtual methods (vtable spans to offset 2104+)
  • The object's struct size is 1832 bytes as recorded in *(ctx+1536) by the deep-clear routine sub_C1B7A0 (*(a1+1536) = 1832), but the constructor sub_7F7DC0 writes fields out to +2136, indicating the in-memory layout extends slightly past the nominal struct size for tail extension fields
  • The knob/options system is accessed through an indirection at +1664 (pointer to knob container object)
  • The output stream lives at +1440

The full constructor is sub_7F7DC0 (1270 lines). The map below was rebuilt by:

  1. Reading every initialization line of sub_7F7DC0 (the canonical ctor) to nail down field types from initial values and accessor widths
  2. Cross-referencing 29 phase-execute functions where a1 is provably the ctx (filtered by requiring both *(a1+1584) and *(a1+1664) accesses in the same file) and tallying offset access frequency with rg --no-filename --only-matching '\(a1 \+ \d{3,4}\)' | sort | uniq -c
  3. For each top-frequency offset, opening the use sites to derive the semantic role from how the value is written, read, indexed, dispatched, and gated against

Compilation Context Field Map

The map is grouped by functional region rather than strict offset order, so that callers reading the cleanup code (which iterates by region) and callers reading dispatch code (which jumps between regions) can both find what they need.

Header / Lifetime / Allocator (+0..+96)

OffsetTypeFieldEvidence
+0vtable*vtable*(_QWORD *)a1 in every virtual dispatch; ctor sub_7F7DC0 line 176 sets *(a1)=a3
+8ptrparent / driver_ctxBack-pointer; sub_A3A7E0 reads v2 = *(a1+8) then v2[198] for Code Object
+12u32flags_word_aCtor sub_7F7DC0:177: *(a1+12) = 0
+16allocator*master_allocatorPool allocator object (656 bytes, vtable off_21DBC80); allocated in ctor lines 178-216 with size 10240; the value at +16 is the canonical handle used by every other field that needs to allocate
+24refcount*weak_ref_block_aCtor lines 219-229: 24-byte refcount block, refcount=1
+32refcount*weak_ref_block_bCtor lines 225-229: second 24-byte refcount block; vtable[16] used for allocation of size 24
+40vtable*secondary_vtableCtor line 233: *(a1+40) = off_21DBEF8 (sub-object vtable for stream/iostream-style operations)
+48 / +56ptrsub_alloc_view_1/2Ctor lines 234-235: *(a1+48)=v16, *(a1+56)=0 (allocator views for sub-object)
+64ptr(zero-init)Ctor line 238
+72u32(zero-init counter)Ctor line 239
+80allocator*first_lookup_container_allocCtor line 236: *(a1+80) = v16. Correction: the +80..+100 region is a growable-container header in the standard ptxas layout, not a standalone last_exit_code_alloc field. Ctor line 603 calls sub_7F0C10((_QWORD*)(a1+80), 512), which is the generic grow helper whose argument is the container base. Inside sub_7F0C10 (sub_7F0C10_0x7f0c10.c:13-34), the helper reads *((unsigned int*)a1+5) = offset +20 from its argument as the capacity, and *((int*)a1+4) = offset +16 as the current count. So the container laid over +80 has: allocator@+80, buffer@+88, count@+96 (u32), capacity@+100 (u32). The 512 capacity suggests a symbol-id or name-hash table companion to the +144 name table
+88ptrfirst_lookup_container_bufferCtor line 240: *(a1+88) = 0. Buffer pointer for the +80 container; after the ctor:603 grow-to-512 call, this points at an array of 512 qwords (4 KiB)
+96i32compile_unit_index / container_countCtor line 231: *(_QWORD*)(a1+96) = 0xFFFFFFFFLL -- a qword write of -1 that sets +96 to -1 (the container count-sentinel) and +100 to 0 (initial capacity). sub_C64F70:71 writes the low 32 bits into the +20 slot of every timing record (v46 = *(_DWORD*)(*a1+96)). Partial: the original documentation called this compile_unit_index, but the qword-init pattern is identical to the count/capacity init used by every other container in the ctx. Either (a) +96 genuinely serves dual duty as container-count and cu-index (unlikely because the two would collide on grow), or (b) it is the container count and the "cu_idx" column in timing records is actually a phase-sequence number. Marked partial until a disambiguating write site is found
+100u32container_capacity_or_reservedImplicit from ctor:231 qword write setting the high dword to 0, and from sub_7F0C10's generic grow helper reading offset +20 as capacity. After ctor:603 line, this becomes 512. Marked partial -- could also be a tail half of +96 if +96 is truly a qword semantic field

Embedded Hash-Map Containers, Bin 1 (+104..+200)

This bin holds three or four small hash-map / sorted-array containers. The allocator view at +16 is used to allocate their bucket arrays. Each container has the layout {header*, ptr, ptr, ptr, ptr, u32 count} packed across ~24-40 bytes.

OffsetTypeFieldEvidence
+104ptrbin1_container_a_hdr0Ctor line 241: *(_QWORD*)(a1+104) = 0. First qword of an opaque five-qword region that looks like a second growable-container or hash-map slot. Not wrapped by sub_7F0C10, which suggests it is a manually-managed structure rather than the standard {alloc, buf, count, cap} layout -- possibly a small std::list-style sentinel node (prev/next/head/tail/count)
+112ptrbin1_container_a_hdr1Ctor line 242: *(_QWORD*)(a1+112) = 0
+120ptrbin1_container_a_hdr2Ctor line 243: *(_QWORD*)(a1+120) = 0
+128ptrbin1_container_a_hdr3Ctor line 244: *(_QWORD*)(a1+128) = 0
+136u32bin1_container_a_countCtor line 245: *(_DWORD*)(a1+136) = 0. Only the low dword is written (ctor uses _DWORD *), confirming that the four preceding qwords are pointers and +136 is a counter, not part of a larger pointer array. The five-field layout {ptr, ptr, ptr, ptr, u32} is the canonical ptxas intrusive-list-sentinel-plus-count pattern
+144allocator*name_table_allocAllocator handle for the name table; ctor line 237 sets *(a1+144)=v16 and sub_7F0C90((a1+144), 1) initializes the table at line 247
+152ptrname_table_bufferBucket / element array for name table; ctor line 246: *(a1+152) = 0; later grown by sub_7F0C90(a1+144, 64) at line 630
+160i32name_table_countCtor line 232 sets 0xFFFFFFFFLL (-1 sentinel = empty), line 309 then sets to 0; sub_C64F70:68 reads *(*a1+160) as v8 and writes it into timing records
+168i32name_table_capacityImplied from grow path
+200ptrname_alt_arrayCtor line 315: separate 24-byte refcount-managed buffer for name aliases

Function Object Table (+208..+340)

The most-accessed structural region of the ctx. This is the function-id-indexed lookup table: given a function id, dereference *(a1+296) + 8*id to get the per-function Code Object pointer. The table is a growable inline array (the inline buffer immediately follows the metadata fields).

OffsetTypeFieldEvidence
+208__m128i(init constant)Ctor line 325: _mm_load_si128(xmmword_21D24D0)
+224i32func_table_iter_stateCtor line 337: *(a1+224)=0
+232ptrcurrent_ori_insnMost-accessed uncharacterized field. Stores the Ori instruction currently being legalized so that error reporters can blame the right line. Written by sub_9BCBA0:300 (*(a1+232) = a2), sub_8127C0:1246 (*(a1+232) = v24 next to insn-loc store), sub_1246DF0:138, sub_80A730:240, etc. Cleared to 0 in ctor line 338
+240i32current_pipeline_phase_binCtor line 339: *(a1+240) = 7; observed used as a 3-bit bin tag for grouping IR transforms
+248..+251u32(init zero)Ctor (4 zeroed dwords from 252 region)
+252i32current_func_table_countCtor line 340: *(a1+252)=0; written/read alongside +296 in many phases
+256..+261u8[6]legalization_state_bytesCtor lines 341-346: six bytes individually zeroed; observed used as bitfields for "this insn is in legalization-pending state"
+264i32current_ori_insn_locPaired with +232: source location of the current insn. Loaded as *(_DWORD *)(insn+20) and written every time +232 is updated. Confirmed in sub_9BCBA0:298, sub_1246DF0:139, sub_8127C0:1247, sub_80A730:281 (all show *(a1+264) = *(_DWORD *)(insn+20))
+272node*pending_legalize_list_headLinked list head used by sub_752CF0:17 to walk Ori instructions whose opcode bits match (opcode & 0xFFFFCFFF) == 0x89 (a specific instruction class). Each node has next = *(v+8) and a payload at +72/+84/+204/+232/+264. Confirmed iterating list in sub_18F4850:89-117 (v8 = *(a1+272); ...; v8 = *(_QWORD *)(v8+8))
+280ptrcurrent_iter_nodePointer used as a moving cursor on +272 list during phase iteration; sub_7846F0:201 reads v7 = *(a1+280) then dereferences *v7+21 for an instruction id
+288allocator*func_table_allocCtor line 328: *(a1+288) = v23; the allocator handle used by the grow path at line 611
+296Code Object**function_table_bufferDocumented -- the function-id-indexed lookup table. *(a1+296) + 8*id returns the per-function Code Object. Read at hundreds of sites
+304i32function_table_countCtor line 324 sets 0xFFFFFFFFLL (-1 sentinel); grows on demand. Read as *(_DWORD *)(a1+304) everywhere a phase iterates over all functions, e.g., sub_781F80:329, sub_7846F0:217, sub_A0F020:361
+308i32function_table_capacityCtor line 605: v55 = *(_DWORD *)(a1+308); if (v55 <= 127) { v56 = grow ... }
+312allocator*func_table_alloc_viewSame allocator, second handle (used by grow loop)
+320Code Object**function_table_buffer_altSecondary buffer pointer used by grow path at ctor line 1239-1262
+328i32function_table_count_altCtor line 326, line 1235: v157 = *(a1+328)
+332i32function_table_capacity_altCtor line 1236: v158 = *(a1+332)
+336allocator*func_table_alloc_view2Ctor line 330

Function Name Table (+344..+416)

A second growable array, parallel to the function object table, storing per-function names.

OffsetTypeFieldEvidence
+344ptr(zero)Ctor line 350
+352i32name_table_alt_countCtor line 331: 0xFFFFFFFFLL
+360allocator*func_name_allocCtor line 332
+368name**function_name_arrayDocumented -- buffer pointer
+376i32function_name_countCtor line 333: 0xFFFFFFFFLL (-1 = empty); read at sub_781F80:982, sub_A0F020:141 (for (i=0; i <= *(a1+376); ++i))
+380i32function_name_capacityCtor line 631: v61 = *(_DWORD *)(a1+380); if (v61 <= 15) ... (grow path)
+384allocator*func_name_alloc_viewCtor line 334
+392 / +400 / +408ptr / i32 / ptradditional small buffersCtor lines 335-336
+416ptr(zero)Ctor line 353

Code Object Stage Buffers (+424..+712)

A bank of growable arrays each managed by the same {allocator, buffer, count, capacity} quad pattern. Used to hold per-stage transient data (intermediate results between consecutive passes).

OffsetTypeFieldEvidence
+424i32(sentinel -1)Ctor line 354
+432allocator*stage_array_alloc_aCtor line 358: *(a1+432)=v23; line 374-375 dispatches (**(a1+432)+32)(a1+432) to free
+440i32*stage_array_buffer_aCtor line 378: stores newly allocated buffer; sized as 4*count + 4
+448i32stage_array_count_aCtor lines 364/377/413: -1 sentinel, then 0
+452i32stage_array_capacity_aCtor line 379: *(a1+452) = 1
+456allocator*stage_array_alloc_bCtor line 403
+464 / +472 / +480 / +488variousadditional growable-array slotsCtor lines 391-405 (see "Inline buffer block" below)
+496 / +504i32 / allocator*-1 sentinel / allocatorCtor lines 392/406
+512i32*function_worklist_bufferActive function-id worklist (typically a int[N]). Ctor line 400. Read with *(_DWORD *)(*(a1+512) + 4*idx) -- e.g., sub_796D60:102-104 indexes into the function table via this list
+520i32function_worklist_countCtor line 394: 0xFFFFFFFFLL (sentinel). Read at sub_796D60:95 (v36 = *(a1+520)), sub_781F80:349, sub_793220:87. Often used as for (i=1; i <= *(a1+520); ++i), indicating 1-based indexing
+528allocator*worklist_allocCtor line 409
+536 / +544i32 / i32sentinel slotsCtor lines 396/402
+552inline_buf*worklist_inline_storageCtor line 393: *(a1+552) = a1+576 (small-buffer optimization base)
+560u64worklist_inline_capacityCtor line 395: 0x500000000LL (capacity = 5 in upper dword)
+568allocator*worklist_alloc_viewCtor line 410
+576..+615i32[5] (inline)worklist_inline_bufferThe actual 5-element inline storage area for the worklist; if more than 5 entries are needed, the heap path at +512 takes over
+616inline_buf*secondary_inline_storageCtor line 399: *(a1+616) = a1+640
+624u64secondary_inline_capacityCtor line 401: 0x200000000LL (capacity = 2)
+632allocator*secondary_alloc_viewCtor line 411
+640..+655i32[2] (inline)secondary_inline_buffer2-element inline buffer
+672inline_buf*tertiary_inline_storageCtor line 405: *(a1+672) = a1+696
+680u64tertiary_inline_capacityCtor line 407: 0xA00000000LL (capacity = 10)
+688allocator*tertiary_alloc_viewCtor line 412
+696..+735i32[10] (inline)tertiary_inline_buffer10-element inline buffer

Call-Graph / Traversal State (+736..+1008)

OffsetTypeFieldEvidence
+736refcount*traversal_refcount_aCtor lines 415-419 allocate a 24-byte refcount block
+744 / +752 / +760ptr / ptr / ptrlinked list head/tail/iterCtor lines 421-423; subsequent passes traverse via these
+776u8*opcode_attribute_tableCorrection -- not a linked-list slot. Ctor lines 904-932 allocate a 1428-byte buffer via (*(a1+16)->alloc)(v139, 1428), write *v140 = 355 into the first 4 bytes (marker / max opcode id), zero-initialize the rest in a 4-qword-stride loop, then store *(a1+776) = v141 where v141 = v140 + 1 (i.e., the pointer skips the 8-byte header). Immediately afterward, ctor lines 934-1229 OR individual flag bits into byte offsets v142[17], [25], [29], [37], [64], ..., [1420], [1424] -- the decompiler types v142 as _QWORD* but the bounds and |= 0xXX patterns prove it is effectively a _BYTE* with byte indices. This buffer is the per-opcode attribute/capability bitmap: each of the ~355 Ori opcodes has one or more attribute bytes describing legality, pool, scheduling class, etc. The mask values observed (0x9C, 0x11, 0x1C, 0x1E, 0x40, 0x80, ...) select distinct attribute subsets. Observed masks stored per-opcode include: 0x01 (valid), 0x02 (has-side-effect), 0x04 (reads-memory), 0x08 (writes-memory), 0x10 (late-expansion), 0x20 (sm9+-only), 0x40 (cluster-aware), 0x80 (dual-issue) -- these are inferred from usage context, and the table should be treated as partial until each bit is cross-referenced with a phase-execute body that tests it
+784allocator*opcode_attribute_table_allocCtor line 933: *(a1+784) = v139 where v139 = *(_QWORD*)(a1+16). Paired with +776: sub_7F7DC0:928-930 uses *(_QWORD*)(a1+784) to free an earlier buffer (the deallocator call dereferences *(a1+784) and dispatches vtable[32])
+792cg_node*call_graph_objectLazily allocated by sub_784220 if null. Holds a function pre-order walk over the call graph; *(cg+0) = node_count, *(cg+8) = i32* node_array, *(cg+16) = allocator, *(cg+24) = visited_flag. Read in sub_793220:62 (*(_DWORD *)(a1+940) = *(_DWORD *)(v2+4)), sub_A0F020:159, 190, 201, 205. Allocated using 4 * (function_count+1) + 8 bytes
+800allocator*call_graph_allocCtor line 434; passed as second handle in sub_784220:69
+808 / +816ptraux ptrsCtor lines 435-437
+824i32current_func_idCtor line 438: 0xFFFFFFFFLL (-1 = "no current function"). Read by phase passes that need to know which function is being processed (sub_7846F0:210, 220 -> sub_923600(a1, *(a1+824))), sub_A0F020:107
+832 / +840ptraux growable-array slotsCtor lines 439-440
+848i32aux_countCtor line 441
+856 / +864 / +872variouslinked list slotsCtor lines 442-444
+880__m128i(SSE init)Ctor line 428 from xmmword_2027590
+896i32assembler_modeCtor line 445; read in sub_784220:258 (if (*(a1+896) == 4) sub_74AEE0(...)) -- value 4 triggers a special handling path
+900i32cluster_dimension_modeCtor lines 849-861: set from *(a2+1796) -- 0/1/2 selecting cluster geometry mode (0=none, 1=auto, 2=explicit)
+904 / +908u8 / u8bit flagsCtor lines 446-447
+912 / +940_OWORDSSE-init slotsCtor lines 430, 433
+928i32(zero)Ctor line 448
+932i32relocation_id_seedCtor line 449: *(a1+932) = -1
+936u8flagCtor line 450
+940i32cg_node_count_cachedCtor (read from *(a1+792)+4); see +792 above
+956i32unroll_factor_defaultCtor line 451: *(a1+956) = 15
+960allocator*(allocator view)Ctor line 431
+968 / +976ptr / i32aux arrayCtor lines 452-453
+984 / +1000ptr(zero)Ctor lines 436, 454
+992back_ptr_block*back_ptr_to_selfCtor lines 894-900: 32-byte block holding [a1, allocator, 0, -1] -- a self-reference used for callback hooks
+1008journal*memstate_rewrite_journal_hdr24-byte refcount block {refcount=2, freelist_head, allocator} allocated at ctor lines 456-465. Identified: per-BB memory-state-token rewrite journal — the drain at sub_8116B0_0x8116b0.c:184-212 walks a trail-bucket array, splicing old values back via *target = *(header+8); *(header+8) = old_value (classic undo-trail pattern). Captures rewrites dispatched through vtable+2456 (operands whose backing descriptor has *(desc+64)==8, opcode ≠ 263) during the per-instruction sweep. Supersedes the older "analysis_pool_alloc" label
+1016i32memstate_journal_countNon-empty DWORD guard checked before drain (sub_8116B0:184)
+1024rewrite_rec*memstate_journal_bucket_baseTrail-bucket array (24-byte records {old_value, target_ptr, tag_dword})
+1032i32memstate_journal_bucket_sizeTrail-bucket array size
+1048journal*ssa_handle_rewrite_journal_hdrSecond 24-byte refcount block (ctor lines 468-477), identical structure to +1008. Identified: SSA-value handle rewrite journal — captures operands matched by sub_7DEB10 (the value/SSA-handle classifier) and dispatched through vtable+2464. Drained at sub_8116B0_0x8116b0.c:213-250 immediately after the memstate journal, in deterministic order. Two parallel channels keep per-BB memory-token rewrites and SSA-handle rewrites independent so after-BB commits can roll each back without cross-contamination
+1056i32ssa_journal_countNon-empty DWORD guard for drain
+1064rewrite_rec*ssa_journal_bucket_baseTrail-bucket array base
+1072i32ssa_journal_bucket_sizeTrail-bucket array size

Linked Lists, Hash Maps, and Misc Containers (+1120..+1352)

OffsetTypeFieldEvidence
+1120..+1136ptr(zero-init)Ctor lines 478-480
+1144node*function_list_headDocumented; ctor line 481
+1152node*aux_list_head_aCtor line 482; written/read in sub_781F80:384, 389 -- secondary linked list traversed during legalization
+1160node*entry_list_headDocumented; ctor line 483
+1168node*aux_list_head_bCtor line 484
+1176..+1231hash_map (56B)embedded_hash_map_aCtor line 485 calls sub_7F0A90(a1+1176, a1). Layout: {back_ptr_to_ctx, bucket_array_ptr, head, tail, ptr, count}. Used by sub-passes to map id->object
+1232..+1287hash_map (56B)embedded_hash_map_bCtor line 486 calls sub_7F0B20(a1+1232, a1). Same layout as the previous map
+1288 / +1296ptrlinked list ptrsCtor lines 488-489
+1304refcount*aux_refcountCtor lines 491-495
+1312 / +1320 / +1328ptraux slotsCtor lines 497-499
+1344 / +1352ptraux slotsCtor lines 501-502
+1360subobj*optional_56B_subobjCtor lines 881-892: only allocated if *(a2+920) > 0, and bit `+1378
+1372i32legalization_iter_phaseCtor line 709: *(a1+1372) = 0. Read in sub_7846F0:240 (if (!*(_DWORD *)(a1+1372) && *(char*)(a1+1415) < 0)) -- gates a legalization sub-mode

Phase Bitfield Bank (+1368..+1421)

A 54-byte region of densely packed boolean / multi-bit gate flags. The constructor sub_7F7DC0 lines 686-872 initialize this region by reading individual fields from the option struct (a2) and packing them into bit positions. Some bytes (+1376, +1396, +1412, +1414, +1416, +1418) were partially documented before; the table below adds the missing ones and corrects the type widths.

OffsetTypeFieldInitial value (ctor line)Bit semantics
+1368u8pipeline_iter_flagsline 693: 0bit 0x01 = "ConvertUnsupportedOps invoked", bit 0x02 = "diagnostic dump active" (gates sub_793220:54 lazy init at +792), bit 0x10 = "deep mode", bit 0x40 = ?, bit 0x80 = sign-bit checked at sub_908EB0:60
+1369u8pipeline_iter_flags_bline 694: 0bit 0x80 cleared at sub_781F80:359; sign-bit checked in sub_752CF0:? and sub_793220:52
+1370u8pipeline_iter_flags_cline 686: &= 0xA0; line 691: & 0x5Fbit 0x02 (constructor enables it via `(8 * (a2&1))
+1371u8legalize_call_flagsline 695: 0; line 1375: `= 0x80`
+1372i32legalization_iter_phaseline 709: 0See above
+1376u8scheduling_mode_flagsDocumented: bit 0x08 forward, 0x10 bidirectional, 0x20 disable schedulingline 696: 0
+1377u8regalloc_flags_aline 697: 0bit 0x20 cleared at sub_781F80:358
+1378u8subobj_present_flagsline 699: 0; line 887: `= 8if*(a2+920) > 0`
+1379u8aux_flagsline 700: 0
+1380..+1382u8 x3flag byteslines 702-704: 0+1381 bit 0x40 (already known: cutlass flag, cleared bit 6 in some passes; tested in sub_913A30:41); +1382 bit 0x20 ("instruction class scan needed", read in sub_796D60:63)
+1383u8init_markerline 707: 0x80Set unconditionally to 0x80 by ctor; never observed cleared (is the "context fully constructed" marker)
+1384i32pipeline_progress_bline 692-701: zeroed via & 0xFFFC7FFF maskA second progress counter (distinct from +1552); incremented by passes at certain transitions
+1385u8phase_bin_flagsline 689: &= 0x80bit 0x04 read in sub_796D60:63
+1386..+1389u8 x4flag byteslines 705, 708, 710-711+1387 line 705: & 0xFC (clears low 2 bits); +1388 zeroed; +1389 line 690: &= 0xE0
+1392u8flag byteline 711: 0
+1393u8option_byteline 712: & 0xF8Three-bit option
+1396u8phase_58_gate_bitsDocumentedFilled bit-by-bit from *(a2+184), +192, +200, +204, +208, +212, +216 (lines 713-727)
+1397u8phase_extension_bitsline 729: 0bits set from *(a2+220), +224, +228, +248, +232, +280, +316 (lines 728-743). 7 distinct option bits packed
+1398u8late_expansion_prereqline 744bit 0x02 = LateExpansion prerequisite (already known); set unconditionally |= 2; bits 0-1 from *(a2+1448) & 3; field finally masked with & 0x1F at line 873
+1399u8derived_phase_bitsline 877Computed bit-OR of bits 3 and 4 from *(a1+1413) (so this byte is derived from other flags, not from input options)
+1400u32error_count_or_thresholdline 752Set from *(a2+272) if non-negative; gates rate-limited diagnostic emission
+1404u8error_count_set_markerline 751Set to 1 once +1400 has been initialized from options
+1408u8flag byteline 755: &= 0xE0low 5 bits used as a sub-mode tag
+1412u8compilation_flags_byteDocumentedline 756: & 0xC0 -- ctor packs *(a2+136), +137, +138, +139 into low bits, then more from *(a2+1060), +1064
+1413u8optimizer_gate_bitslines 759-782Most bit-rich byte: ctor sets bit 1 from *(a2+137), bit 0 from *(a2+138), bit 2 from (a1+1412 & 0x380)==0, bit 3 from *(a2+1060), bits 4-5 from *(a2+1064), bit 6 from *(a2+1068), bit 7 from *(a2+424)==1. Special override line 774-778: `if (*(a2+348) > 36863 && (v91&8)==0 && (v91&0x30)!=0x10) v91
+1414u8late_expansion_flagsDocumented bit 0x02 = LateExpansion prerequisite; ctor packs more bits at lines 786-794 from *(a2+140), +1080, +1084, +1088, +144
+1415u8optimization_path_flagslines 796-801bit 2 from *(a2+172), bit 3 from *(a2+176), bit 4 from *(a2+180). Sign-bit (bit 7) tested at sub_7846F0:240 to gate the legalization sub-mode
+1416u8output_detail_flagsDocumented: bits 4-5 control latency reportinglines 803-807: bit 6 from *(a2+148), bit 2 from *(a2+156)
+1417u8late_expansion_aux_bitslines 808-833Most-touched flag byte (modified in 12+ separate ctor lines from many a2 fields including +152, +772, +1516, +324, +1096, +1048, +764)
+1418u8codegen_mode_flagsDocumentedlines 814-834: packs bits from *(a2+1112), +1528, +1524, +1100, +1104, +1532 and forces bit 6 ON unconditionally (line 831: `v117
+1419u8cluster_and_misc_bitslines 835-845bit 0 from *(a2+1760)==1, bit 3 from *(a2+1116), bit 4 from *(a2+1120), bit 5 from *(a2+1124), bit 6 from *(a2+1128), bit 7 from *(a2+1768)
+1420u8cluster_geometry_bitslines 847-866bit 0 from *(a2+1794), bit 1 from *(a2+1792), bit 2 from *(a2+1800), bit 3 from *(a2+1132), bits 4-5 from *(a2+1076) << 4, bits 6-7 from *(a2+1076) -- entirely cluster-launch related
+1421u8aggregator_flagslines 738/742/867-871bit 1 set unconditionally; bits 2-3 from *(a2+1808) & 3; bit 6 from *(a2+1136); bit 7 from *(a2+1140)
+1424i32pipeline_option_wordline 506: *(a2+704)
+1428i32function_indexDocumentedline 507: *(a2+352)
+1432i32loop_unroll_thresholdline 508: *(a2+360)

Output Stream and Timing Records (+1440..+1656)

OffsetTypeFieldEvidence
+1440..+1551stream (~112B)output_streamDocumented; ctor line 509: sub_7F7CB0((a1+1440), a1) -- the iostream-style output object
+1552i32pipeline_progressDocumented; ctor line 524: *(a1+1552) = 0
+1560allocator*timing_records_allocCtor line 530: *(a1+1560) = v44 (allocator handle for the timing growable array). The wiki previously described +1560 as "the records pointer", but +1560 is the allocator and +1568 is the actual records buffer
+1568timing_record*timing_records_bufferCtor line 510: *(a1+1568) = 0. Each timing record is 32 bytes with layout {u32 phase_id, char* phase_name, u32 invocation_depth, u32 cu_index, u32 spare}. Confirmed in sub_C64F70:75-81: writes *(buffer + 32*idx) = phase_id, +8 = name_str, +16 = depth, +20 = cu_idx, +24 = spare
+1576i32timing_countDocumented -- but ctor line 512 actually initializes it to 0xFFFFFFFFLL (-1)
+1584sm_backend*sm_backendDocumented
+1592ptr(zero-init)Ctor line 513
+1600..+1648ptraux slotsCtor lines 514-520; mostly zero-initialized
+1648ssa_tracker*ssa_trackerA 16-byte tracker object lazily allocated at sub_A0F020:86-90. Layout: {ctx_back_ptr, byte flag_a, byte flag_b}. Used by the dead-code-elimination / use-def cleanup pass to mark whether SSA needs rebuilding
+1656i32aux_counterCtor line 521
+1664knob_container*knob_containerDocumented

Output String Buffer (+1672..+1716)

A std::string-like growable byte buffer used to accumulate the output filename or module identifier passed in via *(a2+856).

OffsetTypeFieldEvidence
+1672u64output_string_capacityCtor line 525: 0; checked at line 678 (if (v74 >= *(_QWORD *)(a1+1672)) sub_66B450(...) -- realloc)
+1680char*output_string_startCtor line 527
+1688char*output_string_posCtor line 532; advanced via *(a1+1688) += strlen(filename) at line 684
+1696allocator*output_string_allocCtor line 531
+1704i32compile_timeout_msCtor line 528: *(a2+116) -- a timeout/limit option
+1708i32verbosity_levelCtor line 533: *(a2+120) -- verbosity option
+1712i32report_formatCtor line 534: *(a2+124)
+1716i32report_levelCtor line 540: *(a2+128)

Register-File / Inline Limit Block (+1720..+1768)

OffsetTypeFieldEvidence
+1720i32regfile_size_hintCtor line 538: *(a1+1720) = 512 (default). Overridden at line 903 from *(a2+132) if non-negative. Used as a target register file size budget for register allocation
+1728allocator*regalloc_alloc_viewCtor line 536
+1736 / +1744 / +1752ptr / i32 / i32auxCtor lines 539, 537, 541; +1744 = -1 sentinel
+1760ptr(zero)Ctor line 543
+1768i32phase_invocation_depthCtor line 544: 0. Read at sub_C64F70:70 as v9 = *(*a1 + 1768) + 1 -- this is the per-phase invocation depth that gets recorded into each timing record at offset +16
+1776sched_state*scheduler_stateCtor line 545: 0. Holds the live scheduling state during the scheduler pass; layout includes +16 = bucket_array, +24 = bucket_count, +28 = bucket_capacity (per sub_92E1F0:114-127)
+1784latency_model*latency_modelCtor line 546: 0. Polymorphic cost-model object with vtable {[0] is_available() -> bool, [8] estimate_latency(insn*) -> double}. Used at sub_8BF4B0:11-30, sub_92E1F0:110, sub_8D1730:104-265, sub_931920:283-307 -- the scheduler asks this object for instruction latencies. See Scheduling Latency Model below
+1792 / +1800 / +1808 / +1816ptraux slotsCtor lines 547, 549-551
+1824i32aux_state_wordCtor line 552
+1832refcount*aux_refcount_bCtor line 548: stores *(a1+24) (the second weak ref block); bumps its refcount via ++*v47

Backend Configuration Block (+1840..+1976)

This region holds pointers and dwords copied from the options struct (a2), forming a snapshot of the inputs that drive backend behavior. Many were previously documented as "backend stuff at +18xx".

OffsetTypeFieldEvidence
+1840ptr(zero)Ctor line 556
+1848i32(zero)Ctor line 557
+1856pool_config*compile_pool_configCtor line 559: *(a2+88) -- driver-supplied pool configuration
+1864bb_structure*bb_structureDocumented
+1872per_func_data*per_func_dataDocumented
+1880function_context*function_contextDocumented
+1888optix_config*optix_configCtor line 562: *(a2+984) -- pointer to OptiX-specific configuration block (only set when compiling for OptiX IR)
+1896i32optix_target_versionCtor line 568: *(int *)(a2+980) -- companion to +1888 (OptiX target version number)
+1904allocator*backend_alloc_viewCtor line 563
+1912ptr(zero)Ctor line 566
+1920i32backend_count_sentinelCtor line 564: 0xFFFFFFFFLL
+1928codegen_ctx*codegen_ctxDocumented
+1936 / +1944ptr(zero)Ctor lines 569-570
+1952ctx*self_pointerCtor line 571: *(a1+1952) = a1 -- a self-reference, used by sub-objects that need a back-pointer to the owning context but only have a stable handle on +1952
+1960 / +1968ptr(zero)Ctor lines 572-573
+1976i32(zero)Ctor line 575

Post-Instruction Hook Queues + Late Tail (+1984..+2136)

Correction (was "NvOptRecipe / Late Tail"): this region is NOT the NvOptRecipe state. sub_9F4040 (the NvOptRecipe applier, see phase-manager.md → Task IR-24 commit 1ac448a) does not read any offset in [+1984, +2096] on its context parameter. Instead, the region holds two symmetric post-instruction hook queues used by the instruction-append callback dispatcher sub_7DD3C0. Each queue is a {head, tail, count} triple of 8+8+4 = 20 bytes plus 4 bytes of padding per slot, and both queues follow an identical "pending → committed" splice pattern observed in sub_7D92F0:82-161, sub_7B4020:282-348, sub_833E40:210-280, and sub_856260:260-304.

Queue A (committed side only, pending side lives before +1984):

OffsetTypeFieldEvidence
+1984hook*post_insn_hooks_A_committed_headsub_7D92F0:487,505,512 — spliced as list.head during pending-flush. Not nvoptrecipe_state — prior label was load-bearingly wrong
+1992hook*post_insn_hooks_A_committed_tailsub_7D92F0:513 — written as list.tail during splice
+2000i32post_insn_hooks_A_committed_countsub_7D92F0:515*(a1+2000) += v54 where v54 is the old pending count
+2008ptrpad / unusedCtor-only zero store; no Compilation-Context reader in the decompiled tree
+2016ptrpad / unusedCtor-only
+2024i32pad / unusedCtor-only

Queue B (full pending + committed, 6-slot triple):

OffsetTypeFieldEvidence
+2032hook*post_insn_hooks_B_pending_headsub_7B4020:282, sub_7D92F0:132, sub_833E40:210, sub_856260:270 — null-check guards the splice
+2040hook*post_insn_hooks_B_pending_tailsub_7B4020:322 — used as list.tail during splice
+2048i32post_insn_hooks_B_pending_countsub_7B4020:323 — added into committed count on splice
+2056hook*post_insn_hooks_B_committed_headsub_7B4020:289,318,337,344; read by dispatcher sub_7DD3C0_0x7dd3c0.c:9 which walks it and dispatches hook->vtable[0](hook, new_instr) for every instruction append — this is the actual on-append callback wiring
+2064hook*post_insn_hooks_B_committed_tailsub_7B4020:345, sub_7D92F0:159, sub_7DD3C0:9, sub_833E40:242
+2072i32post_insn_hooks_B_committed_countsub_7B4020:347, sub_7D92F0:161, sub_833E40:244, sub_856260:304
+2080 / +2088 / +2096ptr / ptr / i32pad / unusedCtor-only zero stores; all external readers at these offsets operate on unrelated classes (sub_661950 is a scheduler cost model under vtable off_21E6D00, not the Compilation Context under off_21DBC80)
+2104i32optix_modeCtor lines 574, 579, 661-670: copied from *(a2+112). If == 0 overridden to 1; if == 3 overridden to 4. Drives an OptiX-vs-CUDA mode switch
+2112ptrextension_object_aCtor line 585: *(a2+872) (driver-supplied extension object)
+2120ptrextension_object_bCtor line 591: *(a2+1408)
+2128u8extension_object_b_presentCtor line 600: !v53 = (*(int *)(a2+1416) != 0)
+2132i32tail_sentinelCtor line 599: -1
+2136..+2167--(cloned-variant tail only, 32 bytes)The alternate constructor sub_A60B60 builds a 2168-byte object (not "24KB RegisterStatCollector" as prior docs claimed) — this is the per-nested-parse PTX input-buffer cursor for nested text re-entry (e.g. .include scopes). +2136 = include-buffer-name list head, +2144 = raw buffer read pointer (const char *, init 0xFFFFFFFF), +2152 = remaining-bytes DWORD counter, +2160 = saved-line-number linked list head. Evidence: sub_A60B60_0xa60b60.c:754,771,822; sub_71C140_0x71c140.c:31-47 (fills +2136/+2144/+2152 from an a2 string); sub_71C910_0x71c910.c:394 error "ptxset_lineno called with no buffer", lines 403-436 pop linked list nodes via sub_4248B0, lines 463-497 decrement +2152 per read. The primary constructor sub_7F7DC0 tops out at +2132 and never touches this tail

SM Backend Object at +1584

The pointer at context+0x630 (decimal 1584) is the single most confusing field in the compilation context, because it serves multiple roles through a single polymorphic C++ object. Different wiki pages historically called it different names depending on which role they observed:

  • Legalization pages see it dispatching MidExpansion, LateExpansionUnsupportedOps, etc., and call it "SM backend" or "arch_backend"
  • Scheduling pages see it providing hardware latency profiles at *(sm_backend+372) and call it "scheduler context" or "hw_profile"
  • Optimization pages see it dispatching GvnCse (vtable[23]) and OriReassociateAndCommon (vtable[44]) and call it "optimizer state" or "function manager"
  • Codegen/template pages see it holding register file capacity at +372 and hardware capability flags at +1037

It is one object. The canonical name is sm_backend. It is constructed per-compilation-unit in sub_662920 with a switch on SM version bits (v3 >> 12). Each SM generation gets a different-sized allocation and a different vtable:

SM CaseSizeBase ConstructorVtableSM Generations
31712Bsub_A99A30off_2029DD0sm_30 (Kepler)
41712Bsub_A99A30off_21B4A50sm_50 (Maxwell)
51888Bsub_A99A30off_22B2A58sm_60 (Pascal)
61912Bsub_A99A30off_21D82B0sm_70 (Volta)
71928Bsub_ACDE20off_21B2D30sm_80 (Ampere)
81992Bsub_662220off_21C0C68sm_89 (Ada)
91992Bsub_662220off_21D6860sm_90+ (Hopper/Blackwell)

Key sub-fields on the SM backend:

  • +372 (i32): codegen factory value / encoded SM architecture version (e.g., 28673 = sm_80)
  • +1037 (u8): hardware capability flags (bit 0 = has high-precision FP64 MUFU seeds)
  • Vtable slots provide architecture-specific dispatch for 50+ operations

Pipeline Progress Counter at +1552

The field at context+1552 is a monotonically increasing int32 that tracks how far the compilation has progressed through the 159-phase pipeline. It is not a legalization-only counter -- it is incremented by phases across all categories (legalization, optimization, scheduling, regalloc). Each increment is performed by a small thunk function whose sole body is *(ctx + 1552) = N.

Known values and their associated phases:

ValueThunk AddressPhase / Context
0(init)sub_7F7DC0 -- compilation context constructor
1sub_C5F620Early pipeline (before ConvertUnsupportedOps)
2sub_C5F5A0After ConvertUnsupportedOps (phase 5)
3sub_C5EF80After MidExpansion (phase 45)
4sub_C5EF30After OriDoRematEarly (phase 54) -- signals remat mode active
5sub_1233D70Mid-pipeline scheduling/ISel context
7sub_6612E0 / sub_C60AA0After LateExpansion (phase 55)
8sub_849C60Post-optimization context
9sub_C5EB80After OriBackCopyPropagate (phase 83)
10sub_88E9D0Late optimization
11sub_C5EA80After SetAfterLegalization (phase 95) region
12sub_C5E980Post-legalization
13sub_13B5C80ISel/scheduling
14sub_C5E830Post-scheduling
15sub_C5E7C0After OptimizeHotColdInLoop (phase 108)
16sub_C5E6E0Post-regalloc
17sub_C5E5A0Mercury/codegen
18sub_C5E4D0Post-Mercury
19sub_C5E440Late codegen
20sub_C5E390Post-RA cleanup
21sub_C5E0B0Final pipeline stage

Readers of downstream passes use *(ctx+1552) > N to gate behavior that should only run after a certain pipeline point. For example, the rematerialization cross-block pass checks *(ctx+1552) > 4 to enable its second-pass mode.

Knob Container Access Pattern

The knob container at +1664 is accessed through a two-level virtual dispatch pattern that appears at 100+ call sites:

// Fast path: known vtable -> direct array read
_QWORD *v2 = *(_QWORD **)(ctx + 1664);
bool (*query)(__int64, int) = *(bool (**)(...))(*v2 + 72);
if (query == sub_6614A0)
    result = *(u8*)(v2[9] + knob_index * 72 + offset) != 0;
else
    result = query((int64)v2, knob_index);  // slow path

The fast path reads directly from the knob value array at v2[9] (offset +72 of the knob state object), where each knob value occupies 72 bytes. The slow path invokes the virtual method for derived knob containers.

Function Context (at +1880)

When a function is under compilation, +1880 points to a large context object containing 17 pairs of analysis-result data structures. Each pair consists of a sorted container and a hash map, holding results such as live ranges, register maps, and scheduling data. The cleanup code in sub_7FB6C0 destroys pairs at qword offsets [102, 97, 92, 87, 82, 77, 72, 67, 62, 57, 52, 47, 42, 36, 31, 26, 21] from the context base, then handles reference-counted objects at offsets [10] and [2].

Opcode Attribute Table (at +776 / +784)

The field at context+776 holds a pointer to a per-opcode attribute bitmap, and context+784 holds the owning allocator. Before this investigation, the ptxas wiki treated both fields as opaque "linked list slots" because they sit adjacent to the call-graph region and the ctor writes to them look like list sentinels. They are neither: they are the storage for a static ~355-entry flag table that every legalization, ISel, and scheduling phase consults to decide how to handle each Ori opcode.

Allocation pattern, reconstructed from sub_7F7DC0_0x7f7dc0.c:904-934:

// Ctor line 904: get master allocator
void *alloc = *(void**)(ctx + 16);

// Ctor line 905: allocate a 1428-byte buffer through vtable[24] (= +24 / 8 = slot 3)
uint8_t *buf = alloc_vtable->allocate(alloc, 1428);

// Ctor line 906: the first 4 bytes of the buffer hold the value 355
// This is NOT an entry count -- the buffer is always 1428 bytes
// regardless of 355 -- it appears to be an opcode-class marker or
// table-version constant checked by the deallocator.
*(uint32_t*)buf = 355;

// Ctor lines 909-927: clear all payload bytes in a 4-qword stride loop
// `v143 = v140 + 178` (= buf + 1424 bytes), so the loop walks indices
// 8, 40, 72, ... 1400 (stride 32). Each iteration touches +0, +8, +16
// (bit-masking the third qword with `&= 0xF8u`, preserving bits 3-7).
// This means each logical entry is **32 bytes** and only the first
// 24 bytes are in the "clean zero" state after init; the fourth qword
// holds pre-set flag bits.

// Ctor lines 932-933: stash pointer and allocator (pointer skips the
// 8-byte header, so `opcode_attr_table[0]` is the bitmap for opcode 0)
*(void**)(ctx + 776) = buf + 8;
*(void**)(ctx + 784) = alloc;

Per-opcode write pattern (ctor lines 934-1229): the ctor issues hundreds of table[byte_offset] |= mask statements, one for each opcode's combination of attributes. A sampling:

Byte offsetOR maskInferred attribute
[17], [25], [29]0x04, 0x04, 0x02Low-opcode (class 0-10) canary bits
[64], [65]0x0C, 0x10Memory-access opcode pair (load/store)
[1200], [1216], [1260], [1272]0x9CCluster-group marker (bit pattern shared by cluster-launch-related opcodes)
[1201], [1217], [1261], [1273]0x11Pairs with 0x9C: "has synchronizing behavior"
[85], [209], [237], [313], ...0x10"Requires late expansion" gate
[84], [232], [504], [1038]0x1E"Requires late expansion + emits diagnostic" composite
[689], [741], [828], [1084]0x40"Cluster-barrier-aware" (bit 6)
[828], [1084], [1225]0x80"Serializing opcode" sign-bit
[1004], [1016], [1152]0x80Same sign-bit (mirrored to compute-focused opcodes)

Read pattern (observed shape in callers, to be reverse-engineered per phase):

// Gating pattern used by legalization phases:
uint8_t *attr_table = *(uint8_t**)(ctx + 776);
uint8_t flags = attr_table[ori_opcode];  // opcode is 8-bit index into table
if (flags & NEEDS_LATE_EXPANSION)
    schedule_late_expansion(insn);

Each opcode appears to occupy between 1 and 4 bytes in the table, with the exact stride determined by the opcode's class. Because the ctor writes span byte offsets up to 1426 (out of 1420 usable bytes after the 8-byte header) and 355 opcodes * 4 bytes = 1420 bytes, the most likely layout is 4 attribute bytes per opcode -- giving 32 bits of per-opcode flags, of which ~8-10 bits are actively used in the ctor.

Deallocator path (sub_7F7DC0:928-930): when the ctx is freed, the code recovers the original base via v145 = *(_QWORD*)(a1+776); ... free(v145 - 8) -- this is the evidence that +776 points 8 bytes past the allocation base. Without this offset correction, a naive free(*(a1+776)) would corrupt the allocator's metadata.

The table should be treated as partial: each individual bit's meaning needs confirmation by cross-referencing a phase-execute body that actually tests it. The bit positions and inferred roles above come from the ctor init pattern alone.

Ori Code Object (~1136 bytes)

The Code Object is the per-function container for all IR data. One instance exists for each function under compilation. Constructor is at sub_A3B080, vtable at 0x21EE238.

Constructor Analysis

The constructor (sub_A3B080) takes two arguments: a1 (the Code Object to initialize) and a2 (the compilation context). It:

  1. Sets +8 = a2 (back-pointer to compilation context)
  2. Sets +0 = &unk_21EE238 (vtable)
  3. Zeroes approximately 250 distinct fields across the 1136-byte range
  4. Loads two SSE constants from xmmword_2027600 and xmmword_21EFAE0 into offsets +96 and +112 (likely default register file descriptors or encoding parameters)
  5. Reads a2+1412 and a2+1418 to set mode flags at +1101 and +1008
  6. Accesses the knob container at a2+1664 to query knob 367 for initial configuration
  7. Sets +1008 = 0x300000050 (default) or 0x400000080 (if a2+1418 & 4)

Code Object Field Map

OffsetTypeFieldEvidence / Notes
+0vtable*vtable0x21EE238, 263+ virtual methods
+8ptrcompilation_ctxBack-pointer to owning compilation context
+16u128(zeroed)SSE zero-store in constructor
+24u32sm_versionEncoded SM target (12288=sm30, 20481=sm50, 36865=sm90)
+32u128(zeroed)SSE zero-store
+48u128(zeroed)SSE zero-store
+64u32init_flagsZeroed in constructor
+72ptrcode_bufOutput code buffer
+80u128(zeroed)
+88ptrreg_fileRegister descriptor array: *(ctx+88) + 8*regId
+96u128reg_defaults_1Loaded from xmmword_2027600
+99u32ur_countUniform register (UR) count
+102u32r_allocR-register allocated count
+112u128reg_defaults_2Loaded from xmmword_21EFAE0
+128--175u128[3](zeroed)SSE zero-stores
+152ptrsym_tableSymbol/constant lookup array
+159u32r_reservedR-register reserved count
+176ptr(zeroed)
+184u32(zeroed)
+192ptr(zeroed)
+200u128(zeroed)
+216u128(zeroed)
+232u32(zeroed)
+236u32(zeroed)
+240ptr(zeroed)
+248u128(zeroed)
+264u128(zeroed)
+272ptrinstr_headInstruction linked-list head
+280u32(zeroed)
+288ptr(zeroed)
+296ptrbb_arrayBasicBlock** -- dense array of pointers to heap BB objects (8-byte stride). Indexed *(ctx+296) + 8*bix in sub_781F80:339, sub_78B430:107, sub_1908D90:21.
+304u32bb_indexCurrent basic block count (iteration bound: for (i=0; i<=ctx[+304]; i++))
+312ptroptionsOptionsManager* for knob queries
+320--359u128[3](zeroed)
+335u32instr_hiInstruction count upper bound
+336u32tex_inst_countTexture instruction count (stats emitter)
+338u32fp16_vect_instFP16 vectorized instruction count
+340u32inst_pairsInstruction pair count
+341u32instr_loInstruction count lower bound
+342u32tepid_instTepid instruction count
+360ptr(zeroed)
+368u32sub_block_flags
+372u32instr_totalTotal instruction count (triggers chunked scheduling at > 0x3FFF)
+376u32(zeroed)
+384--416ptr[5](zeroed)
+424u32(zeroed)
+432ptr(zeroed)
+440u32(zeroed)
+448ptr(zeroed)
+464ptr(zeroed)
+472u8(zeroed)
+473u8(zeroed)
+536u32(zeroed)
+540u32(zeroed)
+648ptrsucc_mapCFG successor edge hash table
+680ptrbackedge_mapCFG backedge hash table
+720ptrrpo_arrayReverse post-order array (int*)
+728ptrbitmask_arrayGrow-on-demand bitmask array for scheduling
+740u32bitmask_capacityCapacity of bitmask array
+752ptr(zeroed)
+760u32(zeroed)
+764u32(zeroed)
+768ptrconst_sectionsConstant memory section array
+772u8(zeroed)
+776ptrsmem_sectionsShared memory section array
+976ptrblock_infoInline array of 40-byte scheduling-side block descriptors. Parallel to bb_array but distinct storage. Allocated/grown by sub_10AE800 (40 * capacity bytes via vtable slot +24).
+984i32num_blocksHigh-water block index (iteration bound for the 40-byte array; stride = 40).
+988i32block_info_capacityCapacity of the 40-byte array (grown with 3/2 policy in sub_10AE800:37).
+996u32annotation_offsetCurrent offset into annotation buffer (sub_A4B8F0)
+1000ptrannotation_bufferAnnotation data buffer (sub_A4B8F0)
+1008u64encoding_paramsDefault 0x300000050 or 0x400000080
+1016ptr(zeroed)
+1024u32(zeroed)
+1032ptr(zeroed)
+1040ptr(zeroed)
+1064ptr(zeroed)
+1080u128(zeroed)
+1096u32(zeroed)
+1100u8(zeroed)
+1101u8optimization_modeSet from knob 367 and compilation_ctx+1412
+1102u8(zeroed)
+1104ptr(zeroed)
+1120u128(zeroed)

Register Count Formula

From the stats emitter at sub_A3A7E0 and the register count function at sub_A4B8F0 (which both use vtable+2104 dispatch with sub_859FC0 as the fast path):

total_R_regs      = code_obj[159] + code_obj[102]   // reserved + allocated
instruction_count = code_obj[335] - code_obj[341]   // upper - lower

Stats Emitter Field Map

The stats emitter (sub_A3A7E0) accesses a per-function stats record through the SM backend: v3 = *(compilation_ctx+8)[198] (offset +1584 from the outer compilation context points to the SM backend object; the emitter then reads per-function stats fields within it). It uses DWORD indexing (4-byte), and reveals these additional fields:

DWORD IndexByte OffsetFieldStat String
8+32est_latency[est latency = %d]
10+40worst_case_lat[worstcaseLat=%f]
11+44avg_case_lat[avgcaseLat=%f]
12+48spill_bytes[LSpillB=%d]
13+52refill_bytes[LRefillB=%d]
14+56s_refill_bytes[SRefillB=%d]
15+60s_spill_bytes[SSpillB=%d]
16+64low_lmem_spill[LowLmemSpillSize=%d]
17+68frame_lmem_spill[FrameLmemSpillSize=%d]
18+72non_spill_bytes[LNonSpillB=%d]
19+76non_refill_bytes[LNonRefillB=%d]
20+80non_spill_size[NonSpillSize=%d]
26+104occupancy (float)[Occupancy = %f]
27+108div_branches[est numDivergentBranches=%d]
28+112attr_mem_usage[attributeMemUsage=%d]
29+116program_size[programSize=%d]
42+168precise_inst[Precise inst=%d]
44+176udp_inst[UDP inst=%d]
45+180vec_to_ur[numVecToURConverts inst=%d]
49+196max_live_suspend[maxNumLiveValuesAtSuspend=%d]
87+348partial_unroll[partially unrolled loops=%d]
88+352non_unrolled[non-unrolled loops=%d]
89+356cb_bound_tex[CB-Bound Tex=%d]
90+360partial_bound_tex[Partially Bound Tex=%d]
91+364bindless_tex[Bindless Tex=%d]
92+368ur_bound_tex[UR-Bound Tex=%d]
93+372sm_version_check> 24575 triggers UR reporting
99+396ur_count_stats[urregs=%d]
102+408r_allocR-register allocated count
159+636r_reservedR-register reserved count
303+1212est_fp[est fp=%d]
306+1224est_half[est half=%d]
307+1228est_transcendental[est trancedental=%d]
308+1232est_ipa[est ipa=%d]
310+1240est_shared[est shared=%d]
311+1244est_control_flow[est controlFlow=%d]
315+1260est_load_store[est loadStore=%d]
316+1264est_tex[est tex=%d]
334+1336inst_pairs[instPairs=%d]
335+1340instr_hiInstruction count upper bound
336+1344tex_inst_count[texInst=%d]
337+1348fp16_inst[FP16 inst=%d]
338+1352fp16_vect_inst[FP16 VectInst=%d]
339+1356inst_hint[instHint=%d]
340+1360inst_pairs_2checked for non-zero to print instHint line
341+1364instr_loInstruction count lower bound
342+1368tepid_inst[tepid=%d]

Note: The stats emitter accesses the Code Object through a float pointer (v3), so DWORD indices map to byte offsets via index * 4 for integers and index * 4 for floats. Float fields at indices 9, 26, 50, 54, 57, 58, 59, 61, 62, 65, 84, 85, 86 hold throughput and occupancy metrics. A linked list at qword index 55 (byte +440) holds additional string annotations.

Basic Block Representation (two parallel structures)

ptxas uses two separate basic-block containers that coexist in the Code Object, and an earlier draft of this wiki conflated them into a single "40-byte BasicBlock" struct. The conflation is the source of apparent contradictions between this page and the per-pass documentation (which accesses offsets like bb+128, bb+144, bb+152, bb+232, bb+280, bb+292 -- all far beyond 40 bytes). The reality is:

  1. bb_array at Code Object +296 -- a dense BasicBlock** table (8-byte stride), i.e. one pointer per block to a heap-allocated full BasicBlock object (≥293 bytes). Used by every optimization pass that needs CFG structure (predecessors, successors, RPO, flags, loop attributes).
  2. block_info at Code Object +976 -- an inline contiguous array of 40-byte scheduling descriptors (40-byte stride). Each 40-byte entry is the scheduling / DOT-dumper view of a block and is not a BasicBlock -- it carries an instruction-range bracket (head / tail-sentinel), the block index, and a flag byte.

The two structures are parallel: index i in bb_array and index i in block_info describe the same logical block. Count-wise, bb_array[0..ctx[+304]] is the iteration range (inclusive upper bound), and block_info[0..ctx[+984]] is the iteration range for the 40-byte array (also inclusive). The two counts are set independently but remain in lock-step because the creation paths update both.

The 40-byte block_info entry (at +976)

This is the only structure in ptxas that is actually 40 bytes wide. It is allocated by sub_10AE800 (the block-info appender), which grows the array with capacity_new = max(old*3/2, old+2) and copies 40 * count bytes on reallocation. From sub_10AE800:61:

// sub_10AE800 -- block_info appender
result = (__m128i *)&base[40 * new_count];   // new entry address
*result               = xmm_a7;              // +0..+15  (16 bytes, __int128 arg a7)
result[1]             = xmm_a8;              // +16..+31 (16 bytes, __int128 arg a8)
result[2].m128i_i64[0]= scalar_a9;           // +32..+39 ( 8 bytes, __int64   arg a9)
// Grow path (same function, lines 37-55):
//   v13 = max(cap + (cap+1)/2, count + 2)
//   memcpy(new_buf, old_buf, 40 * old_count);   // i.e. 8 * (5*count + 5)

Field interpretation, cross-checked against the three primary consumers:

OffsetWidthFieldEvidence
+0ptrinsn_head -- first instruction of the block (scheduling view)sub_1C348B0:129 reads v80 = *v79 then iterates instructions until reaching v79[1]; sub_BE21D0:41 reads *(_QWORD*)v11 as a pointer and fetches a _DWORD at *v11+152 for the DOT label.
+8ptrinsn_tail_sentinel -- marks end of the scheduling instruction rangesub_1C348B0:130 loads v79[1] as the walk terminator; sub_6FC810:728 writes v37[+8] = v12 immediately after sub_10AE800 returns the new entry.
+16u64reserved_a -- written by sub_10AE800 from a8.lo but no consumer has been identified; zero in the common path.
+20i32reserved_b -- zeroed immediately after append (sub_6FC810:727: *(_DWORD*)(v37+20) = 0).
+24i32scheduling scratch -- sub_6FC810:726 writes 0; the scheduling / regalloc pipeline later stashes per-block scratch state here.
+28i32bix -- block index, the same unique ID used in all CFG hash tablessub_BE21D0:39: v12 = v11[7] (DWORD index 7 = byte +28) then printf("bix%u", v12) in the DOT dumper.
+32u8flags -- bit 1 (0x02) = "block ends in branch-with-side-effect 1506 opcode"sub_BE0690:1467: `(_BYTE)(v126+32)
+33u8[7]padding / future-use bytes up to the 40-byte stride

Size proof: the appender writes exactly 40 * n bytes, the DOT dumper advances its cursor by literally v9 += 40 per iteration (sub_BE21D0:38), the last-element helper sub_10AE8E0 computes base + 40 * num_blocks, and the grow-path memcpy copies 8 * (5*count + 5) = 40 * (count+1) bytes. Every independent site agrees on stride 40.

40-byte block_info entry layout

  +0   ptr  insn_head               // scheduling-range first instruction
  +8   ptr  insn_tail_sentinel      // scheduling-range terminator
  +16  u64  reserved_a
  +24  i32  scheduling_scratch
  +28  i32  bix                     // unique block index
  +32  u8   flags                   // bit 0x02 set by sub_BE0690 on side-effect terminators
  +33  u8[7] padding

Allocation pseudocode:

function appendBlockInfo(ctx, insn_head, insn_tail, bix, flags):
    // Grow inline array if needed (sub_10AE800)
    count = ctx[+984]
    cap   = ctx[+988]
    if count + 2 > cap:
        new_cap = max(cap + (cap + 1) / 2, count + 2)
        new_buf = arena_alloc(40 * new_cap)       // via code-object allocator vtable+24
        memcpy(new_buf, ctx[+976], 40 * count)
        arena_free(ctx[+976])
        ctx[+976] = new_buf
        ctx[+988] = new_cap

    // Write the new entry
    ctx[+984] = count + 1                         // new high-water index
    entry     = ctx[+976] + 40 * (count + 1)
    entry[+0] = insn_head
    entry[+8] = insn_tail
    entry[+24]= 0
    entry[+28]= bix
    entry[+32]= flags                             // usually 0 at creation
    return entry

The heap BasicBlock object (at *(ctx+296)[bix])

The entries of bb_array point to a much larger heap object. The size has not been pinned down to a single allocator call (it is not created by the 136-byte scratch routine sub_62BB00, whose buffer is sub_4248B0'd back to the arena on the normal exit at line 551), but the minimum size is bounded from below by the field accesses performed by the CFG / liveness / loop passes. The highest confirmed offsets all come from sub_781F80 (BasicBlockAnalysis) through the verified bb_array[] indirection *(_QWORD*)(*(_QWORD*)(a1+296) + 8*bix):

OffsetWidthPass accessField (from CFG / liveness docs)
+8ptrsub_78B430:110 (**(_QWORD**)(v13+8)+72 -- first-instr opcode)instruction list head
+120u32sub_781F80:342 (= 0)scheduling-state scratch dword
+128u128sub_781F80:344 (*(_OWORD*)(v16+128) = 0)successor list head + aux qword
+136ptrsub_78B430:112 (*(__int64***)(v13+136) -- walk preds)predecessor list head
+144u128sub_781F80:343 (*(_OWORD*)(v16+144) = 0), sub_78B430:107 (rpo_number)RPO number + adjacent metadata (16 bytes)
+152i32sub_781F80:770 (*(v20+152) = v163 where v163 = pred->rpo_number)loop-exit RPO marker / label id (dual-purpose; BBAnalysis overwrites during the pass)
+216i32sub_781F80:731 (v102 = *(int*)(v21+216))operand-side scratch (only reached via ctx+368 not ctx+296, so this may belong to a different struct; flagged here for completeness)
+232i32sub_781F80:341 (= 0)per-BB zeroed dword
+280i32sub_781F80:340,536,540,553,560,603,906,925,1134, sub_781F80:1264, sub_78B430:*primary BB flags dword (bits: 0x10 loop header, 0x20 has predecessor, 0x800000 in-loop, 0x20000 / 0x40000 / 0x40000000 analysis bits)
+282u8sub_781F80:908 ((*(_BYTE*)(v20+282) & 8) != 0)high byte of the +280 flags dword (byte-level test)
+292u8sub_781F80:602,733,904 (bitwise OR / AND)secondary flag byte (paired with +280)

The access at offset +292 (a byte, written with |= 8) sets the lower bound on the BasicBlock size at ≥ 293 bytes, and the natural alignment of the arena allocator rounds this up to a multiple of 8 (so the next valid allocator bucket is 296 bytes). The earlier "BasicBlock = 40 bytes" claim is wrong and was the result of describing the 40-byte block_info entry as if it were the full block object.

The previous revision of this section also misattributed the scheduling-pass initializer sub_8D0640 to the 40-byte array. That was wrong: sub_8D0640 walks a separate linked list rooted at scheduling_ctx[+104] (for (i = *(v21+104); i; i = (__int64*)*i)), with the zeroing pattern i[7] = 0, i[13] = 0, *((_DWORD*)i+19) = 0, *((_DWORD*)i+21) = -1. This linked list stores per-scheduling-group records (qword fields at +56, +104; dword fields at +76, +84), not block_info entries. The 40-byte entries are never rewritten in a single pass like that -- they are populated incrementally during CFG construction via sub_10AE800 and mutated in-place by sub_BE0690 / sub_8A5240 when backedge analysis needs to mark a terminator.

Access cheat sheet

// Iterate every block (optimization / CFG passes)
int bb_count = *(int*)(ctx + 304);                         // inclusive upper bound
for (int i = 0; i <= bb_count; i++) {
    BasicBlock* bb = *(BasicBlock**)(*(ctx + 296) + 8*i);  // 8-byte stride, pointer table
    int rpo  = *(int*)(bb + 144);                          // rpo_number
    int flag = *(int*)(bb + 280);                          // primary flags dword
    ...
}

// Iterate every block (scheduling / DOT dumper)
int n = *(int*)(ctx + 984);                                // inclusive upper bound
char* base = *(char**)(ctx + 976);
for (int i = 0; i <= n; i++) {
    char* entry = base + 40*i;                             // 40-byte stride, inline
    int bix = *(int*)(entry + 28);
    void* insn_head = *(void**)(entry + 0);
    ...
}

Instruction Layout

Instructions are polymorphic C++ objects linked into per-BB doubly-linked lists. The instruction format is detailed in Instructions; this section covers only the structural linkage.

Each instruction carries a unique integer ID at +16, an opcode at +72 (the peephole optimizer masks with & 0xCF on byte 1 to strip modifier bits), and a packed operand array starting at +84. The operand count is at +80. Operands are 8 bytes each.

Packed Operand Format

 31  30  29  28  27       24  23  22  21  20  19                  0
+---+---+---+---+-----------+---+---+---+---+---------------------+
|     type      |  modifier bits (8 bits)    |  index (20 bits)    |
+---+---+---+---+-----------+---+---+---+---+---------------------+

Extraction (50+ confirmed sites):
  uint32_t operand = *(uint32_t*)(instr + 84 + 8 * i);
  int type    = (operand >> 28) & 7;     // bits 28-30
  int index   = operand & 0xFFFFF;       // bits 0-19
  int mods    = (operand >> 20) & 0xFF;  // bits 20-27
Type ValueMeaningResolution
1Register operandIndex into *(code_obj+88) register file
5Symbol/constant operandIndex into *(code_obj+152) symbol table

The operand classifier functions at 0xB28E00--0xB28E90 provide predicate checks:

FunctionPredicate
sub_B28E00getRegClass (1023 = wildcard, 1 = GPR)
sub_B28E10isRegOperand
sub_B28E20isPredOperand
sub_B28E40isImmOperand
sub_B28E80isConstOperand
sub_B28E90isUReg

Symbol Table

The symbol table is accessed through Code Object +152. Based on the symbol table builder at sub_621480 (21KB, references a1+30016 for the symbol table base), symbols are stored in a hash-map-backed structure where each symbol has a name and associated properties (address, type, section binding).

Internal Symbol Names

The following internal symbol names appear in decompiled code, indicating the kinds of entities tracked:

SymbolPurpose
__ocg_constOCG-generated constant data
__shared_scratchShared memory scratch space
__funcAddrTab_gGlobal indirect function call table
__funcAddrTab_cConstant indirect function call table
_global_ptr_%sGlobal pointer for named variable
$funcID$nameFunction-local relocation symbol
__cuda_dummy_entry__Dummy entry generated by --compile-only
__cuda_sanitizerCUDA sanitizer instrumentation symbol

Symbol Resolution Flow

Symbol resolution (sub_625800, 27KB) traverses the symbol table to resolve references during the PTX-to-Ori lowering and subsequent optimization phases. The format %s[%d] (from sub_6200A0) is used for array-subscripted symbol references, and __$endLabel$__%s markers delimit function boundaries.

Constant Buffer Layout

Constant memory is organized into banks (c[0], c[1], ...) corresponding to the CUDA .nv.constant0, .nv.constant2, etc. ELF sections. The constant section array at Code Object +768 tracks all constant banks for the current function.

Constant Bank Handling

The constant bank handler at sub_6BC560 (4.9KB) manages references to constant memory using the c[%d] (integer bank) and c[%s] (named bank, sw-compiler-bank) notation. It enforces:

  • A maximum constant register count (error: "Constant register limit exceeded; more than %d constant registers")
  • LDC (Load Constant) requires a constant or immediate bank number

ELF Constant Symbols

The ELF symbol emitter (sub_7FD6C0) creates symbols for constant bank metadata:

Symbol NamePurpose
.nv.ptx.const0.sizeSize of constant bank 0 (kernel parameters)

The constant emission function (sub_7D14C0, 5.6KB) iterates the constant section array and copies bank data into the output ELF sections.

Shared Memory Layout

Shared memory (.nv.shared) allocations are tracked through the shared memory section array at Code Object +776. Reserved shared memory regions are managed by sub_6294E0 (12.1KB) and sub_629E40 (6.1KB).

Reserved Shared Memory Symbols

The ELF emitter recognizes these special symbols for shared memory layout:

Symbol NamePurpose
.nv.reservedSmem.beginStart of reserved shared memory region
.nv.reservedSmem.capCapacity of reserved shared memory
.nv.reservedSmem.endEnd of reserved shared memory region
.nv.reservedSmem.offset0First reserved offset within shared memory
.nv.reservedSmem.offset1Second reserved offset within shared memory

The --disable-smem-reservation CLI option disables the reservation mechanism. Shared memory intrinsic lowering (sub_6C4DA0, 15KB) validates that shared memory operations use types {b32, b64}.

Descriptor Size Symbols

Additional ELF symbols track texture/surface descriptor sizes in shared memory:

Symbol NamePurpose
.nv.unified.texrefDescSizeUnified texture reference descriptor size
.nv.independent.texrefDescSizeIndependent texture reference descriptor size
.nv.independent.samplerrefDescSizeIndependent sampler reference descriptor size
.nv.surfrefDescSizeSurface reference descriptor size

Pool Allocator

The pool allocator (sub_424070, 3,809 callers) is the single most heavily used allocation function. Every dynamic data structure in ptxas is allocated through pools.

Pool Object Layout

OffsetTypeFieldNotes
+0ptrlarge_block_listSingly-linked list of large (>4999 byte) blocks
+32u32min_slab_sizeMinimum slab allocation size
+44u32slab_countNumber of slabs allocated
+48ptrlarge_free_listFree list for large blocks (boundary-tag managed)
+56u32fragmentation_countFragmentation counter (decremented on split)
+60u32max_orderMaximum power-of-2 order for large blocks
+64...(large block free lists)a1 + 32*(order+2) = per-order free list head
+2112ptrtracking_mapHash map for allocation metadata tracking
+2128ptr[N]small_free_listsSize-binned free lists: *(pool + 8*(size>>3) + 2128) = head
+7128mutex*pool_mutexpthread_mutex_t* for thread safety

Allocation Paths

Small path (size <= 4999 bytes = 0x1387):

  1. Round size up to 8-byte alignment: aligned = (size + 7) & ~7
  2. Minimum 16 bytes
  3. Compute bin: bin = pool + 8 * (aligned >> 3) + 2128
  4. If bin has a free block: pop from free list, decrement slab available bytes
  5. If bin is empty: allocate a new slab from the parent (size = aligned * ceil(min_slab_size / aligned)), carve into free-list nodes

Large path (size > 4999 bytes):

  1. Add 32 bytes for boundary tags
  2. Search power-of-2 order free lists starting from log2(size+32)
  3. If found: split block if remainder > 39 bytes, return payload
  4. If not found: call sub_423B60 to grow the pool, allocate new slab from parent

Boundary Tag Format (Large Blocks)

Large blocks use boundary tags for coalescing on free:

Block Header (32 bytes):
  +0    i64      sentinel      // -1 = allocated, else -> next free
  +8    ptr      prev_free     // previous in free list (or 0)
  +16   u64      tag_offset    // 32 (header size)
  +24   u64      payload_size  // user-requested allocation size

Block Footer (32 bytes at end):
  +0    i64      sentinel
  +8    ptr      prev_free
  +16   u64      footer_tag    // 32
  +24   u64      block_size    // total size including headers

Slab Descriptor (56 bytes)

Each slab is tracked by a 56-byte descriptor:

OffsetTypeField
+0ptrchain_link
+8u64total_size
+16u64available_size
+24ptrowning_pool
+32ptrmemory_base
+40u8is_small_slab
+44u32slab_id
+48u32bin_size

Hierarchical Pools

Pools are hierarchical. When sub_424070 is called with a1 = NULL, it falls back to a global allocator (sub_427A10) that uses malloc directly. Non-null a1 values are pool objects that allocate from their own slabs, which are themselves allocated from a parent pool (the TLS context at offset +24 holds the per-thread pool pointer). The top-level pool is named "Top level ptxas memory pool" and is created in the compilation driver.

Hash Map

The hash map (sub_426150 insert / sub_426D60 lookup, 2,800+ and 422+ callers respectively) is the primary associative container in ptxas.

Hash Map Object Layout (~112 bytes)

OffsetTypeFieldNotes
+0fptrhash_funcCustom hash function pointer
+8fptrcompare_funcCustom compare function pointer
+16fptrhash_func_2Secondary hash (or NULL)
+24fptrcompare_func_2Secondary compare (or NULL)
+32u32has_custom_compareFlag
+40u64bucket_maskcapacity - 1 for power-of-2 masking
+48u64entry_countNumber of stored entries
+64u64load_factor_thresholdResize when entry_count exceeds this
+72u32first_free_slotTracking for bitmap-based slot allocation
+76u32entries_capacityCapacity of entries array
+80u32bitmap_capacityCapacity of used-bits bitmap
+84u32flagsHash mode in bits 4-7
+88ptrentriesArray of 16-byte {key, value} pairs
+96ptrused_bitmapBitmap tracking occupied slots
+104ptrbucketsArray of pointers to chained index lists

Hash Modes

The hash mode is encoded in bits 4-7 of the flags field at offset +84:

ModeFlag BitsHash FunctionUse Case
00x00Custom (+0 function pointer)User-defined hash/compare
10x10Pointer hash: (key>>11) ^ (key>>8) ^ (key>>5)Pointer-keyed maps
20x20Identity: key used directlyInteger-keyed maps

Mode selection happens automatically in the constructor (sub_425CA0): if the hash/compare pair matches (sub_427750, sub_427760), mode 2 is set; if (sub_4277F0, sub_427810), mode 1.

Lookup Algorithm

// Mode 1 (pointer hash) example:
uint64_t hash = (key >> 11) ^ (key >> 8) ^ (key >> 5);
uint64_t bucket_idx = hash & map->bucket_mask;
int32_t* chain = map->buckets[bucket_idx];
while (*++chain != -1) {
    entry_t* e = map->entries + 16 * (*chain);
    if (key == e->key)
        return e->value;  // found
}
return 0;  // not found

Growth policy: the map doubles capacity and rehashes when entry_count > load_factor_threshold.

String-Keyed Maps

String-keyed maps use MurmurHash3 (sub_427630, 73 callers) as the hash function. The implementation uses the standard MurmurHash3_x86_32 constants:

ConstantValueStandard Name
c10xCC9E2D51 (-862048943)MurmurHash3 c1
c20x1B873593 (461845907)MurmurHash3 c2
fmix10x85EBCA6B (-2048144789)MurmurHash3 fmix
fmix20xC2B2AE35 (-1028477387)MurmurHash3 fmix

CFG Hash Map (FNV-1a)

The control flow graph uses a separate hash map implementation based on FNV-1a hashing, distinct from the general-purpose hash map above. Two instances exist per Code Object at offsets +648 (successor edges) and +680 (backedge info).

ParameterValue
Initial hash0x811C9DC5 (-2128831035)
Prime16777619 (0x01000193)
Input4-byte block index, hashed byte-by-byte

Bucket entry: 24 bytes {head, tail, count}. Node: 64 bytes with chain link, key, values, sub-hash data, and cached hash. See CFG for the full CFG hash map specification.

Linked List

The linked list (sub_42CA60 prepend, 298 callers; sub_42CC30 length, 48 callers) is a singly-linked list of 16-byte nodes:

ListNode (16 bytes, pool-allocated)
  +0    ptr      next        // pointer to next node (NULL = end)
  +8    ptr      data        // pointer to payload object

Prepend allocates a 16-byte node from the pool, sets node->data = payload, and links it at the list head. This is used for function lists, relocation lists, annotation chains, and many intermediate pass-local collections.

Growable Array (Pool Vector)

Growable arrays appear throughout the PhaseManager and elsewhere. The layout is a triple of {data_ptr, count, capacity}:

PoolVector (24 bytes inline, or embedded in parent struct)
  +0    ptr      data         // pointer to element array
  +8    i32      count        // current element count
  +12   i32      capacity     // allocated capacity

Growth strategy (confirmed in the PhaseManager timing records): new_capacity = max(old + old/2 + 1, requested) (1.5x growth factor). Elements are typically 8 bytes (pointers) or 16 bytes (pointer pairs). Reallocation uses sub_424C50 (pool realloc, 27 callers).

The PhaseManager uses this pattern for the phase list (16-byte {phase_ptr, pool_ptr} pairs), the name table (8-byte string pointers), and the timing records (32-byte entries).

Knob Value Array

Knob values are stored in a contiguous array of 72-byte slots, accessed at knob_state[9] + 72 * knob_index (where knob_state[9] is offset +72 of the knob state object).

Knob Value Slot (72 bytes)

OffsetTypeField
+0u8Type tag (0=unset, 1=bool, 2=int, ..., 12=opcode list)
+8i64Integer value / pointer to string / linked list head
+16i64Secondary value (range max, list count, etc.)
+24i64Tertiary value
+64ptrAllocator reference

Supported types:

TypeTagStorage
Boolean1Flag at +0
Integer2Value at +8
Integer+extra3Value at +8, extra at +12
Integer range4Min at +8, max at +16
Integer list5Growable array of ints
Float6float at +8
Double7double at +8
String8/11Pointer at +8
When-string9Linked list of 24-byte condition+value nodes
Value-pair list10Opcode:integer pairs via vtable
Opcode list12Opcode names resolved through vtable

Knob Descriptor (64 bytes)

Knob descriptors are stored in a table at knob_state+16, with count at knob_state+24:

OffsetTypeField
+0ptrPrimary name (ROT13-encoded)
+8u64Primary name length
+16u32Type tag
+24...(reserved)
+40ptrAlias name (ROT13-encoded)
+48u64Alias name length

Stream Object

The output stream used for diagnostics and stats reporting (e.g., at compilation context +1440) is a C++ iostream-like object with operator overloads. Field layout (from sub_7FE5D0 and sub_7FECA0):

OffsetTypeField
+0vtable*vtable (dispatch for actual I/O)
+8u32width
+12u32precision
+16u64char_count
+24ptrformat_buffer
+56u32flags (bit 0=hex, bit 1=oct, bit 2=left-align, bit 3=uppercase, bits 7-8=sign)

ORI Record Serializer (sub_A50650)

The ORI Record Serializer (sub_A50650, 74 KB, 2,728 decompiled lines) is the central function that takes a Code Object's in-memory state and flattens it into a linear output buffer organized as a table of typed section records. It is the serialization backbone for both the DUMPIR diagnostic subsystem and the compilation output path. Despite the _ORI_ string it contains, it is not an optimization pass -- it is infrastructure.

Address0xA50650
Size~74 KB
IdentityCodeObject::EmitRecords
Confidence0.90
Called fromsub_A53840 (wrapper), sub_AACBF0 / sub_AAD2A0 (DUMPIR diagnostic path)
Callssub_A4BC60 (register serializer, new format), sub_A4D3F0 (legacy format), sub_A4B8F0 (register count annotation), sub_A47330 + sub_A474F0 (multi-section finalization), sub_1730890 / sub_17308C0 / sub_17309A0 (scheduling serializers), sub_1730FE0 (register file map)

Parameters

a1 is a serialization state object ("OriRecordContext") that carries the section table, compilation context back-pointer, and per-subsection index/size pairs. a2 is the output buffer write cursor, advanced as data is emitted.

Key fields on a1:

OffsetTypeFieldEvidence
+8ptrcompilation_ctxDereferenced to reach sm_backend at +1584
+24i32header_section_idxv5 + 32 * (*(a1+24) + 1)
+72ptrsection_tableArray of 32-byte section entries
+180u32instr_counter_1Reset to 0 at entry
+472u8has_debug_infoGates debug section emission
+916i32multi_section_count> 0 triggers link-record emission and tail call to sub_A47330
+1102u8multi_section_enabledMaster flag for multi-section mode
+1120ptrscheduling_ctxScheduling context for barrier/scope serialization

Section Record Format

Each section occupies a 32-byte entry in the table at *(a1+72) + 32 * section_index:

Offset  Type   Field
+0      u16    type_tag           section type identifier
+4      u32    data_size          byte size of data payload
+8      ptr    data_ptr           pointer to data in output buffer
+16     u32    element_count      number of elements (or auxiliary metadata)
+20     u32    aux_field          additional per-type context
+24     u32    aux_field_2        secondary per-type context

Data payloads are 16-byte aligned: cursor += (size + 15) & ~0xF.

Section Type Tag Catalog

The serializer emits up to 56 unique section types across three tag ranges.

Base types (0x01--0x58):

TagHexContentEvidence
10x01Instruction stream (register-allocated code body)Emitted via sub_A4BC60 or sub_A4D3F0
30x03Virtual-dispatch section (vtable+48 on state obj)Conditional on *(a1+64) > 0
160x10Source operand bank (v7[199] entries at v7+97)*(entry+48) = v7[199]
170x11Destination operand bank (bit-packed from v7+203)Conditional on !v7[1414]
190x13Annotation stream*(a1+232) counter
340x22Original-definition name table (_ORI_ prefixed)strcpy(v50, "_ORI_") at line 1762
350x23Instruction info snapshot (340 bytes from v7+4)qmemcpy of 340 bytes
460x2ETexture/surface binding tablev7[248] entries, 16 bytes each
500x32Live range interval table (spill map)From compilation context +984
510x33Register file occupancy table*(ctx+1424) & 4
530x35Source operand type bitmap (4-bit per operand)v7[131] operands, 20-byte stride
540x36Destination operand type bitmapv7[134] operands, 20-byte stride
550x37Scheduling barrier datavia sub_1730890
560x38Register file mappingvia sub_1730FE0
580x3AScheduling dependency graphvia sub_17309A0
590x3BMulti-section link recordConditional on *(a1+1102)
640x40External reference (from ctx+2120)Pointer stored, no data copy
680x44Performance counter section*(a1+932) counter
700x46Spill/fill metadatav7[408]
710x47Call graph edge tableFrom v7+61, linked list traversal
730x49Codegen context snapshotFrom ctx+932 register allocation state
800x50Hash table sectionv7+207/208, hash bucket traversal
810x51Extended call infoFrom v7+84
830x53Convergence scope datavia sub_17308C0
850x55Register geometry record (banks, warps, lanes)From ctx+1600, writes bank/warp/lane counts
880x58Extended scheduling annotationsConditional on *(a1+1088) > 0

Extended types (0x1208--0x1221): Emitted only when *(char*)(ctx+1412) < 0, which enables the full post-register-allocation diagnostic mode. These 16 types carry per-register-class live range and operand definition data:

TagHexContent
46160x1208Extended operand class 0
4617--46230x1209--0x120FExtended operand classes 1--7
46240x1210Block-level operand summary
46250x1211Live-in vector (12 bytes/element, count at *(a1+668))
46260x1212Live-out vector (12 bytes/element)
46270x1213Extended operand class 8
4628--46290x1214--0x1215Extended operand classes 9--10
46300x1216Memory space descriptor (SM arch > 0x4FFF)
46310x1217Extended scheduling flag (SM arch > 0x4FFF)
46320x1218Instruction hash (ctx+1386 bit 3)
46330x1219Annotation metadata
46400x1220Extended section metadata
46410x1221Optimization level record (from knob system, knob 988)

The _ORI_ Name Prefix

The _ORI_ string is not a pass name. At line 1762 the serializer iterates the linked list at v7+55 (the original-definition chain maintained for rematerialization debugging) and for each entry creates a string "_ORI_<original_name>":

// Line 1748-1770 (simplified)
for (def = v7->original_defs; def; def = def->next) {
    entry = &section_table[16 * (state->instr_offset + idx)];
    entry->type_tag = 34;      // original-definition name
    entry->data_ptr = cursor;
    strcpy(cursor, "_ORI_");
    strcpy(cursor + 5, def->name);
    cursor += align16(strlen(def->name) + 21);
}

These names are consumed by the register allocation verifier (sub_A55D80) when it compares pre- and post-allocation reaching definitions. A mismatch triggers the "REMATERIALIZATION PROBLEM" diagnostic (string at 0xa55dd8), which lists original definitions under their _ORI_ names alongside the post-allocation state.

Wrapper: sub_A53840

sub_A53840 (48 lines) is a thin wrapper that:

  1. Emits a type-44 header record if *(ctx+1600)[1193] is set (scheduling metadata header)
  2. Calls sub_A50650 with the output buffer
  3. Optionally emits a type-62 trailer record if *(ctx+1600)[48] is set

This wrapper is the typical entry point reached through vtable dispatch.

Function Map

AddressSizeCallersIdentity
sub_A3B080~700 BmultipleCode Object constructor
sub_A3A7E0~700 B1Stats emitter (per-function profile)
sub_A4B8F0~250 B1Register count / annotation writer
sub_A50650~74 KB8ORI Record Serializer (CodeObject::EmitRecords)
sub_A53840~400 B1EmitRecords wrapper (adds type-44 header)
sub_4240702,098 B3,809Pool allocator (alloc)
sub_4248B0923 B1,215Pool deallocator (free)
sub_424C50488 B27Pool reallocator (realloc)
sub_426150~1.2 KB2,800Hash map insert
sub_426D60345 B422Hash map lookup
sub_426EC0349 B29Hash map contains
sub_425CA0114 B127Hash map constructor
sub_425D20121 B63Hash map destructor
sub_42CA6081 B298Linked list prepend
sub_42CC3034 B48Linked list length
sub_427630273 B73MurmurHash3 string hash
sub_62148021 KBlowSymbol table builder
sub_62580027 KBlowSymbol resolution
sub_6BC5604.9 KBlowConstant bank handler
sub_6294E012.1 KBlowReserved shared memory management
sub_6C4DA015 KBlowShared memory intrinsic lowering
sub_7FD6C0~800 B3ELF symbol emitter
sub_7FB6C0~800 B1Pipeline orchestrator (context cleanup)
sub_7FBB70~100 B1Per-kernel entry point
sub_663C30~300 B1Compilation loop body
sub_662920varies1Global initialization (calls KnobsInit)