SASS Encoding — Function-Pointer Dispatch Tables
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
The SASS encoder compiles each Ori IR instruction by walking a small ladder of switch-dispatched megafunctions and reading per-field templates out of rodata. The address-stable infrastructure for those megafunctions is a population of 409 function-pointer dispatch tables in .rodata totalling 70,528 slots (564,224 bytes if every entry is an 8-byte target pointer). Each slot is a void(*)() that the megafunction loads through a base-plus-index mov rax, [rip + tbl + 8*idx]; jmp rax sequence — the standard x86-64 jump-table emission pattern that GCC produces for dense switch statements with no default reachable from a fallthrough.
These dispatch tables are not the format descriptors in 0x23F1xxx–0x23F2xxx (those are xmmword blobs documented in encoding.md). They are the executable counterpart — the per-case targets that the megafunction switches actually jump to. A reimplementer who wants to encode SASS by replaying ptxas's behaviour must know not just the format constants but also which target function each opcode-category lands in, because the megafunctions never bring their dispatch logic onto the C-source level — Hex-Rays renders them as goto-soup.
| Total tables | 409 |
| Total slots | 70,528 |
| Section | 404 in .rodata, 4 in .data.rel.ro (0x29f9a28, 0x29f9b18, 0x29f9b68, 0x29f9c68), 1 in .data (0x29fca68, the 269-slot percase table). The split reflects immutability semantics: .rodata entries are link-time-final, .data.rel.ro entries are resolved by the dynamic linker then mprotected read-only via RELRO, and the single .data table is genuinely mutable at runtime — the only dispatch table ptxas can rewrite mid-execution. |
| Slot width | 8 bytes (absolute VA, no relocation needed in a non-PIE build) |
| Largest single table | 0x22f1f60 — 5,560 slots (the consolidated mega-dispatcher jump table) |
| Density | 329 tables target the 0x0040000–0x00FFFFFF text range; 80 tables target 0x01000000–0x01FFFFFF |
| Source | ptxas_data_tables.json (8 MB), extracted via the IDA Python sweep documented in methodology.md |
Mega-Dispatcher Jump Table — 0x22f1f60
The single most important table in the binary. 5,560 contiguous 8-byte slots, all six of ptxas's encoding/decoding megafunctions share this one block. The six functions are pinned to non-overlapping address ranges within it, each function's case-target arena starting at a fixed boundary:
| Slot | Byte offset | Owning function | # slots | Role (from encoding.md) | Confidence |
|---|---|---|---|---|---|
0 | +0x0000 | sub_10C0B20 | 1,018 | setField — write a value into a named field | HIGH |
1018 | +0x1FD0 | sub_10C7690 | 744 | setOperandField — per-operand variant | HIGH |
1762 | +0x3710 | sub_10CAD70 | 744 | getOperandFieldOffset — per-operand bit-offset | HIGH |
2506 | +0x4E50 | sub_10CCD80 | 1,018 | setFieldDefault — write hardcoded default | HIGH |
3524 | +0x6E20 | sub_10D5E60 | 1,018 | getFieldOffset — return bit-offset of named field | HIGH |
4542 | +0x8DF0 | sub_10E32E0 | 1,018 | hasField — boolean field-existence query | HIGH |
The four 1,018-slot arenas hold three (sometimes two) jump tables concatenated: the primary 370-case switch on the opcode category at (WORD*)(a1+12) consumes ~370 slots, and the secondary sub-switches on field ID consume the rest. The two 744-slot arenas additionally extend the primary switch to category 0x174 (373 cases) and need fewer secondary slots because the field-ID inner switches are smaller (range 1–30 instead of the broader field-ID enum).
A direct verification: sub_10C0B20's primary switch has 370 cases per the switch dump and 248 of them carry a real handler — the other 122 fall through to default (0x10c0b40). The table slots are still present and point at the default block, which is why every opcode category occupies a slot even when it has nothing to encode.
⚡ QUIRK — six functions, one rodata block All six mega-dispatchers index into a single contiguous 44,480-byte table. There is no per-function table header or terminator: the boundary between
sub_10C0B20's arena andsub_10C7690's arena is implicit (slot 1018) and known only because each megafunction'sleainstruction loads a different base address. A reimplementer can verify the partition by disassembling thelea r10, [rip + 0x22F1F60]etc. constants. The compactness is a code-size optimization — five separate.rodatasymbols would have introduced five linker alignment gaps.
Switch entry shape
Each primary-switch slot encodes (*((void(**)(int64_t,int64_t,uint32_t))(tbl + 8 * category)))(...) semantics. Decoded C pseudocode for an arbitrary call site:
// Inside sub_10C0B20 (setField):
// a1 = instruction context (192-bit packed word at a1+48)
// a2 = field ID
// a3 = value to write
uint16_t category = *(uint16_t*)(a1 + 12); // opcode category (0x0..0x171)
if (category >= 370) goto LABEL_default;
void *target = ((void**)0x22F1F60)[category]; // 5,560-slot dispatch
goto *target; // → one of 248 real handlers or default
// Each LABEL_xxx then performs its own sub-switch on field-ID `a2`
The dispatcher never returns a value through the table itself — each target either falls through to one of the four shared write-paths (LABEL_3941, LABEL_3923, LABEL_3929, LABEL_3935) or to the default. See encoding.md § setField shared write paths for the boundary-crossing OR sequences.
Per-SM-Tier Encoder Index Tables — 0x22a5aa0 Family
Four nearly-identical 455-slot tables sit adjacent in .rodata, all targeting the 0xB07F70–0xB18CB0 region (the SASS encoder family at the high end of the codegen .text). The slots are populated progressively — the first table has 93 nullsub holes, the last has only 2. This is the SM-tier dispatch — one row per SM tier (SM75, SM80, SM89, SM90/100), each row a different per-opcode encoder selector with newer SMs filling in opcodes that older SMs didn't support.
| Table address | Unique funcs | Nullsubs | Target VA range | SM tier (inferred) | Confidence |
|---|---|---|---|---|---|
0x22a5aa0 | 454 | 93 | 0xB079D0..0xB18410 | Earliest (most unimplemented) | MEDIUM |
0x22a6e70 | 454 | 25 | 0xB079E0..0xB18CB0 | Mid-generation | MEDIUM |
0x22a8248 | 454 | 25 | 0xB079E0..0xB18CB0 | Mid-generation (alt) | MEDIUM |
0x22a9bb0 | 454 | 2 | 0xB07F70..0xB18790 | Latest (fully populated) | MEDIUM |
0x22b2a58 | 455 | 40 | 0x6611B0..0x15F5800 | Distinct purpose (much wider range) | LOW |
The four 0x22a* tables share most slots — e.g. slot 5 is sub_B144E0 in all four, slot 100 is sub_B0F6C0 in all four. They differ at the slot positions where the SM-tier-specific encoder takes over. The slot-0 nullsub is different per table (nullsub_593, _594, _595, _596) — this is the IDA Pro naming convention for distinct-but-empty jump targets and tells us the C++ source declared four separate static const handler_t Encoder_SMxx[455] arrays.
⚡ QUIRK — nullsub gap progression encodes SM history The nullsub count drops monotonically across the SM tiers (93 → 25 → 25 → 2). Each new SM tier "fills in" opcodes that the previous tier left unimplemented, but never removes opcodes. The two final nullsubs in
0x22a9bb0are presumably opcodes still on NVIDIA's deprecation queue — they exist in the encoder enum but no current SM has a real handler. Forensic value: subtracting0x22a5aa0from0x22a9bb0slot-by-slot tells you exactly which opcodes were added at each SM tier without any per-SM string evidence.
⚡ QUIRK — slot-0 nullsub identity is the table fingerprint Every other slot can be shared by name (
sub_B144E0appears at slot 5 in all four tables), but slot 0 holds a uniquenullsub_Nper table because IDA assigns a fresh nullsub ID to each distinct rodata reference even when the target instruction is identical. If a reimplementer collapses the four tables into one, slot 0 collapses with them — losing the SM-tier boundary marker. Treat the four tables as distinct symbols even when 80% of their slots are byte-identical.
Decode protocol
Each slot is reached by a per-opcode index, not by the opcode-category in (a1+12). The index is derived from the opcode_master lookup (mentioned in encoding.md § Concrete Constants for the Top-5 Encodings). A skeleton consumer:
// SM-tier-aware encoder dispatch (reconstructed)
typedef int (*encode_fn)(int64_t ctx, int64_t ir_node);
extern encode_fn sm_encoders[4][455]; // 0x22a5aa0, 0x22a6e70, 0x22a8248, 0x22a9bb0
int encode_for_target(int sm_tier, int opcode_idx, int64_t ctx, int64_t ir) {
if ((unsigned)opcode_idx >= 455) return -1;
encode_fn fn = sm_encoders[sm_tier][opcode_idx];
if (!fn) return -1; // unhandled opcode for this SM
return fn(ctx, ir);
}
The sm_tier value flows from the global SM target descriptor (the --gpu-name sm_NN parsing path). Section 7 of methodology.md documents how the descriptor reaches the encoder.
High-Density Encoder Tables (>200 Unique Targets)
These tables each point to >200 distinct functions and are the workhorses of the encoder back-end. They cover the per-opcode handler lookups that the megafunctions stand on top of.
| Table address | Slots | Unique funcs | Nullsubs | Target VA range | Likely role | Conf |
|---|---|---|---|---|---|---|
0x22ad230 | 2,129 | 246 | 65 | 0xA393D0..0x19F72E0 | Composite operand-decode dispatch (multi-segment) | MED |
0x23b3a80 | 2,109 | 150 | 1 | 0x9DAA40..0x181D9B0 | Opcode→canonical-name printer (PTX↔SASS) | MED |
0x23f4430 | 633 | 633 | 0 | 0x1BB38B0..0x1BBD2C0 | One-to-one per-slot dispatch — 633 unique entries, no sharing | MED |
0x21f9158 | 470 | 469 | 8 | 0x7D6AE0..0xBE26C0 | Per-opcode pretty-printer or trace formatter (nullsubs at slots 4, 55, 302, 321, 331, 457, 460, 461) | MED |
0x21d6860 | 470 | 469 | 4 | 0x6611B0..0x15F4870 | Sibling of 0x21f9158, distinct entry-point set | MED |
0x21d82b0 | 466 | 466 | 33 | 0x6611B0..0x15F5800 | Per-opcode handler, broad-range | MED |
0x21f5b70 | 443 | 443 | 0 | (similar range) | Per-opcode dispatch — fully unique | LOW |
0x22b64d8 | 503 | 208 | 46 | 0xA393D0..0xC380C0 | Per-opcode validate/encode hybrid | LOW |
0x29fca68 | 269 | 269 | 0 | (varies) | Per-opcode dispatch — fully unique | LOW |
The 0x21d6860 / 0x21f9158 pairing — both 470 slots, both ~469 unique — is suspicious. Diffing them slot-by-slot would tell us whether they're another SM-tier pair or two parallel pipelines (e.g. one for mercury mode, one for capmerc mode; see mercury.md § Mercury vs SASS vs Capsule Mercury).
Schema (C-level reconstruction)
// .rodata layout for a dispatch table (no header, no terminator):
struct sass_encoder_table {
encode_fn slots[N]; // N is determined by the megafunction's switch range
};
// Real declaration is anonymous; the symbol exists only as a `lea` displacement
// in one or more megafunctions. There is no length field — N is hard-coded into
// the consumer.
The lack of a length field is critical for reverse engineering: knowing N requires either reading the consumer's cmp ... , N; ja default bound check or summing slots until you hit a different .rodata symbol. The IDA extractor at the source of ptxas_data_tables.json uses the next rodata symbol boundary, which produces correct slot counts in practice but should not be trusted blindly when a table sits at a section boundary.
Representative entry (ASCII bit-layout)
.rodata, 8 bytes per slot:
+-----------------------------------------------------------------+
| 63 0
| <-- 64-bit absolute virtual address of target function -------> |
+-----------------------------------------------------------------+
Example slot 0 of 0x22a5aa0 (VA 0x22a5aa0 = rodata_base 0x1ce2e00 + offset 0x5c2ca0):
0x22a5aa0: A0 80 B0 00 00 00 00 00 ; → 0x00B080A0 (nullsub_593)
^^^^^^^^^^^^^^^^^^^^^^^^^^
little-endian, no relocation in non-PIE binary
(ptxas is a non-PIE statically-linked-style ELF —
confirmed by absence of R_X86_64_RELATIVE entries)
The slot-0 nullsub at 0x00B080A0 is the sentinel convention — every Group-A SM-tier table reserves index 0 for an empty ret-only target (one nullsub per table, hence nullsub_593/_594/_595/_596). Reading slot 0 of any such table tells you whether index 0 is a real opcode (rare) or a reserved "OP_INVALID" probe slot; for the encoder family it is the latter. See Targeting Behavior — nullsub Sentinels below.
The slots are not relocations — ptxas is built non-PIE so absolute addresses are baked into rodata at link time. A PIE rebuild would convert all 70,528 slots into R_X86_64_RELATIVE entries and balloon the dynamic relocation count from 146 (PLT-only) to ~70,000.
Dense Single-Target Tables — Computed-Goto Mode
Several tables hold thousands of slots that all point into a single function. These are not dispatch tables in the conventional sense — they are computed-goto label tables for one mega-switch.
| Table address | Slots | Unique funcs | Target function | Target range | What it represents |
|---|---|---|---|---|---|
0x23f6d00 | 2,064 | 1 | sub_B12920 | 0x1C38C35..0x1C3C646 | Per-case basic-block labels inside one giant switch |
0x237a1b0 | 2,455 | 2 | sub_15F5510, sub_169B190 | (within those two) | Pair of computed-goto tables for two siblings |
0x21d3ac8 | 1,148 | 7 | (7 funcs) | 0x6E66D0..0x855867 | Computed-goto across 7 closely-related funcs |
The 0x23f6d00 row is the most extreme: 2,064 absolute addresses, all landing inside a 14-KB region of sub_B12920. Hex-Rays cannot recover the surface-level switch (...) for this one — the function is rendered as goto soup. The table address tells you the function uses a threaded dispatch (every case ends with goto *jumptable[next_idx]), which is the typical interpreter loop pattern. sub_B12920 is the encoder dispatch core for one specific SASS instruction family — likely the tensor/MMA opcodes given the 14-KB body size.
// Computed-goto threaded dispatch (reconstructed for sub_B12920):
static const void *jt[2064] = { &&L0, &&L1, &&L2, ... }; // == 0x23F6D00
goto *jt[opcode_idx];
L0: /* encode handler 0 */ goto *jt[next_idx];
L1: /* encode handler 1 */ goto *jt[next_idx];
// ...
The 2,064 slots span only 14,353 bytes inside sub_B12920 (0x1C3C646 - 0x1C38C35), meaning each "case" averages 6.95 bytes — barely enough for a single instruction plus the next computed jump. This is the classic GCC __builtin_goto interpreter idiom; the source is almost certainly a single mega-switch on a small uint16_t opcode-subfield with no shared epilogue.
⚡ QUIRK — Hex-Rays surrenders on computed gotos The four mega-dispatchers (
sub_10C0B20,sub_10CCD80,sub_10D5E60,sub_10E32E0) and the threadedsub_B12920are the five functions that Hex-Rays cannot decompile to clean C in methodology.md § Scope and Scale. The dispatch tables in.rodataare the only readable surface for these functions' control flow. Reading the tables in isolation gives you the case→target topology even when the decompiler gives up — which is why the JSON extractor preserves them as first-class artifacts.
Cross-Reference With Format Descriptors
The encoding picture is two-layered. Format descriptors in 0x23F1xxx–0x23F2xxx carry the data (slot sizes, slot types, opcode header widths). The function-pointer tables documented here carry the code (which encoder runs for each opcode). The wiring between them:
- The opcode-category in
(a1+12)selects one of 248 live cases in the mega-switch via0x22f1f60[category]. (table → handler) - The selected handler loads the format-descriptor xmmword at
a1+8from one of 38 named addresses in0x23F1CE8..0x23F2EF8. (handler → format data) - The format descriptor's three trailing DWORD[10] arrays are copied into the encoding context at
a1+24..a1+140. (format data → context state) - The handler then makes 8–12 calls into the bitfield packer
sub_7B9B80, indexing operands via the context state. (context state → encoded bits)
If you want to predict what ptxas emits for a given instruction without running ptxas, you must trace this entire chain. The function-pointer tables are the entry point: without them, you cannot reach step 2 because Hex-Rays cannot tell you which mega-switch case fires for a particular opcode-category integer.
Targeting Behavior — nullsub Sentinels
64 of the 409 tables contain at least one nullsub_* slot. The IDA convention is that nullsub is a function whose entire body is ret — used as a placeholder for "this entry is reachable but does nothing." In ptxas's dispatch tables, nullsubs mean three different things:
| Pattern | Interpretation | Example |
|---|---|---|
| Single nullsub at slot 0 only | Sentinel — index 0 reserved (e.g. "OP_INVALID") | 0x22a5aa0 slot 0 |
| Many nullsubs scattered through table | SM-tier unimplemented opcodes | 0x22a5aa0 (93 holes) |
| Block of consecutive nullsubs | Reserved opcode range for future SMs | 0x22b64d8 (clustered) |
| Nullsub at every slot | Table not yet emitted for this build | (none observed in v13.0.88) |
The distinction matters when reimplementing: an "SM-tier unimplemented" nullsub means silent return — the encoder produces a 64- or 128-bit instruction word of zeros. The default 0x10c0b40 block in the megafunctions, by contrast, returns an error code. So index-0 sentinels and SM-tier holes have different fault semantics even though both are nullsubs.
⚡ QUIRK — silent-zero versus error-default An opcode that hits a nullsub slot in
0x22a5aa0and an opcode whose category isn't in the megafunction's case list both "do nothing", but the first writes an all-zeros instruction word and the second returns -1 to the caller. PTX programs that hit the nullsub path on an older SM will compile cleanly to a bogus zero-instruction; PTX programs that hit the megafunction default will trigger the "Instruction '%s' cannot be compiled for architecture '%s'" diagnostic (string0x...per ptxas_strings.json). Same observable cause, completely different failure modes.
Table Catalog — All Tables With ≥100 Slots
A compact catalog for navigation. The "Pattern" column captures what the table looks like at a glance:
- mega = single-block of mega-dispatcher labels (jumps within megafunc)
- percase = one unique function per slot (real per-opcode dispatch)
- sparse = mostly populated but with nullsub holes (SM-tier sparse)
- gotomap = computed-goto table (1–7 unique targets, many slots)
- mixed = blend of unique funcs and shared fallback
| Address | Slots | Uniq | Nullsubs | Pattern | Owner / role |
|---|---|---|---|---|---|
0x22f1f60 | 5,560 | 6 | 0 | mega | The six mega-dispatchers, shared block |
0x237a1b0 | 2,455 | 2 | 0 | gotomap | sub_15F5510 + sub_169B190 thread |
0x22ad230 | 2,129 | 246 | 65 | mixed | Composite operand-decode |
0x23b3a80 | 2,109 | 150 | 1 | mixed | PTX↔SASS name dispatch |
0x23f6d00 | 2,064 | 1 | 0 | gotomap | sub_B12920 thread (tensor/MMA?) |
0x2358c38 | 1,971 | 1 | 0 | gotomap | sub_143C440 thread |
0x23d2a30 | 1,966 | 1 | 0 | gotomap | sub_198BCD0 thread |
0x21e7118 | 1,836 | 60 | 2 | mixed | Per-opcode multi-stage |
0x21f0590 | 1,677 | 10 | 0 | gotomap | Mid-density goto thread — 291/283/240/240/240 distribution |
0x22a1d80 | 1,649 | 57 | — | mixed | Per-opcode mid-density |
0x2355b80 | 1,548 | 5 | 0 | gotomap | 5-target thread in sub_13ACxxx/sub_ACBxxx |
0x21debc0 | 1,262 | 15 | 0 | gotomap | sub_917A60+ family thread |
0x21d3ac8 | 1,148 | 7 | 0 | gotomap | 7-target thread |
0x1d4b778 | 1,080 | 1 | 0 | gotomap | sub_5FF700 thread |
0x202e108 | 948 | 4 | 0 | gotomap | sub_704D30 thread (764/948 slots) + 3 minor branches |
0x21e35b8 | 827 | 25 | 0 | mixed | — |
0x229e9f8 | 734 | 8 | 0 | gotomap | sub_AED3C0/sub_AEA420 dual thread (342+329 of 734) |
0x1d10a68 | 703 | 1 | 0 | gotomap | sub_4CE6B0 thread |
0x20254b0 | 701 | 5 | 0 | gotomap | sub_657xxx/sub_65Dxxx/sub_65Axxx triple thread |
0x21f3a20 | 667 | 2 | 0 | gotomap | — |
0x23f4430 | 633 | 633 | 0 | percase | One unique func per slot |
0x202b840 | 604 | 3 | 0 | gotomap | Mercury master-encoder thread |
0x21cc6e0 | 591 | 17 | 0 | mixed | — |
0x203a5a8 | 552 | 1 | 0 | gotomap | sub_720F00 thread |
0x2021820 | 550 | 8 | 0 | gotomap | sub_61F700 lead (156) + 4 secondary @ 69 each — symmetric 4-way split |
0x21ec3d0 | 537 | 15 | 0 | mixed | — |
0x21b3418 | 535 | 3 | 0 | gotomap | Mercury triple-thread |
0x22b3960 | 504 | — | — | — | — |
0x22b64d8 | 503 | 208 | 46 | mixed | Per-opcode validate/encode |
0x21dd3a0 | 502 | 2 | 0 | gotomap | — |
0x21f9158 | 470 | 469 | 8 | percase | Per-opcode printer/dispatch — nullsubs at slots 4, 55, 302, 321, 331, 457, 460, 461 (nullsub_148, _190, _227, _231, _232, _246, _247, _248) |
0x21d6860 | 470 | 469 | 4 | percase | Sibling of 0x21f9158 |
0x229d418 | 469 | 468 | — | percase | — |
0x21d82b0 | 466 | 466 | 33 | percase | — |
0x22b2a58 | 455 | 455 | 40 | percase | SM-tier sibling of 0x22a* family |
0x22a9bb0 | 455 | 454 | 2 | sparse | SM-tier 4 (latest) — fully populated |
0x22a8248 | 455 | 454 | 25 | sparse | SM-tier 3 |
0x22a6e70 | 455 | 454 | 25 | sparse | SM-tier 2 |
0x22a5aa0 | 455 | 454 | 93 | sparse | SM-tier 1 (earliest) |
0x21f5b70 | 443 | 443 | 0 | percase | — |
0x22bb738 | 399 | 241 | — | mixed | — |
0x21d2e88 | 388 | 3 | 0 | gotomap | — |
0x22b5910 | 359 | 22 | — | mixed | — |
0x21e4fb0 | 363 | 2 | 0 | gotomap | — |
0x21d7798 | 353 | 2 | 0 | gotomap | — |
0x21c4b58 | 353 | 1 | 0 | gotomap | sub_7D3A20 thread |
0x21b5218 | 350 | 3 | 0 | gotomap | Mercury triple-thread |
0x21d9ef8 | 343 | 1 | 0 | gotomap | sub_89FBA0 thread |
0x22b7ba8 | 341 | 31 | — | mixed | — |
0x23b2e00 | 331 | 2 | 0 | gotomap | — |
0x23f0768 | 329 | 24 | 3 | mixed | — |
0x21c2bc8 | 327 | 2 | 0 | gotomap | — |
0x23d6ed0 | 315 | 237 | — | mixed | — |
0x23f11f0 | 309 | 208 | 42 | mixed | — |
0x23f31a0 | 298 | 2 | 0 | gotomap | — |
0x23f3b00 | 292 | 237 | 79 | sparse | — |
0x29fca68 | 269 | 269 | 0 | percase | — |
0x23f5808 | 236 | 236 | 79 | sparse | — |
0x23d1a88 | 236 | — | — | — | sibling of 0x23f5808 |
0x23d21f8 | 236 | — | — | — | sibling of 0x23f5808 |
0x21edbc8 | 204 | 203 | 184 | sparse | — |
Tables below 100 slots — 349 of the 409 — are mostly per-instruction-family encoders (5–80 slots each) sitting adjacent to the corresponding rodata format constants. They are catalogued in ptxas_data_tables.json for completeness but rarely warrant individual analysis.
Catalog Addendum — Tables Audited After Initial Wave
A targeted re-audit of ptxas_data_tables.json surfaced several tables in the 158–244-slot range that the initial catalog skipped over. They sort into two structural groups.
Group A — fourth four-tier SM-tier family (240/243/244 slots, ~73–74 nullsubs each). Same shape as the 0x22a5aa0 family (page §Per-SM-Tier Encoder Index Tables) but a different per-opcode aspect — distinct slot-1 fingerprints and a target range that overlaps the opcode-printer corpus rather than the opcode-encoder corpus. Strong evidence this is a second parallel 4-tier vtable, likely the per-opcode validate-or-cost probe parallel to the encoder vtable.
All four tables share slot 0 = nullsub_469 (the OP_INVALID sentinel — uniform across the cohort, so it cannot fingerprint individual tables). The distinguishing slot is slot 1: it varies across all four members and is the only practical disambiguator when reading a dispatch trace.
| Address | Slots | Uniq | Nullsubs | Tier reading | Notes |
|---|---|---|---|---|---|
0x21f6a90 | 244 | 244 | 74 | latest tier (most populated) | Target range 0xA393D0..0x1BB0FA0; slot 0 = nullsub_469, slot 1 = sub_C1EF80, slot 2 = sub_A393D0 (slot-2 is a fixed prologue handler shared with siblings) |
0x23d8178 | 243 | 243 | 73 | mid tier | Slot 0 = nullsub_469, slot 1 = sub_19FF740 (only sibling with this slot-1 target), slot 2 = sub_A393D0 |
0x22b5150 | 243 | 243 | 73 | mid tier (alt) | Slot 0 = nullsub_469, slot 1 = sub_C1EF80 (matches 0x21f6a90 at slot 1; collision means these two tables agree on the first non-sentinel entry — diff the remaining slots to separate them), slot 2 = sub_A393D0 |
0x21fa5e0 | 240 | 240 | 73 | earliest tier | Smallest population, distinct slot 1 = sub_C5DBB0 — predates four opcodes added in later tiers |
The 240→243→243→244 progression matches the "opcodes added per SM generation, never removed" rule documented for the 0x22a5aa0 family. Diffing slot lists across the four would yield the exact opcode set added per SM generation for whichever aspect this vtable controls. Confidence: HIGH for structural classification, MED for "validate/cost" role attribution — the constant sub_A393D0 at slot 2 of every entry suggests a fixed prologue handler, which is more consistent with a probe/cost role than with a printer or encoder role.
Group B — uncited single-tier per-opcode dispatch tables. No 4-tier family detected; each likely a one-off per-subsystem dispatch.
| Address | Slots | Uniq | Nullsubs | Best-guess role |
|---|---|---|---|---|
0x21f7df8 | 234 | 234 | 208 | Per-opcode handler for an early SM tier — 208/234 holes is the highest sparsity in the catalog, consistent with "feature-flag dispatch that only a handful of opcodes participate in" (e.g. tensor-core fragment ops circa SM80) |
0x23549c0 | 204 | 204 | 120 | Per-opcode dispatch in the 0x1398000 code range (distinct from the main encoder corpus at 0xA39xxx..0xC5Dxxx) — likely a secondary lowering pipeline, possibly the cuLINK/relocation path |
0x21d2598 | 164 | 154 | 31 | Per-opcode dispatch in 0x6E0000..0x7FD000 (early code section, pre-mercury) — 10 slots share handlers (uniq 154/164), the only mid-size table with non-trivial slot reuse outside the SM-tier families. Likely Mercury preprocess dispatch |
Group C — uniform-target tables (uniq ≤ 2). Catalog skipped these because they don't carry per-opcode handler information, but listing them is useful for completeness — they signal "computed-goto threads that the catalog already cataloged elsewhere but never named":
| Address | Slots | Uniq | Pattern reading |
|---|---|---|---|
0x1d0e128 | 244 | 2 | Two-target goto thread — early-section pair |
0x1d0b880 | 238 | 1 | Single-function thread, 0x1d0xxxx early code |
0x22b74a0 | 211 | 1 | Single-function thread in encoder region |
0x2020a90 | 211 | 2 | Two-target thread |
0x21f73f0 | 209 | 154 | Mid-density per-opcode dispatch — not a goto thread despite low uniqueness; the 154 unique funcs across 209 slots imply ~55 slots collapse to a shared fallback |
0x1d07848 | 203 | 2 | Two-target thread, early section |
0x21f73f0 is the only Group C entry worth a deep-dive — its 209/154 ratio matches the 0x21d2598 pattern (per-opcode dispatch with a small shared fallback bucket) and it sits one bucket below the SM-tier-family size. Possible fifth sibling of the 0x22a5aa0 cohort with a smaller opcode universe; worth diffing against the four siblings to test.
What the Tables Don't Cover
The 409 tables in ptxas_data_tables.json do not include:
- Per-opcode SASS encoder functions (the ~1,086 SM100 handlers documented in encoding.md § Encoder Template). These are reached through the mega-switch case-targets, not through a separate function-pointer table — the megafunctions call them via direct
call sub_XXXXXinstructions emitted at each case label. (MED confidence — confirmed by absence of those addresses in any of the 409 tables.) - PhaseManager vtable at
off_22BD5C8. This is a 159-entry C++ vtable, not a switch jump table — its layout follows the Itanium C++ ABI and is documented in methodology.md § Type Recovery. - Knob lookup tables (~2,000 ROT13-encoded names). These are paired key/value records, not function-pointer arrays.
- Bugspec kind-string table at
0x21F0500. Also key/value, not dispatch. - Format descriptor xmmwords at
0x23F1CE8..0x23F2EF8. These are 128-bit constant blobs, not function pointers — they're consumed by_mm_loadu_si128in the encoder template.
Item 1 is the most surprising omission. Per-opcode encoders are not dispatched through rodata — they're hardcoded as direct call targets inside each mega-switch case body. This means swapping out an opcode's encoder at runtime requires patching the case body itself, not just rewriting a table slot. The implication for binary diffing is significant: an SM tier added in ptxas v14 will probably appear as a new switch case (and a new slot in the 0x22f1f60 mega-table) rather than as a slot reassignment.
Sidebar — The OCG Slab Allocator (vtable +24) Idiom
⚡ QUIRK — register allocator routes through a vtable-dispatched slab, not malloc ptxas does not allocate working-set buffers from
malloc. Every per-compilation buffer is obtained from an arena object reachable as*(ctx + 16)(the OCG context's allocator handle). The handle is a polymorphic C++ object whose vtable slot +24 is the byte-sized "raw allocate" entry point. Call shape:// Generic form seen at dozens of call sites: _QWORD *alloc_obj = *(_QWORD **)(ctx + 16); // OCG allocator handle void *mem = (*(__int64 (**)(_QWORD *, __int64, ...))(*alloc_obj + 24))(alloc_obj, size, ...);The first argument is
this, the second is the byte size, and Hex-Rays renders any remaining stack slots asdouble/__m128icarry-over from the caller's frame — those are noise from the SysV AMD64 ABI, not real parameters. The callee uses(this, size)only.
This idiom is not confined to one subsystem. Verified call sites in ptxas_full.c (v13.0.88):
ptxas_full.c line | Owner | Allocation size | Subsystem | Buffer role |
|---|---|---|---|---|
| 622528 | sub_704D30 callees | 24 | early IR construction | Per-record IR node (24-byte payload) |
| 748581 | helper near sub_823020 | 80 | mid-pipeline | 80-byte working record |
| 501648 | caller of sub_7AB* | 192 | mid-pipeline | 192-byte container record |
| 995796 | sub_957160 | 2056 | regalloc | Pressure histogram (512 DWORDs + 2-DWORD sentinel) — see regalloc/algorithm.md § Pressure Array Construction |
| 1478513 | caller in sub_BCEF* | 4096 | scratch/staging | One-page scratch buffer |
Five distinct sizes (24, 80, 192, 2056, 4096) appear across the binary; the slot +24 entry point services all of them. There is no separate "small/medium/large" dispatch — the allocator decides bucket internally. The implication for reimplementers: a faithful clone cannot model regalloc's pressure arrays as stack-frame locals or as new int[512]. They live in the same arena as IR nodes, the same arena as the 4096-byte scratch pages, and the same arena as the 80-byte working records — and that arena is destroyed wholesale at end-of-compilation, not per-pass. Code that holds raw pointers into this arena across compilation boundaries is undefined.
Sentinel pattern at the call site. Every observed call site follows an identical pre-sequence: the caller writes a 3-word sentinel header ({handle, 0, 0xFFFFFFFF}) onto its own stack before the allocator call (visible at ptxas_full.c:995791–995795 for the regalloc case). This is the arena's free-list bookkeeping — the allocator threads the returned block onto the caller's local list so that the caller's destructor can mass-release on scope exit. The 0xFFFFFFFF is a "no successor" marker. A reimplementer who misses this sentinel will leak everything the function allocates because the arena's reclamation pass walks those headers.
Confidence: HIGH for the vtable+24 idiom and the verified call sites; MED for the "destroyed at end-of-compilation" lifetime claim (inferred from absence of explicit free calls — the allocator's destructor is in the OCG context teardown, which has not been disassembled in detail).
Open Follow-Ups
Tables that warrant individual deep-dives in future work:
0x23f4430(633 unique per-slot funcs) — appears one-to-one with no sharing. The target range (0x1BB38B0..0x1BBD2C0, ~38 KB of code) is too compact to be the main encoder corpus but too broad to be a single function. Likely the SASS printer dispatch (one printer per opcode) feedingsass-printing.mdinfrastructure. Confidence: MED. Action: cross-ref withsass-printing.mdopcode mnemonic list.0x23b3a80(2,109 slots, 150 unique, only 1 nullsub) — broad target range across 0x9DA000–0x181D000 (over 13 MB of code). Suggests an opcode-keyed lookup that fans out across many subsystems (printer + validator + lowering?). Confidence: LOW. Action: pick 10 random slots, identify their owner functions, look for common ancestor.- The 159-entry family at
0x22a7cb8,0x22a9090,0x22a9620,0x22aa9f8,0x2399f58— five tables of 159 slots each (matching the PhaseManager phase count). These are not encoder tables but phase-vtable variants. They deserve a separate section inpasses/phase-manager.md. Confidence: HIGH. Action: hand off to passes-wiki authors. 0x21cb0f8..0x21eaa88(eight 126-slot tables) — a Mercury-region family. The slot-0 fingerprint differs across all eight, suggesting eight Mercury sub-pipelines (more than the six stages documented in mercury.md). Possible that two are forcapmerc-mode-only stages. Confidence: MED. Action: diff slot-0s against the Mercury phase enum.
Cross-References
- SASS Instruction Encoding — abstract framework, format descriptors, bitfield packer, encoder template
- Mercury Encoder Pipeline — the 6-stage sub-pipeline whose master encoder calls into these tables
- Capsule Mercury & Finalization — capmerc-mode variations
- Code Generation Overview — where the encoder sits in the global pipeline
- Methodology — confidence-level conventions and the IDA extraction protocol that produced
ptxas_data_tables.json