Legacy Backend A — sub_A97600 Per-Target Operand-Class Query
All addresses in this page apply to ptxas v13.0.88 (CUDA 13.0). Other versions will differ.
sub_A97600 is a 7,780-byte, 425-block per-target virtual method installed at vtable slot +0x5F0 (index 190) on six of ptxas's legacy SM-target classes. Despite its size and the lure of its position adjacent to other scheduling-related virtuals, it is not a scheduler. It is the legacy implementation of the per-instruction, per-operand source-slot-count / operand-class query that the scheduling and register-pressure machinery consults whenever it needs to know "how many physical source-register slots does operand k of this instruction occupy under the current target's hardware encoding?". On modern SM tiers (sm_80+) this query is answered by a per-opcode dispatch table; on the six legacy target classes it is answered by this single 7.8-KB switch.
The Wave 18D reference in post-schedule.md § Function Map describing sub_A97600 as PostSchedulePass::runOnFunction is incorrect on multiple counts and is superseded by this page. The corrections, in summary:
- The function is not at sub-target vtable slot
+0x90— it is at primary-target vtable slot+0x5F0(vtable index 190, byte offset 1520). - The signature is
int(target *this, ptxInstr *, int operand_idx)returning a small integer — not avoid post_schedule(this)driver. - The body contains no instruction-list re-ordering, no dependency-DAG construction, no register-pressure tracker, no priority queue. It is one big
switchon(*(uint32_t*)(instr+72) >> 8) & 0xCFwith per-opcode operand-slot accounting. - It is referenced by exactly 6 vtables (all the same offset 0x5F0), each a different legacy SM-target class — and it has zero callers via direct call: every invocation flows through
(target->vtable + 1520).
The remainder of this page documents what the function actually does, why it exists in a separate 7.8-KB switch instead of a dispatch table, how the modern path replaces it, and what role the result plays in scheduling/RA.
| Symbol | sub_A97600 |
| Address range | 0xA97600 -- 0xA99464 (7,780 bytes) |
| Basic blocks | 425 |
| Callees | 18 distinct functions, 52 call edges (sub_7E36C0, sub_7E3640, sub_7E3790, sub_7E3800, sub_7E40E0, sub_7E3EF0, sub_693CA0, sub_80B620, sub_8963B0, sub_91E610, sub_A97540, ...) |
| Direct callers | 0 |
| Indirect entry | 6 vtables, all at slot offset +0x5F0 (vtable index 190 / byte 1520) |
| Helper | sub_A97540 (180 B, 3 BBs) — operand-range probe shared with the same six targets |
| Signature | int(target *this, uint32_t *instr, int operand_idx) |
| Return value | small non-negative integer (0, 1, 2, or (end - begin) operand-slot count) |
| Variadic XMM form | also called as int(target *this, uint32_t *instr, int operand_idx, double, double, double, double, double, double, double, double) from sub_7EAD70 and sub_7F5D50 (eight unused XMM slots reserved by the SysV ABI for the variadic latency variant; harmless on this side) |
Why six vtables?
The six vtables that install sub_A97600 at slot 190 are:
| Vtable | Slots | Sibling slot 189 (sub_A90D60) | Sibling slot 191 |
|---|---|---|---|
0x21D6860 | 470 | shared | sub_AAEF20 |
0x21D82B0 | 466 | shared | sub_8102C0 |
0x21F5B70 | 443 | shared | sub_8102C0 |
0x21F9158 | 470 | shared | sub_AAEF20 |
0x229D418 | 469 | shared | sub_8102C0 |
0x22B2A58 | 455 | shared | sub_8102C0 |
All six share the same slot-189 method (sub_A90D60, the predecessor-operand-eligibility query) and split into two slot-191 cohorts on a binary discriminator (sub_AAEF20 vs sub_8102C0). The slot-190 (operand-class) method, slot-189 (pred-op-eligibility) method, and slot-191 (binary cohort discriminator) form a tight three-method invariant of the legacy target hierarchy. Modern targets at sm_80+ install different functions at every one of these slots — in particular, the modern slot-190 implementations are smaller (sub_44A308, sub_7DB178, etc.) and dispatch through the per-opcode tables documented in codegen/encoding-tables.md § Per-SM-Tier Encoder Index Tables. The 7.8-KB switch is the legacy artifact of a path that was never table-driven.
The exact SM mapping is not in RTTI (the typeinfo and offset-to-top fields above each of the six vtables are zeroed in the binary), but the slot-size pattern (~466 slots), the sibling-method identity, and the lack of any Ampere-era specialisations on these six targets are consistent with the pre-sm_80 family (Maxwell sm_50/sm_52/sm_53, Pascal sm_60/sm_61/sm_62, Volta sm_70/sm_72, Turing sm_75 — nine SM variants, six target classes after deduplication of trivially identical sub_targets).
What the function computes
a1 is the target's this pointer (and through *(QWORD*)a1 its own vtable); a2 is a pointer to the 32-bit instruction-word stream of a single ptxInstr; a3 is an operand index. The function dispatches on the masked primary-opcode field:
opcode32 = *(uint32_t *)(a2 + 72) // primary opcode + flags word
masked = opcode32 with byte 1 ANDed by 0xCF
// strips out the two "extension-class"
// bits that don't change operand layout
switch (masked) {
case 0x10: // narrow integer ALU class
case 0x16: // load with computed index range
case 0x32: // tex/sampler class (xmmword opcode-mask templates)
case 0x3D: // single-source pure-predicate test
case 0x4D: // memory access with bit-encoded width/space
case 0x77:
case 0xCA:
...
case 335: // (special case for an opcode that lives at base 0x14F)
default:
}
For each case the body:
- Reads opcode flag bits — typically
(opcode32 >> 11) & 2(operand-base bias) and(opcode32 >> 4) & 7(operand-class hint). - Looks up the operand-encoding window for the requested
a3against per-opcode ranges supplied by helpers (sub_7E40E0,sub_7E3640,sub_7E3790,sub_7E3800,sub_7E36C0,sub_693CA0,sub_80B620,sub_8963B0). - Returns one of:
0— operand at indexa3is not a register source for this instruction (e.g. immediate or predicate-only),1— operand is one single register slot,2— operand consumes two register slots (a 64-bit / paired operand),end - begin— operand consumes an explicit run of consecutive slots, wherebegin/endcome from the per-opcode range helpers (the vector / tex-coord / wide-load cases).
Switch-case inventory
The switch has 33 explicit case labels grouped by operand-encoding family, plus a sprawling default: arm. The numeric value is the masked primary-opcode field (*(uint32_t *)(instr+72)) & ~0x3000:
| Case(s) | Operand-encoding family | Representative semantics |
|---|---|---|
0x10 | narrow integer ALU class | Operand index decoded from *(instr+80) minus a bias derived from (opcode32 >> 11) & 2. Returns 0 / 1 / 2*regs / 3*regs depending on operand-type bits (slot >> 4) & 0x1F. |
0x16 | computed-index window load | Walks the predecessor-instruction chain looking for opcode-0x29 chained-load entries; uses sub_7E40E0(a2, 3..4) to bracket the operand range. Returns end - begin for sources inside the range, 0 outside. |
0x32 | texture/sampler operand class | Materialises an 80-byte scratch buffer from xmmword_21B2EC0 + three immediate _QWORD literals, indexes by (slot >> 2) & 3 to pick the four-coordinate window. See QUIRK below. |
0x3D | single-source predicate test | Returns 2 if the operand encodes a register and the instruction is sign-extended (opcode32 ^ 0x70000000 != 0); 0 otherwise. |
0x4D | memory access with bit-encoded space/width | Eight-way switch on (slot >> 4) & 7. Calls sub_7E36C0(2, ...) to resolve the operand bound, with a side helper sub_A97540 testing whether the operand falls inside the base-load window. |
0x53 | wide register-set load | sub_7E3640(a2, 3..4) brackets, same chained-load walk as 0x16. |
0x55 | constant-bank reference | Loads the constant-bank descriptor from target.constantPool[insn.constID], returns the per-bank slot count. |
0x5A | predicate-only test, all-sources negative | Returns 2 when the addressed operand AND operands 0 and 4 are all sign-encoded negatives. |
0xB7, 0x120 | tex/surface load with vector destination | Returns (slot & 7) + 1, gated by the 0x4000 "vector-destination" bit, with a chained-load post-walk for opcode 288 (ATOMS). |
0xB8, 0xB9 | per-operand surface store class | |
0xCB | tex with separate sampler operand | |
0xD3 | LDG.E.128 / LDG.E.U.128 wide global load | |
0xDF, 0xEA, 0xEE | vectorised store class | Three nearly-identical case labels share a body. |
0xE4 | atomic with packed return | |
0xE9, 0xED | atomic with separate return register | |
0x129 | warp-level reduction class | sub_7E40E0(a2, 3..4) brackets, returns end - begin for sources inside the range. |
0x12B, 0x12C | warp-level shuffle class | |
0x135 | tensor-load class | Materialises four xmmword templates into a 64-byte stack scratch, indexed by (slot >> 18) & 3. Uses sub_7E39B0/sub_7E3BA0/sub_7E3C30 for the per-axis stride lookups. |
0x13A, 0x144 | tensor-store class | |
0x13D, 0x13E, 0x13F | TMA descriptor class | Three labels share a single body that returns 1 for the descriptor operand, 0 elsewhere. |
0x149, 0x14A, 0x14B | bulk-transfer / cp.async class | |
0x14F | shfl.idx.b32 with explicit lane mask | Special-cased because the lane-mask operand is at a non-canonical position. |
0x150, 0x153, 0x15D, 0x160 | per-opcode niche cases | Cover instructions that don't fit any of the four window-resolver families. |
0x129 (revisit) / 0x135 | tensor descriptor encoding | sub_7E3A70(scratch, 64, ...) builds the 80-byte window descriptor. |
default: | every opcode the explicit cases did not name | Multi-stage fallback that calls back into the same target's vtable+1024 (isFlatMemorySpace), vtable+1056 (isCachedAddrMode), vtable+1288 (getOperandBaseBias), and vtable+1480 (predOperandClassifier, optionally compared to sub_A8D330). The default arm is the largest single piece of the body — roughly 180 BBs — because it has to handle every opcode that's not in the explicit list, and the answer depends on per-target capabilities rather than a single closed-form rule. |
The return value flows directly into the scheduling latency/pressure computation. Two representative consumers:
sub_7E4150(slot-1520 callsite around line 67) multiplies it into the per-instruction latency:if (v14 > 4) v9 *= v14 >> 2;— the per-operand slot count scales the latency contribution of that operand.sub_7E8200(slot-1520 callsite at line 12) uses it as a Boolean gate (<= 0-> bail) for whether the operand even participates in the read-port-conflict check.sub_7EAD70andsub_7F5D50use the variadic-double form of the same slot during the latency accumulator in the SASS-encoder-time scheduling-statistics walk.
Without this method the scheduler cannot know, for a given ptxInstr, how many slots in the register-port window are consumed by operand k. The pre-RA scheduler needs that count to size live-range bitvectors. Post-RA scheduling (Backend C, sub_18FDAF0's PressureCost) uses the same query on a different per-target vtable to compute the per-instruction operand-pressure double in the RBT score.
Reduced excerpt — case 0x10 (narrow integer ALU)
The shortest case in the body is 0x10. It is reproduced here lightly cleaned to show the canonical shape every case follows:
// inputs: a1 = LegacyTarget* (this)
// a2 = ptxInstr* (pointer to 32-bit instruction-word stream)
// a3 = operand_idx (which operand we are asking about)
case 0x10u:
{
uint32_t opcode_word = *(uint32_t *)(a2 + 72);
int base_off = *(int *)(a2 + 80) + ~((opcode_word >> 11) & 2);
uint64_t base = a2 + 84;
uint32_t slot = *(uint32_t *)(base + 8LL * base_off);
int width = slot & 0xF; // operand-width bits
int cls = (slot >> 4) & 0x1F; // operand-class hint
int step;
int mult;
if (cls - 12u <= 1 || cls == 2) {
mult = 3 * width;
step = 2;
} else if (cls - 14u > 1) {
if ((int)width <= 1) return 0; // not a register source
mult = 2 * width;
step = 1;
} else {
mult = width;
step = 0;
}
int bit = (*(int *)(base + 8LL * width) < 0); // sign-encoded?
if (bit + mult <= a3) return 0; // beyond the window
if (width + bit <= a3) return width * step; // inside the window, scaled
if (width + bit - 1 == a3 && *(int *)(base + 8LL * width) < 0)
return 1; // sign-extended single slot
return width; // canonical: width regs
}
Every other case follows the same six-step template — read the opcode flag word, decode the operand-class field, pick a width / stride pair, bound-check a3, return the slot count. The body is 7.8 KB because there are 33 explicit cases plus a 180-BB default, each with a slightly different decoding rule for its opcode family.
Why a 7.8-KB switch and not a dispatch table
Modern SM tiers (sm_80--sm_120) answer the same query through the per-opcode tables documented in codegen/encoding-tables.md. The replacement is one lea r10, [rip + table]; jmp [r10 + 8*opcode_category] indirected through one of the 444-slot percase tables in the 0x22A5AA0 family — at most 470 cache-warm pointer chases, no per-case decision tree.
The legacy switch is a 425-block decision tree because the legacy targets predate the table-driven encoder family. The pre-sm_80 encoding-tables (0x22A5AA0 and its cohort) have 93 nullsub holes for opcodes that did not exist in those architectures. Without a way to express "this opcode does not exist on this target" cleanly in the table, the original implementer fell back on a hand-written switch with explicit per-opcode operand accounting. The switch fans out into 425 BBs because every case has its own per-operand-index logic (read the type bits, decide whether to extend a vector window, branch on whether the operand is the destination or a source, ...).
The two designs are functionally equivalent on the opcodes both support. The switch is simply the older form, kept around for the six target classes whose opcode set never changed enough to justify migrating to the table-driven format.
Position relative to PostSchedule (phase 110)
Phase 110 -- PostSchedule is a 51-byte dispatcher that indirects through (target->subtarget)->vtable[+0x90]. The body's only purpose is to call into the sub-target's postSchedule virtual. That +0x90 slot, not the +0x5F0 slot on which sub_A97600 lives, is the post-RA scheduling hook. The two slots are unrelated:
+0x90(slot 18 on the sub-target vtable): post-RA scheduling driver. Installsnullsub_45on the legacy six targets (PostSchedule no-op), orsub_1908D90(Backend C) on sm_80+.+0x5F0(slot 190 on the primary target vtable): per-operand source-slot-count query. Installssub_A97600on the legacy six targets, or per-opcode-table-driven equivalents on sm_80+.
PostSchedule running on a legacy SM (it does not — the SM-version gate at the top of sub_C60640 aborts for SmVersion <= 1) would still not call sub_A97600. The two slots address different vtable, different offset, different signature, different role.
Interaction with the scheduling pipeline
Every caller of (target_vtable + 1520) is a function whose body needs to know, for some instruction instr and some operand index k, how many register-file slots that operand consumes under the current target's encoding. The dominant clients are:
- Pre-RA dependency-graph build (
sub_8D9930/sub_894290->getOperandSrcSlotCount): when constructing RAW / WAR / WAW edges, the engine must know which physical slots in the register-port window each operand occupies. The slot count returned bysub_A97600sizes the per-operand window over which the bitvector trackersdef_bvanduse_bvmask-OR. - Latency cost computation (
sub_7E4150): the per-operand latency contribution to the priority score is scaled byslot_count / 4whenslot_count > 4. A wide tensor-load operand thus contributes proportionally more pressure than a single 32-bit source. - Read-port conflict accounting (
sub_7E8200): an operand whosegetOperandSrcSlotCountreturns zero is treated as "no register-port consumer" — immediate operands, predicate-only operands, and TMA-descriptor handles all gate out of read-port-conflict checks via this path. - Post-RA pressure cost (
sub_18FDAF0'sPressureCosthelper, on sm_80+ via a different vtable slot): the same query is performed against the per-opcode dispatch tables, returning identical semantics. The post-RA result feeds into key 2 of the RBT priority comparison.
All four clients treat the return value as load-bearing for correctness, not just performance: an operand wrongly classified as a register source would induce false RAW edges and serialise the schedule; an operand wrongly classified as non-register would let the scheduler issue back-to-back read-port-conflicting instructions, producing a wall-clock stall the encoder cannot express in the control word. The function is therefore in the small set of legacy-target methods that ptxas refuses to no-op even when other parts of the target stack are disabled.
QUIRK -- 7.8 KB of code, zero direct callers
sub_A97600 is only reachable through (target->vtable + 1520). The IDA xref database returns no incoming direct calls — every entry is the function's own intra-body branches. This is normal for a virtual method, but the size makes the function feel like it should have direct entry points. It does not. A reimplementer hunting for callers must walk every user of (target_vtable + 1520): at least 18 functions in the decompiled corpus call through this slot (sub_7E4150, sub_7E8200, sub_7EAD70, sub_7F5D50, sub_806F80, sub_887F00, sub_92EF10, sub_93A030, sub_93A0D0, sub_93BA60, sub_9511E0, sub_967000, sub_967860, sub_973550, sub_A90D60, sub_A94B80, sub_122AD60, ...). Searching for the symbol name yields nothing. Searching for the literal 0xA97600 in the rodata yields only the six vtable installs.
QUIRK -- the case-0x32 xmmword scratch buffer
Inside case 0x32u (the tex-sampler operand-class case) the body materialises three xmmword-sized constants from .rodata:xmmword_21B2EC0 plus three immediate _QWORD literals (433471489971520000LL, 0xE0A0804000B07LL, magic 336595972) into an 80-byte stack scratch buffer v229[80]. The constants encode the operand-slot windows for the four sampler-coordinate combinations indexed by (opcode32 >> 2) & 3. The scratch buffer is written three times during the case (the constants are re-materialised before each lookup). This is a Hex-Rays artefact: the underlying assembly issues a single movdqa plus three mov [rsp+...]s and reuses the buffer; the decompiler renders each access as a fresh store because the buffer is unfreed in between. A reimplementer should not infer that the constants change between accesses — they do not. The pattern is "fetch 16-byte coordinate-class table into stack, index by sampler-mode, return slot-window width".
QUIRK -- vtable+1024 / vtable+1056 / vtable+1288 / vtable+1480 self-callbacks
The default: arm of the switch (lines 1180--1364 of the decompilation) calls back into the same target's other vtable slots — vtable+1024, +1056, +1288, +1480 — to ask sub-queries like "does this target support 64-bit address mode?" and "does this target classify opcode 6 as a memory access?". The reason is that the default arm has to handle every opcode the seven explicit cases do not, and the precise answer depends on per-target capabilities. Rather than encoding all of those capabilities in a per-target operand-class table, the legacy implementation re-uses the target's other introspection hooks. The cost is that an unfamiliar reader watching a debugger step through the default arm sees the same target's this pointer bouncing through four different vtable slots before producing a final answer. This is the third-party indirection load that the modern table-driven replacement eliminated.
Function Map
| Address | Size | Identity | Confidence |
|---|---|---|---|
sub_A97600 | 7,780 B | LegacyTarget::getOperandSrcSlotCount(this, instr, operand_idx) -- vtable slot +0x5F0 on six legacy SM-target classes | HIGH |
sub_A97540 | 180 B | LegacyTarget::isOperandInBaseWindow(this, operand_idx) -- helper used only by sub_A97600's case 0x4D arm | HIGH |
sub_A90D60 | 1,120 B | LegacyTarget::isOperandEligibleAsPredecessor(this, instr, operand_idx, flag) -- sibling at vtable slot +0x5E8 (slot 189) on the same six vtables | HIGH |
sub_7E36C0 / sub_7E3640 / sub_7E3790 / sub_7E3800 / sub_7E40E0 | 76--204 B | Per-operand-class range helpers (operand-window start/end resolvers) | HIGH |
sub_91E610 | 399 B | getOperandTypeCode(instr, operand_idx) -- shared helper used by 100+ functions | HIGH |
nullsub_45 (0x680190) | 2 B | The post-RA-no-op sentinel — unrelated to sub_A97600, lives on the sub-target vtable +0x90, not the primary target vtable +0x5F0 | CERTAIN |
Cross-References
- Scheduling Algorithm -- the Backend A / B / C taxonomy.
sub_A97600is not a scheduler; the scheduler queries it. - Phase 110 -- PostSchedule -- supersedes the misclassification of
sub_A97600in that page's function map. - Latency Model -- the per-operand latency model that multiplies
sub_A97600's return value into per-source register-port costs. - SASS Encoding Function-Pointer Dispatch Tables -- the modern (sm_80+) table-driven replacement.
- Pipeline Phase Table -- shows the full 159-phase pipeline that this query is consulted from.