Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Embedded ptxas: Architecture Overview

Note: This page is the entry point to the nvlink-internal documentation of the ptxas backend statically linked into nvlink v13.0.88. The address map, mega-hub layout, ROT13 details, and sub_1112F30 per-module compilation driver below are recovered from nvlink's binary -- not from the standalone ptxas binary. For the standalone ptxas reverse-engineering reference (159-phase pipeline, full Ori IR architecture, target catalog), see the ptxas wiki -- in particular Pipeline Overview, Codegen Overview, and Targets.

The single most important structural fact about nvlink v13.0.88 is that approximately 95% of its 25.2 MB .text section is not linker code -- it is a complete, statically embedded copy of the ptxas assembler/compiler backend. The actual device linker (ELF merge, symbol resolution, relocation, layout, output) occupies roughly 1.2 MB in the address range 0x400000--0x530000. Everything from 0x530000 through the end of .text at 0x1D32172 (~24 MB, ~38,000 functions) is the ptxas compiler backend: IR primitives, instruction selection, register allocation, instruction scheduling, SASS binary encoding, PTX parsing, and ELF/cubin output generation.

This page documents the evidence for this claim, the complete address map of the embedded ptxas subsystems, the five mega-hub instruction selector dispatch functions, and the ROT13 obfuscation applied to SASS mnemonics.

Evidence for Embedded ptxas

The embedded compiler is not a stripped-down stub -- it is a full-featured PTX-to-SASS compilation pipeline identical in capability to the standalone ptxas binary shipped in the CUDA toolkit. Key evidence:

  1. Named memory pools. The linker creates "nvlink option parser" and "nvlink memory space" arenas at startup. The embedded compiler creates its own arenas with ptxas-specific names. Memory pool diagnostics at 0x1AEE070 report pool usage statistics (total, freeable, leaked) for the compiler's internal allocations.

  2. Full option parser. sub_1103030 (29,803 bytes) registers the complete ptxas command-line option set via sub_42F130: --maxrregcount, --opt-level, --gpu-name, --device-debug, --fast-compile, --register-usage-level, --compile-only, --minnctapersm, --warn-spills, --lineinfo, --sp-bounds-check, --device-stack-protector, --sanitize, --position-independent-code, and approximately 50 more. These are forwarded from nvlink's LTO pipeline into the embedded compiler.

  3. Full compilation pipeline. sub_1112F30 (65,018 bytes) at 0x1112F30 is the top-level per-module compilation driver. It writes PTX headers (.version, .target, .entry __cuda_dummy_entry__ { ret; }), selects codegen callbacks based on mode flags (--compile-as-tools-patch, --extensible-whole-program, --compile-only), validates SM version compatibility, and dispatches to per-function codegen initialization.

  4. Multi-architecture support. sub_15C0CE0 (14,517 bytes) initializes 7 dispatch hash maps covering sm_75, sm_80, sm_86, sm_87, sm_88, sm_89, sm_90/90a, sm_100/100a/100f, sm_103/103a/103f, sm_110/110a/110f, sm_120/120a/120f, and sm_121/121a/121f. Each architecture gets 7 registered callbacks (nv.info emitter, resource usage table, instruction encoding table, compute capability array, perf-stats handler, cpf_optx handler, codegen options).

  5. Register allocation and instruction scheduling. The range 0x1850000--0x1A00000 contains the full backend compiler core: ScheduleInstructions (85 KB), ScheduleInstructionsReduceReg, DynBatch, HoistInvariants, ConvertMemoryToRegister, spilling regalloc, SMEM spilling, multi-class register allocation (R-regs, UR-regs, predicates), setmaxnreg CTA-reconfig for Blackwell+, and codegen verification passes.

  6. ISel mega-hubs. Five functions exceed 160 KB each. These are the top-level instruction selector dispatch functions, too large for Hex-Rays to decompile. Each calls hundreds of pattern matchers, selects the highest-priority match, and dispatches to the corresponding emitter.

Relationship to Standalone ptxas

The standalone ptxas binary in the CUDA toolkit and the compiler backend embedded in nvlink share the same codebase. They differ in how they are invoked:

  • Standalone ptxas: Invoked as a separate process by nvcc. Reads .ptx files from disk, writes .cubin files.
  • Embedded ptxas in nvlink: Invoked in-process during LTO (-lto) and PTX JIT compilation. The entry point is sub_4BD760 (called from main() when a PTX input file is detected) or sub_4BC6F0 (called for LTO IR compilation after libnvvm produces PTX output). Options are forwarded programmatically rather than via argc/argv.

The embedded copy supports thread-pool parallelism for split compilation (sub_43FDB0 creates the pool, sub_4264B0 dispatches per-function work items). This is the same --split-compile-extended feature available in standalone ptxas.

Embedded ptxas Address Map

The following table maps the full address range of the embedded ptxas backend. All addresses are within the .text section of nvlink v13.0.88.

IR Primitives (0x530000 -- 0x620000, ~960 KB)

RangeSizeSubsystemFunctionsKey Finding
0x530E80--0x530FD0<1 KBIR node accessors22sub_530FB0 has 31,399 callers -- universal getOperand(idx)
0x530FE0--0x5B1AB0523 KBISel pattern matchers (SM50-7x)1,293152 target opcodes, 36 priority levels
0x5B1D80--0x5E4470204 KBMercExpand mega-hub1MercExpand dispatch + CFG analysis (too large for Hex-Rays)
0x5E4470--0x600260114 KBMercExpand engine~50Bitvector ops, FNV-1a hash maps, register constraint propagation
0x603F60--0x61FA60112 KBSM50 instruction encoders79Per-instruction binary encoding functions

The IR node structure is accessed through 22 leaf functions that constitute the most-called code in the entire binary. sub_530FB0 (get operand by index) at 31,399 callers and sub_A49150 (get instruction attribute) at 30,768 callers form the universal accessor layer. The IR node layout:

Offset  Size   Field
  0     1B     operand type tag (1=immediate, 2=register, 6=memref, ...)
  4     4B     register class / encoding field (1023 = wildcard "any")
 14     1B     flag A
 15     1B     flag B
 20     4B     data type / secondary encoding
 28     2B     IR opcode
 32     8B     pointer to operand array (each operand = 32 bytes)
 40     4B     total operand count
 92     4B     first source operand index

Number of source operands = *(off+40) + 1 - *(off+92). Number of destination operands = *(off+92).

ISA Encoding Tables (0x620000 -- 0xA70000, ~4.3 MB)

This is the largest contiguous subsystem -- 4.3 MB of template-instantiated functions defining the complete NVIDIA GPU instruction set encoding and metadata.

RangeSizeSubsystemFunctionsKey Finding
0x620000--0x84DD702.2 MBSM100+ SASS encoders1,537128-bit instruction encoders for Blackwell ISA
0x84DD70--0xA482901.7 MBInstrDesc init table1,613Instruction descriptor initializers (operand types, latencies)
0xA49010--0xA4AB104 KBNVInst accessors~30IR instruction class hierarchy
0xA4AB1011 KBNVInst constructor1Allocates and initializes instruction IR node
0xA4B5E0--0xA4C7C05 KBFNV-1a hash tables4Instruction lookup by hash
0xA5B6B0180 KBsetOperandField dispatch1Giant switch: sets operand fields by opcode class
0xA6222065 KBsetOperandImm dispatch1Giant switch: sets immediate operand values
0xA6590067 KBgetOperandField dispatch1Giant switch: reads operand fields
0xA67910141 KBgetDefaultOperandValue1Giant switch: returns default operand values per opcode

The 1,537 SM100+ encoders each translate one instruction variant into a 128-bit SASS instruction word via the core primitive sub_4C28B0(buf, bit_offset, width, value). Opcode breakdown: major=1 (ALU/Scalar) 37.2%, major=2 (Vector/Memory/Control) 62.7%, major=3 (Special) 0.1%, across 118 instruction families.

The 1,613 InstrDesc initializers populate per-instruction metadata: operand count, operand types/constraints, scheduling hints, latency estimates, and execution unit assignments. Combined, the encoder + descriptor tables define the complete NVIDIA GPU ISA from SM50 through SM121.

Instruction Codecs (0xA70000 -- 0xCA0000, ~2.2 MB)

Multi-architecture instruction encoding and decoding, organized per-SM.

RangeSizeSubsystemFunctionsKey Finding
0xA709F054 KBField offset query16,491-line switch: (opcode_class, field_id) -> bit_offset
0xA7DE7050 KBField presence query1Mirror: returns hasField boolean
0xA87CE0--0xB25D50630 KBSM90/100 encoders~164Per-opcode binary instruction encoders
0xACECF0--0xB77B60700 KBSM90/100 decoders~139Binary-to-IR instruction decoders
0xB9FDE0--0xBC2CC0142 KBSM7x (Volta/Turing) codecs~60Encoders + decoders for SM70/SM75
0xBC3FC0--0xBFEC10236 KBSM75 extended codecs~80Turing-specific instruction variants
0xC00070--0xC2FB60193 KBSM80 (Ampere) codecs~70Ampere instruction encoders
0xC3D540--0xC5097083 KBSM80 decoders~15HMMA tensor core, SHF, memory decoders
0xC7EC90--0xC9EE60131 KBSM86/89 (Ada) codecs~40GA10x / AD10x encoders + decoders

Each encoder packs IR operands into a 128-bit SASS instruction word at *(a1+40). Each decoder unpacks a 128-bit word back into IR form. The sentinel value 1023 (register field) maps to RZ (zero register), and 31 (predicate field) maps to PT (true predicate). Architecture-specific encoder variants are differentiated by the helper functions they call: sub_A5A000 (SM70 Volta), sub_A5AB30 (SM75 Turing), sub_A59D80 (SM80 Ampere), etc.

Per-Arch ISel Backends

Instruction selection is implemented as a linear-scan architecture: for each IR instruction, every pattern matcher is called in sequence, and the match with the highest priority wins. Each backend has its own set of pattern matchers, a mega-hub dispatch function (too large for Hex-Rays), and instruction emitters.

SM80 (Ampere) ISel Backend (0xCA0000 -- 0xDA0000, ~1 MB)

RangeSizeSubsystemFunctions
0xCA0000--0xCDC000240 KBOperand emission + packing137
0xCDD5F0--0xCDD690<1 KBOperand predicates15
0xCE2000--0xD5FD70510 KBISel pattern matchers259
0xD5FD70239 KBSM80 ISel mega-hub1
0xD9A400--0xDA000023 KBBinary encoding17

Three-phase pipeline: (1) pattern match on IR attributes/operand types, (2) operand emission into instruction descriptor, (3) binary encoding into 128-bit SASS word.

SM100+ (Blackwell) SASS Codec -- Second Table (0xDA0000 -- 0xF16000, ~1.5 MB)

RangeSizeSubsystemFunctions
0xDA0310--0xE436D0669 KBBlackwell encoders438
0xE43C201 KBEncoder dispatch1
0xE43DC0--0xF15A50847 KBBlackwell decoders648
0xEFE6C01 KBDecoder dispatch1

Format 1 instructions: 147. Format 2 (extended with modifiers): 290. Format 3 (special wide): 1. Every encoder has a mirror decoder; the decoder count exceeds encoders because decoders also handle architecture-variant forms.

SM75 (Turing) ISel Backend (0xF16000 -- 0x100C000, ~984 KB)

RangeSizeSubsystemFunctions
0xF16030--0xF160F0<1 KBOperand predicates15
0xF10080--0xF15A5022 KBInstruction emitters18
0xF16150--0xFBB780678 KBISel pattern matchers276
0xFBB810280 KBSM75 ISel mega-hub1
0xFFFDF0--0x100BBF048 KBPost-ISel emit+encode38

This is the largest single-architecture ISel backend. sub_FBB810 at 280 KB is the largest function in the binary.

SM89/90 (Ada/Hopper) Backend (0x100C000 -- 0x11EA000, ~1.9 MB)

RangeSizeSubsystemFunctions
0x100C000--0x10FFFFF1.0 MBShared instruction encoders~750
0x1100000--0x1120000128 KBBackend driver~30
0x110495038 KBptxas option parser1
0x1112F3065 KBCompilation driver main1
0x111689060 KBELF output + metadata gen1
0x1120000--0x119BF40496 KBISel pattern matchers~160
0x119BF40231 KBSM89/90 ISel mega-hub1
0x11D4680--0x11EA00090 KBScheduler + emission~16

PTX Frontend (0x11EA000 -- 0x15C0000, ~3.5 MB)

The PTX frontend parses PTX assembly text, validates instructions against SM version constraints, and lowers them to the internal IR consumed by the per-architecture ISel backends.

RangeSizeSubsystemFunctionsKey Finding
0x11EA000--0x126C000520 KBISel pattern-match predicates~160Shared across all SM targets
0x126CA30239 KBPTX ISel mega-hub1Shared PTX-level instruction selector
0x12A7000--0x12B000036 KBPTX type system + operand builders~20Type constructors, operand IR building
0x12B0000--0x12BA00040 KBSpecial register name table~20%ntid, %laneid, %smid, %clock64, %ctaid, ...
0x12BA000--0x12D000088 KBISel lowering passes~30LTO-path instruction lowering
0x12D0000--0x12D500020 KBDWARF debug line info gen~5Line table emission for LTO-compiled code
0x12D5000--0x14000001.2 MBISel pattern clones~500Parametric clones per SM (sm_5x through sm_10x)
0x1400000--0x1430000192 KBLTO pipeline + ELF emit~20Top-level LTO pipeline, MMA lowering
0x1430000--0x144200072 KBPTX version/SM gates~30Version-gated instruction validators
0x1442000--0x146BEC0156 KBInstruction emission handlers~80Per-instruction PTX code generators
0x146BEC0206 KBptx_load_store_validator1Memory operation validator with SM checks
0x147EF50288 KBptx_instruction_semantic_analyzer1Master validator: all SM version requirements
0x1487650240 KBptx_statement_processor1Top-level PTX statement handler
0x14932E0--0x15B86A0700 KBInstruction handlers + builtins~250Code-template generators for CUDA builtins
0x15B86A0345 KBcuda_builtin_prototype_generator1608-case switch covering sm20 through sm10x builtins

The cuda_builtin_prototype_generator is the second-largest function in the binary at 345 KB. It maps builtin index numbers to PTX prototype strings of the form .weak .func (...) __cuda_smXX_foo (...). Function families include div, rem, rcp, sqrt, dsqrt, barrier, wmma, shfl, vote, matchsync, warpsync, reduxsync, sanitizer_memcheck, tcgen05, bulk_copy, and cp_async_bulk_tensor.

Compilation Pipeline (0x15C0000 -- 0x1A00000, ~4.2 MB)

This region contains the per-function compilation pipeline from SM dispatch through code generation to backend verification.

RangeSizeSubsystemFunctionsKey Finding
0x15C0CE015 KBSM dispatch tables17 callback maps for sm_75 through sm_121
0x15C44D0--0x15CA450348 KBnv.info attribute emitters~10Per-SM EIATTR record generation (largest: 78 KB)
0x1610000--0x163FFFF192 KBPTX compilation frontend~40Operand handling, control flow, symbol management
0x1640000--0x165FFFF128 KBCodegen operand lowering~30Atom formatting, offset calculation
0x1660000--0x169FFFF256 KBISel/scheduling + DWARF~40Instruction scheduling, peephole, debug emission
0x16A0000--0x16DFFFF256 KBOCG intrinsic lowering~80builtin_ocg_* handlers, tcmma/tensor operations
0x16E0000--0x16E3AB012 KBtcgen05 intrinsic codegen~10SM100 tensor memory address setup, guardrails
0x16E4D60--0x16F600070 KBPTX instruction builder~20Instruction construction, operand insert
0x16F6000--0x1740000296 KBTepid instruction scheduler~50Full instruction scheduling pipeline
0x175D000--0x176800044 KBKnobs/config infrastructure~15Runtime tuning parameters
0x1769000--0x1850000924 KBSASS opcode tables~150SM70-SM120 opcode encoding/emission with ROT13 mnemonics
0x1850000--0x186F000124 KBInstruction scheduling~15ScheduleInstructions (85 KB), ReduceReg, DynBatch, Cutlass-aware
0x1878000--0x189C000144 KBConvertMemoryToRegister~20Shared-memory to register promotion
0x189C000--0x18FC000384 KBRegister allocation~40Spilling, SMEM spilling, multi-class regalloc
0x18FC000--0x1920000144 KBsetmaxnreg / CTA-reconfig~20Blackwell+ register budget negotiation
0x1916000--0x1960000296 KBmbarrier + ORI passes~30Copy propagation, dead-code elimination
0x1960000--0x19E0000512 KBCodegen verification~40Uninitialized register detection, remat verify
0x19A0000--0x1A00000384 KBMetrics + scheduling guidance~35Occupancy estimation, loop analysis, regalloc guidance

SASS Emission (0x1A00000 -- 0x1D32172, ~3.2 MB)

The final segment of .text handles SASS instruction lowering, ABI enforcement, ELF/cubin output, name demangling, and DWARF debug info.

RangeSizeSubsystemFunctionsKey Finding
0x1A009C0--0x1A0B1806 KBBug injection framework~5Testing hooks for intentional bug injection
0x1A0B180--0x1A2000084 KBInstruction operand analysis~30Operand lowering, constant buffer encoding
0x1A1A000--0x1A2A00064 KBWarp sync / mbarrier~15%%mbarrier_%s_%s instruction generation
0x1A4B000--0x1A6109088 KBWGMMA pipeline analysis~20Warpgroup MMA live ranges, sync injection
0x1A61090--0x1A6A48038 KBScoreboard management~10Instruction scheduling scoreboard
0x1A6A480--0x1AA2090352 KBISel/lowering + encoding~80Instruction selection, SASS emission
0x1AA2090--0x1ABF000124 KBRegalloc + ABI~30Register allocation, ABI handling
0x1AEAA90--0x1AEE07014 KBInstruction vtable factory~10SASS instruction vtable construction
0x1AEE070--0x1B0000070 KBMemory pool diagnostics~10Pool tracking, encoding passes
0x1B00000--0x1B20000128 KBRegister liveness~30Interference graph construction
0x1B19750--0x1B40000160 KBMachine scheduling + CFG~40Basic block management
0x1B40000--0x1B60000128 KBDependency tracking~30Scoreboard / dependency graph
0x1B60000--0x1B9FFFF256 KBISel + lowering (tail)~200PTX-to-SASS ISel, tail-call optimization
0x1BA0000--0x1BFFFFF384 KBABI / calling convention~150Return address mgmt, convergent boundary, coroutine regs
0x1C00000--0x1CDFFFF896 KBELF section builder~120.nv.constant, .nv.shared, cubin/fatbin container
0x1CE0000--0x1CEDFFF56 KBC++ name demangler~40Itanium ABI + MSVC demangler
0x1CF0000--0x1D32172265 KBDWARF + LEB128 + KNOBS~140Debug info generation, SSE-accelerated LEB128, config system

The Five Mega-Hub Functions

Five functions exceed 160 KB each. They are the top-level instruction selector dispatch functions for different SM architecture generations. Each contains a massive jump table that calls hundreds of ISel pattern matchers in sequence, selects the highest-priority match, then dispatches to the corresponding emitter. All five are too large for Hex-Rays to decompile.

AddressSizeTargetDescription
sub_FBB810280 KBSM75 (Turing)Calls 276+ pattern matchers. Largest function in the binary
sub_126CA30239 KBSM50-7x (shared)Covers Maxwell/Pascal/Volta backends
sub_D5FD70239 KBSM80 (Ampere)Calls 259 pattern matchers for Ampere-class GPUs
sub_119BF40231 KBSM89/90 (Ada/Hopper)Calls ~160 pattern matchers
sub_5B1D80204 KBSM50-7x (MercExpand)MercExpand instruction expansion dispatch

The ISel protocol is the standard ptxas linear-scan pattern matcher: every matcher is invoked with (ctx, ir_node, &pattern_id, &priority), the highest-priority match wins, the emitter table dispatches by pattern_id. Matchers query IR through sub_A49150 (attribute), sub_530FD0/sub_530FC0 (operand count), and sub_530FB0 (operand by index). For the algorithm in full detail see ptxas: Instruction Selection; the table above lists the nvlink-binary addresses of the five mega-hub dispatch functions that implement it.

ROT13 Obfuscation of SASS Mnemonics

NVIDIA applies ROT13 encoding to SASS instruction mnemonic strings stored in the binary. The decoder function sub_1A40AC0 uses SSE/SIMD vectorization for bulk ROT13 processing (loading 16 bytes at a time via _mm_load_si128). The SASS opcode table initializer at 0x1A85E40 stores all mnemonics in ROT13-encoded form; they are decoded at runtime before use.

Known decoded mnemonics:

ROT13DecodedInstruction
VZNQIMADInteger multiply-add
SZHYFMULFloat multiply
SNQQFADDFloat add
SRAPRFENCEMemory fence
ZREPHELMERCURYBlackwell codename prefix
CCGYCCTLCache control
OFLAPBSYNCBarrier synchronization
ERZBARREMOVEInstruction removal tag

The "MERCURY" prefix (ZREPHEL in ROT13) corresponds to sm_100+ (Blackwell) and appears throughout the compilation pipeline as a codename. ROT13 is also applied to some internal ELF section names: .sync_restrict::shared::read::mma::a is stored as its ROT13 equivalent, .acc::f16 as .npp::s16, and .sp::2to4 as .fc::2gb4.

Size Summary

SubsystemAddress RangeSizeFunctions% of .text
Linker core (not ptxas)0x400000--0x5300001.2 MB~6005%
IR primitives + SM50-7x ISel0x530000--0x620000960 KB~1,4504%
ISA encoding tables0x620000--0xA700004.3 MB~3,150 encoders + ~1,613 descriptors17%
Instruction codecs (multi-arch)0xA70000--0xCA00002.2 MB~7009%
SM80 ISel backend0xCA0000--0xDA00001.0 MB~4304%
SM100+ codec (second table)0xDA0000--0xF160001.5 MB~1,0906%
SM75 ISel backend0xF16000--0x100C000984 KB~3504%
SM89/90 backend0x100C000--0x11EA0001.9 MB~9808%
PTX frontend0x11EA000--0x15C00003.5 MB~1,10014%
Compilation pipeline0x15C0000--0x1A000004.2 MB~70017%
SASS emission + ABI + ELF0x1A00000--0x1D321723.2 MB~1,30013%
Total embedded ptxas0x530000--0x1D32172~24 MB~38,000~95%

Cross-Reference: Key Functions

FunctionSizeIdentityRole
sub_530FB0<1 KBIRNode_GetOperandUniversal operand accessor (31,399 callers)
sub_A49150<1 KBIRInstr_GetAttributeUniversal attribute accessor (30,768 callers)
sub_4C28B0<1 KBsetBitfieldCore encoding primitive for all SASS encoders
sub_1112F3065 KBptxas_main_compilation_driverTop-level per-module compilation entry
sub_110303030 KBptxas_option_definition_table_builderFull option parser (~60 options)
sub_110495038 KBptxas_command_option_parserOption processing and validation
sub_15C0CE015 KBinit_sm_dispatch_tablesSM architecture callback registration
sub_1A40AC01.9 KBrot13_string_decoderSIMD-vectorized ROT13 decoder
sub_4BD760variesptxas_jit_compileEntry point for PTX JIT compilation
sub_4BC6F0variescompile_linked_lto_irEntry point for LTO compilation
sub_15B86A0345 KBcuda_builtin_prototype_generator608-case builtin switch (second-largest function)
sub_147EF50288 KBptx_instruction_semantic_analyzerMaster instruction validator

Compilation Pipeline: sub_1112F30

sub_1112F30 (65,018 bytes at 0x1112F30, ~2,164 decompiled lines) is the top-level per-module compilation driver inside nvlink's embedded ptxas. It receives a module context a1 and a PTX module descriptor a2, then orchestrates the full PTX-to-SASS compilation across 26 phases before returning. Confidence: HIGH — derived directly from Hex-Rays output of this function.

This driver corresponds structurally to the entry/dispatch path in standalone ptxas — see ptxas: Pipeline Overview for the generic 159-phase pipeline narrative and ptxas: Entry Point for option-parser behavior. The table and notes below preserve the binary-specific phase order, callback choices, and helper addresses that are unique to the nvlink-embedded copy.

Phase Table

#PhaseKey calls / addressesEffect
1Option query & cache configoption_get_bool on def-load-cache, force-load-cache, def-store-cache, force-store-cacheCaptures cache-mode booleans into stack locals
2Cancellation checkreads a1+288; invokes cancel_callback(a1+296)Longjmps to error handler if returns 1
3Timing gatesub_45CCD0 wall-clock, sub_44EF30 high-res; flags at a1+104..107, a1+402Starts timers if profiling enabled
4Callback registrationsub_1108860 instr CB, sub_1101EB0 func CB, sub_12B30E0/sub_12B31D0 PTX version tablesInstalls per-IR-node callbacks
5SM version validationsscanf on .target; sub_12A8360 PTX/SM compatFatal if module SM > max supported
6Mode flag dispatchselects (init_fn, begin_fn) -- see Compilation Mode MatrixPicks one of four codegen pathways
7PTX header emissionsub_12AF550 inline / fopen + fprintf(.version/.target/.entry __cuda_dummy_entry__ { ret; }) + sub_12AF200Emits dummy entry when none exist
8Tools-patch warningsconditional on --compile-as-tools-patch, --assyscallWarns about allocating textures/surfaces/samplers/constants
9Compilation flags setupPIC processing; --fast-compile/--extensible-whole-program disables for ABI-less; --legacy-bar-warp-wide-behavior (SM70 only); --g-tensor-memory-access-check (SM100+ only)Resolves flag conflicts
10Hash maps + codegen context8x sub_4489C0/sub_465020 (caps 0x100/0x400/0x40/0x20); per-func resource array via sub_12AE300 (48 B/entry at a1+336); result array (112 B/entry at a1+256)Allocates module-wide tables
11Register callbacks on module IRsub_1102AC0 per-function, sub_1101E90 per-symbol, sub_1111DB0 per-func-IR, sub_1101DE0 per-global (unless --compile-only), sub_110F5E0 per-section, sub_1101F60 per-symbol post-processInstalls IR walkers
12Address width + register budgetSM≤13 → 32-bit (maxnreg=32); SM>13 → from module metadata; SM>90 + 32-bit → fatalSets address mode
13Entry point collectionresolves -e / -E names through module reader; else uses a2+88Builds ordered entry list
14Transfer state into codegen contextcopies maps/flags into a1+1072..1296; alias map (cap 0x100); callee usage map (cap 0x418)Snapshots compilation state
15init_callback(ctx, entries)from Phase 6: sub_110CD20 / sub_110CBA0 / sub_110D0B0 / sub_110D110Builds per-function codegen descriptors via sub_110BC90, stores at a1+1192
16Load/store cache modeper-function: respects force-load-cache (mode=2), def-load-cache (mode=1), force-store-cache, callee analysisAssigns memory-op cache mode
17Indirect call + MMA validationper-function: warns on indirect mma.f64; fatal on mutual recursion markersFrontend correctness check
18Scheduling class assignmentclass 0 / 1 / 2 propagated through call graph; class 2 = aggressive (callee analysis)Picks scheduling aggressiveness
19Debug info setupsub_1672520 dwarf_init if --device-debugInitializes DWARF context
20Reserved register configuration--first-reserved-rreg (min=4); total = first + countReserves R-regs from regalloc
21Build per-function codegen configpacks ~50 flags (device_debug, lineinfo, fast_compile, maxrregcount, opt_level, compile_only, tools_patch, ewp, preserve_relocs, sm_version, address_width, default caches, PIC, ...) into struct; sub_16257C0 creates CodegenPipelineBuilds pipeline-config object
22Output file setupfopen_and_truncate on --output pathPrepares dump file
23aSequential per-function loopsub_110AA30 codegen_initsub_1655A60 48-pass pipeline → sub_1102B30 codegen_compile (setjmp-wrapped) → sub_110D2A0 codegen_finalizeCompiles each function in main thread
23bParallel per-function loopsub_43FDB0 thread-pool create → for each func build 48-B work item → sub_43FF50 enqueue → sub_43FFE0 barrier → sub_43FE70 destroy. Worker = sub_1107420sub_1102B30 (setjmp + compile) + timingSame as --split-compile-extended in standalone ptxas
24Post-compilation cleanupregister-budget cross-check if --compile-only (caller-callee budget validation through register_budget_map)Validates inter-function constraints
25Pipeline config teardownsub_1626480 pipeline_finalizeDestroys CodegenPipeline
26Final cleanupsub_4650A0 destroys hash maps; frees per-function arraysReturns 0

Per-Function Inner Pipeline (Phase 23)

For each function in the compile list, Phase 23 (either sequential or thread-pool worker) runs the following sub-stages:

Sub-stageAddressRole
codegen_initsub_110AA30Allocate 360-B per-function state; create OCG context; set producer="NVIDIA", tool="ptxocg.0.0"; configure ~30 SM-specific fields; invoke vtable->init to map symbol names
codegen_per_funcsub_1655A60Drive the 48-pass codegen pipeline -- see The 48-Pass Codegen Pipeline
codegen_compilesub_1102B30setjmp-wrapped vtable->compile call; longjmp + record failure on error
Timing record--timing_record writes per-function start/end time
codegen_finalizesub_110D2A0Emit ELF section content (.text, .nv.info, .nv.constant); write EIATTR register usage records; write SASS binary; cleanup per-function OCG state

In parallel mode each worker additionally allocates three local sorted maps (cap 8), copies a 15×16-byte snapshot of driver state into the per-function state, and allocates a 216-B per-function DWARF state via dwarf_register. After the barrier, the main thread merges per-thread maps, restores DWARF and pipeline snapshots, and runs codegen_finalize sequentially for each function so that register-budget propagation observes a deterministic order.

Key Subroutine Reference

AddressName (reconstructed)Role in Pipeline
sub_1112F30ptxas_compile_moduleTop-level per-module driver (this function)
sub_110AA30codegen_initPer-function OCG context creation + field setup
sub_1655A60codegen_per_func48-pass codegen pipeline (ISel, regalloc, sched, encode)
sub_1102B30codegen_compileError-wrapped compilation (setjmp + vtable dispatch)
sub_110D2A0codegen_finalizeELF emission, nv.info, SASS output, cleanup
sub_1107420thread_workerThread pool work item: compile + timing + cleanup
sub_110CD20compile_only_initInit for --compile-only / --compile-as-tools-patch
sub_110D0B0standard_initInit for normal LTO compilation
sub_110D110ewp_initInit for --extensible-whole-program mode
sub_110CBA0standard_init_ewpInit for standard mode with EWP flag
sub_11089E0compile_only_beginBegin callback for compile-only modes
sub_1107F10ewp_beginBegin callback for EWP mode
sub_1109180standard_beginBegin callback for standard compilation
sub_110BC90alloc_codegen_recordAllocate per-function codegen descriptor
sub_16257C0create_codegen_pipelineBuild the codegen pipeline configuration object
sub_1626480pipeline_finalizeTear down the codegen pipeline
sub_43FDB0create_thread_poolCreate split-compilation thread pool
sub_43FF50enqueue_work_itemSubmit per-function work to thread pool
sub_43FFE0thread_pool_barrierWait for all enqueued work to complete
sub_43FE70thread_pool_destroyDestroy thread pool
sub_12AE300get_function_countReturn number of functions in module
sub_12AF550ptx_emit_entry_inlineEmit PTX entry point to in-memory buffer
sub_12AF200ptx_parse_fileParse a PTX file into module representation
sub_12B30E0ptx_version_table_initInitialize PTX version compatibility tables
sub_12A8360ptx_version_compatibleCheck PTX/SM version compatibility
sub_15C3DD0sm_name_to_ordinalConvert SM target string to ordinal index
sub_1672520dwarf_initInitialize DWARF debug info context

Compilation Mode Matrix

The mode flag dispatch at Phase 6 selects one of four codegen pathways. The choice is determined by command-line flags forwarded into the embedded compiler:

ModeConditioninit_fnbegin_fnBehavior
Compile-only--compile-only or --compile-as-tools-patch or --assyscallsub_110CD20sub_11089E0Compile all functions independently. No cross-function optimization. Used for tools patches (Nsight Compute, Nsight Systems)
EWP (no debug)--extensible-whole-program and NOT --device-debugsub_110D110sub_1107F10Whole-program optimization. Functions compiled with global visibility into callee register usage. Enables aggressive inlining decisions
EWP + debug--extensible-whole-program AND --device-debugsub_110CD20sub_11089E0Falls back to compile-only pathway because whole-program optimization conflicts with debug info fidelity
StandardNormal LTO compilation (default)sub_110D0B0sub_1109180Standard per-function compilation with cross-function register budget propagation. Used for typical nvlink LTO
Standard + EWP flagStandard with --extensible-whole-program hintsub_110CBA0sub_1109180Same as standard but with EWP-aware init (reserves additional register space for potential future extensibility)

The 48-Pass Codegen Pipeline (sub_1655A60)

The per-function codegen entry point sub_1655A60 runs a 48-pass pipeline (passes 0--47). Each pass is enable-gated by the SM dispatch vtable at a1[3757] (registered by sub_15C0CE0); enable flags occupy a1[160..207]. The pass numbering, vtable offsets, and binary slot allocation are the nvlink-embedded copy's own and do not match the standalone ptxas pass numbering (~159 phases) — for the corresponding standalone passes see ptxas: Passes Index, ptxas: Instruction Selection, and ptxas: Scheduling Algorithm.

Pass(es)RoleGating
0Zero placeholderalways off
1Initial IR canonicalizationunconditional
2Instruction count estimationvtable+120
3--20SM-gated optimization passes (architecture-specific opt)vtable+72 capability query per pass
21Address-width-dependent setupgated on addr_width
22Register class initializationunconditional for SM >= sm_50
23--38Core backend: ISel mega-hub dispatch, regalloc (graph coloring + spilling), ScheduleInstructions, peephole, SASS encodinguniversally enabled for SM >= sm_50
39Initial ABI frame setupunconditional
40--42Final lowering passesunconditional
43Peephole cleanupunconditional
44--45Reservedalways off
46Binary encoding queryvtable+488
47Final verification + pass-count teardownunconditional

After the pass loop, sub_1655A60 registers additional IR lowering callbacks (sub_161F1C0, sub_161F800, sub_1620460) on the function's basic block list, sets up UDT/UFT relocations for Blackwell+ (SM ordinal > 26 = sm_100+), and processes the function's call graph for register pressure analysis.

Sequential vs. Parallel Compilation

Selected by ctx->thread_count at a1+668. Sequential (count = 0) runs codegen_init -> codegen_compile -> codegen_finalize on the main thread with timing recorded between stages. Parallel (count > 0) creates a thread pool via sub_43FDB0, builds 48-byte work items containing the per-function state snapshot (360 B + 3 local sorted maps + 216-B DWARF state + pipeline snapshot), enqueues each via sub_43FF50, barriers on sub_43FFE0, then destroys via sub_43FE70. Workers run sub_1107420, which delegates to sub_1102B30 (setjmp-wrapped compile) and records timing + peak memory. After the barrier the main thread merges per-thread maps back, restores DWARF and pipeline snapshots, and runs codegen_finalize sequentially so register-budget propagation is deterministic. If qword_2A64430 is non-null, each worker error-checks via sub_1D1E060/sub_1D1E300 after its work item. This is the same --split-compile-extended mechanism available in standalone ptxas.

Cross-References

  • IR Nodes -- IR node structure and universal accessor functions
  • ISel Hubs -- the five mega-hub instruction selector dispatch functions
  • Peephole -- peephole optimization passes (ORI, scheduling-phase, linker-level)
  • PTX Parsing -- the embedded PTX assembler frontend
  • Register Allocation -- graph-coloring register allocator with spilling
  • Scheduling -- pre-RA and tepid (post-RA) instruction schedulers
  • Architecture Dispatch -- per-SM vtable dispatch system
  • Mercury Overview -- Mercury ISA encoding pipeline
  • FNLZR -- post-link binary rewriter for Mercury targets
  • LTO Overview -- how the LTO pipeline invokes the embedded compiler

Sibling Wikis