NVIDIA Custom Passes

Canonical Count

35 NVIDIA custom passes is the headline number used throughout this wiki.

Definition. A NVIDIA custom pass is a pass class (a PassInfoMixin subclass or MachineFunctionPass subclass) registered by cicc that has no upstream LLVM equivalent -- i.e., its symbol is absent from a stock LLVM 20.0.0 build of lib/Passes/PassRegistry.def, lib/Target/NVPTX, and the public llvm::* namespace dumps. The count is over pass classes, not registration entries: a single class registered under multiple StringMap keys (e.g., a parameterized variant or a pipeline-shorthand wrapper) is counted once. Pure analyses (classes that produce results consumed by other passes but perform no IR/MIR mutation themselves) are counted separately and are not part of the 35.

Category	Count	Notes
IR-level NVIDIA pass classes	22	16 module + 9 function + 1 loop in the tables below, minus 4 parameterized/shorthand re-registrations
Machine-level NVIDIA pass classes	13	All 13 entries in the machine table are distinct classes
Total NVIDIA custom passes	35	Headline number
NVIDIA custom analyses	2	`rpa`, `merge-sets` — counted separately
Registration sites	--	`sub_2342890` (New PM) + `sub_12E54A0` (pipeline assembler)
Dedicated deep-dive pages	22	One per major pass

Why 33 also appears. llvm/pipeline.md cites 33 when counting the StringMap registration entries inserted by sub_2342890 -- the line that mentions "12 module + 20 function + 1 loop" reflects raw entries, including parameterized variants and short pipeline aliases registered under separate names. Both numbers describe the same code; 35 counts unique pass classes, 33 counts registration-table rows.

QUIRK. The pipeline.md per-scope split ("12 module + 20 function + 1 loop") and this page's per-scope split ("16 module + 9 function + 1 loop") count different things. pipeline.md sums StringMap keys per scope (and many module-level passes also register a function-scope wrapper key, inflating the "function" bucket). The tables on this page count distinct pass classes by their primary scope -- the scope at which the pass's run() method actually mutates IR. Neither row-count is wrong; they answer different questions.

QUIRK. Three IR-level pass names in the tables below (nvvm-pretreat, check-kernel-functions, check-gep-index) are verifier-shaped: they fail the build on invalid IR rather than transforming it. They are still counted in the canonical 35 because they are pass classes registered as transformations in sub_2342890, not as AnalysisInfoMixin analyses. The two true analyses (rpa, merge-sets) sit out.

IR-Level Module Passes

Pass Name	Class / Function	Size	Description
`memory-space-opt`	`sub_1C70910` / `sub_1CA2920`	cluster	Resolves generic pointers to specific address spaces (global/shared/local/const). Warns on illegal ops: atomics on constant mem, wmma on wrong space. Parameterized: `first-time`, `second-time`, `no-warnings`, `warnings`
`printf-lowering`	`sub_1CB1E60`	6.6 KB native	Lowers `printf` → `vprintf` + local buffer. Validates format string is a literal. `"vprintfBuffer.local"`, `"bufIndexed"`
`nvvm-verify`	`sub_2C80C90`	~37 KB native (three functions)	Three-layer NVVM IR verifier (module + function + intrinsic). Validates triples, address spaces, atomic restrictions, pointer cast rules, architecture-gated intrinsic availability
`nvvm-pretreat`	`PretreatPass`	—	IR pre-treatment before optimization
`check-kernel-functions`	`NVPTXSetFunctionLinkagesPass`	—	Kernel function linkage validation
`check-gep-index`	—	—	GEP index validation
`cnp-launch-check`	`CNPLaunchCheckPass`	—	Cooperative launch validation
`ipmsp`	`IPMSPPass` / `NVVMIPMemorySpacePropagationPass`	—	Inter-procedural memory space propagation. The full `getTypeName()` symbol is `llvm::NVVMIPMemorySpacePropagationPass`; see IPMSP
`nv-early-inliner`	—	—	NVIDIA early inlining pass
`nv-inline-must`	`InlineMustPass`	—	Force-inline functions marked `__forceinline__`
`select-kernels`	`SelectKernelsPass`	—	Kernel selection for compilation
`set-global-array-alignment`	—	—	Parameterized: `modify-shared-mem`, `skip-shared-mem`, `modify-global-mem`, `skip-global-mem`
`lower-aggr-copies`	—	14 KB + 12 KB native	Lower aggregate copies: struct splitting, memmove unrolling. Param: `lower-aggr-func-args`
`lower-struct-args`	—	—	Lower structure arguments. Param: `opt-byval`
`process-restrict`	—	—	Process `__restrict__` annotations. Param: `propagate-only`
`lower-ops`	`LowerOpsPass`	—	Lower special operations. Includes FP128/I128 emulation via 48 `__nv_*` library calls

IR-Level Function Passes

Pass Name	Function	Size	Description
`branch-dist`	`sub_1C47810` cluster	—	Branch distribution optimization. Knobs: `branch-dist-block-limit`, `branch-dist-func-limit`, `branch-dist-norm`
`nvvm-reflect`	`sub_1857160`	—	Resolves `__nvvm_reflect()` calls to integer constants based on target SM and FTZ mode. Runs multiple times as inlining exposes new calls
`nvvm-reflect-pp`	—	—	NVVM reflect preprocessor
`nvvm-intrinsic-lowering`	`sub_2C63FB0`	12 KB native	Lowers `llvm.nvvm.*` intrinsics to standard LLVM IR. Two levels: 0 = basic, 1 = barrier-aware. Runs up to 10 times in mid pipeline
`nvvm-peephole-optimizer`	—	—	NVVM-specific peephole optimizations
`remat`	`sub_1CE7DD0`	13 KB native	IR-level rematerialization. Analyzes live-in/live-out register pressure per BB. Contains IV demotion sub-pass (12 KB native)
`reuse-local-memory`	—	—	Local memory reuse optimization
`set-local-array-alignment`	—	—	Set alignment for local arrays
`sinking2`	—	—	NVIDIA-specific instruction sinking (distinct from LLVM's Sink pass)

IR-Level Loop Pass

Pass Name	Function	Size	Description
`loop-index-split`	`sub_2CC5900` / `sub_1C7B2C0`	11 KB native each	Split loops on index conditions. NVIDIA-preserved pass (removed from upstream LLVM)

Custom Analyses

Analysis Name	Purpose
`rpa`	Register Pressure Analysis — feeds into scheduling and rematerialization decisions
`merge-sets`	Merge set computation — used by coalescing and allocation

Machine-Level Passes

Pass Name	Function	Pass ID	Size	Description
Block Remat	`sub_2186D90`	`nvptx-remat-block`	9.5 KB native	Two-phase candidate selection + iterative "pull-in" for register pressure reduction. `"Max-Live-Function("`, `"Really Final Pull-in:"`
Machine Mem2Reg	`sub_21F9920`	`nvptx-mem2reg`	—	Promotes `__local_depot` stack objects back to registers post-regalloc
MRPA	`sub_2E5A4E0`	`machine-rpa`	9 KB native	Machine Register Pressure Analysis — incremental tracking, not in upstream LLVM
LDG Transform	`sub_21F2780`	`ldgxform`	—	Transforms global loads to `ldg.*` (texture cache) for read-only data
GenericToNVVM	`sub_215DC20`	`generic-to-nvvm`	218 bytes native (dispatcher)	Moves globals from generic to global address space
Alloca Hoisting	`sub_21BC7D0`	`alloca-hoisting`	—	Ensures all allocas are in entry block (PTX requirement)
Image Optimizer	`sub_21BCF10`	—	—	Optimizes texture/surface access patterns
NVPTX Peephole	`sub_21DB090`	`nvptx-peephole`	—	NVPTX-specific peephole optimization
Prolog/Epilog	`sub_21DB5F0`	—	—	Custom frame management (PTX has no traditional prolog/epilog)
Replace Image Handles	`sub_21DBEA0`	—	—	Replaces IR-level image handles with PTX texture/surface references
Extra MI Printer	`sub_21E9E80`	`extra-machineinstr-printer`	—	Register pressure statistics reporting
Valid Global Names	`sub_21BCD80`	`nvptx-assign-valid-global-names`	—	Sanitizes global names to valid PTX identifiers
NVVMIntrRange	`sub_216F4B0`	`nvvm-intr-range`	—	Adds `!range` metadata to NVVM intrinsics (e.g., tid.x bounds)

Major Proprietary Subsystems

Dead Synchronization Elimination — `sub_2C84BA0`

Field	Value
Size	13 KB native (`sub_2C84BA0`)
Purpose	Removes redundant `__syncthreads()` barriers

Bidirectional fixed-point dataflow analysis across the CFG, tracking four memory access categories per BB through eight red-black tree maps. Each deletion triggers full restart. Distinct from lightweight basic-dbe. See dedicated page for full algorithm.

MemorySpaceOpt — Multi-Function Cluster

Function	Size	Purpose
`sub_1C70910`	—	Pass entry point
`sub_1C6A6C0`	—	Pass variant
`sub_1CA2920`	6.5 KB native	Address space resolution — `"Cannot tell what pointer points to, assuming global memory space"`
`sub_1CA9E90`	5.9 KB native	Secondary resolver
`sub_1CA5350`	8.8 KB native	Infrastructure
`sub_2CBBE90`	11 KB native	Memory-space-specialized function cloning

NV Rematerialization Cluster

Function	Size	Role
`sub_1CE7DD0`	13 KB native	Main driver — live-in/live-out analysis, skip decisions
`sub_1CE67D0`	5.6 KB native	Block-level executor — `"remat_"`, `"uclone_"` prefixes
`sub_1CE3AF0`	11 KB native	Pull-in cost analysis — `"Total pull-in cost = %d"`

NLO — Simplify Live Output

Function	Size	Strings
`sub_1CE10B0`	9.2 KB native	`"Simplify Live Output"`, `"nloNewBit"`, `"newBit"`
`sub_1CDC1F0`	7.4 KB native	`"nloNewAdd"`, `"nloNewBit"`

Creates new add/bit operations to simplify live-out values at block boundaries.

IV Demotion — `sub_1CD74B0`

Field	Value
Size	12 KB native (`sub_1CD74B0`)
Strings	`"phiNode"`, `"demoteIV"`, `"newInit"`, `"newInc"`, `"argBaseIV"`, `"newBaseIV"`, `"iv_base_clone_"`, `"substIV"`

Demotes induction variables (e.g., 64-bit to 32-bit), creates new base IVs, clones IV chains for register pressure reduction. Sub-pass of rematerialization. See dedicated page for full algorithm.

RLMCAST — `sub_2D13E90`

Field	Value
Size	13 KB native (`sub_2D13E90`)
Purpose	Register-level multicast instruction lowering

Broadcasts a value to multiple register destinations. Uses 216-byte and 160-byte node structures.

Texture Group Merge (.Tgm) — `sub_2DDE8C0`

Groups texture load operations to hide latency. Uses .Tgm suffix in scheduling and function pointer table (3 predicates) for grouping decisions.

NVVM Intrinsic Verifier — `sub_2C7B6A0`

Field	Value
Size	22 KB native (`sub_2C7B6A0`)
Purpose	Validates ALL NVVM intrinsics against SM capabilities

Architecture-gated validation for every intrinsic call. Part of the three-layer NVVM verifier (~37 KB native total across the three verifier functions).

NVVM Intrinsic Lowering — `sub_2C63FB0`

Field	Value
Size	12 KB native (`sub_2C63FB0`)
Purpose	Lowers NVVM intrinsics to concrete operations

Pattern-matching rewrite engine for llvm.nvvm.* intrinsics. Two levels (basic + barrier-aware), runs up to 10 times. See dedicated page for full dispatch table.

Base Address Strength Reduction — `sub_2CA4A10`

Field	Value
Size	12 KB native (`sub_2CA4A10`)
Knobs	`do-base-address-strength-reduce` (two levels: 1 = no conditions, 2 = with conditions)

Scans loop bodies for memory ops sharing a common base pointer, hoists the anchor computation, rewrites remaining addresses as (anchor + relative_offset). See dedicated page for the anchor selection algorithm.

Common Base Elimination — `sub_2CA8B00`

Field	Value
Size	7 KB native (`sub_2CA8B00`)
Purpose	Hoists shared base address expressions to dominating CFG points

Operates at inter-block level (vs BASR intra-loop). The two passes form a complementary pair for comprehensive GPU address computation reduction. See dedicated page.

CSSA Transformation — `sub_3720740`

Field	Value
Size	3.5 KB native (`sub_3720740`)
Purpose	Conventional-SSA for GPU divergent control flow
Knobs	`do-cssa`, `cssa-coalesce`, `cssa-verbosity`, `dump-before-cssa`
Debug	`"IR Module before CSSA"`

Rewrites PHI nodes to be safe under warp-divergent execution by inserting explicit copy instructions at reconvergence points. See dedicated page for the divergence model.

Mode	Constant
Whole-program no-ABI	`NVVM_COMPILE_MODE_WHOLE_PROGRAM_NOABI`
Whole-program ABI	`NVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI`
Separate ABI	`NVVM_COMPILE_MODE_SEPARATE_ABI`
Extensible WP ABI	`NVVM_COMPILE_MODE_EXTENSIBLE_WHOLE_PROGRAM_ABI`

Opt Level	Constant
None	`NVVM_OPT_LEVEL_NONE`
1	`NVVM_OPT_LEVEL_1`
2	`NVVM_OPT_LEVEL_2`
3	`NVVM_OPT_LEVEL_3`

Debug Info	Constant
None	`NVVM_DEBUG_INFO_NONE`
Line info	`NVVM_DEBUG_INFO_LINE_INFO`
Full DWARF	`NVVM_DEBUG_INFO_DWARF`

CICC Reverse Engineering Reference

NVIDIA Custom Passes

Canonical Count

IR-Level Module Passes

IR-Level Function Passes

IR-Level Loop Pass

Custom Analyses

Machine-Level Passes

Major Proprietary Subsystems

Dead Synchronization Elimination — `sub_2C84BA0`

MemorySpaceOpt — Multi-Function Cluster

NV Rematerialization Cluster

NLO — Simplify Live Output

IV Demotion — `sub_1CD74B0`

RLMCAST — `sub_2D13E90`

Texture Group Merge (.Tgm) — `sub_2DDE8C0`

NVVM Intrinsic Verifier — `sub_2C7B6A0`

NVVM Intrinsic Lowering — `sub_2C63FB0`

Base Address Strength Reduction — `sub_2CA4A10`

Common Base Elimination — `sub_2CA8B00`

CSSA Transformation — `sub_3720740`

NVIDIA Codegen Knobs — `sub_1C20170`

Graphics Pipeline

Compute / Memory

Register Allocation / Scheduling

Type Promotion

PGO

Knob Forwarding

Compile Modes — `sub_1C21CE0`

Keyboard shortcuts

CICC Reverse Engineering Reference

Dead Synchronization Elimination — sub_2C84BA0

MemorySpaceOpt — Multi-Function Cluster

IV Demotion — sub_1CD74B0

NVVM Intrinsic Verifier — sub_2C7B6A0

NVVM Intrinsic Lowering — sub_2C63FB0

Base Address Strength Reduction — sub_2CA4A10

Common Base Elimination — sub_2CA8B00

CSSA Transformation — sub_3720740

Dead Synchronization Elimination — `sub_2C84BA0`

IV Demotion — `sub_1CD74B0`

NVVM Intrinsic Verifier — `sub_2C7B6A0`

NVVM Intrinsic Lowering — `sub_2C63FB0`

Base Address Strength Reduction — `sub_2CA4A10`

Common Base Elimination — `sub_2CA8B00`

CSSA Transformation — `sub_3720740`