Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVIDIA Custom Passes

Canonical Count

35 NVIDIA custom passes is the headline number used throughout this wiki.

Definition. A NVIDIA custom pass is a pass class (a PassInfoMixin subclass or MachineFunctionPass subclass) registered by cicc that has no upstream LLVM equivalent -- i.e., its symbol is absent from a stock LLVM 20.0.0 build of lib/Passes/PassRegistry.def, lib/Target/NVPTX, and the public llvm::* namespace dumps. The count is over pass classes, not registration entries: a single class registered under multiple StringMap keys (e.g., a parameterized variant or a pipeline-shorthand wrapper) is counted once. Pure analyses (classes that produce results consumed by other passes but perform no IR/MIR mutation themselves) are counted separately and are not part of the 35.

CategoryCountNotes
IR-level NVIDIA pass classes2216 module + 9 function + 1 loop in the tables below, minus 4 parameterized/shorthand re-registrations
Machine-level NVIDIA pass classes13All 13 entries in the machine table are distinct classes
Total NVIDIA custom passes35Headline number
NVIDIA custom analyses2rpa, merge-sets — counted separately
Registration sites--sub_2342890 (New PM) + sub_12E54A0 (pipeline assembler)
Dedicated deep-dive pages22One per major pass

Why 33 also appears. llvm/pipeline.md cites 33 when counting the StringMap registration entries inserted by sub_2342890 -- the line that mentions "12 module + 20 function + 1 loop" reflects raw entries, including parameterized variants and short pipeline aliases registered under separate names. Both numbers describe the same code; 35 counts unique pass classes, 33 counts registration-table rows.

QUIRK. The pipeline.md per-scope split ("12 module + 20 function + 1 loop") and this page's per-scope split ("16 module + 9 function + 1 loop") count different things. pipeline.md sums StringMap keys per scope (and many module-level passes also register a function-scope wrapper key, inflating the "function" bucket). The tables on this page count distinct pass classes by their primary scope -- the scope at which the pass's run() method actually mutates IR. Neither row-count is wrong; they answer different questions.

QUIRK. Three IR-level pass names in the tables below (nvvm-pretreat, check-kernel-functions, check-gep-index) are verifier-shaped: they fail the build on invalid IR rather than transforming it. They are still counted in the canonical 35 because they are pass classes registered as transformations in sub_2342890, not as AnalysisInfoMixin analyses. The two true analyses (rpa, merge-sets) sit out.

IR-Level Module Passes

Pass NameClass / FunctionSizeDescription
memory-space-optsub_1C70910 / sub_1CA2920clusterResolves generic pointers to specific address spaces (global/shared/local/const). Warns on illegal ops: atomics on constant mem, wmma on wrong space. Parameterized: first-time, second-time, no-warnings, warnings
printf-loweringsub_1CB1E606.6 KB nativeLowers printfvprintf + local buffer. Validates format string is a literal. "vprintfBuffer.local", "bufIndexed"
nvvm-verifysub_2C80C90~37 KB native (three functions)Three-layer NVVM IR verifier (module + function + intrinsic). Validates triples, address spaces, atomic restrictions, pointer cast rules, architecture-gated intrinsic availability
nvvm-pretreatPretreatPassIR pre-treatment before optimization
check-kernel-functionsNVPTXSetFunctionLinkagesPassKernel function linkage validation
check-gep-indexGEP index validation
cnp-launch-checkCNPLaunchCheckPassCooperative launch validation
ipmspIPMSPPass / NVVMIPMemorySpacePropagationPassInter-procedural memory space propagation. The full getTypeName() symbol is llvm::NVVMIPMemorySpacePropagationPass; see IPMSP
nv-early-inlinerNVIDIA early inlining pass
nv-inline-mustInlineMustPassForce-inline functions marked __forceinline__
select-kernelsSelectKernelsPassKernel selection for compilation
set-global-array-alignmentParameterized: modify-shared-mem, skip-shared-mem, modify-global-mem, skip-global-mem
lower-aggr-copies14 KB + 12 KB nativeLower aggregate copies: struct splitting, memmove unrolling. Param: lower-aggr-func-args
lower-struct-argsLower structure arguments. Param: opt-byval
process-restrictProcess __restrict__ annotations. Param: propagate-only
lower-opsLowerOpsPassLower special operations. Includes FP128/I128 emulation via 48 __nv_* library calls

IR-Level Function Passes

Pass NameFunctionSizeDescription
branch-distsub_1C47810 clusterBranch distribution optimization. Knobs: branch-dist-block-limit, branch-dist-func-limit, branch-dist-norm
nvvm-reflectsub_1857160Resolves __nvvm_reflect() calls to integer constants based on target SM and FTZ mode. Runs multiple times as inlining exposes new calls
nvvm-reflect-ppNVVM reflect preprocessor
nvvm-intrinsic-loweringsub_2C63FB012 KB nativeLowers llvm.nvvm.* intrinsics to standard LLVM IR. Two levels: 0 = basic, 1 = barrier-aware. Runs up to 10 times in mid pipeline
nvvm-peephole-optimizerNVVM-specific peephole optimizations
rematsub_1CE7DD013 KB nativeIR-level rematerialization. Analyzes live-in/live-out register pressure per BB. Contains IV demotion sub-pass (12 KB native)
reuse-local-memoryLocal memory reuse optimization
set-local-array-alignmentSet alignment for local arrays
sinking2NVIDIA-specific instruction sinking (distinct from LLVM's Sink pass)

IR-Level Loop Pass

Pass NameFunctionSizeDescription
loop-index-splitsub_2CC5900 / sub_1C7B2C011 KB native eachSplit loops on index conditions. NVIDIA-preserved pass (removed from upstream LLVM)

Custom Analyses

Analysis NamePurpose
rpaRegister Pressure Analysis — feeds into scheduling and rematerialization decisions
merge-setsMerge set computation — used by coalescing and allocation

Machine-Level Passes

Pass NameFunctionPass IDSizeDescription
Block Rematsub_2186D90nvptx-remat-block9.5 KB nativeTwo-phase candidate selection + iterative "pull-in" for register pressure reduction. "Max-Live-Function(", "Really Final Pull-in:"
Machine Mem2Regsub_21F9920nvptx-mem2regPromotes __local_depot stack objects back to registers post-regalloc
MRPAsub_2E5A4E0machine-rpa9 KB nativeMachine Register Pressure Analysis — incremental tracking, not in upstream LLVM
LDG Transformsub_21F2780ldgxformTransforms global loads to ldg.* (texture cache) for read-only data
GenericToNVVMsub_215DC20generic-to-nvvm218 bytes native (dispatcher)Moves globals from generic to global address space
Alloca Hoistingsub_21BC7D0alloca-hoistingEnsures all allocas are in entry block (PTX requirement)
Image Optimizersub_21BCF10Optimizes texture/surface access patterns
NVPTX Peepholesub_21DB090nvptx-peepholeNVPTX-specific peephole optimization
Prolog/Epilogsub_21DB5F0Custom frame management (PTX has no traditional prolog/epilog)
Replace Image Handlessub_21DBEA0Replaces IR-level image handles with PTX texture/surface references
Extra MI Printersub_21E9E80extra-machineinstr-printerRegister pressure statistics reporting
Valid Global Namessub_21BCD80nvptx-assign-valid-global-namesSanitizes global names to valid PTX identifiers
NVVMIntrRangesub_216F4B0nvvm-intr-rangeAdds !range metadata to NVVM intrinsics (e.g., tid.x bounds)

Major Proprietary Subsystems

Dead Synchronization Eliminationsub_2C84BA0

FieldValue
Size13 KB native (sub_2C84BA0)
PurposeRemoves redundant __syncthreads() barriers

Bidirectional fixed-point dataflow analysis across the CFG, tracking four memory access categories per BB through eight red-black tree maps. Each deletion triggers full restart. Distinct from lightweight basic-dbe. See dedicated page for full algorithm.

MemorySpaceOpt — Multi-Function Cluster

FunctionSizePurpose
sub_1C70910Pass entry point
sub_1C6A6C0Pass variant
sub_1CA29206.5 KB nativeAddress space resolution — "Cannot tell what pointer points to, assuming global memory space"
sub_1CA9E905.9 KB nativeSecondary resolver
sub_1CA53508.8 KB nativeInfrastructure
sub_2CBBE9011 KB nativeMemory-space-specialized function cloning

NV Rematerialization Cluster

FunctionSizeRole
sub_1CE7DD013 KB nativeMain driver — live-in/live-out analysis, skip decisions
sub_1CE67D05.6 KB nativeBlock-level executor — "remat_", "uclone_" prefixes
sub_1CE3AF011 KB nativePull-in cost analysis — "Total pull-in cost = %d"

NLO — Simplify Live Output

FunctionSizeStrings
sub_1CE10B09.2 KB native"Simplify Live Output", "nloNewBit", "newBit"
sub_1CDC1F07.4 KB native"nloNewAdd", "nloNewBit"

Creates new add/bit operations to simplify live-out values at block boundaries.

IV Demotionsub_1CD74B0

FieldValue
Size12 KB native (sub_1CD74B0)
Strings"phiNode", "demoteIV", "newInit", "newInc", "argBaseIV", "newBaseIV", "iv_base_clone_", "substIV"

Demotes induction variables (e.g., 64-bit to 32-bit), creates new base IVs, clones IV chains for register pressure reduction. Sub-pass of rematerialization. See dedicated page for full algorithm.

RLMCAST — sub_2D13E90

FieldValue
Size13 KB native (sub_2D13E90)
PurposeRegister-level multicast instruction lowering

Broadcasts a value to multiple register destinations. Uses 216-byte and 160-byte node structures.

Texture Group Merge (.Tgm) — sub_2DDE8C0

Groups texture load operations to hide latency. Uses .Tgm suffix in scheduling and function pointer table (3 predicates) for grouping decisions.

NVVM Intrinsic Verifiersub_2C7B6A0

FieldValue
Size22 KB native (sub_2C7B6A0)
PurposeValidates ALL NVVM intrinsics against SM capabilities

Architecture-gated validation for every intrinsic call. Part of the three-layer NVVM verifier (~37 KB native total across the three verifier functions).

NVVM Intrinsic Loweringsub_2C63FB0

FieldValue
Size12 KB native (sub_2C63FB0)
PurposeLowers NVVM intrinsics to concrete operations

Pattern-matching rewrite engine for llvm.nvvm.* intrinsics. Two levels (basic + barrier-aware), runs up to 10 times. See dedicated page for full dispatch table.

Base Address Strength Reductionsub_2CA4A10

FieldValue
Size12 KB native (sub_2CA4A10)
Knobsdo-base-address-strength-reduce (two levels: 1 = no conditions, 2 = with conditions)

Scans loop bodies for memory ops sharing a common base pointer, hoists the anchor computation, rewrites remaining addresses as (anchor + relative_offset). See dedicated page for the anchor selection algorithm.

Common Base Eliminationsub_2CA8B00

FieldValue
Size7 KB native (sub_2CA8B00)
PurposeHoists shared base address expressions to dominating CFG points

Operates at inter-block level (vs BASR intra-loop). The two passes form a complementary pair for comprehensive GPU address computation reduction. See dedicated page.

CSSA Transformationsub_3720740

FieldValue
Size3.5 KB native (sub_3720740)
PurposeConventional-SSA for GPU divergent control flow
Knobsdo-cssa, cssa-coalesce, cssa-verbosity, dump-before-cssa
Debug"IR Module before CSSA"

Rewrites PHI nodes to be safe under warp-divergent execution by inserting explicit copy instructions at reconvergence points. See dedicated page for the divergence model.

NVIDIA Codegen Knobs — sub_1C20170

70+ knobs parsed from the NVVM container format:

Graphics Pipeline

VSIsVREnabled, VSIsLastVTGStage, EnableZeroCoverageKill, AllowComputeDerivatives, AllowDerivatives, EnableNonUniformQuadDerivatives, UsePIXBAR, ManageAPICallDepth

Compute / Memory

DisableSAMRAM, DoMMACoalescing, DisablePartialHalfVectorWrites, AssumeConvertMemoryToRegProfitable, MSTSForceOneCTAPerSMForSmemEmu, AddDepFromGlobalMembarToCB

Register Allocation / Scheduling

AdvancedRemat, CSSACoalescing, DisablePredication, DisableXBlockSched, ReorderCSE, ScheduleKils, NumNopsAtStart, DisableERRBARAfterMEMBAR

Type Promotion

PromoteHalf, PromoteFixed, FP16Mode, IgnoreRndFtzOnF32F16Conv, DisableLegalizeIntegers

PGO

PGOProfileKind, PGOEpoch, PGOBatchSize, PGOCounterMemBaseVAIndex

Knob Forwarding

OCGKnobs, OCGKnobsFile, NVVMKnobsString, OmegaKnobs, FinalizerKnobs

Compile Modes — sub_1C21CE0

ModeConstant
Whole-program no-ABINVVM_COMPILE_MODE_WHOLE_PROGRAM_NOABI
Whole-program ABINVVM_COMPILE_MODE_WHOLE_PROGRAM_ABI
Separate ABINVVM_COMPILE_MODE_SEPARATE_ABI
Extensible WP ABINVVM_COMPILE_MODE_EXTENSIBLE_WHOLE_PROGRAM_ABI
Opt LevelConstant
NoneNVVM_OPT_LEVEL_NONE
1NVVM_OPT_LEVEL_1
2NVVM_OPT_LEVEL_2
3NVVM_OPT_LEVEL_3
Debug InfoConstant
NoneNVVM_DEBUG_INFO_NONE
Line infoNVVM_DEBUG_INFO_LINE_INFO
Full DWARFNVVM_DEBUG_INFO_DWARF