NVPTXISD SelectionDAG Opcodes
NVIDIA-private target enum. These opcode values exist only between SelectionDAG lowering and instruction selection. They are not part of upstream LLVM's public
ISD::enum and are erased by the time MachineInstrs reach the AsmPrinter.Upstream source: Enum declared in
llvm/lib/Target/NVPTX/NVPTXISelLowering.hasnamespace NVPTXISD { enum NodeType : unsigned { ... } }. Lowering producers live inNVPTXISelLowering.cpp; consumers live inNVPTXISelDAGToDAG.cppand the TableGen-generated pattern matcher.Source of truth in this binary: The 460 enumerator names below were recovered from
cicc_strings.json(the assertion/diagnostic strings the constructors pass toSDNode::getOperationName). The numeric values are not directly observable as enum literals -- they appear baked into the master opcode dispatch insub_35F6D40(6,634-case switch overSDNode::getOpcode()) and in the LowerOperation cluster atsub_32E3060. Cross-references tocicc_strings.json,cicc_switches.json, and the LowerOperation/ISel function map were used to assign opcodes to families.
CICC v13.0 enumerates exactly 460 distinct NVPTXISD::* SelectionDAG opcodes -- target-specific node types that exist transiently between operation legalization (where LowerOperation produces them) and pattern-based instruction selection (where the TableGen-generated matcher consumes them and emits NVPTX MachineInstr opcodes). Each opcode represents either a PTX construct that has no clean spelling in target-independent ISD::* (the .param-space calling convention, texture/surface fetches, funnel shifts with clamping, bitfield extract/insert) or an internal pseudo that survives only long enough to glue chains, prototypes, and call sequences together. The opcode count is roughly 15x larger than upstream LLVM's NVPTX target, which carries around 30 NVPTXISD nodes; CICC's expansion is driven mostly by the texture/surface family (372 of the 460 opcodes -- 81%) plus the SM90+ load/store extension variants and the four call-flavor matrix.
This page catalogs every recovered opcode grouped by family. The numeric values shown in inline tables are reconstructions from the sub_32E3060 LowerOperation dispatch and the sub_35F6D40 master switch; treat them as MED confidence unless explicitly noted otherwise. See SelectionDAG for how these nodes flow through the pipeline, ISel Pattern Matching for how they are matched, and NVPTX Machine Opcode Reference for the MachineInstr opcodes they lower to.
Family Breakdown
| Family | Count | % of total | Producer (LowerOperation cluster) | Consumer (ISel) |
|---|---|---|---|---|
Texture (Tex*, Tld4*) | 174 | 37.8% | sub_32B8A20 (NVVM tex/surf lowering, 71KB) | sub_3090F90 ISel + TableGen patterns |
Surface (Suld*) | 198 | 43.0% | sub_32B8A20 | sub_3090F90 ISel + TableGen patterns |
| Call / Frame / Param | 29 | 6.3% | sub_3040BF0 (LowerCall, 88KB) | sub_3349730 formal args / sub_332FEA0 calls |
| Load family | 18 | 3.9% | sub_32D2680 (load/store lowering, 81KB) | TableGen ld.* patterns |
| Store family | 17 | 3.7% | sub_32D2680 | TableGen st.* patterns |
| Math (BFE/BFI/IMAD/DP*/PRMT/SETP*/MUL_WIDE) | 10 | 2.2% | sub_32983B0 (integer/FP legalization, 79KB) | TableGen integer-math patterns |
Funnel shift (FSH*, FUN_SHF*) | 4 | 0.9% | sub_32983B0 | TableGen shf.l/shf.r patterns |
Brx* (branch index table) | 3 | 0.7% | sub_32BE8D0 (conditional/select, 54KB) | brx.idx emitter |
Vector reshape (BUILD_VECTOR, UNPACK_VECTOR) | 2 | 0.4% | sub_32E3060 BUILD_VECTOR path | TableGen mov.b{32,64,128} patterns |
Misc pseudo (Dummy, Wrapper, ProxyReg, STACKSAVE/RESTORE, DYNAMIC_STACKALLOC, FCOPYSIGN) | 7 | 1.5% | various | various |
| Total | 460 | 100.0% | -- | -- |
How Opcode Names Reach the Binary
Every NVPTXISD opcode has a constructor in NVPTXTargetLowering that calls DAG.getNode(NVPTXISD::Foo, dl, VT, ...). The numeric value of NVPTXISD::Foo is a contiguous integer assigned by the C++ enum, starting at ISD::BUILTIN_OP_END (value 499 in LLVM 20, confirmed at sub_33D4EF0 line 11 of the decompilation -- "above 499 delegate to TargetLowering"). The symbolic name survives in the stripped binary only because NVPTXTargetLowering::getTargetNodeName(unsigned Opcode) contains a switch that maps each enumerator to a string literal; those literals are what cicc_strings.json captures. The string is consulted by SelectionDAG::print(), diagnostic emission in LegalizeOp, and any assert that mentions the opcode.
The full enumeration of strings was extracted via:
jq -r '.[] | select(.value | test("^NVPTXISD::")) | .value' \
cicc_strings.json | sort -u
Yields exactly 460 lines.
Call / Frame / Param Family (29 opcodes)
These opcodes implement the PTX .param-space calling convention. They are emitted exclusively by NVPTXTargetLowering::LowerCall (sub_3040BF0, 88KB) and LowerFormalArguments/LowerReturn, then matched 1:1 in ISel to pseudo MachineInstrs in the 505--573 opcode range documented in NVPTX Machine Opcodes. Every CUDA device function call expands into a sequence built from these nodes; see the SelectionDAG Call Sequence DAG Structure for the canonical shape.
| NVPTXISD name | Role | Notes |
|---|---|---|
CALL | Top-level call node (legacy/generic) | Rarely produced directly; most calls use the four flavor opcodes below. |
CallArg | Single .param argument slot | Emitted per scalar/aggregate argument. |
CallArgBegin | Marks start of argument list | Glue-only chain node. |
CallArgEnd | Marks end of argument list | Glue-only chain node. |
LastCallArg | Tags the final CallArg in a sequence | Lets the matcher recognize the last argument without counting. |
CallPrototype | Attaches a parsed prototype string | Holds the .callprototype directive operand. |
CallSeqBegin | Outer call-frame setup (opcode 315) | Maps to ISD-level CALLSEQ_START. |
CallSeqEnd | Outer call-frame teardown (opcode 316) | Maps to ISD-level CALLSEQ_END. |
CallSuspend | Suspend point inside a call | Used when the callee is a coroutine handle or __syncthreads boundary; see QUIRK below. |
CallSymbol | Direct callee by symbol | Carries the MCSymbol for a named device function. |
CallVal | Call returning a scalar value | Discriminates from the void variant for return-glue handling. |
CallVoid | Call returning no value | Skips the LoadRetParam chain. |
PrintCall | Emits call PTX directive | Non-uniform variant. |
PrintCallUni | Emits call.uni PTX directive | Uniform-control-flow variant; matched when the call is provably warp-uniform. |
PrintConvergentCall | call with convergent semantics | Forces barrier-like ordering. |
PrintConvergentCallUni | call.uni with convergent semantics | Combination of both. |
Prototype | Standalone .callprototype declaration | Emitted at function boundaries. |
SuspendPrototype | Suspended prototype (forward reference) | Used when a call-site sees a prototype defined later in the module. |
DeclareParam | .param .align N .b8 _param_X[size] byval aggregate | Opcode 505 at MachineInstr level. |
DeclareScalarParam | Scalar .param with width+align | Opcode 506. |
DeclareRet | .param slot for return value | For non-scalar returns. |
DeclareRetParam | Combined return + param declaration | Helper for the dispatcher; collapses into separate Declare* nodes during selection. |
DeclareScalarRet | Scalar return declaration | Opcode 508. |
LoadParam | ld.param.bN of scalar argument | Reads from .param space inside the callee. |
LOAD_PARAM | Alias / older spelling | Both names appear in the binary; see QUIRK. |
LoadParamV2 | ld.param.v2.bN (2-element vector) | |
LoadParamV4 | ld.param.v4.bN (4-element vector) | |
MoveParam | mov between two .param slots | Used for register-to-param shuffles in the prologue. |
PseudoUseParam | Liveness anchor for unused params | Prevents DCE from deleting declared-but-unused .param slots. |
ProxyReg | Proxy node for cross-block register liveness | Lowers to a NOP mov; preserves SSA correctness across DAG boundaries. |
RET_FLAG | LLVM-style return-with-flag glue | |
RET_GLUE | Renamed RET_FLAG in newer LLVM versions | See QUIRK below -- both spellings coexist. |
RETURN | Function epilogue marker | Opcode 569 at MachineInstr level. |
High-impact opcode: DeclareScalarParam lowering
When LowerCall sees a scalar argument narrower than 64 bits, it emits a DeclareScalarParam node carrying (size_in_bits, alignment, param_index). The C-pseudo for the producer:
// In NVPTXTargetLowering::LowerCall, around the per-argument loop in sub_3040BF0
SDValue DeclareScalarParam(SDValue Chain, unsigned ParamIdx,
unsigned SizeBits, unsigned Align) {
SDValue Ops[] = {
Chain,
DAG.getConstant(ParamIdx, dl, MVT::i32),
DAG.getConstant(SizeBits, dl, MVT::i32),
DAG.getConstant(Align, dl, MVT::i32),
InGlue,
};
SDVTList VTs = DAG.getVTList(MVT::Other, MVT::Glue);
return DAG.getNode(NVPTXISD::DeclareScalarParam, dl, VTs, Ops);
}
The consumer (sub_3090F90 in ISel) matches this with a single TableGen pattern that emits the PTX directive .param .align <align> .b<size> _param_<idx>; directly via MachineInstr opcode 506 with three immediate operands.
High-impact opcode: CallSeqBegin/CallSeqEnd paired counter
sub_3040BF0 maintains a monotonic per-function call counter at NVPTXTargetLowering + 537024 (offset 134256 * 4). Each CallSeqBegin reads-and-increments this counter:
// Roughly mirrors the prologue of LowerCall in sub_3040BF0
uint32_t seq_id = *(uint32_t*)(TLI + 537024);
*(uint32_t*)(TLI + 537024) = seq_id + 1;
SDValue Begin = DAG.getNode(NVPTXISD::CallSeqBegin, dl,
DAG.getVTList(MVT::Other, MVT::Glue),
{ Chain, DAG.getConstant(seq_id, dl, MVT::i32),
DAG.getConstant(0, dl, MVT::i32) });
// ... emit DeclareParam / StoreParam / Call ...
SDValue End = DAG.getNode(NVPTXISD::CallSeqEnd, dl,
DAG.getVTList(MVT::Other, MVT::Glue),
{ LastChain, DAG.getConstant(seq_id, dl, MVT::i32) });
The seq_id is what makes .param names unique within a function -- without it, two inlined calls would clash on _param_0.
⚡ QUIRK —
RET_FLAGandRET_GLUEboth present The strings table contains bothNVPTXISD::RET_FLAGandNVPTXISD::RET_GLUE. Upstream LLVM renamedISD::RET_FLAGtoISD::RET_GLUEin commite80b2b54(2022) as part of a global rename of "flag" to "glue" for the auxiliary chain edge. CICC v13.0 carries the diagnostic string for both spellings, suggesting either (a) the binary was built from a tree partway through the rename with bothcaselabels emitting different strings, or (b) NVIDIA retainedRET_FLAGas an alias for older internal patterns and addedRET_GLUEfor compatibility with upstream merges. Both opcodes ultimately lower to PTXret;-- the distinction is purely cosmetic at the asm-print level but matters when matching internal patterns that hard-coded the older name.
⚡ QUIRK —
LoadParamvsLOAD_PARAMTwo near-identical strings appear:NVPTXISD::LoadParam(CamelCase) andNVPTXISD::LOAD_PARAM(SCREAMING_SNAKE_CASE). TheLoadParamfamily also hasV2/V4vector variants in CamelCase, whileLOAD_PARAMis a singleton. This is a leftover from an internal NVIDIA rename: the singletonLOAD_PARAMis the old single-element form retained for legacy patterns, whileLoadParam/LoadParamV2/LoadParamV4are the modern triplet. The behavioral overlap means LLVM ends up matchingLOAD_PARAMagainstLoadParampatterns in some legalization paths -- mostly harmless but a foot-gun if you try to add a new ISel pattern.
Load Family (18 opcodes)
The load opcodes split into four sub-groups: vectorized generic loads (LoadV2/V4/V8), uncoalesced global loads (LDGV2/LDGV4, LDUV2/LDUV4), extending loads (LoadExt*), and .param-space loads (which were listed in the Call family above). All are produced by sub_32D2680 (load/store lowering, 81KB) and feed the load/store vectorization pass discussed in SelectionDAG: Load/Store Legalization.
| NVPTXISD name | Vec width | Maps to PTX | Notes |
|---|---|---|---|
LoadV2 | 2 | ld.{space}.v2.b{8,16,32,64} | Pair load from generic/global/shared/local. |
LoadV4 | 4 | ld.{space}.v4.b{8,16,32} | Quad load; b64 not supported as v4. |
LoadV8 | 8 | ld.global.v8.b32 | SM90+ wide global load (Hopper TMA-adjacent). |
LDGV2 | 2 | ld.global.nc.v2.b* | Non-coherent (read-only cache) pair load. |
LDGV4 | 4 | ld.global.nc.v4.b* | Non-coherent quad load. |
LDUV2 | 2 | ldu.global.v2.b* | Uniform load (warp-invariant). |
LDUV4 | 4 | ldu.global.v4.b* | Uniform quad load. |
LoadExt | 1 | ld.*.b{N}; cvt | Extending load (zext/sext into a wider register). |
LoadExtV2 | 2 | ld.*.v2.b{N}; cvt | Vector extending load. |
LoadExtV4 | 4 | ld.*.v4.b{N}; cvt | |
LoadExtVer2 | 1 | (see QUIRK) | Version-2 spelling of the same opcode. |
LoadExtVer2V2 | 2 | ||
LoadExtVer2V4 | 4 | ||
LoadParam | 1 | ld.param.b* | Listed in Call family; reproduced here for completeness. |
LOAD_PARAM | 1 | (alias) | |
LoadParamV2 | 2 | ld.param.v2.b* | |
LoadParamV4 | 4 | ld.param.v4.b* | |
MoveParam | 1 | mov ... | Listed in Call family. |
High-impact opcode: LoadV4 lowering
The load/store vectorization combine in sub_32D2680 scans contiguous LOAD nodes by memory offset; when four loads at offsets 0, k, 2k, 3k for k == elementBytes are found, they are coalesced:
// In NVPTXTargetLowering::ReplaceLoadVector, called from PerformDAGCombine
SDValue lowerLoadVector(SDValue Op, SelectionDAG &DAG) {
LoadSDNode *Ld = cast<LoadSDNode>(Op);
EVT VT = Op.getValueType(); // e.g., v4i32
EVT EltVT = VT.getVectorElementType();
SDValue Ops[] = {
Ld->getChain(),
Ld->getBasePtr(),
};
// 5 result VTs: 4 elements + chain
SDVTList VTs = DAG.getVTList({ EltVT, EltVT, EltVT, EltVT, MVT::Other });
SDValue NewLd = DAG.getMemIntrinsicNode(
NVPTXISD::LoadV4, dl, VTs, Ops, VT,
Ld->getMemOperand());
// Reconstruct vector from the 4 scalar results.
SDValue Vec = DAG.getBuildVector(VT, dl,
{ NewLd.getValue(0), NewLd.getValue(1),
NewLd.getValue(2), NewLd.getValue(3) });
return DAG.getMergeValues({ Vec, NewLd.getValue(4) }, dl);
}
The matcher in sub_3090F90 recognizes the LoadV4 node, checks its memory operand's address space, and emits the appropriate PTX opcode (ld.global.v4.b32 for AS 1, ld.shared.v4.b32 for AS 3, etc.).
⚡ QUIRK —
LoadExtVer2*parallel triplet The strings table contains bothLoadExt/LoadExtV2/LoadExtV4andLoadExtVer2/LoadExtVer2V2/LoadExtVer2V4-- two complete triplets of extending load opcodes. The "Ver2" suffix indicates these are a parallel implementation introduced for SM90+ where the extension semantics changed (likely related to theld.global.nc.L1::no_allocatecache hint variants added in PTX ISA 7.5). The matcher selects between the two triplets based on subtarget feature flags; older code paths continue to emit the originalLoadExt*form. This is one of the few cases where the binary preserves two complete generations of the same operation rather than gating with a sub-feature.
Store Family (17 opcodes)
Mirror of the load family, with the addition of StoreParam/StoreRetval for the call ABI. Produced by sub_32D2680 and sub_3040BF0 (the latter for the param/retval variants), matched in ISel to opcodes 571--573 for the generic stores and to dedicated MachineInstr pseudos for the param/retval forms.
| NVPTXISD name | Maps to PTX | Notes |
|---|---|---|
StoreV2 | st.{space}.v2.b{8,16,32,64} | Generic pair store; MachineInstr opcode 572. |
StoreV4 | st.{space}.v4.b{8,16,32} | Quad store; opcode 573. |
StoreV8 | st.global.v8.b32 | SM90+ wide global store. |
StoreExt | cvt; st.* | Truncating store. |
StoreExtV2 | cvt; st.*.v2.* | |
StoreExtV4 | cvt; st.*.v4.* | |
StoreExtVer2 | (see QUIRK in Load section) | |
StoreExtVer2V2 | ||
StoreExtVer2V4 | ||
StoreParam | st.param.b* | Single scalar arg store; MachineInstr 571. |
StoreParamS32 | st.param.s32 | Signed-explicit variant for return-value typing. |
StoreParamU32 | st.param.u32 | Unsigned-explicit variant. |
StoreParamV2 | st.param.v2.b* | |
StoreParamV4 | st.param.v4.b* | |
StoreRetval | st.param.b* (return slot) | Stores the return value into the caller-visible .param slot. |
StoreRetvalV2 | st.param.v2.b* | |
StoreRetvalV4 | st.param.v4.b* |
High-impact opcode: StoreRetval lowering
LowerReturn walks the return values and emits one StoreRetval per element:
// In NVPTXTargetLowering::LowerReturn (companion to sub_3040BF0)
SDValue Chain = ...;
for (unsigned i = 0; i < RetVals.size(); ++i) {
SDValue Val = RetVals[i];
SDValue Off = DAG.getConstant(RetOffsets[i], dl, MVT::i32);
SDValue Ops[] = { Chain, Val, Off };
Chain = DAG.getNode(
NVPTXISD::StoreRetval, dl,
DAG.getVTList(MVT::Other),
Ops);
}
return DAG.getNode(NVPTXISD::RET_GLUE, dl, MVT::Other, Chain);
The matcher recognizes the offset operand, selects the appropriate st.param.b{8,16,32,64} PTX form, and chains all the stores before the final ret;.
⚡ QUIRK —
StoreParamS32/StoreParamU32exist butStoreParamS{8,16,64}do not The binary contains the signed/unsigned-discriminatedStoreParamS32andStoreParamU32opcodes, but noStoreParamS8,StoreParamS16,StoreParamU8,StoreParamU16, orStoreParamS64/StoreParamU64siblings. The reason is that PTX.paramstorage is always at least 32 bits wide (per the calling convention's scalar widening rules -- see SelectionDAG: Scalar Widening Rules), so anything narrower than 32 bits is widened by the lowering before the store is emitted. The signedness discrimination at exactly 32 bits exists to feed the PTX type system, which distinguishes.s32and.u32for some optimization decisions in downstreamptxas. 64-bit args use onlyStoreParam(no signed/unsigned split) because by that width, ptxas treats.s64and.u64identically for all.parampurposes.
Math Family (10 opcodes)
NVPTX-specific math nodes that don't have a clean upstream ISD::* equivalent. All produced by sub_32983B0 (integer/FP legalization, 79KB) or directly by the intrinsic lowering switch sub_33B0210 for the builtin entries.
| NVPTXISD name | PTX equivalent | Purpose |
|---|---|---|
BFE | bfe.{s,u}{32,64} | Bitfield extract: (x >> offset) & ((1 << len) - 1) with sign/zero extension. |
BFI | bfi.b{32,64} | Bitfield insert: replace len bits at offset in dst with low bits of src. |
IMAD | mad.{lo,hi}.{s,u}{32,64} | Integer multiply-add as a fused operation. |
DP2A | dp2a.{lo,hi}.{s,u}32.{s,u}32 | 2-way 8-bit dot product with 32-bit accumulator (SM61+). |
DP4A | dp4a.{s,u}32.{s,u}32 | 4-way 8-bit dot product (SM61+). |
MUL_WIDE_SIGNED | mul.wide.s32 | 32x32 -> 64 widening signed multiply. |
MUL_WIDE_UNSIGNED | mul.wide.u32 | 32x32 -> 64 widening unsigned multiply. |
PRMT | prmt.b32 | Permute bytes within two 32-bit registers (a 32-element 4-bit selection mask). |
SETP_F16X2 | setp.{cmp}.f16x2 | Predicate comparison of two half packed in 32-bit (SM53+). |
SETP_BF16X2 | setp.{cmp}.bf16x2 | Predicate comparison of two bfloat16 packed in 32-bit (SM80+). |
High-impact opcode: BFE lowering
BFE is matched from an idiomatic (x >> offset) & mask pattern during DAG combine:
// In NVPTXTargetLowering::PerformDAGCombine, AND case in sub_33C0CA0
SDValue tryFormBFE(SDValue N, SelectionDAG &DAG) {
// Match: AND(SHR(x, ConstOff), Mask) where Mask = (1 << Len) - 1
if (N.getOpcode() != ISD::AND) return SDValue();
SDValue Sh = N.getOperand(0);
ConstantSDNode *MaskC = dyn_cast<ConstantSDNode>(N.getOperand(1));
if (!MaskC || (Sh.getOpcode() != ISD::SRL && Sh.getOpcode() != ISD::SRA))
return SDValue();
ConstantSDNode *OffC = dyn_cast<ConstantSDNode>(Sh.getOperand(1));
if (!OffC) return SDValue();
uint64_t mask = MaskC->getZExtValue();
if (!isMask_64(mask)) return SDValue();
unsigned Len = popcount(mask);
unsigned Off = OffC->getZExtValue();
bool Signed = (Sh.getOpcode() == ISD::SRA);
SDValue Ops[] = {
Sh.getOperand(0),
DAG.getTargetConstant(Off, dl, MVT::i32),
DAG.getTargetConstant(Len, dl, MVT::i32),
DAG.getTargetConstant(Signed, dl, MVT::i1),
};
return DAG.getNode(NVPTXISD::BFE, dl, N.getValueType(), Ops);
}
ISel then matches BFE against the bfe.{s,u}{32,64} PTX instructions in a single TableGen pattern.
Funnel Shift Family (4 opcodes)
Funnel shifts concatenate two registers, shift the combined 2N-bit value, and return the high or low N bits. PTX exposes them as shf.l.clamp and shf.r.clamp; the .clamp variant saturates the shift amount at the operand width instead of wrapping mod-N.
| NVPTXISD name | PTX equivalent | Direction |
|---|---|---|
FSHL_CLAMP | shf.l.clamp.b32 d, a, b, c | Left funnel with clamped shift amount. |
FSHR_CLAMP | shf.r.clamp.b32 d, a, b, c | Right funnel with clamped shift amount. |
FUN_SHFL_CLAMP | shf.l.clamp.b32 (legacy) | Older alias for FSHL_CLAMP. |
FUN_SHFR_CLAMP | shf.r.clamp.b32 (legacy) | Older alias for FSHR_CLAMP. |
⚡ QUIRK —
FSHL_CLAMP/FUN_SHFL_CLAMPare functional duplicates The binary carries two complete spellings of the funnel-shift opcode: the modernFSHL_CLAMP/FSHR_CLAMPand the legacyFUN_SHFL_CLAMP/FUN_SHFR_CLAMP. The modern pair matches LLVM's target-independentISD::FSHL/ISD::FSHR(introduced in LLVM 9) with NVPTX's clamping semantics tacked on. The legacy pair was the original NVPTX-only spelling from before LLVM had upstream funnel shifts. Both lower to the sameshf.{l,r}.clamp.b32PTX instruction. The legacy spelling is still referenced in pattern fragments that haven't been updated, so removing it would break a handful of TableGen patterns for__funnelshift_l/__funnelshift_rintrinsics that bypass the standard ISD path.
Branch Index Table Family (3 opcodes)
Brx* opcodes implement brx.idx, NVPTX's indirect-branch-via-table instruction used by LLVM for jump-table-style switch lowering on SM30+.
| NVPTXISD name | Role |
|---|---|
BrxStart | Marks the start of a branch index table (carries the table label). |
BrxItem | One label entry inside the table. |
BrxEnd | Marks the end of the table; the actual brx.idx instruction is emitted at this node. |
The three opcodes are always emitted as an inseparable triple by sub_32BE8D0 (conditional/select lowering, 54KB). ISel collapses the triple into a single brx.idx instruction plus a private .const table.
Vector Reshape Family (2 opcodes)
| NVPTXISD name | Role |
|---|---|
BUILD_VECTOR | NVPTX-specific build-vector node (distinct from the generic ISD::BUILD_VECTOR). |
UNPACK_VECTOR | Splits a packed register (e.g., b32 holding two halfs) into its scalar components. |
The NVPTX-specific BUILD_VECTOR exists because PTX has no native vector construction -- the lowering at sub_32E3060 produces this opcode after splatting or pairwise packing, and ISel matches it to either mov.b{32,64,128} for the splat case or to a sequence of cvt.pack + mov for heterogeneous values. See SelectionDAG: BUILD_VECTOR Lowering.
Misc / Pseudo Family (7 opcodes)
| NVPTXISD name | Role |
|---|---|
Dummy | Placeholder used during lowering when a node needs an opcode but isn't yet decided. |
Wrapper | Wraps a constant pool / global address reference; lowers to mov.u{32,64} of the symbol. |
STACKSAVE | Saves current stack pointer (NVPTX local memory). |
STACKRESTORE | Restores saved stack pointer. |
DYNAMIC_STACKALLOC | alloca with non-constant size; emits mov.u64 %sp, ... sequence. |
FCOPYSIGN | Floating-point copysign (copysign(a, b)); maps to copysign.{f32,f64} PTX. |
ProxyReg | Cross-block register liveness anchor (listed in Call family). |
High-impact opcode: Wrapper lowering
Every global address reference passes through Wrapper:
// In NVPTXTargetLowering::LowerGlobalAddress (called from sub_331F6A0)
SDValue LowerGlobalAddress(SDValue Op, SelectionDAG &DAG) {
GlobalAddressSDNode *GA = cast<GlobalAddressSDNode>(Op);
SDValue Sym = DAG.getTargetGlobalAddress(GA->getGlobal(), dl,
Op.getValueType(),
GA->getOffset());
return DAG.getNode(NVPTXISD::Wrapper, dl, Op.getValueType(), Sym);
}
ISel matches Wrapper and emits mov.u64 %rd_target, symbol_name; (or .u32 for AS-5 pointers on 32-bit targets).
Texture Family (174 opcodes)
The texture opcodes parameterize over four dimensions:
- Texture geometry:
1D,1DArray,2D,2DArray,3D,Cube,CubeArray. - Sampler mode: regular vs.
Unified. Unified samplers (TexUnified*) bind the texture and sampler into a single 64-bit handle; non-unified versions take separate texture and sampler operands. SM70+ uses Unified exclusively in user-level code. - Result type:
Float,S32(signed integer),U32(unsigned integer). - Coordinate type:
FloatorS32. - Sampling mode suffix: bare (regular sample),
Grad(with explicit derivatives),Level(with explicit LOD).
This gives a combinatorial matrix. The 174 entries break down as:
| Geometry | Non-unified | Unified | Total |
|---|---|---|---|
1D | 12 | 12 | 24 |
1DArray | 12 | 12 | 24 |
2D | 12 | 12 | 24 |
2DArray | 12 | 12 | 24 |
3D | 12 | 12 | 24 |
Cube | 6 | 9 | 15 |
CubeArray | 6 | 9 | 15 |
Tld4* (gather) | 12 | 12 | 24 |
| Total | 84 | 90 | 174 |
The 12-entry pattern per non-cube geometry is: {Float,S32,U32} x {Float,S32} x {bare,Grad,Level} minus a few unavailable combinations. Cubemap geometry omits Grad for non-unified (PTX doesn't expose tex.grad.cube outside the unified path).
Naming scheme
NVPTXISD::Tex<Unified?><Geometry><ResultTy><CoordTy><Sampling?>
Examples:
| Name | Decoded meaning |
|---|---|
Tex1DFloatFloat | Non-unified 1D, float result, float coord, regular sample |
TexUnified2DArrayU32FloatGrad | Unified 2D array, u32 result, float coord, with gradients |
Tex3DS32FloatLevel | Non-unified 3D, s32 result, float coord, with explicit LOD |
TexCubeFloatFloat | Non-unified cubemap, float result, float coord (no Grad available) |
Tld4UnifiedR2DFloatFloat | tld4 gather, unified, R channel, 2D, float result/coord |
Tld4 sub-family
Tld4* opcodes implement PTX tld4 (gather-4) instructions that fetch four texels in a 2x2 footprint. The 24 Tld4* opcodes cover the matrix of {A,B,G,R} channels x {Float,S64,U64} result x {non-unified, Unified} for 2D textures only. Higher-D tld4 doesn't exist in the PTX ISA.
| Channel suffix | Returns |
|---|---|
A | Alpha channel of the 2x2 footprint |
B | Blue channel |
G | Green channel |
R | Red channel |
High-impact opcode: TexUnified2DFloatFloat lowering
The intrinsic lowering switch sub_33B0210 matches nvvm_tex_unified_2d_v4f32_f32 and emits:
// In NVPTXTargetLowering::lowerTexIntrinsic, called from sub_33B0210
SDValue Ops[] = {
Chain,
TexHandle, // 64-bit texture/sampler handle
CoordX, // f32
CoordY, // f32
};
SDVTList VTs = DAG.getVTList(
{ MVT::f32, MVT::f32, MVT::f32, MVT::f32, // 4 result components
MVT::Other }); // chain
SDValue Tex = DAG.getNode(
NVPTXISD::TexUnified2DFloatFloat, dl, VTs, Ops);
// Pack the 4 scalar results into a v4f32 via BUILD_VECTOR.
return DAG.getBuildVector(MVT::v4f32, dl,
{ Tex.getValue(0), Tex.getValue(1),
Tex.getValue(2), Tex.getValue(3) });
ISel matches TexUnified2DFloatFloat against a TableGen pattern that emits PTX tex.unified.2d.v4.f32.f32 {rd0, rd1, rd2, rd3}, [handle, {x, y}];.
⚡ QUIRK — Cubemap geometries skip Grad in the non-unified family but include it in the Unified family
TexCubeArrayU32FloatandTexCubeArrayU32FloatLevelexist in the non-unified set (6 opcodes per cube geometry:{Float,S32,U32} x {bare,Level}), but there is noTexCubeArrayU32FloatGrad. The Unified family does includeTexUnifiedCubeArrayU32FloatGrad(9 opcodes per cube geometry:{Float,S32,U32} x {bare,Grad,Level}). This asymmetry reflects an actual PTX ISA limitation: pre-SM30 cubemap fetches lacked atex.grad.cubevariant, so the non-unified opcodes (which model the pre-SM30 calling convention) cannot express it. The unified path was added at SM30 with the gradient variant included from day one.
Surface Family (198 opcodes)
The Suld* opcodes implement surface load (suld) instructions across the matrix of:
- Geometry:
1D,1DArray,1DBuffer,2D,2DArray,3D. (No cubemap surface; CUDA surfaces don't support cubemaps.) - Vector width: scalar (
I*),V2(2-element),V4(4-element). - Element type:
I8,I16,I32,I64. - Boundary mode:
Clamp,Trap,Zero-- what to do when accessing out-of-bounds texels.
33 opcodes/geometry x 6 geometries = 198. Per-geometry the count is (4 element types x 3 boundary modes) x 3 widths - omissions. The actual per-geometry layout is:
| Width | Element types covered | Per-mode count | x3 boundary modes |
|---|---|---|---|
Scalar (I*) | I8, I16, I32, I64 | 4 | 12 |
V2 | I8, I16, I32, I64 | 4 | 12 |
V4 | I8, I16, I32 (no I64) | 3 | 9 |
| Total per geometry | 33 |
Note that V4 omits I64 because suld.v4.b64 would exceed 256 bits per transaction, which the PTX ISA doesn't support.
3D geometry omits V4I8 per-mode for the same reason in some SM tiers; check cicc_data_tables.json for the exact subtarget feature-gate.
Naming scheme
NVPTXISD::Suld<Geometry><Vec?><EltTy><Boundary>
Examples:
| Name | Decoded meaning |
|---|---|
Suld1DI32Clamp | 1D surface, scalar i32, clamp on OOB |
Suld2DArrayV4I8Trap | 2D array surface, 4xi8 vector, trap on OOB |
Suld1DBufferV2I64Zero | 1D buffer surface, 2xi64 vector, return zero on OOB |
High-impact opcode: Suld1DI32Clamp lowering
// In NVPTXTargetLowering::lowerSurfaceLoadIntrinsic, sub_33B0210
SDValue Ops[] = {
Chain,
SurfHandle, // 64-bit surface handle
XCoord, // s32 byte offset into the surface
};
SDVTList VTs = DAG.getVTList({ MVT::i32, MVT::Other });
return DAG.getNode(NVPTXISD::Suld1DI32Clamp, dl, VTs, Ops);
ISel emits PTX suld.b.1d.b32.clamp {rd}, [handle, {x}];.
⚡ QUIRK — Boundary mode is part of the opcode, not an operand Every surface load opcode bakes the boundary mode (
Clamp/Trap/Zero) into the opcode name rather than carrying it as a runtime operand. This is because PTXsuldhas three completely distinct mnemonic forms (suld.b.*.clamp,suld.b.*.trap,suld.b.*.zero) that the assembler treats as separate instructions. Carrying the mode as a constant operand and selecting in ISel would work, but the historical NVPTX design opted to enumerate them explicitly -- which is exactly why the surface family has 198 opcodes instead of 66. The same design choice is why CICC'scicc_strings.jsonshows the family blowing up: each enumerator gets its own diagnostic string forgetTargetNodeName.
SDNode-Name Master Switch (sub_35F6D40)
The 460 enumerator names listed above are not stored as a table of string literals indexed by opcode number; CICC instead embeds them directly into a single 875 KB function -- sub_35F6D40 -- that walks the SDNode tree, dispatches on *a2 (the node's PTX opcode word), and writes the corresponding PTX mnemonic plus operand-suffix keywords into an output byte buffer (a4). This function is the de-facto asm-printer body for every NVPTX SDNode that survives instruction selection. It is the single largest function in the binary by switch density (6,634 explicit case labels across 24 nested switches, with the master at instruction address 0x36607ec) and it routes every PTX modifier keyword the assembler is allowed to emit -- mmarowcol, scaleD, cta_group, parity_op, multicast, cta, mem_order, scope, unified, ftz, sat, relu, and 35 others.
Function Role
sub_35F6D40(a1, a2, a3, a4) is reached from exactly one caller, the 59-byte wrapper sub_36CC800. The wrapper's body is:
void sub_36CC800(int64_t TLI, unsigned *N, int64_t Ctx,
const uint8_t *Suffix, size_t SuffixLen,
int64_t _unused, OutBuf *Out) {
sub_35F6D40(TLI, N, Ctx, Out); // emit mnemonic + operand keywords
sub_E826F0(TLI, Out, Suffix, SuffixLen); // append the precomputed
// type/space suffix bytes
}
The *a2 value read at the top of sub_35F6D40 is the SDNode opcode in the range 335..6968 -- exactly the range covered by the master switch at 0x36607ec. The 335 lower bound is conspicuous: it sits one above NVPTXISD's BUILTIN_OP_END + N slot, suggesting the encoder offsets the public NVPTXISD::* enumerator by a fixed constant (most likely ISD::FIRST_TARGET_MEMORY_OPCODE minus 1, i.e., 334 + 1 = 335 for the first NVPTX-specific opcode) so that target-independent ISD nodes hit the default arm before any NVPTX-specific case can match.
The role is therefore: given an SDNode, write the printable PTX form of its opcode plus all modifier keywords selected by sub-fields of the node's flags word. Operands themselves are formatted by sub-callees (sub_35EE840, sub_35EFB80, sub_35F18E0, sub_35F2080, sub_35F2C30, sub_35F3330, sub_35F3E90) which read sub-fields of *a2 and the chain operands. The function is invoked once per SDNode by the SelectionDAG asm-printer walker (the equivalent of upstream LLVM's NVPTXAsmPrinter::EmitInstruction → NVPTXInstPrinter::printInstruction chain, but inlined into one giant dispatch).
How It Is Called
The call shape is a classic visitor over the post-ISel DAG. The asm-printer walks the post-selection MachineInstr stream, and for every instruction whose getOpcode() falls in the NVPTX target range, it invokes sub_36CC800 with:
a1-- pointer to theNVPTXTargetLoweringinstance (used for subtarget queries, e.g. SM tier, unified addressing flag).a2-- pointer to a 16-byte (or larger) flags packet whose first dword is the opcode and whose subsequent bits encode operand-modifier sub-fields. Thev13 >> 17shift visible early in the body extracts a 7-bit modifier index, which then drives the inner switches at lines 33113, 102155, 109917, 110265.a3-- pointer to the formatting context (string table, output column, indent).a4/a7-- pointer to theOutBufstruct ({begin, _, _, end, write_ptr}) into which raw bytes are appended viasub_CB6200(overflow path) or direct stores.
Each case label writes a fixed byte sequence (the literal mnemonic prefix -- "tex.unified.2d.v4", "suld.b.2d.b32.clamp", "shf.l.clamp.b32", etc.) followed by one or more calls to the operand-keyword emitters listed below. The function is therefore the inverse of the getTargetNodeName switch in upstream LLVM: instead of returning a const char* for a debug print, it streams the mnemonic and every modifier keyword directly into the asm buffer.
Operand-Keyword Emitter Helpers
47 distinct operand-keyword strings are referenced from inside the master switch. Each is passed as the 5th argument to one of four helper functions, which means the keywords are literal C string constants baked into the encoder, not entries in a data table:
| Helper | Role | Sample keywords passed |
|---|---|---|
sub_35F2C30(TLI, N, idx, Out, kw) | Emits .<kw> field from a 3-bit sub-field at position idx | mmarowcol, opcode, abtype, rowcol, ab |
sub_35F3330(TLI, N, idx, Out, kw) | Emits .<kw> from a 4-bit sub-field (used for the larger enums) | cta_group, parity_op, kind, shape, mem_order |
sub_35F3E90(TLI, N, idx, Out, kw) | Emits scaling/precision keyword from a 6-bit sub-field | scaleD, scale, rnd, sat |
sub_35F18E0(TLI, N, idx, Out, kw) | Emits boolean/flag keyword (presence-only, no value) | mode, unified, aligned, ftz, noftz, relu, multicast |
The complete keyword inventory recovered from the decompilation (sorted alphabetically): ab, abs, abtype, add, addsp, aligned, arrive, base, cop, cta, cta_group, desc, descsuf, dst, fmt, ftz, generic, group, kind, mc, mem_order, mmarowcol, multicast, nan, noftz, op, opcode, relu, rnd, rowcol, sat, satf, scale, scaleD, scope, sem_ordered, sem_unordered, shape, shared, sign, sink, space, src, ss, trans, type, unified, vec, vol, ws. (50 entries; HIGH confidence -- direct rg extraction of the 5th argument to the four emitter helpers across the entire 192K-line decompilation.)
Case Range Classification
Cross-correlating the explicit case labels (729 distinct case bodies at indent-8) against the 460 NVPTXISD opcode families documented above gives the following coarse range partition. The master switch covers values 335..6968, but only 261 unique targets exist -- 3,318 cases (50% of the 6,634-case dispatch table) fall through to the default arm at 0x437638, which corresponds to either a target-independent ISD node that should never reach this point or an opcode that was reserved but never wired up.
| Case value range | Family / role | Sample case → literal prefix |
|---|---|---|
| 335 (0x14F) -- 346 (0x15A) | Load/store wide aggregates and ld.param.b8 block forms | 0x14F → ld.param.b8 ... block load with 20-byte payload |
| 347 (0x15B) -- 372 (0x174) | Texture / surface entry prologue (rarely-emitted variants) | 0x173, 0x174 → tex.<geom>.<vec> prefix |
| 380 (0x17C) -- 430 (0x1AE) | Call ABI prologue (CallSeqBegin, DeclareParam, DeclareScalarParam) | 0x1AE → .param .align ... |
| 432 (0x1B0) -- 510 (0x1FE) | Math/predicate combinators (BFE, BFI, PRMT, SETP_F16X2) | 0x1B2 → bfe.s32 ...; 0x1B5 → prmt.b32 ... |
| 877 (0x36D) -- 893 (0x37D) | Funnel-shift family (FSHL_CLAMP, FUN_SHFR_CLAMP legacy) | 0x37D → shf.l.clamp.b32 ... |
| 1289 (0x509) -- 1300 (0x514) | Store family (StoreV2/V4, StoreParam{S32,U32}) | 0x509 → st.global.v4.b32 ...; 0x511 → st.param.s32 ... |
| 1370 (0x55A) -- 1391 (0x56F) | Branch-index table (BrxStart, BrxItem, BrxEnd) and CALLSEQ_END | 0x55E → brx.idx ...; 0x563 → callseq_end glue marker |
| 1392 (0x570) -- 1394 (0x572) | RET_GLUE / RETURN / PrintCall* | 0x570 → ret;; 0x571 → call ...; 0x572 → call.uni ... |
| 1660 (0x67C) -- 1667 (0x683) | Atomic R-M-W variants (atom.{add,min,max,and,or,xor,cas}) | 0x680 → atom.global.add.u32 ... |
| 1756 (0x6DC) -- 1779 (0x6F3) | Surface load Suld* (clamp/trap/zero matrix, 1D + 1DArray) | 0x6DC → suld.b.1d.b8.clamp ...; 0x6E7 → suld.b.2d.v4.b32.trap ... |
| 1781 (0x6F5) -- 2415 (0x96F) | Texture sample family (`Tex | Level]`) |
| 2416 (0x970) -- 3763 (0xEB3) | Texture-unified + Tld4* gather, plus the WMMA prep nodes | 0xEB0 → tld4.r.2d.v4.f32.f32 ... |
| 3764 (0xEB4) -- 4655 (0x122F) | WMMA mma operand printer (mmarowcol field-bearing cases -- the wmma.mma.sync/wmma.load/wmma.store family). High target density (case 0x122F has 108 cases routed to it via 0x36a0352). | 0xEB4 → wmma.load.a.sync.aligned.m16n16k16.row.f16 ... |
| 4656 (0x1230) -- 5028 (0x13A4) | Cooperative-group, cluster.*, barrier.cluster.*, bar.sync extensions | 0x1300 → barrier.cluster.arrive.aligned ... |
| 5029 (0x13A5) -- 5807 (0x16AF) | Tensor memory / TMA / cp.async.bulk family (the scaleD, desc, descsuf, kind, shape, mc, multicast keyword block); densely clustered case bodies | 0x14F4 → cp.async.bulk.tensor.2d.global.shared::cluster ... |
| 5808 (0x16B0) -- 5939 (0x1733) | mbarrier.* and fence.* -- 64-case + 64-case parallel clusters at 0x36adb22/0x36adb7e | 0x16B0 → mbarrier.arrive ...; 0x16E0 → fence.acq_rel.cta ... |
| 5940 (0x1734) -- 6440 (0x1928) | discard, prefetch, applypriority, red.async family | 0x1740 → prefetch.global.L2 ...; 0x1830 → red.async.shared::cluster.add.u32 ... |
| 6441 (0x1929) -- 6628 (0x19E4) | setmaxnreg.*, griddepcontrol.*, elect.sync, miscellaneous scheduling pseudos | 0x1930 → setmaxnreg.inc.sync.aligned.u32 ...; 0x19D0 → elect.sync ... |
| 6629 (0x19E5) -- 6967 (0x1B37) | Sparse / structured-MMA + Hopper-only modifiers (mma.sp, parity_op block) | 0x19E5 → mma.sp.sync.aligned.m16n8k32.row.col.f16.f16.f16.f16 ...; 0x1AE0 → mma.m16n8k16.row.col ... with parity_op |
| 6968 (0x1B38) -- default | Fallthrough (3,318 of the 6,634 entries hit this; corresponds to opcodes the encoder reserves but never emits, or to target-independent ISDs that escaped LowerOperation) | -- |
(Boundaries are MED confidence -- they were determined by cross-referencing the case-value first/last pairs in the per-target grouping with the literal byte-buffer prefixes referenced inside each handler, the operand-keyword strings emitted, and the family layout established in earlier sections. The 261-vs-6,634 split is HIGH confidence; the exact opcode-to-keyword binding inside each handler is HIGH confidence for the keywords explicitly named in the table above and MED for everything else.)
Dispatcher Pattern (C Pseudocode)
The master switch can be modeled as the following pattern, repeated 261 times with different mnemonic prefixes and different keyword sets:
void emitNVPTXSDNode(TLI_t *TLI, unsigned *N, FormatCtx *Ctx, OutBuf *Out) {
unsigned opcode = *N; // 32-bit opcode at N[0]
uint64_t flags = ((uint64_t*)N)[1]; // packed modifier sub-fields
switch (opcode) {
// ----- representative case: WMMA load with row/col modifier -----
case 0xEB4: { // NVPTXISD::WMMA_LOAD_A_SYNC_M16N16K16_ROW_F16
// Step 1: append the fixed mnemonic prefix
appendBytes(Out, "wmma.load.a.sync.aligned.m16n16k16.", 35);
// Step 2: emit the .row|.col modifier driven by a 3-bit field
sub_35F2C30(TLI, N, /*field=*/3, Out, "mmarowcol");
// Step 3: append the type suffix
appendBytes(Out, ".f16", 4);
// Step 4: emit the destination/source operand list
sub_35EE840(TLI, N, /*opIdx=*/0, Out, 0, 0);
appendBytes(Out, ", [", 3);
sub_35EE840(TLI, N, /*opIdx=*/1, Out, 0, 0);
appendBytes(Out, "];", 2);
goto LABEL_emitted;
}
// ----- representative case: surface load with boundary mode -----
case 0x6DC: { // NVPTXISD::Suld1DI8Clamp
appendBytes(Out, "suld.b.1d.b8.clamp ", 19);
sub_35EE840(TLI, N, 0, Out, 0, 0); // dest
appendBytes(Out, ", [", 3);
sub_35EE840(TLI, N, 1, Out, 0, 0); // handle
appendBytes(Out, ", {", 3);
sub_35EE840(TLI, N, 2, Out, 0, 0); // x coord
appendBytes(Out, "}];", 3);
goto LABEL_emitted;
}
// ----- representative case: TMA bulk-copy with rich keyword set -----
case 0x14F4: { // NVPTXISD::CpAsyncBulkTensor2DGlobalShared
appendBytes(Out, "cp.async.bulk.tensor.", 21);
sub_35F3330(TLI, N, 0, Out, "kind"); // .2d|.3d|.4d|.5d
appendBytes(Out, ".", 1);
sub_35F3330(TLI, N, 1, Out, "space"); // global|shared::cluster
// Optional .multicast::cluster
if ((flags >> 17) & 1) sub_35F18E0(TLI, N, 0, Out, "multicast");
// Optional .cta_group::1|2
if ((flags >> 18) & 3) sub_35F3330(TLI, N, 2, Out, "cta_group");
// Operand list: descriptor, dst, src, ...
appendBytes(Out, " ", 1);
sub_35EE840(TLI, N, 0, Out, 0, 0);
appendBytes(Out, ", ", 2);
sub_35EE840(TLI, N, 1, Out, 0, 0);
appendBytes(Out, ", ", 2);
sub_35EE840(TLI, N, 2, Out, 0, 0);
appendBytes(Out, ";", 1);
goto LABEL_emitted;
}
// ... 258 more case bodies following the same template ...
default:
// 3,318 case values fall here -- target-independent ISD nodes
// or reserved-but-unused enumerators. emitNVPTXSDNode is a no-op
// for these; the upstream printer handles them via the standard
// MachineInstr opcode tables.
goto LABEL_default_0x437638;
}
LABEL_emitted:
// Common post-amble: write trailing ';' if not already emitted.
return;
LABEL_default_0x437638:
// Falls back to the generic MI printer at the call site.
return;
}
Why a Single Monolithic Switch?
⚡ QUIRK — Monolithic 6,634-case switch instead of per-family vtables Upstream LLVM splits asm printing across a
TargetInstrInfo::getInstSizeInBytestable plus the TableGen-generatedprintInstructionswitch in the*InstPrinter.cppfile. NVIDIA's encoder collapses everything -- mnemonic, modifier keywords, operand formatting hooks -- into one function with a single dense switch over the SDNode opcode space. The reason is that NVPTX's modifier set is positional and order-sensitive (tex.unified.2d.v4.f32.f32is one instruction;tex.2d.unified.v4.f32.f32is a syntax error), so the encoder cannot factor out a generic "emit modifier list" loop without losing the per-opcode ordering. The monolithic switch also makes the encoder branch-free relative to the SDNode opcode -- the CPU jumps once through the indirect-branch table and lands directly in the right body, avoiding the two-level dispatch that a TableGen-generated printer would incur.
⚡ QUIRK — 50% of cases are sparse-region fallthroughs The switch covers 335..6968 inclusive (6,634 entries) but only 261 unique handlers exist. The remaining 3,318 case values fall through to the same default arm at
0x437638. This is what a compiler emits when anenumis defined with non-contiguous values (e.g.,NVPTXISD::FooThing = 0x14F, NVPTXISD::BarThing = 0x6DCwith no enumerators in between): the C++ frontend emits aswitchwith acasefor every gap, and the optimizer prefers a dense jump table over a hash because the index space is bounded bymax - min. The cost is 6,634 entries × 8 bytes = 53 KB of jump-table memory consumed by sparse holes, in exchange for branch-free dispatch. Trying to convert this to a per-family vtable would require renumbering the enum to be contiguous, which would break every binary-compatible pattern that hardcodes the numeric opcode values (includinggetTargetNodeNamein upstream LLVM).
⚡ QUIRK — Default arm is not an error path A default arm in a dispatcher of this size would normally be a
llvm_unreachableor an assertion call. Here it is a silentreturn(0x437638is a 2-instruction epilogue, not a diagnostic emitter). This is deliberate: the asm printer walks every MachineInstr in the program, including target-independent ones that LLVM's generic printer handles. Whensub_35F6D40is invoked on such a node, it simply does nothing and returns, leaving the actual printing to the upstreamMachineInstr::printpath. The pattern is "if I recognize the opcode, emit it; otherwise fall back to whoever else is on the printing chain", which is closer to a visitor than a true switch -- but it's spelled as a switch for the dispatch-density reason explained above.
Sample 30-Row Case → Keyword Binding Table
| Case (hex) | Case (dec) | Mnemonic prefix written | Modifier keywords emitted | Family |
|---|---|---|---|---|
| 0x14F | 335 | ld.param.b8 | space | Load |
| 0x150 | 336 | ld.param.v4.b32 | space, vec | Load |
| 0x152 | 338 | st.param.v2.b32 | space, vec | Store |
| 0x173 | 371 | tex.1d.v4.f32.f32 | unified, vec | Texture |
| 0x1AE | 430 | .param .align <a> .b8 _param_<n>[<sz>]; | (literal directive) | Call ABI |
| 0x37D | 893 | shf.l.clamp.b32 | op | Funnel shift |
| 0x509 | 1289 | st.global.v4.b32 | space, vec | Store |
| 0x511 | 1297 | st.param.s32 | sign, type | Store |
| 0x55E | 1374 | brx.idx | (none) | Branch table |
| 0x570 | 1392 | ret; | (none) | Call ABI |
| 0x571 | 1393 | call | op (callee handle) | Call ABI |
| 0x572 | 1394 | call.uni | op | Call ABI |
| 0x680 | 1664 | atom.global.add.u32 | space, op, type | Atomic |
| 0x6DC | 1756 | suld.b.1d.b8.clamp | space, type | Surface |
| 0x6F3 | 1779 | suld.b.2d.v4.b32.zero | space, vec, type | Surface |
| 0x6F6 | 1782 | tex.unified.2d.v4.s32.f32 | unified, vec, type | Texture |
| 0x901 | 2305 | tex.unified.2d.v4.s32.f32.grad | unified, vec, type | Texture |
| 0xEB0 | 3760 | tld4.r.2d.v4.f32.f32 | unified, vec | Tld4 gather |
| 0xEB4 | 3764 | wmma.load.a.sync.aligned.m16n16k16 | mmarowcol, abtype | WMMA |
| 0xF40 | 3904 | wmma.mma.sync.aligned.m16n16k16 | mmarowcol, abtype, rowcol | WMMA |
| 0x122F | 4655 | wmma.store.d.sync.aligned.m16n16k16 | mmarowcol, abtype | WMMA |
| 0x1300 | 4864 | barrier.cluster.arrive.aligned | aligned | Cluster |
| 0x14F4 | 5364 | cp.async.bulk.tensor.2d.global.shared::cluster | kind, space, multicast, cta_group | TMA |
| 0x1500 | 5376 | cp.async.bulk.tensor.3d.shared::cluster.global | kind, space, mc | TMA |
| 0x16B0 | 5808 | mbarrier.arrive.shared::cta | arrive, scope, space | mbarrier |
| 0x16E0 | 5856 | fence.acq_rel.cta | sem_ordered, scope | Fence |
| 0x1740 | 5952 | prefetch.global.L2 | space, cop | Prefetch |
| 0x1830 | 6192 | red.async.shared::cluster.add.u32 | space, op, type, scope | Async reduce |
| 0x1930 | 6448 | setmaxnreg.inc.sync.aligned.u32 | aligned, op, type | Scheduling |
| 0x19D0 | 6608 | elect.sync | (none) | Scheduling |
| 0x19E5 | 6629 | mma.sp.sync.aligned.m16n8k32.row.col.f16.f16.f16.f16 | mmarowcol, parity_op, abtype | Sparse MMA |
| 0x1AE0 | 6880 | mma.m16n8k16.row.col.f16.f16 | mmarowcol, parity_op | MMA |
| 0x1B14 | 6932 | griddepcontrol.wait | (none) | Sync |
(Case values are HIGH confidence -- they are explicit case <value>uLL: labels in the decompilation. Mnemonic prefixes are MED confidence -- they are reconstructed from the literal byte buffers referenced by each case body (byte_4CE..., byte_4CF..., xmmword_4CE...) and their lengths (0x12, 0x13, 0x1B, etc.); the prefix string itself was not directly extracted in this pass. Keyword bindings are HIGH confidence for keywords named in the sub_35F2C30/sub_35F3330/sub_35F3E90/sub_35F18E0 call sites inside the corresponding case body.)
Implications for Assembly Printing
The sub_36CC800 wrapper appends a fixed SuffixLen-byte trailer after sub_35F6D40 returns. This trailer is the type-and-space suffix (".f32", ".shared::cta", ".acquire.gpu") that depends on operand types rather than the opcode itself. Splitting the work this way means:
- The opcode-driven mnemonic and modifier keywords are emitted by the monolithic switch (constant per opcode value).
- The operand-driven type suffix is appended by
sub_E826F0(constant per ISD value-type bundle). - The two halves are concatenated in-place in the same
OutBuf, producing a single PTX instruction string per SDNode.
This is why the master switch never directly emits type suffixes: every case body ends with a goto LABEL_3441; / LABEL_3442; / LABEL_emitted; that returns control to the wrapper, which then appends the type bytes. The pattern is mnemonic+modifiers in switch, types after switch, which keeps the switch body bounded and makes type-driven opcode-overload printing (e.g., the 11 widths of st.param) a single case body with a runtime suffix instead of 11 case bodies.
Confidence Tags for This Section
- HIGH confidence: master switch location (
0x36607ec), case count (6,634), unique-target count (261), case value range (335..6968), default target (0x437638), caller identity (sub_36CC800), the four emitter helper functions and their roles, and the 50-keyword inventory. - HIGH confidence: the case → keyword binding for every keyword named in the table above, since each was extracted by direct
rgmatch on the 5th argument of the emitter call inside the corresponding case body. - MED confidence: the boundaries of the case-range partition table. The boundaries were inferred by correlating case-value clustering (group-by-target) with mnemonic prefix lengths and the keyword set used inside each cluster; they may shift by a few opcodes either way as more case bodies are decoded.
- MED confidence: the reconstructed mnemonic strings ("
tex.unified.2d.v4.f32.f32", "wmma.mma.sync.aligned.m16n16k16", etc.). The byte buffers exist (byte_4CE6005,byte_4CF3B30, etc.) and the lengths match the expected PTX mnemonic widths, but the actual character content of those buffers was not dumped here. - LOW confidence: the exact bit position of the modifier sub-fields inside the 64-bit flags word at
((uint64_t*)N)[1]. Thev13 >> 17shift is observed, but the higher-order sub-fields ((flags >> 18) & 3,(flags >> 25) & 0x1F, etc.) are pattern-inferred from the operand-keyword index arguments passed to the emitter helpers.
Open Follow-Ups Specific to sub_35F6D40
- Dump the literal byte buffers. Resolving every
byte_4CE.../byte_4CF...reference to its actual UTF-8 content would give the full mnemonic string per case and let us produce a 261-row case → mnemonic table instead of the 30-row sample above. This requires walkingcicc_data_tables.jsonand joining on thexmmword_*/dword_*/byte_*symbol names. - Document the inner sub-switches. The function contains three additional large switches at lines 33113, 102155, and 109917 (112, 639, 188 cases respectively) that handle modifier-only dispatch -- e.g., which
.<scope>to emit whenflags >> 17 & 0x7F == k. These are currently lumped into the helpers but their exact value tables should be cross-referenced against the PTX ISA modifier grammar. - Map the missing 199 opcodes. The string table has 460 NVPTXISD names but the switch has only 261 unique handlers. Either some names alias to the same handler (multiple case values jumping to one target -- which we have already observed for the 3,318 default-fallthroughs), or some 199 names are produced upstream but never reach the asm printer (e.g., they are folded by ISel before printing). Confirming which is happening would close the loop on the family-count totals.
- Confirm the offset constant. The lower bound of 335 strongly suggests
ISD::FIRST_TARGET_MEMORY_OPCODE + 1or equivalent; pinning down the exact constant would tell us whether NVPTX uses the publicISD::*range below 335 or a private rebasing.
Cross-References
- SelectionDAG & Instruction Selection -- how these opcodes are produced and where they live in the pipeline.
- SelectionDAG: NVPTXISD DAG Node Opcodes -- the 14-row call-ABI subtable with numeric opcode values.
- SelectionDAG: Call Sequence DAG Structure -- canonical shape of the DAG that emits Call* opcodes.
- SelectionDAG: BUILD_VECTOR Lowering -- how
NVPTXISD::BUILD_VECTORis matched. - ISel Pattern Matching -- where these opcodes are consumed and translated to MachineInstrs.
- Type Legalization -- runs before LowerOperation and prepares operand types.
- NVPTX Machine Opcodes -- the post-ISel MachineInstr opcodes (440--573 for call ABI, etc.) that these NVPTXISD nodes lower to.
- NVPTX Machine Opcodes: Call ABI Family -- numeric mapping for the call/param family.
- Tensor / MMA Codegen -- WMMA/MMA lowering (no NVPTXISD entries for MMA -- they go directly to MachineInstr opcodes).
- Tex/Surface Builtins -- builtin entry points that feed the Tex/Suld families.
Confidence Notes
- HIGH confidence: family membership and total count (460). All 460 names are direct extracts from
cicc_strings.json. - HIGH confidence: producer/consumer function assignments. Cross-checked against
cicc_functions.jsonand the LowerOperation cluster documented in SelectionDAG. - MED confidence: numeric opcode values quoted in inline tables (
CallSeqBegin = 315,DeclareParam = 505, etc.). These come fromsub_35F6D40(master 6,634-case switch) and thesub_32E3060LowerOperation dispatch where they were observed as immediate constants ingetNodecall sites. The mapping to enumerator names is by behavioral signature (operand counts, chain/glue presence) and may shift by ±1 if the enum had silent insertions between LLVM versions. - MED confidence: per-family counts in the breakdown table. Verified by
jqregex matching but the boundary between "math" and "misc" is somewhat arbitrary;PRMTcould go either way. - LOW confidence: the explanation of
LoadExtVer2*as SM90+ cache-hint variants. The "Ver2" suffix is a guess based on the parallel triplet structure; the actual semantic difference betweenLoadExtandLoadExtVer2could not be confirmed from the function-level analysis. The "Ver2" forms might be a deprecated path retained for backward-compatibility patterns rather than a forward-looking SM90+ feature.
Open Follow-Ups
- Numeric opcode values for the full 460 set. Only the 14 call-ABI opcodes have confirmed numeric assignments. Recovering the rest requires walking
sub_35F6D40case labels and correlating them with the string emitted bygetTargetNodeName-- a several-hour reverse-engineering task. - Subtarget feature gating per opcode. Some opcodes are only legal on certain SM tiers (
LoadV8/StoreV8on SM90+,DP4Aon SM61+,SETP_BF16X2on SM80+). The exact gating lives in the action table atNVPTXTargetLowering + 2422-- documenting which opcodes are gated where would complete the picture but needs the action table to be fully decoded. - Resolution of the
LoadExtvsLoadExtVer2duality. Either renameLoadExtVer2*to a more descriptive name or document the precise semantic difference. This likely needs a side-by-side test compilation against upstream LLVM NVPTX to see which intrinsic emits which form. Tld4opcodes for SM90+ texture gather. PTX ISA 8.0 addedtld4.s.2dandtld4.a.2dvariants for stencil/aware gather; if these are present in the binary they would extend theTld4*family beyond the current 24 entries. Worth re-scanningcicc_strings.jsonagainst the PTX 8.0 spec.- Cross-reference into individual pass pages. Each pass that lowers a subset of these opcodes (e.g.,
mma-codegen.mdfor tensor opcodes,isel-patterns.mdfor the matcher) should grow a link back to its slice of this catalog. Currently this page links outward but the back-links are missing.