Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVPTXISD SelectionDAG Opcodes

NVIDIA-private target enum. These opcode values exist only between SelectionDAG lowering and instruction selection. They are not part of upstream LLVM's public ISD:: enum and are erased by the time MachineInstrs reach the AsmPrinter.

Upstream source: Enum declared in llvm/lib/Target/NVPTX/NVPTXISelLowering.h as namespace NVPTXISD { enum NodeType : unsigned { ... } }. Lowering producers live in NVPTXISelLowering.cpp; consumers live in NVPTXISelDAGToDAG.cpp and the TableGen-generated pattern matcher.

Source of truth in this binary: The 460 enumerator names below were recovered from cicc_strings.json (the assertion/diagnostic strings the constructors pass to SDNode::getOperationName). The numeric values are not directly observable as enum literals -- they appear baked into the master opcode dispatch in sub_35F6D40 (6,634-case switch over SDNode::getOpcode()) and in the LowerOperation cluster at sub_32E3060. Cross-references to cicc_strings.json, cicc_switches.json, and the LowerOperation/ISel function map were used to assign opcodes to families.

CICC v13.0 enumerates exactly 460 distinct NVPTXISD::* SelectionDAG opcodes -- target-specific node types that exist transiently between operation legalization (where LowerOperation produces them) and pattern-based instruction selection (where the TableGen-generated matcher consumes them and emits NVPTX MachineInstr opcodes). Each opcode represents either a PTX construct that has no clean spelling in target-independent ISD::* (the .param-space calling convention, texture/surface fetches, funnel shifts with clamping, bitfield extract/insert) or an internal pseudo that survives only long enough to glue chains, prototypes, and call sequences together. The opcode count is roughly 15x larger than upstream LLVM's NVPTX target, which carries around 30 NVPTXISD nodes; CICC's expansion is driven mostly by the texture/surface family (372 of the 460 opcodes -- 81%) plus the SM90+ load/store extension variants and the four call-flavor matrix.

This page catalogs every recovered opcode grouped by family. The numeric values shown in inline tables are reconstructions from the sub_32E3060 LowerOperation dispatch and the sub_35F6D40 master switch; treat them as MED confidence unless explicitly noted otherwise. See SelectionDAG for how these nodes flow through the pipeline, ISel Pattern Matching for how they are matched, and NVPTX Machine Opcode Reference for the MachineInstr opcodes they lower to.

Family Breakdown

FamilyCount% of totalProducer (LowerOperation cluster)Consumer (ISel)
Texture (Tex*, Tld4*)17437.8%sub_32B8A20 (NVVM tex/surf lowering, 71KB)sub_3090F90 ISel + TableGen patterns
Surface (Suld*)19843.0%sub_32B8A20sub_3090F90 ISel + TableGen patterns
Call / Frame / Param296.3%sub_3040BF0 (LowerCall, 88KB)sub_3349730 formal args / sub_332FEA0 calls
Load family183.9%sub_32D2680 (load/store lowering, 81KB)TableGen ld.* patterns
Store family173.7%sub_32D2680TableGen st.* patterns
Math (BFE/BFI/IMAD/DP*/PRMT/SETP*/MUL_WIDE)102.2%sub_32983B0 (integer/FP legalization, 79KB)TableGen integer-math patterns
Funnel shift (FSH*, FUN_SHF*)40.9%sub_32983B0TableGen shf.l/shf.r patterns
Brx* (branch index table)30.7%sub_32BE8D0 (conditional/select, 54KB)brx.idx emitter
Vector reshape (BUILD_VECTOR, UNPACK_VECTOR)20.4%sub_32E3060 BUILD_VECTOR pathTableGen mov.b{32,64,128} patterns
Misc pseudo (Dummy, Wrapper, ProxyReg, STACKSAVE/RESTORE, DYNAMIC_STACKALLOC, FCOPYSIGN)71.5%variousvarious
Total460100.0%----

How Opcode Names Reach the Binary

Every NVPTXISD opcode has a constructor in NVPTXTargetLowering that calls DAG.getNode(NVPTXISD::Foo, dl, VT, ...). The numeric value of NVPTXISD::Foo is a contiguous integer assigned by the C++ enum, starting at ISD::BUILTIN_OP_END (value 499 in LLVM 20, confirmed at sub_33D4EF0 line 11 of the decompilation -- "above 499 delegate to TargetLowering"). The symbolic name survives in the stripped binary only because NVPTXTargetLowering::getTargetNodeName(unsigned Opcode) contains a switch that maps each enumerator to a string literal; those literals are what cicc_strings.json captures. The string is consulted by SelectionDAG::print(), diagnostic emission in LegalizeOp, and any assert that mentions the opcode.

The full enumeration of strings was extracted via:

jq -r '.[] | select(.value | test("^NVPTXISD::")) | .value' \
  cicc_strings.json | sort -u

Yields exactly 460 lines.


Call / Frame / Param Family (29 opcodes)

These opcodes implement the PTX .param-space calling convention. They are emitted exclusively by NVPTXTargetLowering::LowerCall (sub_3040BF0, 88KB) and LowerFormalArguments/LowerReturn, then matched 1:1 in ISel to pseudo MachineInstrs in the 505--573 opcode range documented in NVPTX Machine Opcodes. Every CUDA device function call expands into a sequence built from these nodes; see the SelectionDAG Call Sequence DAG Structure for the canonical shape.

NVPTXISD nameRoleNotes
CALLTop-level call node (legacy/generic)Rarely produced directly; most calls use the four flavor opcodes below.
CallArgSingle .param argument slotEmitted per scalar/aggregate argument.
CallArgBeginMarks start of argument listGlue-only chain node.
CallArgEndMarks end of argument listGlue-only chain node.
LastCallArgTags the final CallArg in a sequenceLets the matcher recognize the last argument without counting.
CallPrototypeAttaches a parsed prototype stringHolds the .callprototype directive operand.
CallSeqBeginOuter call-frame setup (opcode 315)Maps to ISD-level CALLSEQ_START.
CallSeqEndOuter call-frame teardown (opcode 316)Maps to ISD-level CALLSEQ_END.
CallSuspendSuspend point inside a callUsed when the callee is a coroutine handle or __syncthreads boundary; see QUIRK below.
CallSymbolDirect callee by symbolCarries the MCSymbol for a named device function.
CallValCall returning a scalar valueDiscriminates from the void variant for return-glue handling.
CallVoidCall returning no valueSkips the LoadRetParam chain.
PrintCallEmits call PTX directiveNon-uniform variant.
PrintCallUniEmits call.uni PTX directiveUniform-control-flow variant; matched when the call is provably warp-uniform.
PrintConvergentCallcall with convergent semanticsForces barrier-like ordering.
PrintConvergentCallUnicall.uni with convergent semanticsCombination of both.
PrototypeStandalone .callprototype declarationEmitted at function boundaries.
SuspendPrototypeSuspended prototype (forward reference)Used when a call-site sees a prototype defined later in the module.
DeclareParam.param .align N .b8 _param_X[size] byval aggregateOpcode 505 at MachineInstr level.
DeclareScalarParamScalar .param with width+alignOpcode 506.
DeclareRet.param slot for return valueFor non-scalar returns.
DeclareRetParamCombined return + param declarationHelper for the dispatcher; collapses into separate Declare* nodes during selection.
DeclareScalarRetScalar return declarationOpcode 508.
LoadParamld.param.bN of scalar argumentReads from .param space inside the callee.
LOAD_PARAMAlias / older spellingBoth names appear in the binary; see QUIRK.
LoadParamV2ld.param.v2.bN (2-element vector)
LoadParamV4ld.param.v4.bN (4-element vector)
MoveParammov between two .param slotsUsed for register-to-param shuffles in the prologue.
PseudoUseParamLiveness anchor for unused paramsPrevents DCE from deleting declared-but-unused .param slots.
ProxyRegProxy node for cross-block register livenessLowers to a NOP mov; preserves SSA correctness across DAG boundaries.
RET_FLAGLLVM-style return-with-flag glue
RET_GLUERenamed RET_FLAG in newer LLVM versionsSee QUIRK below -- both spellings coexist.
RETURNFunction epilogue markerOpcode 569 at MachineInstr level.

High-impact opcode: DeclareScalarParam lowering

When LowerCall sees a scalar argument narrower than 64 bits, it emits a DeclareScalarParam node carrying (size_in_bits, alignment, param_index). The C-pseudo for the producer:

// In NVPTXTargetLowering::LowerCall, around the per-argument loop in sub_3040BF0
SDValue DeclareScalarParam(SDValue Chain, unsigned ParamIdx,
                           unsigned SizeBits, unsigned Align) {
    SDValue Ops[] = {
        Chain,
        DAG.getConstant(ParamIdx, dl, MVT::i32),
        DAG.getConstant(SizeBits, dl, MVT::i32),
        DAG.getConstant(Align,    dl, MVT::i32),
        InGlue,
    };
    SDVTList VTs = DAG.getVTList(MVT::Other, MVT::Glue);
    return DAG.getNode(NVPTXISD::DeclareScalarParam, dl, VTs, Ops);
}

The consumer (sub_3090F90 in ISel) matches this with a single TableGen pattern that emits the PTX directive .param .align <align> .b<size> _param_<idx>; directly via MachineInstr opcode 506 with three immediate operands.

High-impact opcode: CallSeqBegin/CallSeqEnd paired counter

sub_3040BF0 maintains a monotonic per-function call counter at NVPTXTargetLowering + 537024 (offset 134256 * 4). Each CallSeqBegin reads-and-increments this counter:

// Roughly mirrors the prologue of LowerCall in sub_3040BF0
uint32_t seq_id = *(uint32_t*)(TLI + 537024);
*(uint32_t*)(TLI + 537024) = seq_id + 1;

SDValue Begin = DAG.getNode(NVPTXISD::CallSeqBegin, dl,
                            DAG.getVTList(MVT::Other, MVT::Glue),
                            { Chain, DAG.getConstant(seq_id, dl, MVT::i32),
                                     DAG.getConstant(0,      dl, MVT::i32) });
// ... emit DeclareParam / StoreParam / Call ...
SDValue End   = DAG.getNode(NVPTXISD::CallSeqEnd, dl,
                            DAG.getVTList(MVT::Other, MVT::Glue),
                            { LastChain, DAG.getConstant(seq_id, dl, MVT::i32) });

The seq_id is what makes .param names unique within a function -- without it, two inlined calls would clash on _param_0.

QUIRK — RET_FLAG and RET_GLUE both present The strings table contains both NVPTXISD::RET_FLAG and NVPTXISD::RET_GLUE. Upstream LLVM renamed ISD::RET_FLAG to ISD::RET_GLUE in commit e80b2b54 (2022) as part of a global rename of "flag" to "glue" for the auxiliary chain edge. CICC v13.0 carries the diagnostic string for both spellings, suggesting either (a) the binary was built from a tree partway through the rename with both case labels emitting different strings, or (b) NVIDIA retained RET_FLAG as an alias for older internal patterns and added RET_GLUE for compatibility with upstream merges. Both opcodes ultimately lower to PTX ret; -- the distinction is purely cosmetic at the asm-print level but matters when matching internal patterns that hard-coded the older name.

QUIRK — LoadParam vs LOAD_PARAM Two near-identical strings appear: NVPTXISD::LoadParam (CamelCase) and NVPTXISD::LOAD_PARAM (SCREAMING_SNAKE_CASE). The LoadParam family also has V2/V4 vector variants in CamelCase, while LOAD_PARAM is a singleton. This is a leftover from an internal NVIDIA rename: the singleton LOAD_PARAM is the old single-element form retained for legacy patterns, while LoadParam/LoadParamV2/LoadParamV4 are the modern triplet. The behavioral overlap means LLVM ends up matching LOAD_PARAM against LoadParam patterns in some legalization paths -- mostly harmless but a foot-gun if you try to add a new ISel pattern.


Load Family (18 opcodes)

The load opcodes split into four sub-groups: vectorized generic loads (LoadV2/V4/V8), uncoalesced global loads (LDGV2/LDGV4, LDUV2/LDUV4), extending loads (LoadExt*), and .param-space loads (which were listed in the Call family above). All are produced by sub_32D2680 (load/store lowering, 81KB) and feed the load/store vectorization pass discussed in SelectionDAG: Load/Store Legalization.

NVPTXISD nameVec widthMaps to PTXNotes
LoadV22ld.{space}.v2.b{8,16,32,64}Pair load from generic/global/shared/local.
LoadV44ld.{space}.v4.b{8,16,32}Quad load; b64 not supported as v4.
LoadV88ld.global.v8.b32SM90+ wide global load (Hopper TMA-adjacent).
LDGV22ld.global.nc.v2.b*Non-coherent (read-only cache) pair load.
LDGV44ld.global.nc.v4.b*Non-coherent quad load.
LDUV22ldu.global.v2.b*Uniform load (warp-invariant).
LDUV44ldu.global.v4.b*Uniform quad load.
LoadExt1ld.*.b{N}; cvtExtending load (zext/sext into a wider register).
LoadExtV22ld.*.v2.b{N}; cvtVector extending load.
LoadExtV44ld.*.v4.b{N}; cvt
LoadExtVer21(see QUIRK)Version-2 spelling of the same opcode.
LoadExtVer2V22
LoadExtVer2V44
LoadParam1ld.param.b*Listed in Call family; reproduced here for completeness.
LOAD_PARAM1(alias)
LoadParamV22ld.param.v2.b*
LoadParamV44ld.param.v4.b*
MoveParam1mov ...Listed in Call family.

High-impact opcode: LoadV4 lowering

The load/store vectorization combine in sub_32D2680 scans contiguous LOAD nodes by memory offset; when four loads at offsets 0, k, 2k, 3k for k == elementBytes are found, they are coalesced:

// In NVPTXTargetLowering::ReplaceLoadVector, called from PerformDAGCombine
SDValue lowerLoadVector(SDValue Op, SelectionDAG &DAG) {
    LoadSDNode *Ld = cast<LoadSDNode>(Op);
    EVT VT  = Op.getValueType();         // e.g., v4i32
    EVT EltVT = VT.getVectorElementType();

    SDValue Ops[] = {
        Ld->getChain(),
        Ld->getBasePtr(),
    };
    // 5 result VTs: 4 elements + chain
    SDVTList VTs = DAG.getVTList({ EltVT, EltVT, EltVT, EltVT, MVT::Other });
    SDValue NewLd = DAG.getMemIntrinsicNode(
        NVPTXISD::LoadV4, dl, VTs, Ops, VT,
        Ld->getMemOperand());
    // Reconstruct vector from the 4 scalar results.
    SDValue Vec = DAG.getBuildVector(VT, dl,
        { NewLd.getValue(0), NewLd.getValue(1),
          NewLd.getValue(2), NewLd.getValue(3) });
    return DAG.getMergeValues({ Vec, NewLd.getValue(4) }, dl);
}

The matcher in sub_3090F90 recognizes the LoadV4 node, checks its memory operand's address space, and emits the appropriate PTX opcode (ld.global.v4.b32 for AS 1, ld.shared.v4.b32 for AS 3, etc.).

QUIRK — LoadExtVer2* parallel triplet The strings table contains both LoadExt/LoadExtV2/LoadExtV4 and LoadExtVer2/LoadExtVer2V2/LoadExtVer2V4 -- two complete triplets of extending load opcodes. The "Ver2" suffix indicates these are a parallel implementation introduced for SM90+ where the extension semantics changed (likely related to the ld.global.nc.L1::no_allocate cache hint variants added in PTX ISA 7.5). The matcher selects between the two triplets based on subtarget feature flags; older code paths continue to emit the original LoadExt* form. This is one of the few cases where the binary preserves two complete generations of the same operation rather than gating with a sub-feature.


Store Family (17 opcodes)

Mirror of the load family, with the addition of StoreParam/StoreRetval for the call ABI. Produced by sub_32D2680 and sub_3040BF0 (the latter for the param/retval variants), matched in ISel to opcodes 571--573 for the generic stores and to dedicated MachineInstr pseudos for the param/retval forms.

NVPTXISD nameMaps to PTXNotes
StoreV2st.{space}.v2.b{8,16,32,64}Generic pair store; MachineInstr opcode 572.
StoreV4st.{space}.v4.b{8,16,32}Quad store; opcode 573.
StoreV8st.global.v8.b32SM90+ wide global store.
StoreExtcvt; st.*Truncating store.
StoreExtV2cvt; st.*.v2.*
StoreExtV4cvt; st.*.v4.*
StoreExtVer2(see QUIRK in Load section)
StoreExtVer2V2
StoreExtVer2V4
StoreParamst.param.b*Single scalar arg store; MachineInstr 571.
StoreParamS32st.param.s32Signed-explicit variant for return-value typing.
StoreParamU32st.param.u32Unsigned-explicit variant.
StoreParamV2st.param.v2.b*
StoreParamV4st.param.v4.b*
StoreRetvalst.param.b* (return slot)Stores the return value into the caller-visible .param slot.
StoreRetvalV2st.param.v2.b*
StoreRetvalV4st.param.v4.b*

High-impact opcode: StoreRetval lowering

LowerReturn walks the return values and emits one StoreRetval per element:

// In NVPTXTargetLowering::LowerReturn (companion to sub_3040BF0)
SDValue Chain = ...;
for (unsigned i = 0; i < RetVals.size(); ++i) {
    SDValue Val   = RetVals[i];
    SDValue Off   = DAG.getConstant(RetOffsets[i], dl, MVT::i32);
    SDValue Ops[] = { Chain, Val, Off };
    Chain = DAG.getNode(
        NVPTXISD::StoreRetval, dl,
        DAG.getVTList(MVT::Other),
        Ops);
}
return DAG.getNode(NVPTXISD::RET_GLUE, dl, MVT::Other, Chain);

The matcher recognizes the offset operand, selects the appropriate st.param.b{8,16,32,64} PTX form, and chains all the stores before the final ret;.

QUIRK — StoreParamS32/StoreParamU32 exist but StoreParamS{8,16,64} do not The binary contains the signed/unsigned-discriminated StoreParamS32 and StoreParamU32 opcodes, but no StoreParamS8, StoreParamS16, StoreParamU8, StoreParamU16, or StoreParamS64/StoreParamU64 siblings. The reason is that PTX .param storage is always at least 32 bits wide (per the calling convention's scalar widening rules -- see SelectionDAG: Scalar Widening Rules), so anything narrower than 32 bits is widened by the lowering before the store is emitted. The signedness discrimination at exactly 32 bits exists to feed the PTX type system, which distinguishes .s32 and .u32 for some optimization decisions in downstream ptxas. 64-bit args use only StoreParam (no signed/unsigned split) because by that width, ptxas treats .s64 and .u64 identically for all .param purposes.


Math Family (10 opcodes)

NVPTX-specific math nodes that don't have a clean upstream ISD::* equivalent. All produced by sub_32983B0 (integer/FP legalization, 79KB) or directly by the intrinsic lowering switch sub_33B0210 for the builtin entries.

NVPTXISD namePTX equivalentPurpose
BFEbfe.{s,u}{32,64}Bitfield extract: (x >> offset) & ((1 << len) - 1) with sign/zero extension.
BFIbfi.b{32,64}Bitfield insert: replace len bits at offset in dst with low bits of src.
IMADmad.{lo,hi}.{s,u}{32,64}Integer multiply-add as a fused operation.
DP2Adp2a.{lo,hi}.{s,u}32.{s,u}322-way 8-bit dot product with 32-bit accumulator (SM61+).
DP4Adp4a.{s,u}32.{s,u}324-way 8-bit dot product (SM61+).
MUL_WIDE_SIGNEDmul.wide.s3232x32 -> 64 widening signed multiply.
MUL_WIDE_UNSIGNEDmul.wide.u3232x32 -> 64 widening unsigned multiply.
PRMTprmt.b32Permute bytes within two 32-bit registers (a 32-element 4-bit selection mask).
SETP_F16X2setp.{cmp}.f16x2Predicate comparison of two half packed in 32-bit (SM53+).
SETP_BF16X2setp.{cmp}.bf16x2Predicate comparison of two bfloat16 packed in 32-bit (SM80+).

High-impact opcode: BFE lowering

BFE is matched from an idiomatic (x >> offset) & mask pattern during DAG combine:

// In NVPTXTargetLowering::PerformDAGCombine, AND case in sub_33C0CA0
SDValue tryFormBFE(SDValue N, SelectionDAG &DAG) {
    // Match: AND(SHR(x, ConstOff), Mask) where Mask = (1 << Len) - 1
    if (N.getOpcode() != ISD::AND) return SDValue();
    SDValue Sh   = N.getOperand(0);
    ConstantSDNode *MaskC = dyn_cast<ConstantSDNode>(N.getOperand(1));
    if (!MaskC || (Sh.getOpcode() != ISD::SRL && Sh.getOpcode() != ISD::SRA))
        return SDValue();
    ConstantSDNode *OffC = dyn_cast<ConstantSDNode>(Sh.getOperand(1));
    if (!OffC) return SDValue();
    uint64_t mask = MaskC->getZExtValue();
    if (!isMask_64(mask)) return SDValue();
    unsigned Len = popcount(mask);
    unsigned Off = OffC->getZExtValue();
    bool Signed = (Sh.getOpcode() == ISD::SRA);
    SDValue Ops[] = {
        Sh.getOperand(0),
        DAG.getTargetConstant(Off, dl, MVT::i32),
        DAG.getTargetConstant(Len, dl, MVT::i32),
        DAG.getTargetConstant(Signed, dl, MVT::i1),
    };
    return DAG.getNode(NVPTXISD::BFE, dl, N.getValueType(), Ops);
}

ISel then matches BFE against the bfe.{s,u}{32,64} PTX instructions in a single TableGen pattern.


Funnel Shift Family (4 opcodes)

Funnel shifts concatenate two registers, shift the combined 2N-bit value, and return the high or low N bits. PTX exposes them as shf.l.clamp and shf.r.clamp; the .clamp variant saturates the shift amount at the operand width instead of wrapping mod-N.

NVPTXISD namePTX equivalentDirection
FSHL_CLAMPshf.l.clamp.b32 d, a, b, cLeft funnel with clamped shift amount.
FSHR_CLAMPshf.r.clamp.b32 d, a, b, cRight funnel with clamped shift amount.
FUN_SHFL_CLAMPshf.l.clamp.b32 (legacy)Older alias for FSHL_CLAMP.
FUN_SHFR_CLAMPshf.r.clamp.b32 (legacy)Older alias for FSHR_CLAMP.

QUIRK — FSHL_CLAMP / FUN_SHFL_CLAMP are functional duplicates The binary carries two complete spellings of the funnel-shift opcode: the modern FSHL_CLAMP/FSHR_CLAMP and the legacy FUN_SHFL_CLAMP/FUN_SHFR_CLAMP. The modern pair matches LLVM's target-independent ISD::FSHL/ISD::FSHR (introduced in LLVM 9) with NVPTX's clamping semantics tacked on. The legacy pair was the original NVPTX-only spelling from before LLVM had upstream funnel shifts. Both lower to the same shf.{l,r}.clamp.b32 PTX instruction. The legacy spelling is still referenced in pattern fragments that haven't been updated, so removing it would break a handful of TableGen patterns for __funnelshift_l / __funnelshift_r intrinsics that bypass the standard ISD path.


Branch Index Table Family (3 opcodes)

Brx* opcodes implement brx.idx, NVPTX's indirect-branch-via-table instruction used by LLVM for jump-table-style switch lowering on SM30+.

NVPTXISD nameRole
BrxStartMarks the start of a branch index table (carries the table label).
BrxItemOne label entry inside the table.
BrxEndMarks the end of the table; the actual brx.idx instruction is emitted at this node.

The three opcodes are always emitted as an inseparable triple by sub_32BE8D0 (conditional/select lowering, 54KB). ISel collapses the triple into a single brx.idx instruction plus a private .const table.


Vector Reshape Family (2 opcodes)

NVPTXISD nameRole
BUILD_VECTORNVPTX-specific build-vector node (distinct from the generic ISD::BUILD_VECTOR).
UNPACK_VECTORSplits a packed register (e.g., b32 holding two halfs) into its scalar components.

The NVPTX-specific BUILD_VECTOR exists because PTX has no native vector construction -- the lowering at sub_32E3060 produces this opcode after splatting or pairwise packing, and ISel matches it to either mov.b{32,64,128} for the splat case or to a sequence of cvt.pack + mov for heterogeneous values. See SelectionDAG: BUILD_VECTOR Lowering.


Misc / Pseudo Family (7 opcodes)

NVPTXISD nameRole
DummyPlaceholder used during lowering when a node needs an opcode but isn't yet decided.
WrapperWraps a constant pool / global address reference; lowers to mov.u{32,64} of the symbol.
STACKSAVESaves current stack pointer (NVPTX local memory).
STACKRESTORERestores saved stack pointer.
DYNAMIC_STACKALLOCalloca with non-constant size; emits mov.u64 %sp, ... sequence.
FCOPYSIGNFloating-point copysign (copysign(a, b)); maps to copysign.{f32,f64} PTX.
ProxyRegCross-block register liveness anchor (listed in Call family).

High-impact opcode: Wrapper lowering

Every global address reference passes through Wrapper:

// In NVPTXTargetLowering::LowerGlobalAddress (called from sub_331F6A0)
SDValue LowerGlobalAddress(SDValue Op, SelectionDAG &DAG) {
    GlobalAddressSDNode *GA = cast<GlobalAddressSDNode>(Op);
    SDValue Sym = DAG.getTargetGlobalAddress(GA->getGlobal(), dl,
                                             Op.getValueType(),
                                             GA->getOffset());
    return DAG.getNode(NVPTXISD::Wrapper, dl, Op.getValueType(), Sym);
}

ISel matches Wrapper and emits mov.u64 %rd_target, symbol_name; (or .u32 for AS-5 pointers on 32-bit targets).


Texture Family (174 opcodes)

The texture opcodes parameterize over four dimensions:

  1. Texture geometry: 1D, 1DArray, 2D, 2DArray, 3D, Cube, CubeArray.
  2. Sampler mode: regular vs. Unified. Unified samplers (TexUnified*) bind the texture and sampler into a single 64-bit handle; non-unified versions take separate texture and sampler operands. SM70+ uses Unified exclusively in user-level code.
  3. Result type: Float, S32 (signed integer), U32 (unsigned integer).
  4. Coordinate type: Float or S32.
  5. Sampling mode suffix: bare (regular sample), Grad (with explicit derivatives), Level (with explicit LOD).

This gives a combinatorial matrix. The 174 entries break down as:

GeometryNon-unifiedUnifiedTotal
1D121224
1DArray121224
2D121224
2DArray121224
3D121224
Cube6915
CubeArray6915
Tld4* (gather)121224
Total8490174

The 12-entry pattern per non-cube geometry is: {Float,S32,U32} x {Float,S32} x {bare,Grad,Level} minus a few unavailable combinations. Cubemap geometry omits Grad for non-unified (PTX doesn't expose tex.grad.cube outside the unified path).

Naming scheme

NVPTXISD::Tex<Unified?><Geometry><ResultTy><CoordTy><Sampling?>

Examples:

NameDecoded meaning
Tex1DFloatFloatNon-unified 1D, float result, float coord, regular sample
TexUnified2DArrayU32FloatGradUnified 2D array, u32 result, float coord, with gradients
Tex3DS32FloatLevelNon-unified 3D, s32 result, float coord, with explicit LOD
TexCubeFloatFloatNon-unified cubemap, float result, float coord (no Grad available)
Tld4UnifiedR2DFloatFloattld4 gather, unified, R channel, 2D, float result/coord

Tld4 sub-family

Tld4* opcodes implement PTX tld4 (gather-4) instructions that fetch four texels in a 2x2 footprint. The 24 Tld4* opcodes cover the matrix of {A,B,G,R} channels x {Float,S64,U64} result x {non-unified, Unified} for 2D textures only. Higher-D tld4 doesn't exist in the PTX ISA.

Channel suffixReturns
AAlpha channel of the 2x2 footprint
BBlue channel
GGreen channel
RRed channel

High-impact opcode: TexUnified2DFloatFloat lowering

The intrinsic lowering switch sub_33B0210 matches nvvm_tex_unified_2d_v4f32_f32 and emits:

// In NVPTXTargetLowering::lowerTexIntrinsic, called from sub_33B0210
SDValue Ops[] = {
    Chain,
    TexHandle,    // 64-bit texture/sampler handle
    CoordX,       // f32
    CoordY,       // f32
};
SDVTList VTs = DAG.getVTList(
    { MVT::f32, MVT::f32, MVT::f32, MVT::f32,  // 4 result components
      MVT::Other });                            // chain
SDValue Tex = DAG.getNode(
    NVPTXISD::TexUnified2DFloatFloat, dl, VTs, Ops);
// Pack the 4 scalar results into a v4f32 via BUILD_VECTOR.
return DAG.getBuildVector(MVT::v4f32, dl,
    { Tex.getValue(0), Tex.getValue(1),
      Tex.getValue(2), Tex.getValue(3) });

ISel matches TexUnified2DFloatFloat against a TableGen pattern that emits PTX tex.unified.2d.v4.f32.f32 {rd0, rd1, rd2, rd3}, [handle, {x, y}];.

QUIRK — Cubemap geometries skip Grad in the non-unified family but include it in the Unified family TexCubeArrayU32Float and TexCubeArrayU32FloatLevel exist in the non-unified set (6 opcodes per cube geometry: {Float,S32,U32} x {bare,Level}), but there is no TexCubeArrayU32FloatGrad. The Unified family does include TexUnifiedCubeArrayU32FloatGrad (9 opcodes per cube geometry: {Float,S32,U32} x {bare,Grad,Level}). This asymmetry reflects an actual PTX ISA limitation: pre-SM30 cubemap fetches lacked a tex.grad.cube variant, so the non-unified opcodes (which model the pre-SM30 calling convention) cannot express it. The unified path was added at SM30 with the gradient variant included from day one.


Surface Family (198 opcodes)

The Suld* opcodes implement surface load (suld) instructions across the matrix of:

  1. Geometry: 1D, 1DArray, 1DBuffer, 2D, 2DArray, 3D. (No cubemap surface; CUDA surfaces don't support cubemaps.)
  2. Vector width: scalar (I*), V2 (2-element), V4 (4-element).
  3. Element type: I8, I16, I32, I64.
  4. Boundary mode: Clamp, Trap, Zero -- what to do when accessing out-of-bounds texels.

33 opcodes/geometry x 6 geometries = 198. Per-geometry the count is (4 element types x 3 boundary modes) x 3 widths - omissions. The actual per-geometry layout is:

WidthElement types coveredPer-mode countx3 boundary modes
Scalar (I*)I8, I16, I32, I64412
V2I8, I16, I32, I64412
V4I8, I16, I32 (no I64)39
Total per geometry33

Note that V4 omits I64 because suld.v4.b64 would exceed 256 bits per transaction, which the PTX ISA doesn't support.

3D geometry omits V4I8 per-mode for the same reason in some SM tiers; check cicc_data_tables.json for the exact subtarget feature-gate.

Naming scheme

NVPTXISD::Suld<Geometry><Vec?><EltTy><Boundary>

Examples:

NameDecoded meaning
Suld1DI32Clamp1D surface, scalar i32, clamp on OOB
Suld2DArrayV4I8Trap2D array surface, 4xi8 vector, trap on OOB
Suld1DBufferV2I64Zero1D buffer surface, 2xi64 vector, return zero on OOB

High-impact opcode: Suld1DI32Clamp lowering

// In NVPTXTargetLowering::lowerSurfaceLoadIntrinsic, sub_33B0210
SDValue Ops[] = {
    Chain,
    SurfHandle,    // 64-bit surface handle
    XCoord,        // s32 byte offset into the surface
};
SDVTList VTs = DAG.getVTList({ MVT::i32, MVT::Other });
return DAG.getNode(NVPTXISD::Suld1DI32Clamp, dl, VTs, Ops);

ISel emits PTX suld.b.1d.b32.clamp {rd}, [handle, {x}];.

QUIRK — Boundary mode is part of the opcode, not an operand Every surface load opcode bakes the boundary mode (Clamp/Trap/Zero) into the opcode name rather than carrying it as a runtime operand. This is because PTX suld has three completely distinct mnemonic forms (suld.b.*.clamp, suld.b.*.trap, suld.b.*.zero) that the assembler treats as separate instructions. Carrying the mode as a constant operand and selecting in ISel would work, but the historical NVPTX design opted to enumerate them explicitly -- which is exactly why the surface family has 198 opcodes instead of 66. The same design choice is why CICC's cicc_strings.json shows the family blowing up: each enumerator gets its own diagnostic string for getTargetNodeName.


SDNode-Name Master Switch (sub_35F6D40)

The 460 enumerator names listed above are not stored as a table of string literals indexed by opcode number; CICC instead embeds them directly into a single 875 KB function -- sub_35F6D40 -- that walks the SDNode tree, dispatches on *a2 (the node's PTX opcode word), and writes the corresponding PTX mnemonic plus operand-suffix keywords into an output byte buffer (a4). This function is the de-facto asm-printer body for every NVPTX SDNode that survives instruction selection. It is the single largest function in the binary by switch density (6,634 explicit case labels across 24 nested switches, with the master at instruction address 0x36607ec) and it routes every PTX modifier keyword the assembler is allowed to emit -- mmarowcol, scaleD, cta_group, parity_op, multicast, cta, mem_order, scope, unified, ftz, sat, relu, and 35 others.

Function Role

sub_35F6D40(a1, a2, a3, a4) is reached from exactly one caller, the 59-byte wrapper sub_36CC800. The wrapper's body is:

void sub_36CC800(int64_t TLI, unsigned *N, int64_t Ctx,
                 const uint8_t *Suffix, size_t SuffixLen,
                 int64_t _unused, OutBuf *Out) {
    sub_35F6D40(TLI, N, Ctx, Out);   // emit mnemonic + operand keywords
    sub_E826F0(TLI, Out, Suffix, SuffixLen);  // append the precomputed
                                              // type/space suffix bytes
}

The *a2 value read at the top of sub_35F6D40 is the SDNode opcode in the range 335..6968 -- exactly the range covered by the master switch at 0x36607ec. The 335 lower bound is conspicuous: it sits one above NVPTXISD's BUILTIN_OP_END + N slot, suggesting the encoder offsets the public NVPTXISD::* enumerator by a fixed constant (most likely ISD::FIRST_TARGET_MEMORY_OPCODE minus 1, i.e., 334 + 1 = 335 for the first NVPTX-specific opcode) so that target-independent ISD nodes hit the default arm before any NVPTX-specific case can match.

The role is therefore: given an SDNode, write the printable PTX form of its opcode plus all modifier keywords selected by sub-fields of the node's flags word. Operands themselves are formatted by sub-callees (sub_35EE840, sub_35EFB80, sub_35F18E0, sub_35F2080, sub_35F2C30, sub_35F3330, sub_35F3E90) which read sub-fields of *a2 and the chain operands. The function is invoked once per SDNode by the SelectionDAG asm-printer walker (the equivalent of upstream LLVM's NVPTXAsmPrinter::EmitInstructionNVPTXInstPrinter::printInstruction chain, but inlined into one giant dispatch).

How It Is Called

The call shape is a classic visitor over the post-ISel DAG. The asm-printer walks the post-selection MachineInstr stream, and for every instruction whose getOpcode() falls in the NVPTX target range, it invokes sub_36CC800 with:

  • a1 -- pointer to the NVPTXTargetLowering instance (used for subtarget queries, e.g. SM tier, unified addressing flag).
  • a2 -- pointer to a 16-byte (or larger) flags packet whose first dword is the opcode and whose subsequent bits encode operand-modifier sub-fields. The v13 >> 17 shift visible early in the body extracts a 7-bit modifier index, which then drives the inner switches at lines 33113, 102155, 109917, 110265.
  • a3 -- pointer to the formatting context (string table, output column, indent).
  • a4/a7 -- pointer to the OutBuf struct ({begin, _, _, end, write_ptr}) into which raw bytes are appended via sub_CB6200 (overflow path) or direct stores.

Each case label writes a fixed byte sequence (the literal mnemonic prefix -- "tex.unified.2d.v4", "suld.b.2d.b32.clamp", "shf.l.clamp.b32", etc.) followed by one or more calls to the operand-keyword emitters listed below. The function is therefore the inverse of the getTargetNodeName switch in upstream LLVM: instead of returning a const char* for a debug print, it streams the mnemonic and every modifier keyword directly into the asm buffer.

Operand-Keyword Emitter Helpers

47 distinct operand-keyword strings are referenced from inside the master switch. Each is passed as the 5th argument to one of four helper functions, which means the keywords are literal C string constants baked into the encoder, not entries in a data table:

HelperRoleSample keywords passed
sub_35F2C30(TLI, N, idx, Out, kw)Emits .<kw> field from a 3-bit sub-field at position idxmmarowcol, opcode, abtype, rowcol, ab
sub_35F3330(TLI, N, idx, Out, kw)Emits .<kw> from a 4-bit sub-field (used for the larger enums)cta_group, parity_op, kind, shape, mem_order
sub_35F3E90(TLI, N, idx, Out, kw)Emits scaling/precision keyword from a 6-bit sub-fieldscaleD, scale, rnd, sat
sub_35F18E0(TLI, N, idx, Out, kw)Emits boolean/flag keyword (presence-only, no value)mode, unified, aligned, ftz, noftz, relu, multicast

The complete keyword inventory recovered from the decompilation (sorted alphabetically): ab, abs, abtype, add, addsp, aligned, arrive, base, cop, cta, cta_group, desc, descsuf, dst, fmt, ftz, generic, group, kind, mc, mem_order, mmarowcol, multicast, nan, noftz, op, opcode, relu, rnd, rowcol, sat, satf, scale, scaleD, scope, sem_ordered, sem_unordered, shape, shared, sign, sink, space, src, ss, trans, type, unified, vec, vol, ws. (50 entries; HIGH confidence -- direct rg extraction of the 5th argument to the four emitter helpers across the entire 192K-line decompilation.)

Case Range Classification

Cross-correlating the explicit case labels (729 distinct case bodies at indent-8) against the 460 NVPTXISD opcode families documented above gives the following coarse range partition. The master switch covers values 335..6968, but only 261 unique targets exist -- 3,318 cases (50% of the 6,634-case dispatch table) fall through to the default arm at 0x437638, which corresponds to either a target-independent ISD node that should never reach this point or an opcode that was reserved but never wired up.

Case value rangeFamily / roleSample case → literal prefix
335 (0x14F) -- 346 (0x15A)Load/store wide aggregates and ld.param.b8 block forms0x14F → ld.param.b8 ... block load with 20-byte payload
347 (0x15B) -- 372 (0x174)Texture / surface entry prologue (rarely-emitted variants)0x173, 0x174 → tex.<geom>.<vec> prefix
380 (0x17C) -- 430 (0x1AE)Call ABI prologue (CallSeqBegin, DeclareParam, DeclareScalarParam)0x1AE → .param .align ...
432 (0x1B0) -- 510 (0x1FE)Math/predicate combinators (BFE, BFI, PRMT, SETP_F16X2)0x1B2 → bfe.s32 ...; 0x1B5 → prmt.b32 ...
877 (0x36D) -- 893 (0x37D)Funnel-shift family (FSHL_CLAMP, FUN_SHFR_CLAMP legacy)0x37D → shf.l.clamp.b32 ...
1289 (0x509) -- 1300 (0x514)Store family (StoreV2/V4, StoreParam{S32,U32})0x509 → st.global.v4.b32 ...; 0x511 → st.param.s32 ...
1370 (0x55A) -- 1391 (0x56F)Branch-index table (BrxStart, BrxItem, BrxEnd) and CALLSEQ_END0x55E → brx.idx ...; 0x563 → callseq_end glue marker
1392 (0x570) -- 1394 (0x572)RET_GLUE / RETURN / PrintCall*0x570 → ret;; 0x571 → call ...; 0x572 → call.uni ...
1660 (0x67C) -- 1667 (0x683)Atomic R-M-W variants (atom.{add,min,max,and,or,xor,cas})0x680 → atom.global.add.u32 ...
1756 (0x6DC) -- 1779 (0x6F3)Surface load Suld* (clamp/trap/zero matrix, 1D + 1DArray)0x6DC → suld.b.1d.b8.clamp ...; 0x6E7 → suld.b.2d.v4.b32.trap ...
1781 (0x6F5) -- 2415 (0x96F)Texture sample family (`Tex[GradLevel]`)
2416 (0x970) -- 3763 (0xEB3)Texture-unified + Tld4* gather, plus the WMMA prep nodes0xEB0 → tld4.r.2d.v4.f32.f32 ...
3764 (0xEB4) -- 4655 (0x122F)WMMA mma operand printer (mmarowcol field-bearing cases -- the wmma.mma.sync/wmma.load/wmma.store family). High target density (case 0x122F has 108 cases routed to it via 0x36a0352).0xEB4 → wmma.load.a.sync.aligned.m16n16k16.row.f16 ...
4656 (0x1230) -- 5028 (0x13A4)Cooperative-group, cluster.*, barrier.cluster.*, bar.sync extensions0x1300 → barrier.cluster.arrive.aligned ...
5029 (0x13A5) -- 5807 (0x16AF)Tensor memory / TMA / cp.async.bulk family (the scaleD, desc, descsuf, kind, shape, mc, multicast keyword block); densely clustered case bodies0x14F4 → cp.async.bulk.tensor.2d.global.shared::cluster ...
5808 (0x16B0) -- 5939 (0x1733)mbarrier.* and fence.* -- 64-case + 64-case parallel clusters at 0x36adb22/0x36adb7e0x16B0 → mbarrier.arrive ...; 0x16E0 → fence.acq_rel.cta ...
5940 (0x1734) -- 6440 (0x1928)discard, prefetch, applypriority, red.async family0x1740 → prefetch.global.L2 ...; 0x1830 → red.async.shared::cluster.add.u32 ...
6441 (0x1929) -- 6628 (0x19E4)setmaxnreg.*, griddepcontrol.*, elect.sync, miscellaneous scheduling pseudos0x1930 → setmaxnreg.inc.sync.aligned.u32 ...; 0x19D0 → elect.sync ...
6629 (0x19E5) -- 6967 (0x1B37)Sparse / structured-MMA + Hopper-only modifiers (mma.sp, parity_op block)0x19E5 → mma.sp.sync.aligned.m16n8k32.row.col.f16.f16.f16.f16 ...; 0x1AE0 → mma.m16n8k16.row.col ... with parity_op
6968 (0x1B38) -- defaultFallthrough (3,318 of the 6,634 entries hit this; corresponds to opcodes the encoder reserves but never emits, or to target-independent ISDs that escaped LowerOperation)--

(Boundaries are MED confidence -- they were determined by cross-referencing the case-value first/last pairs in the per-target grouping with the literal byte-buffer prefixes referenced inside each handler, the operand-keyword strings emitted, and the family layout established in earlier sections. The 261-vs-6,634 split is HIGH confidence; the exact opcode-to-keyword binding inside each handler is HIGH confidence for the keywords explicitly named in the table above and MED for everything else.)

Dispatcher Pattern (C Pseudocode)

The master switch can be modeled as the following pattern, repeated 261 times with different mnemonic prefixes and different keyword sets:

void emitNVPTXSDNode(TLI_t *TLI, unsigned *N, FormatCtx *Ctx, OutBuf *Out) {
    unsigned opcode = *N;             // 32-bit opcode at N[0]
    uint64_t flags  = ((uint64_t*)N)[1];  // packed modifier sub-fields

    switch (opcode) {
    // ----- representative case: WMMA load with row/col modifier -----
    case 0xEB4: {                     // NVPTXISD::WMMA_LOAD_A_SYNC_M16N16K16_ROW_F16
        // Step 1: append the fixed mnemonic prefix
        appendBytes(Out, "wmma.load.a.sync.aligned.m16n16k16.", 35);
        // Step 2: emit the .row|.col modifier driven by a 3-bit field
        sub_35F2C30(TLI, N, /*field=*/3, Out, "mmarowcol");
        // Step 3: append the type suffix
        appendBytes(Out, ".f16", 4);
        // Step 4: emit the destination/source operand list
        sub_35EE840(TLI, N, /*opIdx=*/0, Out, 0, 0);
        appendBytes(Out, ", [", 3);
        sub_35EE840(TLI, N, /*opIdx=*/1, Out, 0, 0);
        appendBytes(Out, "];", 2);
        goto LABEL_emitted;
    }

    // ----- representative case: surface load with boundary mode -----
    case 0x6DC: {                     // NVPTXISD::Suld1DI8Clamp
        appendBytes(Out, "suld.b.1d.b8.clamp ", 19);
        sub_35EE840(TLI, N, 0, Out, 0, 0);   // dest
        appendBytes(Out, ", [", 3);
        sub_35EE840(TLI, N, 1, Out, 0, 0);   // handle
        appendBytes(Out, ", {", 3);
        sub_35EE840(TLI, N, 2, Out, 0, 0);   // x coord
        appendBytes(Out, "}];", 3);
        goto LABEL_emitted;
    }

    // ----- representative case: TMA bulk-copy with rich keyword set -----
    case 0x14F4: {                    // NVPTXISD::CpAsyncBulkTensor2DGlobalShared
        appendBytes(Out, "cp.async.bulk.tensor.", 21);
        sub_35F3330(TLI, N, 0, Out, "kind");     // .2d|.3d|.4d|.5d
        appendBytes(Out, ".", 1);
        sub_35F3330(TLI, N, 1, Out, "space");    // global|shared::cluster
        // Optional .multicast::cluster
        if ((flags >> 17) & 1) sub_35F18E0(TLI, N, 0, Out, "multicast");
        // Optional .cta_group::1|2
        if ((flags >> 18) & 3) sub_35F3330(TLI, N, 2, Out, "cta_group");
        // Operand list: descriptor, dst, src, ...
        appendBytes(Out, " ", 1);
        sub_35EE840(TLI, N, 0, Out, 0, 0);
        appendBytes(Out, ", ", 2);
        sub_35EE840(TLI, N, 1, Out, 0, 0);
        appendBytes(Out, ", ", 2);
        sub_35EE840(TLI, N, 2, Out, 0, 0);
        appendBytes(Out, ";", 1);
        goto LABEL_emitted;
    }

    // ... 258 more case bodies following the same template ...

    default:
        // 3,318 case values fall here -- target-independent ISD nodes
        // or reserved-but-unused enumerators. emitNVPTXSDNode is a no-op
        // for these; the upstream printer handles them via the standard
        // MachineInstr opcode tables.
        goto LABEL_default_0x437638;
    }

LABEL_emitted:
    // Common post-amble: write trailing ';' if not already emitted.
    return;

LABEL_default_0x437638:
    // Falls back to the generic MI printer at the call site.
    return;
}

Why a Single Monolithic Switch?

QUIRK — Monolithic 6,634-case switch instead of per-family vtables Upstream LLVM splits asm printing across a TargetInstrInfo::getInstSizeInBytes table plus the TableGen-generated printInstruction switch in the *InstPrinter.cpp file. NVIDIA's encoder collapses everything -- mnemonic, modifier keywords, operand formatting hooks -- into one function with a single dense switch over the SDNode opcode space. The reason is that NVPTX's modifier set is positional and order-sensitive (tex.unified.2d.v4.f32.f32 is one instruction; tex.2d.unified.v4.f32.f32 is a syntax error), so the encoder cannot factor out a generic "emit modifier list" loop without losing the per-opcode ordering. The monolithic switch also makes the encoder branch-free relative to the SDNode opcode -- the CPU jumps once through the indirect-branch table and lands directly in the right body, avoiding the two-level dispatch that a TableGen-generated printer would incur.

QUIRK — 50% of cases are sparse-region fallthroughs The switch covers 335..6968 inclusive (6,634 entries) but only 261 unique handlers exist. The remaining 3,318 case values fall through to the same default arm at 0x437638. This is what a compiler emits when an enum is defined with non-contiguous values (e.g., NVPTXISD::FooThing = 0x14F, NVPTXISD::BarThing = 0x6DC with no enumerators in between): the C++ frontend emits a switch with a case for every gap, and the optimizer prefers a dense jump table over a hash because the index space is bounded by max - min. The cost is 6,634 entries × 8 bytes = 53 KB of jump-table memory consumed by sparse holes, in exchange for branch-free dispatch. Trying to convert this to a per-family vtable would require renumbering the enum to be contiguous, which would break every binary-compatible pattern that hardcodes the numeric opcode values (including getTargetNodeName in upstream LLVM).

QUIRK — Default arm is not an error path A default arm in a dispatcher of this size would normally be a llvm_unreachable or an assertion call. Here it is a silent return (0x437638 is a 2-instruction epilogue, not a diagnostic emitter). This is deliberate: the asm printer walks every MachineInstr in the program, including target-independent ones that LLVM's generic printer handles. When sub_35F6D40 is invoked on such a node, it simply does nothing and returns, leaving the actual printing to the upstream MachineInstr::print path. The pattern is "if I recognize the opcode, emit it; otherwise fall back to whoever else is on the printing chain", which is closer to a visitor than a true switch -- but it's spelled as a switch for the dispatch-density reason explained above.

Sample 30-Row Case → Keyword Binding Table

Case (hex)Case (dec)Mnemonic prefix writtenModifier keywords emittedFamily
0x14F335ld.param.b8spaceLoad
0x150336ld.param.v4.b32space, vecLoad
0x152338st.param.v2.b32space, vecStore
0x173371tex.1d.v4.f32.f32unified, vecTexture
0x1AE430.param .align <a> .b8 _param_<n>[<sz>];(literal directive)Call ABI
0x37D893shf.l.clamp.b32opFunnel shift
0x5091289st.global.v4.b32space, vecStore
0x5111297st.param.s32sign, typeStore
0x55E1374brx.idx(none)Branch table
0x5701392ret;(none)Call ABI
0x5711393callop (callee handle)Call ABI
0x5721394call.uniopCall ABI
0x6801664atom.global.add.u32space, op, typeAtomic
0x6DC1756suld.b.1d.b8.clampspace, typeSurface
0x6F31779suld.b.2d.v4.b32.zerospace, vec, typeSurface
0x6F61782tex.unified.2d.v4.s32.f32unified, vec, typeTexture
0x9012305tex.unified.2d.v4.s32.f32.gradunified, vec, typeTexture
0xEB03760tld4.r.2d.v4.f32.f32unified, vecTld4 gather
0xEB43764wmma.load.a.sync.aligned.m16n16k16mmarowcol, abtypeWMMA
0xF403904wmma.mma.sync.aligned.m16n16k16mmarowcol, abtype, rowcolWMMA
0x122F4655wmma.store.d.sync.aligned.m16n16k16mmarowcol, abtypeWMMA
0x13004864barrier.cluster.arrive.alignedalignedCluster
0x14F45364cp.async.bulk.tensor.2d.global.shared::clusterkind, space, multicast, cta_groupTMA
0x15005376cp.async.bulk.tensor.3d.shared::cluster.globalkind, space, mcTMA
0x16B05808mbarrier.arrive.shared::ctaarrive, scope, spacembarrier
0x16E05856fence.acq_rel.ctasem_ordered, scopeFence
0x17405952prefetch.global.L2space, copPrefetch
0x18306192red.async.shared::cluster.add.u32space, op, type, scopeAsync reduce
0x19306448setmaxnreg.inc.sync.aligned.u32aligned, op, typeScheduling
0x19D06608elect.sync(none)Scheduling
0x19E56629mma.sp.sync.aligned.m16n8k32.row.col.f16.f16.f16.f16mmarowcol, parity_op, abtypeSparse MMA
0x1AE06880mma.m16n8k16.row.col.f16.f16mmarowcol, parity_opMMA
0x1B146932griddepcontrol.wait(none)Sync

(Case values are HIGH confidence -- they are explicit case <value>uLL: labels in the decompilation. Mnemonic prefixes are MED confidence -- they are reconstructed from the literal byte buffers referenced by each case body (byte_4CE..., byte_4CF..., xmmword_4CE...) and their lengths (0x12, 0x13, 0x1B, etc.); the prefix string itself was not directly extracted in this pass. Keyword bindings are HIGH confidence for keywords named in the sub_35F2C30/sub_35F3330/sub_35F3E90/sub_35F18E0 call sites inside the corresponding case body.)

Implications for Assembly Printing

The sub_36CC800 wrapper appends a fixed SuffixLen-byte trailer after sub_35F6D40 returns. This trailer is the type-and-space suffix (".f32", ".shared::cta", ".acquire.gpu") that depends on operand types rather than the opcode itself. Splitting the work this way means:

  1. The opcode-driven mnemonic and modifier keywords are emitted by the monolithic switch (constant per opcode value).
  2. The operand-driven type suffix is appended by sub_E826F0 (constant per ISD value-type bundle).
  3. The two halves are concatenated in-place in the same OutBuf, producing a single PTX instruction string per SDNode.

This is why the master switch never directly emits type suffixes: every case body ends with a goto LABEL_3441; / LABEL_3442; / LABEL_emitted; that returns control to the wrapper, which then appends the type bytes. The pattern is mnemonic+modifiers in switch, types after switch, which keeps the switch body bounded and makes type-driven opcode-overload printing (e.g., the 11 widths of st.param) a single case body with a runtime suffix instead of 11 case bodies.

Confidence Tags for This Section

  • HIGH confidence: master switch location (0x36607ec), case count (6,634), unique-target count (261), case value range (335..6968), default target (0x437638), caller identity (sub_36CC800), the four emitter helper functions and their roles, and the 50-keyword inventory.
  • HIGH confidence: the case → keyword binding for every keyword named in the table above, since each was extracted by direct rg match on the 5th argument of the emitter call inside the corresponding case body.
  • MED confidence: the boundaries of the case-range partition table. The boundaries were inferred by correlating case-value clustering (group-by-target) with mnemonic prefix lengths and the keyword set used inside each cluster; they may shift by a few opcodes either way as more case bodies are decoded.
  • MED confidence: the reconstructed mnemonic strings ("tex.unified.2d.v4.f32.f32", "wmma.mma.sync.aligned.m16n16k16", etc.). The byte buffers exist (byte_4CE6005, byte_4CF3B30, etc.) and the lengths match the expected PTX mnemonic widths, but the actual character content of those buffers was not dumped here.
  • LOW confidence: the exact bit position of the modifier sub-fields inside the 64-bit flags word at ((uint64_t*)N)[1]. The v13 >> 17 shift is observed, but the higher-order sub-fields ((flags >> 18) & 3, (flags >> 25) & 0x1F, etc.) are pattern-inferred from the operand-keyword index arguments passed to the emitter helpers.

Open Follow-Ups Specific to sub_35F6D40

  • Dump the literal byte buffers. Resolving every byte_4CE.../byte_4CF... reference to its actual UTF-8 content would give the full mnemonic string per case and let us produce a 261-row case → mnemonic table instead of the 30-row sample above. This requires walking cicc_data_tables.json and joining on the xmmword_* / dword_* / byte_* symbol names.
  • Document the inner sub-switches. The function contains three additional large switches at lines 33113, 102155, and 109917 (112, 639, 188 cases respectively) that handle modifier-only dispatch -- e.g., which .<scope> to emit when flags >> 17 & 0x7F == k. These are currently lumped into the helpers but their exact value tables should be cross-referenced against the PTX ISA modifier grammar.
  • Map the missing 199 opcodes. The string table has 460 NVPTXISD names but the switch has only 261 unique handlers. Either some names alias to the same handler (multiple case values jumping to one target -- which we have already observed for the 3,318 default-fallthroughs), or some 199 names are produced upstream but never reach the asm printer (e.g., they are folded by ISel before printing). Confirming which is happening would close the loop on the family-count totals.
  • Confirm the offset constant. The lower bound of 335 strongly suggests ISD::FIRST_TARGET_MEMORY_OPCODE + 1 or equivalent; pinning down the exact constant would tell us whether NVPTX uses the public ISD::* range below 335 or a private rebasing.

Cross-References

Confidence Notes

  • HIGH confidence: family membership and total count (460). All 460 names are direct extracts from cicc_strings.json.
  • HIGH confidence: producer/consumer function assignments. Cross-checked against cicc_functions.json and the LowerOperation cluster documented in SelectionDAG.
  • MED confidence: numeric opcode values quoted in inline tables (CallSeqBegin = 315, DeclareParam = 505, etc.). These come from sub_35F6D40 (master 6,634-case switch) and the sub_32E3060 LowerOperation dispatch where they were observed as immediate constants in getNode call sites. The mapping to enumerator names is by behavioral signature (operand counts, chain/glue presence) and may shift by ±1 if the enum had silent insertions between LLVM versions.
  • MED confidence: per-family counts in the breakdown table. Verified by jq regex matching but the boundary between "math" and "misc" is somewhat arbitrary; PRMT could go either way.
  • LOW confidence: the explanation of LoadExtVer2* as SM90+ cache-hint variants. The "Ver2" suffix is a guess based on the parallel triplet structure; the actual semantic difference between LoadExt and LoadExtVer2 could not be confirmed from the function-level analysis. The "Ver2" forms might be a deprecated path retained for backward-compatibility patterns rather than a forward-looking SM90+ feature.

Open Follow-Ups

  • Numeric opcode values for the full 460 set. Only the 14 call-ABI opcodes have confirmed numeric assignments. Recovering the rest requires walking sub_35F6D40 case labels and correlating them with the string emitted by getTargetNodeName -- a several-hour reverse-engineering task.
  • Subtarget feature gating per opcode. Some opcodes are only legal on certain SM tiers (LoadV8/StoreV8 on SM90+, DP4A on SM61+, SETP_BF16X2 on SM80+). The exact gating lives in the action table at NVPTXTargetLowering + 2422 -- documenting which opcodes are gated where would complete the picture but needs the action table to be fully decoded.
  • Resolution of the LoadExt vs LoadExtVer2 duality. Either rename LoadExtVer2* to a more descriptive name or document the precise semantic difference. This likely needs a side-by-side test compilation against upstream LLVM NVPTX to see which intrinsic emits which form.
  • Tld4 opcodes for SM90+ texture gather. PTX ISA 8.0 added tld4.s.2d and tld4.a.2d variants for stencil/aware gather; if these are present in the binary they would extend the Tld4* family beyond the current 24 entries. Worth re-scanning cicc_strings.json against the PTX 8.0 spec.
  • Cross-reference into individual pass pages. Each pass that lowers a subset of these opcodes (e.g., mma-codegen.md for tensor opcodes, isel-patterns.md for the matcher) should grow a link back to its slice of this catalog. Currently this page links outward but the back-links are missing.