NVPTX Target Lowering - Calls and Arguments

Abstract

The NVPTX SelectionDAG target-lowering layer is the bridge between ordinary LLVM function semantics and the PTX .param ABI. It converts LLVM IR calls, formal arguments, return values, custom loads, and atomic operations into NVPTX-specific DAG nodes.

The contract is param-space discipline. Call arguments and returns never travel as ordinary memory traffic. Each one gets a generated param symbol, breaks into ABI-legal value parts, threads through explicit NVPTX call envelope nodes (DeclareRetParam, ParamCallStart, ParamCallEnd), and reassembles after the call. Kernel by-value grid-constant arguments take a fast path that preserves their param-address-space identity through legalization. Custom DAG opcodes for vector memory, vector atomics, and scalar floating remainders share one dispatcher so unhandled cases fall back cleanly to LLVM's generic legalizer.

Responsibilities

This lowering family does four jobs:

Area	Responsibility
Formal arguments	Convert incoming LLVM function parameters into `LOAD_PARAM`, `MoveParam`, or proxy nodes.
Calls	Build the `DeclareRetParam`, `ParamCallStart`, argument materialization, callee target, result extraction, and `ParamCallEnd` sequence.
Custom operations	Handle target-marked custom opcodes such as vector loads, vector atomics, and NVPTX-specific splats.
Atomics	Lower scalar and vector atomic-RMW families into NVPTX atomic DAG nodes with explicit chain bundling.

Everything downstream assumes this layer has already made ABI details explicit. Botch param naming, byval handling, or chain construction and the emitted PTX still prints — but it will not match tileiras behavior.

NVPTXTargetLowering Vtable Bank

The NVPTXTargetLowering instance carries a 21-slot LLVM TargetLowering vtable in .data.rel.ro. Most slots inherit from the abstract base class. The NVPTX backend overrides eight, four of which carry the codegen-shaping methods this page documents. The vtable bank sits at a fixed .data.rel.ro address that this report references as &vt_NVPTXTargetLowering; the exact offset shifts across builds, but slot order is stable because LLVM publishes a versioned TargetLowering ABI.

Slot	Method	Identity in this build	Role
0	typeinfo helper	RTTI pointer	Standard Itanium-ABI `_ZTI...` slot.
1	dtor (delete)	inherited	Virtual destructor, deletes through the base pointer.
2	dtor (no delete)	inherited	Virtual destructor variant that leaves storage alone.
3	`LowerOperation`	`sub_1A7C310` via shim `sub_1A7FB60`	79-case DAG dispatch for `BUILD_VECTOR` remap, vector LOAD, scalar floating-remainder fallback, and atomic families.
4	`LowerFormalArguments`	`sub_1A77460`	Walks the IR argument list, builds `_param_<N>` symbols, and emits `LOAD_PARAM`, `MoveParam`, or `ProxyReg` per part.
5	`LowerCall`	`sub_1A72EF0`	Builds the `DeclareRetParam`, `ParamCallStart`, argument materialization, callee target, result extraction, and `ParamCallEnd` envelope.
6	`LowerReturn`	inherited hook stub	Lowers `ret` into `RET_FLAG`-class nodes; this build leaves the slot pointing at the LLVM default because return-value marshaling already happened upstream through `StoreRetval` custom nodes. See the Lowering Returns section below.
7	`ReplaceNodeResults`	`sub_1A7C310` (shared body)	Post-legalisation hook for v8 / v16 splits; reuses the `LowerOperation` body with a different return path.
8	`getTargetNodeName`	NVPTX numeric-opcode table	Translates `NVPTXISD::` opcodes (such as `0x1FD` `ParamCallStart`, `0x1FE` `ParamCallEnd`, `0x317` `DeclareRetParam`) into display names for `-debug` dumps.
9	`useSoftFloat`	constant `return false`	NVPTX always lowers floating point through hardware DAG nodes; no soft-float runtime.
10-20	inherited from `TargetLowering` base	base class methods	Type-promotion hooks, register class hooks, shift-amount type, atomic legality, and other defaults the NVPTX backend does not override.

The four overrides this page details are slots 3, 4, 5, and 7. Slot 5 dominates the bank's complexity at ~16.6 KB of code; slot 3 is ~14.0 KB; slot 4 is ~6.4 KB. Remaining slots are thin policy returns or short dispatch wrappers. The dispatch shim sub_1A7FB60 deserves a separate note: it is a public-facing LowerOperation trampoline that the LegalizeOp driver enters and tail-calls into sub_1A7C310, keeping the vtable slot pointer stable across rebuilds even when the body shifts.

The numeric-opcode table backing slot 8 carries the names this page references repeatedly: ParamCallStart at 0x1FD, ParamCallEnd at 0x1FE, ParamCallStartScalar at 0x1FF, PrintCallVector at 0x200, CallNonRegPrototype at 0x201, CallNonReg at 0x202, CallSeqBegin at 0x203, CallArg at 0x204. Declare nodes DeclareRetParam (0x13D) and DeclareScalarParam (0x13E) sit just above the call range. Ship the same opcode numbers and -debug parity comes for free; downstream diagnostic tooling reads the names through this slot.

Param Symbol Naming

Tileiras names generated param symbols deterministically:

StringRef make_param_symbol(unsigned arg_index, bool is_vararg) {
    if (is_vararg)
        return "_vararg";
    return format("_param_%u", arg_index);
}

The same namer is used by formal-argument lowering and call lowering, so caller and callee agree on declarations such as:

.param .align A .b8 <function>_param_<N>[SIZE]

This is a behavioral contract, not just a printing convention. Later param loads, stores, and call-sequence nodes refer to these names.

Lowering Formal Arguments

Formal-argument lowering walks the LLVM function's argument list, computes the legal register parts for each argument type, creates the param symbol, and emits one of two shapes:

Non-kernel by-value arguments are represented with MoveParam or proxy nodes because their value must be copied through ordinary param space.
Kernel grid-constant by-value arguments can load directly from the param address space, preserving the special kernel argument representation.

void lower_formal_arguments(Function *fn, MachineFunctionInfo *mfi, DAG *dag) {
    for (unsigned i = 0; i < fn->arg_count; ++i) {
        Argument *arg = fn->args[i];
        ValueTypeParts parts = compute_value_parts(arg->type, fn->calling_conv);
        StringRef name = intern_param_name(mfi, make_param_symbol(i, false));

        if (arg->has_byval && !is_kernel(fn)) {
            SDValue moved = emit_move_param(dag, name, parts, arg->type);
            bind_argument_value(arg, moved);
            continue;
        }

        if (is_kernel_grid_constant_byval(arg, fn)) {
            SDValue loaded = emit_load_param_addrspace_101(dag, name, parts);
            mark_preserve_param_address_space(loaded);
            bind_argument_value(arg, loaded);
            continue;
        }

        SDValue loaded = emit_load_param(dag, name, parts);
        bind_argument_value(arg, loaded);
    }
}

The address-space preservation flag is essential for grid-constant byval arguments. Drop it and later DAG combines promote the value back to a generic pointer, erasing the fact that the source was kernel param memory.

Lowering Calls

Call lowering builds an explicit NVPTX call envelope. The path starts by reserving a per-call unique ID, then folds that ID into generated param symbols so multiple calls in the same function cannot collide.

The call lowering path has six logical stages:

Create the entry chain and return-param declaration.
Classify outgoing arguments.
Resolve the callee target.
Materialize and store each outgoing argument.
Extract return values from call-result nodes.
Close the call sequence.

SDValue lower_call(CallLoweringInfo *cli, DAG *dag, MachineFunctionInfo *mfi) {
    unsigned call_id = ++mfi->next_call_id;
    Chain chain = dag_entry_token(dag);

    ReturnParam ret = declare_return_param(dag, cli->result_types, call_id);
    chain = emit_param_call_start(dag, chain, call_id);

    CalleeTarget target = resolve_callee(cli);
    for (unsigned i = 0; i < cli->out_count; ++i) {
        OutArg *arg = &cli->outs[i];
        ParamSymbol sym = make_call_param_symbol(cli, call_id, i);

        if (is_kernel_grid_constant_byval_outarg(cli, arg)) {
            chain = emit_direct_grid_constant_param(dag, chain, arg, sym);
        } else if (arg->is_byval) {
            chain = emit_byval_param_copy(dag, chain, arg, sym);
        } else {
            chain = emit_scalar_or_vector_param_store(dag, chain, arg, sym);
        }
    }

    chain = emit_callee_target(dag, chain, target);
    SDValue result = collect_call_results(dag, chain, ret, cli->result_types);
    emit_param_call_end(dag, chain, call_id);
    return result;
}

Byval and Grid-Constant State Machine

The byval path is governed by four facts:

Fact	Meaning
`K`	The caller/callee context is a kernel entry.
`B`	The argument has `byval` semantics.
`G`	The argument carries the grid-constant annotation.
`D`	The call resolves to a direct `Function`.

The decision table is:

Condition	Lowering action
`K && B && G`	Load directly from param address space. This is the fast path for kernel grid constants.
`K && B && !G`	Materialize through the ordinary lowered-args path.
`!K && B`	Use a proxy or move-param sequence for device-function byval.
`!B && D`	Emit direct callee prototype and direct call target nodes.
`!B && !D`	Build a synthetic callable wrapper and mark it as an NVPTX libcall callee.

The synthetic wrapper case adds the function attribute:

nvptx-libcall-callee = "true"

The marker is NVIDIA-specific and lets later passes recognize indirect-call wrappers without re-deriving their origin from the DAG.

CalleeTarget resolve_callee(CallLoweringInfo *cli) {
    if (cli->called_function != NULL)
        return direct_callee(cli->called_function);

    if (is_global_address(cli->callee_value))
        return external_symbol_callee(cli->callee_value);

    Function *wrapper = build_libcall_wrapper(cli->callee_value);
    add_function_attribute(wrapper, "nvptx-libcall-callee", "true");
    return callable_wrapper(wrapper);
}

Value Part Scheduling

Aggregate and vector arguments are broken into legal machine value types before they are stored into param space. The helper logic is equivalent to:

void store_outgoing_argument(DAG *dag,
                             Chain *chain,
                             OutArg *arg,
                             ParamSymbol sym) {
    ValueTypeParts parts = compute_value_parts(arg->type, arg->calling_conv);
    PartSchedule sched = schedule_value_parts(parts);

    for (unsigned i = 0; i < sched.count; ++i) {
        SDValue piece = extract_argument_piece(dag, arg->value, sched.part[i]);
        *chain = emit_store_param(dag, *chain, sym, sched.part[i], piece);
    }
}

Part scheduling is why the lowering path must know ABI size and alignment. By the time PTX sees a source-level aggregate, it is no longer a single call operand.

Lowering Returns

Vtable slot 6 (LowerReturn) points at the inherited base-class stub in this build because the NVPTX return ABI does not need a custom DAG shape: every return value has already been routed through StoreRetval-class custom nodes (numeric 0xDA and friends, dispatched from the load/store vector selector). The base hook merely emits a RET_FLAG chain node that closes the function.

SDValue lower_return(ReturnLoweringInfo *rli, DAG *dag) {
    Chain chain = rli->chain;
    for (unsigned i = 0; i < rli->ret_count; ++i) {
        RetVal *ret = &rli->rets[i];
        ValueTypeParts parts = compute_value_parts(ret->type, rli->calling_conv);
        for (unsigned j = 0; j < parts.count; ++j) {
            SDValue piece = extract_return_piece(dag, ret->value, parts.part[j]);
            chain = emit_store_retval(dag, chain, parts.part[j], piece);
        }
    }
    return emit_ret_flag(dag, chain);
}

Reimplementations that override this slot must keep StoreRetval materialization upstream of the chain close. Pushing return-value materialization into LowerReturn itself collapses the chain into a single node and breaks the value-part scheduling the rest of the lowering layer relies on.

Custom Operation Lowering

The custom-operation dispatcher takes target-specific cases and lets LLVM's generic legalizer take everything else. The relevant classes:

Operation class	Lowering behavior
Vector load/store	Rewrite into NVPTX vector load/store or splat DAG nodes when the target supports the shape.
`INSERT_VECTOR_ELT`, `EXTRACT_VECTOR_ELT`, `BUILD_VECTOR`, `SCALAR_TO_VECTOR`	Rebuild as NVPTX splat or lane-extract nodes.
Scalar floating remainder fallback	Materialize through param-load and element rebuild nodes.
Scalar atomics	Lower into NVPTX atomic nodes and chain bundles.
Vector atomics	Require a sufficiently new SM target; otherwise emit a fatal unsupported-architecture diagnostic.

The dispatcher returns "not handled" for gaps on purpose: that preserves LLVM's ordinary legalization behavior for non-NVIDIA cases.

bool lower_operation(SDNode *node, DAG *dag, SmallVector<SDValue> *results) {
    switch (node->opcode) {
    case ISD_LOAD:
        return lower_vector_or_param_load(node, dag, results);

    case ISD_INSERT_VECTOR_ELT:
    case ISD_EXTRACT_VECTOR_ELT:
    case ISD_BUILD_VECTOR:
    case ISD_SCALAR_TO_VECTOR:
        return lower_vector_lane_op(node, dag, results);

    case ISD_ATOMIC_LOAD_ADD:
    case ISD_ATOMIC_LOAD_AND:
    case ISD_ATOMIC_LOAD_OR:
    case ISD_ATOMIC_LOAD_XOR:
    case ISD_ATOMIC_CMP_SWAP:
        return lower_atomic(node, dag, results);

    default:
        return false;
    }
}

Atomic-RMW Lowering

Atomic lowering is split by operation family. CAS-like and load-only operations emit an atomic compare/swap skeleton. Arithmetic RMW operations emit one per-part arithmetic atomic and bundle the result chain.

bool lower_atomic(SDNode *node, DAG *dag, SmallVector<SDValue> *results) {
    AtomicKind kind = classify_atomic(node->opcode);

    if (kind.is_vector && !subtarget_supports_vector_atomics(dag->subtarget))
        fatal("vector atomics not supported on this architecture!");

    ValueTypeParts parts = compute_value_parts(node->value_type, node->calling_conv);
    Chain chain = node->chain;

    for (unsigned i = 0; i < parts.count; ++i) {
        AtomicPart part = extract_atomic_part(node, parts.part[i]);

        if (kind.uses_compare_exchange) {
            chain = emit_atomic_compare_and_swap(dag, chain, part, kind.signedness);
        } else {
            chain = emit_atomic_rmw(dag, chain, part, kind.opcode, kind.signedness);
        }
    }

    results->push_back(bundle_atomic_chain(dag, chain));
    return true;
}

Signedness does not change the overall DAG shape. It threads into final instruction selection so the backend picks signed or unsigned PTX mnemonics — atom.global.min.s32 versus atom.global.min.u32.

Keyboard shortcuts

Tileiras Internals