NVPTX Bring-up and Target Init
Abstract
NVPTX bring-up is the handoff point between the Tileiras dialect-lowering
pipeline and the stock-shaped LLVM TargetMachine configured for PTX
emission. By the time this layer runs, the MLIR pipeline has already
produced LLVM/NVVM IR.
The layer owns target registration, MC-layer object construction, the
NVPTXAsmPrinter section model, the embedded-device-library linker, target
machine caching, and the LLVM optimization pipeline driver. The
reimplementation contract is a sequence, not a static constructor layout:
register both NVPTX triples, build consistent MC services, resolve the
target machine from the requested chip/features, link device bitcode, run
the LLVM pipeline, then emit PTX through the NVPTX asm printer.
Two choices distinguish Tileiras from a plain LLVM build. First,
nvptx and nvptx64 share one constructor table; the triple controls
pointer size and ABI details downstream. Second, libdevice never travels
LLVM's ordinary filesystem search path. It arrives as an MLIR BlobAttr
on the gpu.module and is parsed into an LLVM module before
optimization.
Target Registration Chain
Bring-up follows the same shape as upstream LLVM NVPTX with one structural
twist: the factory chain that constructs the NVPTXTargetMachine is folded
under NVIDIA's private peephole-pass selection. LLVMInitializeNVPTXTargetInfo
registers the target names through TargetRegistry::RegisterTarget. That call
runs from a __attribute__((constructor))-style global initializer, so by the
time main enters the compiler the two target records (nvptx, nvptx64) are
already in the registry. LLVMInitializeNVPTXTarget fills the constructor slots
for the target services used later by MC emission and target-machine creation.
The factory function the registry stores under each target record is not the
upstream createNVPTXTargetMachine. It is an NVIDIA-private variant that,
after building the base target machine, walks the global peephole-pass table
and installs the subset legal on the requested target chip. The selection is
data-driven: each entry in the peephole table carries a chip-feature predicate
that the factory evaluates against the parsed feature string. Peepholes whose
predicates fail are skipped; the survivors become part of the per-target-machine
pass pipeline returned to the caller. Caching the target machine therefore also
caches the peephole-pass selection — rebuilding with a different chip/feature
combination forces both the target machine and the peephole list to be
reconstructed.
| Service | Role |
|---|---|
LLVMInitializeNVPTXTargetInfo | Registers nvptx and nvptx64 target records. |
LLVMInitializeNVPTXTarget | Installs all target constructor callbacks. |
NVPTXMCAsmInfo | Defines PTX comments, directives, pointer size, and asm syntax. |
MCInstrInfo | Supplies instruction descriptors for the NVPTX opcode set. |
NVPTXRegisterInfo | Supplies physical registers and register-class descriptors. |
MCSubtargetInfo | Supplies CPU and feature tables used by legality checks. |
MCInstrAnalysis | Supplies branch and instruction-analysis helpers. |
MCAsmBackend | Supplies MC assembly backend services. |
MCCodeEmitter | Supplies MC instruction encoding hooks where LLVM expects them. |
NVPTXAsmPrinter | Emits module headers, directives, sections, and PTX instruction text. The constructor slot points at the LTO-folded printer described below, not a generic LLVM AsmPrinter. |
Both 32-bit and 64-bit targets receive the same service table. The triple decides
whether the compilation is nvptx or nvptx64, and the MC asm-info constructor
turns that into the pointer-size and stack-slot-size choices needed by the ABI.
User Target vs gpulibs Subtarget Triple
The 64-bit NVPTX target record handles two distinct triples that travel through
the same TargetMachine factory but exit with different feature gates: the
user-facing nvptx64-nvidia-cuda triple compiled by the host LLVM-21 backend at
run time, and the embedded-only nvptx64-nvidia-gpulibs subtarget triple
carried as producer metadata on prebuilt bitcode resources baked into the
binary at link time. The host backend never emits gpulibs IR; it only
consumes it through the bitcode reader during the blobLinkedLib link step.
What makes this surprising is that the same compiler binary ships IR produced by two different clang generations, both of which predate the host LLVM-21 link target by several major versions:
| Producer string | Subtarget triple | Carried symbol family |
|---|---|---|
clang version 16.0.0 (NVIDIA internal) | nvptx64-nvidia-gpulibs | __nv_fp128 softfloat path — fp128 arithmetic and transcendentals |
clang version 7.1.0 git-630d6c22278 | nvptx64-nvidia-gpulibs | __nv_*128 integer family — 128-bit integer divide, modulo, conversion |
The dual-clang split exists because the integer-128 helper library was
finalized against clang 7.1.0 long before the fp128 softfloat work began, and
NVIDIA never recompiled the older IR against newer clang releases. Recompiling
the legacy IR would force re-verification of the entire __nv_*128 integer
helper set against every supported SM, and the helpers are pure bitwise
arithmetic that LLVM 21's optimizer consumes identically to LLVM 7's output.
The fp128 work, by contrast, was a fresh integration that needed clang-16
features (newer __attribute__((target)) handling, fp128 ABI fixes) and was
checked in at the version that built cleanly. Both blobs were frozen at their
respective producer generations and embedded side by side rather than
maintained on a moving baseline.
What the gpulibs IR ships, structurally:
- Berkeley SoftFloat —
f128M_add,f128M_mul,f128M_div,f128M_sqrt,softfloat_*rounding and rawFloat helpers. Provides the arithmetic backbone of the fp128 softfloat path. The library is statically linked into the gpulibs bitcode rather than shipped as a separate.bcresource; on-disk it is invisible. - Sleef —
Sleef_*transcendental functions,Sleef_rempitabqp(the Payne–Hanek argument-reduction table for quad-precision), and theqp_cuda_sleefqCUDA bridge. Providessinq,cosq,tanq,expq,logq, and the rest of the fp128 transcendental surface. - NVIDIA
__nv_*128helpers —__nv_udivti3,__nv_umodti3,__nv_divti3,__nv_modti3, and the wider 128-bit integer conversion set. These come from the clang-7.1 blob, not the clang-16 one.
Integration into the host pipeline goes through the same blobLinkedLib
attribute described below: the gpulibs bitcode is parsed by the LLVM-21
bitcode reader, linked with LinkOnlyNeeded so only the helpers the kernel
actually references survive, then dropped into the optimization pipeline as
ordinary internal functions. The optimizer sees no producer-version
distinction — the IR is read as plain LLVM 21 IR once the bitcode reader has
upgraded any forward-compatible constructs.
⚡ QUIRK — two compiler generations, one binary A stripped tileiras binary carries producer strings for
clang version 16.0.0andclang version 7.1.0 git-630d6c22278simultaneously, alongside the primary host link target identifying asLLVM21.0.0git. The producer strings are the fingerprint to grep for when locating the embedded bitcode resources in a stripped binary; they survive both LTO andstripbecause they live inside the bitcode payload, not in the host symbol table.
⚡ QUIRK —
nvptx64-nvidia-gpulibsis a producer-only triple The host backend never builds or registers aTargetMachinefor thegpulibstriple. The triple appears only in the module metadata of embedded bitcode resources and tells the bitcode reader to apply gpulibs-specific attribute defaults during deserialization. A reimplementation that registersgpulibsas a callable target will be calling code paths the original binary never exercises at run time.
⚡ QUIRK — SoftFloat and Sleef are not separate
.bcfiles Both third-party libraries are statically linked into the gpulibs blob before the producer-string serialization happens. The blob exposesf128M_*,softfloat_*,Sleef_*, and__nv_fp128_*as if they were a single translation unit, which is why the producer string isclang version 16.0.0for the entire fp128 surface even though the upstream SoftFloat and Sleef sources were never built with clang-16 in isolation.
NVPTXMCAsmInfo Constructor
NVPTXMCAsmInfo starts from ordinary LLVM MC defaults and then replaces the
host-assembly pieces that make no sense for PTX. PTX has no ELF-style
.text, .bss, .data, .globl, or .weak directives, so those fields become
comments or PTX-specific byte directives. Inline assembly gets wrapped in
comments so ptxas receives the inline body without host-assembler markers.
| Field | NVPTX value |
|---|---|
PointerSize | 4 for nvptx, 8 for nvptx64 |
CalleeSaveStackSlotSize | matches pointer size |
CommentString | // |
PrivateGlobalPrefix | $L__ |
CommentColumn | 4 |
InlineAsmStart / InlineAsmEnd | commented begin/end markers |
AsciiDirective | .b8 |
Data8bitsDirective | .b8 |
Data32bitsDirective | .b32 |
Data64bitsDirective | .b64 |
GlobalDirective | commented .globl surrogate |
WeakRefDirective | commented .weak surrogate |
UseIntegratedAssembler | disabled |
SupportsDebugInformation | enabled |
PTX assembly must never depend on host object-file section semantics. The asm-info layer turns LLVM's generic MC vocabulary into PTX comments and PTX byte directives before the printer writes a module.
Section Changes
NVPTXAsmPrinter::changeSection implements the brace-bound function-body model
used by PTX. Instead of switching among ELF sections, the printer emits a
commented section header and opens or closes a brace-delimited body.
void change_nvptx_section(AsmPrinter *printer, MCSection *next, raw_ostream *os) {
if (printer->current_section == next) {
os_write(os, "\t}\n");
printer->current_section = NULL;
return;
}
print_commented_section_header(next, os);
os_write(os, "\t{\n");
printer->current_section = next;
}
Emitted PTX kernels therefore appear inside { and } rather than between
.text and .size markers. The section line is documentation for readers and
debug tooling; ptxas treats it as a comment.
The LTO-Folded AsmPrinter Class
The class the target registry stores under NVPTXAsmPrinter is a single, very
large function produced by NVIDIA's whole-program LTO build. It is not the
upstream NVPTXAsmPrinter class hierarchy — at link time the LTO pipeline
collapses the upstream class, its TableGen-generated AsmWriter subclass, the
operand-print helpers, the modifier-print helpers, and most of the per-opcode
print-shape helpers into a single dispatch function with a giant switch ladder
over MC opcodes. The dispatcher inlines the operand-print and modifier-print
work directly into each case rather than calling through small helpers.
What survives as separate, non-inlined methods is the part of the printer the
target machine and the pass manager call from outside: the constructor, the
runOnMachineFunction entry point, the section-change hook described above,
the module-level header emitter, and the global-variable emitter. These methods
are the ones whose addresses the target registry stores and whose vtable slots
the rest of the backend dispatches through.
The MLIR side of the same class handles selected nvvm.* ops — the dialect's
custom printers for ops that do not lower to a single PTX instruction, and for
which the TableGen assemblyFormat is not enough. The MC-instruction side
handles selected MachineInst opcodes — the per-MC-opcode dispatcher. Both
sides share the operand-print and modifier-print code, which is why they end up
folded into the same LTO function: the inliner sees the shared callees and
collapses both call graphs around them.
A reimplementation does not need to reproduce the LTO fold. The contract is
that one printer object serves both MLIR-side nvvm.* printing and MC-side PTX
emission, and that the two sides share modifier-print and operand-print
infrastructure. How the implementation factors that contract is a build-time
decision; the LTO fold is the choice NVIDIA's release build makes.
Mnemonic Pool Decode
PTX mnemonics live in a .data-resident pool the AsmPrinter reads from. The
pool is not stored in cleartext. The bytes in .data are obfuscated under a
walking-XOR cipher; the printer decodes the pool in place on first use, then
all subsequent mnemonic lookups read decoded bytes directly.
The cipher is a stride-3 walking XOR. Byte at offset i in the pool is
decoded by XORing with key byte (3 * i) mod 256. The decode is in-place and
single-pass: the printer walks the pool from offset zero to the end exactly
once, XORing each byte against its computed key byte, and then sets a flag
that future readers consult before doing any work. The decode runs under a
pthread_once guard so concurrent compilations cannot trigger overlapping
in-place writes; the once-init body holds a process-global lock, walks the
pool, and releases the lock with the decoded state visible to all threads.
static pthread_once_t mnemonic_pool_once = PTHREAD_ONCE_INIT;
static char mnemonic_pool[POOL_SIZE]; // .data-resident, obfuscated at link time
static void decode_mnemonic_pool(void) {
for (size_t i = 0; i < POOL_SIZE; ++i) {
mnemonic_pool[i] ^= (char)((3u * (unsigned)i) & 0xff);
}
}
const char *mnemonic_for_opcode(unsigned mc_opcode) {
pthread_once(&mnemonic_pool_once, decode_mnemonic_pool);
uint32_t lo_offset = mnemonic_offset_table_lo[mc_opcode];
uint32_t hi_offset = mnemonic_offset_table_hi[mc_opcode];
return &mnemonic_pool[lo_offset | (hi_offset << 16)];
}
Two offset tables index into the decoded pool — a low-16-bit table and a
high-16-bit table, both keyed by MC opcode. Joining them produces a 32-bit
byte offset into the pool, which lets the printer address mnemonics up to a
4 GiB pool size even though the pool itself fits in a few tens of kilobytes.
The two-table split survives the LTO fold; the printer's MC-opcode dispatcher
emits the same (lo | (hi << 16)) reconstruction inline in every case.
⚡ QUIRK — 32-bit offset reconstruction for a sub-megabyte mnemonic pool The printer addresses each mnemonic with a 32-bit offset reconstructed from two 16-bit tables, even though the entire decoded pool fits in tens of kilobytes — the high 16 bits are always zero in practice. The split is a TableGen artifact: the offset type is sized for the worst-case LLVM target's combined mnemonic pool. NVPTX inherits the layout because LTO refuses to fold the second table away (the dispatcher reads it inline in every case), so the
(lo | (hi << 16))idiom is the fingerprint to grep for when locating the printer in a stripped binary.
A smaller separate pool with the same XOR-3 decode scheme holds physical register names. Both pools are decoded by the same once-init body, so the mnemonic table and the register-name table become available simultaneously on the first call to the printer.
NVVM Intrinsic Mapping Table
The translation from nvvm.* MLIR ops to LLVM IR happens through a
one-to-one mapping table the target-init layer installs into the MLIR-to-LLVM
translator. Each nvvm.* op carries one of two lowering keys:
llvm_intrinsic_id: the op lowers to a singlecallof anllvm.nvvm.*intrinsic. The translator looks up the intrinsic by ID and emits an LLVMIntrinsicInstwith the op's operands as arguments.inline_asm_template: the op lowers to anllvm.inline_asmcall whose template is a PTX fragment with${0},${1}, etc. placeholders for the operands. The translator substitutes the operand SSA values, emits theInlineAsmIR, and lets the later NVPTX backend pass copy the inline-asm body verbatim into the PTX output.
The choice between the two paths is per-op, baked into the table. Ops that
correspond to a single PTX instruction with a stable encoding (most mma.*,
tma.*, and mbarrier.* ops) go through the intrinsic path. Ops that
correspond to compound PTX sequences or that depend on assembly-level
modifiers the LLVM intrinsic surface does not expose go through the
inline-asm path. The inline-asm template typically embeds the modifier
directly into the template string rather than passing it as an operand,
because the LLVM asm constraint vocabulary cannot express PTX modifier
combinatorics in general.
typedef enum { LOWER_INTRINSIC, LOWER_INLINE_ASM } NvvmLoweringKind;
typedef struct NvvmOpLowering {
StringRef op_name; // e.g. "nvvm.barrier0"
NvvmLoweringKind kind;
union {
uint32_t intrinsic_id; // valid when kind == LOWER_INTRINSIC
struct {
StringRef template_str; // PTX template with ${i} placeholders
StringRef constraints; // LLVM inline-asm constraint string
bool has_side_effects;
} asm_data; // valid when kind == LOWER_INLINE_ASM
};
} NvvmOpLowering;
The table is populated by the dialect's addPattern calls during target-init.
On every nvvm.* op encountered by the translator, the dispatcher consults
the table, builds either the intrinsic call or the inline-asm call, and
attaches the same memory-effect and side-effect attributes the op carried in
MLIR. Attaching the effects matters: if the op was marked memory-writing in
the dialect, the resulting LLVM call must be marked the same way, or the
LLVM optimizer will see it as a pure call and start hoisting or eliminating
it on the device IR.
blobLinkedLib Attribute on gpu.module
The blobLinkedLib attribute on a gpu.module carries the precompiled
bitcode payload that gets linked into the LLVM module during NVPTX bring-up.
The attribute is the only point at which libdevice (or another bitcode
helper library) enters the compiler — there is no command-line -mlink-bc
flag or filesystem lookup. The driver attaches the attribute to the module
during front-end processing, and the bring-up layer consumes it during the
LLVM-link step described below.
The attribute value is an MLIR BlobAttr whose payload can be one of two
shapes:
- An inline byte array: the bitcode payload is embedded directly in the IR. Used when the driver wants to pin a specific libdevice version into a reproducible build.
- A filesystem path: the bitcode lives on disk and the bring-up layer reads
it at link time. Used by the normal CUDA toolchain build where libdevice
is shipped as a separate
.bcfile alongside the compiler.
LLVMModule *load_blob_linked_lib(GPUModuleOp module) {
Attribute attr = module.attributes()["blobLinkedLib"];
if (attr == NULL) {
return NULL;
}
BlobPayload payload = resolve_blob_payload(attr);
if (payload.kind == BLOB_FILE && !is_regular_file(payload.path)) {
diagnose("blobLinkedLib: bitcode path does not exist or is not a file");
return NULL;
}
ParseResult parsed = parse_llvm_bitcode(payload);
if (!parsed.ok) {
diagnose("blobLinkedLib: failed to parse embedded bitcode");
return NULL;
}
return parsed.module;
}
The loader runs at the start of the NVPTX bring-up. The parsed module is
linked into the main LLVM module via llvm::Linker with the
LinkOnlyNeeded flag so unused helpers do not bloat the final PTX. From
this point on the helpers are ordinary LLVM IR — the optimizer sees them as
internal functions and can inline, specialize, and DCE them like any other
device function.
The contract for a reimplementation: a gpu.module without blobLinkedLib
proceeds through the NVPTX pipeline with no implicit libdevice link. Math
intrinsics that need libdevice helpers (transcendentals, denormal handling,
some integer conversions) emit linker-error diagnostics rather than silently
falling back to a default library. The driver layer is responsible for
attaching the attribute when the kernel actually needs libdevice.
Target-Machine Cache
Target-machine creation resolves the target triple, looks up the registered
target, builds TargetOptions, selects the requested mcpu, and calls the
target's TargetMachine constructor. The resulting object is cached so repeated
compilations with the same target settings do not rebuild the LLVM backend state.
TargetMachine *get_or_create_nvptx_target_machine(TargetCache *cache,
TargetRequest request) {
if (cache->machine != NULL && target_request_equal(cache->request, request)) {
return cache->machine;
}
const Target *target = lookup_target(request.triple);
if (target == NULL) {
diagnose("failed to look up NVPTX target for requested triple");
return NULL;
}
TargetOptions options = default_nvptx_target_options();
TargetMachine *machine = target->create_target_machine(
request.triple, request.mcpu, request.features, options);
cache->request = request;
cache->machine = machine;
return machine;
}
The cache key must include the triple, chip, and feature string. A target machine reused across incompatible feature sets makes later legality checks observe the wrong subtarget.
LLVM Pass Pipeline
The optimization driver accepts the requested optimization level, ensures a
target machine exists, and asks LLVM PassBuilder for the per-module default
pipeline. Invalid optimization levels become diagnostics before any pass manager
is built.
bool run_llvm_pipeline(LLVMModule *module, TargetMachine *tm, OptLevel level) {
if (!is_valid_opt_level(level)) {
diagnose("invalid LLVM optimization level");
return false;
}
if (tm == NULL) {
diagnose("target machine unavailable; cannot optimize with LLVM");
return false;
}
PassBuilder builder(tm);
ModulePassManager mpm = builder.build_per_module_default_pipeline(level);
mpm.run(*module);
return true;
}
The pipeline shape is the stock LLVM decomposition: early simplification, module
simplification, function simplification, inlining, vectorization, module
optimization, and post-pass cleanup. Tileiras-specific behavior happens before
and around the pipeline: target-machine selection, the blobLinkedLib bitcode
linkage step, and the NVIDIA-private peephole-pass selection the factory
installed when the target machine was built.
Cross-References
- Codegen Overview — End-To-End Algorithm — the seven-stage backend contract these primitives feed into.
- AsmPrinter — MC Switch Shape Population Table — the 6,388-case dispatcher that the LTO-folded printer described here implements, and the per-SM print-shape windows it dispatches into.
- NVPTX Subtarget — The 81 Feature Indices — the chip/feature predicates the peephole-pass selection consults.
- libdevice Overview — the bitcode library most
often delivered through the
blobLinkedLibattribute. - Math Pass Pipeline and Crosswalk — the
consumer side of the gpulibs IR: where
f128M_*,softfloat_*,Sleef_*, and the__nv_*128integer helpers get wired into kernel-side math. - Versions and Fingerprints — the producer-string and subtarget-triple table this section refers to.
- LLVM Fingerprint Table — the host LLVM-21 link-target identification that distinguishes the run-time backend from the embedded clang-16 / clang-7.1 producers.
- Lowering — Target Attribute Conversion — the
point at which
gpu.moduleacquires the#nvvm.targetandblobLinkedLibattributes the bring-up reads.