Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

CUDA Pragma & NVVM Annotation Registry

CUDA programs steer cicc through three orthogonal channels: source-level pragma directives (#pragma unroll, #pragma nv_diag_suppress, #pragma nv_abi, #pragma nvopt), declaration-attached attributes (__launch_bounds__, __cluster_dims__, __grid_constant__, __maxnreg__), and the resulting NVVM-IR-level metadata (nvvm.annotations, !llvm.loop.*, function-attached "nvvm.maxntid" attributes). This page documents the complete pipeline by which an EDG parse hit becomes an IR-level annotation, the byte-tag table that EDG uses internally, every nvvm.annotations tag the back-end consumes, and the dispatch function (sub_A84F90) that walks the metadata at IR-import time and decomposes each tuple into per-function attribute bags. Target audience: a senior compiler engineer who needs to reimplement the EDG→NVVM bridge or write a tool that produces cicc-compatible bitcode without going through the C++ front-end. Take-away: cicc's annotation surface is far wider than the NVPTX user manual documents — there are at least 14 EDG-internal attribute slots, 9 distinct NVVM annotation keys, 4 function-attached attribute strings shared with the annotation keys, and 7 well-formed nvopt<...> loop tags. All are recoverable from the binary by tracing four string clusters.

At a Glance

ChannelLives inRead byPersists across
#pragma directiveEDG parser stateEDG attribute mapper sub_5C79F0One declaration
__attribute__ / __launch_bounds__EDG declaration recordEDG IR emitterOne function
nvvm.annotations named metadataNVVM IR modulesub_A84F90 (dispatcher) → per-tag accessors sub_CE8D40 familyWhole module, survives LTO
Function string attribute "nvvm.maxntid" etc.Function::AttributesCode generator, register allocatorOne function, set late
llvm.loop.* MDNode chainLLVM IR metadata on latch branchLoop optimizer (sub_19BB5C0 computeUnrollCount)One loop
nvopt<...> loop tagPass-manager scheduling metadataPipeline assembler sub_226C400One loop body

The three lanes are tied together by a single invariant: every user-facing pragma or attribute is materialized as either a nvvm.annotations tuple, a function Attribute string, or an llvm.loop.* metadata node before LLVM optimization begins. No optimizer pass ever re-parses pragma text. This is enforced by EDG's IR emitter, which lowers attributes to metadata during NVVM IR generation (see NVVM IR Generation).

1. EDG-Side Attribute Byte Tags

EDG stores each declaration attribute as a 1-byte tag in the symbol's flag bits. Function sub_5C79F0 is the inverse map — given a byte tag, it returns the human-readable keyword used in diagnostics. Decompilation of the full switch yields the following table (HIGH — every keyword is a string literal cross-referenced by this function only):

Tag (hex)Tag (ASCII)KeywordWhere applied
0x56V__host__Function declarations
0x57W__device__Function/variable declarations
0x58X__global__Kernel function declarations
0x59Y__tile_global__Tile-shared kernel functions (HIGH — string exists; semantics inferred)
0x5AZ__shared__Variable declarations
0x5B[__constant__Variable declarations
0x5C\__launch_bounds____global__ functions
0x5D]__maxnreg____global__ functions
0x5E^__local_maxnreg____global__ functions (function-local override)
0x5F___tile_builtin__Compiler-emitted tile functions
0x66f__managed__Variable declarations
0x6Bk__cluster_dims____global__ functions, sm_90+ only
0x6Cl__block_size____global__ functions (cluster sibling)
0x72r__nv_pure____device__ functions (NVIDIA's pure-function attribute)
/* Reconstructed body of sub_5C79F0 — EDG attribute byte → keyword map. */
const char *edg_attr_keyword(const edg_attr_t *attr) {
    /* attr+0   : qualified-name pointer        */
    /* attr+8   : tag byte (this switch)        */
    /* attr+16  : raw spelling (printable name) */
    /* attr+24  : qualifier prefix (e.g. "C::") */
    const char *prefix = attr->qual_prefix;
    const char *raw    = attr->raw_spelling;
    if (prefix) {                                        /* qualified case */
        int n = sprintf(g_attr_scratch, "%s::%s", prefix, raw);
        raw = intern(g_attr_scratch, n);                 /* canonicalise   */
    }
    switch (attr->tag) {
        case 'V': return "__host__";
        case 'W': return "__device__";
        case 'X': return "__global__";
        case 'Y': return "__tile_global__";
        case 'Z': return "__shared__";
        case '[': return "__constant__";
        case '\\':return "__launch_bounds__";
        case ']': return "__maxnreg__";
        case '^': return "__local_maxnreg__";
        case '_': return "__tile_builtin__";
        case 'f': return "__managed__";
        case 'k': return "__cluster_dims__";
        case 'l': return "__block_size__";
        case 'r': return "__nv_pure__";
        default:  return raw ? raw : "<anonymous-attr>";
    }
}

QUIRK — non-contiguous tag range The byte tags occupy 0x56..0x6C but with two visible gaps (0x60..0x65, 0x67..0x6A, 0x6D..0x71). Those slots are not stray free space — they belong to attributes that exist in EDG's internal model but have no diagnostic string (compiler-internal flags such as auto-__device__ propagation from -default-device). When implementing a bitcode producer that round-trips EDG behaviour, you must never reuse these byte values for new attributes: the EDG semantic analyser also reads them, and any tag outside the documented map degrades to "anonymous attribute" rather than producing an error.

2. The #pragma nv_abi ABI-Override Slot

#pragma nv_abi is unique in cicc: it is the only pragma that mutates the per-function ABI before code generation, distinct from __launch_bounds__ (which sets metadata) or __nv_pure__ (which sets a function attribute). All ten diagnostic strings cluster at 0x3A53C90..0x3A53EC8, which gives a complete picture of the validator. (HIGH — strings verbatim; ordering of checks inferred from typical EDG semantic-action layout.)

/* Reconstructed semantic-action body for `#pragma nv_abi <arg>`. */
edg_diag_t process_pragma_nv_abi(edg_decl_t *decl, edg_token_t *tok) {
    if (decl == NULL)                       return DIAG("must appear immediately before a function declaration, function definition, or an expression statement");
    if (!(decl->space_flags & SPACE_DEVICE))return DIAG(decl->space_flags & SPACE_HOST
                                                       ? "is not supported inside a host function"
                                                       : "must be applied to device functions");
    if (tok == NULL || tok->kind == TOK_END)return DIAG("requires an argument");
    edg_expr_t *e = parse_expr(tok);
    if (!edg_is_icexpr(e))                  return DIAG("argument must evaluate to an integral constant expression");
    int64_t v = edg_icexpr_value(e);
    if (v < 0)                              return DIAG("argument value must be a positive value");
    if (v < INT32_MIN || v > INT32_MAX)     return DIAG("argument value exceeds the range of an integer");
    if (!nv_abi_option_valid(v))            return DIAG("contains an invalid option");
    if (nv_abi_already_set(decl, v))        return DIAG("contains a duplicate argument");
    if (!nv_abi_value_valid_for(decl, v))   return DIAG("contains an invalid argument value");
    decl->abi_overrides |= (1ULL << v);     /* set bit, attach to decl */
    return DIAG_OK;
}

The bit-packed abi_overrides field is what survives into NVVM IR — typically encoded into the function's "nvvm.abi.opts" string attribute. The exact bit-meaning of each option index is gated by the host-tool driver and not visible from cicc alone (LOW).

3. EDG #pragma nvopt — Per-Loop Optimisation Override

#pragma nvopt N (where N is an integer constant) attaches a numeric optimisation strength to the immediately-following loop. It is cicc's only mechanism for per-loop O-level override and the only way for source code to opt a single loop into the Ofast-compile tier. The diagnostic key cluster nvopt_pragma_* (in cicc_strings.json at 0x3A0F6DF..0x3A0F756) and the human strings at 0x3A505A0..0x3A50720 together specify the full validator. The metadata key "nvopt" at 0x4281726 is then written into the loop's MDNode list (HIGH — both producer (sub_12C35D0, sub_225D540) and consumer (sub_226C400) are anchored on the same key).

Diagnostic keyTrigger
nvopt_pragma_no_loopPragma not directly preceding a loop
nvopt_pragma_no_argMissing argument
nvopt_pragma_bad_formatJunk trailing tokens after the integer
nvopt_pragma_not_constantArgument not an integer constant expression
nvopt_pragma_value_overflowArgument exceeds 32-bit range
nvopt_pragma_negative_valueArgument is negative
/* Reconstructed loop-attribute emission for `#pragma nvopt N`. */
void emit_nvopt_loop_mdnode(loop_t *loop, int level) {
    /* The optimisation level is packed as a 6-tag enum, not the raw int. */
    static const char *kNvoptTags[] = {
        "nvopt<O0>", "nvopt<O1>", "nvopt<O2>", "nvopt<O3>",
        "nvopt<Ofcmin>", "nvopt<Ofcmid>", "nvopt<Ofcmax>",
    };
    nvopt_level_t lv = clamp_nvopt(level);                /* user int → enum */
    md_t  *tag = md_string(kNvoptTags[lv]);
    md_t  *node = md_tuple2(md_string("nvopt"), tag);
    loop->latch_branch->md_loopid = md_loopid_append(loop->latch_branch->md_loopid, node);
}

sub_226C400 (the pipeline assembler) reads the tag back, and selects one of seven internal default<...> pass pipelines per loop. The mapping is fixed:

Source pragmaMDNode tagPipeline assembled
#pragma nvopt 0nvopt<O0>default<O0>
#pragma nvopt 1nvopt<O1>default<O1>
#pragma nvopt 2nvopt<O2>default<O2>
#pragma nvopt 3nvopt<O3>default<O3>
-Ofast-compile=min globalnvopt<Ofcmin>Minimal Ofast-compile
-Ofast-compile=mid globalnvopt<Ofcmid>Mid-tier Ofast-compile
-Ofast-compile=max globalnvopt<Ofcmax>Aggressive Ofast-compile

The trailing diagnostic at 0x4364520 (Cannot specify -O#/-Ofast-compile=<...> and --passes=/--foo-pass, use -passes='default<O#>,other-pass' or -passes='default<Ofcmax>,other-pass') makes the mutual-exclusion rule explicit: once a --passes= is given on the command line, #pragma nvopt is silently respected only if it matches the requested default; otherwise the front-end diagnostic at line 737 of sub_226C400 fires.

QUIRK — nvopt<Ofcmax> is a per-loop kill switch The naïve reading is that -Ofast-compile (a CLI flag) and #pragma nvopt (a source directive) are independent. They aren't. The same enum drives both: sub_226C400 reads the loop-attached nvopt<Ofcmax> tag and switches the pipeline used for just that loop's body to default<Ofcmax>, even when the module was compiled with -O3. This effectively means a single #pragma nvopt 6 (the Ofcmax index) inside an otherwise -O3 translation unit will demote a hot loop to the lightweight pipeline. The error path is symmetrical: setting --passes=... globally and then leaving #pragma nvopt in source produces no diagnostic — the loop tag is silently overridden by the CLI override.

4. #pragma unroll — The Loop-Unroll Metadata Bridge

Unlike nvopt, #pragma unroll is not a cicc invention — it lowers to the standard LLVM loop unrolling metadata family at llvm.loop.unroll.*. sub_19BB5C0 (computeUnrollCount) is the consumer; the lowering producer lives in EDG's IR emitter. The full set of metadata strings recovered from the binary (HIGH — 0x4281800.. cluster, all read by sub_19BB5C0):

Source formMDNode keyEffect in computeUnrollCount
#pragma unroll N (N ≥ 2)llvm.loop.unroll.count + i32 NForces unroll-by-N, 2nd-priority branch
#pragma unroll 1llvm.loop.unroll.disable (no operand)Hard kill — loop is excluded from loop-unroll
#pragma unroll (no arg)llvm.loop.unroll.full (no operand)Forces full unrolling, 3rd-priority branch
(none, runtime trip)llvm.loop.unroll.runtime.disableDisables the runtime-unroll fallback only
(compiler-injected)llvm.loop.unroll.enableWhitelist back into unrolling after a disable
(downstream chaining)llvm.loop.unroll.followup_unrolled, ..._remainder, ..._allMetadata to attach to the unrolled body / remainder

The priority order observed in sub_19BB5C0 (from decompiled if/else chain at lines 469–700):

  1. llvm.loop.unroll.count present → unroll exactly by that count, never re-attempt full unroll.
  2. llvm.loop.unroll.disable present → silently mark as done; log "Unrolling is disabled by source code \"#pragma unroll 1\"".
  3. llvm.loop.unroll.full present → try full unrolling; falls back to partial if the trip count is unknown or the estimated size exceeds the threshold.
  4. None of the above → try cost-model-driven unrolling, then partial, then runtime unrolling (5th-priority branch).

QUIRK — #pragma unroll 1 is not "unroll once" Programmers occasionally read #pragma unroll 1 as "unroll once" (i.e. unroll-by-2). It isn't — it is encoded as llvm.loop.unroll.disable (no count operand), not llvm.loop.unroll.count i32 1. The lowering happens in EDG before LLVM ever sees it, so even a custom pass that reads the metadata cannot tell the two intents apart. The remediation, if you ever need real unroll-by-1 semantics, is #pragma unroll 2 followed by #pragma unroll (full) on the residual.

5. nv_diag_* — The Diagnostic Steering Pragmas

The nv_diagnostic, nv_diag_suppress, nv_diag_warning, nv_diag_error, nv_diag_remark, nv_diag_once, and nv_diag_default keywords (strings cluster at 0x3AC71A7..0x3AC7202) gate per-diagnostic severity. None of these emit IR metadata — they mutate EDG's diagnostic-severity stack, which is consulted purely at the front-end. Hence they do not survive into the bitcode at all.

The stack is pushed and popped lexically by #pragma push_macro/#pragma pop_macro (the diagnostic store is implemented as a stack-of-stacks: outer stack pushed by #pragma diagnostic push, inner per-diag stack indexed by diagnostic ID). The error message at 0x3AC... (no '#pragma diagnostic push' was found to match this 'diagnostic pop') confirms a strict balanced-stack discipline; an unbalanced pop logs and is ignored rather than rejected.

KeywordActionArgument
nv_diag_suppressSuppress diagnostic; no renderingDiagnostic number or error_number literal
nv_diag_warningOverride severity to warningDiagnostic number
nv_diag_errorOverride severity to errorDiagnostic number
nv_diag_remarkOverride severity to remarkDiagnostic number
nv_diag_onceRender at most once per TUDiagnostic number
nv_diag_defaultReset to compile-defaultDiagnostic number
nv_diagnosticStack controlpush / pop

6. The nvvm.annotations Named-Metadata Tuple Format

After EDG produces NVVM IR, all surviving per-function attributes are encoded into a single named metadata node nvvm.annotations (string at 0x3F256xx, written by every per-attribute path under sub_A84F90). Each tuple is one MDNode of the form:

!{ <Function*> | <GlobalVariable*>,  <Tag-String>,  <Operand>... }

The full tag inventory recovered from string xrefs (HIGH — every key is a string literal anchored on sub_A84F90 directly):

Tag stringOperand typesSets
"kernel"i32 1Marks the function as a __global__ kernel entry point
"maxntidx" / "maxntidy" / "maxntidz"i32__launch_bounds__ thread-count, per axis
"reqntidx" / "reqntidy" / "reqntidz"i32__launch_bounds__(blockDim, _, _) strict variant
"minctasm"i32Min CTAs per SM (from __launch_bounds__(_, _, minctasm))
"maxnreg"i32Per-function register cap (__maxnreg__)
"cluster_dim_x" / "cluster_dim_y" / "cluster_dim_z"i32__cluster_dims__ sm_90+
"cluster_max_blocks"i32__cluster_dims__ max-blocks-per-cluster operand
"grid_constant"(no operand; appears with parameter index list)Each __grid_constant__ parameter index, on kernel function
"managed"i32 1Globals declared __managed__
"texture" / "surface"(variable-only)texture<> / surface<> typed globals

Bit-Layout of the MDNode Header

  ┌─────────────────────────────────────────────────────────────────────┐
  │ MDNode header (16-byte aligned)                                     │
  ├──────────┬──────────┬──────────┬──────────┬──────────┬──────────────┤
  │ tag-byte │ flags    │ refcount │ pad      │ num-ops  │ small-storage│
  │  (1B)    │  (1B)    │  (2B)    │  (4B)    │  (4B)    │   inline ops │
  └──────────┴──────────┴──────────┴──────────┴──────────┴──────────────┘
                                                              ↓
                                                       if num-ops > 1:
                                                       external array

The flags byte's bit 1 (& 0x02) distinguishes inline-storage from external-storage tuples — the dispatch code at line 132 of sub_A84F90 reads it as v10 & 2 to decide whether the operand pointer is at offset -32 (large MDNode) or computed in-place from -16 - 8*((v10 >> 2) & 0xF) (small MDNode). This packing is the same scheme LLVM uses for MDNode::isInline(), but the byte position is shifted by NVIDIA's modifications to support the heavily-tagged annotation traffic; a tool generating bitcode for cicc may produce either form as long as the standard LLVM bitcode reader emits the canonical layout.

7. The Central Dispatcher sub_A84F90

sub_A84F90 (3 020 bytes, 153 basic blocks) is the single entry point that decomposes the entire nvvm.annotations tuple list into per-function Attribute strings. It is called once per module immediately after IR parsing and before any optimisation pass runs (callers: sub_1068BA0, sub_1214F10, sub_225A270, sub_9FEAF0, sub_A85B60 — all bitcode loader paths).

/* Reconstructed contract of sub_A84F90, the nvvm.annotations dispatcher. */
void nvvm_annotations_to_function_attrs(Module *M) {
    NamedMDNode *N = M->getNamedMetadata("nvvm.annotations");
    if (!N) return;

    GlobalValue *seen[8];                       /* small inline dedup set     */
    unsigned    seen_n = 0;
    for (unsigned i = 0, e = N->getNumOperands(); i != e; ++i) {
        MDNode *tup = N->getOperand(i);
        /* op[0] = GV*, op[1] = tag string, op[2..] = operands */
        Value *gv_meta = tup->getOperand(0);
        if (!gv_meta) continue;                 /* dropped GV — skip         */
        GlobalValue *gv = mdNodeToGV(gv_meta);
        if (gv->getValueID() != GlobalValueValueID) continue;

        /* Tag dispatch — strings interned in BSS, compared by length+memcmp.  */
        StringRef tag = cast<MDString>(tup->getOperand(1))->getString();
        if      (tag == "maxntidx")            forward_to(sub_CE8B40, gv, tup);
        else if (tag == "maxntidy")            forward_to(sub_CE8B80, gv, tup);
        else if (tag == "maxntidz")            forward_to(sub_CE8BC0, gv, tup);
        else if (tag == "nvvm.maxntid")        forward_to(sub_CE7350, gv, tup); /* fused tuple form */
        else if (tag == "reqntidx" || ... )    forward_to(sub_CE7350, gv, tup);
        else if (tag == "nvvm.cluster_dim")    forward_to(sub_CE8EA0, gv, tup);
        else if (tag == "cluster_dim_x" /*..*/) forward_to(sub_CE8C00 /*..*/, gv, tup);
        else if (tag == "nvvm.maxclusterrank") forward_to(sub_CE9030, gv, tup);
        else if (tag == "nvvm.minctasm")       forward_to(sub_CE90E0, gv, tup);
        else if (tag == "nvvm.maxnreg")        forward_to(sub_CE9180, gv, tup);
        else if (tag == "nvvm.kernel")         set_kernel_calling_conv(gv);
        else if (tag == "grid_constant")       set_grid_constant_param(gv, tup);
        else if (tag == "managed")             set_managed_global(gv);
        else                                   /* unknown — silently drop  */;

        /* Dedup: keep a small open-addressed set, fall through to a growable */
        /* heap allocation when seen_n > 8 (see v94/v95/v100 in decompilation). */
        if (seen_n < 8 && !contains(seen, seen_n, gv)) seen[seen_n++] = gv;
    }
}

The per-tag accessors (sub_CE8D40 family) are the write side. They expand a fused MDNode like ("nvvm.maxntid", i32 X, i32 Y, i32 Z) into three separate function string attributes "nvvm.maxntid.x"="X", "nvvm.maxntid.y"="Y", "nvvm.maxntid.z"="Z" and similar for cluster_dim. When the annotation is in the split form (maxntidx + maxntidy + maxntidz as three tuples), the same accessor instead reads the three child accessors sub_CE8B40/sub_CE8B80/sub_CE8BC0 and produces the same attribute strings — both forms converge on a single canonical representation. The cluster path takes the same shape:

/* sub_CE8EA0 — exact decompilation, paraphrased. */
void apply_cluster_dim(Function *F, MDNode *md) {
    if (md_has_inline_xyz(md, "nvvm.cluster_dim")) {       /* fused form */
        F->addFnAttr("nvvm.cluster_dim",
                     int_to_str(md->getOperand(2)) + "," +
                     int_to_str(md->getOperand(3)) + "," +
                     int_to_str(md->getOperand(4)));
    } else {                                                /* split form */
        F->addFnAttr("nvvm.cluster_dim.x", int_to_str(sub_CE8C00(md)));
        F->addFnAttr("nvvm.cluster_dim.y", int_to_str(sub_CE8C40(md)));
        F->addFnAttr("nvvm.cluster_dim.z", int_to_str(sub_CE8C80(md)));
    }
}

sub_CE9030 (the maxclusterrank reader) shows the same union pattern, additionally accepting the legacy tag cluster_max_blocks as a synonym for nvvm.maxclusterrank. The two strings are not aliased anywhere — they are read by separate code paths in three callers (sub_2C771D0, sub_3022E70, sub_3074D80).

QUIRK — silent acceptance of unknown tags Unknown annotation tags do not trigger a diagnostic. sub_A84F90's fall-through path is empty (no report_fatal_error, no errs() << ...). This is by design: NVIDIA uses nvvm.annotations as a forward-compatible extension mechanism. Bitcode produced for a newer SM target can carry tags the current cicc does not understand, and the bitcode reader will accept the module rather than fail. The downside is that a typo (maxnitdx instead of maxntidx) yields no warning and silently disables the launch-bound — exactly the kind of bug that is hard to debug without explicit knowledge of this fall-through.

8. Function String Attributes (Post-Dispatch)

After sub_A84F90 runs, the same information lives in two places simultaneously: the original nvvm.annotations MDNode (kept for round-trip purposes and ThinLTO summary emission) and a set of Function::Attribute strings consumed by the back-end code generator. Strings that appear directly on functions (HIGH — distinct xrefs from sub_A84F90 family):

Attribute stringOperand encodingRead by
"nvvm.kernel"none (presence ⇒ kernel)NVPTX calling-conv selection; PTX .entry emission
"nvvm.maxntid"comma-separated x,y,z integersNVPTX .maxntid directive emitter
"nvvm.maxntid.x" / .y / .zsingle integerLegacy split form; same emitter
"nvvm.reqntid" (and .x/.y/.z)integersNVPTX .reqntid directive
"nvvm.minctasm"integerNVPTX .minnctapersm directive
"nvvm.maxnreg"integerRegister allocator hard cap; PTX .maxnreg
"nvvm.cluster_dim" (and .x/.y/.z)integerssm_90 cluster launch encoding
"nvvm.maxclusterrank"integersm_90 .maxclusterrank
"nvvm.annotations_transplanted"sentinelMarks a function whose annotations were lifted from inlined callees

The transplant sentinel "nvvm.annotations_transplanted" is the inliner's bookkeeping mechanism: when an __device__ callee with attached annotations is inlined into a __global__ caller, the inliner must decide whether to inherit annotations such as "maxnreg". The sentinel marks the caller as having absorbed annotations so the loader does not re-process them on a subsequent re-link. sub_CE9220 is the producer (HIGH — string at 0x3F25..., single xref).

9. End-to-End Dataflow

        SOURCE                       EDG FRONT-END                          NVVM IR                          LLVM PIPELINE
   ────────────────             ──────────────────────                ──────────────────────              ────────────────────
   __launch_bounds__   ──┐
   __maxnreg__         ──┤      attribute byte tags                  ┌─→  nvvm.annotations           ┌─→  Function string attrs
   __cluster_dims__    ──┼─────►    (V..r in 0x56..0x6C)   ─────────┤                                │       (set by sub_A84F90)
   __grid_constant__   ──┤      sub_5C79F0 (decoder)                 │                                │
   __managed__         ──┘                                           │                                │
                                                                     │                                │
   #pragma nv_abi      ──────► validator (sub_~1000B)  ──── abi_overrides bitmask  ──────────────────►  "nvvm.abi.opts"
                                                                                                       
   #pragma nvopt N     ──────► sub_12C35D0 / sub_225D540  ──── nvopt MDNode  ──────────────────► loop-attached !nvopt<O…>
                                                                                                  read by sub_226C400
                                                                                                  pipeline assembler

   #pragma unroll N    ──────► EDG IR emitter  ──── llvm.loop.unroll.count  ──────────────────► sub_19BB5C0 computeUnrollCount
   #pragma unroll 1                            ──── llvm.loop.unroll.disable
   #pragma unroll                              ──── llvm.loop.unroll.full

   #pragma nv_diag_*   ──────► diagnostic-severity stack (front-end-only; never leaves EDG)

Three observations from this diagram:

  • The only pragma whose effect is invisible in the bitcode is nv_diag_* — every other steering directive has at least one IR-level shadow.
  • nvopt is the only pragma whose effect is selected at the pipeline-assembly phase rather than during a specific pass — it changes which passes run, not how they behave.
  • nv_abi is the only pragma that produces a bit-packed attribute (abi_overrides), as opposed to a flat tuple. All other attributes survive as integers, strings, or sentinels.

10. Validation Checklist (for Bitcode Producers)

A toolchain that emits cicc-compatible NVVM IR without going through the C++ front-end (think: a custom DSL compiler, or a CUDA-Rust back-end) must produce annotations the dispatcher will accept. The minimum compliance set:

  1. Emit a nvvm.annotations named metadata node, even if empty (some downstream passes assume its existence; HIGH for the kernel-bearing case, MED for empty-module case).
  2. For every __global__ function, emit a ("kernel", i32 1) tuple. Without it, the function is treated as a __device__ function and is not placed in PTX .entry blocks.
  3. For __launch_bounds__, prefer the split form (("maxntidx", i32 …) + ("maxntidy", …) + ("maxntidz", …)) — the dispatcher accepts both, but the split form is what the EDG front-end emits, so it has the most test coverage.
  4. For __cluster_dims__, do not mix split and fused forms in the same module. The dispatcher tolerates the mix, but downstream verifier passes (see NVVM IR Verifier) check consistency per-function and may diagnose.
  5. For loop metadata, attach the llvm.loop.unroll.* MDNodes to the latch branch's !llvm.loop chain — attaching to the header or pre-header silently disables the unroll directive.
  6. Never duplicate annotations: the dispatcher deduplicates by global-value pointer in an 8-element open-addressed set, then spills to a growable list. Duplicates do not cause harm, but they bloat ThinLTO summaries.

Cross-References