Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Attribute System Overview

cudafe++ processes CUDA attributes through NVIDIA's customization of the EDG 6.6 attribute subsystem. EDG provides a general-purpose attribute infrastructure in attribute.c (approximately 11,500 lines of source, spanning addresses 0x409350--0x418F80 in the binary) that handles C++11 [[...]] attributes, GNU __attribute__((...)), MSVC __declspec, and alignas. NVIDIA extends this infrastructure by injecting CUDA-specific attribute kinds into EDG's attribute kind enumeration, registering CUDA-specific handler callbacks, and adding a post-declaration validation pass that enforces cross-attribute consistency rules (e.g., __launch_bounds__ requires __global__). The kind enum reserves a contiguous block of 25 values in the ASCII printable range (86..110, with gaps at 96-101, 103-106, 109); the 14 commonly-encountered kinds are tabulated below, with the remaining 11 slots documented in the Reserved Kind Slots section as internal/unallocated values.1

The attribute system operates in four phases: scanning (lexer recognizes attribute syntax and builds attribute node lists), lookup (maps attribute names to descriptors via a hash table), application (dispatches to per-attribute handler functions that modify entity nodes), and validation (post-declaration consistency checks). CUDA attributes participate in all four phases, using the same node structures and dispatch mechanisms as standard C++/GNU attributes.

CUDA Attribute Kind Enum

Every attribute node carries a kind byte at offset +8. For standard C++/GNU attributes, EDG assigns kinds from its built-in descriptor table (byte_82C0E0 in the .rodata segment). For CUDA attributes, NVIDIA reserves a block of kind values in the ASCII printable range. The function attribute_display_name (sub_40A310, from attribute.c:1307) contains the authoritative switch table that maps kind values to human-readable names:

KindHexASCIIDisplay NameCategoryHandler
860x56'V'__host__Execution spacesub_4108E0
870x57'W'__device__Execution spacesub_40EB80
880x58'X'__global__Execution spacesub_40E1F0 / sub_40E7F0
890x59'Y'__tile_global__Execution space(internal)
900x5A'Z'__shared__Memory spacesub_40E0D0 (shared path)
910x5B'['__constant__Memory spacesub_40E0D0 (constant path)
920x5C'\'__launch_bounds__Launch configsub_411C80
930x5D']'__maxnreg__Launch configsub_410F70
940x5E'^'__local_maxnreg__Launch configsub_411090
950x5F'_'__tile_builtin__Internal(internal)
1020x66'f'__managed__Memory spacesub_40E0D0 (managed path)
1070x6B'k'__cluster_dims__Launch configsub_4115F0
1080x6C'l'__block_size__Launch configsub_4109E0
1100x6E'n'__nv_pure__Optimization(internal)

The kind values are not contiguous. Kinds 86--95 form a dense block for the original CUDA attributes. Kinds 102, 107, 108, and 110 were added later (managed memory in CUDA 6.0, cluster dimensions in CUDA 11.8, block size and nv_pure more recently), occupying gaps in the ASCII range.

Reserved Kind Slots

The kind dispatch logic in sub_40A310, sub_40FDB0, and the apply-handler switches collectively reserves 25 enum slots in the range 86..110. Eleven of these have no user-visible mapping in this binary revision: they appear as fall-through targets in the dispatcher (returning the raw name pointer rather than a CUDA display string), but no descriptor entry in off_D46820 writes them into an attribute node, and no diagnostic references them by name. They are listed for completeness:

Kind rangeHexASCIIStatus
960x60'`'Reserved / unallocated (no descriptor)
970x61'a'Reserved / unallocated
980x62'b'Reserved / unallocated
990x63'c'Reserved / unallocated
1000x64'd'Reserved / unallocated
1010x65'e'Reserved / unallocated
1030x67'g'Reserved / unallocated
1040x68'h'Reserved / unallocated
1050x69'i'Reserved / unallocated
1060x6A'j'Reserved / unallocated
1090x6D'm'Reserved / unallocated

The eleven gaps interleave the 14 documented kinds in a pattern consistent with revisional history: blocks of slots were reserved en bloc when a new CUDA feature was on the roadmap, but only a subset of the slots was ever wired up. Reverse-engineering the binary at hand cannot distinguish "reserved for a future feature" from "retired after a feature was renamed or merged"; both produce identical evidence (kind value handled in dispatch but absent from descriptor table). Consumers should not rely on any particular semantics for these eleven values when re-implementing the attribute subsystem.

attribute_display_name (sub_40A310)

This function serves dual duty: it formats the display name for diagnostic messages, and its switch table is the canonical enumeration of all CUDA attribute kinds. The logic:

// sub_40A310 -- attribute_display_name (attribute.c:1307)
// a1: pointer to attribute node
const char* attribute_display_name(attr_node_t* a1) {
    const char* name = a1->name;           // +16
    const char* ns   = a1->namespace_str;  // +24

    // If scoped (namespace::name), format "namespace::name"
    if (ns) {
        size_t ns_len = strlen(ns);
        assert(ns_len + strlen(name) + 3 <= 204);  // buffer byte_E7FB80
        sprintf(byte_E7FB80, "%s::%s", ns, name);
        name = intern_string(byte_E7FB80);  // sub_5E0700
    }

    // Override with CUDA display name based on kind byte
    switch (a1->kind) {  // byte at +8
        case 'V': return "__host__";
        case 'W': return "__device__";
        case 'X': return "__global__";
        case 'Y': return "__tile_global__";
        case 'Z': return "__shared__";
        case '[': return "__constant__";
        case '\\': return "__launch_bounds__";
        case ']': return "__maxnreg__";
        case '^': return "__local_maxnreg__";
        case '_': return "__tile_builtin__";
        case 'f': return "__managed__";
        case 'k': return "__cluster_dims__";
        case 'l': return "__block_size__";
        case 'n': return "__nv_pure__";
        default:  return name ? name : "";
    }
}

The 204-byte static buffer byte_E7FB80 is shared across calls (not thread-safe, but cudafe++ is single-threaded per translation unit). The intern_string call (sub_5E0700) ensures the formatted "namespace::name" string is deduplicated into EDG's permanent string pool.

Attribute Node Structure

Every attribute is represented by a 72-byte IL node (entry kind 0x48 = attribute). The node layout:

struct attr_node_t {               // 72 bytes, IL entry kind 0x48
    attr_node_t*  next;            // +0   next attribute in list
    uint8_t       kind;            // +8   attribute kind byte (CUDA: 'V'..'n')
    uint8_t       source_mode;     // +9   1=C++11, 2=GNU, 3=MSVC, 4=alignas, 5=clang
    uint8_t       target_kind;     // +10  what entity type this targets
    uint8_t       flags;           // +11  bit 0=applies_to_params
                                   //      bit 1=skip_arg_check
                                   //      bit 4=scoped attribute
                                   //      bit 7=unknown/unrecognized
    uint32_t      _pad;            // +12  (alignment)
    const char*   name;            // +16  attribute name string
    const char*   namespace_str;   // +24  namespace (NULL for unscoped)
    arg_node_t*   arguments;       // +32  argument list head
    void*         source_pos;      // +40  source position info
    void*         decl_context;    // +48  declaration context / scope
    void*         src_loc_1;       // +56  source location
    void*         src_loc_2;       // +64  secondary source location
};

For CUDA attributes, the kind byte at offset +8 is the discriminator. When get_attr_descr_for_attribute (sub_40FDB0) resolves an attribute name, it writes the corresponding kind value from the descriptor table (byte_82C0E0) into this field. All subsequent dispatch operates on this byte alone.

The source_mode byte at +9 indicates the syntactic form the user wrote. CUDA attributes like __host__ are parsed as GNU-style attributes (source_mode = 2), because cudafe++ defines them via __attribute__((...)) internally.

Attribute Descriptor Table and Name Lookup

Master Descriptor Table (off_D46820)

The attribute descriptor table is a static array in .rodata at off_D46820, extending to unk_D47A60. Each entry is 32 bytes and encodes:

  • Attribute name string
  • Kind byte (written to attr_node_t.kind on match)
  • Handler function pointer (the apply_* callback)
  • Mode/version condition string (e.g., 'g' for GCC-only, 'l' for Clang-only)
  • Target applicability mask

Initialization: init_attr_name_map (sub_418F80)

At startup, init_attr_name_map iterates the descriptor table, validates each name is at most 100 characters, and inserts it into the hash table qword_E7FB60 (created via sub_7425C0). This hash table enables O(1) lookup of attribute names during parsing.

// sub_418F80 -- init_attr_name_map (attribute.c:1524)
void init_attr_name_map(void) {
    attr_name_map = create_hash_table();  // qword_E7FB60
    for (attr_descr* d = off_D46820; d < unk_D47A60; d++) {
        assert(strlen(d->name) <= 100);
        insert_into_hash_table(attr_name_map, d->name, d);
    }
    // Also initializes dword_E7F078 and processes config if dword_106BF18 set
}

A companion function init_attr_token_map (sub_419070) creates a second hash table qword_E7F038 that maps attribute tokens to their descriptors, used during lexer-level attribute recognition.

Name Normalization: sub_40A250

Before looking up an attribute name, EDG strips __ prefixes and suffixes. The function at sub_40A250 checks whether the name starts with "__" and ends with "__", strips them, and looks up the bare name in qword_E7FB60. This means __host__, __attribute__((host)), and host all resolve to the same descriptor. The stripping respects the current language standard (dword_126EFB4) and C++ version (dword_126EF68).

Central Dispatch: get_attr_descr_for_attribute (sub_40FDB0)

This 227-line function is the central attribute resolution path. Given an attribute node with a name, it:

  1. Looks up the name in the hash table
  2. Checks mode compatibility (GCC mode via dword_126EFA8, Clang mode via dword_126EFA4, MSVC mode via dword_106BF68/dword_106BF58)
  3. Checks namespace match ("gnu", "__gnu__", "clang") via cond_matches_attr_mode (sub_40C4C0)
  4. Evaluates version-conditional availability via in_attr_cond_range (sub_40D620)
  5. Writes the kind byte from the matched descriptor into attr_node_t.kind
  6. Returns the descriptor entry (which carries the handler function pointer)

The mode condition strings use a compact encoding: 'g'=GCC, 'l'=Clang, 's'=Sun, 'c'=C++, 'm'=MSVC; 'x'=extension, '+'=positive match, '!'=boundary marker.

Attribute Application Pipeline

Phase 1: Scanning

The lexer recognizes attribute syntax and calls into the scanning functions:

FunctionAddressRole
scan_std_attribute_groupsub_412650Parses [[...]] C++11 and __attribute__((...)) GNU attributes
scan_gnu_attribute_groupssub_412F20Handles __attribute__((...)) specifically
scan_attributes_listsub_4124A0Iterates token stream building attribute node lists
parse_attribute_argument_clausesub_40C8B0Parses attribute argument expressions
get_balanced_tokensub_40C6C0Handles balanced parentheses/brackets in arguments

Scanning produces a linked list of attr_node_t nodes. At this stage, the kind byte is unset; only the name and namespace_str fields are populated.

Phase 2: Lookup and Kind Assignment

When the parser reaches a declaration, get_attr_descr_for_attribute resolves each attribute name to a descriptor and writes the kind byte. For CUDA attributes, this assigns values in the 'V'--'n' range.

Phase 3: Application -- apply_one_attribute (sub_413240)

The central dispatcher is a 585-line function containing a switch on the kind byte. For each CUDA kind, it calls the corresponding handler:

// sub_413240 -- apply_one_attribute (attribute.c, main dispatch)
// 585 lines, giant switch on attribute kind
void apply_one_attribute(attr_node_t* attr, entity_t* entity, int target_kind) {
    switch (attr->kind) {
        case 'V':  apply_nv_host_attr(attr, entity, target_kind);     break;
        case 'W':  apply_nv_device_attr(attr, entity, target_kind);   break;
        case 'X':  apply_nv_global_attr(attr, entity, target_kind);   break;
        case 'Z':  apply_nv_shared_attr(attr, entity, target_kind);   break;
        case '[':  apply_nv_constant_attr(attr, entity, target_kind); break;
        case '\\': apply_nv_launch_bounds(attr, entity, target_kind); break;
        case ']':  apply_nv_maxnreg_attr(attr, entity, target_kind);  break;
        case '^':  apply_nv_local_maxnreg(attr, entity, target_kind); break;
        case 'f':  apply_nv_managed_attr(attr, entity, target_kind);  break;
        case 'k':  apply_nv_cluster_dims(attr, entity, target_kind);  break;
        case 'l':  apply_nv_block_size(attr, entity, target_kind);    break;
        // ... standard attributes handled similarly ...
    }
}

The outer iteration is apply_attributes_to_entity (sub_413ED0, 492 lines), which walks the attribute list, calls apply_one_attribute for each, and handles deferred attributes, attribute merging, and ordering constraints.

Phase 4: Post-Declaration Validation -- sub_6BC890

After all attributes on a declaration are applied, sub_6BC890 (nv_validate_cuda_attributes, from nv_transforms.c) performs cross-attribute consistency checking. This function validates that combinations of CUDA attributes are legal:

// sub_6BC890 -- nv_validate_cuda_attributes (nv_transforms.c)
// a1: entity (function), a2: diagnostic location
void nv_validate_cuda_attributes(entity_t* fn, source_loc_t* loc) {
    if (!fn || (fn->byte_177 & 0x10))  // skip if null or already validated
        return;

    uint8_t exec_space = fn->byte_182;  // CUDA execution space bits
    launch_config_t* lc = fn->launch_config;  // entity+256

    // Check 1: parameters with rvalue-reference in __global__ functions
    // Walks parameter list, emits error 3702 for ref-qualified params

    // Check 2: __nv_register_params__ on __host__-only or __global__
    if (fn->byte_183 & 0x08) {
        if (exec_space & 0x40)       // __global__
            emit_error(3661, "__global__");
        else if ((exec_space & 0x30) == 0x20)  // __host__ only (no __device__)
            emit_error(3661, "__host__");
    }

    // Check 3: __launch_bounds__ without __global__
    if (lc && !(exec_space & 0x40)) {
        if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor)
            emit_error(3534, "__launch_bounds__");
    }

    // Check 4: __cluster_dims__ / __block_size__ without __global__
    if (lc && (fn->byte_183 & 0x40 || lc->cluster_dim_x > 0)) {
        const char* name = (lc->block_size_x > 0) ? "__block_size__" : "__cluster_dims__";
        emit_error(3534, name);
    }

    // Check 5: maxBlocksPerClusterSize exceeds cluster product
    if (lc && lc->cluster_dim_x > 0 && lc->maxBlocksPerClusterSize > 0) {
        if (lc->maxBlocksPerClusterSize <
            lc->cluster_dim_x * lc->cluster_dim_y * lc->cluster_dim_z) {
            emit_error(3707, ...);
        }
    }

    // Check 6: __maxnreg__ without __global__
    if (lc && lc->maxnreg >= 0 && !(exec_space & 0x40))
        emit_error(3715, "__maxnreg__");

    // Check 7: __launch_bounds__ + __maxnreg__ conflict
    if (lc && lc->maxThreadsPerBlock && lc->maxnreg >= 0)
        emit_error(3719, "__launch_bounds__ and __maxnreg__");

    // Check 8: __global__ without __launch_bounds__
    if ((exec_space & 0x40) && (!lc || (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor)))
        emit_warning(3695);  // "no __launch_bounds__ specified for __global__ function"
}

Error Codes in Validation

ErrorSeverityMessage
35347 (error)"%s" attribute is not allowed on a non-__global__ function
36617 (error)__nv_register_params__ is not allowed on a %s function
36954 (warning)no __launch_bounds__ specified for __global__ function
37027 (error)Parameter with rvalue reference in __global__ function
37077 (error)total number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit
37157 (error)__maxnreg__ is not allowed on a non-__global__ function
37197 (error)__launch_bounds__ and __maxnreg__ may not be used on the same declaration

Per-Attribute Handler Function Table

Each CUDA attribute has a dedicated apply_* function registered in the descriptor table. These functions modify entity node fields (execution space bits, memory space bits, launch configuration) and emit diagnostics for invalid usage.

AttributeHandlerAddressLinesEntity Fields Modified
__host__apply_nv_host_attrsub_4108E031entity+182 |= 0x15
__device__apply_nv_device_attrsub_40EB80100Functions: entity+182 |= 0x23; Variables: entity+148 |= 0x01
__global__apply_nv_global_attrsub_40E1F089entity+182 |= 0x61
__global__ (variant 2)apply_nv_global_attrsub_40E7F086Same as above (alternate entry point)
__shared__(via device attr path)----entity+148 |= 0x02
__constant__(via device attr path)----entity+148 |= 0x04
__managed__apply_nv_managed_attrsub_40E0D047entity+148 |= 0x01, entity+149 |= 0x01
__launch_bounds__apply_nv_launch_bounds_attrsub_411C8098entity+256 -> launch config +0, +8, +16
__maxnreg__apply_nv_maxnreg_attrsub_410F7067entity+256 -> launch config +32
__local_maxnreg__apply_nv_local_maxnreg_attrsub_41109067entity+256 -> launch config +36
__cluster_dims__apply_nv_cluster_dims_attrsub_4115F0145entity+256 -> launch config +20, +24, +28
__block_size__apply_nv_block_size_attrsub_4109E0265entity+256 -> launch config +40..+52
__nv_register_params__apply_nv_register_params_attrsub_40B0A038entity+183 |= 0x08

Attribute Registration (sub_6B5E50)

The function sub_6B5E50 (160 lines, in the nv_transforms.c / mem_manage.c area) registers NVIDIA-specific pseudo-attributes into EDG's keyword and macro systems at startup. It operates after EDG's standard keyword initialization but before parsing begins.

The registration creates macro-like definitions that the lexer expands before attribute processing. The function:

  1. Allocates attribute definition nodes via sub_6BA0D0 (EDG's node allocator)
  2. Looks up existing definitions via sub_734430 (hash table search) -- if a definition already exists, it chains the new handler onto it via sub_6AC190
  3. Creates new keyword entries via sub_749600 if no prior definition exists
  4. Registers __nv_register_params__ as a 40-byte attribute definition node (kind marker 8961) with chain linkage
  5. Registers __noinline__ as a 30-byte attribute definition node (kind marker 6401), including the "oinline))" suffix for __attribute__((__noinline__)) expansion
  6. Conditionally registers ARM SME attributes (__arm_in, __arm_inout, __arm_out, __arm_preserves, __arm_streaming, __arm_streaming_compatible) via sub_6ACCB0 when Clang version >= 180000 and ARM target flags are set
  7. Registers _Pragma as an operator-like keyword for _Pragma("...") processing

If any registration fails (the existing entry cannot be extended), it emits internal error 1338 with the attribute name and calls sub_6B6280 (fatal error handler).

Entity Node: CUDA Attribute Fields

CUDA attributes modify specific byte fields in entity nodes. The key fields for a reimplementation:

Execution Space (entity+182)

Bit 0 (0x01): __device__           set by apply_nv_device_attr
Bit 2 (0x04): __host__             set by apply_nv_host_attr
Bit 4 (0x10): (reserved)
Bit 5 (0x20): __host__ explicit    set by apply_nv_host_attr
Bit 6 (0x40): __global__           set by apply_nv_global_attr
Bit 7 (0x80): __host__ __device__  set when both specified

Handlers use OR-masks: __host__ sets 0x15 (bits 0+2+4), __device__ sets 0x23 (bits 0+1+5), __global__ sets 0x61 (bits 0+5+6). The overlap at bit 0 means all execution-space-annotated functions have bit 0 set, which serves as a quick "has CUDA annotation" predicate.

Memory Space (entity+148)

Bit 0 (0x01): __device__           device memory
Bit 1 (0x02): __shared__           shared memory
Bit 2 (0x04): __constant__         constant memory

Extended Memory Space (entity+149)

Bit 0 (0x01): __managed__          managed (unified) memory

Launch Configuration (entity+256)

A pointer to a separately allocated launch_config_t structure (created by sub_5E52F0):

struct launch_config_t {
    uint64_t  maxThreadsPerBlock;          // +0   from __launch_bounds__(N, ...)
    uint64_t  minBlocksPerMultiprocessor;  // +8   from __launch_bounds__(N, M, ...)
    int32_t   maxBlocksPerClusterSize;     // +16  from __launch_bounds__(N, M, K)
    int32_t   cluster_dim_x;              // +20  from __cluster_dims__(X, ...)
    int32_t   cluster_dim_y;              // +24  from __cluster_dims__(X, Y, ...)
    int32_t   cluster_dim_z;              // +28  from __cluster_dims__(X, Y, Z)
    int32_t   maxnreg;                    // +32  from __maxnreg__(N)
    int32_t   local_maxnreg;              // +36  from __local_maxnreg__(N)
    int32_t   block_size_x;              // +40  from __block_size__(X, ...)
    int32_t   block_size_y;              // +44  from __block_size__(X, Y, ...)
    int32_t   block_size_z;              // +48  from __block_size__(X, Y, Z, ...)
    uint8_t   flags;                      // +52  bit 0=cluster_dims_set
                                          //      bit 1=block_size_set
};

This structure is allocated lazily -- only created when a launch configuration attribute is first applied to a function. The allocation function sub_5E52F0 returns a zero-initialized structure with maxnreg = -1 and local_maxnreg = -1 (sentinel for "unset").

Attribute Processing Global State

GlobalAddressPurpose
qword_E7FB600xE7FB60Attribute name hash table (created by init_attr_name_map)
qword_E7F0380xE7F038Attribute token hash table (created by init_attr_token_map)
byte_E7FB800xE7FB80204-byte static buffer for formatted attribute display names
off_D468200xD46820Master attribute descriptor table (32 bytes per entry, extends to 0xD47A60)
qword_E7F0700xE7F070Visibility stack (for __attribute__((visibility(...))) nesting)
qword_E7F0480xE7F048Alias/ifunc free list head
qword_E7F058/E7F0500xE7F058/0xE7F050Alias chain list head/tail
dword_E7F0800xE7F080Attribute processing flags
dword_E7F0780xE7F078Extended attribute config flag

The function reset_attribute_processing_state (sub_4190B0) zeroes all of these at the start of each translation unit.

Function Map

AddressIdentitySourceConfidence
sub_40A250strip_double_underscores_and_lookupattribute.cHIGH
sub_40A310attribute_display_nameattribute.c:1307HIGH
sub_40C4C0cond_matches_attr_modeattribute.cHIGH
sub_40C6C0get_balanced_tokenattribute.cHIGH
sub_40C8B0parse_attribute_argument_clauseattribute.cHIGH
sub_40D620in_attr_cond_rangeattribute.cHIGH
sub_40E0D0apply_nv_managed_attrattribute.c:10523HIGH
sub_40E1F0apply_nv_global_attr (variant 1)attribute.cHIGH
sub_40E7F0apply_nv_global_attr (variant 2)attribute.cHIGH
sub_40EB80apply_nv_device_attrattribute.cHIGH
sub_40FDB0get_attr_descr_for_attributeattribute.c:1902HIGH
sub_4108E0apply_nv_host_attrattribute.cHIGH
sub_4109E0apply_nv_block_size_attrattribute.cHIGH
sub_410F70apply_nv_maxnreg_attrattribute.cHIGH
sub_411090apply_nv_local_maxnreg_attrattribute.cHIGH
sub_4115F0apply_nv_cluster_dims_attrattribute.cHIGH
sub_411C80apply_nv_launch_bounds_attrattribute.cHIGH
sub_412650scan_std_attribute_groupattribute.c:2914HIGH
sub_413240apply_one_attributeattribute.cHIGH
sub_413ED0apply_attributes_to_entityattribute.cHIGH
sub_418F80init_attr_name_mapattribute.c:1524HIGH
sub_419070init_attr_token_mapattribute.cHIGH
sub_4190B0reset_attribute_processing_stateattribute.cHIGH
sub_6B5E50process_nv_register_params / attribute registrationnv_transforms.cHIGH
sub_6BC890nv_validate_cuda_attributesnv_transforms.cVERY HIGH

Per-Attribute IL Emission Matrix

Every CUDA attribute is parsed into a 72-byte attribute IL node (entry kind 0x48). What happens after application is what determines downstream visibility. The matrix below classifies each attribute by its emission path: whether the attribute survives as a discrete IL node, collapses into entity bitfield mutations, allocates the side-band launch-config struct, or is preserved on the entity's attribute chain for the .int.c writer to re-emit.

AttributeParse-time IL nodePost-apply form.int.c re-emissionPipeline consumer
__host__kind 0x48, byte +8 = 'V'entity+182 |= 0x15-- (entity bits only)Device/host splitter (mark_to_keep_in_il)
__device__kind 0x48, byte +8 = 'W'functions: entity+182 |= 0x23; variables: entity+148 |= 0x01-- (entity bits only)Device/host splitter, cross-space checker
__global__kind 0x48, byte +8 = 'X'entity+182 |= 0x61, then |= 0x80host stub generator emits launch wrapperKernel stub emitter (sub_489000)
__tile_global__kind 0x48, byte +8 = 'Y'(no handler) attribute node kept on attr chainpreserved through chain walkcicc (downstream)
__shared__kind 0x48, byte +8 = 'Z'entity+148 |= 0x02-- (entity bits only)Memory-space declarator
__constant__kind 0x48, byte +8 = '['entity+148 |= 0x04-- (entity bits only)Memory-space declarator
__launch_bounds__kind 0x48, byte +8 = '\\'entity+256 -> launch_config_t (+0, +8, +16)re-emitted as IL kind 25 via sub_540560cicc (NVVM IR generator)
__maxnreg__kind 0x48, byte +8 = ']'launch_config+32-- (struct field)cicc, ptxas
__local_maxnreg__kind 0x48, byte +8 = '^'launch_config+36-- (struct field)cicc, ptxas
__tile_builtin__kind 0x48, byte +8 = '_'(no handler) attribute node kept on attr chainpreserved through chain walkcicc (downstream)
__managed__kind 0x48, byte +8 = 'f'entity+148 |= 0x01 + entity+149 |= 0x01comma-op host wrapper + RT boilerplate via sub_489000CUDA runtime (__nv_init_managed_rt)
__cluster_dims__kind 0x48, byte +8 = 'k'launch_config+20/+24/+28 + flag bit 0; zero-arg sets entity+183 |= 0x40-- (struct fields)cicc
__block_size__kind 0x48, byte +8 = 'l'launch_config+40/+44/+48 + flag bit 1; optional +20/+24/+28-- (struct fields)cicc
__nv_pure__kind 0x48, byte +8 = 'n'(no entity mutation) attribute node kept on attr chainre-emitted as IL kind 25 via sub_540560 (shared path with \\)cicc (applies LLVM readonly/willreturn)
__nv_register_params__kind 0x48 via GNU path (no CUDA kind byte)entity+183 |= 0x08-- (entity bits only)cicc (ABI selector)
__forceinline__(no CUDA kind byte; processed via inline-control path)entity+177 |= 0x10emitted as __attribute__((always_inline))Host compiler + cicc
__noinline__ (EDG form)(no CUDA kind byte)entity+179 |= 0x20 (+ ABI node on prototype in C mode)emitted as __attribute__((noinline))Host compiler + cicc
__noinline__ (GNU form)kind 0x48 via GNU __attribute__entity+180 |= 0x80emitted as __attribute__((noinline))Host compiler + cicc
__inline_hint__(no CUDA kind byte)entity+179 |= 0x10emitted as suggestion (non-binding)cicc inlining heuristics
__grid_constant__parameter-level attributeentity+164 |= 0x04, type+133 |= 0x20, param+32 |= 0x02emitted into kernel parameter declarationKernel parameter ABI (cicc + driver)
__restrict__C99/GNU restrict path (not a CUDA kind)Type-qualifier on parameter type chainpreserved as __restrict__ on parameterHost compiler + cicc

Three Emission Categories

Reading the matrix vertically, every CUDA attribute lands in exactly one of three categories:

  1. Collapse to entity bits -- __host__, __device__, __shared__, __constant__, __managed__, __nv_register_params__, __forceinline__, __noinline__, __inline_hint__. After application, no attribute IL node survives. The downstream consumer reads entity bytes (+148, +149, +177, +179, +180, +182, +183) instead. The original kind-0x48 IL node is freed back to the arena when the attribute chain is torn down.

  2. Side-band launch-config struct -- __launch_bounds__, __maxnreg__, __local_maxnreg__, __cluster_dims__, __block_size__. The attribute IL node is consumed; the values are extracted into the 56-byte launch_config_t pointed to by entity+256. __launch_bounds__ additionally walks back through the writer with kind_field = 25 so cicc sees it in .int.c. The other four are consumed entirely by cudafe++ and the launch-config struct is read by later passes (kernel stub generator, ptxas argument formatter).

  3. Preserved on attribute chain -- __nv_pure__, __tile_global__, __tile_builtin__, __grid_constant__. These have no entity-bit collapse (or only a flag bit alongside the chain). The attribute IL node remains attached to the entity through code generation. The .int.c writer (sub_5565E0 family, sub_540560) walks the chain and re-emits the attribute textually so cicc can apply the corresponding LLVM-level semantics. For the __tile_* pair, no cudafe++ consumer exists -- the attribute is pure pass-through.

Why There Is No Per-Attribute IL Node Type

A CUDA attribute is never lowered into a discrete IL node of its own kind. Every attribute reuses the generic attribute kind (0x48), with its arguments stored as attribute_argument nodes (kind 0x49) and the whole bundle wrapped in an attribute_group node (kind 0x4A). The discriminator is the byte at +8 of the 0x48 node ('V'..'n' for CUDA attributes; standard descriptors for GNU/C++11/MSVC attributes). This is a deliberate EDG design choice: the IL graph stays small and the dispatch logic centralizes in apply_one_attribute (sub_413240) rather than fanning out into dozens of node kinds.

The practical consequence is that "which IL node is emitted for attribute X" is the wrong question. The correct question is "which entity-byte mutation does attribute X cause, and which downstream pass reads that byte." The matrix above answers both.

Cross-References


  1. The "25 reserved slots" figure comes from the upper and lower bounds of the CUDA kind range as observed in sub_40A310 (attribute_display_name) and the cross-validating switch dispatchers in sub_40FDB0, sub_410A20, and sub_4115F0. Of the 25, only kinds 86-95, 102, 107, 108, and 110 appear in any user-facing diagnostic, source-level keyword table, or descriptor entry; the others were either never wired up at this binary revision, are reserved internal markers used briefly during parsing, or were retired between EDG/CUDA revisions.