Attribute System Overview

cudafe++ processes CUDA attributes through NVIDIA's customization of the EDG 6.6 attribute subsystem. EDG provides a general-purpose attribute infrastructure in attribute.c (approximately 11,500 lines of source, spanning addresses 0x409350--0x418F80 in the binary) that handles C++11 [[...]] attributes, GNU __attribute__((...)), MSVC __declspec, and alignas. NVIDIA extends this infrastructure by injecting CUDA-specific attribute kinds into EDG's attribute kind enumeration, registering CUDA-specific handler callbacks, and adding a post-declaration validation pass that enforces cross-attribute consistency rules (e.g., __launch_bounds__ requires __global__). The kind enum reserves a contiguous block of 25 values in the ASCII printable range (86..110, with gaps at 96-101, 103-106, 109); the 14 commonly-encountered kinds are tabulated below, with the remaining 11 slots documented in the Reserved Kind Slots section as internal/unallocated values.¹

The attribute system operates in four phases: scanning (lexer recognizes attribute syntax and builds attribute node lists), lookup (maps attribute names to descriptors via a hash table), application (dispatches to per-attribute handler functions that modify entity nodes), and validation (post-declaration consistency checks). CUDA attributes participate in all four phases, using the same node structures and dispatch mechanisms as standard C++/GNU attributes.

CUDA Attribute Kind Enum

Every attribute node carries a kind byte at offset +8. For standard C++/GNU attributes, EDG assigns kinds from its built-in descriptor table (byte_82C0E0 in the .rodata segment). For CUDA attributes, NVIDIA reserves a block of kind values in the ASCII printable range. The function attribute_display_name (sub_40A310, from attribute.c:1307) contains the authoritative switch table that maps kind values to human-readable names:

Kind	Hex	ASCII	Display Name	Category	Handler
86	0x56	`'V'`	`__host__`	Execution space	`sub_4108E0`
87	0x57	`'W'`	`__device__`	Execution space	`sub_40EB80`
88	0x58	`'X'`	`__global__`	Execution space	`sub_40E1F0` / `sub_40E7F0`
89	0x59	`'Y'`	`__tile_global__`	Execution space	(internal)
90	0x5A	`'Z'`	`__shared__`	Memory space	`sub_40E0D0` (shared path)
91	0x5B	`'['`	`__constant__`	Memory space	`sub_40E0D0` (constant path)
92	0x5C	`'\'`	`__launch_bounds__`	Launch config	`sub_411C80`
93	0x5D	`']'`	`__maxnreg__`	Launch config	`sub_410F70`
94	0x5E	`'^'`	`__local_maxnreg__`	Launch config	`sub_411090`
95	0x5F	`'_'`	`__tile_builtin__`	Internal	(internal)
102	0x66	`'f'`	`__managed__`	Memory space	`sub_40E0D0` (managed path)
107	0x6B	`'k'`	`__cluster_dims__`	Launch config	`sub_4115F0`
108	0x6C	`'l'`	`__block_size__`	Launch config	`sub_4109E0`
110	0x6E	`'n'`	`__nv_pure__`	Optimization	(internal)

The kind values are not contiguous. Kinds 86--95 form a dense block for the original CUDA attributes. Kinds 102, 107, 108, and 110 were added later (managed memory in CUDA 6.0, cluster dimensions in CUDA 11.8, block size and nv_pure more recently), occupying gaps in the ASCII range.

Reserved Kind Slots

The kind dispatch logic in sub_40A310, sub_40FDB0, and the apply-handler switches collectively reserves 25 enum slots in the range 86..110. Eleven of these have no user-visible mapping in this binary revision: they appear as fall-through targets in the dispatcher (returning the raw name pointer rather than a CUDA display string), but no descriptor entry in off_D46820 writes them into an attribute node, and no diagnostic references them by name. They are listed for completeness:

Kind range	Hex	ASCII	Status
96	`0x60`	'`'	Reserved / unallocated (no descriptor)
97	`0x61`	`'a'`	Reserved / unallocated
98	`0x62`	`'b'`	Reserved / unallocated
99	`0x63`	`'c'`	Reserved / unallocated
100	`0x64`	`'d'`	Reserved / unallocated
101	`0x65`	`'e'`	Reserved / unallocated
103	`0x67`	`'g'`	Reserved / unallocated
104	`0x68`	`'h'`	Reserved / unallocated
105	`0x69`	`'i'`	Reserved / unallocated
106	`0x6A`	`'j'`	Reserved / unallocated
109	`0x6D`	`'m'`	Reserved / unallocated

The eleven gaps interleave the 14 documented kinds in a pattern consistent with revisional history: blocks of slots were reserved en bloc when a new CUDA feature was on the roadmap, but only a subset of the slots was ever wired up. Reverse-engineering the binary at hand cannot distinguish "reserved for a future feature" from "retired after a feature was renamed or merged"; both produce identical evidence (kind value handled in dispatch but absent from descriptor table). Consumers should not rely on any particular semantics for these eleven values when re-implementing the attribute subsystem.

attribute_display_name (`sub_40A310`)

This function serves dual duty: it formats the display name for diagnostic messages, and its switch table is the canonical enumeration of all CUDA attribute kinds. The logic:

// sub_40A310 -- attribute_display_name (attribute.c:1307)
// a1: pointer to attribute node
const char* attribute_display_name(attr_node_t* a1) {
    const char* name = a1->name;           // +16
    const char* ns   = a1->namespace_str;  // +24

    // If scoped (namespace::name), format "namespace::name"
    if (ns) {
        size_t ns_len = strlen(ns);
        assert(ns_len + strlen(name) + 3 <= 204);  // buffer byte_E7FB80
        sprintf(byte_E7FB80, "%s::%s", ns, name);
        name = intern_string(byte_E7FB80);  // sub_5E0700
    }

    // Override with CUDA display name based on kind byte
    switch (a1->kind) {  // byte at +8
        case 'V': return "__host__";
        case 'W': return "__device__";
        case 'X': return "__global__";
        case 'Y': return "__tile_global__";
        case 'Z': return "__shared__";
        case '[': return "__constant__";
        case '\\': return "__launch_bounds__";
        case ']': return "__maxnreg__";
        case '^': return "__local_maxnreg__";
        case '_': return "__tile_builtin__";
        case 'f': return "__managed__";
        case 'k': return "__cluster_dims__";
        case 'l': return "__block_size__";
        case 'n': return "__nv_pure__";
        default:  return name ? name : "";
    }
}

The 204-byte static buffer byte_E7FB80 is shared across calls (not thread-safe, but cudafe++ is single-threaded per translation unit). The intern_string call (sub_5E0700) ensures the formatted "namespace::name" string is deduplicated into EDG's permanent string pool.

Attribute Node Structure

Every attribute is represented by a 72-byte IL node (entry kind 0x48 = attribute). The node layout:

struct attr_node_t {               // 72 bytes, IL entry kind 0x48
    attr_node_t*  next;            // +0   next attribute in list
    uint8_t       kind;            // +8   attribute kind byte (CUDA: 'V'..'n')
    uint8_t       source_mode;     // +9   1=C++11, 2=GNU, 3=MSVC, 4=alignas, 5=clang
    uint8_t       target_kind;     // +10  what entity type this targets
    uint8_t       flags;           // +11  bit 0=applies_to_params
                                   //      bit 1=skip_arg_check
                                   //      bit 4=scoped attribute
                                   //      bit 7=unknown/unrecognized
    uint32_t      _pad;            // +12  (alignment)
    const char*   name;            // +16  attribute name string
    const char*   namespace_str;   // +24  namespace (NULL for unscoped)
    arg_node_t*   arguments;       // +32  argument list head
    void*         source_pos;      // +40  source position info
    void*         decl_context;    // +48  declaration context / scope
    void*         src_loc_1;       // +56  source location
    void*         src_loc_2;       // +64  secondary source location
};

For CUDA attributes, the kind byte at offset +8 is the discriminator. When get_attr_descr_for_attribute (sub_40FDB0) resolves an attribute name, it writes the corresponding kind value from the descriptor table (byte_82C0E0) into this field. All subsequent dispatch operates on this byte alone.

The source_mode byte at +9 indicates the syntactic form the user wrote. CUDA attributes like __host__ are parsed as GNU-style attributes (source_mode = 2), because cudafe++ defines them via __attribute__((...)) internally.

Attribute Descriptor Table and Name Lookup

Master Descriptor Table (`off_D46820`)

The attribute descriptor table is a static array in .rodata at off_D46820, extending to unk_D47A60. Each entry is 32 bytes and encodes:

Attribute name string
Kind byte (written to attr_node_t.kind on match)
Handler function pointer (the apply_* callback)
Mode/version condition string (e.g., 'g' for GCC-only, 'l' for Clang-only)
Target applicability mask

Initialization: `init_attr_name_map` (`sub_418F80`)

At startup, init_attr_name_map iterates the descriptor table, validates each name is at most 100 characters, and inserts it into the hash table qword_E7FB60 (created via sub_7425C0). This hash table enables O(1) lookup of attribute names during parsing.

// sub_418F80 -- init_attr_name_map (attribute.c:1524)
void init_attr_name_map(void) {
    attr_name_map = create_hash_table();  // qword_E7FB60
    for (attr_descr* d = off_D46820; d < unk_D47A60; d++) {
        assert(strlen(d->name) <= 100);
        insert_into_hash_table(attr_name_map, d->name, d);
    }
    // Also initializes dword_E7F078 and processes config if dword_106BF18 set
}

A companion function init_attr_token_map (sub_419070) creates a second hash table qword_E7F038 that maps attribute tokens to their descriptors, used during lexer-level attribute recognition.

Name Normalization: `sub_40A250`

Before looking up an attribute name, EDG strips __ prefixes and suffixes. The function at sub_40A250 checks whether the name starts with "__" and ends with "__", strips them, and looks up the bare name in qword_E7FB60. This means __host__, __attribute__((host)), and host all resolve to the same descriptor. The stripping respects the current language standard (dword_126EFB4) and C++ version (dword_126EF68).

Central Dispatch: `get_attr_descr_for_attribute` (`sub_40FDB0`)

This 227-line function is the central attribute resolution path. Given an attribute node with a name, it:

Looks up the name in the hash table
Checks mode compatibility (GCC mode via dword_126EFA8, Clang mode via dword_126EFA4, MSVC mode via dword_106BF68/dword_106BF58)
Checks namespace match ("gnu", "__gnu__", "clang") via cond_matches_attr_mode (sub_40C4C0)
Evaluates version-conditional availability via in_attr_cond_range (sub_40D620)
Writes the kind byte from the matched descriptor into attr_node_t.kind
Returns the descriptor entry (which carries the handler function pointer)

The mode condition strings use a compact encoding: 'g'=GCC, 'l'=Clang, 's'=Sun, 'c'=C++, 'm'=MSVC; 'x'=extension, '+'=positive match, '!'=boundary marker.

Attribute Application Pipeline

Phase 1: Scanning

The lexer recognizes attribute syntax and calls into the scanning functions:

Function	Address	Role
`scan_std_attribute_group`	`sub_412650`	Parses `[[...]]` C++11 and `__attribute__((...))` GNU attributes
`scan_gnu_attribute_groups`	`sub_412F20`	Handles `__attribute__((...))` specifically
`scan_attributes_list`	`sub_4124A0`	Iterates token stream building attribute node lists
`parse_attribute_argument_clause`	`sub_40C8B0`	Parses attribute argument expressions
`get_balanced_token`	`sub_40C6C0`	Handles balanced parentheses/brackets in arguments

Scanning produces a linked list of attr_node_t nodes. At this stage, the kind byte is unset; only the name and namespace_str fields are populated.

Phase 2: Lookup and Kind Assignment

When the parser reaches a declaration, get_attr_descr_for_attribute resolves each attribute name to a descriptor and writes the kind byte. For CUDA attributes, this assigns values in the 'V'--'n' range.

Phase 3: Application -- `apply_one_attribute` (`sub_413240`)

The central dispatcher is a 585-line function containing a switch on the kind byte. For each CUDA kind, it calls the corresponding handler:

// sub_413240 -- apply_one_attribute (attribute.c, main dispatch)
// 585 lines, giant switch on attribute kind
void apply_one_attribute(attr_node_t* attr, entity_t* entity, int target_kind) {
    switch (attr->kind) {
        case 'V':  apply_nv_host_attr(attr, entity, target_kind);     break;
        case 'W':  apply_nv_device_attr(attr, entity, target_kind);   break;
        case 'X':  apply_nv_global_attr(attr, entity, target_kind);   break;
        case 'Z':  apply_nv_shared_attr(attr, entity, target_kind);   break;
        case '[':  apply_nv_constant_attr(attr, entity, target_kind); break;
        case '\\': apply_nv_launch_bounds(attr, entity, target_kind); break;
        case ']':  apply_nv_maxnreg_attr(attr, entity, target_kind);  break;
        case '^':  apply_nv_local_maxnreg(attr, entity, target_kind); break;
        case 'f':  apply_nv_managed_attr(attr, entity, target_kind);  break;
        case 'k':  apply_nv_cluster_dims(attr, entity, target_kind);  break;
        case 'l':  apply_nv_block_size(attr, entity, target_kind);    break;
        // ... standard attributes handled similarly ...
    }
}

The outer iteration is apply_attributes_to_entity (sub_413ED0, 492 lines), which walks the attribute list, calls apply_one_attribute for each, and handles deferred attributes, attribute merging, and ordering constraints.

Phase 4: Post-Declaration Validation -- `sub_6BC890`

After all attributes on a declaration are applied, sub_6BC890 (nv_validate_cuda_attributes, from nv_transforms.c) performs cross-attribute consistency checking. This function validates that combinations of CUDA attributes are legal:

// sub_6BC890 -- nv_validate_cuda_attributes (nv_transforms.c)
// a1: entity (function), a2: diagnostic location
void nv_validate_cuda_attributes(entity_t* fn, source_loc_t* loc) {
    if (!fn || (fn->byte_177 & 0x10))  // skip if null or already validated
        return;

    uint8_t exec_space = fn->byte_182;  // CUDA execution space bits
    launch_config_t* lc = fn->launch_config;  // entity+256

    // Check 1: parameters with rvalue-reference in __global__ functions
    // Walks parameter list, emits error 3702 for ref-qualified params

    // Check 2: __nv_register_params__ on __host__-only or __global__
    if (fn->byte_183 & 0x08) {
        if (exec_space & 0x40)       // __global__
            emit_error(3661, "__global__");
        else if ((exec_space & 0x30) == 0x20)  // __host__ only (no __device__)
            emit_error(3661, "__host__");
    }

    // Check 3: __launch_bounds__ without __global__
    if (lc && !(exec_space & 0x40)) {
        if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor)
            emit_error(3534, "__launch_bounds__");
    }

    // Check 4: __cluster_dims__ / __block_size__ without __global__
    if (lc && (fn->byte_183 & 0x40 || lc->cluster_dim_x > 0)) {
        const char* name = (lc->block_size_x > 0) ? "__block_size__" : "__cluster_dims__";
        emit_error(3534, name);
    }

    // Check 5: maxBlocksPerClusterSize exceeds cluster product
    if (lc && lc->cluster_dim_x > 0 && lc->maxBlocksPerClusterSize > 0) {
        if (lc->maxBlocksPerClusterSize <
            lc->cluster_dim_x * lc->cluster_dim_y * lc->cluster_dim_z) {
            emit_error(3707, ...);
        }
    }

    // Check 6: __maxnreg__ without __global__
    if (lc && lc->maxnreg >= 0 && !(exec_space & 0x40))
        emit_error(3715, "__maxnreg__");

    // Check 7: __launch_bounds__ + __maxnreg__ conflict
    if (lc && lc->maxThreadsPerBlock && lc->maxnreg >= 0)
        emit_error(3719, "__launch_bounds__ and __maxnreg__");

    // Check 8: __global__ without __launch_bounds__
    if ((exec_space & 0x40) && (!lc || (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor)))
        emit_warning(3695);  // "no __launch_bounds__ specified for __global__ function"
}

Error Codes in Validation

Error	Severity	Message
3534	7 (error)	`"%s" attribute is not allowed on a non-__global__ function`
3661	7 (error)	`__nv_register_params__ is not allowed on a %s function`
3695	4 (warning)	`no __launch_bounds__ specified for __global__ function`
3702	7 (error)	Parameter with rvalue reference in `__global__` function
3707	7 (error)	`total number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit`
3715	7 (error)	`__maxnreg__ is not allowed on a non-__global__ function`
3719	7 (error)	`__launch_bounds__ and __maxnreg__ may not be used on the same declaration`

Per-Attribute Handler Function Table

Each CUDA attribute has a dedicated apply_* function registered in the descriptor table. These functions modify entity node fields (execution space bits, memory space bits, launch configuration) and emit diagnostics for invalid usage.

Attribute	Handler	Address	Lines	Entity Fields Modified
`__host__`	`apply_nv_host_attr`	`sub_4108E0`	31	`entity+182 \|= 0x15`
`__device__`	`apply_nv_device_attr`	`sub_40EB80`	100	Functions: `entity+182 \|= 0x23`; Variables: `entity+148 \|= 0x01`
`__global__`	`apply_nv_global_attr`	`sub_40E1F0`	89	`entity+182 \|= 0x61`
`__global__` (variant 2)	`apply_nv_global_attr`	`sub_40E7F0`	86	Same as above (alternate entry point)
`__shared__`	(via device attr path)	--	--	`entity+148 \|= 0x02`
`__constant__`	(via device attr path)	--	--	`entity+148 \|= 0x04`
`__managed__`	`apply_nv_managed_attr`	`sub_40E0D0`	47	`entity+148 \|= 0x01`, `entity+149 \|= 0x01`
`__launch_bounds__`	`apply_nv_launch_bounds_attr`	`sub_411C80`	98	`entity+256` -> launch config `+0`, `+8`, `+16`
`__maxnreg__`	`apply_nv_maxnreg_attr`	`sub_410F70`	67	`entity+256` -> launch config `+32`
`__local_maxnreg__`	`apply_nv_local_maxnreg_attr`	`sub_411090`	67	`entity+256` -> launch config `+36`
`__cluster_dims__`	`apply_nv_cluster_dims_attr`	`sub_4115F0`	145	`entity+256` -> launch config `+20`, `+24`, `+28`
`__block_size__`	`apply_nv_block_size_attr`	`sub_4109E0`	265	`entity+256` -> launch config `+40`..`+52`
`__nv_register_params__`	`apply_nv_register_params_attr`	`sub_40B0A0`	38	`entity+183 \|= 0x08`

Attribute Registration (`sub_6B5E50`)

The function sub_6B5E50 (160 lines, in the nv_transforms.c / mem_manage.c area) registers NVIDIA-specific pseudo-attributes into EDG's keyword and macro systems at startup. It operates after EDG's standard keyword initialization but before parsing begins.

The registration creates macro-like definitions that the lexer expands before attribute processing. The function:

Allocates attribute definition nodes via sub_6BA0D0 (EDG's node allocator)
Looks up existing definitions via sub_734430 (hash table search) -- if a definition already exists, it chains the new handler onto it via sub_6AC190
Creates new keyword entries via sub_749600 if no prior definition exists
Registers __nv_register_params__ as a 40-byte attribute definition node (kind marker 8961) with chain linkage
Registers __noinline__ as a 30-byte attribute definition node (kind marker 6401), including the "oinline))" suffix for __attribute__((__noinline__)) expansion
Conditionally registers ARM SME attributes (__arm_in, __arm_inout, __arm_out, __arm_preserves, __arm_streaming, __arm_streaming_compatible) via sub_6ACCB0 when Clang version >= 180000 and ARM target flags are set
Registers _Pragma as an operator-like keyword for _Pragma("...") processing

If any registration fails (the existing entry cannot be extended), it emits internal error 1338 with the attribute name and calls sub_6B6280 (fatal error handler).

Entity Node: CUDA Attribute Fields

CUDA attributes modify specific byte fields in entity nodes. The key fields for a reimplementation:

Execution Space (`entity+182`)

Bit 0 (0x01): __device__           set by apply_nv_device_attr
Bit 2 (0x04): __host__             set by apply_nv_host_attr
Bit 4 (0x10): (reserved)
Bit 5 (0x20): __host__ explicit    set by apply_nv_host_attr
Bit 6 (0x40): __global__           set by apply_nv_global_attr
Bit 7 (0x80): __host__ __device__  set when both specified

Handlers use OR-masks: __host__ sets 0x15 (bits 0+2+4), __device__ sets 0x23 (bits 0+1+5), __global__ sets 0x61 (bits 0+5+6). The overlap at bit 0 means all execution-space-annotated functions have bit 0 set, which serves as a quick "has CUDA annotation" predicate.

Memory Space (`entity+148`)

Bit 0 (0x01): __device__           device memory
Bit 1 (0x02): __shared__           shared memory
Bit 2 (0x04): __constant__         constant memory

Extended Memory Space (`entity+149`)

Bit 0 (0x01): __managed__          managed (unified) memory

Launch Configuration (`entity+256`)

A pointer to a separately allocated launch_config_t structure (created by sub_5E52F0):

struct launch_config_t {
    uint64_t  maxThreadsPerBlock;          // +0   from __launch_bounds__(N, ...)
    uint64_t  minBlocksPerMultiprocessor;  // +8   from __launch_bounds__(N, M, ...)
    int32_t   maxBlocksPerClusterSize;     // +16  from __launch_bounds__(N, M, K)
    int32_t   cluster_dim_x;              // +20  from __cluster_dims__(X, ...)
    int32_t   cluster_dim_y;              // +24  from __cluster_dims__(X, Y, ...)
    int32_t   cluster_dim_z;              // +28  from __cluster_dims__(X, Y, Z)
    int32_t   maxnreg;                    // +32  from __maxnreg__(N)
    int32_t   local_maxnreg;              // +36  from __local_maxnreg__(N)
    int32_t   block_size_x;              // +40  from __block_size__(X, ...)
    int32_t   block_size_y;              // +44  from __block_size__(X, Y, ...)
    int32_t   block_size_z;              // +48  from __block_size__(X, Y, Z, ...)
    uint8_t   flags;                      // +52  bit 0=cluster_dims_set
                                          //      bit 1=block_size_set
};

This structure is allocated lazily -- only created when a launch configuration attribute is first applied to a function. The allocation function sub_5E52F0 returns a zero-initialized structure with maxnreg = -1 and local_maxnreg = -1 (sentinel for "unset").

Attribute Processing Global State

Global	Address	Purpose
`qword_E7FB60`	`0xE7FB60`	Attribute name hash table (created by `init_attr_name_map`)
`qword_E7F038`	`0xE7F038`	Attribute token hash table (created by `init_attr_token_map`)
`byte_E7FB80`	`0xE7FB80`	204-byte static buffer for formatted attribute display names
`off_D46820`	`0xD46820`	Master attribute descriptor table (32 bytes per entry, extends to `0xD47A60`)
`qword_E7F070`	`0xE7F070`	Visibility stack (for `__attribute__((visibility(...)))` nesting)
`qword_E7F048`	`0xE7F048`	Alias/ifunc free list head
`qword_E7F058`/`E7F050`	`0xE7F058`/`0xE7F050`	Alias chain list head/tail
`dword_E7F080`	`0xE7F080`	Attribute processing flags
`dword_E7F078`	`0xE7F078`	Extended attribute config flag

The function reset_attribute_processing_state (sub_4190B0) zeroes all of these at the start of each translation unit.

Function Map

Address	Identity	Source	Confidence
`sub_40A250`	`strip_double_underscores_and_lookup`	`attribute.c`	HIGH
`sub_40A310`	`attribute_display_name`	`attribute.c:1307`	HIGH
`sub_40C4C0`	`cond_matches_attr_mode`	`attribute.c`	HIGH
`sub_40C6C0`	`get_balanced_token`	`attribute.c`	HIGH
`sub_40C8B0`	`parse_attribute_argument_clause`	`attribute.c`	HIGH
`sub_40D620`	`in_attr_cond_range`	`attribute.c`	HIGH
`sub_40E0D0`	`apply_nv_managed_attr`	`attribute.c:10523`	HIGH
`sub_40E1F0`	`apply_nv_global_attr` (variant 1)	`attribute.c`	HIGH
`sub_40E7F0`	`apply_nv_global_attr` (variant 2)	`attribute.c`	HIGH
`sub_40EB80`	`apply_nv_device_attr`	`attribute.c`	HIGH
`sub_40FDB0`	`get_attr_descr_for_attribute`	`attribute.c:1902`	HIGH
`sub_4108E0`	`apply_nv_host_attr`	`attribute.c`	HIGH
`sub_4109E0`	`apply_nv_block_size_attr`	`attribute.c`	HIGH
`sub_410F70`	`apply_nv_maxnreg_attr`	`attribute.c`	HIGH
`sub_411090`	`apply_nv_local_maxnreg_attr`	`attribute.c`	HIGH
`sub_4115F0`	`apply_nv_cluster_dims_attr`	`attribute.c`	HIGH
`sub_411C80`	`apply_nv_launch_bounds_attr`	`attribute.c`	HIGH
`sub_412650`	`scan_std_attribute_group`	`attribute.c:2914`	HIGH
`sub_413240`	`apply_one_attribute`	`attribute.c`	HIGH
`sub_413ED0`	`apply_attributes_to_entity`	`attribute.c`	HIGH
`sub_418F80`	`init_attr_name_map`	`attribute.c:1524`	HIGH
`sub_419070`	`init_attr_token_map`	`attribute.c`	HIGH
`sub_4190B0`	`reset_attribute_processing_state`	`attribute.c`	HIGH
`sub_6B5E50`	`process_nv_register_params` / attribute registration	`nv_transforms.c`	HIGH
`sub_6BC890`	`nv_validate_cuda_attributes`	`nv_transforms.c`	VERY HIGH

Per-Attribute IL Emission Matrix

Every CUDA attribute is parsed into a 72-byte attribute IL node (entry kind 0x48). What happens after application is what determines downstream visibility. The matrix below classifies each attribute by its emission path: whether the attribute survives as a discrete IL node, collapses into entity bitfield mutations, allocates the side-band launch-config struct, or is preserved on the entity's attribute chain for the .int.c writer to re-emit.

Attribute	Parse-time IL node	Post-apply form	`.int.c` re-emission	Pipeline consumer
`__host__`	kind `0x48`, byte `+8 = 'V'`	`entity+182 \|= 0x15`	-- (entity bits only)	Device/host splitter (`mark_to_keep_in_il`)
`__device__`	kind `0x48`, byte `+8 = 'W'`	functions: `entity+182 \|= 0x23`; variables: `entity+148 \|= 0x01`	-- (entity bits only)	Device/host splitter, cross-space checker
`__global__`	kind `0x48`, byte `+8 = 'X'`	`entity+182 \|= 0x61`, then `\|= 0x80`	host stub generator emits launch wrapper	Kernel stub emitter (`sub_489000`)
`__tile_global__`	kind `0x48`, byte `+8 = 'Y'`	(no handler) attribute node kept on attr chain	preserved through chain walk	cicc (downstream)
`__shared__`	kind `0x48`, byte `+8 = 'Z'`	`entity+148 \|= 0x02`	-- (entity bits only)	Memory-space declarator
`__constant__`	kind `0x48`, byte `+8 = '['`	`entity+148 \|= 0x04`	-- (entity bits only)	Memory-space declarator
`__launch_bounds__`	kind `0x48`, byte `+8 = '\\'`	`entity+256 -> launch_config_t` (+0, +8, +16)	re-emitted as IL kind `25` via `sub_540560`	cicc (NVVM IR generator)
`__maxnreg__`	kind `0x48`, byte `+8 = ']'`	`launch_config+32`	-- (struct field)	cicc, ptxas
`__local_maxnreg__`	kind `0x48`, byte `+8 = '^'`	`launch_config+36`	-- (struct field)	cicc, ptxas
`__tile_builtin__`	kind `0x48`, byte `+8 = '_'`	(no handler) attribute node kept on attr chain	preserved through chain walk	cicc (downstream)
`__managed__`	kind `0x48`, byte `+8 = 'f'`	`entity+148 \|= 0x01` + `entity+149 \|= 0x01`	comma-op host wrapper + RT boilerplate via `sub_489000`	CUDA runtime (`__nv_init_managed_rt`)
`__cluster_dims__`	kind `0x48`, byte `+8 = 'k'`	`launch_config+20/+24/+28` + flag bit 0; zero-arg sets `entity+183 \|= 0x40`	-- (struct fields)	cicc
`__block_size__`	kind `0x48`, byte `+8 = 'l'`	`launch_config+40/+44/+48` + flag bit 1; optional `+20/+24/+28`	-- (struct fields)	cicc
`__nv_pure__`	kind `0x48`, byte `+8 = 'n'`	(no entity mutation) attribute node kept on attr chain	re-emitted as IL kind `25` via `sub_540560` (shared path with `\\`)	cicc (applies LLVM `readonly`/`willreturn`)
`__nv_register_params__`	kind `0x48` via GNU path (no CUDA kind byte)	`entity+183 \|= 0x08`	-- (entity bits only)	cicc (ABI selector)
`__forceinline__`	(no CUDA kind byte; processed via inline-control path)	`entity+177 \|= 0x10`	emitted as `__attribute__((always_inline))`	Host compiler + cicc
`__noinline__` (EDG form)	(no CUDA kind byte)	`entity+179 \|= 0x20` (+ ABI node on prototype in C mode)	emitted as `__attribute__((noinline))`	Host compiler + cicc
`__noinline__` (GNU form)	kind `0x48` via GNU `__attribute__`	`entity+180 \|= 0x80`	emitted as `__attribute__((noinline))`	Host compiler + cicc
`__inline_hint__`	(no CUDA kind byte)	`entity+179 \|= 0x10`	emitted as suggestion (non-binding)	cicc inlining heuristics
`__grid_constant__`	parameter-level attribute	`entity+164 \|= 0x04`, `type+133 \|= 0x20`, `param+32 \|= 0x02`	emitted into kernel parameter declaration	Kernel parameter ABI (cicc + driver)
`__restrict__`	C99/GNU restrict path (not a CUDA kind)	Type-qualifier on parameter type chain	preserved as `__restrict__` on parameter	Host compiler + cicc

Three Emission Categories

Reading the matrix vertically, every CUDA attribute lands in exactly one of three categories:

Collapse to entity bits -- __host__, __device__, __shared__, __constant__, __managed__, __nv_register_params__, __forceinline__, __noinline__, __inline_hint__. After application, no attribute IL node survives. The downstream consumer reads entity bytes (+148, +149, +177, +179, +180, +182, +183) instead. The original kind-0x48 IL node is freed back to the arena when the attribute chain is torn down.
Side-band launch-config struct -- __launch_bounds__, __maxnreg__, __local_maxnreg__, __cluster_dims__, __block_size__. The attribute IL node is consumed; the values are extracted into the 56-byte launch_config_t pointed to by entity+256. __launch_bounds__ additionally walks back through the writer with kind_field = 25 so cicc sees it in .int.c. The other four are consumed entirely by cudafe++ and the launch-config struct is read by later passes (kernel stub generator, ptxas argument formatter).
Preserved on attribute chain -- __nv_pure__, __tile_global__, __tile_builtin__, __grid_constant__. These have no entity-bit collapse (or only a flag bit alongside the chain). The attribute IL node remains attached to the entity through code generation. The .int.c writer (sub_5565E0 family, sub_540560) walks the chain and re-emits the attribute textually so cicc can apply the corresponding LLVM-level semantics. For the __tile_* pair, no cudafe++ consumer exists -- the attribute is pure pass-through.

Why There Is No Per-Attribute IL Node Type

A CUDA attribute is never lowered into a discrete IL node of its own kind. Every attribute reuses the generic attribute kind (0x48), with its arguments stored as attribute_argument nodes (kind 0x49) and the whole bundle wrapped in an attribute_group node (kind 0x4A). The discriminator is the byte at +8 of the 0x48 node ('V'..'n' for CUDA attributes; standard descriptors for GNU/C++11/MSVC attributes). This is a deliberate EDG design choice: the IL graph stays small and the dispatch logic centralizes in apply_one_attribute (sub_413240) rather than fanning out into dozens of node kinds.

The practical consequence is that "which IL node is emitted for attribute X" is the wrong question. The correct question is "which entity-byte mutation does attribute X cause, and which downstream pass reads that byte." The matrix above answers both.

Cross-References

global Function Constraints -- detailed validation rules for __global__
Launch Configuration Attributes -- __launch_bounds__, __cluster_dims__, __block_size__
grid_constant -- grid-constant parameter attribute
managed Variables -- managed memory attribute
Minor CUDA Attributes -- __noinline__, __forceinline__, __nv_register_params__, __nv_pure__
_nv* Builtin Intrinsic Names -- 110 __nv_* identifiers that look like attributes but are intrinsics / lambda machinery
Entity Node Layout -- full entity structure with CUDA field offsets
CUDA Execution Spaces -- how execution space bits drive code generation
CUDA Memory Spaces -- memory space bitfield semantics

The "25 reserved slots" figure comes from the upper and lower bounds of the CUDA kind range as observed in sub_40A310 (attribute_display_name) and the cross-validating switch dispatchers in sub_40FDB0, sub_410A20, and sub_4115F0. Of the 25, only kinds 86-95, 102, 107, 108, and 110 appear in any user-facing diagnostic, source-level keyword table, or descriptor entry; the others were either never wired up at this binary revision, are reserved internal markers used briefly during parsing, or were retired between EDG/CUDA revisions. ↩

Keyboard shortcuts

cudafe++ Reverse Engineering Reference