Attribute System Overview
cudafe++ processes CUDA attributes through NVIDIA's customization of the EDG 6.6 attribute subsystem. EDG provides a general-purpose attribute infrastructure in attribute.c (approximately 11,500 lines of source, spanning addresses 0x409350--0x418F80 in the binary) that handles C++11 [[...]] attributes, GNU __attribute__((...)), MSVC __declspec, and alignas. NVIDIA extends this infrastructure by injecting CUDA-specific attribute kinds into EDG's attribute kind enumeration, registering CUDA-specific handler callbacks, and adding a post-declaration validation pass that enforces cross-attribute consistency rules (e.g., __launch_bounds__ requires __global__). The kind enum reserves a contiguous block of 25 values in the ASCII printable range (86..110, with gaps at 96-101, 103-106, 109); the 14 commonly-encountered kinds are tabulated below, with the remaining 11 slots documented in the Reserved Kind Slots section as internal/unallocated values.1
The attribute system operates in four phases: scanning (lexer recognizes attribute syntax and builds attribute node lists), lookup (maps attribute names to descriptors via a hash table), application (dispatches to per-attribute handler functions that modify entity nodes), and validation (post-declaration consistency checks). CUDA attributes participate in all four phases, using the same node structures and dispatch mechanisms as standard C++/GNU attributes.
CUDA Attribute Kind Enum
Every attribute node carries a kind byte at offset +8. For standard C++/GNU attributes, EDG assigns kinds from its built-in descriptor table (byte_82C0E0 in the .rodata segment). For CUDA attributes, NVIDIA reserves a block of kind values in the ASCII printable range. The function attribute_display_name (sub_40A310, from attribute.c:1307) contains the authoritative switch table that maps kind values to human-readable names:
| Kind | Hex | ASCII | Display Name | Category | Handler |
|---|---|---|---|---|---|
| 86 | 0x56 | 'V' | __host__ | Execution space | sub_4108E0 |
| 87 | 0x57 | 'W' | __device__ | Execution space | sub_40EB80 |
| 88 | 0x58 | 'X' | __global__ | Execution space | sub_40E1F0 / sub_40E7F0 |
| 89 | 0x59 | 'Y' | __tile_global__ | Execution space | (internal) |
| 90 | 0x5A | 'Z' | __shared__ | Memory space | sub_40E0D0 (shared path) |
| 91 | 0x5B | '[' | __constant__ | Memory space | sub_40E0D0 (constant path) |
| 92 | 0x5C | '\' | __launch_bounds__ | Launch config | sub_411C80 |
| 93 | 0x5D | ']' | __maxnreg__ | Launch config | sub_410F70 |
| 94 | 0x5E | '^' | __local_maxnreg__ | Launch config | sub_411090 |
| 95 | 0x5F | '_' | __tile_builtin__ | Internal | (internal) |
| 102 | 0x66 | 'f' | __managed__ | Memory space | sub_40E0D0 (managed path) |
| 107 | 0x6B | 'k' | __cluster_dims__ | Launch config | sub_4115F0 |
| 108 | 0x6C | 'l' | __block_size__ | Launch config | sub_4109E0 |
| 110 | 0x6E | 'n' | __nv_pure__ | Optimization | (internal) |
The kind values are not contiguous. Kinds 86--95 form a dense block for the original CUDA attributes. Kinds 102, 107, 108, and 110 were added later (managed memory in CUDA 6.0, cluster dimensions in CUDA 11.8, block size and nv_pure more recently), occupying gaps in the ASCII range.
Reserved Kind Slots
The kind dispatch logic in sub_40A310, sub_40FDB0, and the apply-handler switches collectively reserves 25 enum slots in the range 86..110. Eleven of these have no user-visible mapping in this binary revision: they appear as fall-through targets in the dispatcher (returning the raw name pointer rather than a CUDA display string), but no descriptor entry in off_D46820 writes them into an attribute node, and no diagnostic references them by name. They are listed for completeness:
| Kind range | Hex | ASCII | Status |
|---|---|---|---|
| 96 | 0x60 | '`' | Reserved / unallocated (no descriptor) |
| 97 | 0x61 | 'a' | Reserved / unallocated |
| 98 | 0x62 | 'b' | Reserved / unallocated |
| 99 | 0x63 | 'c' | Reserved / unallocated |
| 100 | 0x64 | 'd' | Reserved / unallocated |
| 101 | 0x65 | 'e' | Reserved / unallocated |
| 103 | 0x67 | 'g' | Reserved / unallocated |
| 104 | 0x68 | 'h' | Reserved / unallocated |
| 105 | 0x69 | 'i' | Reserved / unallocated |
| 106 | 0x6A | 'j' | Reserved / unallocated |
| 109 | 0x6D | 'm' | Reserved / unallocated |
The eleven gaps interleave the 14 documented kinds in a pattern consistent with revisional history: blocks of slots were reserved en bloc when a new CUDA feature was on the roadmap, but only a subset of the slots was ever wired up. Reverse-engineering the binary at hand cannot distinguish "reserved for a future feature" from "retired after a feature was renamed or merged"; both produce identical evidence (kind value handled in dispatch but absent from descriptor table). Consumers should not rely on any particular semantics for these eleven values when re-implementing the attribute subsystem.
attribute_display_name (sub_40A310)
This function serves dual duty: it formats the display name for diagnostic messages, and its switch table is the canonical enumeration of all CUDA attribute kinds. The logic:
// sub_40A310 -- attribute_display_name (attribute.c:1307)
// a1: pointer to attribute node
const char* attribute_display_name(attr_node_t* a1) {
const char* name = a1->name; // +16
const char* ns = a1->namespace_str; // +24
// If scoped (namespace::name), format "namespace::name"
if (ns) {
size_t ns_len = strlen(ns);
assert(ns_len + strlen(name) + 3 <= 204); // buffer byte_E7FB80
sprintf(byte_E7FB80, "%s::%s", ns, name);
name = intern_string(byte_E7FB80); // sub_5E0700
}
// Override with CUDA display name based on kind byte
switch (a1->kind) { // byte at +8
case 'V': return "__host__";
case 'W': return "__device__";
case 'X': return "__global__";
case 'Y': return "__tile_global__";
case 'Z': return "__shared__";
case '[': return "__constant__";
case '\\': return "__launch_bounds__";
case ']': return "__maxnreg__";
case '^': return "__local_maxnreg__";
case '_': return "__tile_builtin__";
case 'f': return "__managed__";
case 'k': return "__cluster_dims__";
case 'l': return "__block_size__";
case 'n': return "__nv_pure__";
default: return name ? name : "";
}
}
The 204-byte static buffer byte_E7FB80 is shared across calls (not thread-safe, but cudafe++ is single-threaded per translation unit). The intern_string call (sub_5E0700) ensures the formatted "namespace::name" string is deduplicated into EDG's permanent string pool.
Attribute Node Structure
Every attribute is represented by a 72-byte IL node (entry kind 0x48 = attribute). The node layout:
struct attr_node_t { // 72 bytes, IL entry kind 0x48
attr_node_t* next; // +0 next attribute in list
uint8_t kind; // +8 attribute kind byte (CUDA: 'V'..'n')
uint8_t source_mode; // +9 1=C++11, 2=GNU, 3=MSVC, 4=alignas, 5=clang
uint8_t target_kind; // +10 what entity type this targets
uint8_t flags; // +11 bit 0=applies_to_params
// bit 1=skip_arg_check
// bit 4=scoped attribute
// bit 7=unknown/unrecognized
uint32_t _pad; // +12 (alignment)
const char* name; // +16 attribute name string
const char* namespace_str; // +24 namespace (NULL for unscoped)
arg_node_t* arguments; // +32 argument list head
void* source_pos; // +40 source position info
void* decl_context; // +48 declaration context / scope
void* src_loc_1; // +56 source location
void* src_loc_2; // +64 secondary source location
};
For CUDA attributes, the kind byte at offset +8 is the discriminator. When get_attr_descr_for_attribute (sub_40FDB0) resolves an attribute name, it writes the corresponding kind value from the descriptor table (byte_82C0E0) into this field. All subsequent dispatch operates on this byte alone.
The source_mode byte at +9 indicates the syntactic form the user wrote. CUDA attributes like __host__ are parsed as GNU-style attributes (source_mode = 2), because cudafe++ defines them via __attribute__((...)) internally.
Attribute Descriptor Table and Name Lookup
Master Descriptor Table (off_D46820)
The attribute descriptor table is a static array in .rodata at off_D46820, extending to unk_D47A60. Each entry is 32 bytes and encodes:
- Attribute name string
- Kind byte (written to
attr_node_t.kindon match) - Handler function pointer (the
apply_*callback) - Mode/version condition string (e.g.,
'g'for GCC-only,'l'for Clang-only) - Target applicability mask
Initialization: init_attr_name_map (sub_418F80)
At startup, init_attr_name_map iterates the descriptor table, validates each name is at most 100 characters, and inserts it into the hash table qword_E7FB60 (created via sub_7425C0). This hash table enables O(1) lookup of attribute names during parsing.
// sub_418F80 -- init_attr_name_map (attribute.c:1524)
void init_attr_name_map(void) {
attr_name_map = create_hash_table(); // qword_E7FB60
for (attr_descr* d = off_D46820; d < unk_D47A60; d++) {
assert(strlen(d->name) <= 100);
insert_into_hash_table(attr_name_map, d->name, d);
}
// Also initializes dword_E7F078 and processes config if dword_106BF18 set
}
A companion function init_attr_token_map (sub_419070) creates a second hash table qword_E7F038 that maps attribute tokens to their descriptors, used during lexer-level attribute recognition.
Name Normalization: sub_40A250
Before looking up an attribute name, EDG strips __ prefixes and suffixes. The function at sub_40A250 checks whether the name starts with "__" and ends with "__", strips them, and looks up the bare name in qword_E7FB60. This means __host__, __attribute__((host)), and host all resolve to the same descriptor. The stripping respects the current language standard (dword_126EFB4) and C++ version (dword_126EF68).
Central Dispatch: get_attr_descr_for_attribute (sub_40FDB0)
This 227-line function is the central attribute resolution path. Given an attribute node with a name, it:
- Looks up the name in the hash table
- Checks mode compatibility (GCC mode via
dword_126EFA8, Clang mode viadword_126EFA4, MSVC mode viadword_106BF68/dword_106BF58) - Checks namespace match (
"gnu","__gnu__","clang") viacond_matches_attr_mode(sub_40C4C0) - Evaluates version-conditional availability via
in_attr_cond_range(sub_40D620) - Writes the kind byte from the matched descriptor into
attr_node_t.kind - Returns the descriptor entry (which carries the handler function pointer)
The mode condition strings use a compact encoding: 'g'=GCC, 'l'=Clang, 's'=Sun, 'c'=C++, 'm'=MSVC; 'x'=extension, '+'=positive match, '!'=boundary marker.
Attribute Application Pipeline
Phase 1: Scanning
The lexer recognizes attribute syntax and calls into the scanning functions:
| Function | Address | Role |
|---|---|---|
scan_std_attribute_group | sub_412650 | Parses [[...]] C++11 and __attribute__((...)) GNU attributes |
scan_gnu_attribute_groups | sub_412F20 | Handles __attribute__((...)) specifically |
scan_attributes_list | sub_4124A0 | Iterates token stream building attribute node lists |
parse_attribute_argument_clause | sub_40C8B0 | Parses attribute argument expressions |
get_balanced_token | sub_40C6C0 | Handles balanced parentheses/brackets in arguments |
Scanning produces a linked list of attr_node_t nodes. At this stage, the kind byte is unset; only the name and namespace_str fields are populated.
Phase 2: Lookup and Kind Assignment
When the parser reaches a declaration, get_attr_descr_for_attribute resolves each attribute name to a descriptor and writes the kind byte. For CUDA attributes, this assigns values in the 'V'--'n' range.
Phase 3: Application -- apply_one_attribute (sub_413240)
The central dispatcher is a 585-line function containing a switch on the kind byte. For each CUDA kind, it calls the corresponding handler:
// sub_413240 -- apply_one_attribute (attribute.c, main dispatch)
// 585 lines, giant switch on attribute kind
void apply_one_attribute(attr_node_t* attr, entity_t* entity, int target_kind) {
switch (attr->kind) {
case 'V': apply_nv_host_attr(attr, entity, target_kind); break;
case 'W': apply_nv_device_attr(attr, entity, target_kind); break;
case 'X': apply_nv_global_attr(attr, entity, target_kind); break;
case 'Z': apply_nv_shared_attr(attr, entity, target_kind); break;
case '[': apply_nv_constant_attr(attr, entity, target_kind); break;
case '\\': apply_nv_launch_bounds(attr, entity, target_kind); break;
case ']': apply_nv_maxnreg_attr(attr, entity, target_kind); break;
case '^': apply_nv_local_maxnreg(attr, entity, target_kind); break;
case 'f': apply_nv_managed_attr(attr, entity, target_kind); break;
case 'k': apply_nv_cluster_dims(attr, entity, target_kind); break;
case 'l': apply_nv_block_size(attr, entity, target_kind); break;
// ... standard attributes handled similarly ...
}
}
The outer iteration is apply_attributes_to_entity (sub_413ED0, 492 lines), which walks the attribute list, calls apply_one_attribute for each, and handles deferred attributes, attribute merging, and ordering constraints.
Phase 4: Post-Declaration Validation -- sub_6BC890
After all attributes on a declaration are applied, sub_6BC890 (nv_validate_cuda_attributes, from nv_transforms.c) performs cross-attribute consistency checking. This function validates that combinations of CUDA attributes are legal:
// sub_6BC890 -- nv_validate_cuda_attributes (nv_transforms.c)
// a1: entity (function), a2: diagnostic location
void nv_validate_cuda_attributes(entity_t* fn, source_loc_t* loc) {
if (!fn || (fn->byte_177 & 0x10)) // skip if null or already validated
return;
uint8_t exec_space = fn->byte_182; // CUDA execution space bits
launch_config_t* lc = fn->launch_config; // entity+256
// Check 1: parameters with rvalue-reference in __global__ functions
// Walks parameter list, emits error 3702 for ref-qualified params
// Check 2: __nv_register_params__ on __host__-only or __global__
if (fn->byte_183 & 0x08) {
if (exec_space & 0x40) // __global__
emit_error(3661, "__global__");
else if ((exec_space & 0x30) == 0x20) // __host__ only (no __device__)
emit_error(3661, "__host__");
}
// Check 3: __launch_bounds__ without __global__
if (lc && !(exec_space & 0x40)) {
if (lc->maxThreadsPerBlock || lc->minBlocksPerMultiprocessor)
emit_error(3534, "__launch_bounds__");
}
// Check 4: __cluster_dims__ / __block_size__ without __global__
if (lc && (fn->byte_183 & 0x40 || lc->cluster_dim_x > 0)) {
const char* name = (lc->block_size_x > 0) ? "__block_size__" : "__cluster_dims__";
emit_error(3534, name);
}
// Check 5: maxBlocksPerClusterSize exceeds cluster product
if (lc && lc->cluster_dim_x > 0 && lc->maxBlocksPerClusterSize > 0) {
if (lc->maxBlocksPerClusterSize <
lc->cluster_dim_x * lc->cluster_dim_y * lc->cluster_dim_z) {
emit_error(3707, ...);
}
}
// Check 6: __maxnreg__ without __global__
if (lc && lc->maxnreg >= 0 && !(exec_space & 0x40))
emit_error(3715, "__maxnreg__");
// Check 7: __launch_bounds__ + __maxnreg__ conflict
if (lc && lc->maxThreadsPerBlock && lc->maxnreg >= 0)
emit_error(3719, "__launch_bounds__ and __maxnreg__");
// Check 8: __global__ without __launch_bounds__
if ((exec_space & 0x40) && (!lc || (!lc->maxThreadsPerBlock && !lc->minBlocksPerMultiprocessor)))
emit_warning(3695); // "no __launch_bounds__ specified for __global__ function"
}
Error Codes in Validation
| Error | Severity | Message |
|---|---|---|
| 3534 | 7 (error) | "%s" attribute is not allowed on a non-__global__ function |
| 3661 | 7 (error) | __nv_register_params__ is not allowed on a %s function |
| 3695 | 4 (warning) | no __launch_bounds__ specified for __global__ function |
| 3702 | 7 (error) | Parameter with rvalue reference in __global__ function |
| 3707 | 7 (error) | total number of blocks in cluster computed from %s exceeds __launch_bounds__ specified limit |
| 3715 | 7 (error) | __maxnreg__ is not allowed on a non-__global__ function |
| 3719 | 7 (error) | __launch_bounds__ and __maxnreg__ may not be used on the same declaration |
Per-Attribute Handler Function Table
Each CUDA attribute has a dedicated apply_* function registered in the descriptor table. These functions modify entity node fields (execution space bits, memory space bits, launch configuration) and emit diagnostics for invalid usage.
| Attribute | Handler | Address | Lines | Entity Fields Modified |
|---|---|---|---|---|
__host__ | apply_nv_host_attr | sub_4108E0 | 31 | entity+182 |= 0x15 |
__device__ | apply_nv_device_attr | sub_40EB80 | 100 | Functions: entity+182 |= 0x23; Variables: entity+148 |= 0x01 |
__global__ | apply_nv_global_attr | sub_40E1F0 | 89 | entity+182 |= 0x61 |
__global__ (variant 2) | apply_nv_global_attr | sub_40E7F0 | 86 | Same as above (alternate entry point) |
__shared__ | (via device attr path) | -- | -- | entity+148 |= 0x02 |
__constant__ | (via device attr path) | -- | -- | entity+148 |= 0x04 |
__managed__ | apply_nv_managed_attr | sub_40E0D0 | 47 | entity+148 |= 0x01, entity+149 |= 0x01 |
__launch_bounds__ | apply_nv_launch_bounds_attr | sub_411C80 | 98 | entity+256 -> launch config +0, +8, +16 |
__maxnreg__ | apply_nv_maxnreg_attr | sub_410F70 | 67 | entity+256 -> launch config +32 |
__local_maxnreg__ | apply_nv_local_maxnreg_attr | sub_411090 | 67 | entity+256 -> launch config +36 |
__cluster_dims__ | apply_nv_cluster_dims_attr | sub_4115F0 | 145 | entity+256 -> launch config +20, +24, +28 |
__block_size__ | apply_nv_block_size_attr | sub_4109E0 | 265 | entity+256 -> launch config +40..+52 |
__nv_register_params__ | apply_nv_register_params_attr | sub_40B0A0 | 38 | entity+183 |= 0x08 |
Attribute Registration (sub_6B5E50)
The function sub_6B5E50 (160 lines, in the nv_transforms.c / mem_manage.c area) registers NVIDIA-specific pseudo-attributes into EDG's keyword and macro systems at startup. It operates after EDG's standard keyword initialization but before parsing begins.
The registration creates macro-like definitions that the lexer expands before attribute processing. The function:
- Allocates attribute definition nodes via
sub_6BA0D0(EDG's node allocator) - Looks up existing definitions via
sub_734430(hash table search) -- if a definition already exists, it chains the new handler onto it viasub_6AC190 - Creates new keyword entries via
sub_749600if no prior definition exists - Registers
__nv_register_params__as a 40-byte attribute definition node (kind marker 8961) with chain linkage - Registers
__noinline__as a 30-byte attribute definition node (kind marker 6401), including the"oinline))"suffix for__attribute__((__noinline__))expansion - Conditionally registers ARM SME attributes (
__arm_in,__arm_inout,__arm_out,__arm_preserves,__arm_streaming,__arm_streaming_compatible) viasub_6ACCB0when Clang version >= 180000 and ARM target flags are set - Registers
_Pragmaas an operator-like keyword for_Pragma("...")processing
If any registration fails (the existing entry cannot be extended), it emits internal error 1338 with the attribute name and calls sub_6B6280 (fatal error handler).
Entity Node: CUDA Attribute Fields
CUDA attributes modify specific byte fields in entity nodes. The key fields for a reimplementation:
Execution Space (entity+182)
Bit 0 (0x01): __device__ set by apply_nv_device_attr
Bit 2 (0x04): __host__ set by apply_nv_host_attr
Bit 4 (0x10): (reserved)
Bit 5 (0x20): __host__ explicit set by apply_nv_host_attr
Bit 6 (0x40): __global__ set by apply_nv_global_attr
Bit 7 (0x80): __host__ __device__ set when both specified
Handlers use OR-masks: __host__ sets 0x15 (bits 0+2+4), __device__ sets 0x23 (bits 0+1+5), __global__ sets 0x61 (bits 0+5+6). The overlap at bit 0 means all execution-space-annotated functions have bit 0 set, which serves as a quick "has CUDA annotation" predicate.
Memory Space (entity+148)
Bit 0 (0x01): __device__ device memory
Bit 1 (0x02): __shared__ shared memory
Bit 2 (0x04): __constant__ constant memory
Extended Memory Space (entity+149)
Bit 0 (0x01): __managed__ managed (unified) memory
Launch Configuration (entity+256)
A pointer to a separately allocated launch_config_t structure (created by sub_5E52F0):
struct launch_config_t {
uint64_t maxThreadsPerBlock; // +0 from __launch_bounds__(N, ...)
uint64_t minBlocksPerMultiprocessor; // +8 from __launch_bounds__(N, M, ...)
int32_t maxBlocksPerClusterSize; // +16 from __launch_bounds__(N, M, K)
int32_t cluster_dim_x; // +20 from __cluster_dims__(X, ...)
int32_t cluster_dim_y; // +24 from __cluster_dims__(X, Y, ...)
int32_t cluster_dim_z; // +28 from __cluster_dims__(X, Y, Z)
int32_t maxnreg; // +32 from __maxnreg__(N)
int32_t local_maxnreg; // +36 from __local_maxnreg__(N)
int32_t block_size_x; // +40 from __block_size__(X, ...)
int32_t block_size_y; // +44 from __block_size__(X, Y, ...)
int32_t block_size_z; // +48 from __block_size__(X, Y, Z, ...)
uint8_t flags; // +52 bit 0=cluster_dims_set
// bit 1=block_size_set
};
This structure is allocated lazily -- only created when a launch configuration attribute is first applied to a function. The allocation function sub_5E52F0 returns a zero-initialized structure with maxnreg = -1 and local_maxnreg = -1 (sentinel for "unset").
Attribute Processing Global State
| Global | Address | Purpose |
|---|---|---|
qword_E7FB60 | 0xE7FB60 | Attribute name hash table (created by init_attr_name_map) |
qword_E7F038 | 0xE7F038 | Attribute token hash table (created by init_attr_token_map) |
byte_E7FB80 | 0xE7FB80 | 204-byte static buffer for formatted attribute display names |
off_D46820 | 0xD46820 | Master attribute descriptor table (32 bytes per entry, extends to 0xD47A60) |
qword_E7F070 | 0xE7F070 | Visibility stack (for __attribute__((visibility(...))) nesting) |
qword_E7F048 | 0xE7F048 | Alias/ifunc free list head |
qword_E7F058/E7F050 | 0xE7F058/0xE7F050 | Alias chain list head/tail |
dword_E7F080 | 0xE7F080 | Attribute processing flags |
dword_E7F078 | 0xE7F078 | Extended attribute config flag |
The function reset_attribute_processing_state (sub_4190B0) zeroes all of these at the start of each translation unit.
Function Map
| Address | Identity | Source | Confidence |
|---|---|---|---|
sub_40A250 | strip_double_underscores_and_lookup | attribute.c | HIGH |
sub_40A310 | attribute_display_name | attribute.c:1307 | HIGH |
sub_40C4C0 | cond_matches_attr_mode | attribute.c | HIGH |
sub_40C6C0 | get_balanced_token | attribute.c | HIGH |
sub_40C8B0 | parse_attribute_argument_clause | attribute.c | HIGH |
sub_40D620 | in_attr_cond_range | attribute.c | HIGH |
sub_40E0D0 | apply_nv_managed_attr | attribute.c:10523 | HIGH |
sub_40E1F0 | apply_nv_global_attr (variant 1) | attribute.c | HIGH |
sub_40E7F0 | apply_nv_global_attr (variant 2) | attribute.c | HIGH |
sub_40EB80 | apply_nv_device_attr | attribute.c | HIGH |
sub_40FDB0 | get_attr_descr_for_attribute | attribute.c:1902 | HIGH |
sub_4108E0 | apply_nv_host_attr | attribute.c | HIGH |
sub_4109E0 | apply_nv_block_size_attr | attribute.c | HIGH |
sub_410F70 | apply_nv_maxnreg_attr | attribute.c | HIGH |
sub_411090 | apply_nv_local_maxnreg_attr | attribute.c | HIGH |
sub_4115F0 | apply_nv_cluster_dims_attr | attribute.c | HIGH |
sub_411C80 | apply_nv_launch_bounds_attr | attribute.c | HIGH |
sub_412650 | scan_std_attribute_group | attribute.c:2914 | HIGH |
sub_413240 | apply_one_attribute | attribute.c | HIGH |
sub_413ED0 | apply_attributes_to_entity | attribute.c | HIGH |
sub_418F80 | init_attr_name_map | attribute.c:1524 | HIGH |
sub_419070 | init_attr_token_map | attribute.c | HIGH |
sub_4190B0 | reset_attribute_processing_state | attribute.c | HIGH |
sub_6B5E50 | process_nv_register_params / attribute registration | nv_transforms.c | HIGH |
sub_6BC890 | nv_validate_cuda_attributes | nv_transforms.c | VERY HIGH |
Per-Attribute IL Emission Matrix
Every CUDA attribute is parsed into a 72-byte attribute IL node (entry kind 0x48). What happens after application is what determines downstream visibility. The matrix below classifies each attribute by its emission path: whether the attribute survives as a discrete IL node, collapses into entity bitfield mutations, allocates the side-band launch-config struct, or is preserved on the entity's attribute chain for the .int.c writer to re-emit.
| Attribute | Parse-time IL node | Post-apply form | .int.c re-emission | Pipeline consumer |
|---|---|---|---|---|
__host__ | kind 0x48, byte +8 = 'V' | entity+182 |= 0x15 | -- (entity bits only) | Device/host splitter (mark_to_keep_in_il) |
__device__ | kind 0x48, byte +8 = 'W' | functions: entity+182 |= 0x23; variables: entity+148 |= 0x01 | -- (entity bits only) | Device/host splitter, cross-space checker |
__global__ | kind 0x48, byte +8 = 'X' | entity+182 |= 0x61, then |= 0x80 | host stub generator emits launch wrapper | Kernel stub emitter (sub_489000) |
__tile_global__ | kind 0x48, byte +8 = 'Y' | (no handler) attribute node kept on attr chain | preserved through chain walk | cicc (downstream) |
__shared__ | kind 0x48, byte +8 = 'Z' | entity+148 |= 0x02 | -- (entity bits only) | Memory-space declarator |
__constant__ | kind 0x48, byte +8 = '[' | entity+148 |= 0x04 | -- (entity bits only) | Memory-space declarator |
__launch_bounds__ | kind 0x48, byte +8 = '\\' | entity+256 -> launch_config_t (+0, +8, +16) | re-emitted as IL kind 25 via sub_540560 | cicc (NVVM IR generator) |
__maxnreg__ | kind 0x48, byte +8 = ']' | launch_config+32 | -- (struct field) | cicc, ptxas |
__local_maxnreg__ | kind 0x48, byte +8 = '^' | launch_config+36 | -- (struct field) | cicc, ptxas |
__tile_builtin__ | kind 0x48, byte +8 = '_' | (no handler) attribute node kept on attr chain | preserved through chain walk | cicc (downstream) |
__managed__ | kind 0x48, byte +8 = 'f' | entity+148 |= 0x01 + entity+149 |= 0x01 | comma-op host wrapper + RT boilerplate via sub_489000 | CUDA runtime (__nv_init_managed_rt) |
__cluster_dims__ | kind 0x48, byte +8 = 'k' | launch_config+20/+24/+28 + flag bit 0; zero-arg sets entity+183 |= 0x40 | -- (struct fields) | cicc |
__block_size__ | kind 0x48, byte +8 = 'l' | launch_config+40/+44/+48 + flag bit 1; optional +20/+24/+28 | -- (struct fields) | cicc |
__nv_pure__ | kind 0x48, byte +8 = 'n' | (no entity mutation) attribute node kept on attr chain | re-emitted as IL kind 25 via sub_540560 (shared path with \\) | cicc (applies LLVM readonly/willreturn) |
__nv_register_params__ | kind 0x48 via GNU path (no CUDA kind byte) | entity+183 |= 0x08 | -- (entity bits only) | cicc (ABI selector) |
__forceinline__ | (no CUDA kind byte; processed via inline-control path) | entity+177 |= 0x10 | emitted as __attribute__((always_inline)) | Host compiler + cicc |
__noinline__ (EDG form) | (no CUDA kind byte) | entity+179 |= 0x20 (+ ABI node on prototype in C mode) | emitted as __attribute__((noinline)) | Host compiler + cicc |
__noinline__ (GNU form) | kind 0x48 via GNU __attribute__ | entity+180 |= 0x80 | emitted as __attribute__((noinline)) | Host compiler + cicc |
__inline_hint__ | (no CUDA kind byte) | entity+179 |= 0x10 | emitted as suggestion (non-binding) | cicc inlining heuristics |
__grid_constant__ | parameter-level attribute | entity+164 |= 0x04, type+133 |= 0x20, param+32 |= 0x02 | emitted into kernel parameter declaration | Kernel parameter ABI (cicc + driver) |
__restrict__ | C99/GNU restrict path (not a CUDA kind) | Type-qualifier on parameter type chain | preserved as __restrict__ on parameter | Host compiler + cicc |
Three Emission Categories
Reading the matrix vertically, every CUDA attribute lands in exactly one of three categories:
-
Collapse to entity bits --
__host__,__device__,__shared__,__constant__,__managed__,__nv_register_params__,__forceinline__,__noinline__,__inline_hint__. After application, no attribute IL node survives. The downstream consumer reads entity bytes (+148,+149,+177,+179,+180,+182,+183) instead. The original kind-0x48IL node is freed back to the arena when the attribute chain is torn down. -
Side-band launch-config struct --
__launch_bounds__,__maxnreg__,__local_maxnreg__,__cluster_dims__,__block_size__. The attribute IL node is consumed; the values are extracted into the 56-bytelaunch_config_tpointed to byentity+256.__launch_bounds__additionally walks back through the writer withkind_field = 25so cicc sees it in.int.c. The other four are consumed entirely by cudafe++ and the launch-config struct is read by later passes (kernel stub generator, ptxas argument formatter). -
Preserved on attribute chain --
__nv_pure__,__tile_global__,__tile_builtin__,__grid_constant__. These have no entity-bit collapse (or only a flag bit alongside the chain). The attribute IL node remains attached to the entity through code generation. The.int.cwriter (sub_5565E0family,sub_540560) walks the chain and re-emits the attribute textually so cicc can apply the corresponding LLVM-level semantics. For the__tile_*pair, no cudafe++ consumer exists -- the attribute is pure pass-through.
Why There Is No Per-Attribute IL Node Type
A CUDA attribute is never lowered into a discrete IL node of its own kind. Every attribute reuses the generic attribute kind (0x48), with its arguments stored as attribute_argument nodes (kind 0x49) and the whole bundle wrapped in an attribute_group node (kind 0x4A). The discriminator is the byte at +8 of the 0x48 node ('V'..'n' for CUDA attributes; standard descriptors for GNU/C++11/MSVC attributes). This is a deliberate EDG design choice: the IL graph stays small and the dispatch logic centralizes in apply_one_attribute (sub_413240) rather than fanning out into dozens of node kinds.
The practical consequence is that "which IL node is emitted for attribute X" is the wrong question. The correct question is "which entity-byte mutation does attribute X cause, and which downstream pass reads that byte." The matrix above answers both.
Cross-References
- global Function Constraints -- detailed validation rules for
__global__ - Launch Configuration Attributes --
__launch_bounds__,__cluster_dims__,__block_size__ - grid_constant -- grid-constant parameter attribute
- managed Variables -- managed memory attribute
- Minor CUDA Attributes --
__noinline__,__forceinline__,__nv_register_params__,__nv_pure__ - _nv* Builtin Intrinsic Names -- 110
__nv_*identifiers that look like attributes but are intrinsics / lambda machinery - Entity Node Layout -- full entity structure with CUDA field offsets
- CUDA Execution Spaces -- how execution space bits drive code generation
- CUDA Memory Spaces -- memory space bitfield semantics
-
The "25 reserved slots" figure comes from the upper and lower bounds of the CUDA kind range as observed in
sub_40A310(attribute_display_name) and the cross-validating switch dispatchers insub_40FDB0,sub_410A20, andsub_4115F0. Of the 25, only kinds 86-95, 102, 107, 108, and 110 appear in any user-facing diagnostic, source-level keyword table, or descriptor entry; the others were either never wired up at this binary revision, are reserved internal markers used briefly during parsing, or were retired between EDG/CUDA revisions. ↩