Attribute System and Lowering
Abstract
Compiler attributes carry the semantic context that IR-graph structure alone cannot express: kernel launch shape, fast-math flags, pipeline staging, scheduling hints, memory ordering, layout descriptors. Every lowering stage in tileiras has a deliberate policy for each attribute family — preserve it under a renamed key for the new dialect, consume it and drop the carrier after the analysis that needed it finished, or synthesize a fresh attribute from inferred facts. A reimplementation that drops the wrong attribute at the wrong stage emits PTX that compiles and runs but produces the wrong answer; the bytes survive ptxas because nothing it sees is malformed.
This page is the canonical reference for the attribute system as a whole. The per-stage lowering pages document where individual attributes flow; the dialect type-and-attribute pages document the per-attribute parse contract and verifier. This page documents the cross-cutting policy that ties them together: what each lowering stage does to each attribute family, which transitions are intentionally lossy, and which silent drops are wrong-output bugs waiting to be introduced.
Attribute carriers
MLIR exposes five places an attribute can live. Tileiras uses all five, and the decision of which carrier to use is part of the attribute's contract — moving an attribute from one carrier to another changes who reads it, when it is read, and what happens when the carrier disappears under a rewrite.
| Carrier | Storage | Lifetime | Primary readers |
|---|---|---|---|
| Op attribute dictionary | DictionaryAttr on the op header, or inherent Properties storage for ops that declared an inherent attribute slot | Bound to the op; survives clones unless the op is rewritten away | Verifiers, fold rules, conversion patterns, the AsmPrinter |
| Type-storage parameters | Fields inside the type's TypeStorage derivative, uniqued through the context StorageUniquer | Bound to the type identity; outlives every op that uses the type | Type-equality checks, type converters, every walker that keys on type |
| Function-level named-attribute dictionary | func.func (or llvm.func) operation header | Bound to the function symbol; survives function-level clones | Function-boundary lowering, LLVM function-attribute emission, the PTX directive emitter |
| Module-level dictionary | builtin.module operation header | Bound to the module; survives across passes that do not rewrite the module shell | Pipeline driver, target-attachment pass, options-mapping pass |
| NVVM properties blob | Per-op compact slot table at Operation*+64, slots stride 8 bytes | Bound to the op like an inherent attribute, but the slots are positional, not keyed | The NVVM-to-LLVM dispatcher arms documented in Properties Blob and Attr Parsers |
The carrier decision matters because the rules for who can read a carrier differ. An op-attribute dictionary entry is keyed by string; a passes that consumes the op can fetch it through getAttr("name"). A type-storage parameter is positional; only code that knows the type's storage class can read it. A function-level attribute is read by a different set of passes than an op-level attribute carrying the same name. Moving an attribute from op-level to function-level — for example, when a kernel-spec entry on the function summarises a per-op annotation — changes the answer to "which pass owns this attribute now?".
Lifecycle of a kernel attribute
A concrete attribute makes the policy concrete. The frontend hint tt.num_warps = 4 — a Triton-style annotation requesting four warps per CTA — flows through every lowering stage in tileiras, changing carrier and key as it travels. The end result is the PTX directive .reqntid 128, 1, 1 in the kernel's .entry header.
Stage 0 (frontend input). A Triton-style producer emits cuda_tile bytecode with the kernel-spec hint attached to the module:
module attributes { tt.num_warps = 4 : i32 } {
cuda_tile.entry @gemm(...) { ... }
}
Stage 1 (ConvertCudaTileToTileAA). The pass walks cuda_tile.module operations and lowers their bodies, but the module-level dictionary entry passes through verbatim. The nv_tileaa dialect declares the same string key as a legal attribute on its enclosing module, so the conversion target does not reject it. The lifecycle here is "preserve, do not rename".
Stage 2 (ConvertTileAAToTileAS, kernel-spec attach). The attachKernelSpecAttributes step folds the frontend hint into the function-level nv_tileaa.kernel_spec attribute. The bytes num_warps = 4 become one field of a structured kernel-spec record on the function. The lifecycle is "consume and synthesize" — the module-level tt.num_warps is read once and a function-level nv_tileaa.kernel_spec is written.
Stage 3 (TileAS scheduling and layout). The scheduler reads kernel_spec.num_warps = 4 to size the warp partitioning and resource pools. The agent-switch pass reads nv_tileas.num_warps = 4 (a per-agent mirror written by OptimizeExecutionUnitMapping) to round each agent's starting warp to its group size. The lifecycle is "read to act, do not rewrite".
Stage 4 (ConvertTileFuncToLLVM). The function-boundary lowering reads nv_tileaa.kernel_spec and writes nvvm.reqntid = 128 : i32 onto the rewritten func.func, derived from 32 * num_warps = 32 * 4 = 128. The lifecycle is "consume and synthesize"; the kernel-spec attribute remains for downstream readers, but the nvvm.reqntid carrier is what the PTX emitter consumes next.
Stage 5 (PTX directive emission). The kernel-directive emitter walks the LLVM function's nvvm.* attribute set in the fixed order documented in Host Launch and ptxas Knobs and emits .reqntid 128, 1, 1 into the .entry header. The lifecycle is "read and project to PTX".
The full trace is six carriers in five stages: module dictionary → function-level kernel-spec → scheduler-internal pool sizing → function-level nvvm.* → PTX directive → cubin metadata consumed by the CUDA driver at launch time. Each transition has a different rule, and each transition is owned by exactly one pass.
Attribute categories
The attribute system breaks into nine functional families. Each family has its own carrier policy, its own set of readers, and its own per-stage rewrite rules.
| Family | Representative attributes | Carrier | Primary readers |
|---|---|---|---|
| Launch shape | nvvm.reqntid, nvvm.maxntid, nvvm.cluster_dim, nvvm.maxclusterrank, nvvm.minctasm, nvvm.maxnreg, nvvm.blocksareclusters, nvvm.explicitcluster, nvvm.grid_constant | Function-level dictionary | Kernel-directive emitter, NVVM IR verifier |
| Compute capability | nv_tileaa.compute_capability, nv_tileaa.target_spec, nv_tileas.compute_capability, nvvm.target | Module-level dictionary; nvvm.target is a type-storage parameter on a type-encoded target attribute | Target-attachment pass, SM-gated rewriter guards |
| Kernel spec | nv_tileaa.kernel_spec, nv_tileas.num_warps, nv_tileas.workspace_global_offset | Function-level dictionary | Scheduler, agent-switch builder, function-boundary lowering |
| Fast-math | fastmath = "contract", fastmath = "nnan", fastmath = "ninf", fastmath = "nsz", fastmath = "arcp", fastmath = "afn", fastmath = "reassoc" | Op-level inherent attribute on arithmetic and MMA ops | Arith folder, instruction selector, intrinsic-rewrite pattern |
| Memory ordering | mem_semantic (relaxed / acquire / release / acq_rel / sc), mem_scope (cta / cluster / gpu / sys), mbar_scope, mbar_space | Op-level inherent attribute on memory ops; later an NVVM properties slot | Memory-op verifier, NVVM dispatcher arm A, LLVM atomic emitter |
| Cache policy | cache_modifier (.ca/.cg/.cs/.cv), eviction_policy, l2_prefetch, cache_eviction_priority | Op-level inherent attribute | Memory-op selectors, ptxas directive emitter |
| Layout / shape | cuTe layout descriptors, DenseI32ArrayAttr tile shape on partition_view, mma_layout, wgmma_layout | Type-storage parameter (layout-on-view), op-level attribute (layout-on-MMA) | Layout-assignment pass, atom builders, MMA intrinsic selector |
| Pipeline staging | pipeline_stage, num_stages, nv_tileas.persistent, tileas.schedule.constraint.* | Op-level discardable attribute on async-pipeline ops; some live in inherent properties when the op definition reserved a slot | Modulo scheduler, MaterializeAsync, schedule-constraint parser |
| Assumption / debug | div_by, bounded, same_elements (assumption predicates on cuda_tile.assume); di_loc, di_compile_unit, di_file, di_lexical_block, di_subprogram (debug info) | Op-level attribute | Optimizer, debug-info translator |
The kernel-spec family is the central pivot. Frontend hints land as kernel-spec fields, the scheduler reads kernel-spec fields, function-boundary lowering reads kernel-spec fields and writes nvvm.* attributes from them, the PTX emitter walks the nvvm.* set in a fixed order — every interesting per-kernel decision passes through the kernel-spec at least once.
Per-stage attribute rules
Each lowering stage has a published rule for each attribute family. The table below is the policy matrix every conversion pattern must respect: which attributes the stage is allowed to drop, which it must preserve (renamed under the new dialect's prefix), which it must synthesize from inferred or read facts, and which it reads to drive its own rewriting decisions without modifying.
| Stage | Drops | Preserves and renames | Synthesizes | Reads to act |
|---|---|---|---|---|
Frontend → cuda_tile (bytecode input) | (none) | (none) | (none) | (none — this is the input contract) |
ConvertCudaTileToTileAA | (none) | All op-attribute dictionaries flow through the TypeConverter and emerge on the rewritten nv_tileaa ops; fastmath carries verbatim on arithmetic ops | (none) | Compute capability from the pass option to specialise type conversion |
ConvertTileAAToTileAS | (none at this stage; downstream passes drop intermediate analysis attrs) | Per-op CopyAtom and ReduceAtom witnesses ride verbatim onto the new nv_tileas ops; layout attributes carry through | nv_tileaa.kernel_spec on the function from frontend hints; SM-gated rewrites consult it through the attached attribute | nv_tileaa.compute_capability for SM100 block-scaled MMA admission |
| TileAS scheduling and layout (D07-D22) | Scheduler-internal intermediate attrs after MaterializeSchedule consumes them | pipeline_stage, nv_tileas.num_warps, schedule-constraint attrs survive into materialization | pipeline_stage integer on each producer/consumer region, nv_tileas.num_warps mirror on agent-switch ops, agent_strides array on agent_switch | kernel_spec.num_warps, kernel_spec.num_ctas, schedule-constraint attrs |
ConvertTileFuncToLLVM | nv_tileaa.kernel_spec field-by-field (the function-level dictionary entry stays; its readers move) | nv_tileaa.compute_capability, nv_tileaa.target_spec; nv_tileaa.grid_constant argument attributes are renamed and migrated onto the LLVM-typed arguments by the downstream CuteKernelToNvvmRewrite pass | nvvm.reqntid from 32 * numWarps; nvvm.cluster_dim when targetSM > 89 && clusterProduct > 1; nvvm.blocksareclusters under the same predicate; nvvm.minctasm = 1; nvvm.maxnreg from per-SM occupancy table when nv_tileaa.occupancy is set; cute.kernel unit marker (renamed to nvvm.kernel only in the downstream pass) | nv_tileaa.kernel_spec field accessors |
ConvertTileASToLLVM body conversion | Async-token operand types collapse to i32 carriers; some carrier-only attrs disappear with their ops | nvvm.* properties attributes on lowered ops survive into the NVVM dispatcher slots described in Properties Blob and Attr Parsers | NVVM properties slots from the lowered op's MLIR attribute dictionary; mem_semantic becomes a Pattern-A enum slot at +64, mem_scope becomes a Pattern-A enum slot at +72 | cute.kernel marker, CopyAtom and ReduceAtom witnesses |
Companion cute*-to-LLVM lowering | CuTe-internal layout-algebra attributes after descriptor materialization | Tile-shape attributes survive into the emitted descriptor constants | TMA descriptor constants from cuTe layout attributes | Layout attributes, compute_capability for atom selection |
ConvertNVGPUAndGPUToNVVM | gpu.kernel after rewriting to nvvm.kernel | nvvm.* family unchanged | (none beyond what the rewrite emits) | gpu.kernel, gpu.module target attribute |
AttachNVVMTarget | (none) | Compute-capability and target-spec data folded into #nvvm.target | #nvvm.target attribute on the gpu.module with chip, features, link-files, flags | nv_tileaa.compute_capability, nv_tileaa.target_spec |
| MLIR-to-LLVM translation | The nvvm.* markers that have no LLVM-IR counterpart (e.g. nvvm.kernel is emitted as a calling-convention attribute, not as a metadata node) | All function-level nvvm.* directive carriers become LLVM function attributes named nvvm-reqntid, nvvm-cluster-dim, etc., or NVVM annotation tuples on the legacy path | LLVM function attributes; debug-info intrinsics | All carrier-only nvvm.* attributes |
| NVPTX MIR | Most function-level attributes outside the directive-bearing ones | nvvm-reqntid, nvvm-cluster-dim, nvvm-maxnreg, nvvm-minctasm, nvvm-grid-constant, nvvm-maxclusterrank, nvvm-blocksareclusters carry through as function attributes the AsmPrinter reads | NVPTXISD pseudo-opcodes for grid-constant arguments and TMA descriptor materialization | nvvm.kernel (entry vs func split), per-arg nvvm.grid_constant |
| AsmPrinter (MIR → PTX) | (none at emission time) | (none — this is the projection step) | PTX directives: .entry, .maxntid, .reqntid, .minnctapersm, .maxnreg, .explicitcluster, .reqnctapercluster, .maxclusterrank, .blocksareclusters | Every directive-bearing function attribute |
The two stages that synthesize the most are ConvertTileFuncToLLVM and AttachNVVMTarget. Function-boundary conversion is where frontend hints, scheduler analysis, and kernel-spec fields collapse into the small set of nvvm.* attributes the AsmPrinter will eventually project to PTX. Target attachment is where the per-module compute_capability and target_spec strings become the single resolved #nvvm.target attribute that drives every SM-gated decision downstream.
Intentional drops and silent miscompiles
Not every attribute drop is a bug. The pipeline deliberately drops attributes once their consumer has read them, and the carrier serves no purpose after that point. Distinguishing intentional drops from accidental drops is the central correctness concern for any reimplementation.
Intentional drops:
- Scheduler-internal intermediate attributes are dropped after
MaterializeScheduleconsumes them. They exist only to communicate analysis state from one scheduler subpass to the next, and they would clutter the IR if left behind. The drop is correct because no downstream pass reads them. fastmathattributes on an op's output value disappear when the op is rewritten as an intrinsic that re-encodes the same flags. The intrinsic's argument list carries the flags forward (typically as afastmathflagsLLVM operand bundle), so the original attribute carrier is redundant.cute.kernelis renamed tonvvm.kernelby the downstreamCuteKernelToNvvmRewritepass; the original marker disappears once the rename runs. The two-step rename exists because the rewriter also liftscute_nvgpu.grid_constantargument attributes tonvvm.grid_constant, and that lift needs the LLVM-typed function arguments the function-boundary pass has just produced.- Per-op
mem_semanticandmem_scopeop-attribute entries fold into NVVM properties slots duringConvertTileASToLLVM. The op-attribute carrier vanishes, but the value survives at a positional slot the NVVM dispatcher reads.
Silent-miscompile drops to avoid:
- Dropping
mem_semanticon a memory op during lowering produces a load or store with weaker ordering than the source requested. The NVVM dispatcher picks the relaxed-ordering arm by default, and the resulting PTX validates cleanly under ptxas — there is no diagnostic to surface the missing fact. - Dropping
mem_scopeon a cluster-scope atomic produces a CTA-scope atomic on Hopper hardware. The two opcodes both exist and both pass the NVVM IR verifier; the cluster invariant is not checked. - Dropping
nv_tileaa.compute_capabilitybeforeAttachNVVMTargetruns produces a#nvvm.targetattribute with the default chip, not the requested one. The NVVM IR verifier accepts the target because the chip string is legal; the cubin compiles for the wrong SM and runs in degraded mode (or crashes on unsupported instructions). - Dropping
nv_tileaa.kernel_specbefore function-boundary conversion produces a kernel without launch-bound directives. The function compiles as a.funcinstead of a.entry, and the resulting cubin exposes no kernel for the driver to launch. - Dropping
nvvm.grid_constanton a TMA descriptor argument produces a kernel that copies the descriptor through parameter memory on every launch instead of materializing it once. ptxas accepts the result; the kernel runs but at degraded performance. - Dropping
fastmathon anmmafop that the frontend markedcontractproduces an MMA emission that refuses fused-multiply-add formation. The PTX is correct under IEEE-754, slower than the user requested, and the diagnostic surface is empty.
A reimplementation should treat any attribute drop that is not on the intentional list as a candidate bug. The pass-level verifier catches structural mismatches but does not see semantic drops; the NVVM IR verifier sees structural target-violation but does not see semantic miscompiles. The attribute-drop policy is the producer-side discipline that fills that gap.
NVVM properties blob
The NVVM properties blob is the dialect-specific compact carrier that sits below the standard MLIR attribute-dictionary surface. Every nvvm.* op that carries inline-data attributes gets a uniform Properties record bump-allocated next to its Operation* header, with attribute slots starting at byte +64 and striding 8 bytes apart. The five access patterns (A through E) cover every per-op attribute family in the dialect — enum payloads, optional enums, unit attributes, integer attributes, and array attributes.
The blob is positional, not keyed. A slot's meaning is fixed by the dispatcher arm for that op mnemonic; a reimplementation that gets the slot ordering wrong reads the wrong attribute even when the data is present. The full slot tables for each op family, plus the 67-element enum-attr registrar chain that backs the parsers, are documented in Properties Blob and Attr Parsers.
Three properties of the blob matter for the cross-stage attribute system. First, the blob is the terminal carrier for memory-ordering and cache-modifier attributes; once an op reaches the NVVM dispatcher, its op-attribute dictionary has been collapsed into the slot table. Second, the blob is inherent storage, not discardable, so cloning an op preserves the slot values verbatim. Third, the slot ordering is the canonical reference for how the lowering pass maps op-attribute keys to NVVM Properties positions — getting the ordering right is exactly the constraint that a hand-written pattern set must satisfy.
Cross-references
Per-stage attribute movement is documented in Lowering Overview, cuda_tile to nv_tileaa, nv_tileaa to nv_tileas, and nv_tileas to LLVM. Host Launch and ptxas Knobs documents the launch-shape directive emitter and the per-directive policy. Properties Blob and Attr Parsers documents the NVVM properties carrier in detail. cuda_tile Types and Attrs, nv_tileaa Types, Attrs, Verifiers, and nv_tileas Types document the per-dialect parse contract and verifier for each attribute. Schedule Constraint Attributes covers the nine scheduler-constraint attribute strings the modulo scheduler reads. GPU Execution Model is the canonical reference for the launch-shape directives at runtime. DSL to PTX End-to-End walks a representative kernel through every stage and shows the attribute movement in context.