Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVM Properties Blob and Attr Parsers

Abstract

Every nvvm.* op that carries inline-data attributes gets a uniform Properties record bump-allocated next to its Operation*. The NVVM-to-LLVM lowering dispatcher shares one blob layout across the whole dialect, and five access patterns (A..E) cover every family: tcgen05.mma, ldmatrix / stmatrix, wgmma / mma.sync, cp.async.bulk / TMA, atomicrmw / red, the prefetch / fence / elect.sync triad, and block-scaled MMA.

NVVMDialect::initialize installs a 67-element enum-attr registrar chain that registers the enum namespaces those patterns consume. The slot tables below are normative — they pin each op mnemonic to its enum / unit / int / array slot positions, so a reimplementer can wire getAttrOfType<EnumAttr> (or OpAdaptor) to the offsets the upstream dispatcher already reads.

Properties record layout

Every dispatcher arm opens with the same property fetch. The Operation* header's discriminator byte at op[46] carries one high bit that selects inline Properties storage versus out-of-line bump-allocator storage. The bit is set for every NVVM op in this binary, so the effective offset is 16 and the fetch collapses to a single pointer load:

props_ptr = *(qword*)(op + 16 * (op[46] >> 7) + 0)        // = +0

The first 16 bytes hold an OperandSegmentSizes inline buffer for ops with variadic operand groups; the bytes are zero otherwise. +16..+47 is reserved padding. Attribute slots start at +64 and march on an 8-byte stride. Each slot is an Attribute* — either null (optional attribute absent) or a pointer to an AttrStorage header whose i32 payload sits at +8:

+0..+15   OperandSegmentSizes (16 B, inline; zero when op has no segments)
+16..+47  zero-pad / reserved inline storage
+64       slot 0  Attribute*    (8 B)
+72       slot 1  Attribute*
+80       slot 2  Attribute*
+88       slot 3  Attribute*
+96       slot 4  Attribute*
+104      slot 5  Attribute*
+112      slot 6  Attribute*
+120      slot 7  Attribute*
+128      slot 8  Attribute*
+136      slot 9  Attribute*
+144      slot 10 Attribute*
+152      slot 11 Attribute*
+160      slot 12 Attribute*  (rare; only block-scaled MMA reaches here)

The biggest record observed reads 13 slots: nvvm.mma.block_scale touches +64..+136 plus +144. Every per-arm offset lands on 64 + 8*k for k ∈ [0,12] — no half-pointer storage, no odd alignment. Other observed slot counts: nvvm.ldmatrix=4, tcgen05.mma=9, wgmma.mma_async=12. The out-of-line bump-allocator path goes unused for this NVVM op set; lowering rejects any op whose discriminator says its properties are out-of-line.

The five access patterns

All 199 dispatcher arms reach into their Properties record through one of five inline-templated helpers. They share the slot-fetch arithmetic from the layout above and differ only in what they do with the slot's Attribute* and which payload they pull out.

PatternAttribute kindSlot readStored resultUsed for
AEnumAttrload slot pointer, then read the padded i32 enum payloaduint32_t enumshape, typeA/B/C, layout, trans, eltype, scale_in/out, kind, sparsity, cta_group, collectorA/B, cp_size, cache_modifier, red_op, red_type, mem_order
BOptional<EnumAttr>Pattern A, but null-toleranttagged uint64_t with present flag plus valueoptional layout, trans, sparsity
CUnitAttr / BoolAttrtest whether the slot pointer is non-nullboolsatfinite, transA/B, has_write_disable, tcgen05.fence direction, prefetch L2 marker
DIntegerAttrread the APInt valueu32, or u64 when active bits exceed 32mask on elect.sync, cache_level on prefetch, num on ldmatrix
EArrayAttrread the first element of the arrayfirst i32 of arraykernel bool, maxntid first element

Pattern A is the workhorse: more than half of every per-arm slot read follows it, because NVVM EnumAttrs uniformly pad their payload to a full 32-bit word at slot+8 regardless of cardinality. Pattern B's tagged-int return is what feeds the present-flag inspections scattered through the dispatcher. Pattern C never touches the attribute payload at all. Pattern D bottoms out in APInt::getValue. Pattern E is the rarest — only the nvvm.kernel / nvvm.maxntid function-attribute decoders use it.

Per-op-family Properties slot maps

The 199 dispatcher arms divide into the eight families below. Access patterns reuse the A..E labels from the table above.

tcgen05.mma family (Blackwell sm_100a / sm_100f, 16 arms)

Op mnemonicSlotPatternField
nvvm.tcgen05.mma+64 / +72 / +80 / +88A / A / A / AtypeA/cType, collectorA, scale_d, layout-bits
nvvm.tcgen05.mma.block_scale+64 / +72 / +80 / +88A / A / A / AcType, collectorA, scale_d, layout, kindA, kindB
nvvm.tcgen05.mma.sp+64 / +72 / +80 / +88A / A / A / Asame as tcgen05.mma plus the metadata operand slot
nvvm.tcgen05.mma.wsoperand-only
nvvm.tcgen05.mma.ws.spoperand-only
nvvm.tcgen05.mma.sp.block_scale+64 / +72 / +80 / +88A / A / A / Amerge of sp and block_scale fields
nvvm.tcgen05.shiftoperand-only
nvvm.tcgen05.commit+64Acta_group
nvvm.tcgen05.commit.arrive+64Acta_group
nvvm.tcgen05.cp+64 / +72 / +80A / A / Amulticast, shape, src_fmt
nvvm.tcgen05.alloc+64Acta_group
nvvm.tcgen05.dealloc+64Acta_group
nvvm.tcgen05.relinquish_alloc_permit+64Acta_group
nvvm.tcgen05.wait+64Await_kind (load or store)
nvvm.tcgen05.fence+64Cfence-kind marker
nvvm.tcgen05.{ld,st}matrixindexed-operand walker only

tcgen05.mma is the only family where the first 16 Properties bytes aren't idle. The op carries a variable-arity operand list, so the dispatcher reserves +0..+15 for a packed OperandSegmentSizes buffer plus a second 16-byte running-offset buffer at +96..+111.

ldmatrix / stmatrix (Volta+ tensor-core fragment ops, 3 arms)

Op mnemonicSlotPatternField
nvvm.ldmatrix+64 / +72 / +80 / +88A / D / A / Aeltype/size, num, shape, trans
nvvm.stmatrix+64 / +72A / Ashape, trans; num is the SSA-vector cardinality, not a property
nvvm.stmatrix alternate selector+64Atrans encoded as a 0/1 enum

The alternate stmatrix selector disambiguates intrinsic variants from the trans enum alone. It fires when the operand vector matches the narrower selector shape.

wgmma / mma.sync (Hopper sm_90a, 4 arms)

Op mnemonicSlotPatternField
nvvm.wgmma.mma_async+64 / +72 / +80 / +88 / +96 / +112 / +120 / +128 / +136A × 8, D × 1typeA, b1Op, typeB, shape, typeC, scaleIn, scaleOut, layoutA, layoutB
nvvm.wgmma.commit_group_sync_aligned+64 / +72A / Awgmma_type, wgmma_layout
nvvm.wgmma.wait_group_sync_aligned+64 / +72 / +88A / A / Atype, layout, shape-N selector
nvvm.mma.sync+64 / +72 / +80 / +88 / +96 / +104A × 6b1Op, multiplicandAPtxType, layoutA, layoutB, multiplicandBPtxType, intOverflowBehavior
nvvm.wmma familyoperand-only; eltype/k/m/n/layout are baked into the resolved intrinsic name at build time

cp.async.bulk / TMA (Hopper+ sm_90a / Blackwell sm_100, 8 arms)

Op mnemonicSlotPatternField
nvvm.cp.async.bulk.tensor.reduce+64 / +72 + rank-dependent slotC / C / Amulticast presence, cache-hint, reduce_kind
nvvm.cp.async.bulk.tensor.prefetch+64 / +72A / Cim2col-type, multicast
nvvm.cp.async.bulk.tensor.shared.cta.to.globaloperand-only
nvvm.cp.async.bulk.tensor.shared.cta.to.global.extim2col / cache-hint operands
nvvm.cp.async.bulk.tensor.shared.cluster.to.globaloperand-only
nvvm.cp.async.bulk.tensor.base+64 / +72 / +80C / C / Chas_im2col, has_multicast, has_cache_hint
nvvm.cp.async.shared.*.global+64 / +80 / +72A / A / Ccp_size, ca/cg cache modifier, L2-hint presence
nvvm.cp.async.commit_group+64Aca/cg modifier

atomicrmw / red (sm_60+, 5 arms)

Op mnemonicSlotPatternField
nvvm.atomicrmw+64 / +72A / Amem_order, atomic_op
nvvm.red variant 1+64 / +72A / Ared_op, red_type
nvvm.red variant 2+64 / +72A / Ared_op, red_type
nvvm.red variant 3+64 / +72A / Ared_op, red_type
nvvm.atomic.cas / nvvm.red.b128 (parser-arm shorthand; neither string appears in the binary — both arms are reached by TypeID dispatch from the nvvm.cmpxchg / 128-bit reduction lowerings)+64 / +72 / +80 / +88A / A / A / Afour enum slots

prefetch / fence / elect.sync (5 arms)

Op mnemonicSlotPatternField
nvvm.tcgen05.fence before+64Cbefore unit marker
nvvm.tcgen05.fence after+64Cafter unit marker
nvvm.elect.sync+64Dmask
nvvm.prefetch / nvvm.prefetch.tensormap+64 / +72 / +80 / +88 / +96D / C / A / A / Acache_level, L2 marker, to-tensormap flag, evict-priority, prefetch-mode
nvvm.cvt.packfloat.f32helper-decodedAproperty-decoded and emitted through a helper

Block-scaled MMA (nvvm.mma.block_scale, sm_100a, 1 arm)

Op mnemonicSlotPatternField
nvvm.mma.block_scale+64 / +72 / +80 / +88 / +96 / +112 / +120 / +128 / +136A × 9typeA, b1Op, typeB, shape, typeC, scaleAFmt, scaleBFmt, scale_vec, layoutA

Block-scaled MMA reuses the wgmma.mma_async prologue shape but rewires the slots: +128 swaps layoutA for scale_vec, and +112/+120 swap scaleIn/scaleOut for scaleAFmt/scaleBFmt. The slot index, not the byte offset, is the canonical identifier.

The 67-element enum-attr registrar chain

NVVMDialect::initialize installs 68 attribute registrars. Sixty-seven are single-namespace EnumAttr registrars; the sixty-eighth is the NVVMTargetAttr registrar carrying chip, features, link-files, and flags. Every enum registrar has the same shape — assemble an attribute-class definition tuple, add it to the dialect, attach the printer/parser pair to the attribute-name table.

The 67 namespaces cover every enum-typed Properties slot read by the dispatcher. Grouped by family, the chain registers cache / memory hints (cache_eviction_priority, load_cache_modifier, load_cache_modifier_ext, store_cache_modifier, l2_prefetch, evict_kind, prefetch_cache_level); address spaces and scopes (state_space, shared_space, mem_scope, mbar_scope, mbar_space); memory ordering and fences (mem_order, proxy_kind, action, tcgen05_fence, tcgen05_wait); warp-level collectives (shfl_kind, vote_sync_kind, match_sync_kind, redux_kind, barrier_redux_kind); mbarrier / FP / cvt (mbar_txn_kind, mbar_wait, fp_rnd_mode, sat_mode, rnd, sat, convert_fp4_type, convert_fp6_type, convert_fp8_type, packfloat_type); MMA / WMMA / WGMMA (shape, mma_layout, mma_type, mma_frag, mma_b1op, mma_int_overflow, mma_cta_count, sparsity_format, load_shape, store_shape, load_src_format, wgmma_scale_in, wgmma_scale_out, wgmma_type); block-scaled and tcgen05 (scale_vec_size, block_scale_format, tcgen05_mma_kind, tcgen05_mma_collectorop, tcgen05_mma_scale_vec, tcgen05_mma_collectorb, TmemLayout, TCBarParam, tcgen05_group, tcgen05_cp_shape, tcgen05_cp_multicast, tcgen05_cp_src_fmt, tcgen05_ldst_shape, load_mode); TMA / atomic / reduction (tma_store_mode, tma_redux_kind, red_op, red_type, mul_mode, atomic_op, dot_accumulate_type).

These namespaces are exactly the enums whose i32 payloads Pattern A pulls from the slot trailers above. The chain only registers parse-side machinery; constant materialization is a later lowering concern. During the NVVM-to-LLVM rewrite, any enum payload that needs to become an SSA constant materializes as llvm.mlir.constant %c : i32. For inline-asm slots that bypass the intrinsic table, see NVVM Overview — Inline-PTX Templates and Constraint Strings.

Reimplementation Notes

A clean implementation drives off a generated slot schema, not a hand-written switch on every op:

for op in nvvm_ops:
    props = read_inline_properties(op)
    schema = schema_for(op.name)

    for field in schema.fields:
        slot = props.slots[field.index]
        value = decode(slot, field.pattern)
        emit_lowering_operand_or_intrinsic_selector(field.name, value)

The invariants are small. Properties are inline for this op set. Slots start at byte 64 and advance by one pointer. Enum attributes decode through padded 32-bit payloads. Optional enum attributes carry a presence bit separate from the value.

Position in the cross-stage attribute system

The Properties blob is the terminal carrier for the memory-ordering, cache-modifier, and MMA-shape attributes that ride down from the higher dialects. Earlier stages keep these facts in the op-attribute dictionary; by the time the NVVM dispatcher sees the op, the attribute has folded into a positional slot in the blob. Attribute System and Lowering documents that journey across the full pipeline — which carrier each fact lives in at each stage, which transitions are intentionally lossy, and which silent drops are wrong-output bugs that ptxas will not catch.