NVVM Properties Blob and Attr Parsers

Abstract

Every nvvm.* op that carries inline-data attributes gets a uniform Properties record bump-allocated next to its Operation*. The NVVM-to-LLVM lowering dispatcher shares one blob layout across the whole dialect, and five access patterns (A..E) cover every family: tcgen05.mma, ldmatrix / stmatrix, wgmma / mma.sync, cp.async.bulk / TMA, atomicrmw / red, the prefetch / fence / elect.sync triad, and block-scaled MMA.

NVVMDialect::initialize installs a 67-element enum-attr registrar chain that registers the enum namespaces those patterns consume. The slot tables below are normative — they pin each op mnemonic to its enum / unit / int / array slot positions, so a reimplementer can wire getAttrOfType<EnumAttr> (or OpAdaptor) to the offsets the upstream dispatcher already reads.

Properties record layout

Every dispatcher arm opens with the same property fetch. The Operation* header's discriminator byte at op[46] carries one high bit that selects inline Properties storage versus out-of-line bump-allocator storage. The bit is set for every NVVM op in this binary, so the effective offset is 16 and the fetch collapses to a single pointer load:

props_ptr = *(qword*)(op + 16 * (op[46] >> 7) + 0)        // = +0

The first 16 bytes hold an OperandSegmentSizes inline buffer for ops with variadic operand groups; the bytes are zero otherwise. +16..+47 is reserved padding. Attribute slots start at +64 and march on an 8-byte stride. Each slot is an Attribute* — either null (optional attribute absent) or a pointer to an AttrStorage header whose i32 payload sits at +8:

+0..+15   OperandSegmentSizes (16 B, inline; zero when op has no segments)
+16..+47  zero-pad / reserved inline storage
+64       slot 0  Attribute*    (8 B)
+72       slot 1  Attribute*
+80       slot 2  Attribute*
+88       slot 3  Attribute*
+96       slot 4  Attribute*
+104      slot 5  Attribute*
+112      slot 6  Attribute*
+120      slot 7  Attribute*
+128      slot 8  Attribute*
+136      slot 9  Attribute*
+144      slot 10 Attribute*
+152      slot 11 Attribute*
+160      slot 12 Attribute*  (rare; only block-scaled MMA reaches here)

The biggest record observed reads 13 slots: nvvm.mma.block_scale touches +64..+136 plus +144. Every per-arm offset lands on 64 + 8*k for k ∈ [0,12] — no half-pointer storage, no odd alignment. Other observed slot counts: nvvm.ldmatrix=4, tcgen05.mma=9, wgmma.mma_async=12. The out-of-line bump-allocator path goes unused for this NVVM op set; lowering rejects any op whose discriminator says its properties are out-of-line.

The five access patterns

All 199 dispatcher arms reach into their Properties record through one of five inline-templated helpers. They share the slot-fetch arithmetic from the layout above and differ only in what they do with the slot's Attribute* and which payload they pull out.

Pattern	Attribute kind	Slot read	Stored result	Used for
A	`EnumAttr`	load slot pointer, then read the padded `i32` enum payload	`uint32_t` enum	`shape`, `typeA/B/C`, `layout`, `trans`, `eltype`, `scale_in/out`, `kind`, `sparsity`, `cta_group`, `collectorA/B`, `cp_size`, `cache_modifier`, `red_op`, `red_type`, `mem_order`
B	`Optional<EnumAttr>`	Pattern A, but null-tolerant	tagged `uint64_t` with present flag plus value	optional `layout`, `trans`, `sparsity`
C	`UnitAttr` / `BoolAttr`	test whether the slot pointer is non-null	`bool`	`satfinite`, `transA/B`, `has_write_disable`, `tcgen05.fence` direction, prefetch L2 marker
D	`IntegerAttr`	read the `APInt` value	`u32`, or `u64` when active bits exceed 32	`mask` on `elect.sync`, `cache_level` on `prefetch`, `num` on `ldmatrix`
E	`ArrayAttr`	read the first element of the array	first `i32` of array	`kernel` bool, `maxntid` first element

Pattern A is the workhorse: more than half of every per-arm slot read follows it, because NVVM EnumAttrs uniformly pad their payload to a full 32-bit word at slot+8 regardless of cardinality. Pattern B's tagged-int return is what feeds the present-flag inspections scattered through the dispatcher. Pattern C never touches the attribute payload at all. Pattern D bottoms out in APInt::getValue. Pattern E is the rarest — only the nvvm.kernel / nvvm.maxntid function-attribute decoders use it.

Per-op-family Properties slot maps

The 199 dispatcher arms divide into the eight families below. Access patterns reuse the A..E labels from the table above.

tcgen05.mma family (Blackwell sm_100a / sm_100f, 16 arms)

Op mnemonic	Slot	Pattern	Field
`nvvm.tcgen05.mma`	+64 / +72 / +80 / +88	A / A / A / A	typeA/cType, collectorA, scale_d, layout-bits
`nvvm.tcgen05.mma.block_scale`	+64 / +72 / +80 / +88	A / A / A / A	cType, collectorA, scale_d, layout, kindA, kindB
`nvvm.tcgen05.mma.sp`	+64 / +72 / +80 / +88	A / A / A / A	same as `tcgen05.mma` plus the metadata operand slot
`nvvm.tcgen05.mma.ws`	—	—	operand-only
`nvvm.tcgen05.mma.ws.sp`	—	—	operand-only
`nvvm.tcgen05.mma.sp.block_scale`	+64 / +72 / +80 / +88	A / A / A / A	merge of `sp` and `block_scale` fields
`nvvm.tcgen05.shift`	—	—	operand-only
`nvvm.tcgen05.commit`	+64	A	cta_group
`nvvm.tcgen05.commit.arrive`	+64	A	cta_group
`nvvm.tcgen05.cp`	+64 / +72 / +80	A / A / A	multicast, shape, src_fmt
`nvvm.tcgen05.alloc`	+64	A	cta_group
`nvvm.tcgen05.dealloc`	+64	A	cta_group
`nvvm.tcgen05.relinquish_alloc_permit`	+64	A	cta_group
`nvvm.tcgen05.wait`	+64	A	wait_kind (`load` or `store`)
`nvvm.tcgen05.fence`	+64	C	fence-kind marker
`nvvm.tcgen05.{ld,st}matrix`	—	—	indexed-operand walker only

tcgen05.mma is the only family where the first 16 Properties bytes aren't idle. The op carries a variable-arity operand list, so the dispatcher reserves +0..+15 for a packed OperandSegmentSizes buffer plus a second 16-byte running-offset buffer at +96..+111.

ldmatrix / stmatrix (Volta+ tensor-core fragment ops, 3 arms)

Op mnemonic	Slot	Pattern	Field
`nvvm.ldmatrix`	+64 / +72 / +80 / +88	A / D / A / A	eltype/size, `num`, shape, trans
`nvvm.stmatrix`	+64 / +72	A / A	shape, trans; `num` is the SSA-vector cardinality, not a property
`nvvm.stmatrix` alternate selector	+64	A	trans encoded as a 0/1 enum

The alternate stmatrix selector disambiguates intrinsic variants from the trans enum alone. It fires when the operand vector matches the narrower selector shape.

wgmma / mma.sync (Hopper sm_90a, 4 arms)

Op mnemonic	Slot	Pattern	Field
`nvvm.wgmma.mma_async`	+64 / +72 / +80 / +88 / +96 / +112 / +120 / +128 / +136	A × 8, D × 1	typeA, b1Op, typeB, shape, typeC, scaleIn, scaleOut, layoutA, layoutB
`nvvm.wgmma.commit_group_sync_aligned`	+64 / +72	A / A	wgmma_type, wgmma_layout
`nvvm.wgmma.wait_group_sync_aligned`	+64 / +72 / +88	A / A / A	type, layout, shape-N selector
`nvvm.mma.sync`	+64 / +72 / +80 / +88 / +96 / +104	A × 6	b1Op, multiplicandAPtxType, layoutA, layoutB, multiplicandBPtxType, intOverflowBehavior
`nvvm.wmma` family	—	—	operand-only; eltype/k/m/n/layout are baked into the resolved intrinsic name at build time

cp.async.bulk / TMA (Hopper+ sm_90a / Blackwell sm_100, 8 arms)

Op mnemonic	Slot	Pattern	Field
`nvvm.cp.async.bulk.tensor.reduce`	+64 / +72 + rank-dependent slot	C / C / A	multicast presence, cache-hint, reduce_kind
`nvvm.cp.async.bulk.tensor.prefetch`	+64 / +72	A / C	im2col-type, multicast
`nvvm.cp.async.bulk.tensor.shared.cta.to.global`	—	—	operand-only
`nvvm.cp.async.bulk.tensor.shared.cta.to.global.ext`	—	—	im2col / cache-hint operands
`nvvm.cp.async.bulk.tensor.shared.cluster.to.global`	—	—	operand-only
`nvvm.cp.async.bulk.tensor.base`	+64 / +72 / +80	C / C / C	has_im2col, has_multicast, has_cache_hint
`nvvm.cp.async.shared.*.global`	+64 / +80 / +72	A / A / C	cp_size, ca/cg cache modifier, L2-hint presence
`nvvm.cp.async.commit_group`	+64	A	ca/cg modifier

atomicrmw / red (sm_60+, 5 arms)

Op mnemonic	Slot	Pattern	Field
`nvvm.atomicrmw`	+64 / +72	A / A	mem_order, atomic_op
`nvvm.red` variant 1	+64 / +72	A / A	red_op, red_type
`nvvm.red` variant 2	+64 / +72	A / A	red_op, red_type
`nvvm.red` variant 3	+64 / +72	A / A	red_op, red_type
`nvvm.atomic.cas` / `nvvm.red.b128` (parser-arm shorthand; neither string appears in the binary — both arms are reached by TypeID dispatch from the `nvvm.cmpxchg` / 128-bit reduction lowerings)	+64 / +72 / +80 / +88	A / A / A / A	four enum slots

prefetch / fence / elect.sync (5 arms)

Op mnemonic	Slot	Pattern	Field
`nvvm.tcgen05.fence` before	+64	C	`before` unit marker
`nvvm.tcgen05.fence` after	+64	C	`after` unit marker
`nvvm.elect.sync`	+64	D	mask
`nvvm.prefetch` / `nvvm.prefetch.tensormap`	+64 / +72 / +80 / +88 / +96	D / C / A / A / A	cache_level, L2 marker, to-tensormap flag, evict-priority, prefetch-mode
`nvvm.cvt.packfloat.f32`	helper-decoded	A	property-decoded and emitted through a helper

Block-scaled MMA (`nvvm.mma.block_scale`, sm_100a, 1 arm)

Op mnemonic	Slot	Pattern	Field
`nvvm.mma.block_scale`	+64 / +72 / +80 / +88 / +96 / +112 / +120 / +128 / +136	A × 9	typeA, b1Op, typeB, shape, typeC, scaleAFmt, scaleBFmt, scale_vec, layoutA

Block-scaled MMA reuses the wgmma.mma_async prologue shape but rewires the slots: +128 swaps layoutA for scale_vec, and +112/+120 swap scaleIn/scaleOut for scaleAFmt/scaleBFmt. The slot index, not the byte offset, is the canonical identifier.

The 67-element enum-attr registrar chain

NVVMDialect::initialize installs 68 attribute registrars. Sixty-seven are single-namespace EnumAttr registrars; the sixty-eighth is the NVVMTargetAttr registrar carrying chip, features, link-files, and flags. Every enum registrar has the same shape — assemble an attribute-class definition tuple, add it to the dialect, attach the printer/parser pair to the attribute-name table.

The 67 namespaces cover every enum-typed Properties slot read by the dispatcher. Grouped by family, the chain registers cache / memory hints (cache_eviction_priority, load_cache_modifier, load_cache_modifier_ext, store_cache_modifier, l2_prefetch, evict_kind, prefetch_cache_level); address spaces and scopes (state_space, shared_space, mem_scope, mbar_scope, mbar_space); memory ordering and fences (mem_order, proxy_kind, action, tcgen05_fence, tcgen05_wait); warp-level collectives (shfl_kind, vote_sync_kind, match_sync_kind, redux_kind, barrier_redux_kind); mbarrier / FP / cvt (mbar_txn_kind, mbar_wait, fp_rnd_mode, sat_mode, rnd, sat, convert_fp4_type, convert_fp6_type, convert_fp8_type, packfloat_type); MMA / WMMA / WGMMA (shape, mma_layout, mma_type, mma_frag, mma_b1op, mma_int_overflow, mma_cta_count, sparsity_format, load_shape, store_shape, load_src_format, wgmma_scale_in, wgmma_scale_out, wgmma_type); block-scaled and tcgen05 (scale_vec_size, block_scale_format, tcgen05_mma_kind, tcgen05_mma_collectorop, tcgen05_mma_scale_vec, tcgen05_mma_collectorb, TmemLayout, TCBarParam, tcgen05_group, tcgen05_cp_shape, tcgen05_cp_multicast, tcgen05_cp_src_fmt, tcgen05_ldst_shape, load_mode); TMA / atomic / reduction (tma_store_mode, tma_redux_kind, red_op, red_type, mul_mode, atomic_op, dot_accumulate_type).

These namespaces are exactly the enums whose i32 payloads Pattern A pulls from the slot trailers above. The chain only registers parse-side machinery; constant materialization is a later lowering concern. During the NVVM-to-LLVM rewrite, any enum payload that needs to become an SSA constant materializes as llvm.mlir.constant %c : i32. For inline-asm slots that bypass the intrinsic table, see NVVM Overview — Inline-PTX Templates and Constraint Strings.

Reimplementation Notes

A clean implementation drives off a generated slot schema, not a hand-written switch on every op:

for op in nvvm_ops:
    props = read_inline_properties(op)
    schema = schema_for(op.name)

    for field in schema.fields:
        slot = props.slots[field.index]
        value = decode(slot, field.pattern)
        emit_lowering_operand_or_intrinsic_selector(field.name, value)

The invariants are small. Properties are inline for this op set. Slots start at byte 64 and advance by one pointer. Enum attributes decode through padded 32-bit payloads. Optional enum attributes carry a presence bit separate from the value.

Position in the cross-stage attribute system

The Properties blob is the terminal carrier for the memory-ordering, cache-modifier, and MMA-shape attributes that ride down from the higher dialects. Earlier stages keep these facts in the op-attribute dictionary; by the time the NVVM dispatcher sees the op, the attribute has folded into a positional slot in the blob. Attribute System and Lowering documents that journey across the full pipeline — which carrier each fact lives in at each stage, which transitions are intentionally lossy, and which silent drops are wrong-output bugs that ptxas will not catch.

Keyboard shortcuts

Tileiras Internals