NVVM TMA Ops
Abstract
nvvm.cp.async.bulk.* covers the Hopper Tensor Memory Accelerator (TMA) surface: tile loads from global to shared, tile stores from shared to global, prefetches, reductions, group commit / wait, and the descriptor-fence helper that pairs with an in-SMEM CUtensorMap. Rather than enumerating one op per (mode, direction, cache-hint, multicast, rank) combination, this dialect carries a small set of canonical mnemonics; mode (tile / im2col / im2col_w / im2col_w_128), rank, cache-hint presence, and multicast presence are all encoded as attributes the printer reads at PTX emit time. See TMA Tensormap and cp.async.bulk Codegen for the per-template emission catalog and Lowering: nvgpu / gpu to NVVM — TMA Async Load for the operand-slot mapping.
TMA descriptors live in global memory as 128-byte CUtensorMap structs encoded by the CUDA driver. The device-side ops in this family consume the descriptor as an opaque global pointer, with cache-hint and multicast attributes wiring into optional intrinsic operand slots.
Op Roster
| Sub-family | Count | Mnemonic stem |
|---|---|---|
| Tile load (global → shared via cluster) | 1 op, rank/mode in attributes | nvvm.cp.async.bulk.tensor.shared.cluster.global |
| Tile load (cta-direct) | 1 op, rank/mode in attributes | nvvm.cp.async.bulk.tensor.shared.cta.global |
| Tile store (shared → global) | 2 (base + ext) | nvvm.cp.async.bulk.tensor.global.shared.cta, …ext |
| Reduce | 1 op, redop in attribute | nvvm.cp.async.bulk.tensor.reduce |
| Prefetch | 1 op | nvvm.cp.async.bulk.tensor.prefetch |
| Group control | 2 | nvvm.cp.async.bulk.commit.group, nvvm.cp.async.bulk.wait_group |
| Descriptor copy / fence | 1 | nvvm.tensormap.cp_fenceproxy |
Each canonical mnemonic above parameterises rank, mode, cache-hint presence, and multicast presence through op attributes. The NVVM-to-LLVM printer expands a single dialect op into one of the family of llvm.nvvm.cp.async.bulk.tensor.{1..5}d.<dir>.<mode> intrinsics at lowering time; the IR layer stays compact.
Operand Tables
nvvm.cp.async.bulk.tensor.shared.cluster.global (rank in rank attribute, mode = #tile)
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | dstAddr | ptr addrspace(3) | SMEM destination tile origin |
| operand 1 | tensorMap | ptr (global, opaque) | 128-byte CUtensorMap pointer |
| operands 2..N+1 | coords | variadic i32, rank N | tile origin in tensor space |
| operand N+2 | barrier | ptr addrspace(3) | mbarrier slot for expect-tx completion |
| operand N+3 | multicastMask | optional i16 | cluster multicast bitmap (positional slot in the intrinsic call) |
| operand N+4 | cacheHint | optional i64 | L2 cache hint (positional slot in the intrinsic call) |
| attribute | cacheHintEnable | UnitAttr | gates the .L2::cache_hint modifier |
| attribute | multicastEnable | UnitAttr | gates the .multicast modifier |
| attribute | mode | enum tma_load_mode | tile / im2col / im2col_w / im2col_w_128 |
The two UnitAttrs gate the corresponding optional operand. When cacheHintEnable is absent the cacheHint operand position is left empty in the LLVM intrinsic call; when present the operand must be supplied. The same pattern applies to multicastEnable and multicastMask. See Lowering: TMA Async Load — Operand Mapping (rank N) for the operand-slot mapping the nvgpu-to-nvvm rewriter performs.
nvvm.cp.async.bulk.tensor.{N}d.global.shared.cta.tile
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tensorMap | ptr (global, opaque) | 128-byte CUtensorMap |
| operands 1..N | coords | variadic i32, rank N | tile origin |
| operand N+1 | srcAddr | ptr addrspace(3) | SMEM source tile |
| operand N+2 | cacheHint | optional i64 | L2 cache hint |
| attribute | cacheHintEnable | UnitAttr | gates the .L2::cache_hint modifier |
No barrier — the producer issues the store and continues; the consumer side observes completion via cp.async.bulk.wait.group.
nvvm.cp.async.bulk.tensor.{N}d.global.shared.cta.tile.reduce
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tensorMap | ptr (global) | CUtensorMap |
| operands 1..N | coords | variadic i32, rank N | tile origin |
| operand N+1 | srcAddr | ptr addrspace(3) | SMEM source |
| operand N+2 | cacheHint | optional i64 | L2 cache hint |
| attribute | redop | enum tma_redux_kind | add / min / max / inc / dec / and / or / xor |
nvvm.cp.async.bulk.tensor.{N}d.tile.prefetch
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | tensorMap | ptr (global) | CUtensorMap |
| operands 1..N | coords | variadic i32, rank N | tile origin |
| attribute | mode | enum load_mode | tile / im2col (matches the load form) |
| attribute | cacheHintEnable | UnitAttr (optional) | gates a cache-hint operand |
nvvm.cp.async.bulk.commit.group / nvvm.cp.async.bulk.wait.group
| Position | Name | Type | Notes |
|---|---|---|---|
(wait.group) operand 0 | groupDepth | i32 | number of older groups to keep in flight |
(commit.group) | — | — | no operands |
nvvm.tensormap.cp.async.shared
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | dst | ptr addrspace(3) | SMEM destination descriptor mailbox |
| operand 1 | src | ptr (global) | source descriptor |
nvvm.tensormap.replace.tile.global_address (and .box_dim, .element_stride, .box_corner, .elem_type, .swizzle, .fill)
| Position | Name | Type | Notes |
|---|---|---|---|
| operand 0 | descriptor | ptr addrspace(3) | SMEM-resident descriptor being mutated |
| operand 1 | value | i64 / i32 / enum | replacement value for the named field |
| attribute | ord | i32 | rank index for box_dim / element_stride |
LLVM Intrinsic Mapping
| Op | LLVM intrinsic |
|---|---|
nvvm.cp.async.bulk.tensor.shared.cluster.global | llvm.nvvm.cp.async.bulk.tensor.{1..5}d.shared.cluster.global.{tile,im2col,im2col_w,im2col_w_128} (rank/mode in attrs) |
nvvm.cp.async.bulk.tensor.shared.cta.global | llvm.nvvm.cp.async.bulk.tensor.{1..5}d.shared.cta.global.tile |
nvvm.cp.async.bulk.tensor.global.shared.cta | llvm.nvvm.cp.async.bulk.tensor.{1..5}d.global.shared.cta.tile |
nvvm.cp.async.bulk.tensor.reduce | llvm.nvvm.cp.async.bulk.tensor.{1..5}d.global.shared.cta.tile.reduce.{redop} |
nvvm.cp.async.bulk.tensor.prefetch | llvm.nvvm.cp.async.bulk.tensor.{1..5}d.tile.prefetch |
nvvm.cp.async.bulk.commit.group | llvm.nvvm.cp.async.bulk.commit.group |
nvvm.cp.async.bulk.wait_group | llvm.nvvm.cp.async.bulk.wait_group |
nvvm.tensormap.cp_fenceproxy | llvm.nvvm.cp.async.bulk.tensor.shared.cluster.tensormap.cta paired with llvm.nvvm.fence.proxy.tensormap.generic.release.cta |
The reduction intrinsic concatenates the redop name into the intrinsic ID; eight distinct intrinsics exist per rank.
PTX Templates
cp.async.bulk.tensor.{N}d.shared::cluster.global.tile.mbarrier::complete_tx::bytes
[%dst], [%tmap, {%c0, %c1, ..., %c{N-1}}], [%mbar];
cp.async.bulk.tensor.{N}d.shared::cluster.global.tile.mbarrier::complete_tx::bytes.multicast::cluster
[%dst], [%tmap, {%c0, ..., %c{N-1}}], [%mbar], %multicastMask;
cp.async.bulk.tensor.{N}d.shared::cluster.global.tile.mbarrier::complete_tx::bytes.L2::cache_hint
[%dst], [%tmap, {%c0, ..., %c{N-1}}], [%mbar], %hint;
cp.async.bulk.tensor.{N}d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes
[%dst], [%tmap, {%c0, ..., %c{N-1}}], [%mbar];
cp.async.bulk.tensor.{N}d.global.shared::cta.tile
[%tmap, {%c0, ..., %c{N-1}}], [%src];
cp.async.bulk.tensor.{N}d.global.shared::cta.tile.{redop}.bulk_group
[%tmap, {%c0, ..., %c{N-1}}], [%src];
cp.async.bulk.prefetch.tensor.{N}d.global.tile
[%tmap, {%c0, ..., %c{N-1}}];
cp.async.bulk.commit_group;
cp.async.bulk.wait_group N;
tensormap.cp.async.shared::cta.bulk_group [%dst], [%src];
tensormap.replace.tile.global_address [%tmap], %addr;
tensormap.replace.tile.box_dim.k [%tmap], %dim;
tensormap.replace.tile.element_stride.k [%tmap], %stride;
tensormap.replace.tile.elemtype [%tmap], %elt_id;
tensormap.replace.tile.swizzle [%tmap], %mode;
The multicast and L2::cache_hint suffix variants are picked per template by the presence flags. The reductions all flow through a single dialect op (nvvm.cp.async.bulk.tensor.reduce — only this mnemonic is interned) whose redop enum attribute selects between {add, min, max, inc, dec, and, or, xor}; the LLVM-intrinsic name baked at lowering time enumerates eight distinct intrinsics, one per PTX modifier.
Inline-PTX Variants
A few TMA paths reach PTX through llvm.inline_asm because no LLVM intrinsic exists at the snapshot revision Tileiras tracks. The most common is the im2col cache-hint store variant:
asm template: "cp.async.bulk.tensor.{N}d.global.shared::cta.im2col.bulk_group.L2::cache_hint
[%tmap, {%c0, ..., %c{N-1}}], [%src], %hint;"
constraints : "l,l,r,r,r,...,l"
l is the 64-bit descriptor pointer and the cache-hint operand; r is each 32-bit coordinate; the source SMEM pointer is also l (an opaque pointer). Tileiras retains the upstream constraint string verbatim; reimplementers must not rearrange operand order, because the NVPTX backend matches positional registers against the template's % slots.
Per-Arch Availability
| Op family | SM floor | ptx_min |
|---|---|---|
| Tile load / store / reduce | sm_90 | 8.0 |
| Im2col forms | sm_90 | 8.0 |
| Multicast / cluster forms | sm_90 (sm_90a for cluster mode) | 8.0 |
| Prefetch | sm_90 | 8.0 |
| Group commit / wait | sm_90 | 8.0 |
tensormap.cp.async.shared | sm_90 | 8.3 |
tensormap.replace.* | sm_90 | 8.3 |
Blackwell extends the cache-hint and OOB-fill modes but keeps the same op surface and the same intrinsic shape; verification accepts sm_100+ for every op in the family. See TMA Descriptor Shape for the CUtensorMap layout and cp.async.bulk Template Catalog for the per-rank PTX templates.
Verifier Invariants
tensorMapis a global-memory pointer; the descriptor itself is opaque.dstAddr(loads) is in addr-space 3;srcAddr(stores) is in addr-space 3.- Coordinate operand count equals the rank in the op mnemonic.
multicastEnableandcacheHintEnableagree with the operand list: presence of the attribute requires the operand to be supplied.- For reductions,
redopis one of the eight legal values. - For im2col forms, the rank is 3, 4, or 5; lower ranks are rejected.