Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVM TMA Ops

Abstract

nvvm.cp.async.bulk.* covers the Hopper Tensor Memory Accelerator (TMA) surface: tile loads from global to shared, tile stores from shared to global, prefetches, reductions, group commit / wait, and the descriptor-fence helper that pairs with an in-SMEM CUtensorMap. Rather than enumerating one op per (mode, direction, cache-hint, multicast, rank) combination, this dialect carries a small set of canonical mnemonics; mode (tile / im2col / im2col_w / im2col_w_128), rank, cache-hint presence, and multicast presence are all encoded as attributes the printer reads at PTX emit time. See TMA Tensormap and cp.async.bulk Codegen for the per-template emission catalog and Lowering: nvgpu / gpu to NVVM — TMA Async Load for the operand-slot mapping.

TMA descriptors live in global memory as 128-byte CUtensorMap structs encoded by the CUDA driver. The device-side ops in this family consume the descriptor as an opaque global pointer, with cache-hint and multicast attributes wiring into optional intrinsic operand slots.

Op Roster

Sub-familyCountMnemonic stem
Tile load (global → shared via cluster)1 op, rank/mode in attributesnvvm.cp.async.bulk.tensor.shared.cluster.global
Tile load (cta-direct)1 op, rank/mode in attributesnvvm.cp.async.bulk.tensor.shared.cta.global
Tile store (shared → global)2 (base + ext)nvvm.cp.async.bulk.tensor.global.shared.cta, …ext
Reduce1 op, redop in attributenvvm.cp.async.bulk.tensor.reduce
Prefetch1 opnvvm.cp.async.bulk.tensor.prefetch
Group control2nvvm.cp.async.bulk.commit.group, nvvm.cp.async.bulk.wait_group
Descriptor copy / fence1nvvm.tensormap.cp_fenceproxy

Each canonical mnemonic above parameterises rank, mode, cache-hint presence, and multicast presence through op attributes. The NVVM-to-LLVM printer expands a single dialect op into one of the family of llvm.nvvm.cp.async.bulk.tensor.{1..5}d.<dir>.<mode> intrinsics at lowering time; the IR layer stays compact.

Operand Tables

nvvm.cp.async.bulk.tensor.shared.cluster.global (rank in rank attribute, mode = #tile)

PositionNameTypeNotes
operand 0dstAddrptr addrspace(3)SMEM destination tile origin
operand 1tensorMapptr (global, opaque)128-byte CUtensorMap pointer
operands 2..N+1coordsvariadic i32, rank Ntile origin in tensor space
operand N+2barrierptr addrspace(3)mbarrier slot for expect-tx completion
operand N+3multicastMaskoptional i16cluster multicast bitmap (positional slot in the intrinsic call)
operand N+4cacheHintoptional i64L2 cache hint (positional slot in the intrinsic call)
attributecacheHintEnableUnitAttrgates the .L2::cache_hint modifier
attributemulticastEnableUnitAttrgates the .multicast modifier
attributemodeenum tma_load_modetile / im2col / im2col_w / im2col_w_128

The two UnitAttrs gate the corresponding optional operand. When cacheHintEnable is absent the cacheHint operand position is left empty in the LLVM intrinsic call; when present the operand must be supplied. The same pattern applies to multicastEnable and multicastMask. See Lowering: TMA Async Load — Operand Mapping (rank N) for the operand-slot mapping the nvgpu-to-nvvm rewriter performs.

nvvm.cp.async.bulk.tensor.{N}d.global.shared.cta.tile

PositionNameTypeNotes
operand 0tensorMapptr (global, opaque)128-byte CUtensorMap
operands 1..Ncoordsvariadic i32, rank Ntile origin
operand N+1srcAddrptr addrspace(3)SMEM source tile
operand N+2cacheHintoptional i64L2 cache hint
attributecacheHintEnableUnitAttrgates the .L2::cache_hint modifier

No barrier — the producer issues the store and continues; the consumer side observes completion via cp.async.bulk.wait.group.

nvvm.cp.async.bulk.tensor.{N}d.global.shared.cta.tile.reduce

PositionNameTypeNotes
operand 0tensorMapptr (global)CUtensorMap
operands 1..Ncoordsvariadic i32, rank Ntile origin
operand N+1srcAddrptr addrspace(3)SMEM source
operand N+2cacheHintoptional i64L2 cache hint
attributeredopenum tma_redux_kindadd / min / max / inc / dec / and / or / xor

nvvm.cp.async.bulk.tensor.{N}d.tile.prefetch

PositionNameTypeNotes
operand 0tensorMapptr (global)CUtensorMap
operands 1..Ncoordsvariadic i32, rank Ntile origin
attributemodeenum load_modetile / im2col (matches the load form)
attributecacheHintEnableUnitAttr (optional)gates a cache-hint operand

nvvm.cp.async.bulk.commit.group / nvvm.cp.async.bulk.wait.group

PositionNameTypeNotes
(wait.group) operand 0groupDepthi32number of older groups to keep in flight
(commit.group)no operands

nvvm.tensormap.cp.async.shared

PositionNameTypeNotes
operand 0dstptr addrspace(3)SMEM destination descriptor mailbox
operand 1srcptr (global)source descriptor

nvvm.tensormap.replace.tile.global_address (and .box_dim, .element_stride, .box_corner, .elem_type, .swizzle, .fill)

PositionNameTypeNotes
operand 0descriptorptr addrspace(3)SMEM-resident descriptor being mutated
operand 1valuei64 / i32 / enumreplacement value for the named field
attributeordi32rank index for box_dim / element_stride

LLVM Intrinsic Mapping

OpLLVM intrinsic
nvvm.cp.async.bulk.tensor.shared.cluster.globalllvm.nvvm.cp.async.bulk.tensor.{1..5}d.shared.cluster.global.{tile,im2col,im2col_w,im2col_w_128} (rank/mode in attrs)
nvvm.cp.async.bulk.tensor.shared.cta.globalllvm.nvvm.cp.async.bulk.tensor.{1..5}d.shared.cta.global.tile
nvvm.cp.async.bulk.tensor.global.shared.ctallvm.nvvm.cp.async.bulk.tensor.{1..5}d.global.shared.cta.tile
nvvm.cp.async.bulk.tensor.reducellvm.nvvm.cp.async.bulk.tensor.{1..5}d.global.shared.cta.tile.reduce.{redop}
nvvm.cp.async.bulk.tensor.prefetchllvm.nvvm.cp.async.bulk.tensor.{1..5}d.tile.prefetch
nvvm.cp.async.bulk.commit.groupllvm.nvvm.cp.async.bulk.commit.group
nvvm.cp.async.bulk.wait_groupllvm.nvvm.cp.async.bulk.wait_group
nvvm.tensormap.cp_fenceproxyllvm.nvvm.cp.async.bulk.tensor.shared.cluster.tensormap.cta paired with llvm.nvvm.fence.proxy.tensormap.generic.release.cta

The reduction intrinsic concatenates the redop name into the intrinsic ID; eight distinct intrinsics exist per rank.

PTX Templates

cp.async.bulk.tensor.{N}d.shared::cluster.global.tile.mbarrier::complete_tx::bytes
    [%dst], [%tmap, {%c0, %c1, ..., %c{N-1}}], [%mbar];

cp.async.bulk.tensor.{N}d.shared::cluster.global.tile.mbarrier::complete_tx::bytes.multicast::cluster
    [%dst], [%tmap, {%c0, ..., %c{N-1}}], [%mbar], %multicastMask;

cp.async.bulk.tensor.{N}d.shared::cluster.global.tile.mbarrier::complete_tx::bytes.L2::cache_hint
    [%dst], [%tmap, {%c0, ..., %c{N-1}}], [%mbar], %hint;

cp.async.bulk.tensor.{N}d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes
    [%dst], [%tmap, {%c0, ..., %c{N-1}}], [%mbar];

cp.async.bulk.tensor.{N}d.global.shared::cta.tile
    [%tmap, {%c0, ..., %c{N-1}}], [%src];

cp.async.bulk.tensor.{N}d.global.shared::cta.tile.{redop}.bulk_group
    [%tmap, {%c0, ..., %c{N-1}}], [%src];

cp.async.bulk.prefetch.tensor.{N}d.global.tile
    [%tmap, {%c0, ..., %c{N-1}}];

cp.async.bulk.commit_group;
cp.async.bulk.wait_group N;

tensormap.cp.async.shared::cta.bulk_group [%dst], [%src];
tensormap.replace.tile.global_address [%tmap], %addr;
tensormap.replace.tile.box_dim.k       [%tmap], %dim;
tensormap.replace.tile.element_stride.k [%tmap], %stride;
tensormap.replace.tile.elemtype        [%tmap], %elt_id;
tensormap.replace.tile.swizzle         [%tmap], %mode;

The multicast and L2::cache_hint suffix variants are picked per template by the presence flags. The reductions all flow through a single dialect op (nvvm.cp.async.bulk.tensor.reduce — only this mnemonic is interned) whose redop enum attribute selects between {add, min, max, inc, dec, and, or, xor}; the LLVM-intrinsic name baked at lowering time enumerates eight distinct intrinsics, one per PTX modifier.

Inline-PTX Variants

A few TMA paths reach PTX through llvm.inline_asm because no LLVM intrinsic exists at the snapshot revision Tileiras tracks. The most common is the im2col cache-hint store variant:

asm template: "cp.async.bulk.tensor.{N}d.global.shared::cta.im2col.bulk_group.L2::cache_hint
                  [%tmap, {%c0, ..., %c{N-1}}], [%src], %hint;"
constraints : "l,l,r,r,r,...,l"

l is the 64-bit descriptor pointer and the cache-hint operand; r is each 32-bit coordinate; the source SMEM pointer is also l (an opaque pointer). Tileiras retains the upstream constraint string verbatim; reimplementers must not rearrange operand order, because the NVPTX backend matches positional registers against the template's % slots.

Per-Arch Availability

Op familySM floorptx_min
Tile load / store / reducesm_908.0
Im2col formssm_908.0
Multicast / cluster formssm_90 (sm_90a for cluster mode)8.0
Prefetchsm_908.0
Group commit / waitsm_908.0
tensormap.cp.async.sharedsm_908.3
tensormap.replace.*sm_908.3

Blackwell extends the cache-hint and OOB-fill modes but keeps the same op surface and the same intrinsic shape; verification accepts sm_100+ for every op in the family. See TMA Descriptor Shape for the CUtensorMap layout and cp.async.bulk Template Catalog for the per-rank PTX templates.

Verifier Invariants

  • tensorMap is a global-memory pointer; the descriptor itself is opaque.
  • dstAddr (loads) is in addr-space 3; srcAddr (stores) is in addr-space 3.
  • Coordinate operand count equals the rank in the op mnemonic.
  • multicastEnable and cacheHintEnable agree with the operand list: presence of the attribute requires the operand to be supplied.
  • For reductions, redop is one of the eight legal values.
  • For im2col forms, the rank is 3, 4, or 5; lower ranks are rejected.