TMA Load Walkthrough
Abstract
A single TMA load — the asynchronous bulk-tensor copy that moves a tile from global memory into shared memory on sm_90a and later — touches every layer of the tileiras cascade. It begins life as a tile-shaped cuda_tile.load_view_tko, picks up an alias-aware token in nv_tileaa, acquires a TMA descriptor handle and an mbarrier slot in nv_tileas, expands into an nvvm.cp.async.bulk.tensor.shared.cluster.global.2d intrinsic in LLVM, becomes a CP_ASYNC_BULK_TENSOR_2D_* machine instruction in NVPTX MIR, and surfaces as cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes in PTX text. The transaction-byte count — the number the consumer's mbarrier.try_wait.parity checks against — flows through each layer under a different name and at a different level of abstraction.
This page traces one load end-to-end. The kernel-wide walkthrough in DSL to PTX End-to-End shows the same kernel at every stage with all operations in place; this page narrows the focus to a single operation so the descriptor lifecycle, the mbarrier handshake, and the transaction-byte accounting are visible without GEMM scaffolding. Cross-reference targets remain the per-stage canonical pages: cuda_tile to nv_tileaa, nv_tileaa to nv_tileas, nv_tileas to LLVM, TileAS TMA and Memops Family, mbarrier State Machine, WGMMA Emission Protocol, and TMA, Tensormap, and cp.async.bulk.
Confidence: HIGH for IR shapes, mnemonic spellings, and the transaction-byte arithmetic; MED for the SSA value naming used in the worked example (the binary-derived examples in the source pages use slightly different temp names).
The Operation
The walkthrough operation is one TMA bulk-tensor load of a 128-row × 128-column BF16 tile from a global tensor into shared memory, on sm_90a, with one mbarrier slot acting as the completion barrier. The element type is bf16 (2 bytes), so the tile carries 128 × 128 × 2 = 32 768 bytes — that integer is the transaction-byte count every layer eventually publishes against the barrier. The CTA hosts a single warp group (128 threads) doing the load on behalf of a consumer that will read the shared-memory tile.
The frontend constructed:
a_view = make_partition_view(A, [M, K], tile=(128, 128), dim_map=[0, 1])
a_tile = load(a_view, (block_m, block_k)) # tile<128x128xbf16>
A is a global-memory bf16 tensor of shape [M, K] with row-major strides [K, 1]. block_m and block_k are CTA-supplied tile coordinates. The load completes asynchronously — the consumer side waits on an mbarrier before reading a_tile, but that wait is a separate operation. This page traces the load itself.
The transaction-byte arithmetic that runs through every stage:
tile_bytes = tile_rows * tile_cols * sizeof(bf16)
= 128 * 128 * 2
= 32768 bytes
That single integer is the value nv_tileas.async.tiled_tma_load carries as its tx_count attribute, nvvm.mbarrier.arrive.expect_tx publishes against the barrier, and the consumer side sees as expected_txn in the mbarrier state machine documented in mbarrier State Machine — State Machine.
Stage 1: cuda_tile IR
The first IR the compiler sees comes out of the frontend's bytecode. The load is a cuda_tile.load_view_tko — token-ordered tile load from a partition view — and the verifier contract on the operation is the standard cuda_tile contract: power-of-two tile dimensions, a 16-million-element ceiling per tile, a token operand for ordering, and an explicit tile-typed SSA result.
%a_view = cuda_tile.tensor_view %A, shape = [%M, %K], stride = [%K, 1]
: !cuda_tile.tensor_view<128x128xbf16>
%a_part = cuda_tile.partition_view %a_view, tile = [128, 128], dim_map = [0, 1]
: !cuda_tile.partition_view<128x128xbf16>
%tok0 = cuda_tile.make_token : !cuda_tile.token
%a_tile, %tok_a = cuda_tile.load_view_tko %a_part, [%bm, %bk], %tok0
: !cuda_tile.tile<128x128xbf16>, !cuda_tile.token
There is no descriptor, no mbarrier, no transaction-byte count, and no TMA mention. cuda_tile is the public surface and deliberately stays target-agnostic: a partition view plus tile coordinates plus an ordering token is all the frontend has to publish. The _tko suffix denotes the token-ordered shape — the load consumes an input token and produces an output token, so subsequent loads and stores can be scheduled against the ordering edge rather than against explicit fences. The optional allow_tma attribute (defaulting to true on sm_90a and later) is what the next pass reads to decide whether the load becomes a TMA copy or falls back to plain ldg. See cuda_tile to nv_tileaa for the conversion target this op is illegal against.
⚡ QUIRK —
cuda_tilecarries no descriptor, no mbarrier, no tx-count The public dialect has no syntax for a TMA descriptor, no syntax for an mbarrier slot, and no transaction-byte attribute. Every TMA-specific noun first appears innv_tileaa(the copy-atom witness) ornv_tileas(the descriptor handle, the mbarrier slot, thetx_count). A reimplementer who tries to express any of those on the public surface has misread the contract —cuda_tileis a tile-algebra dialect, not a TMA-shaped dialect. The promotion to TMA is a downstream decision driven by the copy-atom registry, not a frontend gesture.
Stage 2: nv_tileaa IR
ConvertCudaTileToTileAA rewrites the load through the three-populator structure documented in cuda_tile to nv_tileaa. Part B of that structure owns the memory and view families, so the rewrite for this load lives in the Part-B tiled_load pattern. The partition view dissolves into an nv_tileaa.make_memref plus an nv_tileaa.addptr, the tile type becomes an MLIR tensor<128x128xbf16>, and — the key change for this walkthrough — the load picks up a CopyAtom witness. The witness is an attribute that names the hardware copy primitive selected by the layout-assignment pre-pass; for an sm_90a load that meets the TMA eligibility rules (the box-dim invariants documented in TileAS TMA and Memops Family — Descriptor Builders and Verifiers) the witness is sm90_tma_load_2d_bf16.
%a_ref = nv_tileaa.make_memref %A, shape = [%M, %K], stride = [%K, 1],
space = #nv_tileaa.global
: !nv_tileaa.memref<?x?xbf16>
%off_a = nv_tileaa.addptr %a_ref, [%bm, %bk]
: !nv_tileaa.memref<?x?xbf16>
%a_tile, %tok_a = nv_tileaa.tiled_load %off_a, %tok0
{ atom = #cute.copy_atom<sm90_tma_load_2d_bf16>,
in_bounds = array<i1: true, true>,
mem_semantic = #nv_tileaa<mem_semantic relaxed>,
mem_scope = #nv_tileaa<mem_scope cluster> }
: !nv_tileaa.memref<?x?xbf16> -> tensor<128x128xbf16>,
!nv_tileaa.mem_token
The CopyAtom witness is the single most consequential change. nv_tileaa.tiled_load does not commit to a particular hardware primitive — that decision rides on the attribute. A different atom in the same slot (sm80_cp_async_4_bf16, sm70_ldg_128_bf16, sm90_tma_load_2d_bf16_mcast) would steer the next stage's rewrite into a different lowering path. Layout assignment runs before this pass and is what consults the copy-atom registry documented in SM-Tier Roster and Copy Atom Registry; after this pass the witness travels verbatim down to the LLVM lowering.
Token ordering also takes its alias-aware form. The output token %tok_a is now an !nv_tileaa.mem_token, threaded into the next memory op so the scheduler can reason about read-after-write and write-after-read edges without inserting fences. The mem_semantic = relaxed and mem_scope = cluster operands together declare that the load is observably ordered against other cluster-scope traffic but does not impose an acquire fence.
The transaction-byte count is still implicit. The element type (bf16) and the tile shape (128 × 128) determine it, but no attribute yet records the integer.
CopyAtom Witness Selection
The witness attached to the tiled_load is not free-form text. The dialect interface CopyAtomAttrInterface constrains every valid witness to one of the entries in the SM-tier copy atom registry, and the layout-assignment pre-pass picks among them by reading three pieces of information from the operand context: the compute capability (the --compute-capability driver option, parsed once and threaded onto the function as nv_tileaa.compute_capability), the source-memref's address space (global versus shared), and the destination tile's shape and element type.
For this walkthrough's load, layout assignment sees sm_90a, address-space-1 (global) source, and a 128 × 128 × bf16 destination. The matching atom in the SM-tier registry is sm90_tma_load_2d_bf16 with swizzle_128B — the swizzle mode is selected so that consecutive K-axis vectors in the destination shared-memory tile fall in different SMEM banks, avoiding bank conflicts when the consumer's WGMMA reads through 128-bit ldmatrix.sync.aligned-style fragments. On Ada Lovelace (sm_89) the same source cuda_tile.load_view_tko would resolve to sm80_cp_async_4_bf16 because TMA hardware is not available; on Ampere (sm_80) the same atom; on Volta or earlier, a plain sm70_ldg_128_bf16.
The witness is also the gate LowerTMALoadStoreToAsync reads in phase 2 of its eight-phase walk. Atoms that do not implement TmaAtomTypeInterface (the plain ldg, stg, ldgsts family) are skipped in phase 2 without rewrite, so the TMA expansion documented in Stage 3 of this page never runs for them. The atom-to-interface dispatch is the central decision that pins the rest of the lowering — picking a non-TMA atom in the witness slot keeps the load as a synchronous tiled_load all the way down to the LLVM stage.
Stage 3: nv_tileas IR
ConvertTileAAToTileAS keeps the same operand shape but renames the op and updates dialect namespaces. The TileAS rewrite documented in TileAA to TileAS — tiled_load Witness Hand-Off preserves the CopyAtom witness verbatim, swaps the mnemonic prefix from nv_tileaa to nv_tileas, and leaves the rest of the operand vector unchanged.
%a_tile, %tok_a = nv_tileas.tiled_load %off_a, %tok0
{ atom = #cute.copy_atom<sm90_tma_load_2d_bf16>,
in_bounds = array<i1: true, true>,
mem_semantic = #nv_tileas<mem_semantic relaxed>,
mem_scope = #nv_tileas<mem_scope cluster> }
: !nv_tileaa.memref<?x?xbf16> -> tensor<128x128xbf16>,
!nv_tileaa.mem_token
Then the TileAS TMA and Memops Family pipeline runs, with LowerTMALoadStoreToAsync doing the heavy work. The eight-phase walk (KernelSpec gate, TMA-eligibility scan, tmaIdx assignment, descriptor bind, async op materialization, mbarrier emission, wait sinking, diagnostic finalization) is documented in TileAS TMA and Memops Family — LowerTMALoadStoreToAsync. The output is the four-op sequence the downstream lowering expects: descriptor build, async TMA op, mbarrier expect-tx, mbarrier wait.
// ---- descriptor materialized by phase 4
%desc_a = nv_tileas.make_tiled_tma_desc %a_ref, box = [128, 128],
atom = #cute_nvgpu.atom_copy_field_tmaload<load_2d_bf16, swizzle_128B>,
tmaIdx = 0 : i32
: !nv_tileas.tma_desc<128x128xbf16>
// ---- mbarrier reserved by phase 6 (one slot, count=1, expecting one arrival)
%mbar_a = nv_tileas.alloc_mbarrier { count = 1 : i32 }
: !nv_tileas.mbarrier
// ---- async TMA load: phase 5 emission
%tok_a = nv_tileas.async.tiled_tma_load
%desc_a, %a_smem[%bm, %bk], %mbar_a
{ atom = #cute_nvgpu.atom_copy_field_tmaload<load_2d_bf16, swizzle_128B>,
tx_count = 32768 : i32,
tmaIdx = 0 : i32 }
: !nv_tileas.tma_desc<128x128xbf16>,
!nv_tileas.smem<128x128xbf16>,
index, index,
!nv_tileas.mbarrier
-> !nv_tileas.async
Three new entities appear at this stage. First, the TMA descriptor is a first-class SSA value — %desc_a is the result of make_tiled_tma_desc, which captures tensor shape, stride, padding mode, descriptor mode (tiled), element type, and the cute_nvgpu.tma_atom witness with its swizzle. The descriptor's tmaIdx attribute is the per-function counter assigned in phase 3 of LowerTMALoadStoreToAsync; downstream the AttachTMADescriptorArgs pass (documented in TileAS TMA and Memops Family — Descriptor ABI) reads it to wire host descriptor preparation back to device descriptor consumption. Second, the mbarrier slot is an explicit !nv_tileas.mbarrier SSA value with an arrive_count of 1 — one producer agent will publish completion. Third, the transaction-byte count is a concrete tx_count = 32768 : i32 attribute on the async load. That integer is the byte count the consumer's mbarrier.try_wait.parity will check against.
⚡ QUIRK — transaction-byte count is per-atom, not per-mbarrier A single mbarrier slot can receive transaction-byte updates from multiple TMA loads (one per operand in a multi-input WGMMA, for instance). The
tx_countattribute stamped on eachnv_tileas.async.tiled_tma_loadis the byte count that load will contribute; the consumer'sexpected_txnfield on the mbarrier is the sum across all loads that arrive on the same barrier. A reimplementation that publishes a per-mbarrier total at descriptor-build time and ignores per-atom byte counts produces a barrier whoseexpected_txnnever matches the actual transaction total, andtry_wait.parityeither fires early (if the published total is too small) or hangs forever (if too large). The per-atom accounting is what makes the multi-operand WGMMA producer/consumer handshake work.
⚡ QUIRK — CopyAtom witness vs concrete PTX form The CopyAtom witness
sm90_tma_load_2d_bf16does not name the PTX mnemonic the load eventually becomes. It names a family: the basic 2D tile load iscp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes, but the multicast variant of the same atom prints…multicast::cluster, the L2-cache-hint variant prints…L2::cache_hint, and the im2col atom variant prints…im2col. The witness names the legal shape; per-load options the rewriter discovers (an attached multicast mask, an attached cache hint, an im2col mode flag) select among the printable variants. A naive name-to-mnemonic mapping in a reimplementation skips the variant gates the codegen page documents in TMA, Tensormap, and cp.async.bulk.
Descriptor Lifecycle and the Host/Device Split
The nv_tileas.make_tiled_tma_desc op materialized above is one of two possible descriptor origins. AttachTMADescriptorArgs and SeparateHostTMA (documented in TileAS TMA and Memops Family — Descriptor ABI) split the descriptor population between host and device. When the descriptor depends only on values the runtime can supply through the launch ABI (the kernel's global pointer arguments, the tensor shape and stride parameters, the box-dim attribute that came from the partition_view), the pass hoists descriptor construction into a host-side companion module that builds a 64-byte CUDA tensormap once per launch. When the descriptor depends on values only the device knows (a runtime-computed shape, a divergent stride), it stays on the device and the kernel emits a cp.async.bulk.tensor.encode sequence inline before the load.
For this walkthrough, the descriptor depends only on the kernel's %A, %M, and %K arguments, so the host-side path wins. The kernel ABI grows a .param slot holding a pointer to the descriptor, the slot carries the cute_nvgpu.grid_constant argument attribute (later lifted to nvvm.grid_constant), and the device-side make_tiled_tma_desc op survives only as a marker the next stage's pattern looks up by tmaIdx = 0. The runtime's per-launch descriptor-emit callback runs once per launch and writes 64 bytes per descriptor into a scratch buffer the kernel reads through its .param slot.
The descriptor's 64-byte payload encodes the eleven fields the TileAS TMA and Memops Family — Tensormap Mutators page documents: global base address, per-axis dimension sizes, per-axis strides, element type, rank, format (tiled / im2col / im2col_at / tiled_at), box shape, swizzle mode, fill mode, OOB fill value, and interleave layout. Eight of the eleven are immutable on device (set at construction, never replaced); only the global base pointer, per-axis sizes, and non-leading strides can be replaced via tensormap.replace.{global_address,dim_size,stride_size} if the kernel needs to vary them across iterations.
Stage 4: NVVM Intrinsic in LLVM IR
ConvertTileASToLLVM is the terminal MLIR-side lowering, and its nine-phase body conversion documented in tileas to LLVM carries the load to LLVM. The TMA-load rewrite specifically follows the five-step pattern in tileas-to-llvm — async.tiled_tma_load: the descriptor becomes an llvm.ptr<1> to its global-memory home, the destination view becomes an llvm.ptr<3> to the shared-memory base address, the per-axis coordinates flow through unchanged as i32 values, and the mbarrier slot becomes an llvm.ptr<3> to the completion barrier.
; ---- descriptor pointer (taken from the kernel's grid-constant .param slot)
%desc_a_ptr = call ptr addrspace(1) @llvm.nvvm.tma.get.descriptor.address(i32 0)
; ---- mbarrier publishes expected transaction-byte count
call void @llvm.nvvm.mbarrier.arrive.expect_tx.shared(
ptr addrspace(3) %mbar_a, i32 32768)
; ---- TMA bulk-tensor load: shared <- global, through the descriptor,
; coordinate order reversed to inner-axis-first (column-major)
call void @llvm.nvvm.cp.async.bulk.tensor.shared.cluster.global.2d(
ptr addrspace(3) %a_smem_dst, ; destination: shared base address
ptr addrspace(1) %desc_a_ptr, ; source: TMA descriptor in global
i32 %bk_coord, ; coord[0]: inner-axis (was %bm)
i32 %bm_coord, ; coord[1]: outer-axis (was %bk)
ptr addrspace(3) %mbar_a) ; completion barrier
Four things change at the LLVM boundary. The CopyAtom witness is consumed — the intrinsic name llvm.nvvm.cp.async.bulk.tensor.shared.cluster.global.2d encodes everything the witness named (2D tile, shared destination, global source, mbarrier completion), so no attribute is needed. The transaction-byte count is published in its own instruction: llvm.nvvm.mbarrier.arrive.expect_tx.shared writes 32768 into the barrier's expected_txn field. Coordinate order reverses — TileAS lists coordinates outer-axis-first ([%bm, %bk]) to match layout-assignment's row-major convention, but the LLVM intrinsic expects inner-axis-first ([%bk, %bm]) to match the PTX instruction. And the AsyncToken SSA result becomes an i32 zero constant: the token does not carry hardware state, only an IR-level data-dependence edge, so the lowering replaces it with a placeholder whose only purpose is keeping the SSA dataflow connected.
The mbarrier's role is decoupled here. The mbarrier.arrive.expect_tx.shared call is what publishes the byte count; the cp.async.bulk.tensor call is what issues the asynchronous transfer that will update the barrier's txn_count field asynchronously as bytes arrive in shared memory; a downstream mbarrier.try_wait.parity.shared (emitted by the consumer-side pattern, not by the load itself) is what gates the consumer on both arrival and transaction-byte completion. The three are independent instructions tied together only by the shared barrier pointer. See mbarrier State Machine — Kinds: Ordinary, Transaction, Cluster for the TMA-transaction kind's full state-machine view.
⚡ QUIRK — failed transactions are silent UB If a TMA load fails mid-flight — out-of-bounds coordinate, malformed descriptor, multicast mask referencing a CTA outside the cluster — the hardware does not raise an exception, does not signal the barrier, and does not abort the kernel. The transaction-byte count simply never reaches
expected_txn, and the consumer'smbarrier.try_wait.parityspins forever (or, with a non-zeronstimeout, returns failure). No diagnostic surfaces from MLIR, LLVM, orptxas; the failure mode is a hang. Reimplementations that assume any kind of error reporting from the load itself will misdiagnose this class of bug as a barrier mis-init. The only defense is the descriptor verifier — see TileAS TMA and Memops Family — Descriptor Builders and Verifiers for the catalog.
Stage 5: NVPTX MIR
The NVPTX backend's instruction selector (ISelDAG and MatcherTable) consumes the LLVM intrinsic and produces a MachineFunction instruction. The TMA family of opcodes is a set of CP_ASYNC_BULK_TENSOR_*_* machine instructions, one per (rank, mode, destination, options) tuple. For the 2D tile load with mbarrier completion, the opcode is CP_ASYNC_BULK_TENSOR_2D_SHARED_CLUSTER_GLOBAL_MBARRIER.
bb.loop:
; --- mbarrier arrives with expected transaction byte count
MBARRIER_ARRIVE_EXPECT_TX_SHARED %mbar_a:b64, 32768
; --- TMA load: shared destination, global source via descriptor, mbarrier completion
CP_ASYNC_BULK_TENSOR_2D_SHARED_CLUSTER_GLOBAL_MBARRIER
%a_smem_dst:b64, ; destination shared address (b64 SMEM ptr)
%desc_a:b64, ; descriptor address (b64 global ptr)
%bk:b32, %bm:b32, ; coordinates, inner-axis first
%mbar_a ; completion barrier (b64 SMEM ptr)
Three observations matter at MIR level. First, the opcode encodes the address spaces in its name — _SHARED_CLUSTER_GLOBAL_ selects the variant whose destination is SMEM, whose source is global via descriptor, and whose completion handshake is the cluster-scope mbarrier transaction. A 1D variant would have _1D_ in the slot; an im2col variant would have _IM2COL_; a multicast variant would have _MULTICAST_. Each is a distinct opcode in the NVPTX .td files, picked up by the AsmPrinter table to render the corresponding PTX modifier set. Second, the MBARRIER_ARRIVE_EXPECT_TX_SHARED instruction is the in-flight expect_tx publish: it writes 32768 into the barrier's expected_txn slot. The number is a literal immediate at this level, not a register; the rewriter has folded it from the tx_count = 32768 attribute that originated in nv_tileas.async.tiled_tma_load. Third, the AsyncToken is gone — the LLVM-side i32 zero placeholder dies during instruction selection because no MIR opcode consumes it.
The transaction-byte count has now flowed through four levels of representation: implicit shape-and-element-type in cuda_tile, still implicit in nv_tileaa, explicit as tx_count = 32768 : i32 attribute in nv_tileas, explicit as i32 32768 operand to llvm.nvvm.mbarrier.arrive.expect_tx.shared in LLVM IR, and explicit as immediate operand 32768 to MBARRIER_ARRIVE_EXPECT_TX_SHARED in MIR.
Stage 6: PTX Text
The AsmPrinter (AsmPrinter and Per-SM Windows) walks the MachineFunction and renders each instruction. The 2D TMA load with mbarrier completion prints as cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes, the canonical Hopper TMA tile-load mnemonic.
// ---- mbarrier publishes the 32768-byte expectation
mbarrier.arrive.expect_tx.shared.b64 _, [%rd_mbar], 32768;
// ---- TMA bulk-tensor 2D load
cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes
[%rd_smem_dst], // destination: SMEM address of tile
[%rd_desc_a, {%r_bk, %r_bm}], // source: descriptor + coords (inner-axis first)
[%rd_mbar]; // completion barrier address
The mnemonic encodes seven independent decisions. cp.async.bulk.tensor is the family. 2d is the descriptor rank. shared::cluster.global is the address-space pair — destination in CTA-scope shared memory visible to the cluster, source in global memory via descriptor. tile is the descriptor mode (versus im2col, gather4, etc.). mbarrier::complete_tx::bytes is the completion mechanism — the bytes-based transaction barrier that pairs with mbarrier.arrive.expect_tx and mbarrier.try_wait.parity. Each modifier maps back to a specific attribute that traveled from nv_tileas.async.tiled_tma_load through the LLVM intrinsic name into the MIR opcode suffix and finally into the printed mnemonic.
The coordinate operands {%r_bk, %r_bm} are inner-axis-first; the original nv_tileas form was outer-axis-first [%bm, %bk]. The reversal happened in the LLVM lowering, propagated through MIR, and surfaces here at print time.
The transaction-byte count 32768 is the literal immediate to mbarrier.arrive.expect_tx. The load itself does not carry a byte-count operand — the load's job is to issue the transfer and have the hardware update the barrier's txn_count field; the byte expectation was set out-of-band by the expect_tx instruction.
⚡ QUIRK —
mbarrier::complete_tx::bytesis a load modifier, not an mbarrier modifier Thembarrier::complete_tx::bytesqualifier appears in thecp.async.bulk.tensor.*mnemonic, not in thembarrier.*mnemonic. It selects the transaction-completion behaviour of the bulk-tensor instruction — the load updates an mbarrier'stxn_countfield as bytes arrive — and does not describe how the barrier itself is built or armed. The barrier'sexpected_txnis established by a separatembarrier.arrive.expect_txinstruction issued before the load. Reimplementations that emit only the bulk-tensor instruction and rely on the mnemonic to "publish" the byte count produce a barrier whoseexpected_txnstays at zero, sotry_wait.parityreturns success immediately (thetxn_count >= expected_txncheck is vacuous at zero), and the consumer reads garbage from a still-uncopied destination.
Transaction-Byte Count: Cross-Stage Flow
The single integer 32768 is the canonical thread tying the load to the consumer. It is computed exactly once — 128 rows × 128 cols × 2 bytes/element = 32 768 bytes — but lives under different names and at different levels of abstraction at every stage. Its journey:
| Stage | Form | Carrier | Source |
|---|---|---|---|
1 — cuda_tile | implicit | tile shape <128x128xbf16> | derived at lower time |
2 — nv_tileaa | implicit | tensor shape <128x128xbf16> plus CopyAtom witness | derived at lower time |
3 — nv_tileas | explicit attr | tx_count = 32768 : i32 on async.tiled_tma_load | computed by LowerTMALoadStoreToAsync phase 5 |
| 4 — LLVM IR | explicit i32 operand | argument to llvm.nvvm.mbarrier.arrive.expect_tx.shared | folded from tx_count attribute |
| 5 — NVPTX MIR | explicit immediate | operand to MBARRIER_ARRIVE_EXPECT_TX_SHARED | selected through ISelDAG |
| 6 — PTX text | explicit literal | last operand to mbarrier.arrive.expect_tx.shared.b64 | rendered by AsmPrinter |
The transition from implicit (stages 1–2) to explicit (stages 3–6) happens in phase 5 of LowerTMALoadStoreToAsync, the same phase that materializes the async TMA op itself. Until that phase runs, the byte count exists only as a derivable consequence of the tile shape and the element type; after that phase runs, it is a first-class attribute that travels verbatim through every subsequent lowering. The consumer side reads it through the mbarrier's expected_txn field, exactly as the mbarrier State Machine try_wait.parity predicate documents.
⚡ QUIRK — tx_count diverges from naive byte count under swizzle The byte count published to
expect_txis the number of bytes the TMA hardware will actually deposit into shared memory, which equals the tile size in bytes when the descriptor's swizzle mode isnone. When the swizzle mode is128B,64B, or32B, the hardware deposits the unswizzled byte count — the data after swizzle reordering still occupies the same number of bytes, even though their layout in SMEM is permuted. A reimplementer who computestx_count = tile_rows * tile_cols * sizeof(elem)is correct for tiled mode regardless of swizzle, but the same formula does not generalize to im2col mode (where padded rows expand the byte count beyondrows × cols × sizeof) or to multicast (where the byte count is per receiving CTA, not aggregated). The phase-5 computation inLowerTMALoadStoreToAsyncreads the atom's descriptor metadata to get the right answer for each mode.
Stage 7: SASS
Past the PTX text, the path leaves tileiras and enters ptxas's territory through the boundary documented in ptxas Handoff Protocol. The assembler renders the cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes mnemonic into the SASS instruction stream — instruction encodings, register allocation, and scheduling are entirely ptxas's decision. The byte count 32768 becomes part of the encoded immediate operand of the SASS MBARRIER instruction; the load itself decomposes into a sequence of UTMALDG and related SASS instructions that issue the transfer and arm the L1 cache fill paths.
That layer is out of scope for tileiras's documentation. The wiki covers the path up to PTX text; everything below the handoff is ptxas territory, including the SASS opcode encoding for the bulk-tensor family and the SM scheduling decisions that interleave the asynchronous TMA issue against the warp group's compute.
Coordinate Reversal in Detail
The coordinate operands flip between outer-axis-first and inner-axis-first ordering exactly once in the cascade — between Stage 3 (nv_tileas) and Stage 4 (LLVM IR). Tracking which axis ordering each stage uses is necessary for any reimplementation that wants to walk an IR dump and validate the coordinate operand order without re-reading the source patterns.
The convention at each stage:
| Stage | Coordinate order | Why |
|---|---|---|
1 — cuda_tile.load_view_tko | outer-axis-first [%bm, %bk] | matches Python-style tile indexing in the frontend bytecode |
2 — nv_tileaa.tiled_load | outer-axis-first [%bm, %bk] | preserved verbatim through Part-B rewrite for SSA-edge stability |
3 — nv_tileas.async.tiled_tma_load | outer-axis-first [%bm, %bk] | layout-assignment writes coordinates in row-major order |
4 — nvvm.cp.async.bulk.tensor.shared.cluster.global.2d | inner-axis-first [%bk, %bm] | matches the PTX instruction operand order |
5 — CP_ASYNC_BULK_TENSOR_2D_SHARED_CLUSTER_GLOBAL_MBARRIER | inner-axis-first [%bk, %bm] | survives ISel verbatim |
6 — cp.async.bulk.tensor.2d.shared::cluster.global.tile… | inner-axis-first [%bk, %bm] | PTX consumes inner-axis-first |
The reversal happens inside the async.tiled_tma_load → nvvm.cp.async.bulk.tensor.* pattern in tileas-to-llvm — async.tiled_tma_load. The pattern walks the operand list and emits the operands in reverse coordinate order as part of the intrinsic call construction. Im2col variants get the same reversal applied to their coordinate vector, with the K-offset prefix preserved at the head of the operand list.
Verifier Surface at Each Stage
Each stage's verifier catches a different class of malformed load. The same operation must satisfy every verifier on its path; a load that survives Stage 2 because Stage 1's verifier didn't notice an issue still fails at Stage 3 once the descriptor builder runs its 128-byte alignment check. The catalog by stage:
| Stage | Verifier | Sample diagnostics |
|---|---|---|
1 — cuda_tile | cuda_tile.load_view_tko verifier | "all dimensions must be positive constants, got …", "all dimensions must be powers of two, got …", "tile would exceed the maximum of …" |
2 — nv_tileaa | nv_tileaa.tiled_load verifier (inherited tile-dim invariants) | "expects N coordinates, but got M", "expects CoordType is same as memref index type", "view elementType not equal with tensor element type: …" |
3a — nv_tileas.tiled_load | inherited from TileAS shared verify_tiled_memop | "unsupported mem_semantic: acquire" (loads forbid acquire), "incorrect number of in_bounds elements: expected …" |
3b — nv_tileas.make_tiled_tma_desc | descriptor builder verifier | "expected tma descriptor pointer to have alignment at least 128", "tma boxDims[0] * elemTypeBitWidth is not a multiple of 16 bytes", "smem layout is not TMA compatible", "TmaLoad only support zero padding now" |
3c — nv_tileas.async.tiled_tma_load | post-rewrite atom verifier | "expect a tma_load atom type", "tmaBoxDim and atomBoxDim length mismatch", "mcast is not supported for TMA load with less than 128bytes per atom" |
| 4 — LLVM IR | shared TypeConverter + intrinsic-arity check | (catch-all for arity mismatches against llvm.nvvm.cp.async.bulk.tensor.* declarations) |
| 5 — NVPTX MIR | LLT-typed operand check at ISel | (rejects malformed b64/b32 operand bundles before they reach the AsmPrinter) |
| 6 — PTX text | ptxas directive verifier | (out of scope; documented under ptxas Handoff Protocol) |
The 128-byte descriptor alignment check at Stage 3b is the one most worth flagging: TMA descriptors must be 128-byte aligned, which is why the kernel ABI marks descriptor-pointer arguments as grid_constant (placed in .param, naturally 128-byte aligned) rather than .global (only 16-byte aligned in the general case). A reimplementation that drops the grid_constant attribute keeps the IR well-formed all the way through verifier, but ptxas rejects the resulting cp.async.bulk.tensor load with an alignment diagnostic at SASS-generation time.
Address-Space Trail
Every operand of the load lives in a specific GPU address space, and the address-space attribution flows through the cascade in a different shape at each level. The trail for this walkthrough's operands:
| Operand | Address space at each stage |
|---|---|
Source tensor %A | space-1 (global) at every stage |
TMA descriptor %desc_a | space-1 (global), with grid_constant mark routing it to .param |
| Destination tile (SMEM) | space-3 (shared) starting at Stage 3, allocated from global_smem |
Mbarrier slot %mbar_a | space-3 (shared), 64-bit aligned, lives in global_smem |
| Loop iterator / coordinates | space-0 (generic register) |
The descriptor's address-space story is the subtlest of the five. make_tiled_tma_desc produces a value typed !nv_tileas.tma_desc<…>, an opaque type that the Shared LLVM Type Converter lowers to !llvm.ptr<1> — an LLVM global-space pointer. The descriptor itself lives in global memory at launch time, but the pointer to it is passed through the kernel ABI in .param space. The cute_nvgpu.grid_constant argument attribute is what tells the codegen "this argument is a .param slot containing a pointer that, when dereferenced, lands in global memory." The downstream cute-to-llvm lowering at sub_1698C20 lifts that attribute to nvvm.grid_constant, which is the form ptxas reads for the .param placement decision.
The mbarrier slot's address-space story is the simplest: it is always .shared. nvvm.mbarrier.init.shared, nvvm.mbarrier.arrive.expect_tx.shared, nvvm.mbarrier.try_wait.parity.shared — every member of the 21-op NVVM family for the cases this walkthrough exercises takes the .shared variant, because mbarrier hardware lives in CTA-scope SMEM. The non-.shared variants exist for cluster-scope mbarriers reached through nvvm.mapa (peer-CTA address translation), but those are out of scope for a single-CTA tile load.
Mbarrier Slot Allocation and Reuse
The nv_tileas.alloc_mbarrier op produced in Stage 3 carves a 64-bit barrier out of the kernel's SMEM arena. The buffer-assignment pass documented in Buffer Assignment and Named-Barrier Binding is what decides the offset; for this walkthrough the slot lives at a fixed offset past the tile-storage region of global_smem. The barrier is initialized once at kernel prologue with nvvm.mbarrier.init.shared — the arrive_count matches the number of producer arrivals the load contributes (one, for a single-issue load), and the expected_txn field starts at zero and is set by the first mbarrier.arrive.expect_tx that publishes against it.
In a pipelined kernel — the steady-state shape documented in DSL to PTX End-to-End — Stage 3: nv_tileas IR — the same mbarrier slot is reused across multiple iterations under different phase parities. The phase bit on the barrier flips on every completion, so iteration i and iteration i+1 see opposite parities on the same slot; the consumer's try_wait.parity reads the iteration-derived parity from the loop's pipeline iterator state. For a depth-D pipeline the slot at stage index s carries phase (i / D) & 1 on iteration i, and the producer's expect_tx flips the phase implicitly through the arrive-with-expect-tx machinery the mbarrier State Machine — State Machine page documents.
This is why the load alone does not need to specify a phase: the load increments the in-flight transaction byte count, and mbarrier_arrive flips the phase when pending reaches zero — exactly one arrival is enough for the single-producer load. The consumer reads the iteration phase from its pipeline iterator and asks try_wait.parity for that phase. The phase invariant is what makes a single barrier slot reusable across iterations without ABA hazards.
Capability Cross-Check
The walkthrough above targets sm_90a. The same cuda_tile.load_view_tko would produce a different cascade on every other supported architecture; the table below summarises the divergence so a reimplementer can predict what to expect under a different --compute-capability value.
| Compute capability | CopyAtom witness | Stage-3 op | Stage-6 PTX mnemonic |
|---|---|---|---|
sm_70 (Volta) | sm70_ldg_128_bf16 | nv_tileas.tiled_load (no rewrite) | ld.global.nc.v4.b32 (per-thread vectorised) |
sm_75 (Turing) | sm70_ldg_128_bf16 | same | ld.global.nc.v4.b32 |
sm_80 (Ampere) | sm80_cp_async_4_bf16 | nv_tileas.async.cp_async | cp.async.ca.shared.global.4 |
sm_89 (Ada) | sm80_cp_async_4_bf16 | same | same |
sm_90a (Hopper) | sm90_tma_load_2d_bf16 | nv_tileas.async.tiled_tma_load | cp.async.bulk.tensor.2d.shared::cluster.global.tile.mbarrier::complete_tx::bytes |
sm_100 (Blackwell) | sm90_tma_load_2d_bf16 (inherited) | nv_tileas.async.tiled_tma_load | same TMA mnemonic; consumer is tcgen05.mma not WGMMA |
sm_120 (consumer Blackwell) | sm90_tma_load_2d_bf16 | same | same |
The transition between sm_89 and sm_90a is where the TMA descriptor, the mbarrier transaction-byte machinery, and the asynchronous cp.async.bulk.tensor instruction first enter the cascade. Below that boundary the load is synchronous (ldg) or coroutine-style async with per-thread cp.async; above that boundary it is the bulk-tensor instruction the rest of this page traces. See Matmul Progression by SM for the parallel progression on the WGMMA / tcgen05 consumer side.
Consumer-Side Pairing
The load alone is half the story. Once the producer issues cp.async.bulk.tensor and mbarrier.arrive.expect_tx, the consumer needs to wait until the destination tile is ready. That wait is a separate operation — nv_tileas.async.pipeline.consumer_wait at the TileAS level, lowering to nvvm.mbarrier.try_wait.parity.shared at the LLVM level, and surfacing as mbarrier.try_wait.parity.shared.b64 in PTX. The consumer wait is documented in detail in DSL to PTX End-to-End — Stage 4: LLVM IR with NVVM intrinsics and in mbarrier State Machine — Phase Parity.
The wait's release predicate is exactly what the state-machine table in mbarrier State Machine — State Machine shows: phase == want_phase && pending == arrive_count && txn_count >= expected_txn. The third clause is what the transaction-byte machinery in this walkthrough sets up — the load updates txn_count asynchronously, and the wait releases only once that field crosses the expected_txn = 32768 threshold the expect_tx instruction published.
The WGMMA consumer that typically reads the loaded tile is documented in WGMMA Emission Protocol. The end-to-end producer/consumer pipeline that wires the TMA load to the WGMMA consumer through a multi-stage mbarrier ring is documented in DSL to PTX End-to-End — Stage 3: nv_tileas IR.
Reimplementation Checklist
Anyone reproducing a one-shot TMA load from a higher-level IR should walk the same six gates this page traces, in order. The checklist mirrors the cascade:
- Pick a CopyAtom whose interface tag (
TmaAtomTypeInterface) marks it as a TMA candidate. Anything else stays synchronous. - Verify the box-dim invariants: leading dim's bit-width must be a 16-byte multiple, descriptor pointer 128-byte aligned, smem layout TMA-compatible. The descriptor builder verifier catches every violation.
- Compute
tx_countfrom atom metadata, not from the naiverows × cols × sizeof(elem)shortcut. Im2col and multicast change the formula. - Reserve exactly one mbarrier slot per pipeline stage, not per producer or per atom. Multiple atoms publishing to the same slot is normal; the slot's
expected_txnis the sum. - Reverse coordinate order at the TileAS-to-LLVM boundary and only there. Earlier reversals corrupt scheduling decisions; later reversals corrupt the PTX operand order.
- Pair every load with a downstream consumer wait on the same mbarrier and the same parity. The wait is not a property of the load — it is a separate operation, emitted by a separate pattern, against a separately-tracked phase.
Skipping any of these six steps yields a kernel that either fails verifier mid-pipeline, fails ptxas at SASS time, or hangs forever at runtime. The QUIRK callouts above flag the most error-prone of the six.
Two further constraints are worth flagging because they are easy to miss when working backward from a PTX dump. First, the nv_tileaa.kernel_spec attribute must be present on the function before LowerTMALoadStoreToAsync runs — its absence fires "LowerTMALoadStoreToAsync: missing or invalid KernelSpecAttr on function" and skips every TMA rewrite, leaving the IR with tiled_load ops that the downstream verifier rejects. Second, the function-level nv_tileas.num-host-tmas and nv_tileas.num-device-tmas counters must agree with the actual tmaIdx range; "tmaIdx exceed tmaHostNum." and "tmaIdx exceed tmaDeviceNum." are emitted by the ABI verifier when they don't.
Cross-References
DSL to PTX End-to-End is the kernel-wide walkthrough this page narrows; it shows the same kernel at every stage with all operations in place and traces the full producer/consumer pipeline through scheduling.
mbarrier State Machine is the canonical reference for the barrier object's state machine, the 21-op NVVM family that touches it, and the three barrier kinds (ordinary, TMA-transaction, cluster-transaction); this walkthrough exercises the TMA-transaction kind end-to-end.
WGMMA Emission Protocol is the consumer-side companion: the four-op wgmma.fence / mma_async / commit_group / wait_group sequence runs once the try_wait.parity on the load's mbarrier releases.
TileAS TMA and Memops Family covers the eight-phase LowerTMALoadStoreToAsync pass, descriptor builders, ABI separation between host and device descriptors, and the full diagnostic catalog from the verifier.
cuda_tile to nv_tileaa documents the first lowering stage (Stage 1 → Stage 2 in this walkthrough), the three-populator structure, and the CopyAtom-attaching layout-assignment pre-pass.
nv_tileaa to nv_tileas covers the alias-aware-to-assembler-near rewrite (Stage 2 → Stage 3) and the witness hand-off shape that preserves the CopyAtom attribute verbatim.
nv_tileas to LLVM is the terminal MLIR-side lowering (Stage 3 → Stage 4), with the five-step TMA-load rewrite, the coordinate reversal, and the nine-phase body conversion.
TMA, Tensormap, and cp.async.bulk is the codegen-side reference for the full cp.async.bulk.tensor.* mnemonic family, descriptor mutator ABI, and proxy-fence rules.
TMA Atoms catalogues the cute_nvgpu.tma_atom witness shapes for every supported (rank, mode, element-type, swizzle, multicast) tuple, and the make_exec_tma binding step that pairs each atom with its mbarrier.
SM-Tier Roster and Copy Atom Registry is the registry the layout-assignment pre-pass consults to pick sm90_tma_load_2d_bf16 over the alternatives.
Matmul Progression by SM and Capability Matrix explain why the lowering chose the sm_90a TMA path; on Ampere or earlier the same cuda_tile.load_view_tko would lower to cp.async or plain ldg with no TMA descriptor, no mbarrier transaction kind, and no tx_count attribute at all.