Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lowering: nvgpu / gpu to NVVM

Abstract

This lowering family is the final MLIR-side step. It strips the standard gpu and nvgpu dialects from a Tileiras kernel module: portable GPU concepts (thread indices, barriers, dynamic shared memory, subgroup operations, printf) and NVIDIA-specific operations (async-copy, tensor-memory, mbarrier, WGMMA, sparse MMA, packed arithmetic) all become NVVM and LLVM operations the NVPTX backend can consume.

The contract is semantic, not archaeological: once these conversions run, no executable gpu.* or nvgpu.* operation should remain. The resulting module contains llvm.*, nvvm.*, and a small set of explicitly legal container or bridge operations that later serialization already understands.

Boundary Contract

Two related but distinct jobs share this pass.

gpu -> nvvm lowers the standard MLIR GPU dialect: thread and block index queries, cluster index queries, barriers, GPU function boundaries, GPU returns, dynamic shared memory, shuffle/reduce operations, printf, and math operations that need libdevice calls.

nvgpu -> nvvm lowers NVIDIA architectural operations: mbarrier operations, TMA tensor copy operations, descriptor construction and prefetching, WGMMA descriptor and accumulator operations, synchronous MMA, ldmatrix, SM80-style cp.async, sparse MMA, reciprocal approximation, packed float conversion, and packed f32x2 arithmetic.

The conversion target is strict:

Input conceptOutput form
gpu.thread_id, gpu.block_id, dimension queriesnvvm.read.ptx.sreg.* and integer arithmetic
gpu.barriernvvm.barrier0
cf.assert in GPU codeguarded call to CUDA-compatible __assertfail
gpu.printfvprintf call with lowered format and argument buffer
math.* operations that require device helpersscalarized libdevice __nv_* calls
nvgpu.mbarrier.*nvvm.mbarrier.*, usually with shared-memory variants
nvgpu.tma.*nvvm.cp.async.bulk.tensor.*, tensor-map helpers, and proxy fences
nvgpu.warpgroup.*WGMMA NVVM operations plus LLVM value packing
nvgpu.mma.sync, nvgpu.ldmatrixmatching NVVM matrix intrinsics plus LLVM repacking
nvgpu.device_async_*SM80 nvvm.cp.async.* group operations
nvgpu.mma.sp.syncllvm.inline_asm carrying the PTX sparse-MMA instruction
SM100 packed arithmetic and conversion opsdedicated nvvm.* packed operations

Violation behavior is uniform across the two halves of the pass: any executable gpu.* or nvgpu.* op remaining after the partial conversion is a hard failure — applyPartialConversion reports the unconverted op and the pass fails. An nvgpu.mbarrier.* op whose operand does not resolve to a shared-memory pointer is rejected by the typed-operand trait check (which surfaces as the verbatim " must be mbarrier barrier type, but got " diagnostic, prefixed by the operand label and suffixed with the printed offending type) rather than implicitly inserting an address-space cast, because the cast would change the semantic memory space tcgen05 lowering relies on. A vector-typed math.* operation that reaches libdevice dispatch without prior scalarisation is rejected by the conversion target rather than dispatched lane-by-lane silently. A cf.assert whose message globals cannot be materialised falls through to the upstream LLVM diagnostic. The gpu.module container itself is the only legal gpu.* surface on output; any other surviving gpu.* op signals a missing pattern in this bank.

GPU Dialect Lowering

The standard GPU pass builds a conversion target that legalises LLVM and NVVM, keeps gpu.module and gpu.yield legal so kernel bodies can be rewritten in place, marks the rest of the GPU dialect illegal, and adds libdevice-backed math operations and cf.assert to the illegal set. A surviving gpu.* op after this pass means either no pattern was registered or the pattern rejected the operation; the strict target makes the failure mode visible.

Index Queries

Thread, block, cluster, and grid index queries each rewrite to one NVVM special-register read plus an i32-to-index cast. The shape is uniform across the family — only the special-register name varies.

%i = gpu.thread_id x : index
   ↓
%r = nvvm.read.ptx.sreg.tid.x : i32
%i = arith.index_cast %r : i32 to index

The full mapping covers nine source operations:

SourceSpecial register
gpu.thread_id {x,y,z}nvvm.read.ptx.sreg.tid.{x,y,z}
gpu.block_id {x,y,z}nvvm.read.ptx.sreg.ctaid.{x,y,z}
gpu.block_dim {x,y,z}nvvm.read.ptx.sreg.ntid.{x,y,z}
gpu.grid_dim {x,y,z}nvvm.read.ptx.sreg.nctaid.{x,y,z}
gpu.cluster_id {x,y,z}nvvm.read.ptx.sreg.clusterid.{x,y,z}
gpu.cluster_dim {x,y,z}nvvm.read.ptx.sreg.nclusterid.{x,y,z}
gpu.cluster_block_id {x,y,z}nvvm.read.ptx.sreg.cluster.ctaid.{x,y,z}
gpu.subgroup_sizenvvm.read.ptx.sreg.warpsize
gpu.lane_idnvvm.read.ptx.sreg.laneid

Barrier

The CTA-wide barrier rewrite is one-to-one and must not introduce control flow — schedulers downstream rely on a barrier appearing exactly where the source op did.

gpu.barrier
   ↓
nvvm.bar.sync.aligned %c0 : i32

The aligned variant is mandatory: tileiras kernels always launch with warp-aligned thread counts, and the non-aligned barrier would force a fallback path the scheduler has not budgeted for.

Assert

cf.assert preserves CUDA's runtime contract. Message, source file, and function name become module-level global strings; the original predicate controls a conditional branch where the failing edge calls __assertfail and the passing edge falls through.

cf.assert %cond, "message" : i1
   ↓
llvm.cond_br %cond, ^cont, ^fail
^fail:
  %msg  = llvm.mlir.addressof @.assert_msg  : !llvm.ptr
  %file = llvm.mlir.addressof @.assert_file : !llvm.ptr
  %func = llvm.mlir.addressof @.assert_func : !llvm.ptr
  llvm.call @__assertfail(%msg, %file, %line, %func, %c0_i64) : ...
  llvm.br ^cont
^cont:
  ...

__assertfail is the CUDA runtime symbol the linker resolves; the signature (char*, char*, i32 line, char*, i64 charSize) is fixed by the runtime ABI and any reimplementation must call it with exactly those argument types in that order.

Libdevice Math

Vector lanes are scalarised before libdevice dispatch because libdevice functions are scalar and downstream cleanup folds scalar LLVM operations far more reliably than dialect-vector calls. The rewriter walks vector results, emits a per-lane libdevice call selected by element type, and reconstructs the vector via insertelement.

%r = math.sqrt %v : vector<4xf32>
   ↓
%v0 = vector.extract %v[0] : f32 from vector<4xf32>
%v1 = vector.extract %v[1] : f32 from vector<4xf32>
%v2 = vector.extract %v[2] : f32 from vector<4xf32>
%v3 = vector.extract %v[3] : f32 from vector<4xf32>
%r0 = llvm.call @__nv_sqrtf(%v0) : (f32) -> f32
%r1 = llvm.call @__nv_sqrtf(%v1) : (f32) -> f32
%r2 = llvm.call @__nv_sqrtf(%v2) : (f32) -> f32
%r3 = llvm.call @__nv_sqrtf(%v3) : (f32) -> f32
%r  = vector.from_elements %r0, %r1, %r2, %r3 : vector<4xf32>

The callee name comes from a (MathOpKind, ElementType) table: math.sqrt of f32 selects __nv_sqrtf, of f64 selects __nv_sqrt, of f16 selects __nv_sqrtf with operand promotion. Reflection-resolved variants for fast-math and unsafe-math intrinsics (__nv_fast_sinf, __nv_unsafe_divf) attach via the fastmath attribute on the source op.

NVGPU Dialect Lowering

The NVGPU conversion is a table-driven pattern set. Each pattern has one root operation and a typed matchAndRewrite body. Most emit a single NVVM operation. A handful are structural: tensor-map descriptor construction writes an LLVM stack object, WGMMA store decomposes an accumulator into per-thread stores, and sparse MMA emits inline assembly because the dialect snapshot doesn't model that instruction as a first-class NVVM op.

Source familyLowering behavior
nvgpu.mbarrier.createcreates or references a private shared-memory barrier object
nvgpu.mbarrier.initinitializes the barrier with the requested participant count
nvgpu.mbarrier.arrive*emits arrival, no-complete, and expect-transaction NVVM intrinsics
nvgpu.mbarrier.test.waittests and waits on a phase or token
nvgpu.mbarrier.try_wait.parityemits the parity-sensitive wait primitive
nvgpu.tma.async.loademits tensor bulk copy from global tensor memory into shared memory
nvgpu.tma.async.storeemits tensor bulk copy from shared memory back to global tensor memory
nvgpu.tma.create.descriptorbuilds the tensor-map descriptor that the CUDA driver can encode
nvgpu.tma.prefetch.descriptoremits tensor-map prefetch
nvgpu.tma.fence.descriptoremits proxy acquire fence for descriptor visibility
nvgpu.warpgroup.generate.descriptorpacks the GMMA shared-memory descriptor bitfields
nvgpu.warpgroup.mmaemits WGMMA fence, async MMA, commit, and wait operations
nvgpu.warpgroup.mma.storemaps accumulator fragments to per-thread stores
nvgpu.warpgroup.mma.init.accumulatorbuilds the zero or poison accumulator aggregate
nvgpu.mma.syncemits synchronous MMA NVVM intrinsic
nvgpu.ldmatrixemits ldmatrix and repacks the returned fragments
nvgpu.device_async_copyemits SM80 cp.async.shared.global
nvgpu.device_async_create_groupemits cp.async.commit.group
nvgpu.device_async_waitemits cp.async.wait.group
nvgpu.mma.sp.syncemits sparse MMA inline assembly
nvgpu.rcpemits reciprocal approximation
nvgpu.cvt_fptrunc, nvgpu.cvt_fpextemits packed float conversion
nvgpu.fma.packed.f32x2, nvgpu.mul.packed.f32x2emits packed f32x2 arithmetic

Each entry above is a distinct OpConversionPattern subclass registered against its root op. The conversion engine selects among them by op kind; there is no shared dispatcher inside a single rewriter.

Pattern Shapes

Every NVGPU pattern in this stage shares one outer shape: match on a root NVGPU op, convert its operands through the shared LLVM type converter, emit one or more NVVM ops plus any packing arithmetic, and replace the root. The four shapes below cover the families that need more than a single emission step; the remaining one-to-one patterns reduce to generic_remap from the 43-instantiation arith bank.

Mbarrier

The mbarrier family rewrites the five nvgpu.mbarrier.* operations into matching nvvm.mbarrier.* intrinsics. Shared-memory variants take a !llvm.ptr<3> barrier address; non-shared variants take a generic pointer the rewriter must address-space-cast to shared or reject.

%bar = nvgpu.mbarrier.create : !nvgpu.mbarrier
   ↓
%bar = llvm.mlir.addressof @mbar_storage : !llvm.ptr<3>

nvgpu.mbarrier.init %bar, %count : !nvgpu.mbarrier, i32
   ↓
nvvm.mbarrier.init.shared %bar, %count : !llvm.ptr<3>, i32

nvgpu.mbarrier.arrive %bar : !nvgpu.mbarrier -> !nvgpu.token
   ↓
%tok = nvvm.mbarrier.arrive.shared %bar : !llvm.ptr<3> -> i64

nvgpu.mbarrier.arrive.expect_tx %bar, %tx_count : !nvgpu.mbarrier, i32
   ↓
nvvm.mbarrier.arrive.expect_tx.shared %bar, %tx_count : !llvm.ptr<3>, i32

%t = nvgpu.mbarrier.try_wait.parity %bar, %phase, %ticks
   ↓
%t = nvvm.mbarrier.try_wait.parity.shared %bar, %phase, %ticks : !llvm.ptr<3>, i1, i32 -> i1

// `nvgpu.mbarrier.inval` is not interned in this binary; the lower-level
// `nvvm.mbarrier.inval.shared` intrinsic is still emitted directly by
// callers (e.g. CTAExit cleanup) without an `nvgpu` wrapper.
nvvm.mbarrier.inval.shared %bar : !llvm.ptr<3>

If the source operand does not resolve to a shared-memory pointer, the rewriter fails via the typed-operand trait check, surfacing the verbatim " must be mbarrier barrier type, but got " diagnostic (prefixed by the operand label and followed by the printed offending type). The pattern rejects rather than inserts an implicit cast because the cast would change the semantic memory space and downstream tcgen05 lowering depends on shared-memory residence.

TMA Async Load

nvgpu.tma.async.load rewrites to nvvm.cp.async.bulk.tensor.shared.cluster.global with the descriptor pointer, coordinate operands, and barrier. Optional attributes — multicastMask and l2CacheHint — wire into the intrinsic's optional argument slots when present.

nvgpu.tma.async.load %desc, %smem, %coords[%c0, %c1], %barrier
   { multicastMask = 0x000F : i16, l2CacheHint = 0xCAFE : i64 }
   ↓
nvvm.cp.async.bulk.tensor.shared.cluster.global.5d
   %smem, %desc, %barrier, %c0, %c1, %c2, %c3, %c4,
   multicast_mask = %mask, l2_cache_hint = %hint
   : !llvm.ptr<3>, !llvm.ptr<1>, !llvm.ptr<3>, i32 x 5, i16, i64

Operand mapping (rank N)

The intrinsic signature for nvvm.cp.async.bulk.tensor.{N}d.shared.cluster.global.tile is rank-parameterised; the multicastMask and l2CacheHint operands are optional. The rewriter maps the flat nvgpu operand list onto positional intrinsic slots and sets two Unit-typed enable attributes that gate the optional slots.

nvgpu.tma.async.load operandNVVM intrinsic slot
dst (SMEM memref, addr-space 3)slot 0 — dstAddr : ptr addrspace(3)
tensorMapDescriptorslot 1 — tensorMap : ptr to the 128-byte CUtensorMap
coordinates[0..N-1]slots 2..N+1 — coords : i32, one per rank
barrierslot N+2 — barrier : ptr addrspace(3)
multicastMask (optional)slot N+3 — multicastMask : i16
l2CacheHint (optional)slot N+4 — cacheHint : i64

The two Unit attributes (multicastEnable, cacheHintEnable) are not nvgpu attributes — they are produced by the rewriter from operand presence. When multicastMask is supplied, the rewriter sets multicastEnable = unit on the new nvvm.* op; otherwise it leaves both operand and enable absent. The same rule applies to l2CacheHint / cacheHintEnable.

Worked example, 3-D TMA load with both optional operands:

%smem        : memref<128x128xf16, 3>
%bar         : !nvgpu.mbarrier.group
%tmap        : !nvgpu.tensormap.descriptor
%c0,%c1,%c2  : i32
%mask        : i16
%hint        : i64

input :
  nvgpu.tma.async.load %smem[%c0,%c1,%c2], %bar, %tmap,
                       multicastMask = %mask,
                       l2CacheHint   = %hint

output :
  %smem_ptr = unrealized_conversion_cast %smem : memref<128x128xf16, 3> to !llvm.ptr<3>
  %bar_ptr  = ...                              : !llvm.ptr<3>
  %tmap_ptr = ...                              : !llvm.ptr
  nvvm.cp.async.bulk.tensor.3d.shared.cluster.global.tile
      %smem_ptr,                  // slot 0
      %tmap_ptr,                  // slot 1
      %c0, %c1, %c2,              // slots 2..4
      %bar_ptr,                   // slot 5
      %mask,                      // slot 6 (multicast)
      %hint                       // slot 7 (cache hint)
      { multicastEnable, cacheHintEnable, mode = #nvvm.load_mode<tile> }

If %mask is absent, slot 6 is dropped and multicastEnable is not set; slot 7 (if %hint is present) shifts left into slot 6 of the actually-emitted call. The intrinsic ID stays the same; only the operand bag changes width. Absent operands leave slots unset rather than emitting zero constants — a zero cacheHint would force a non-default code path in the backend.

TMA Async Store

nvgpu.tma.async.store is the symmetric reverse direction. The descriptor and coordinates appear in the same operand positions; the source becomes shared memory and the destination becomes global memory. There is no barrier — the producer issues the store and continues.

nvgpu.tma.async.store %smem, %desc, %coords[%c0, %c1]
   ↓
nvvm.cp.async.bulk.tensor.global.shared.cta.5d
   %desc, %smem, %c0, %c1, %c2, %c3, %c4
   : !llvm.ptr<1>, !llvm.ptr<3>, i32 x 5

Operand mapping (rank N):

nvgpu.tma.async.store operandNVVM intrinsic slot
tensorMapDescriptorslot 0 — tensorMap : ptr
coordinates[0..N-1]slots 1..N — coords : i32
src (SMEM memref, addr-space 3)slot N+1 — srcAddr : ptr addrspace(3)
l2CacheHint (optional)slot N+2 — cacheHint : i64, gated by cacheHintEnable

An nvgpu.tma.async.reduce wrapper is not interned in this binary. Reduce-variant lowerings are reached through cute_nvgpu straight into nvvm.cp.async.bulk.tensor.reduce, where the red_op enum selects the intrinsic ID at registration time — eight distinct intrinsics per rank, one per reduction kind. Operand layout mirrors the store form; the upstream wrapper, when present, would copy redop into the red_op slot verbatim.

The fence pattern nvgpu.tma.fence.descriptor rewrites to nvvm.fence.proxy.acquire.sync so descriptor updates from the CUDA host become visible to the device proxy before the next async load.

WGMMA Pipeline

nvgpu.warpgroup.mma expands into the four-op WGMMA protocol the hardware expects: fence, async issue, commit, wait. The accumulator is an aggregate the pattern emits as register-file values; the matching nvgpu.warpgroup.generate.descriptor pattern pre-packs the GMMA descriptors.

%acc' = nvgpu.warpgroup.mma %desc_a, %desc_b, %acc
   ↓
nvvm.wgmma.fence.aligned
%acc' = nvvm.wgmma.mma_async %desc_a, %desc_b, %acc : i64, i64, !llvm.struct<(f32, f32, ...)>
nvvm.wgmma.commit.group.sync.aligned
nvvm.wgmma.wait.group.sync.aligned 0

The four ops must appear in order: the fence ensures prior shared-memory stores are visible to the WGMMA pipeline; mma_async issues the operation; commit.group packages it into a group the warpgroup tracks; wait.group 0 blocks until the in-flight group count reaches zero. Reordering any pair changes the semantics — a missing fence loses input-dependence guarantees, and a missing wait races the accumulator into downstream reads.

Ldmatrix and Repack

nvgpu.ldmatrix rewrites to nvvm.ldmatrix.sync and repacks the returned register fragments into the LLVM-typed vector the consumer expects. The shape and transpose attributes pass through verbatim onto the intrinsic.

%v = nvgpu.ldmatrix %smem, num=4, transpose=false : memref<*xi32, 3>, vector<4xi32>
   ↓
%p = nvvm.ldmatrix.sync %smem, num=4, trans=false
   : !llvm.ptr<3> -> !llvm.struct<(i32, i32, i32, i32)>
%v0 = llvm.extractvalue %p[0] : !llvm.struct<(i32, i32, i32, i32)>
%v1 = llvm.extractvalue %p[1] : !llvm.struct<(i32, i32, i32, i32)>
%v2 = llvm.extractvalue %p[2] : !llvm.struct<(i32, i32, i32, i32)>
%v3 = llvm.extractvalue %p[3] : !llvm.struct<(i32, i32, i32, i32)>
%v  = vector.from_elements %v0, %v1, %v2, %v3 : vector<4xi32>

The fragment count (1, 2, or 4) selects the struct shape: num=1 returns a single i32, num=2 returns !llvm.struct<(i32, i32)>, num=4 returns !llvm.struct<(i32, i32, i32, i32)>. The repack always uses extractvalue + vector.from_elements so the consumer sees a uniform vector regardless of fragment count.

Device Async Copy (SM80)

nvgpu.device_async_copy rewrites to SM80-era cp.async. The associated group and wait operations rewrite one-to-one.

%tok = nvgpu.device_async_copy %gmem, %smem, %size : memref<*xf32, 1>, memref<*xf32, 3>
   ↓
nvvm.cp.async.shared.global %smem, %gmem, %size : !llvm.ptr<3>, !llvm.ptr<1>, i32

nvgpu.device_async_create_group [%tok0, %tok1, ...] : !nvgpu.token
   ↓
nvvm.cp.async.commit.group

nvgpu.device_async_wait %group { numGroups = 0 : i32 }
   ↓
nvvm.cp.async.wait.group 0

Async tokens lower to i32 integer values; the create-group operation discards its token operands because cp.async.commit.group operates on the implicit in-flight group rather than on explicit token list.

Sparse MMA Inline Assembly

nvgpu.mma.sp.sync has no first-class NVVM op in the current dialect snapshot, so the rewriter emits the PTX sparse-MMA instruction through llvm.inline_asm. This is the only operation in the bank that uses inline assembly; prefer NVVM intrinsics for everything else.

Descriptor and Barrier Rules

Mbarrier lowering is address-space-sensitive. Shared-memory barriers use the .shared NVVM variants; non-shared barrier values are rejected with a diagnostic rather than silently cast, because the cast would change the semantic memory space and downstream tcgen05 lowering depends on shared-memory residence. Token parity stays as a small integer value so wait operations can consume it directly without unpacking.

TMA lowering separates descriptor construction from descriptor use. nvgpu.tma.create.descriptor materialises a 128-byte tensor-map object on the function's stack and populates it with the static shape, stride, element-type, swizzle, rank, and interleave fields the CUDA-side encoder reads. Load, store, prefetch, and fence operations consume that descriptor pointer — they never reconstruct the descriptor from its fields, so descriptor canonicalisation can hoist construction freely.

For device-side descriptor rebind, this pass emits the inline-asm tensormap.replace.tile.* calls — global_address once, global_dim once per rank, global_stride once per non-leading rank — wrapped in the fence.proxy.tensormap::generic acquire/release pair so the generic-proxy write becomes visible to the tensormap proxy that cp.async.bulk.tensor.* reads from. The mutator templates, descriptor field layout, and fence-scope selection (.cta vs .gpu vs .sys) are documented in TMA Descriptor Mutators. The rewrite contract here is that the rewriter emits exactly that fixed sequence — any deviation (writing strides before dims, omitting the acquire fence, scoping to .cta across a cluster) leaves the descriptor partially coherent and the consumer reads stale lanes.

WGMMA descriptor packing is a pure integer operation over five inputs: the shared-memory base pointer, leading-byte offset, matrix stride, swizzle base, and swizzle mode. The 64-bit layout is fixed by the Hopper GMMA ISA — bit positions and field widths are documented in MMA Atoms sm70-120 — SM90 WGMMA. The packer is deterministic and side-effect-free, so schedulers and common-subexpression elimination can hoist redundant descriptor construction across loop iterations.

%desc = nvgpu.warpgroup.generate.descriptor %smem_base
   { leading_byte_offset = 16 : i64, matrix_stride = 64 : i64,
     swizzle_base = 128 : i64, swizzle_mode = #nvgpu<swizzle 128B> }
   ↓
%bits   = arith.constant 0x... : i64           // pre-folded bit pattern from attribute fields
%base_i = llvm.ptrtoint %smem_base : !llvm.ptr<3> to i64
%desc   = llvm.or %bits, %base_i : i64

The runtime base pointer is the only operand that varies per instance; everything else folds at compile time from the GMMA-descriptor attribute, so the generated LLVM is typically two instructions (ptrtoint plus or) per descriptor.

Conversion Invariants

  • The pass must leave no executable gpu.* or nvgpu.* operation behind.
  • gpu.module may survive only as the module container consumed by GPU-to-binary serialization.
  • Vector math is scalarized before libdevice calls are introduced.
  • CUDA assertion lowering must preserve the original predicate and source metadata.
  • Mbarrier variants must agree with the operand address space.
  • TMA descriptor construction must be kept separate from TMA copy and prefetch operations.
  • Sparse MMA uses inline assembly only for the missing dialect intrinsic; other operations should prefer first-class NVVM ops.
  • WGMMA lowering must emit the fence, MMA, commit, and wait sequence in the order expected by the hardware pipeline.

Cross-References

Conversion / Lowering Overview places this pass at the companion-lowering stage that runs alongside CuTe lowering. CuTe and CuTe-NVGPU to LLVM — Architecture-Specialized Atoms covers the CuTe atom rewrites whose outputs this pass consumes through cute_nvgpu.atom. Shared LLVM Type Converter describes the shared LLVM type converter every pattern in this bank threads through. MMA Atoms sm70-120 — SM90 WGMMA is the canonical reference for the WGMMA descriptor bit layout the packer above emits.