Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

nvgpu Dialect Overview

Abstract

nvgpu is the bridge dialect between MLIR's generic gpu dialect and NVPTX-specific nvvm. It names the NVIDIA kernel patterns that gpu cannot express — warp shuffle, MMA and WGMMA, cp.async, mbarrier, TMA — without committing yet to a concrete NVVM intrinsic. Tileiras links the upstream dialect unchanged. cute_nvgpu feeds it from above; convert-nvgpu-to-nvvm drains it from below.

About thirty ops live here. The conversion pass installs one OpConversionPattern per op and rewrites the module in a single sweep, each pattern emitting a small fixed body of nvvm.* ops — or, for a few exception cases, expanded memref / llvm / llvm.inline_asm. The pass mnemonic is convert-nvgpu-to-nvvm; in the O3 pipeline it runs immediately after the broad convert-to-llvm step and before convert-vector-to-llvm, arith-expand, and convert-memref-to-llvm (see Pass List by Optimization Level — O3), so by the time it fires every operand is already in LLVM-dialect or memref form.

Position in the Cascade

cute_nvgpu
    |
    | lower architecture atoms into stock GPU operations
    v
nvgpu
    |
    | convert-nvgpu-to-nvvm: ~30 patterns, one sweep
    v
nvvm
    |
    | translate to LLVM IR and the NVPTX backend
    v
PTX

cute_nvgpu ops still speak SM-tier vocabulary — TMA atoms, WGMMA atoms, Blackwell tensor-memory operations. nvgpu strips the source-level atom naming and re-presents the same behaviour over MLIR memrefs, vectors, descriptors, barrier groups, and async tokens. That makes the NVVM conversion mechanical: every nvgpu op below has a fixed nvvm (or llvm.inline_asm) lowering.

Op Roster

populateNVGPUToNVVMConversionPatterns installs one OpConversionPattern per op. Tileiras links the upstream populator unchanged. The rewriter callbacks branch on source memory space to pick the generic or .shared form of the mbarrier and cp.async intrinsics — address space 3 always selects .shared.

The "Status" column distinguishes ops whose mnemonic string appears verbatim in this binary's string table from upstream-MLIR patterns that the linked populator carries but whose mnemonic was never interned (either because the op was renamed, dropped, or only reached through gpu-dialect routing).

nvgpu opNVVM op(s) emittedStatus
nvgpu.device_async_copynvvm.cp.async.shared.globalinterned
nvgpu.device_async_create_groupnvvm.cp.async.commit.groupinterned
nvgpu.device_async_waitnvvm.cp.async.wait.groupinterned
nvgpu.mbarrier.create(no nvvm.*; memref.global + memref.get_global)interned
nvgpu.mbarrier.initnvvm.mbarrier.init / nvvm.mbarrier.init.shared (address-space-driven)interned
nvgpu.mbarrier.arrivenvvm.mbarrier.arrive / .sharedinterned
nvgpu.mbarrier.arrive.nocompletenvvm.mbarrier.arrive.nocomplete / .sharedinterned
nvgpu.mbarrier.arrive.expect_txnvvm.mbarrier.arrive.expect_tx / .sharedinterned
nvgpu.mbarrier.test.waitnvvm.mbarrier.test.wait / .sharedinterned
nvgpu.mbarrier.try_wait.paritynvvm.mbarrier.try_wait.parity.sharedinterned
nvgpu.mbarrier.invalnvvm.mbarrier.inval[.shared]absent from this binary; NVVM mbarrier.inval.shared is reached through the lower-level pattern
nvgpu.tma.async.loadnvvm.cp.async.bulk.tensor.shared.cluster.globalinterned
nvgpu.tma.async.storenvvm.cp.async.bulk.tensor.global.shared.ctainterned
nvgpu.tma.async.reducenvvm.cp.async.bulk.tensor.reduceabsent from this binary; NVVM reduce intrinsic still ships, but no nvgpu wrapper interns the mnemonic
nvgpu.tma.prefetch.descriptornvvm.prefetch.tensormapinterned
nvgpu.tma.fence.descriptornvvm.fence.proxy.acquireinterned
nvgpu.tma.create.descriptorllvm.alloca + GEP/store sequence + llvm.call @cuTensorMapEncodeTiledinterned
nvgpu.tensormap.create.descriptor (device-side replace path)nvvm.tensormap.cp.async.shared + nvvm.tensormap.replace.*absent from this binary; described here for completeness against upstream MLIR
nvgpu.tensormap.update.{global_address,box_dim,element_stride}nvvm.tensormap.replace.* per fieldabsent from this binary; same upstream-only status
gpu.warp_execute_on_lane_0 (consumed at this stage)nvvm.shfl.sync + conditional regionrouted through the upstream gpu dialect; no nvgpu mnemonic interned
nvgpu.warpgroup.descriptor (generate)integer shl/or chain producing a 64-bit value (see WGMMA descriptor bit layout)interned (both nvgpu.warpgroup.descriptor and nvgpu.warpgroup.generate.descriptor)
nvgpu.warpgroup.mmanvvm.wgmma.fence.aligned + N× nvvm.wgmma.mma_async + nvvm.wgmma.commit.group.sync.aligned + nvvm.wgmma.wait.group.sync.alignedinterned
nvgpu.warpgroup.mma.storeper-thread llvm.store decomposition of the accumulatorinterned
nvgpu.warpgroup.mma.init.accumulatorllvm.mlir.undef (or zero) accumulator aggregateinterned
nvgpu.mma.syncnvvm.wmma.mma.sync.aligned (sm_70..sm_89) or nvvm.wgmma.mma_async (sm_90+)interned
nvgpu.mma.sp.syncllvm.inline_asm with mma.sp.sync.aligned.m... templateinterned
nvgpu.ldmatrixnvvm.ldmatrix + register repackinterned
(no nvgpu.stmatrix mnemonic in this binary; the upstream stmatrix lowering targets nvvm.stmatrix directly)nvvm.stmatrix (when available) or llvm.inline_asmNVVM op present, nvgpu wrapper absent
nvgpu.rcpnvvm.rcp.approx.ftz.f family or libdevice callinterned
nvgpu.cvt_fpext / nvgpu.cvt_fptruncnvvm.cvt.packfloat.f32 familyinterned
nvgpu.fma.packed.f32x2 / nvgpu.mul.packed.f32x2packed nvvm.fma.packed.f32x2 / nvvm.mul.packed.f32x2interned

Patterns are registered at benefit = 1. The 64-bit values consumed by nvgpu.warpgroup.mma's descriptorA / descriptorB operands are the same SMEM descriptor words that flow through cute_nvgpu MMA atoms — the canonical bitfield decode is on the cute_nvgpu MMA atoms page.

Operand and Attribute Tables

The tables below pin every op family in the dialect to its operand list, attribute list, result list, and the NVVM rewrite the conversion pattern emits. SM gating lives in the per-arch availability table further down.

nvgpu.device_async_copy

PositionNameTypeNotes
operand 0dstmemref<...> in addr-space 3 (shared)minor dim must be unit-stride
operand 1srcmemref<...> in addr-space 1 (global)minor dim must be unit-stride
operand 2dstIndicesvariadic indexrank == dst rank
operand 3srcIndicesvariadic indexrank == src rank
operand 4srcElementsoptional indexruntime element count for predicated case
attributedstElementsi64 (IntegerAttr)element count per lane; 4, 8, 16
attributebypassL1optional UnitAttrselects .cg cache modifier
result 0token!nvgpu.device.async.tokenpassed to commit / wait

Rewriter emits nvvm.cp.async.shared.global with cp_size = dstElements * eltbytes and cp_modifier = bypassL1 ? cg : ca.

nvgpu.device_async_create_group

PositionNameTypeNotes
operand 0..NinputTokensvariadic !nvgpu.device.async.tokentokens to commit as a group
result 0groupToken!nvgpu.device.async.tokenfeeds device_async_wait

Rewriter emits a single nvvm.cp.async.commit.group. Input tokens are erased; the SSA edge survives only as a happens-before constraint.

nvgpu.device_async_wait

PositionNameTypeNotes
operand 0asyncDependencies!nvgpu.device.async.tokengroup token to wait on
attributenumGroupsoptional i32passed verbatim as the wait_group immediate

Rewriter emits nvvm.cp.async.wait.group N where N = numGroups (default 0).

nvgpu.mbarrier.create

PositionNameTypeNotes
attributenumBarriersi64requested mbarrier count in the group
result 0barriers!nvgpu.mbarrier.groupwraps the shared-memory slot

Rewriter emits no nvvm.* op. It generates a memref.global "private" @__mbarrier ... : memref<NxNumBarrier x i64, 3> and a memref.get_global returning a base pointer.

nvgpu.mbarrier.init.shared (alias of mbarrier.init over addr-space 3)

PositionNameTypeNotes
operand 0barriers!nvgpu.mbarrier.groupthe group from mbarrier.create
operand 1mbarIdindexbarrier index within the group
operand 2countindexparticipant count
attribute(none)the address space drives the .shared selector

Rewriter emits nvvm.mbarrier.init.shared against the GEP-resolved slot address.

nvgpu.mbarrier.arrive

PositionNameTypeNotes
operand 0barriers!nvgpu.mbarrier.groupwraps the shared-memory slot
operand 1mbarIdindexbarrier index within the group
result 0token!nvgpu.mbarrier.tokenfeeds mbarrier.test.wait

Rewriter emits nvvm.mbarrier.arrive or nvvm.mbarrier.arrive.shared based on the slot's address space.

nvgpu.mbarrier.arrive.expect_tx

PositionNameTypeNotes
operand 0barriers!nvgpu.mbarrier.groupmbarrier slot
operand 1mbarIdindexbarrier index within the group
operand 2txCountindexexpect-tx byte count

Rewriter emits nvvm.mbarrier.arrive.expect_tx[.shared]. No SSA result; the side effect is on the shared-memory mbarrier slot.

nvgpu.mbarrier.try_wait.parity

PositionNameTypeNotes
operand 0barriers!nvgpu.mbarrier.groupmbarrier slot
operand 1mbarIdindexbarrier index within the group
operand 2phaseindexphase parity (0 or 1)
operand 3ticksindexretry budget

Rewriter emits nvvm.mbarrier.try_wait.parity.shared returning an i1 polled in a loop.

nvgpu.mbarrier.inval (absent in this binary)

The mnemonic nvgpu.mbarrier.inval is not interned in this tileiras string table; the inval-side intrinsic (nvvm.mbarrier.inval.shared) is still present and is reached through the lower-level NVVM lowering. The operand list below is documented for upstream-MLIR parity and as a reference for reimplementers that choose to surface the wrapper.

PositionNameTypeNotes
operand 0barriers!nvgpu.mbarrier.groupmbarrier slot
operand 1mbarIdindexbarrier index within the group

Upstream rewriter shape: emit nvvm.mbarrier.inval[.shared].

nvgpu.tma.async.load

PositionNameTypeNotes
operand 0dstmemref<...> in addr-space 3TMA destination
operand 1barrier!nvgpu.mbarrier.grouparrives expect-tx on completion
operand 2tensorMapDescriptor!nvgpu.tensormap.descriptorfrom tma.create.descriptor
operand 3..7coordinatesvariadic i32, rank 1..5tile origin in tensor space
operand 8multicastMaskoptional i16cluster multicast bitmap
operand 9l2CacheHintoptional i64maps to .L2::cache_hint
attributepredicateoptional i1gated TMA issue

Rewriter emits a single nvvm.cp.async.bulk.tensor.shared.global. See Lowering: TMA Async Load — Operand Mapping for the operand-slot mapping.

nvgpu.tma.async.store

PositionNameTypeNotes
operand 0srcmemref<...> in addr-space 3TMA source tile in SMEM
operand 1tensorMapDescriptor!nvgpu.tensormap.descriptorglobal tensor map
operand 2..6coordinatesvariadic i32, rank 1..5tile origin in tensor space
operand 7l2CacheHintoptional i64maps to .L2::cache_hint
attributepredicateoptional i1gated TMA issue

Rewriter emits nvvm.cp.async.bulk.tensor.global.shared. No barrier — the producer side does not wait.

nvgpu.tma.prefetch.descriptor

PositionNameTypeNotes
operand 0tensorMapDescriptor!nvgpu.tensormap.descriptordescriptor to prefetch

Rewriter emits nvvm.prefetch.tensormap [%tmap].

nvgpu.tma.fence.descriptor

PositionNameTypeNotes
operand 0tensorMapDescriptor!nvgpu.tensormap.descriptordescriptor being made visible

Rewriter emits nvvm.fence.proxy.acquire.sync.cluster — the proxy-acquire fence that the WGMMA descriptor consumer needs.

nvgpu.tma.async.reduce (absent in this binary)

The nvgpu.tma.async.reduce mnemonic is not interned in this tileiras build. The underlying NVVM op (nvvm.cp.async.bulk.tensor.reduce) is present and consumed by cute_nvgpu lowerings directly; no nvgpu wrapper surfaces the reduce variant. The operand layout below documents the upstream wrapper for parity.

PositionNameTypeNotes
operand 0srcmemref<...> in addr-space 3SMEM source tile
operand 1tensorMapDescriptor!nvgpu.tensormap.descriptorglobal tensor map
operand 2..6coordinatesvariadic i32, rank 1..5tile origin in tensor space
operand 7l2CacheHintoptional i64L2 hint
attributeredopenum tma_redux_kindadd / min / max / inc / dec / and / or / xor

Upstream rewriter shape: emit nvvm.cp.async.bulk.tensor.reduce with red_op decoded from the attribute.

nvgpu.tma.create.descriptor

PositionNameTypeNotes
operand 0tensormemref<...>global tensor whose layout the descriptor encodes
operand 1..NboxDimensionsvariadic indexTMA tile shape per rank
attributeswizzleenum tma_swizzlenone / 32B / 64B / 128B
attributel2Promotionenum tma_l2_promotionnone / 64B / 128B / 256B
attributeoobFillenum tma_oob_fillnone / nan
attributeinterleaveenum tma_interleavenone / 16B / 32B
result 0descriptor!nvgpu.tensormap.descriptorglobal-memory pointer to the 128-byte CUtensorMap

Rewriter emits no nvvm.* op. It allocates a 128-byte CUtensorMap on the host stack via llvm.alloca, fills it through llvm.getelementptr + llvm.store, and calls cuTensorMapEncodeTiled.

nvgpu.tensormap.create.descriptor (device-side replace path; absent in this binary)

This op family is not interned in this tileiras build. The device-side descriptor replace path is reached directly through cute_nvgpu -> nvvm.tensormap.* without going through an nvgpu.tensormap.create.descriptor wrapper. Operand layout documented below for upstream-MLIR parity.

PositionNameTypeNotes
operand 0dst!nvgpu.tensormap.descriptor in shareddestination mailbox
operand 1src!nvgpu.tensormap.descriptor in globalsource descriptor

Upstream rewriter shape: emit nvvm.tensormap.cp.async.shared followed by a sequence of nvvm.tensormap.replace.* ops.

nvgpu.tensormap.update.global_address / box_dim / element_stride (absent in this binary)

These per-field update wrappers are also not interned in this build. Field-level descriptor updates lower directly through the nvvm.tensormap.replace.* family.

PositionNameTypeNotes
operand 0descriptor!nvgpu.tensormap.descriptor in shareddescriptor being edited
operand 1valuei64 or i32new field value
attributeordi32rank index for box_dim / element_stride

Upstream rewriter shape: each maps to the matching nvvm.tensormap.replace.* op against the SMEM-resident descriptor.

nvgpu.warpgroup.descriptor (also spelled warpgroup.generate.descriptor)

PositionNameTypeNotes
operand 0tensormemref<...> in addr-space 3SMEM tile origin
attributelayoutenum (row / col)matrix layout
attributeswizzleenum (none / 32B / 64B / 128B)SMEM swizzle pattern
result 0descriptor!nvgpu.warpgroup.descriptor64-bit SMEM descriptor

Rewriter packs the descriptor bits inline. The result is a i64 LLVM value built by an shl/or chain; the bit layout (start_addr[14] | lbo[16] | sbo[16] | base_offset[3] | reserved[3] | swizzle_mode[2] | pad[10]) is documented on the cute_nvgpu MMA atoms page.

nvgpu.warpgroup.mma

PositionNameTypeNotes
operand 0descriptorA!nvgpu.warpgroup.descriptorSMEM descriptor for A
operand 1descriptorB!nvgpu.warpgroup.descriptorSMEM descriptor for B
operand 2matrixC!nvgpu.warpgroup.accumulatorinput accumulator tile
attributetransposeAoptional UnitAttrwired into the WGMMA layout enum
attributetransposeBoptional UnitAttrwired into the WGMMA layout enum
attributewaitGroupoptional i32controls the wait-group depth
result 0matrixD!nvgpu.warpgroup.accumulatoroutput accumulator tile

Rewriter expands to the canonical four-op WGMMA sequence: nvvm.wgmma.fence.aligned, one nvvm.wgmma.mma_async per accumulator tile, nvvm.wgmma.commit.group.sync.aligned, then nvvm.wgmma.wait.group.sync.aligned waitGroup. See WGMMA Emission Protocol — The Four-Op Sequence for the timing rules and accumulator lifetime. It validates GMMA layout up front with the canonical "Not a canonical GMMA_MN Layout" wording lifted from CUTLASS's gmma.hpp.

nvgpu.warpgroup.mma.store

PositionNameTypeNotes
operand 0matrixD!nvgpu.warpgroup.accumulatoraccumulator to drain
operand 1dstmemref<...> in addr-space 3SMEM destination tile

Rewriter decomposes the accumulator into per-thread llvm.store operations against the destination memref. No nvvm.* op is emitted.

nvgpu.warpgroup.mma.init.accumulator

PositionNameTypeNotes
result 0accumulator!nvgpu.warpgroup.accumulatorzero-valued accumulator

Rewriter emits llvm.mlir.zero (or llvm.mlir.undef followed by per-field zero stores) producing the accumulator aggregate.

nvgpu.mma.sync

PositionNameTypeNotes
operand 0matrixAvector<...> register fragmentA operand fragment
operand 1matrixBvector<...> register fragmentB operand fragment
operand 2matrixCvector<...> register fragmentaccumulator fragment
attributemmaShapeArrayAttr<i64> of length 3[m, n, k]
attributetf32Enabledoptional UnitAttrenables tf32 element-type lowering
result 0matrixDvector<...> register fragmentD = A * B + C

Rewriter emits nvvm.wmma.mma.sync.aligned (Ampere/Ada) or routes through nvvm.wgmma.mma_async (Hopper) based on the active SM.

nvgpu.mma.sp.sync

PositionNameTypeNotes
operand 0matrixAvector<...> register fragmentstructurally sparse A operand
operand 1matrixBvector<...> register fragmentdense B operand
operand 2matrixCvector<...> register fragmentaccumulator fragment
operand 3sparseMetadatavector<2xi16>sparse selector word
operand 4sparsitySelectori320 or 1 — selects which packed pair
attributemmaShapeArrayAttr<i64> of length 3[m, n, k]
result 0matrixDvector<...> register fragmentsparse MMA result

Rewriter emits llvm.inline_asm with the mma.sp.sync.aligned.m... template; upstream NVVM exposes no sparse-MMA op in the snapshot tileiras tracks.

nvgpu.ldmatrix

PositionNameTypeNotes
operand 0srcmemref<...> in addr-space 3SMEM tile origin
operand 1..Nindicesvariadic indexrank-matched indices
attributenumTilesi321, 2, or 4
attributetransposeUnitAttr (optional)selects .trans form
result 0resvector<NxNxi32>repacked register fragment

Rewriter emits nvvm.ldmatrix.sync.aligned returning an llvm.struct<(i32, i32, ...)>, then a pack-struct-into-vector repack to match the result type.

nvgpu.stmatrix (absent in this binary)

There is no nvgpu.stmatrix mnemonic in this tileiras build's string table. The stmatrix store path is reached from the upstream MLIR vector / nvvm populators directly into nvvm.stmatrix. The operand layout below mirrors the upstream wrapper.

PositionNameTypeNotes
operand 0dstmemref<...> in addr-space 3SMEM destination
operand 1..Nindicesvariadic indexrank-matched indices
operand 2srcvector<NxNxi32>per-thread fragment
attributetransposeUnitAttr (optional)selects .trans form

Upstream rewriter shape: emit nvvm.stmatrix.sync.aligned on sm_90+ targets, or llvm.inline_asm with the matching stmatrix... template otherwise.

gpu.warp_execute_on_lane_0 (routed through upstream gpu dialect)

There is no nvgpu.warp.execute_on_lane_0 mnemonic; the corresponding op is the upstream gpu dialect's gpu.warp_execute_on_lane_0, which convert-nvgpu-to-nvvm rewrites in passing.

PositionNameTypeNotes
regionbodyone blockruns on lane 0 only
result 0..Nresultsany LLVM-typed valuesshuffled to every lane after the region

Rewriter emits a region predicate against lane == 0, runs the body, and broadcasts each result with nvvm.shfl.sync (idx, 0, 0xffffffff).

Packed conversion and arithmetic helpers

OpOperandsResultNVVM emission
nvgpu.rcpf32f32nvvm.rcp.approx.ftz.f or libdevice call
nvgpu.cvt_fpextpacked i32 of FP4/FP8vector<2xf16> / vector<2xf32>nvvm.cvt.packfloat.f32 family
nvgpu.cvt_fptruncvector<2xf16> / vector<2xf32>packed i32nvvm.cvt.packfloat.f32 family
nvgpu.fma.packed.f32x2three vector<2xf32>vector<2xf32>nvvm.fma.rn.f32x2
nvgpu.mul.packed.f32x2two vector<2xf32>vector<2xf32>nvvm.mul.f32x2

Each packed op carries a rnd enum (rn, rz, rm, rp) and, where applicable, a packed_kind enum that selects between MXFP / NVFP packing modes.

Lowering-Target Table

What each rewriter emits. The middle column gives the concrete NVVM op (or the expanded form when the pattern bypasses NVVM on purpose); the right column is what the NVPTX backend ultimately prints, not anything nvgpu itself emits.

nvgpu opNVVM op (or expansion)Final PTX (after NVVM lowering)
nvgpu.device_async_copynvvm.cp.async.shared.globalcp.async.{ca,cg}.shared.global [%dst], [%src], N;
nvgpu.device_async_create_groupnvvm.cp.async.commit.groupcp.async.commit_group;
nvgpu.device_async_waitnvvm.cp.async.wait.groupcp.async.wait_group N;
nvgpu.mbarrier.creatememref.global "private" + memref.get_global(no PTX; allocates SMEM slot)
nvgpu.mbarrier.init.sharednvvm.mbarrier.init.sharedmbarrier.init.shared.b64 [%mbar], %count;
nvgpu.mbarrier.arrivenvvm.mbarrier.arrive[.shared]mbarrier.arrive.shared.b64 %tok, [%mbar];
nvgpu.mbarrier.arrive.expect_txnvvm.mbarrier.arrive.expect_tx[.shared]mbarrier.arrive.expect_tx.shared.b64 %tok, [%mbar], %tx;
nvgpu.mbarrier.try_wait.paritynvvm.mbarrier.try_wait.parity.sharedmbarrier.try_wait.parity.shared.b64 %p, [%mbar], %ph, %ns;
nvgpu.mbarrier.inval (absent in this binary)nvvm.mbarrier.inval[.shared]mbarrier.inval.shared.b64 [%mbar];
nvgpu.tma.async.loadnvvm.cp.async.bulk.tensor.shared.globalcp.async.bulk.tensor.{1..5}d.shared::cluster.global.tile.mbarrier::complete_tx::bytes [%dst], [%tmap, {%c0,..}], [%mbar];
nvgpu.tma.async.storenvvm.cp.async.bulk.tensor.global.sharedcp.async.bulk.tensor.{1..5}d.global.shared::cta.tile [%tmap, {%c0,..}], [%src];
nvgpu.tma.async.reduce (absent in this binary)nvvm.cp.async.bulk.tensor.reducecp.async.bulk.tensor.{1..5}d.global.shared::cta.tile.{redop}.bulk_group [%tmap, {%c0,..}], [%src];
nvgpu.tma.prefetch.descriptornvvm.prefetch.tensormapprefetch.tensormap [%tmap];
nvgpu.tma.fence.descriptornvvm.fence.proxy.acquire.sync.clusterfence.proxy.async.shared::cluster;
nvgpu.tma.create.descriptorllvm.alloca + GEP/store sequence + llvm.call @cuTensorMapEncodeTiled(no PTX; host-side encode of a 128-byte CUtensorMap)
nvgpu.tensormap.create.descriptor (absent in this binary)nvvm.tensormap.cp.async.shared + tensormap.replace.*tensormap.cp.async.shared::cta.bulk_group [%dst], [%src]; then tensormap.replace.tile.{global_address,box_dim,elem_stride,...}.[%tmap], ...;
gpu.warp_execute_on_lane_0 (upstream gpu dialect; no nvgpu.warp.execute_on_lane_0 mnemonic)nvvm.shfl.sync + conditional regionshfl.sync.idx.b32 %r, %v, 0, 0x1f, 0xffffffff;
nvgpu.warpgroup.descriptorinteger shl/or chain — no NVVM op(no PTX; the 64-bit SMEM descriptor is built by ordinary integer ops; the PTX side sees the materialised b64 value)
nvgpu.warpgroup.mmanvvm.wgmma.fence.aligned → N× nvvm.wgmma.mma_asyncnvvm.wgmma.commit.group.sync.alignednvvm.wgmma.wait.group.sync.alignedwgmma.fence.sync.aligned; then wgmma.mma_async.sync.aligned.m64nXkY.f32.{f16,bf16,e4m3,e5m2}.{f16,bf16,e4m3,e5m2} {...}, %da, %db, p, 1, 1, %la, %lb; then wgmma.commit_group.sync.aligned; then wgmma.wait_group.sync.aligned N;
nvgpu.warpgroup.mma.storeper-thread llvm.store decompositionst.shared.b32 [%dst+off], %r; per fragment lane
nvgpu.mma.syncnvvm.wmma.mma.sync.aligned (sm_70..sm_89) or nvvm.wgmma.mma_async (sm_90+)mma.sync.aligned.m16n8kK.{row,col}.{row,col}.{...} {...}, %a, %b, %c;
nvgpu.mma.sp.syncllvm.inline_asm with mma.sp.sync.aligned.m... templatemma.sp.sync.aligned.m16n8k{16,32}.row.col.{f16,bf16,...} {...}, %a, %b, %c, %meta, 0x0;
nvgpu.ldmatrixnvvm.ldmatrix.sync.aligned + repackldmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.shared::cta.b16 {...}, [%addr];
nvgpu.stmatrix (absent in this binary; upstream wrapper shape)nvvm.stmatrix.sync.aligned or inline asmstmatrix.sync.aligned.m8n8.x{1,2,4}{.trans,}.shared::cta.b16 [%addr], {...};
nvgpu.cvt_fpext / nvgpu.cvt_fptruncnvvm.cvt.packfloat.f32.*cvt.{rn,rz,...}.{f16,bf16,e4m3,e5m2}.f32 %r, %f; (per lane)
nvgpu.fma.packed.f32x2nvvm.fma.rn.f32x2fma.rn.f32x2 %r, %a, %b, %c;

The sparse-MMA path reaches PTX through llvm.inline_asm because the snapshot's upstream NVVM does not yet expose a sparse-MMA op. The template, constraint string, and result type live in the pattern body and drop verbatim into the LLVM module — see Inline-PTX templates and constraint strings on the NVVM overview for the constraint-string form.

Per-Arch Availability

convert-nvgpu-to-nvvm runs unconditionally on every target — the gates live inside the patterns and in NVVM verification, not in pass scheduling. The first column gives the lowest SM that accepts each pattern, the second the form it emits at that floor, the third the lowest PTX ISA version that defines the resulting instruction.

nvgpu opSM floorEmits at floorptx_min
nvgpu.device_async_copysm_80cp.async.{ca,cg}.shared.global7.0
nvgpu.device_async_create_groupsm_80cp.async.commit_group7.0
nvgpu.device_async_waitsm_80cp.async.wait_group7.0
nvgpu.mbarrier.{create,init,arrive,try_wait.parity,inval}sm_80shared-memory mbarrier7.0 (base set on 7.0; cluster-aware forms 7.8)
nvgpu.mbarrier.arrive.expect_txsm_90mbarrier.arrive.expect_tx.shared.b647.8
nvgpu.tma.async.{load,store}sm_90cp.async.bulk.tensor.{Nd,shared,global}8.0
nvgpu.tma.async.reduce (absent in this binary)sm_90cp.async.bulk.tensor.reduce8.0
nvgpu.tma.prefetch.descriptorsm_90prefetch.tensormap8.0
nvgpu.tma.fence.descriptorsm_90fence.proxy.async.shared::cluster8.0
nvgpu.tma.create.descriptorsm_90runtime call to cuTensorMapEncodeTiled(host)
nvgpu.tensormap.create.descriptor (absent in this binary)sm_90tensormap.cp.async.shared + tensormap.replace.*8.3
nvgpu.tensormap.update.* (absent in this binary)sm_90tensormap.replace.*8.3
nvgpu.warpgroup.mmasm_90awgmma.mma_async.sync.aligned.m64nXkY.*8.0
nvgpu.warpgroup.mma.storesm_90aper-thread st.shared.*8.0
nvgpu.warpgroup.mma.init.accumulatorsm_90allvm.mlir.zero (no PTX)8.0
nvgpu.warpgroup.descriptorsm_90a(no PTX; SMEM descriptor synthesis)n/a
nvgpu.mma.sync (Ampere/Ada path)sm_80mma.sync.aligned.m16n8k{16,32}.*7.0
nvgpu.mma.sync (Hopper path)sm_90redirects through nvvm.wgmma.mma_async.*8.0
nvgpu.mma.sp.syncsm_80inline mma.sp.sync.aligned.m16n8k{16,32}.*7.1
nvgpu.ldmatrixsm_75ldmatrix.sync.aligned.m8n8.x{1,2,4}6.5
nvgpu.stmatrix (absent in this binary)sm_90stmatrix.sync.aligned.m8n8.x{1,2,4}8.0
gpu.warp_execute_on_lane_0 (no nvgpu.warp.execute_on_lane_0 mnemonic)sm_70shfl.sync.idx.b32 + region predicate6.0
nvgpu.cvt_fpext / nvgpu.cvt_fptrunc (FP4 / FP8)sm_89 (FP8) / sm_100a (FP4)cvt.{rn,rz,...}.{e4m3,e5m2}.f327.8 / 8.6
nvgpu.fma.packed.f32x2 / nvgpu.mul.packed.f32x2sm_100afma.rn.f32x2 / mul.rn.f32x28.6
nvgpu.rcpsm_70rcp.approx.ftz.f326.0

sm_90a is the architecture-qualified variant wgmma and TMA require; plain sm_90 rejects them at NVVM verification. The dialect has no sm_100 op of its own — the Blackwell tcgen05 surface lives entirely in nvvm, accessed through cute_nvgpu atoms that lower past nvgpu. See Per-SM Emission Templates for the per-tier capability matrix.

Pattern-Set Construction

populateNVGPUToNVVMConversionPatterns is a flat populator: one OpConversionPattern per nvgpu.* op, each registered with benefit = 1. The patterns are stateless — they read operands and attributes through their OpAdaptor, emit a fixed sequence of nvvm.* (or llvm.* / memref.*) ops, and replace the root.

Tileiras consumes this populator unchanged from upstream MLIR. Reimplementations should match the same one-pattern-per-op shape; the rewriter's branch on source memory space is the only piece of policy the patterns carry. See Lowering: nvgpu / gpu to NVVM — Pattern Shapes for the rewrite primitives the patterns share.

Lowering Contract

The conversion never reinfers layout intent. By the time IR reaches nvgpu, descriptor shape, memory space, vector shape, MMA tile shape, sparse metadata, and barrier identity already live in operands and attributes. Pattern bodies stay small as a result.

The mbarrier family branches on memory space and emits one nvvm.mbarrier.*[.shared] intrinsic per op. See mbarrier State Machine for the slot transitions and NVVM mbarrier Ops for the per-op intrinsic mapping. TMA load and store each emit a single nvvm.cp.async.bulk.tensor.* intrinsic, threading the variadic coordinates, multicast mask, and L2 cache hint through unchanged. The largest pattern is nvgpu.warpgroup.mma: it emits the four-stage Hopper WGMMA sequence — fence, async MMA, commit, wait — and validates GMMA layout up front with the canonical "Not a canonical GMMA_MN Layout" wording lifted from CUTLASS's gmma.hpp.

A handful of patterns emit no nvvm.* op at all. nvgpu.mbarrier.create emits a memref.global with "private" visibility plus a memref.get_global, allocating the __mbarrier slot in shared memory. nvgpu.tma.create.descriptor emits an llvm.alloca for a 128-byte CUtensorMap, fills it via llvm.getelementptr+llvm.store sequences, then calls the CUDA driver's cuTensorMapEncodeTiled. nvgpu.warpgroup.descriptor is a pure shl/or chain over the WGMMA descriptor bitfield. nvgpu.mma.sp.sync emits an llvm.inline_asm with the verbatim "mma.sp.sync.aligned.m..." PTX template; at the snapshot revision tileiras tracks, upstream NVVM has no sparse-MMA op yet, and inline-asm is the upstream design.

Verification Invariants

The interesting nvgpu verifier checks are semantic, not lexical. TMA ops demand valid descriptor types, compatible source or destination memrefs, supported tensor-map ranks, and a legal shared-memory layout. WGMMA demands rank-2 matrix fragments, compatible M/N/K, a supported tile shape, matching accumulator and result types, and legal transpose flags. MMA and sparse MMA add element-type checks, sparse-selector bounds, and a guard that tf32 only pairs with valid floating-point operands. Device async copy requires matching element types, unit-stride minor dimensions, supported transfer sizes, and correct alignment when L1 bypass is requested.

The boundary matters because NVVM conversion assumes the op is already legal for the selected target. Invalid shapes slipping through here resurface later as much less useful intrinsic-selection or backend diagnostics.

Reimplementation Checklist

A practical reimplementation needs the operation families above, typed descriptor and barrier values, shape-aware verifiers, and a deterministic conversion table to NVVM. Keep the layer transient. Independent scheduling, high-level layout algebra, and CUDA Tile semantics all belong above nvgpu. The dialect's job is to normalise hardware operations, verify their low-level shape contracts, and hand them to NVVM with as little policy as possible.

The minimum useful surface: tensor-map descriptor creation and async TMA load/store/reduce; shared-memory barrier groups and barrier tokens with expect_tx and try_wait.parity; WGMMA accumulator init / mma / store; the WGMMA descriptor packer; MMA ops with explicit shape attributes; sparse MMA metadata and selector validation; ldmatrix and stmatrix; SM80 cp.async device-async copy; packed conversion and arithmetic helpers; a complete nvgpu-to-nvvm conversion table; and target-aware verification before conversion.