Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVM Cluster Ops

Abstract

nvvm.cluster.* and the adjacent cluster-aware helpers cover Hopper's thread-block-cluster surface: a small group of CTAs running on neighbouring SMs that share a logical cluster-wide barrier and a mapa-addressable view of their peer CTAs' shared memory. The ops in this family handle cluster-wide arrival, wait, and rank queries; they pair with mbarrier ops in nvvm.mbarrier.* for the data-side handshake. See Cluster Sync and DSMEM Handshake for the cross-CTA protocol and Cluster Sync Emission for the codegen side.

Blackwell (sm_100+) keeps the cluster surface; the same op set is the access path on every sm_90+ target.

Op Roster

OpRole
nvvm.cluster.arrivearrive at the cluster-wide barrier (acquire-release semantics)
nvvm.cluster.arrive.relaxedrelaxed-memory variant of cluster.arrive
nvvm.cluster.waitwait for every CTA in the cluster to arrive
nvvm.mapatranslate a peer-CTA SMEM pointer to a cluster-mapped address
nvvm.read.ptx.sreg.clusterid.x / .y / .zread cluster-rank index
nvvm.read.ptx.sreg.nclusterid.x / .y / .zread cluster-rank dimension
nvvm.read.ptx.sreg.cluster.ctarankper-CTA rank within the cluster
nvvm.read.ptx.sreg.cluster.nctaranktotal CTAs in the cluster
nvvm.barrier.cluster.arrive / .wait (alias spellings used by gpu.barrier lowering)same ops, different mnemonic

The cluster rank reads sit alongside the special-register family; the dialect exposes them under both nvvm.read.ptx.sreg.* and the cluster-specific names so kernels written against either spelling round-trip.

Operand Tables

nvvm.cluster.arrive / nvvm.cluster.arrive.relaxed / nvvm.cluster.wait

No operands and no result. Each lowers to a single PTX barrier.cluster.*; instruction.

nvvm.mapa

PositionNameTypeNotes
operand 0addrptr addrspace(3)local-CTA SMEM pointer
operand 1ctaRanki32peer CTA index within the cluster
result 0mappedptr addrspace(3)cluster-mapped address that aliases peer-CTA SMEM

The mapped pointer is dereferenceable by ordinary ld.shared / st.shared instructions and behaves as a view into the peer CTA's slot.

nvvm.read.ptx.sreg.clusterid.{x,y,z} and family

PositionNameTypeNotes
result 0ri32the requested cluster coordinate

LLVM Intrinsic Mapping

OpLLVM intrinsic
nvvm.cluster.arrivellvm.nvvm.barrier.cluster.arrive
nvvm.cluster.arrive.relaxedllvm.nvvm.barrier.cluster.arrive.relaxed
nvvm.cluster.waitllvm.nvvm.barrier.cluster.wait
nvvm.mapallvm.nvvm.mapa.shared.cluster.i64
nvvm.read.ptx.sreg.clusterid.xllvm.nvvm.read.ptx.sreg.clusterid.x
nvvm.read.ptx.sreg.cluster.ctarankllvm.nvvm.read.ptx.sreg.cluster.ctarank
nvvm.read.ptx.sreg.cluster.nctarankllvm.nvvm.read.ptx.sreg.cluster.nctarank

PTX Templates

barrier.cluster.arrive;
barrier.cluster.arrive.relaxed;
barrier.cluster.wait;

mapa.shared::cluster.u64 %r, %addr, %cta_rank;

mov.u32 %r, %clusterid.x;
mov.u32 %r, %clusterid.y;
mov.u32 %r, %clusterid.z;
mov.u32 %r, %nclusterid.x;
mov.u32 %r, %nclusterid.y;
mov.u32 %r, %nclusterid.z;
mov.u32 %r, %cluster_ctarank;
mov.u32 %r, %cluster_nctarank;

mapa accepts a 64-bit shared-cluster address; the u64 variant is the only one the dialect emits even when the result is a 32-bit pointer in source code — LLVM widens at type-conversion time.

Per-Arch Availability

Op familySM floorptx_min
cluster.arrive / waitsm_908.0
cluster.arrive.relaxedsm_908.1
mapasm_908.0
clusterid / nclusterid readssm_908.0
cluster.ctarank / nctaranksm_908.0

The relaxed-memory variant of cluster.arrive is the only op in the family that requires ptx 8.1; everything else is legal on 8.0.

Verifier Invariants

  • mapa requires the operand pointer in addr-space 3; generic pointers are rejected.
  • ctaRank is a 32-bit unsigned value; values outside [0, nctarank) cause undefined behaviour at runtime but the verifier does not reject them.
  • Cluster ops carry no operands and no result; verification rejects any attempt to attach attributes other than location info.
  • cluster.arrive and cluster.wait must appear in pairs across cooperating CTAs; the verifier cannot prove pairing but rejects clearly-unpaired uses inside non-cluster kernels (no cluster attribute on the parent gpu.module).