Atomic, Warp, Sreg, Fence Emission
Abstract
Four PTX synchronization and communication families share the NVPTX backend's final printer: atomic read-modify-write and reductions, warp-level collectives, special-register readers, and the fence/mbarrier/proxy-fence family. They enter code generation through different IR layers and selector dispatch arms, then converge on the same emitter.
The contract is modifier construction in a fixed order. Atomics and fences carry memory ordering and scope. Warp collectives carry a small kind enum that picks a PTX template. Special-register readers map a typed NVVM op to a registered PTX special register and route through a fast path for thread and CTA coordinates. Mbarriers and proxy fences carry operation-specific operands but reuse the same scope vocabulary, so one ordering/scope packing function services every family.
Atomic and Reduction Family
Atomic RMW lowering builds a modifier record. The printer emits modifiers in a fixed order: cluster-tail, scope, ordering, operation, cache hint, type. Reductions reuse the same scope and ordering vocabulary but support a smaller ordering set at the PTX level.
| Family / op | Orderings | Scope set | Type set |
|---|---|---|---|
red.cta / red.gpu / red.sys / red.cluster | relaxed default, release | cta/gpu/sys/cluster | b32/b64/u32/u64/s32/s64/f32/f64/f16/f16x2/bf16/bf16x2 |
atom.cas{.b16,.b32,.b64,.b128} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster/cta::cluster | b16, b32, b64, b128 |
atom.exch.{b32,b64} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster/cta::cluster | b32, b64 |
atom.add | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster/cta::cluster | u32, u64, f32, f64, f16/bf16 packed forms |
atom.and.{b32,b64} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster | b32, b64 |
atom.or.{b32,b64} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster | b32, b64 |
atom.xor.{b32,b64} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster | b32, b64 |
atom.min.{s32,s64,u32,u64} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster | s32, s64, u32, u64 |
atom.max.{s32,s64,u32,u64} | relaxed/acquire/release/acq_rel/seq_cst | cta/gpu/sys/cluster | s32, s64, u32, u64 |
Some red.gpu.global.add.* forms emit as inline PTX templates rather than
through the generic modifier printer. Invalid combinations get specific
diagnostics: unsupported ordering for nvvm.atomic.rmw,
Invalid memory model ordering for nvvm.red,
Invalid reduction op for nvvm.red, Invalid reduction type for nvvm.red.
The printer concatenates tokens in a fixed order so a reimplementation can read tokens off a modifier word without re-sorting. The order, from left to right, is: opcode stem, memory ordering, scope, operation suffix, address-space suffix, optional cache hint, type suffix, then the operand list. Each token comes from a small enum table; an absent enum value (default order or implicit scope) emits nothing rather than a placeholder dash.
For an atomic add on shared memory with relaxed memory order and CTA scope, the printer reads op = ADD, order = RELAXED, scope = CTA, addrspace = SHARED, type = U32 from the operand record and emits:
atom.relaxed.cta.add.u32.shared %r0, [%r1], %r2;
The token order is atom (stem) → .relaxed (order) → .cta (scope) → .add (operation) → .u32 (type) → .shared (address space) → operands. A few token slots accept compound forms: scope can be .cta::cluster when the cluster-tail bit is set, the cache-hint slot expands to .L2::cache_hint and adds a cache-policy operand, and the type slot can take packed widths like .f16x2 or .bf16x2.
Reductions reuse the same order without a return register:
red.gpu.add.f32.global [%rd0], %f1;
Atomic compare-and-swap doubles the operand count but keeps the same token order:
atom.acquire.gpu.cas.b64.global %rd0, [%rd1], %rd2, %rd3;
Lowering rejects illegal order/scope pairs before the printer fires, so the token-emission step never has to recover from an invalid modifier word. The invariant a reimplementation must preserve: every order/scope/space combination that lowering accepts is also accepted by ptxas on the current target. The verifier in ISelDAG and MatcherTable — Subtarget Feature Model shares this contract through the same subtarget feature bitmap.
Warp-Level Collectives
Four MLIR NVVM ops model warp-level collectives: nvvm.redux.sync,
nvvm.shfl.sync, nvvm.vote.sync, nvvm.match.sync. Each carries a
compact kind enum that selects the PTX template.
| NVVM op | Kind enum | PTX template family | Verifier / constraint |
|---|---|---|---|
nvvm.redux.sync | add, umin, umax, min, max, and, or, xor, fmin, fmax, fminabsnan, fmaxabsnan | redux.sync.* on 32-bit values | must run uniformly over the entire subgroup |
nvvm.shfl.sync | bfly, up, down, idx | shfl.sync.{bfly,up,down,idx}.b32 | optional validity predicate result |
nvvm.vote.sync | any, all, uni, ballot | vote.sync.{any,all,uni}.pred or vote.sync.ballot.b32 | ballot returns i32; others return i1 |
nvvm.match.sync | any, all | match.sync.{any,all}.{b32,b64} | any returns i32; all returns {i32, i1} |
redux.sync is feature-gated. Integer reductions require the redux-capable
path; floating redux appears only on newer targets. bar.warp.sync belongs
to the same warp-level family and emits bar.warp.sync mask.
The selector dispatches each warp collective by intrinsic-ID plus operand types. The intrinsic ID picks the family (redux, shfl, vote, match); the kind enum on the SDNode picks the operation within the family; and the operand element type picks the PTX type suffix. Four representative emissions:
redux.sync.add.s32 %r0, %r1, 0xFFFFFFFF; // signed-int reduction over the full warp
shfl.sync.bfly.b32 %r0|%p0, %r1, 0x10, 0x1F, 0xFFFFFFFF;
vote.sync.ballot.b32 %r0, %p1, 0xFFFFFFFF; // ballot returns i32
match.sync.any.b32 %r0, %r1, 0xFFFFFFFF; // match.any returns i32
The vote.sync.{any, all, uni} variants return a pred rather than b32; match.sync.all.b32 returns the pair {i32, i1} and the printer emits the i1 destination as the second operand slot. The last 32-bit operand on each form is the membership mask the issuing thread passes in. redux.sync is feature-gated and requires the subtarget bitmap's has_redux bit; floating redux adds has_redux_float. The verifier rejects non-uniform redux.sync callers before the selector fires, so the emitter can treat the subgroup as uniform without re-checking.
The bar.warp.sync mask; instruction belongs to the same family and emits a warp-level barrier. Selection routes it through the same dispatcher arm as vote.sync, with bar.warp.sync as the stem and the mask as the single operand.
Special-Register Readers
Tileiras registers the nvvm.read.ptx.sreg.* family for PTX
special-register reads. A compact fast path prints the base thread and CTA
coordinate registers; the rest go through the ordinary instruction printer.
| Sreg family | PTX name(s) | Intrinsic | Width |
|---|---|---|---|
| Thread index | %tid.x / %tid.y / %tid.z | nvvm.read.ptx.sreg.tid.{x,y,z} | u32 |
| Thread-block dim | %ntid.x / %ntid.y / %ntid.z | nvvm.read.ptx.sreg.ntid.{x,y,z} | u32 |
| CTA index | %ctaid.x / %ctaid.y / %ctaid.z | nvvm.read.ptx.sreg.ctaid.{x,y,z} | u32 |
| Grid dim | %nctaid.x / %nctaid.y / %nctaid.z | nvvm.read.ptx.sreg.nctaid.{x,y,z} | u32 |
| Cluster geometry | %clusterid.*, %nclusterid.*, %cluster_ctaid.*, %cluster_nctaid.* | nvvm.read.ptx.sreg.cluster* | u32 |
| Cluster rank | %cluster_ctarank, %cluster_nctarank | nvvm.read.ptx.sreg.cluster.ctarank and sibling | u32 |
| SM / warp identity | %smid, %nsmid, %warpid, %nwarpid, %laneid, %warpsize, %gridid | matching nvvm.read.ptx.sreg.* | u32 |
| Lane-mask predicates | %lanemask_{eq,ge,gt,le,lt} | nvvm.read.ptx.sreg.lanemask.{eq,ge,gt,le,lt} | u32 |
| Clocks / timer | %clock, %clock64, %globaltimer | matching nvvm.read.ptx.sreg.* | u32 / u64 / u64 |
| Environment regs | %envreg0 .. %envreg31 | nvvm.read.ptx.sreg.envreg{0..31} | u32 |
%dynamic_smem_size reads through inline assembly. Tileiras exposes only
the combined u64 %globaltimer form, not separate high/low 32-bit halves.
Performance counters %pm0 through %pm7 go unregistered.
nvvm.breakpoint uses %globaltimer for a short busy-wait before
trapping.
The reader path is a one-line mov template keyed by the registered sreg
name. The fast path for thread and CTA coordinates skips the generic
instruction printer entirely:
void print_sreg_read(Printer *p, SregReadInst *inst) {
if (is_thread_or_cta_sreg(inst->sreg)) {
/* Fast path: compact "mov.u32 %rN, %sreg;" emission. */
write(p, "mov.u32 ");
print_dest_reg(p, inst->dest);
write(p, ", ");
write_sreg_token(p, inst->sreg);
write(p, ";");
return;
}
/* Slow path: ordinary instruction printer handles cluster geometry, lane masks,
clocks, environment regs, and any sreg that needs u64 typing. */
print_inst_generic(p, inst);
}
Fence and Mbarrier Family
Fence scope encodes two ways. acq_rel and sc fences use
scope-suffixed operation names; acquire and release fences carry scope
as an attribute. Mbarriers model initialization, arrival, expected
transactions, invalidation, and waits. Proxy fences model synchronization
between generic, async, cluster, and tensormap proxies.
| Op family | Operands / attrs | PTX lowering |
|---|---|---|
mbarrier.init / .shared | smemPtr, count | mbarrier.init[.shared].b64 [$p], $n; |
mbarrier.arrive / .shared | smemPtr | mbarrier.arrive[.shared].b64 $r, [$p]; |
mbarrier.arrive.nocomplete | smemPtr, count | mbarrier.arrive.noComplete[.shared].b64 $r, [$p], $cnt; |
mbarrier.arrive.expect_tx | smemPtr, txCount | mbarrier.arrive.expect_tx[.shared].b64 $r, [$p], $tx; |
mbarrier.txn | smemPtr, txCount, relaxed, noComplete, shared-space kind, scope, peer rank | `mbarrier.expect_tx{.relaxed}.{cta |
mbarrier.test.wait | smemPtr, token | mbarrier.test_wait[.shared].b64 $r, [$p], $token; |
mbarrier.try_wait | smemPtr, parity, suspendNs or timelimit | mbarrier.try_wait*.b64 ...; |
mbarrier.wait | smemPtr, optional parity | mbarrier.wait[.parity].b64 $r, [$p][, $par]; |
fence.acq_rel.{cta,cluster,gpu,sys} | none | fence.acq_rel.{cta,cluster,gpu,sys}; |
fence.sc.{cta,cluster,gpu,sys} | none | fence.sc.{cta,cluster,gpu,sys}; |
fence.acquire / fence.release | scope, space | fence.{acquire,release}.<scope>; |
fence.mbarrier.init | useIntrinsic | fence.mbarrier_init.release.cluster; |
fence.proxy | kind, space, useIntrinsic | fence.proxy.<kind>; |
fence.proxy.acquire / fence.proxy.release | fromProxy, scope, toProxy | fence.proxy.<from>::<to>.{acquire,release}.<scope>.sync.aligned [addr], sz; |
tensormap.cp_fenceproxy | srcTmapPtr, dstTmapPtr, sizeBytes, scope | tensormap.cp_fenceproxy.<scope>.tensormap::generic.release.<scope>.sync.aligned [dst], [src], sz; |
Legacy membar.{cta,gpu,sys} remains as a fallback. Cluster-scope fences
have no pre-Hopper fallback and diagnose unsupported ordering/scope
combinations on older targets.
Fence emission resolves the scope-as-name-vs-scope-as-attribute split at print time:
void print_fence(Printer *p, FenceInst *inst) {
if (inst->order == ORDER_ACQ_REL || inst->order == ORDER_SEQ_CST) {
/* Scope is folded into the operation name: fence.acq_rel.cta */
write(p, "fence.");
write_order_token(p, inst->order);
write(p, ".");
write_scope_token(p, inst->scope);
write(p, ";");
return;
}
/* Acquire and release fences carry scope as a separate token. */
require(inst->order == ORDER_ACQUIRE || inst->order == ORDER_RELEASE,
"fence ordering must be acquire, release, acq_rel, or seq_cst");
write(p, "fence.");
write_order_token(p, inst->order);
write(p, ".");
write_scope_token(p, inst->scope);
write(p, ";");
}
Proxy and tensormap fences extend this with <from>::<to> proxy tokens
and an aligned-sync address payload; the underlying decision tree is the
same.
SyncScope Mapping
LLVM SyncScope names get normalized before atomic and fence printing. Tileiras keeps a small map from LLVM scope names to the NVPTX scope vocabulary, then packs ordering and scope into the modifier word the final printer consumes.
| LLVM SyncScope name | Backend scope | PTX token |
|---|---|---|
singlethread | thread | no token |
| empty default scope | system | .sys |
block | CTA | .cta |
cluster | cluster | .cluster |
device | device / GPU | no explicit token |
The default device scope deliberately prints no .gpu token for ordinary
atomics because PTX treats GPU scope as the default spelling. CTA scope
becomes the composite .cta::cluster spelling when the lowering path asks
for cluster-tail semantics. Cache-hint atomics accept only CTA and system
scope; cluster cache hints get rejected outright rather than silently
downgraded.
typedef enum {
NVPTX_SCOPE_THREAD,
NVPTX_SCOPE_CTA,
NVPTX_SCOPE_CLUSTER,
NVPTX_SCOPE_DEVICE,
NVPTX_SCOPE_SYSTEM,
} NvptxScope;
typedef enum {
ORDER_RELAXED = 1,
ORDER_ACQUIRE = 2,
ORDER_RELEASE = 3,
ORDER_ACQ_REL = 4,
ORDER_SEQ_CST = 5,
} AtomicOrder;
NvptxScope map_sync_scope(const char *scope_name) {
if (scope_name == NULL || scope_name[0] == '\0') {
return NVPTX_SCOPE_SYSTEM;
}
if (strcmp(scope_name, "singlethread") == 0) {
return NVPTX_SCOPE_THREAD;
}
if (strcmp(scope_name, "block") == 0) {
return NVPTX_SCOPE_CTA;
}
if (strcmp(scope_name, "cluster") == 0) {
return NVPTX_SCOPE_CLUSTER;
}
if (strcmp(scope_name, "device") == 0) {
return NVPTX_SCOPE_DEVICE;
}
fail("unsupported NVPTX synchronization scope");
}
unsigned pack_atomic_modifier(AtomicOrder order, NvptxScope scope, bool cta_cluster_tail) {
unsigned modifier = (unsigned)order;
if (scope == NVPTX_SCOPE_CTA) {
modifier |= 1u << 4;
} else if (scope == NVPTX_SCOPE_SYSTEM) {
modifier |= 2u << 4;
} else if (scope == NVPTX_SCOPE_CLUSTER) {
modifier |= 3u << 4;
}
if (cta_cluster_tail) {
modifier |= 1u << 9;
}
return modifier;
}
The reimplementation rule is simple: preserve the high-level LLVM ordering first, map the scope name to the closest PTX scope second, then pick the printed suffix. Diagnose unsupported order/scope pairs at lowering time so the printer never recovers from an invalid modifier word.
Cross-References
AsmPrinter — MC Switch Shape Population Table documents the dispatcher that selects the print shape for these atomic, warp-collective, sreg, and fence opcodes. ISelDAG and MatcherTable covers the selector that consumes the same subtarget feature bitmap before any of these instructions reach the printer. tcgen05 mbarrier Emission and the mbarrier State Machine cover the mbarrier and proxy-fence families that share scope and ordering vocabulary with this page.