Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

NVVM mbarrier Ops

Abstract

nvvm.mbarrier.* covers the sm_80+ mbarrier (memory barrier) state machine — a 64-bit shared-memory slot that counts arrivals, tracks an expected-transaction byte count, advances a phase parity, and lets warps wait for the slot to flip. The ops in this family each implement one transition of that state machine and emit the matching mbarrier.* PTX instruction. See mbarrier State Machine for the cross-op semantics and mbarrier Emission for the codegen side.

Two slot variants exist for almost every op: a generic-pointer form for completeness and a .shared form for the common case where the slot lives in shared memory. Lowering picks the .shared form whenever the operand address space is 3; the generic form remains so kernels that explicitly cast through __cvta_to_shared round-trip.

State Machine

Each mbarrier slot carries four fields packed into a 64-bit word:

FieldBitsRole
participant_countlow 20total arrivals that complete one phase
pending_countmid 20arrivals remaining before the phase completes
tx_countnext 20bytes still expected (for TMA expect-tx variant)
phasehigh 1toggles each time the phase completes

The state transitions are:

OpTransition
initparticipant_count := N, pending_count := N, tx_count := 0, phase := 0
arrivepending_count -= 1; if zero, complete the phase: flip phase, reset pending_count := participant_count
arrive.nocompletepending_count -= 1 but suppress completion
arrive.expect_txarrive plus tx_count += k (for the TMA producer side)
try_wait.paritynon-blocking: return true if phase == expected_phase
test.waitblocking: spin until phase matches the token
invalmark the slot uninitialised

The expect_tx op is the producer-side handshake for TMA tile loads: the consumer initialises the slot with the participant count, the TMA load issues arrive.expect_tx once the bytes are committed, and the consumer waits on the phase flip. See mbarrier State Machine — Phase Parity for the parity bit semantics and Kinds: Ordinary, Transaction, Cluster for the per-kind transitions.

Op Roster

OpVariants
nvvm.mbarrier.initgeneric + .shared
nvvm.mbarrier.invalgeneric + .shared
nvvm.mbarrier.arrivegeneric + .shared
nvvm.mbarrier.arrive.nocompletegeneric + .shared
nvvm.mbarrier.arrive.expect_txgeneric + .shared
nvvm.mbarrier.test.waitgeneric + .shared
nvvm.mbarrier.wait(one op) — blocking phase wait
nvvm.mbarrier.wait.parity(one op) — phase-parity blocking wait
nvvm.mbarrier.try_wait.paritygeneric + .shared + .timelimit variant
nvvm.mbarrier.try_wait.timelimit(one op) — try-wait with deadline
nvvm.fence.mbarrier.init(one op) — proxy fence before init
nvvm.mbarrier.txn / nvvm.mbarrier.txn.ctatx-count transaction handles

Most of these ops carry a .shared variant (the address-space split adds matching .shared entries); the wait family and the transaction-handle ops are single-variant.

Operand Tables

nvvm.mbarrier.init[.shared]

PositionNameTypeNotes
operand 0addrptr addrspace(3) (.shared) or generic ptrmbarrier slot
operand 1counti32participant count

nvvm.mbarrier.inval[.shared]

PositionNameTypeNotes
operand 0addrptr addrspace(3) or genericmbarrier slot

nvvm.mbarrier.arrive[.shared]

PositionNameTypeNotes
operand 0addrptr addrspace(3) or genericmbarrier slot
operand 1countoptional i32arrival weight (default 1)
result 0tokeni64phase token consumed by test.wait

The "drop participant" variant of arrive (mbarrier.arrive.drop in PTX) is not surfaced as a separate dialect op in this binary's string table; the upstream way to reach it is through nvvm.mbarrier.arrive with a count attribute encoding the drop semantics, or through inline asm.

nvvm.mbarrier.arrive.expect_tx[.shared]

PositionNameTypeNotes
operand 0addrptr addrspace(3) or genericmbarrier slot
operand 1txCounti32expect-tx byte count
result 0tokeni64phase token

nvvm.mbarrier.test.wait[.shared]

PositionNameTypeNotes
operand 0addrptr addrspace(3) or genericmbarrier slot
operand 1tokeni64from arrive
result 0completei1phase-match outcome

nvvm.mbarrier.try_wait.parity[.shared]

PositionNameTypeNotes
operand 0addrptr addrspace(3) or genericmbarrier slot
operand 1phasei32parity (0 or 1)
operand 2ticksi32retry budget
result 0completei1phase-match outcome

nvvm.fence.mbarrier.init

PositionNameTypeNotes
(no operands)proxy-acquire fence emitted before mbarrier.init

LLVM Intrinsic Mapping

OpLLVM intrinsic
nvvm.mbarrier.init.sharedllvm.nvvm.mbarrier.init.shared.b64
nvvm.mbarrier.initllvm.nvvm.mbarrier.init.b64
nvvm.mbarrier.inval.sharedllvm.nvvm.mbarrier.inval.shared.b64
nvvm.mbarrier.arrive.sharedllvm.nvvm.mbarrier.arrive.shared.b64
nvvm.mbarrier.arrivellvm.nvvm.mbarrier.arrive.b64
nvvm.mbarrier.arrive.nocomplete.sharedllvm.nvvm.mbarrier.arrive.noComplete.shared.b64
nvvm.mbarrier.arrive.expect_tx.sharedllvm.nvvm.mbarrier.arrive.expect_tx.shared.b64
nvvm.mbarrier.test.wait.sharedllvm.nvvm.mbarrier.test.wait.shared.b64
nvvm.mbarrier.try_wait.parity.sharedllvm.nvvm.mbarrier.try.wait.parity.shared.b64
nvvm.fence.mbarrier.initllvm.nvvm.fence.mbarrier.init.release.cluster

The intrinsic ID is selected at TableGen registration time; lowering does not re-derive it from operand types.

PTX Templates

mbarrier.init.shared.b64 [%mbar], %count;
mbarrier.inval.shared.b64 [%mbar];
mbarrier.arrive.shared.b64 %tok, [%mbar];
mbarrier.arrive.noComplete.shared.b64 %tok, [%mbar], %count;
mbarrier.arrive.expect_tx.shared.b64 %tok, [%mbar], %tx;
mbarrier.test_wait.shared.b64 %p, [%mbar], %tok;
mbarrier.try_wait.parity.shared.b64 %p, [%mbar], %ph, %ns;
fence.mbarrier_init.release.cluster;

The non-.shared forms drop the address-space token: mbarrier.init.b64 [%mbar], %count; and so on. The verifier rejects mixing — a .shared op with a generic pointer operand, or a generic op with a shared-pointer operand.

Per-Arch Availability

OpSM floorptx_min
init, arrive, arrive.nocomplete, inval, test.wait, try_waitsm_807.0
try_wait.paritysm_807.8
try_wait.parity.timelimit / try_wait.timelimitsm_908.0
wait / wait.paritysm_908.0
txn / txn.cta (tx-count transaction handles)sm_908.0
arrive.expect_txsm_907.8
fence.mbarrier.initsm_908.0
Cluster-aware variants (.cluster, .release.cluster)sm_908.0

The expect_tx form is the TMA producer-side handshake; it is the only op in this family that requires sm_90. See TMA Ops for the producer side and Cluster Sync and DSMEM Handshake — DSMEM Transaction Handshake for the cluster-aware transaction protocol.

Verifier Invariants

  • .shared ops require operand 0 in addr-space 3.
  • count and txCount are 32-bit unsigned; values larger than 20 bits are rejected.
  • test.wait and try_wait.parity require an i1 result type.
  • arrive.expect_tx is rejected on sm_80; it requires sm_90 or later.
  • fence.mbarrier.init carries a release.cluster scope; rewriting it to acquire is rejected.