nvvm.mbarrier.* covers the sm_80+ mbarrier (memory barrier) state machine — a 64-bit shared-memory slot that counts arrivals, tracks an expected-transaction byte count, advances a phase parity, and lets warps wait for the slot to flip. The ops in this family each implement one transition of that state machine and emit the matching mbarrier.* PTX instruction. See mbarrier State Machine for the cross-op semantics and mbarrier Emission for the codegen side.
Two slot variants exist for almost every op: a generic-pointer form for completeness and a .shared form for the common case where the slot lives in shared memory. Lowering picks the .shared form whenever the operand address space is 3; the generic form remains so kernels that explicitly cast through __cvta_to_shared round-trip.
Each mbarrier slot carries four fields packed into a 64-bit word:
Field
Bits
Role
participant_count
low 20
total arrivals that complete one phase
pending_count
mid 20
arrivals remaining before the phase completes
tx_count
next 20
bytes still expected (for TMA expect-tx variant)
phase
high 1
toggles each time the phase completes
The state transitions are:
Op
Transition
init
participant_count := N, pending_count := N, tx_count := 0, phase := 0
arrive
pending_count -= 1; if zero, complete the phase: flip phase, reset pending_count := participant_count
arrive.nocomplete
pending_count -= 1 but suppress completion
arrive.expect_tx
arrive plus tx_count += k (for the TMA producer side)
try_wait.parity
non-blocking: return true if phase == expected_phase
test.wait
blocking: spin until phase matches the token
inval
mark the slot uninitialised
The expect_tx op is the producer-side handshake for TMA tile loads: the consumer initialises the slot with the participant count, the TMA load issues arrive.expect_tx once the bytes are committed, and the consumer waits on the phase flip. See mbarrier State Machine — Phase Parity for the parity bit semantics and Kinds: Ordinary, Transaction, Cluster for the per-kind transitions.
Most of these ops carry a .shared variant (the address-space split adds matching .shared entries); the wait family and the transaction-handle ops are single-variant.
The "drop participant" variant of arrive (mbarrier.arrive.drop in PTX) is not surfaced as a separate dialect op in this binary's string table; the upstream way to reach it is through nvvm.mbarrier.arrive with a count attribute encoding the drop semantics, or through inline asm.
The non-.shared forms drop the address-space token: mbarrier.init.b64 [%mbar], %count; and so on. The verifier rejects mixing — a .shared op with a generic pointer operand, or a generic op with a shared-pointer operand.