cutlass Seq-Bar and Block-Striped Operations

Abstract

cutlass.seq_bar.* models CUTLASS ordered sequence barriers — a ring of mbarrier slots plus a phase/index state cursor. cutlass.block_striped.* models per-thread striped memory movement and partial reduction across a CTA. Together they form the synchronisation and cooperative-movement substrate for warp-specialized GEMM mainloops and StreamK/split-K epilogues. The seq-bar family contributes five of the seventy ops the dialect ctor registers at sub_1761D90; the block-striped family contributes four.

Sequential Barrier Model

A sequence barrier is a circular array of barrier slots. Each participant carries a state cursor:

typedef struct {
    int phase;
    int index;
    int count;
} SeqBarState;

The slot is index % depth. Arrival advances the cursor; waiting uses the current phase.

BarrierSlot seq_bar_slot(SeqBar bar, SeqBarState state) {
    int slot_index = state.index % bar.depth;
    return bar.base[slot_index];
}

SeqBarState seq_bar_advance(SeqBar bar, SeqBarState state) {
    state.index += 1;
    state.count += 1;

    if ((state.index % bar.depth) == 0) {
        state.phase ^= 1;
    }

    return state;
}

Seq-Bar Operations

The five seq-bar ops are cutlass.seq_bar.create, cutlass.seq_bar.init, cutlass.seq_bar.arrive, cutlass.seq_bar.wait, and the per-state-machine cutlass.seq_bar.state.create. The init op claims a slot range from the per-CTA NamedBarrier pool via the same barrier-id allocator at sub_1771850 that PipelineInitOp::verify uses. The thirty-two-slot pool serves both cutlass.pipeline and cutlass.seq_bar, so pipeline-init and seq-bar-init compete for the same physical resource and must agree on allocation order.

Operation	Contract
`cutlass.seq_bar.create`	Allocate the ring's typed handle; lowering attaches the barrier-id range claimed by `init`.
`cutlass.seq_bar.init`	Initialize every barrier slot in the ring; allocates slot IDs from the per-CTA NamedBarrier pool.
`cutlass.seq_bar.arrive`	Arrive on the current slot and advance state.
`cutlass.seq_bar.wait`	Wait for the current slot and phase.
`cutlass.seq_bar.state.create`	Materialize the per-thread state record (slot index, phase) consumed by `arrive`/`wait`.

void lower_seq_bar_wait(SeqBar bar, SeqBarState state) {
    BarrierSlot slot = seq_bar_slot(bar, state);
    emit_mbarrier_try_wait_parity(slot, state.phase);
}

SeqBarState lower_seq_bar_arrive(SeqBar bar, SeqBarState state) {
    BarrierSlot slot = seq_bar_slot(bar, state);
    emit_mbarrier_arrive(slot, bar.participant_count);
    return seq_bar_advance(bar, state);
}

The verifier checks that wait and arrive use a state whose depth matches the sequence-barrier type. Mismatched depths usually mean producer and consumer cursors were constructed for different rings.

Relationship to `cutlass.pipeline`

seq_bar is the simple ordered-ring path. cutlass.pipeline carries richer producer/consumer participant masks and transaction-byte accounting; its init op is verified by the 3 406-byte PipelineInitOp::verify at sub_1771F40, which checks numStages > 0, that the participants list length (read via sub_172E930) matches numProducers, that the consumer list length (read via sub_172E940) matches numConsumers, that barrier_id_base falls within [0, 32), and that producer_group_id and consumer_group_id are distinct. Seq-bar init skips all of these checks because its participant model is a single flat ring. Reach for seq_bar when ordered arrival and wait are enough; reach for pipeline when TMA, WGMMA, executor masks, or participant roles must be explicit.

Once PipelineInitOp::verify passes, the post-verify builder at sub_1772C90 stamps a derived arrive-count attribute on the pipeline-init op. Seq-bar init has no equivalent post-verify step — its arrive-count is the static participant count from the op attribute and needs no recomputation.

Block-Striped Model

Block-striped movement partitions a contiguous tile across T threads. Thread i touches indices:

i, T + i, 2T + i, ...

The result: coalesced per-thread stripes with a dead-simple mapping from thread id to element index.

SmallVector<int> block_striped_indices(int thread_id, int threads, int elements) {
    SmallVector<int> result;

    for (int index = thread_id; index < elements; index += threads) {
        result.push(index);
    }

    return result;
}

Block-Striped Operations

The four block-striped ops are cutlass.block_striped.load, cutlass.block_striped.load_add (fused load-then-atomic-add), cutlass.block_striped.store, and cutlass.block_striped.reduce. Type specialisation (half, bfloat, packed, integer, float) is carried as an attribute on each op rather than as separate op-name entries — each variant needs distinct atom selection at lowering time, but the op registration is shared.

Operation	Contract
`cutlass.block_striped.load`	Load a per-thread stripe from global memory into register memory.
`cutlass.block_striped.store`	Store a per-thread stripe from register memory to global memory.
`cutlass.block_striped.reduce`	Atomically add or max each register stripe into a global workspace.

Every variant requires one register-memory memref, one global-memory pointer or memref, an element width of at least sixteen bits, and a static shape. Static shape lets the lowering pick a vector width and atom shape.

Four Operand-Layout Checkers

The block-striped family uses four operand-layout checkers — one per variant: sub_176E670 for load, sub_176EE10 for store, sub_176F5B0 for reduce-add, sub_176FD50 for reduce-max. Each one checks register-memory operand width, global-memory pointer or memref shape, element-type width (at least sixteen bits), and static stripe shape. They reject malformed operand combinations before lowering picks a vector width and copy atom.

LogicalResult verify_block_striped(BlockStripedOp op) {
    require(has_rmem_operand(op));
    require(has_gmem_operand(op));
    require(bit_width(op.element_type) >= 16);
    require(shape_is_static(op.stripe_shape));
    require(op.n_threads_in_block > 0);
    return success();
}

Reduce Lowering

block_striped.reduce is the side-effecting atomic path. It emits a global atomic add or max appropriate for the element type and packed width.

void lower_block_striped_reduce(BlockStripedReduceOp op) {
    for (int lane_index : block_striped_indices(thread_id(), op.threads, op.elements)) {
        Pointer dst = op.workspace + lane_index * sizeof(op.element_type);
        Value value = load_register_fragment(op.rmem, lane_index);
        emit_global_atomic_add(dst, value, GPU_SCOPE);
    }
}

Half and bfloat packed reductions use the target's packed no-flush path when available. Integer and floating scalar reductions select the matching global-add op. The five type-specialised forms exist so the lowering can route straight to the correct atom without rederiving the element type from a generic op.

Cutlass-Bar Warp-Cooperative Diagnostic

BarOpLowering at sub_15FC250 is a ~5.5 KB routine handling cutlass.bar lowering plus the warp-cooperative diagnostic. The diagnostic fires when a cutlass.bar op carries an arrive-count that is not a multiple of warp size, or when the op sits outside warp-cooperative scope. Block-striped and seq-bar lowering both depend on the warp-cooperative scope guarantee the diagnostic enforces — partial-warp arrivals on a cutlass.bar would otherwise corrupt the per-stage arrive count the pipeline-init post-verify builder stamped.

Load/Add/Store Lowering

The non-atomic variants lower through cute copy atoms and vector load/store helpers.

void lower_block_striped_load_add(BlockStripedLoadAddOp op) {
    CopyAtom atom = make_block_striped_copy_atom(op.element_type, op.stripe_shape);

    RegisterFragment old_value = cute_load_vec(op.gmem, atom);
    RegisterFragment partial = cute_load_vec(op.rmem, atom);
    RegisterFragment merged = add_fragments(old_value, partial, op.element_type);

    cute_store_vec(op.rmem, merged, atom);
}

Use floating addition for floating element types and integer addition for integer element types. The verifier nails that choice down by requiring a concrete element type.

StreamK and Split-K Use

StreamK and split-K reach for block-striped ops in their epilogues. Partial CTAs write accumulator fragments into a workspace. The final reducer CTA loads the partials, accumulates them, and stores or atomically reduces the result. The block-striped mapping pins every thread to a deterministic stripe of the accumulator tile.

Invariants

Sequence-barrier state depth matches the sequence-barrier value.
Sequence-barrier arrive advances index and flips phase on wraparound.
Block-striped operands include one register-memory and one global-memory object.
Block-striped element widths are at least sixteen bits.
Block-striped shapes are static.
Reduce is side-effecting and must not be commoned or removed.
Load/add/store variants choose integer or floating addition from element type.
cutlass.bar arrive-count is a multiple of warp size and the op is in warp-cooperative scope.

Cross-References

cutlass Pipeline and Tile Scheduler — Pipeline Model covers the richer producer/consumer protocol the simpler seq-bar ring sits next to, and Producer Acquire 4-Phase Lowering covers the NamedBarrier-pool consumer that seq-bar init shares slots with. mbarrier State Machine covers the per-stage mbarrier protocol the seq-bar slots reuse. cutlass Dialect Overview — Block-Striped Operand Checkers lists the four block-striped operand-layout checkers this page's verifier walks.

Keyboard shortcuts

Tileiras Internals