Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

nv_tileas Operation Roster and Builders

Abstract

nv_tileas is the operational surface for async scheduling, tiled memory movement, layout conversion, TMA descriptor use, and scheduled tile compute. This page lists the operation families, explains which attributes belong to the public contract, and describes the builder helpers used by scheduling and materialization passes.

The useful reference is semantic. The binary holds plenty of generated registration thunks, but a reimplementation only needs the operation names, operand/result contracts, attributes, and builder behavior described here.

Operation Families

FamilyOperationsPurpose
async pipelineasync.pipeline.create_pipeline, create_iterator, inc_iter, produce_one, produce_one_async, consume_one, consume_one_async, producer_acquire, producer_write, producer_commit, consumer_wait, consumer_read, consumer_release, agent_switch, async.pipeline.yieldproducer/consumer pipeline regions, stage iteration, ownership handshakes, and agent partitioning
async tokensasync.wait, async.future_wait, async.to_async, async.token_to_async, create_noneasync completion, token bridging, and placeholder values
tiled memopstiled_load, tiled_store, tiled_atomic_rmw, async.tiled_load, async.copy, async.load, async.store, copy, load, store, gather_load, scatter_storetoken-ordered and async memory movement
tensor slicesalloc_tensor, extract_slice, insert_slice, async.extract_slice, async.insert_slicelocal tile storage and shape manipulation
layoutconvert_layout, view, expand_dims, reinterpret, shuffle, generatelayout conversion, value views, and generated tile bodies
TMAmake_tiled_tma_desc, async.tiled_tma_load, async.tiled_tma_store, async.gather_tma_load, async.scatter_tma_storeTMA descriptor construction and async tensor bulk copies
computedot, async.dot, reduce, scanMMA and region-bearing reduction operations
control and metadatayield, pragma, cancel_next_program_id, async.cancel_next_program_idregion termination, optimizer directives, and scheduling control

Attribute Roster

AttributeOwner conceptsMeaning
atomcopy, dot, tiled memory, TMA, gather/scatterselects copy, MMA, TMA, or reduce atom
padding_valuegather/load/store variantsvalue used when an access is out of bounds
consumer_idxconsumer wait/read pathsselects a consumer inside a consumer group
ocgEnterDirectivespragmaoptimizer-control directives active on entry
ocgLeaveDirectivespragmaoptimizer-control directives active on exit
operandSegmentSizessegmented memops and descriptor opsseparates view, coordinate, offset, token, and metadata operands
memory semantic/scope attrstiled memory operationsordering and visibility contract
in-bounds attrsloads and storesper-dimension bounds information

Attributes belong to the operation contract. Pattern rewrites may remove stale caches, but they must preserve semantic attributes unless they replace the operation with a semantically equivalent form.

PipelineOp Enum

The nv_tileas.async.pipeline.* op family is a closed 16-entry enumeration. Each entry pairs with a single builder helper and a fixed OperationState shape, so a reimplementation can drive the entire family from one indexed dispatch instead of per-op registration code. Entries 0..14 are active; entry 15 is reserved.

#MnemonicOperationState
0nv_tileas.async.pipeline.create_pipeline6 named operands: numStages (i32), bufferView, producerGroupId (u8), consumerGroupId (u8), sharedMem (bool), dynamic (bool)
1nv_tileas.async.pipeline.produce_one1 region op
2nv_tileas.async.pipeline.produce_one_async1 region op
3nv_tileas.async.pipeline.consume_one1 region op + consumer_idx i32 attr
4nv_tileas.async.pipeline.consume_one_async1 region op
5nv_tileas.async.pipeline.consumer_readscalar op + consumer_idx i32 attr
6nv_tileas.async.pipeline.producer_writescalar op
7nv_tileas.async.pipeline.producer_acquirescalar op
8nv_tileas.async.pipeline.producer_commitscalar op
9nv_tileas.async.pipeline.consumer_waitscalar op
10nv_tileas.async.pipeline.consumer_releasescalar op
11nv_tileas.async.pipeline.yieldvariadic terminator
12nv_tileas.async.pipeline.inc_iterscalar op
13nv_tileas.async.pipeline.create_iteratorscalar op
14nv_tileas.async.pipeline.agent_switchvariadic body builder: num_agents_per_group i32, max_regs per-agent list, isolated bool
15(reserved)

Two builders deserve individual notes. create_pipeline is the largest builder because each of its six named operands runs through the named-operand helper before the state populates; the names ride along with the operation so they reappear in IR-printed form rather than as positional %0..%5 references. agent_switch is variadic in agent-body count: the emitted operation state carries an arbitrary number of regions, one per agent, plus the num_agents_per_group count, a DenseI32ArrayAttr of per-agent max_regs budgets, and an isolated boolean that controls whether an agent's region sees the surrounding SSA scope.

The region-op verifiers attached to the produce/consume variants and the yield are documented in Verifiers — Region-Op Verifier Template. The operation-state trailing-objects layout each builder fills in is documented in Operation Layout — TrailingObjects Decoder.

Worked Example: Producer/Consumer Pipeline Region

A representative two-stage pipeline that loads a tile through TMA in the producer region, waits for it in the consumer region, and feeds a dot in the consumer region:

// Build the pipeline. numStages=2, one producer, one consumer.
%prod_tok, %cons_tok = nv_tileas.async.pipeline.create_pipeline %buf_view
    { numStages       = 2 : i32,
      producerGroupId = 0 : i8,
      consumerGroupId = 1 : i8,
      sharedMem       = true,
      dynamic         = false }
    : !nv_tileaa.tiled_view<2x128x128xf16>
    -> !nv_tileas.async.pipeline.producer_token, !nv_tileas.async.pipeline.consumer_token

// Stage iterator
%iter = nv_tileas.async.pipeline.create_iterator %prod_tok
    : !nv_tileas.async.pipeline.producer_token -> !nv_tileas.async.pipeline.iterator<tile<128x128xf16>>

// Producer region — TMA loads, one per stage
%prod_tok2 = nv_tileas.async.pipeline.produce_one %prod_tok, %iter
    { producer_types = [tile<128x128xf16>] } : (
    !nv_tileas.async.pipeline.producer_token,
    !nv_tileas.async.pipeline.iterator<tile<128x128xf16>>
) -> !nv_tileas.async.pipeline.producer_token {
^bb0(%stage_buf : tile<128x128xf16>):
    %async_tok = nv_tileas.async.tiled_tma_load
        %tma_desc, %stage_buf[%k_outer]
        { atom = #nv_tileas<atom tma_load_2d>,
          operandSegmentSizes = array<i32: 1, 1, 1, 1> }
        : !cute_nvgpu.tma_descriptor_tiled, !nv_tileaa.tiled_view<128x128xf16>,
          index, !nv_tileaa.mem_token
        -> !async.value<tile<128x128xf16>>
    nv_tileas.async.pipeline.yield %stage_buf : tile<128x128xf16>
}

// Consumer region — wait for stage, dot, release
%cons_tok2 = nv_tileas.async.pipeline.consume_one %cons_tok, %iter
    { consumer_idx   = 0 : i32,
      consumer_types = [tile<128x128xf16>] } : (
    !nv_tileas.async.pipeline.consumer_token,
    !nv_tileas.async.pipeline.iterator<tile<128x128xf16>>
) -> !nv_tileas.async.pipeline.consumer_token {
^bb0(%a_tile : tile<128x128xf16>):
    %waited = nv_tileas.async.pipeline.consumer_wait %cons_tok, %iter
        { consumer_idx = 0 : i32 }
        : !nv_tileas.async.pipeline.consumer_token,
          !nv_tileas.async.pipeline.iterator<tile<128x128xf16>>
        -> !nv_tileas.async.pipeline.consumer_token
    %d = nv_tileas.dot %a_tile, %b_tile, %acc
        { atom = #nv_tileas<atom mma_f16_f16_f32> }
        : tile<128x128xf16>, tile<128x128xf16>, tile<128x128xf32>
        -> tile<128x128xf32>
    %released = nv_tileas.async.pipeline.consumer_release %waited
        : !nv_tileas.async.pipeline.consumer_token
        -> !nv_tileas.async.pipeline.consumer_token
    nv_tileas.async.pipeline.yield %a_tile : tile<128x128xf16>
}

The pipeline state attribute on create_pipeline records the stage count, the producer/consumer agent group ids, the buffer view, and the sharedMem flag that pins per-stage storage to shared memory. The producer_types and consumer_types attributes on the region ops match the producer token's payload type list, which is what the region-op verifier checks before lowering. The mbarrier slot the TMA load deposits into is the consumer's stage barrier; consumer_wait observes the same barrier and consumer_release returns the stage to the producer pool. The iterator rotates through numStages stages and is incremented per outer loop iteration through nv_tileas.async.pipeline.inc_iter.

TMA Op Operand/Result Tables

nv_tileas.make_tiled_tma_desc

SlotKindTypeRequiredNotes
operand 0global viewtiled_view with GMEM residency tagyesresidency is read from the view's address-space attribute, not the SSA type; element stride must equal 1
operand 1..Rbox dimsindexyes (R = atom box rank)per-axis box size
result 0descriptornv_tileas.tma_descyesconsumed by async.tiled_tma_load/_store
attr atomatomTMA load or store atomyesdrives kind selection
attr swizzle_modeenumnone|32B|64B|128Boptionalshared-memory swizzle
attr oob_modeenumzero|nan|constantoptionalout-of-bounds behavior

nv_tileas.async.tiled_tma_load

SlotKindTypeRequiredNotes
operand 0descriptortma_descyesfrom make_tiled_tma_desc
operand 1shared destinationtiled_view with SMEM residency tagyesresidency read from the view's address-space attribute; TMA-compatible swizzled layout
operand 2..R+1coordsindexyesper-axis source coordinate
operand R+2barriermem_tokenyesmbarrier for completion
result 0async tokenAsyncTokenTypeyesobserved by async.wait
attr atomatomTMA load atomyesmatches descriptor atom kind
attr padding_valuetyped attrelement-typed scalaroptionalfloating-point only
attr operandSegmentSizesdense i32length 4yes{desc, dst, coords, barrier}

nv_tileas.async.tiled_tma_store

SlotKindTypeRequiredNotes
operand 0descriptortma_descyesTMA store kind
operand 1shared sourcetiled_view (shared)yesTMA-compatible swizzled layout
operand 2..R+1coordsindexyesper-axis destination coordinate
result 0async tokenAsyncTokenTypeyes
attr atomatomTMA store atomyes
attr operandSegmentSizesdense i32length 3yes{desc, src, coords}

nv_tileas.async.gather_tma_load / scatter_tma_store

The discontiguous TMA variants take a per-lane coordinate tile (gather) or per-lane address tile (scatter) on top of the contiguous operands, and reject modes the descriptor doesn't support. Their attribute sets mirror the contiguous variants — gather_tma_load accepts padding_value, scatter_tma_store rejects it.

LogicalResult verify_make_tiled_tma_desc(MakeTmaDescOp op) {
    require(op.atom().is_tma());
    require(op.box_dims().size() == op.atom().box_rank());
    require(op.global_view().element_stride() == 1);
    require_descriptor_alignment(op.global_view().base());
    require_captures_are_descriptor_abi_compatible(op);
    return success();
}

Pipeline Op Operand/Result Tables

nv_tileas.async.pipeline.create_pipeline

SlotKindTypeRequiredNotes
operand 0buffer viewtiled_viewyesstage-local storage view
result 0producer tokenPipelineProducerTokenTypeyesfeeds producer_acquire
result 1consumer tokenPipelineConsumerTokenTypeyesfeeds consumer_wait
attr numStagesi32yesstage count
attr producerGroupIdu8yesagent group emitting producers
attr consumerGroupIdu8yesagent group emitting consumers
attr sharedMembooloptionalstage storage lives in shared memory
attr dynamicbooloptionaldynamic stage indexing

nv_tileas.async.pipeline.produce_one / produce_one_async

SlotKindTypeRequiredNotes
operand 0producer tokenPipelineProducerTokenTypeyesinput ownership
operand 1iteratorPipelineIteratorTypeyesstage indexing
region 0bodyproduceryesterminated by async.pipeline.yield
result 0producer tokenPipelineProducerTokenTypeyesreturned to caller
result 1async tokenAsyncTokenTypeasync variant onlycompletion of async producer work
attr producer_typestyped arrayyeselement-type list yielded by body

nv_tileas.async.pipeline.consume_one / consume_one_async

SlotKindTypeRequiredNotes
operand 0consumer tokenPipelineConsumerTokenTypeyesinput ownership
operand 1iteratorPipelineIteratorTypeyes
region 0bodyconsumeryesterminated by async.pipeline.yield
result 0consumer tokenPipelineConsumerTokenTypeyes
result 1async tokenAsyncTokenTypeasync variant only
attr consumer_idxi32yesselects a consumer in consumer group
attr consumer_typestyped arrayyeselement-type list yielded by body

nv_tileas.async.pipeline.producer_acquire / producer_commit / consumer_wait / consumer_release

OpOperand 0Result 0Notes
producer_acquireproducer token + iteratorproducer tokengrants stage ownership
producer_commitproducer tokenproducer tokenpublishes stage
consumer_waitconsumer token + iteratorconsumer tokenobserves commit
consumer_releaseconsumer tokenconsumer tokenreturns stage to pool

consumer_wait and consumer_read additionally carry the consumer_idx i32 attribute that maps the wait to a specific consumer inside the consumer group.

nv_tileas.async.pipeline.yield

SlotKindTypeRequiredNotes
operand 0..yielded valuesvariadicyesoperand types match enclosing region's result types

nv_tileas.async.pipeline.create_iterator / inc_iter

OpOperand 0Result 0Notes
create_iteratorpipeline valuePipelineIteratorTyperotates through numStages stages
inc_iteriteratoriteratoradvances to next stage
LogicalResult verify_pipeline_handshake(Operation op) {
    require_token_kind(op, op.operand(0));
    require_iterator_type_payload_matches(op.region(0), op.producer_types_attr());
    require_region_terminator_is(op.region(0), "nv_tileas.async.pipeline.yield");
    require_yield_operand_types_match_results(op.region(0), op.result_types());
    return success();
}

Pipeline Builders

Pipeline builders create region operations and token handshakes. A good implementation exposes small helper functions instead of forcing every pass to build raw operation states.

ProduceOneOp build_produce_one(Rewriter *rw,
                               Location loc,
                               ProducerToken token,
                               PipelineIterator iter,
                               TypeRange result_types,
                               RegionBuilder body) {
    ProduceOneOp op = rw->create<ProduceOneOp>(loc, result_types, token, iter);
    body(op.body(), op.region_arguments());
    ensure_pipeline_yield(op.body());
    return op;
}

ConsumeOneOp build_consume_one(Rewriter *rw,
                               Location loc,
                               ConsumerToken token,
                               PipelineIterator iter,
                               uint32_t consumer_idx,
                               TypeRange result_types,
                               RegionBuilder body) {
    ConsumeOneOp op = rw->create<ConsumeOneOp>(loc, result_types, token, iter);
    op.set_consumer_idx(consumer_idx);
    body(op.body(), op.region_arguments());
    ensure_pipeline_yield(op.body());
    return op;
}

agent_switch is variadic in agent body count and carries per-agent register-budget data. The builder keeps body regions, group counts, and max-register lists together so execution-unit propagation can reason about them.

Tiled Memop Operand/Result Tables

The tiled memory family shares one segmented operand layout. operandSegmentSizes separates view, coordinate, offset, token, and optional padding/mask operands so the verifier walks each slice without re-parsing the op.

Throughout the tables below, the SSA operand type is tiled_view<…> (a TileAS dialect type, not the MLIR built-in memref). Residency — RMEM, SMEM, TMEM, or GMEM — is an attribute on the tiled_view type, not encoded in the SSA type name. Verifier rules that say "shared" or "global" inspect that address-space tag, not the SSA type; two operands that both type-print as tiled_view<128x128xf16> can disagree on residency and be rejected by the memory-space-pair check.

nv_tileas.tiled_load

SlotKindTypeRequiredNotes
operand 0viewnv_tileaa.tiled_view or nv_tileas.tiled_viewyessource tile view
operand 1..Rcoordsindexyes (R = view rank)per-axis coordinate
operand R+1..offsetsindexoptionalper-axis offset; segment may be empty
token slottokenmem_token or async_tokenoptionalone or zero
result 0tiletile<S × element>yesshape S = atom box shape
result 1tokenmem_token or async_tokenoptionalpresent when token slot was supplied
attr atomatomAtomAttryesselects copy/TMA atom
attr mem_semanticenumweak|relaxed|acquireoptionalacquire_release rejected
attr mem_scopeenumtl_blk|cluster|gpu|sysrequired when semantic > weakrejected when semantic = weak
attr in_boundsdense boolper-axisoptionaldefaults to false
attr padding_valuetyped attrelement-typed scalaroptionalonly with in_bounds=false
attr operandSegmentSizesdense i32length 4 or 5yes{view, coords, offsets, token[, mask]}
LogicalResult verify_tiled_load(TiledLoadOp op) {
    require_operand_segments(op, {1, op.view().rank(), -1, /*token*/ -1});
    require_optional_token(op);
    require_coordinate_types_match_index(op);
    require_tile_shape_matches_atom_box(op.atom(), op.result(0));
    require_tile_dimensions_power_of_two(op.result(0).shape());

    if (op.mem_semantic() == ACQUIRE_RELEASE) {
        return op.emit_error("tiled_load rejects acquire_release semantic");
    }
    require_scope_iff_non_weak(op.mem_semantic(), op.mem_scope());
    require_padding_only_when_not_in_bounds(op);
    return success();
}

nv_tileas.tiled_store

SlotKindTypeRequiredNotes
operand 0viewtiled_viewyesdestination tile view
operand 1valuetile<S × element>yeselement type matches view element type
operand 2..R+1coordsindexyesper-axis coordinate
operand R+2..offsetsindexoptionalper-axis offset
token slottokenmem_token or async_tokenoptional
result 0tokenmem_token or async_tokenoptionalmirrors input token slot
attr atomatomAtomAttryesTMA store, register-to-global, etc.
attr mem_semanticenumweak|relaxed|releaseoptionalacquire and acquire_release rejected
attr mem_scopeenumas aboverequired when semantic > weak
attr in_boundsdense boolper-axisoptional
attr padding_valuetyped attrelement-typed scalaroptionalonly with in_bounds=false
attr operandSegmentSizesdense i32length 4 or 5yes

nv_tileas.tiled_atomic_rmw

SlotKindTypeRequiredNotes
operand 0viewtiled_viewyesatomic destination
operand 1valuetile<S × element>yesRMW operand
operand 2..R+1coordsindexyesper-axis coordinate
token slottokenmem_tokenoptional
result 0tiletile<S × element>yesold value tile
result 1tokenmem_tokenoptional
attr atomatomAtomAttryes
attr rmw_modeenumadd|and|or|xor|xchg|min|max|umin|umax|cmpxchg|addfyes
attr mem_semanticenumfull setoptionalmatches CAS semantics
attr mem_scopeenumas aboverequired when semantic > weak
attr operandSegmentSizesdense i32length 4yes

The atomic verifier also rejects 8-bit element types across all modes and rejects 16-bit integer atomics; 16-bit floating atomics restrict the mode set to add, max, min. The shared invariants for memory semantics, scope, and tile-shape validation appear in Verifiers.

Tiled Load and Store Builders

The most common composite builders emit a view followed by a tiled memory operation. They normalize rank and coordinate widths, attach operand segment sizes, and carry memory-ordering attributes through.

TiledLoadOp build_view_then_tiled_load(Rewriter *rw,
                                      Location loc,
                                      Value source,
                                      TileViewSpec view,
                                      TiledLoadAttrs attrs) {
    Value tile_view = rw->create<ViewOp>(loc, view.type, source, view.indices);
    return rw->create<TiledLoadOp>(
        loc,
        attrs.result_types,
        tile_view,
        attrs.coords,
        attrs.offsets,
        attrs.token,
        attrs.semantic_attrs());
}

TiledStoreOp build_view_then_tiled_store(Rewriter *rw,
                                        Location loc,
                                        Value value,
                                        Value destination,
                                        TileViewSpec view,
                                        TiledStoreAttrs attrs) {
    Value tile_view = rw->create<ViewOp>(loc, view.type, destination, view.indices);
    return rw->create<TiledStoreOp>(
        loc,
        tile_view,
        value,
        attrs.coords,
        attrs.offsets,
        attrs.token,
        attrs.semantic_attrs());
}

Scheduling preparation and materialization passes lean on these builders because they repeatedly need the same view-plus-memory-operation shape.

Dot and Mask Builders

Dot builders cover several recurring patterns:

  • allocate a zero accumulator and emit a dot;
  • wrap dot emission in scf.for and scf.if when a predicate or stage guard is needed;
  • synthesize a predicate mask, convert layout, and emit dot;
  • install dot simplification patterns for select-constant cases.
Value build_zero_accumulator_dot(Rewriter *rw,
                                 Location loc,
                                 DotInputs inputs,
                                 Type acc_type,
                                 AtomAttr atom) {
    Value acc = rw->create<AllocTensorOp>(loc, acc_type);
    Value zero = rw->create<arith::ConstantOp>(loc, zero_attr(acc_type));
    initialize_accumulator(rw, acc, zero);
    return rw->create<DotOp>(loc, inputs.a, inputs.b, acc, atom).result();
}

Dot builders preserve the atom and signedness attributes — later NVGPU/NVVM lowering uses them to pick the actual instruction.

Arithmetic Helper Builders

The builder library also ships thin wrappers for common arith operations: constants, add, multiply, subtract, signed division, signed max, and select. These helpers let composite TileAS builders materialize index math without depending on caller-specific boilerplate.

Value build_index_expr(Rewriter *rw, Value base, Value lane, Value stride) {
    Value scaled = rw->create<arith::MulIOp>(lane.get_loc(), lane, stride);
    return rw->create<arith::AddIOp>(base.get_loc(), base, scaled);
}

Wrappers must not add overflow or fast-math attributes unless the caller explicitly asks for them. Defaults belong to the arith dialect operation itself.

Schedule Infrastructure Builders

After schedule generation, three helper algorithms convert analysis into concrete IR:

HelperPurpose
materialize schedulepartitions resident and pending loads/stores/async roots from schedule analysis
build stagesturns union constraints into stage-ordered producer/consumer pairs
expand single tiled opclones a tiled operation for each scheduled stage and rewires operands
ScheduleMaterialization materialize_schedule(ScheduleAnalysis analysis, MaterializeOptions options) {
    ScheduleMaterialization out = {};
    out.resident_loads = compute_resident_loads(analysis, options);
    out.resident_stores = compute_resident_stores(analysis, options);
    out.pending_loads = expand_iteration_arguments(analysis, Side::Read);
    out.pending_stores = expand_iteration_arguments(analysis, Side::Write);
    out.resident_async = filter_async_eligible(out.resident_loads, options);
    out.pending_async = filter_async_eligible(out.pending_loads, options);
    return out;
}

Stage expansion needs two maps: one from original operands to their source operation, and one from each source operation to the per-stage replica. Those two maps are what let a single scheduled tiled operation become several stage-specific SSA operations without mixing operands from different stages.

void expand_single_tiled_op(TiledOp op, StageMap stages, Rewriter *rw) {
    OperandSourceMap sources = collect_operand_sources(op);
    ReplicaMap replicas = clone_op_per_stage(op, stages, rw);

    for (Operation *replica : replicas.values()) {
        for (OpOperand &operand : replica->get_op_operands()) {
            if (Value repl = lookup_stage_replacement(operand, sources, replicas)) {
                operand.set(repl);
            }
        }
    }
}

Cross-References

Verifiers describes the verbatim diagnostics the operations defined here must satisfy. Types describes the pipeline-token, iterator, and agent types that ride on these ops. Folds and Memory Consistency describes the rewrite shapes applied to the slice and structured-control scaffolding. The TileAA-side counterpart in nv_tileaa Operation Roster feeds these scheduling operations through the alias-aware lowering boundary.