Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

nv_tileaa Operation Roster

Abstract

nv_tileaa is the alias-aware tile dialect between public cuda_tile IR and the lower nv_tileas scheduling dialect. Its operations keep the mathematical shape of the original program while making pointer provenance, memory ordering, queue flow, and plugin boundaries explicit enough for later passes to schedule and materialize. The operation surface is a reimplementation contract — what each family represents, which operands and attributes matter, and which invariants a verifier or lowering must preserve.

The roster groups operations by behavior, not by binary registration order. A reimplementation should track the family contracts and textual mnemonics, not the incidental internal layout used by one compiler build.

Semantic Families

FamilyOperationsContract
Floating and integer tile mathaddf, subf, mulf, divf, fma, sqrt, rsqrt, exp2, clampf, mulhiuiElementwise arithmetic over scalar or tile-shaped values. Floating ops accept rounding or NaN propagation attributes where applicable.
Dot, convolution, and collectivesdot, conv_dot, reduce, scan, histogramPreserve high-level collective math until TileAS can choose MMA, tensor-memory, or reduction pipelines.
Shape and tile constructionsplat, broadcast, expand_dims, extract, extract_slice, view, cat, permute, make_range, generate, block_tile, conv_tile, get_dim_sizeExpress rank changes, indexing, view reinterpretation, generated tiles, and convolution blocking without hiding shape dependencies.
Pointer, memref, and type conversionaddptr, int_to_ptr, ptr_to_int, make_memref, bitcast, fp_to_fp, extern_elementwise, elementwise_inline_asm, call_elementwise_intrinsicConvert public pointer and element-type concepts into explicit addressable objects and per-element escape hatches.
Memory effectsget_global, load, store, tiled_load, tiled_store, gather_load, scatter_store, atomic_cas, atomic_rmw, tiled_atomic_rmwRepresent memory traffic with visible masks, volatility, TMA eligibility, bounds facts, and token dependencies.
Tokens, assumptions, and lifetime hintscreate_mem_token, join_mem_token, mark_for_reuse, assert, assume, optimization_barrier, pragma, message, printCarry ordering, alias, reuse, diagnostics, and optimizer constraints as SSA-visible IR.
Functions, plugins, and launch structurefunc, call, return, yield, global, launch_func, execute, plugin, inject_irProvide the internal symbol, function, launch, and extension shell that survives until queue and plugin lowering.
Grid and queue flowget_program_id, get_num_programs, is_valid_program_id, cancel_next_program_id, create_queue, queue.get, queue.put, queue.yieldModel program-grid queries and queue dataflow before they become TileAS async pipeline regions.

Core Operation Contracts

Arithmetic

Elementwise arithmetic is shape-preserving. The verifier rejects operand sets that cannot be broadcast or matched by the dialect's normal shape rules. For floating operations, the result element type is the operand element type; fp_to_fp is the explicit conversion boundary and must be the only operation that silently changes floating width.

TileValue verify_elementwise_arithmetic(Op op) {
    Shape result_shape = infer_common_shape(op.operands);
    ElementType type = infer_common_element_type(op.operands);

    require_all_operands_compatible(op.operands, result_shape, type);
    require_rounding_mode_if_needed(op);

    return TileValue(type, result_shape);
}

mulhiui is an unsigned high-half multiply: for each lane, multiply the zero-extended operands at double width and return the upper half. Never model it as ordinary signed multiplication followed by a shift.

uintN mulhiui(uintN a, uintN b) {
    uint2N wide = zero_extend(a) * zero_extend(b);
    return truncate_to_N_bits(wide >> N);
}

Dot and Convolution

nv_tileaa.dot

SlotKindTypeRequiredNotes
operand 0Atile<M × K × elem_a>yesrank-2 or rank-3 (batched)
operand 1Btile<K × N × elem_b>yesK dimension agrees with A
operand 2Ctile<M × N × elem_c>yesaccumulator
operand 3sfatile<scale × E8M0>block-scaled onlyscale factors for A
operand 4sfbtile<scale × E8M0>block-scaled onlyscale factors for B
result 0Dtile<M × N × elem_c>yessame shape as C
attr signedness_aenumsigned|unsignedinteger MMA
attr signedness_benumsigned|unsignedinteger MMA
attr propagate_nanbooloptionalfloating-point only
attr operandSegmentSizesdense i32length 5yes{A, B, C, sfa, sfb}

dot abstracts matrix multiply-accumulate. It accepts ordinary float and integer MMA shapes and carries the scale-factor operands needed for Blackwell-style block-scaled MMA. The verifier owns four properties:

  • A and B use a legal paired element type.
  • The accumulator/result type is legal for that pair.
  • Integer operands agree on bit width and signedness rules.
  • Block-scaled forms are gated on a target that supports them.
LogicalResult verify_dot(DotOp op, Target target) {
    MmaShape shape = infer_mma_shape(op.a, op.b, op.c);
    require_compatible_contracting_dims(shape);

    if (op.has_scale_factors) {
        require(target.supports_block_scaled_mma);
        require_legal_scale_factor_types(op.sfa, op.sfb, shape);
    }

    require_legal_mma_element_tuple(op.a.type, op.b.type, op.c.type, op.result.type);
    require_signedness_attrs_for_integer_mma(op);
    return success();
}

conv_dot, conv_tile, and block_tile keep convolution lowering structured. The key behavior is not a special address calculation — it is preserving padding, activation layout, and tile-blocking facts until the memory layout pass can pick the right producer and consumer layouts.

The element-type rules that govern legal (A, B, C) tuples — FP8 e4m3 and e5m2, block-scaled MX-FP and NV-FP4, and the f32 accumulator requirement on narrow-precision inputs — are documented in Fast-Math and Numerical Precision.

Shape Operations

Shape operations stay cheap, explicit, and canonicalizable. view changes interpretation without changing elements, expand_dims inserts size-one axes, broadcast repeats values across larger axes, and extract or extract_slice projects subshapes. splat is the scalar-to-tile constructor and the canonical sink for many reshape folds.

Shape infer_shape(Op op) {
    switch (op.kind) {
    case SPLAT:
        return op.result_shape;
    case EXPAND_DIMS:
        return insert_axes(op.input.shape, op.axes, size_one_axes());
    case BROADCAST:
        require_can_broadcast(op.input.shape, op.result_shape);
        return op.result_shape;
    case VIEW:
        require_same_element_count(op.input.shape, op.result_shape);
        return op.result_shape;
    case EXTRACT:
        return remove_indexed_axes(op.input.shape, op.indices);
    case EXTRACT_SLICE:
        return op.slice_shape;
    default:
        return infer_from_traits(op);
    }
}

Pointer and Memref Construction

nv_tileaa.addptr

SlotKindTypeRequiredNotes
operand 0baseptr or tile<ptr>yespreserves address space
operand 1offsetindex or tile<index>yesshape must broadcast against base
result 0ptrsame as baseyeselement type and address space inherited

nv_tileaa.make_memref

SlotKindTypeRequiredNotes
operand 0baseptryesbase pointer of the memref
operand 1offsetindexoptionalbyte offset added to base
operand 2..R+1sizesindexyes per-dynamic-dimmatches dynamic slots in result shape
operand R+2..stridesindexyes per-dynamic-stridematches dynamic stride slots
result 0memrefmemrefyeselement type, shape, stride packed into result type
attr alias_scopescope idu32optionalprovenance tag consumed by alias analysis
attr operandSegmentSizesdense i32length 4yes{base, offset, sizes, strides}

addptr is the primary pointer-arithmetic operation. It accepts scalar or tile-shaped offsets and preserves the base pointer's address-space and provenance. Canonicalization collapses chained additions into a single addition with a folded offset expression.

PointerValue addptr(PointerValue base, IndexValue offset, Layout layout) {
    ByteOffset bytes = scale_offset_by_element_size(offset, base.element_type);
    return PointerValue(base.address + bytes, base.element_type, base.space, layout.provenance);
}

A representative scalar addptr:

%p1 = nv_tileaa.addptr %p0, %off
    : !nv_tileaa.ptr<f32, 1>, index -> !nv_tileaa.ptr<f32, 1>

A tile-shaped addptr produces per-lane addresses for a gather:

%pp = nv_tileaa.addptr %pbase, %lanes
    : !nv_tileaa.ptr<f16, 1>, tile<128xindex>
    -> tile<128x!nv_tileaa.ptr<f16, 1>>

make_memref packages a base pointer with offset, sizes, strides, element type, memory space, and alias provenance. Later TMA descriptor generation depends on this object being structurally complete — never hide strides or bounds behind opaque pointer arithmetic.

%mr = nv_tileaa.make_memref %pbase, %off, %sz0, %sz1, %st0, %st1
    { alias_scope = 7,
      operandSegmentSizes = array<i32: 1, 1, 2, 2> }
    : (!nv_tileaa.ptr<f32, 1>, index, index, index, index, index)
    -> !nv_tileaa.memref<?x?xf32, 1>

Memory Effects

Scalar memory ops and tiled memory ops share one discipline: every memory effect consumes the incoming token and returns a token representing the effect after the access. Loads return a value too; stores and atomics still return the updated token even when their data result is unused.

nv_tileaa.load / nv_tileaa.tiled_load

SlotKindTypeRequiredNotes
operand 0baseptr or memrefyessource address
operand 1..Rindicesindexyes (R = base rank)per-axis index
operand R+1masktile<S × i1>optionalper-lane predicate
operand R+2othertile<S × element>optionalfallback value when masked off
token slottokenmem_tokenoptionaldrives ordering
result 0valuetile<S × element> or scalaryeselement type matches base pointee
result 1tokenmem_tokenoptionalmirrors token operand
attr cache_modifierenumnone|ca|cg|cs|lu|cvoptional
attr eviction_policyenumnone|first|last|normaloptional
attr mem_semanticenumweak|relaxed|acquireoptional
attr mem_scopeenumtl_blk|cluster|gpu|sysrequired when semantic > weak
attr in_boundsdense boolper-axisoptional
attr operandSegmentSizesdense i32length 4yes{base, indices, mask, other}

nv_tileaa.store / nv_tileaa.tiled_store

SlotKindTypeRequiredNotes
operand 0baseptr or memrefyesdestination
operand 1valuetile<S × element> or scalaryeselement type matches base pointee
operand 2..R+1indicesindexyesper-axis index
operand R+2masktile<S × i1>optionalper-lane predicate
token slottokenmem_tokenoptional
result 0tokenmem_tokenoptionalmirrors token operand
attr mem_semanticenumweak|relaxed|releaseoptionalacquire/acq_rel rejected
attr mem_scopeenumas aboverequired when semantic > weak
attr cache_modifierenumas aboveoptional
attr operandSegmentSizesdense i32length 4yes{base, value, indices, mask}

nv_tileaa.atomic_cas / nv_tileaa.atomic_rmw / nv_tileaa.tiled_atomic_rmw

SlotKindTypeRequiredNotes
operand 0baseptr or memrefyesatomic target
operand 1value or comparescalar/tileyesRMW operand or CAS compare
operand 2replacementscalar/tileCAS only
operand 3..indices, maskindex, tile<i1>optionalper-axis index; predicate
token slottokenmem_tokenoptional
result 0old valuescalar/tileyes
result 1tokenmem_tokenoptional
attr rmw_modeenumfull setRMW onlyadd|and|or|xor|xchg|min|max|umin|umax|addf
attr mem_semanticenumfull setoptional
attr mem_scopeenumfull setrequired when semantic > weak
LogicalResult verify_memory_op_common(MemoryOp op) {
    require_operand_segments(op);
    require_indices_match_base_rank(op.base(), op.indices());
    require_mask_shape_matches_result(op.mask(), op.result(0).shape());
    require_token_arity_matches_segment(op);

    if (op.mem_semantic() != WEAK) {
        require(op.mem_scope().has_value(),
                "non-weak memory ordering requires explicit scope");
    } else {
        require(!op.mem_scope().has_value(),
                "weak memory ordering must not carry a scope");
    }
    return success();
}

The mem_semantic / mem_scope pair on atomic_cas, atomic_rmw, and tiled_atomic_rmw is the user-facing entry point into the layered memory model documented in Concurrency and Sync Semantics. That page enumerates which (semantic, scope) combinations each op family accepts, how the pair survives every lowering stage down to the PTX .sem / .scope modifiers, and how the implicit release/acquire pair on mbarrier.arrive.expect_tx and mbarrier.try_wait.parity fits into a producer/consumer pipeline.

LoadResult lower_memory_read(MemRef ref, Indices indices, Token token, Mask mask) {
    require_indices_in_rank(ref, indices);

    if (mask.is_constant_false()) {
        return LoadResult(mask.other_value_or_undef(), token);
    }

    Value value = masked_or_unmasked_load(ref, indices, mask);
    Token next = sequence_after(token, value.memory_effect);
    return LoadResult(value, next);
}

tiled_load, tiled_store, and tiled_atomic_rmw are the TMA-aware forms. Their attributes record whether TMA is allowed and whether each dimension is known in bounds. The TileAA verifier validates the structural facts; the TileAS lowering decides whether a concrete TMA instruction is profitable and legal for the selected layout.

Worked Example: addptr → tiled_load → dot → tiled_store

A representative GEMM-style fragment threads four operations through a single memory token chain. Each operation consumes the incoming token and produces a new one; later passes may reorder operations only when the token graph stays intact.

// Initial token at the function entry
%t0 = nv_tileaa.create_mem_token : !nv_tileaa.mem_token

// Compute the per-stage base pointers
%pa = nv_tileaa.addptr %a_base, %off_a
    : !nv_tileaa.ptr<f16, 1>, index -> !nv_tileaa.ptr<f16, 1>
%pb = nv_tileaa.addptr %b_base, %off_b
    : !nv_tileaa.ptr<f16, 1>, index -> !nv_tileaa.ptr<f16, 1>

// Wrap each pointer in a memref describing shape and stride
%mr_a = nv_tileaa.make_memref %pa, %off_a, %M, %K, %s_a_row, %s_a_col
    : (!nv_tileaa.ptr<f16, 1>, index, index, index, index, index)
    -> !nv_tileaa.memref<?x?xf16, 1>
%mr_b = nv_tileaa.make_memref %pb, %off_b, %K, %N, %s_b_row, %s_b_col
    : (!nv_tileaa.ptr<f16, 1>, index, index, index, index, index)
    -> !nv_tileaa.memref<?x?xf16, 1>

// Token-ordered tile loads
%av, %t1 = nv_tileaa.tiled_load %mr_a[%i, %k], %t0
    { in_bounds = array<i1: true, true>,
      operandSegmentSizes = array<i32: 1, 2, 0, 0> }
    : !nv_tileaa.memref<?x?xf16, 1>, index, index, !nv_tileaa.mem_token
    -> tile<128x32xf16>, !nv_tileaa.mem_token

%bv, %t2 = nv_tileaa.tiled_load %mr_b[%k, %j], %t1
    { in_bounds = array<i1: true, true>,
      operandSegmentSizes = array<i32: 1, 2, 0, 0> }
    : !nv_tileaa.memref<?x?xf16, 1>, index, index, !nv_tileaa.mem_token
    -> tile<32x128xf16>, !nv_tileaa.mem_token

// Block-scaled dot accumulating into an f32 accumulator
%d = nv_tileaa.dot %av, %bv, %c_in
    { operandSegmentSizes = array<i32: 1, 1, 1, 0, 0> }
    : tile<128x32xf16>, tile<32x128xf16>, tile<128x128xf32>
    -> tile<128x128xf32>

// Token-ordered tile store; %t3 succeeds the store in the token chain
%mr_c = nv_tileaa.make_memref %c_base, %off_c, %M, %N, %s_c_row, %s_c_col
    : (!nv_tileaa.ptr<f32, 1>, index, index, index, index, index)
    -> !nv_tileaa.memref<?x?xf32, 1>
%t3 = nv_tileaa.tiled_store %mr_c[%i, %j], %d, %t2
    { in_bounds = array<i1: true, true>,
      operandSegmentSizes = array<i32: 1, 1, 2, 0> }
    : !nv_tileaa.memref<?x?xf32, 1>, tile<128x128xf32>, index, index,
      !nv_tileaa.mem_token
    -> !nv_tileaa.mem_token

The four operations carry one continuous token chain %t0 → %t1 → %t2 → %t3. The dot consumes no token because it is a pure tile-on-tile computation; the operations on either side of it commit their memory effects through the chain. A reordering pass may swap the two tiled_loads only if it also rewires their token edges, because the verifier rejects any pair where the second load's token input is not produced by an operation it dominates. The discipline lets later TileAS scheduling reorder TMA loads aggressively without ever losing the producer/consumer ordering between memory and compute.

Tokens and Lifetime

create_mem_token creates an initial memory-order value. join_mem_token merges several order edges into one. mark_for_reuse tells buffer allocation that a value's lifetime extends beyond naive SSA liveness. The token value carries no user-visible data — it is an ordering edge that later lowering can map to barrier phase state.

Token join_mem_token(ArrayRef<Token> inputs) {
    if (inputs.empty()) {
        return create_mem_token();
    }

    Token result = inputs[0];
    for (Token token : inputs.drop_front()) {
        result = merge_order_edges(result, token);
    }
    return result;
}

Plugins and Queues

plugin and execute give opaque kernel fragments a structured extension point. They carry function-like operands, layout metadata, and resource requirements such as registers, shared memory, tensor memory, and named barriers. queue.get, queue.put, and queue.yield form a dataflow shell that TileAS later collapses into producer and consumer pipeline regions.

void lower_queue_region(QueueRegion region, PipelineBuilder builder) {
    for (QueueOp op : region.ops) {
        switch (op.kind) {
        case QUEUE_GET:
            builder.emit_consumer_wait(op.queue, op.consumer_index);
            bind_queue_results(op);
            break;
        case QUEUE_PUT:
            builder.emit_producer_commit(op.queue, op.values);
            break;
        case QUEUE_YIELD:
            builder.close_region(op.yielded_values);
            break;
        }
    }
}

Verification Invariants

  • A TileAA module should contain no remaining cuda_tile operations.
  • Every memory op with effects participates in the token protocol.
  • Pointer and memref operations preserve address space, element type, and alias provenance.
  • Shape-changing ops preserve element count unless their semantics explicitly create or remove repetition.
  • yield, queue.yield, and return operands match the parent region or function contract.
  • func, call, plugin, and execute symbol references resolve inside the containing module.
  • Block-scaled MMA and tensor-memory features require a target that supports the needed Blackwell instruction family.
  • TMA eligibility attributes are promises to later lowering, not proof that TMA must be emitted.

Cross-References

Types, Attributes, Verifiers catalogues the type and attribute surface these operations use and the verbatim verifier diagnostics they emit. Folds, Canonicalizers, Tokens describes the rewrites applied after verification succeeds. The TileAS-side counterpart in nv_tileas Operation Roster and Builders extends this surface with async pipeline and TMA operations.