Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

cuda_tile Operation Roster

Abstract

A frontend emitting cuda_tile is writing tile values, structured kernel control flow, token-ordered memory effects, tensor views, matrix multiply-accumulate intent, and source-level numeric attributes — everything the compiler will subsequently lower into private implementation dialects. This page is the producer and reimplementation reference: operation families, the behavior each family promises, and how a compiler should lower the surface without leaning on internal registration details.

In this build, the token-ordered print operation is spelled cuda_tile.print. The newer cuda_tile.atan2 is rejected outright, so a frontend that supports multiple TileIR revisions should gate it behind explicit version logic.

Operation Families

FamilyOperationsContract
Floating and integer arithmeticabsf, absi, addf, addi, ceil, cmpf, cmpi, cos, cosh, divf, divi, exp, exp2, floor, fma, log, log2, maxf, maxi, minf, mini, mulf, mulhii, muli, negf, negi, pow, remf, remi, rsqrt, sin, sinh, sqrt, subf, subi, tan, tanhOperate elementwise on scalar or tile values while preserving rounding, signedness, overflow, comparison, and fast-math choices.
Integer logicandi, ori, shli, shri, xoriBitwise and shift operations over integer scalar or tile values.
Token-ordered memoryload_ptr_tko, load_view_tko, store_ptr_tko, store_view_tko, atomic_cas_tko, atomic_rmw_tko, make_token, join_tokens, offset, global, get_global, make_tensor_view, make_partition_viewExpress pointer, view, global, token, and atomic memory behavior without committing to backend layout or scheduling.
Structured control flowmodule, entry, if, for, loop, yield, break, continue, return, assert, assumeKeep kernel structure in the source dialect and verify region arity, yielded values, and early-exit ancestry.
Shape algebrabroadcast, cat, extract, get_index_space_shape, get_num_tile_blocks, get_tensor_shape, get_tile_block_id, iota, permute, reshapeTransform tile rank, tile extents, launch geometry, and indexing without choosing hardware layout.
Reductions and scansreduce, scanCarry the reduction dimension, identities, input/result types, and pure combiner body.
Matrix multiply-accumulatemmaf, mmaiPreserve floating and integer MMA intent until atom selection and scheduler lowering.
Type conversionbitcast, exti, ftof, ftoi, int_to_ptr, itof, ptr_to_int, ptr_to_ptr, trunciMake widening, narrowing, bit reinterpretation, float/int conversion, and pointer casts explicit.
Constants, selection, diagnosticsconstant, select, printMaterialize literal values, value selection, and token-ordered runtime diagnostics.

The family boundaries are semantic, not syntactic. fma is arithmetic because it is elementwise; mmaf and mmai are MMA because they contract matrix dimensions. assert and assume live with control flow because regions and dominance scope their meaning, even though their payload is an attribute or predicate.

Producer Contract

A valid producer should build modules with this shape:

cuda_tile.module {
    cuda_tile.entry @kernel(%arg0 : !cuda_tile.tensor_view<...>) {
        %tok0 = cuda_tile.make_token : !cuda_tile.token
        %tile, %tok1 = cuda_tile.load_view_tko %view[%i, %j] token=%tok0
        %acc = cuda_tile.mmaf %a, %b, %c : ...
        %tok2 = cuda_tile.store_view_tko %view[%i, %j], %acc token=%tok1
        cuda_tile.return
    }
}

The exact textual syntax is described in Assembly Printer, but the contract is independent of formatting:

  • memory effects are threaded through cuda_tile.token;
  • tile values have static rank and element type;
  • view values carry shape and stride metadata;
  • structured control flow yields values rather than branching through cf;
  • numeric choices such as rounding and signedness are attributes, not implicit frontend assumptions;
  • debug info and optimization hints may be present but must not be required for semantic correctness.

Lowering Sketch

The first lowering stage converts public cuda_tile into alias-aware TileAA. Arithmetic and shape operations keep their mathematical meaning intact. Memory operations gain explicit memref and token structure. Control flow is rewritten only once region and token legality are already proven.

Module lower_cuda_tile_to_tileaa(Module module, Target target) {
    require(module.only_uses_dialect("cuda_tile", "builtin", "arith"));
    verify_cuda_tile_module(module, target);

    TypeConverter types;
    types.add(convert_scalar_type);
    types.add(convert_tile_type);
    types.add(convert_pointer_type);
    types.add(convert_view_type);
    types.add(convert_token_type);

    RewritePatternSet patterns;
    add_arithmetic_patterns(patterns, types);
    add_shape_patterns(patterns, types);
    add_memory_patterns(patterns, types);
    add_control_flow_patterns(patterns, types);
    add_mma_patterns(patterns, types);

    apply_conversion(module, patterns);
    require(!module.contains_dialect("cuda_tile"));
    return module;
}

Lowering must not erase source-level facts prematurely. A load_view_tko becomes an operation with explicit view, index, mask, fallback, memory ordering, memory scope, and token dependencies — not an unstructured pointer load until the alias and layout passes have the context to handle it safely.

Numeric Operations

Arithmetic ops accept scalar or tile-shaped operands. Tile operands must agree on shape and element type unless the op has an explicit shape-changing contract. Floating operations carry rounding mode and flush-to-zero policy forward until a lower dialect decides whether the target instruction can encode those choices directly.

Value lower_elementwise_arith(ArithOp op) {
    require_same_shape(op.operands);
    require_legal_element_type(op);

    NumericPolicy policy = {
        .rounding = op.rounding_mode,
        .flush_to_zero = op.flush_to_zero,
        .signedness = op.signedness,
        .overflow = op.overflow,
    };

    return tileaa_elementwise(op.kind, op.operands, policy);
}

mulhii returns the high half of a signed integer product. Implement it as a wide multiply followed by a high-half extract — never as ordinary multiplication that relies on target-width overflow.

Operand and Result Tables

The most heavily emitted ops carry the following operand/attribute/result shape. The _tko family threads a cuda_tile.token through every memory effect.

cuda_tile.load_view_tko

SlotKindTypeRequiredNotes
operand 0viewpartition_viewyessource tile view
operand 1..Rindicesindexyes (R = tile rank)per-axis tile coordinate
operand R+1masktile<S × i1>optionalper-lane predicate
operand R+2othertile<S × element>optionalfallback value when masked off
operand R+3tokencuda_tile.tokenyesinput ordering edge
result 0valuetile<S × element>yesmatches view element type
result 1tokencuda_tile.tokenyessuccessor ordering edge
attr mem_semanticenumweak|relaxed|acquireoptionalacquire requires scope
attr mem_scopeenumtl_blk|cluster|gpu|sysconditionalrequired for non-weak
attr optimization_hintsdictarchitecture-keyedoptional
attr operandSegmentSizesdense i32length 5yes{view, indices, mask, other, token}

A representative two-dimensional tile load with a predicate mask:

%tile, %t1 = cuda_tile.load_view_tko %view[%i, %j], %mask, %fallback, %t0
    { mem_semantic = #cuda_tile<mem_semantic relaxed>,
      mem_scope    = #cuda_tile<mem_scope gpu>,
      operandSegmentSizes = array<i32: 1, 2, 1, 1, 1> }
    : !cuda_tile.partition_view<128x64xf32>, index, index,
      tile<128x64xi1>, tile<128x64xf32>, !cuda_tile.token
    -> tile<128x64xf32>, !cuda_tile.token

The mask and fallback shapes equal the result tile shape; the view element type matches the result element type. The token chain threads %t0 in and %t1 out, ordering the load against any preceding or following memory effect that consumes the same chain.

cuda_tile.store_view_tko

SlotKindTypeRequiredNotes
operand 0viewpartition_viewyesdestination view
operand 1valuetile<S × element>yeselement type matches view
operand 2..R+1indicesindexyesper-axis tile coordinate
operand R+2masktile<S × i1>optional
operand R+3tokencuda_tile.tokenyesinput ordering edge
result 0tokencuda_tile.tokenyessuccessor ordering edge
attr mem_semanticenumweak|relaxed|releaseoptionalacquire variants rejected
attr mem_scopeenumas aboveconditional
attr operandSegmentSizesdense i32length 5yes

cuda_tile.atomic_rmw_tko

SlotKindTypeRequiredNotes
operand 0pointerstile<S × ptr>yesper-lane address
operand 1valuetile<S × element>yesRMW operand
operand 2masktile<S × i1>optional
operand 3tokencuda_tile.tokenyes
result 0oldtile<S × element>yes
result 1tokencuda_tile.tokenyes
attr kindenumadd|addf|and|or|xor|xchg|min|max|umin|umaxyes
attr orderingenumfullyes
attr scopeenumfullconditional

cuda_tile.mmaf / cuda_tile.mmai

SlotKindTypeRequiredNotes
operand 0Atile<[B ×] M × K × elem_a>yesrank 2 or 3 (batched)
operand 1Btile<[B ×] K × N × elem_b>yesK agrees with A
operand 2Ctile<[B ×] M × N × elem_c>yesaccumulator
result 0Dtile<[B ×] M × N × elem_c>yesshape equals C shape
attr signedness_aenumsigned|unsignedinteger MMArequired for mmai
attr signedness_benumsigned|unsignedinteger MMArequired for mmai
attr roundingenumIEEE basicoptionalmmaf only

A 16×16×16 floating MMA with an f32 accumulator and f16 inputs:

%d = cuda_tile.mmaf %a, %b, %c
    : tile<16x16xf16>, tile<16x16xf16>, tile<16x16xf32>
    -> tile<16x16xf32>

A batched integer MMA with explicit signedness attributes:

%d = cuda_tile.mmai %a, %b, %c
    { signedness_a = #cuda_tile<signedness signed>,
      signedness_b = #cuda_tile<signedness unsigned> }
    : tile<4x16x32xi8>, tile<4x32x16xi8>, tile<4x16x16xi32>
    -> tile<4x16x16xi32>

The M/N dimensions of A and B agree with C; the K dimension is contracted. The verifier rejects rank mismatch, K disagreement, accumulator/result type mismatch, missing signedness on mmai, and any input/accumulator pair that lies outside the target's legal MMA element-type tuple.

cuda_tile.if

SlotKindTypeRequiredNotes
operand 0conditioni1yesscalar predicate
region 0thenterminated by yieldyesyields result_types
region 1elseterminated by yieldrequired when results non-emptyyields result_types
result 0..valuesany non-view typeoptionalview-typed results rejected

cuda_tile.for

SlotKindTypeRequiredNotes
operand 0lowerintegeryes
operand 1upperintegeryessame width as lower
operand 2stepintegeryessame width as lower
operand 3..iter argsany non-viewoptionaltypes equal result_types
region 0bodyterminated by yieldyesblock arg 0 = induction var
result 0..yielded iter argsany non-viewoptional

Memory and Token Operations

The _tko suffix means token ordered. Every token-ordered memory op consumes an input token and produces a successor. Loads and atomics also produce data; stores produce only the successor token. That discipline is the public memory model — later passes refine it into barriers, async copies, and backend memory instructions.

LoadResult lower_load_ptr_tko(LoadPtrTkoOp op) {
    MemRef ref = make_memref_from_pointer(op.pointer, op.indices);
    MemoryPolicy policy = memory_policy(op.ordering, op.scope, op.hints);

    Value data = tileaa_load(ref, op.mask, op.padding, policy, op.input_token);
    Token next = token_after(data.memory_effect, op.input_token);
    return (LoadResult){ .value = data, .token = next };
}

Atomics check both memory ordering and element type. Integer bitwise modes are integer-only; floating add is floating-only; compare-and-swap is restricted to element widths the backend can update atomically.

Structured Control Flow

cuda_tile ships its own region operations because frontends need a stable kernel-level API. Later lowering may translate these regions into SCF, CFG, or private control-flow dialects, but the verifier enforces these rules first:

  • if result types match every non-empty yielding branch;
  • for induction, bounds, step, iter args, and results are type-consistent;
  • loop iter args and results are type-consistent;
  • break exits the nearest compatible loop;
  • continue exits to the next iteration of a compatible for or loop;
  • return appears in an entry context and matches the entry function type;
  • yield appears only in a parent op that expects region yields.

MMA Operations

mmaf and mmai are deliberately narrow public abstractions: they describe matrix multiply-accumulate intent, not final tensor-core instruction selection. The verifier checks shape compatibility and element-type legality. Choosing WGMMA, smaller MMA atoms, tensor-memory paths, or emulation is left to the lowering pipeline.

LogicalResult verify_mma_shape(Tile lhs, Tile rhs, Tile acc, Tile result) {
    require(lhs.rank == 2 || lhs.rank == 3);
    require(rhs.rank == lhs.rank);
    require(acc.rank == lhs.rank);
    require(result.rank == lhs.rank);

    if (lhs.rank == 3) {
        require(lhs.dim(0) == rhs.dim(0));
        require(lhs.dim(0) == acc.dim(0));
        require(lhs.dim(0) == result.dim(0));
    }

    require(lhs.k_dim == rhs.k_dim);
    require(lhs.m_dim == acc.m_dim);
    require(rhs.n_dim == acc.n_dim);
    require(acc.shape == result.shape);
    return success();
}

Version Notes

  • Emit cuda_tile.print for runtime diagnostic printing in this build.
  • Do not emit cuda_tile.print_tko unless targeting a source tree that uses that mnemonic.
  • Do not emit cuda_tile.atan2 for this build; guard it behind a newer TileIR version check.
  • Treat cuda_tile.string as implementation-specific unless the target contract explicitly documents it.

Op Method Surface

Every cuda_tile.* op exposes four registered functions to the framework: a builder (or textual parse entry), a registration thunk that interns the mnemonic and installs the op vtable, a verifier hook, and a lowering pattern that rewrites the op during the first conversion stage. The functions follow predictable shapes by op family.

FamilyBuilder shapeVerifier shapeLowering arm
Trivial unary (absf, absi, ceil, floor, negf, negi, sqrt, cos, sin, transcendentals)Default trampoline; constructs result from one operand and forwards rounding/flush-to-zero attributes.Generic trait-only verification with element-type and rank checks.Arithmetic-group conversion pattern.
Floating binary (addf, subf, mulf, divf, maxf, minf, remf)Forwards rounding mode and flush-to-zero.Type-equality, shape-equality, and rounding-mode legality (see Verifiers — Type-Compatibility Diagnostics).Arithmetic-group conversion pattern.
Integer binary (addi, subi, muli, divi, maxi, mini, mulhii, remi, andi, ori, xori, shli, shri)Forwards signedness and overflow attributes.Type-equality, shape-equality, signedness-presence.Arithmetic-group conversion pattern (integer max routes through a dedicated arm).
Conversion (exti, trunci, ftof, ftoi, itof, bitcast, int_to_ptr, ptr_to_int, ptr_to_ptr)Builds result from operand element type and target element type.Width-direction and rounding-mode checks; identity conversions are rejected.Pointer-cast specialty arm for the four pointer-family ops; arithmetic-group arm for the rest.
Shape (broadcast, cat, extract, permute, reshape, iota)Builds result from result shape, source shape, and axis attributes.Rank, element-count, and axis legality.Arithmetic-group conversion pattern.
Token-ordered memory (load_*_tko, store_*_tko, atomic_*_tko, make_token, join_tokens)Builds the result tile plus the successor token; threads the input token through.Token presence, pointer/value element-type match, mask shape match, ordering/scope pairing.Arithmetic-group arm; the lowering produces a TileAA tiled_load/tiled_store/atomic_rmw and threads the new mem-token chain.
View construction (make_tensor_view, make_partition_view, offset, global, get_global)Builds the view type from element type, shape, stride, and dynamic operands.Dynamic-operand count match, element-type compatibility, partition dim_map injectivity.Arithmetic-group arm; lowers to TileAA make_memref plus address-space metadata.
Structured control flow (module, entry, if, for, loop, yield, break, continue, return)Builds region(s) plus block argument types from result types and iter-arg types.Region structure, terminator arity, yield-type match, view-result rejection.Routes through the control-flow conversion arm that produces SCF/CF dialect output.
Aggregate (reduce, scan)Builds the result-type list plus the combiner body region.Body purity, rank-zero block argument types, identity-vs-input element-type match.Arithmetic-group arm; produces a TileAA reduce with the same body region.
MMA (mmaf, mmai)Builds result from A/B/C tile types; signedness attributes preserved.Rank, K/M/N dimension agreement, accumulator/result type match, signedness presence for integer MMA.Arithmetic-group arm; lowers to TileAA dot with optional scale-factor operands.
Constants and diagnostics (constant, select, assert, assume, print)Constants carry a typed attribute; select carries condition plus two values; diagnostics carry a message and operand list.Constant-attribute type match; select arm-type match; assume-predicate interface checks.Constant-and-select arm (constants are constant-folded into the TileAA constant pool).

The default builder for trivial unary ops shares one trampoline that constructs the op from a single operand and forwards rounding/flush-to-zero attributes; the default verifier hook installs a no-op stub when the op's contract is fully covered by trait-level checks. The control-flow lowering routes through one driver that owns cuda_tile.if, cuda_tile.for, cuda_tile.loop, cuda_tile.continue, and cuda_tile.return together so it can preserve region nesting and the structured-exit ancestry contract.

One count discrepancy is worth flagging. The roster in this build is 92 mnemonics. The two names missing from open-source documentation are cuda_tile.atan2 (excluded entirely from this binary) and the rename cuda_tile.print_tkocuda_tile.print. Producers should follow the version notes above and emit only the 92 mnemonics this dialect accepts.

The dialect constructor walks the registration thunks in roster order; each thunk interns the mnemonic into the dialect's OperationName table and installs the op's vtable, fold callback, and verifier hook through the slots described in overview — AbstractOperation Record and Operation Layout — Pointer-Identity Dispatch. Lowering patterns are matched as conversion patterns by the arithmetic-group and pointer-cast dispatchers during the first lowering stage; the conversion is documented in Cuda Tile to TileAA.

Cross-References

Overview describes the dialect's role as the public producer-facing API and the AbstractOperation record structure. Verifiers details the verbatim verifier diagnostics each family emits. Canonicalizers and Folds describes the rewrites applied after verification. Bytecode Reader and Writer documents the on-wire encoding the opcode dispatcher consumes.