Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Frontend Contract and Tile IR Emission

Abstract

Tileiras consumes Tile IR bytecode; producing that bytecode is a frontend's responsibility. A conformant frontend follows three conventions: emit operations in the cuda_tile dialect with the documented operand and attribute structure, attach module-level kernel-launch metadata that survives every subsequent lowering, and serialize using MLIR bytecode under the tileiras-flavored attribute-tag wire format. This page is the producer-facing contract: kernel signature rules, op-construction conventions, attribute requirements, bytecode-format constraints, and the common emission mistakes that produce modules tileiras rejects.

The contract is documented from the consumer's perspective. Every rule below corresponds to a check that fires somewhere in the bytecode reader, the cuda_tile verifier, the ConvertCudaTileToTileAA conversion target, or the downstream kernel-spec lookup. A frontend that satisfies all four boundaries produces bytecode that flows through the entire 53-pass pipeline without producer-visible diagnostics.

Who Produces Tile IR

Tile IR is an open input format. Any compiler that wants to target the tile pipeline can build a frontend; tileiras only inspects the bytecode buffer it receives, not the producer that wrote it.

ProducerSurfaceStatus
NVIDIA's Triton-style frontendHigh-level Python DSL with tt.* kernel attributesPrimary producer; sets the de facto attribute conventions
CUTLASS DSL frontendPython DSL that emits cuda_tile directly through MLIR Python bindingsTargets the same bytecode container with the same attribute names
mlir-translate with a tileiras-aware bytecode writerTextual cuda_tile IR plus the tileiras AttrTag numberingPractical for hermetic tests and small reproducers; requires the tileiras-flavored writer rather than stock upstream
Hand-rolled bytecode emittersDirect LEB128 record construction against the wire format documented in MLIR Bytecode FormatUsed for differential testing and bug reduction; only viable when the producer freezes the tileiras tag table

The producer set is open in both directions: tileiras has no allowlist of signing frontends, and the public bytecode contract has no producer-identity field. The only invariants it checks are the bytecode envelope, the dialect list, and the per-op verifier rules.

The Kernel Signature Contract

A frontend's first job is producing a cuda_tile.entry operation whose signature, attributes, and body region match the public dialect contract. The verifier checks the structure; the kernel-attribute attachment step in ConvertTileAAToTileAS reads the attributes; the function-boundary lowering in ConvertTileFuncToLLVM projects the attributes onto nvvm.* directives.

Required Module Shell

Every conformant module looks like this at the top level:

module attributes {
    nv_tileaa.compute_capability = 90 : i32,    // sm_90
    tt.num_warps = 4 : i32,                     // four warps per CTA
    tt.num_ctas = 1 : i32                       // single-CTA cluster
} {
  cuda_tile.module {
    cuda_tile.entry @gemm(%A: !cuda_tile.ptr<f16>, ...) {
      ...
      cuda_tile.return
    }
  }
}

The outer builtin.module carries the kernel-launch attributes; the inner cuda_tile.module holds one or more cuda_tile.entry operations whose bodies contain the kernel logic.

Kernel-Function Requirements

An entry function must satisfy four structural rules. The verifier in cuda_tile Verifiers catches all four before the conversion target rejects the operation.

RuleVerifier checkProducer responsibility
Operation is cuda_tile.entry (not func.func)The dialect declares entry as the kernel constructor; arbitrary func.func ops are not recognized as kernels by the downstream loweringFrontend must construct cuda_tile.entry, not lift to func.func
Body terminates with cuda_tile.returnRegion terminator must be the matching return op for the entryNo raw func.return allowed in the entry body
Arguments use cuda_tile.ptr, cuda_tile.tensor_view, cuda_tile.partition_view, or scalar typesThe type converter only knows how to lift theseA frontend that passes a raw !llvm.ptr argument fails at type conversion
No view-typed return valuesThe verifier rejects view results across structured-control-flow boundariesViews are produced from arguments inside the kernel, not returned out

The cuda_tile.entry op is distinct from func.func by design. It carries the region in which the kernel-private structured control flow lives, and the downstream lowering can identify the entry without a separate annotation walk. A frontend that tries to lift the body into func.func and tag it with a custom unit attribute will not produce a kernel; the resulting function emits .func rather than .entry and is invisible to the CUDA driver at launch time.

Compute-Capability Attribute

nv_tileaa.compute_capability is the single attribute the frontend must attach to choose a target. Its absence is fatal at ConvertTileFuncToLLVM: the pass emits "Failed to get ComputeCapability" through severity 259/0x103 and aborts the module. The same encoding rule applies everywhere — the integer is the two-digit form major * 10 + minor.

Compute capabilityEncoded integerSM string
sm_7070"sm_70"
sm_8080"sm_80"
sm_8989"sm_89"
sm_90 (Hopper)90"sm_90"
sm_100 (Blackwell datacenter)100"sm_100"
sm_103 (Blackwell Ultra)103"sm_103"
sm_120 (consumer Blackwell)120"sm_120"

The driver passes --gpu-name=sm_<NN> on the command line; the conversion pass reads the --compute-capability option (major * 10 + minor integer) and writes the attribute onto the module. A frontend that emits bytecode without this attribute must rely on the driver to inject it from --gpu-name; a frontend that emits the attribute itself short-circuits the option lookup.

The fallback nv_tileaa.target_spec (a StringAttr of the form "sm_XX") is read when the integer attribute is absent. The two spellings convert to one logical concept; new IR should prefer the canonical underscore form compute_capability.

Kernel-Launch Attributes

The tt.* namespace is the de facto convention for kernel-launch attributes attached at the module level. They flow through ConvertCudaTileToTileAA verbatim (no rename) and are folded into nv_tileaa.kernel_spec during attachKernelSpecAttributes. The kernel-spec record is then read by the scheduler, the agent-switch builder, and the function-boundary lowering, which projects it onto nvvm.* function attributes the AsmPrinter emits as PTX directives.

QUIRK — Triton's tt.* prefix is the project-neutral compiler's de-facto schema Tileiras is a project-neutral CUDA tile compiler, but its kernel-launch contract reads attributes under the tt.* namespace — the prefix Triton uses for its own frontend. tt.num_warps, tt.num_ctas, tt.cluster_dim, and tt.num_stages flow through ConvertCudaTileToTileAA with no rename and land in nv_tileaa.kernel_spec unchanged. A frontend that uses a clean per-project namespace (say myfrontend.num_warps) gets the silent defaults instead — 4 warps, 1 CTA, cluster [1,1,1] — and the scheduler emits a kernel sized for a single warp group with no warning that the producer's intent was ignored.

AttributeDefault if absentProjected PTX directive
tt.num_warps = N : i324.reqntid (32*N), 1, 1
tt.num_ctas = N : i321(drives cluster directive emission)
tt.cluster_dim (3-element i32 array)[1, 1, 1].reqnctapercluster X, Y, Z on SM90+
tt.num_stages = N : i32scheduler default(consumed by modulo scheduler, no direct PTX)

A frontend may attach additional implementation-specific attributes under its own namespace; they survive every lowering stage that does not actively rewrite them and are dropped if no consumer reads them. The recommended practice is to keep producer-internal attributes prefix-namespaced so they cannot collide with the consumer-visible ones.

Optional NVVM Directive Hints

A frontend can ask for a tighter register cap or a per-SM occupancy floor by attaching nvvm.* attributes directly to the kernel function. These bypass the kernel-spec mirror and reach the AsmPrinter unchanged.

AttributeTypePTX directiveUse case
nvvm.maxnreg = N : i32IntegerAttr<i32>.maxnreg NBound per-thread register usage so ptxas can trade registers for occupancy
nvvm.minctasm = N : i32IntegerAttr<i32>.minnctapersm NRequest a minimum occupancy floor
nvvm.maxclusterrank = N : i32IntegerAttr<i32>.maxclusterrank NPortability cap on cluster size
nvvm.blocksareclustersUnitAttr.blocksareclustersTreat every CTA as its own cluster (legal only with cluster_dim = (1, 1, 1))

These attributes are optional. The kernel-spec path is the primary mechanism; direct nvvm.* attributes are for cases where the frontend already knows the exact directive value and wants to skip the mirror step. Mixing the two is legal — the function-boundary lowering writes nvvm.* attributes derived from the kernel-spec only when they are not already present.

Op-Construction Conventions

The 92-op cuda_tile roster (see Operation Roster) divides into families with consistent operand-order and attribute conventions. A frontend that follows the family conventions produces IR that satisfies the verifier on first construction; a frontend that improvises operand orders triggers verbatim diagnostics keyed off the operandSegmentSizes arrays the verifier consults.

Token-Ordered Memory Operations

Every memory-side-effect op carries a token chain. The convention for constructing a load is:

%value, %tok_out = cuda_tile.load_view_tko %view[%i, %j], %mask, %fallback, %tok_in
    { mem_semantic = #cuda_tile<mem_semantic relaxed>,
      mem_scope    = #cuda_tile<mem_scope gpu>,
      operandSegmentSizes = array<i32: 1, 2, 1, 1, 1> }
    : !cuda_tile.partition_view<128x64xf32>, index, index,
      tile<128x64xi1>, tile<128x64xf32>, !cuda_tile.token
    -> tile<128x64xf32>, !cuda_tile.token

The operand order is fixed by the family: (view, indices..., mask, fallback, token_in) for views; (ptr, indices..., mask, fallback, token_in) for raw pointers. The operandSegmentSizes array partitions the operand list into the five slots {view_or_ptr, indices, mask, fallback, token} and is the verifier's primary structural check.

Three structural rules govern token construction:

  • Every memory op consumes one input token and produces one output token. A frontend that leaves the token unthreaded breaks the dataflow representation of memory ordering. Stores produce a token but no data; loads and atomics produce both.
  • A make_token op at function entry seeds the chain. Use it once per independent ordering chain; multiple independent loads that can reorder use separate chains, multiple ordered loads thread the same chain.
  • join_tokens merges two chains. When two independent chains both need to feed into a later store, join them and pass the result through.

The _tko suffix marks the family, but it is also the verifier's keyword: omitting the suffix produces an unknown-op diagnostic during bytecode read.

MMA Operations

Matrix multiply-accumulate operations follow the (A, B, C) -> D convention where C is the accumulator-in and D is the accumulator-out. The same SSA value is permitted to flow through both — that is the common pattern when an MMA runs inside a K-loop. The shape contract is enforced at construction:

OpA shapeB shapeC/D shapeRequired attributes
mmaf (floating)tile<[B ×] M × K × elem_a>tile<[B ×] K × N × elem_b>tile<[B ×] M × N × elem_c>optional rounding
mmai (integer)samesamesamerequired signedness_a, signedness_b

The K dimension is contracted (must agree between A and B); the M and N dimensions are the output extents (must agree between A/B and C/D). The batched form takes rank-3 tile types with a shared leading batch dimension; the verifier rejects any rank disagreement, K-dimension mismatch, or accumulator/ result type divergence.

The MMA atom (WGMMA / tcgen05.mma / mma.sync) is not selected by the frontend. It is the lowering pipeline's job to pick the right atom for the target. A frontend that tries to pick a specific atom must do so through an optimization hint (op-level attribute under optimization_hints), not by constructing a different op.

Reductions and Scans

The reduce and scan ops carry a region with a pure combiner body. The convention is (input, identity) -> result with the combiner taking two block arguments of the input element type.

%sum = cuda_tile.reduce %a, %identity { axis = 1 : i32 } : tile<8x16xf32>
    -> tile<8xf32> {
  ^bb0(%lhs: f32, %rhs: f32):
    %r = cuda_tile.addf %lhs, %rhs : f32
    cuda_tile.yield %r : f32
}

The body must be a pure region — no side-effecting ops, no token-ordered memory ops, no view operations. Element-type identity in the combiner must match the input element type; rank-zero block arguments are mandatory. The verifier rejects each violation with a verbatim diagnostic that names the rule that fired.

Async Pipeline Ops

Async-pipeline ops are emitted by the scheduler in nv_tileas, not by the frontend. A frontend that wants explicit pipeline staging communicates the intent through the module-level tt.num_stages hint and lets the modulo scheduler in Modulo Scheduler and Rau turn it into producer/consumer agents during lowering. A frontend that emits nv_tileas.async.pipeline.* ops directly bypasses the verifier — nv_tileas is not legal at the bytecode boundary.

Required vs Optional Attributes

A single table summarises every attribute the frontend must, may, or must not attach to a conformant cuda_tile module. "Required" means the lowering fails without it; "optional" means a sensible default applies; "advisory" means the attribute is read if present but has no failure path when absent.

AttributeCarrierStatusDefault if absent
nv_tileaa.compute_capabilityModuleRequired"Failed to get ComputeCapability" (O3); driver-supplied from --gpu-name
nv_tileaa.target_specModuleFallbackUsed when compute_capability is absent
tt.num_warpsModuleOptional4 (1 warp group)
tt.num_ctasModuleOptional1 (single-CTA cluster)
tt.cluster_dimModuleOptional[1, 1, 1]
tt.num_stagesModuleAdvisoryScheduler default
nvvm.maxnregFunctionOptionalNo cap; ptxas chooses
nvvm.minctasmFunctionOptionalNo occupancy floor
nvvm.maxclusterrankFunctionOptionalNo portability cap
nvvm.blocksareclustersFunctionOptionalOff
nvvm.kernelFunctionSynthesizedAttached by downstream rewrite; frontend should not emit
nv_tileaa.kernel_specFunctionSynthesizedAttached by attachKernelSpecAttributes from tt.* hints
nv_tileaa.occupancyFunctionOptionalNo nvvm.maxnreg synthesized
Per-op mem_semantic, mem_scopeOpConditionalweak/CTA when absent; required for non-weak orderings
Per-op fastmathOpOptionalNo fast-math flags
Per-op optimization_hintsOpAdvisoryNo hint applied
Per-op operandSegmentSizesOpRequiredVerifier emits structural error when absent on multi-operand-family ops

The cross-cutting policy for every attribute family — which stages drop, preserve, synthesize, or read each one — is documented in Attribute System and Lowering.

Bytecode-Format Constraints

The wire format is not stock MLIR bytecode. A frontend that constructs a valid cuda_tile module in memory still has to clear the bytecode envelope and the attribute-tag numbering before tileiras can read the buffer.

Magic and Version

Every conformant container opens with the eight-byte magic and a three-VarInt Tile-version triple. The magic and version constants are documented at MLIR Bytecode Format — Header Parser. The accepted version range is 13.1.x only; the parser emits an "unsupported Tile version ..." diagnostic for everything else.

7f 54 69 6c 65 49 52 00    // "\x7fTileIR\0"
0d 01 00                   // VarInt 13, VarInt 1, VarInt 0 (Tile 13.1.0)

Upstream MLIR fills the eighth magic byte with the start of "\nMLIR". A producer that uses an unmodified upstream writer emits 0x0A in that slot, and the tileiras reader rejects the buffer with "invalid magic number at position 7". The driver also surfaces this case with the diagnostic "input does not correspond to Tile IR bytecode (it looks like MLIR bytecode instead)".

Dialect List

The envelope's dialect list must include cuda_tile. A builtin entry is synthesized automatically by the MLIR infrastructure. Other dialects are legal only if they appear in the registered set and the frontend actually uses them: arith for constants whose representation is not a cuda_tile.constant, func for symbol references, and the optional debug-info dialects.

A module that lists nv_tileaa, nv_tileas, cute, cute_nvgpu, cutlass, nvgpu, nvvm, or llvm in its dialect list is rejected by the conversion-target legality check: those are internal lowering dialects, not public input. The diagnostic spelling is "unregistered dialect: <name>" from the dialect-list walker.

AttrTag Wire Format

The 13-entry attribute-tag table inside the bytecode reader is the single-largest wire-format-breaking divergence from upstream. The shipped numbering is documented in Self-Contained Attribute Dispatch; the key differences are:

Tagtileiras meaningUpstream MLIR meaning
1StringAttrIntegerAttr
4DenseElementsAttrTypeAttr
5DenseElementsAttr<string>StringAttr
13AssumePredicateAttr(undefined)

A producer that writes attributes through stock mlir-translate --serialize-bytecode emits the upstream numbering and the tileiras reader decodes every attribute incorrectly — usually surfacing as garbled type mismatches mid-IR rather than envelope errors. The practical implication is that a frontend cannot use upstream mlir-translate directly; it must either link the tileiras-aware writer or fork the upstream AttrTag table.

Canonical VarInt Encoding

Every multi-byte integer in the container uses the LEB128 variant documented in VarInt Encoding. Producers must emit the canonical (shortest) encoding for every integer; an overlong encoding decodes to the same value but is rejected with "non-canonical VarInt" and the section fails. The writer-side rule is straightforward: count leading zero bytes in the integer's two's-complement form, pick the shortest encoding that fits, never zero-pad for alignment.

Section Ordering

Sections must be present in dependency order. The reader's walker assumes that later sections can index into earlier ones, so a producer that writes the sections out-of-order fails the cross-section index validation, not the section-walker. The required order is documented in Section Walker Algorithm. The minimum set for a cuda_tile module is string, type, attribute/constant, IR (func and global), and the end marker. Resource and debug sections are optional.

Common Pitfalls

Most frontend bugs are well-formed bytecode that tileiras refuses for one of a small set of repeatable reasons. The diagnostics are verbatim from the reader and the verifier; the root causes are producer-side.

Missing Kernel Marker

Symptom. The kernel compiles, ptxas accepts the PTX, but the resulting cubin exposes no entry symbol for the CUDA driver to launch.

Cause. The frontend wrote a func.func instead of cuda_tile.entry, or the downstream cute.kernel-to-nvvm.kernel rewrite did not fire because the function never picked up the cute.kernel marker. The directive emitter wrote .func rather than .entry because no kernel-spec attached and no nvvm.kernel was present.

Fix. Emit cuda_tile.entry for every kernel. Do not lift to func.func before the bytecode boundary; the dialect's structured-control-flow surface covers everything a kernel body needs.

Wrong AttrTag Numbering

Symptom. The bytecode parser emits "unknown attribute tag <N>" mid-IR, or — more confusingly — successfully decodes the file but produces a module whose attribute types are systematically off by one slot.

Cause. The writer used stock upstream AttrTag numbering. Tag 1 wrote an IntegerAttr (upstream) where the tileiras reader expected a StringAttr; tag 5 wrote a StringAttr where the reader expected a DenseElementsAttr<string>.

Fix. Use a tileiras-aware writer. The producer-side AttrTag table is frozen to the values the reader uses; encoding through any other table produces an unreadable buffer regardless of in-memory correctness.

Missing Compute Capability

Symptom. "Failed to get ComputeCapability" (O3) or "failed to get compute capability." (O2) at lowering time, depending on which pass first observes the missing attribute.

Cause. Neither nv_tileaa.compute_capability nor nv_tileaa.target_spec attached to the module. The driver's --gpu-name=sm_<NN> option is the intended injection point; the frontend may skip the attribute and rely on the driver, but a module produced without the attribute is not portable across drivers that do not inject one.

Fix. Attach nv_tileaa.compute_capability = N : i32 at the module level when the frontend knows the target, or document the requirement that the caller pass --gpu-name.

Wrong Operand Order on Token-Ordered Ops

Symptom. The verifier emits "expected token operand" or a structural error keyed on operandSegmentSizes.

Cause. The frontend placed the token operand at a non-canonical position, or omitted operandSegmentSizes. The verifier reconstructs the operand partition from the array; without it, the multi-operand families fail structural validation.

Fix. Follow the operand-order convention in Operation Roster: view/ptr first, indices, optional mask, optional fallback, token last. Always emit operandSegmentSizes as a five-element array<i32> for the load/store/atomic families.

tile<...> vs tensor<...> Type Confusion

Symptom. The first-stage type converter fails with an unexpected type diagnostic, or a downstream pattern fails to match.

Cause. A frontend that ported from a tensor-typed IR may have lifted shape operations to tensor<> rather than cuda_tile.tile<>. The two types have different verifier contracts: cuda_tile.tile enforces power-of-two dimensions and a 16-million-element ceiling, while tensor<> has neither check.

Fix. Construct cuda_tile.tile<...> for every shaped value in the kernel body. Tensor types appear in the IR only after ConvertCudaTileToTileAA lifts tiles to tensors during the alias-aware stage.

Returning a View

Symptom. Verifier emits "view-typed result rejected" from cuda_tile.if or cuda_tile.for.

Cause. View types are not first-class results of structured-control-flow operations. The intended pattern is to construct the view inside the region and consume it directly, not to return it across the region boundary.

Fix. Construct views close to where they are consumed; if conditional view construction is necessary, branch around the consuming load rather than yielding a view from cuda_tile.if.

Power-of-Two Tile Dimensions

Symptom. "tile dimensions must be powers of two" at type construction.

Cause. A tile shape includes a non-power-of-two dimension. The shape verifier walks each tile type and rejects any dimension that is not 2^k for some non-negative k. The element-count ceiling fires later: products above 16 million elements are rejected with "tile would exceed the maximum element count".

Fix. Round tile shapes up to the next power of two and use masking for the ragged region. Frontends that target non-power-of-two problem sizes typically tile around an oversized power-of-two block and predicate the tail.

Unregistered Dialect

Symptom. "unregistered dialect: <name>" from the dialect-list walker.

Cause. The frontend declared an internal lowering dialect (nv_tileaa, nv_tileas, cute, etc.) in its bytecode envelope. These dialects are not public input — they are produced inside tileiras and are illegal at the bytecode boundary.

Fix. Restrict the dialect list to cuda_tile, builtin, arith, func, and the debug-info dialects. Construct everything else through cuda_tile's own operation surface.

Minimal Hand-Rolled Kernel

For testing, hermetic builds, or differential reduction, a frontend can hand-construct a tiny kernel as textual MLIR and run it through a tileiras-aware writer. The minimum is one entry function with one return:

module attributes {
    nv_tileaa.compute_capability = 90 : i32,
    tt.num_warps = 4 : i32
} {
  cuda_tile.module {
    cuda_tile.entry @noop() {
      cuda_tile.return
    }
  }
}

Serialized through a tileiras-aware writer, this produces a 256-byte buffer that flows through the entire pipeline and emits a .entry noop PTX function with the four expected directive lines (.entry, .reqntid 128, 1, 1, .maxnreg if set, and the parameter block). It is the smallest input that exercises every stage of the cascade and is the canonical reduction target for producer-side bugs.

Triton-Frontend Extensions

NVIDIA's Triton-style frontend extends the contract with a handful of domain-specific module attributes. They follow the same convention as the documented tt.* attributes — module-level, integer-or-array values, read once by attachKernelSpecAttributes and folded into the nv_tileaa.kernel_spec record on each entry function.

Triton attributeEffectLowering site
tt.num_stages = N : i32Hint to the modulo scheduler about pipeline depthAsync/Pipeline Family
tt.cluster_size = [X, Y, Z]Shorthand for tt.cluster_dim plus tt.num_ctasCTA Cluster Family
tt.is_persistentMark the kernel as persistent for the StaticPersistent schedulerPipeline and Tile Scheduler
tt.dump_intermediateProducer-side debugging hint (informational only)(no consumer)

These are not part of the canonical contract — a non-Triton frontend can ignore them entirely — but they are stable enough that downstream consumers can rely on them when they are present. See cuda_tile Overview for the public dialect surface they map onto and Attribute System and Lowering for the lifecycle that each one follows from the module dictionary to the PTX directive emitter.

Cross-References

cuda_tile Overview documents the public dialect surface this contract targets. Operation Roster catalogues every legal mnemonic and operand-order convention. Types and Attributes covers the type-storage parameters and attribute parse contract. Verifiers documents the verifier diagnostics this page references by spelling. MLIR Bytecode Format is the wire-format reference; Dialect Bytecode Reader/Writer Status explains why only cuda_tile has a linked reader. Position in nvcc 13.1 Toolchain shows where the frontend's bytecode artifact lands in the larger build. Attribute System and Lowering is the cross-stage policy reference for every attribute discussed above; GPU Execution Model walks the same kernel-attribute story from the perspective of PTX directive emission. DSL to PTX End-to-End traces a single kernel through every stage of the pipeline from this contract down to PTX text.