Lowering: cuda_tile to nv_tileaa
Abstract
ConvertCudaTileToTileAA is the first lowering pass in the tileiras pipeline and the only one that translates from a publicly-defined dialect. It rewrites cuda_tile — the bytecode-input form users author against — into the internal nv_tileaa dialect every subsequent pass operates on. No cuda_tile.* operation may survive this pass.
The conversion is partial. The pass loads six legal dialects, marks cuda_tile illegal, attaches a dynamic-legality predicate to ub.poison, registers three type-conversion functor pairs, and applies a pattern bank assembled by three independent populators in a fixed order.
Boundary Contract
| Dimension | Specification |
|---|---|
| Allowed input ops | every cuda_tile.* executable op (whitelisted via addIllegalDialect<cuda_tile>); ub.poison accepted under a dynamic-legality predicate that requires an nv_tileaa-primitive result type |
| Allowed input types / attributes | cuda_tile::TileType, cuda_tile::PointerType, cuda_tile::TokenType; the arith.fastmath-shaped fastmath property carried on cuda_tile arithmetic ops, plus axis, inclusive, and per-op attribute dictionaries; module-level --compute-capability option must parse |
| Guaranteed output ops | only ops from arith, nv_tileaa, func, gpu, scf, math (the six legal dialects); no cuda_tile.* survives; bridge builtin.unrealized_conversion_cast may remain pending downstream reconciliation |
| Guaranteed output types / attributes | tile → llvm.struct<...> descriptor, pointer → llvm.ptr, token → llvm.token through the materialiser triple; region block-arg types rewritten through the same TypeConverter; the fastmath property is propagated unchanged onto the matching nv_tileaa arithmetic op |
| Violation behavior | residual cuda_tile.* op → applyPartialConversion fails with "failed to convert cuda_tile to nv_tileaa"; malformed compute capability → "invalid or missing --compute-capability option"; mismatched region block-arg types → next-stage verifier rejects the parent op (no localised diagnostic from this pass) |
Pass Driver
runOnOperation reads the stored --compute-capability option, builds the conversion target, populates three pattern groups in order, runs the PDL fallback, and invokes applyPartialConversion. Two user-facing diagnostics escape: "invalid or missing --compute-capability option" when the option parses as malformed, and "failed to convert cuda_tile to nv_tileaa" when partial conversion fails to legalise every cuda_tile.* op.
LogicalResult convertCudaTileToTileAA(ModuleOp mod, ComputeCapability cc) {
if (!cc.valid()) {
return emit("invalid or missing --compute-capability option");
}
RewritePatternSet patterns;
populatePartA(patterns); // arithmetic, comparison, conversion, indexing, control flow
populatePartB(patterns); // memory, pointer, token, view, partition
populatePartC(patterns); // mma, reduce, scan, transcendental specialists
ConversionTarget target = buildConversionTarget(mod);
FrozenRewritePatternSet frozen;
compilePDLPatterns(patterns, &frozen);
if (failed(applyPartialConversion(mod, target, frozen))) {
return emit("failed to convert cuda_tile to nv_tileaa");
}
return success();
}
The pass walks all cuda_tile.module operations nested in the input module before conversion. The walker is a recursive op-tree walk filtered by TypeID; collected modules land in a small inline-allocated vector sized for the common case of one nested module per bytecode input.
Conversion Target
The conversion target builder marks six dialects legal, declares cuda_tile fully illegal, and attaches dynamic legality to ub.poison. The same target object is reused across all three populators.
ConversionTarget buildConversionTarget(ModuleOp mod) {
ConversionTarget target(*mod.getContext());
// Fully legal — accept any op of these dialects without further checks
target.addLegalDialect<arith::ArithDialect,
nv_tileaa::TileAADialect,
func::FuncDialect,
gpu::GPUDialect,
scf::SCFDialect,
math::MathDialect>();
// Fully illegal — every cuda_tile op must be rewritten away
target.addIllegalDialect<cuda_tile::CudaTileDialect>();
// Dynamic legality — ub.poison is legal once its result type is an nv_tileaa primitive
target.addDynamicallyLegalOp<ub::PoisonOp>([](ub::PoisonOp op) {
return isLegalTileAAType(op.getResult().getType());
});
return target;
}
The type-converter materialisers handle the residual cases where partial conversion needs a bridge value while the IR is mid-rewrite. Source materialisers run when an nv_tileaa-typed value is needed but only the original cuda_tile-typed value exists; target materialisers run for the reverse direction. Both produce builtin.unrealized_conversion_cast operations that the next pass's reconciliation phase erases.
Input and Output Dialects
| Direction | Surface |
|---|---|
| input ops | cuda_tile.* (all executable ops), ub.poison (dyn-legal) |
| input types | cuda_tile::TileType, cuda_tile::PointerType, cuda_tile::TokenType |
| output ops (legal after this pass) | arith, nv_tileaa, func, gpu, scf, math, plus already-legal llvm.struct and llvm.ptr shapes produced by type materialisation |
| output types | tile types become llvm.struct<...>, pointer types become llvm.ptr, token types become llvm.token (via the materialiser triple) |
The canonical rewrite shape for a one-to-one Part-A pattern is:
input : %r = cuda_tile.addi %a, %b : <tile shape>
output : %r = nv_tileaa.addi %a, %b : <tile shape>
Region-bearing ops (cuda_tile.reduce, cuda_tile.scan) keep their region intact; only block-argument types and yielded values flow through the TypeConverter.
Three-Populator Structure
Three populators build the pattern set in fixed order. Parts A and B are mutually independent at the source level; they run sequentially so the resulting pattern-set composition stays reproducible. Part C runs after both because its patterns depend on the type-conversion and layout decisions A and B have already published.
| Part | Patterns | Role |
|---|---|---|
| A | ~45 | Arithmetic, comparison, conversion, indexing, structured control flow |
| B | ~34 | Memory, pointer, token, view, partition |
| C | 4 | mmaf, mmai, reduce, scan — specialists whose lowering depends on layout choices A and B locked in |
Part A registers hand-written OpConversion patterns (AddIOpConversion, ReduceOpConversion, and so on). Part B mixes template-generated GenericConversion<cuda_tile::XOp, nv_tileaa::YOp> patterns with custom view/token/entry patterns. Part C is four specialists for operations whose rewrite shape varies with the parent op's element type, accumulator location, or combiner-region structure.
Singleton Pattern Adders
Eight pattern classes register through dedicated singleton adders rather than through the main populator bodies, because downstream callers (the CudaTileOptimizer test driver and the rsqrt/fma fusion pass) need to install them into private pattern sets without pulling in the full Part-A/B/C registration. Each adder is a single-purpose helper that allocates one OpConversionPattern and pushes it onto the supplied RewritePatternSet.
cuda_tile op | Pattern class | Role |
|---|---|---|
cuda_tile.trunci | TruncIOpConversion | Integer truncation, lowered through arith.trunci retyped over nv_tileaa operand shapes |
cuda_tile.rsqrt | RsqrtOpConversion | Reciprocal square root, rewrites to nv_tileaa.rsqrt |
cuda_tile.maxi | MaxIOpConversion | Signed integer max, lowered through arith.maxsi/arith.maxui over nv_tileaa operand shapes |
cuda_tile.itof | IToFOpConversion | Integer-to-float conversion, lowered through arith.sitofp / arith.uitofp over nv_tileaa operand shapes |
cuda_tile.global | GlobalOpConversion | Global symbol declaration, rewrites to nv_tileaa.global |
cuda_tile.fma | FmaOpConversion | Fused multiply-add, rewrites to nv_tileaa.fma |
cuda_tile.constant | ConstantOpConversion | Tile constant, rewrites to nv_tileaa.splat (with a constant scalar) or to arith.constant carrying a dense tensor for static aggregates |
cuda_tile.assume | AssumeOpConversion | Assumption hint, rewrites to nv_tileaa.assume |
Each rewrite has the same one-to-one shape:
%r = cuda_tile.rsqrt %x : tensor<8x64xf32>
↓
%r = nv_tileaa.rsqrt %x : tensor<8x64xf32>
The eight ops never appear in the main populator rosters; the singleton adders are the only registration path that brings them into a pattern set.
cuda_tile.trunci Walk
TruncIOpConversion is the canonical type-narrowing rewrite. The operand is an integer tile and the result is a narrower integer tile of the same shape. The rewrite keeps the operand SSA value verbatim, swaps the op mnemonic, and asks the TypeConverter for the result type:
// Before
%narrow = cuda_tile.trunci %wide : !cuda_tile.tile<128xi32> to !cuda_tile.tile<128xi8>
// After
%narrow = nv_tileaa.trunci %wide : tensor<128xi32> to tensor<128xi8>
The operand %wide flows through the source materialiser when its definition has not yet rewritten — applyPartialConversion inserts a builtin.unrealized_conversion_cast %wide : !cuda_tile.tile<128xi32> to tensor<128xi32> that the downstream cast-reconciliation phase erases once both ends are TileAA-typed. No attribute hand-off is needed: trunci carries only its result type.
cuda_tile.fma Walk
FmaOpConversion is a three-operand floating-multiply-add. All three operands share the source tile type and the result has the same shape:
// Before
%r = cuda_tile.fma %a, %b, %c { fastmath = #arith.fastmath<contract> }
: !cuda_tile.tile<8x64xf32>
// After
%r = nv_tileaa.fma %a, %b, %c { fastmath = #arith.fastmath<contract> }
: tensor<8x64xf32>
The fastmath attribute carries verbatim through the rewrite. Both dialects accept the shared arith.fastmath enum (the same one the MLIR arith dialect publishes), so no attribute kind translation is required — the same typed attribute object is re-attached to the rewritten op. A later lowering past nv_tileaa translates it to llvm.fastmath when descending into the LLVM dialect, but at this stage the attribute is dialect-shared rather than dialect-private.
Type-Converter Materialisers
Three type-converter functor pairs register before the populators run. Each pair combines an addConversion callback (called when the converter sees the source type) with an addMaterialization callback (called when partial conversion needs a bridge value while the IR is mid-rewrite). Materialisations should not survive later canonicalisation — the reconciliation phase in the next pass erases them.
| Source type | Target type | Materialiser direction |
|---|---|---|
cuda_tile::TileType | llvm.struct<...> (descriptor shape) | source — produces an nv_tileaa value when only a cuda_tile value exists |
cuda_tile::PointerType | llvm.ptr | target — produces a cuda_tile value when only an nv_tileaa value exists |
cuda_tile::TokenType | llvm.token | source — produces an nv_tileaa value when only a cuda_tile value exists |
Splitting source from target materialisers preserves token ordering and view identity for the scheduler, which still needs to reason about memory dependences before later NVVM lowering flattens tokens into integers. A purely-symmetric materialiser pair would lose the directional information the dialect-conversion engine uses to pick the right cast.
Block-Argument Type Flow
Region-bearing operations (cuda_tile.reduce, cuda_tile.scan, structured control flow that carries cuda_tile-typed iteration arguments) need block-argument types converted in the same step as their parent op. The standard inline-region helper does not see the pass's type converter, so the region-rewriting patterns construct their replacement operations explicitly:
LogicalResult lowerRegionOp(Operation *src, OperationName dst,
ConversionPatternRewriter &rw,
const TypeConverter &types) {
SmallVector<Value> operands;
if (failed(types.convertOperands(src->getOperands(), operands)))
return failure();
SmallVector<Type> resultTypes;
if (failed(types.convertTypes(src->getResultTypes(), resultTypes)))
return failure();
OperationState state(src->getLoc(), dst);
state.addOperands(operands);
state.addTypes(resultTypes);
state.addAttributes(src->getAttrs());
for (Region ®ion : src->getRegions()) {
Region *newRegion = state.addRegion();
rw.inlineRegionBefore(region, *newRegion, newRegion->begin());
if (failed(rw.convertRegionTypes(newRegion, types)))
return failure();
}
Operation *replacement = rw.create(state);
rw.replaceOp(src, replacement->getResults());
return success();
}
convertRegionTypes walks the block-argument list of every block in the region and rewrites types through the same converter the parent op uses. Without this step, the parent op verifies against post-conversion operand types but its region terminator yields pre-conversion types — a signature mismatch the next-stage verifier reports without enough context to diagnose properly.
Part C Specialists
Part C registers four specialists that depend on layout decisions made by Parts A and B. Each takes a cuda_tile op whose lowering shape is parameterised by element type, layout intent, or combiner-region structure, and emits the matching nv_tileaa form.
cuda_tile.mmaf and cuda_tile.mmai
The float and integer matrix-multiply-accumulate ops rewrite to nv_tileaa.dot with the element-type-specific attribute set. The rewriter selects FP rounding mode and accumulator precision from the source op's attributes.
%c' = cuda_tile.mmaf %a, %b, %c { fastmath = "contract" }
: tensor<128x64xf16>, tensor<64x128xf16>, tensor<128x128xf32>
↓
%c' = nv_tileaa.dot %a, %b, %c { input_precision = "tf32", fastmath = "contract" }
: tensor<128x64xf16>, tensor<64x128xf16>, tensor<128x128xf32>
cuda_tile.reduce Worked Example
cuda_tile.reduce carries a combiner region whose block arguments are accumulator-typed and whose terminator yields the next accumulator value. The rewriter walks the region, converts block-argument types through the shared TypeConverter, and rebuilds the op as nv_tileaa.reduce with the converted region body.
Input:
%sum = cuda_tile.reduce %values { axis = 1 : i32 } : tensor<8x64xf32> -> tensor<8xf32> {
^bb0(%acc: !cuda_tile.tile<f32>, %val: !cuda_tile.tile<f32>):
%s = cuda_tile.addf %acc, %val : !cuda_tile.tile<f32>
cuda_tile.yield %s : !cuda_tile.tile<f32>
}
The pattern converts the parent op's operand and result types, inlines the region, then walks the new region's blocks to convert each block-argument type:
%sum = nv_tileaa.reduce %values { axis = 1 : i32 } : tensor<8x64xf32> -> tensor<8xf32> {
^bb0(%acc: f32, %val: f32):
%s = nv_tileaa.addf %acc, %val : f32
nv_tileaa.yield %s : f32
}
Block-argument types !cuda_tile.tile<f32> become f32 because the TileType conversion strips the dialect wrapper; the terminator and combiner body rewrite recursively under the same partial-conversion driver, since cuda_tile.addf and cuda_tile.yield are in the illegal dialect and match Part A patterns.
If the rewriter forgot to convert block-argument types, the parent nv_tileaa.reduce would have f32 operands at the outer signature but the inner region's ^bb0 would still bind !cuda_tile.tile<f32> — the verifier would reject the operation with a signature mismatch the next-stage diagnostics cannot localise back to this pass.
cuda_tile.scan Worked Example
cuda_tile.scan follows the same shape as reduce but produces a tensor of the same rank as the input — every output element is the cumulative reduction of the prefix of input elements along the scan axis. The rewriter applies identical region-conversion logic, only changing the parent op's mnemonic and keeping the result rank equal to the input rank.
Input:
%prefix = cuda_tile.scan %values { axis = 1 : i32, inclusive = true }
: !cuda_tile.tile<8x64xf32> -> !cuda_tile.tile<8x64xf32> {
^bb0(%acc: !cuda_tile.tile<f32>, %elem: !cuda_tile.tile<f32>):
%sum = cuda_tile.addf %acc, %elem : !cuda_tile.tile<f32>
cuda_tile.yield %sum : !cuda_tile.tile<f32>
}
Output:
%prefix = nv_tileaa.scan %values { axis = 1 : i32, inclusive = true }
: tensor<8x64xf32> -> tensor<8x64xf32> {
^bb0(%acc: f32, %elem: f32):
%sum = nv_tileaa.addf %acc, %elem : f32
nv_tileaa.yield %sum : f32
}
The axis and inclusive attributes carry verbatim; block-argument types unwrap from !cuda_tile.tile<f32> to f32 via the TileType converter, and the inner cuda_tile.addf / cuda_tile.yield rewrite recursively under Part A patterns matched by the same partial-conversion driver.
Transcendental Specialists
The transcendental specialists (cuda_tile.exp2, cuda_tile.log2, cuda_tile.sin, cuda_tile.cos, cuda_tile.tanh) rewrite to nv_tileaa counterparts but additionally attach the fastmath flag derived from the source op's attribute dictionary. The flag controls whether downstream lowering selects the __nv_* precise libdevice variant or the __nv_fast_* approximate variant.
Tokens and Atomics
Token-aware operations stay explicit in the IR rather than collapsing immediately to NVVM. Loads, stores, atomic compare-and-swap, atomic read-modify-write, token creation, and token join all become nv_tileaa operations that still expose memory dependences. The downstream scheduler and async-pipeline passes reason about those dependences before LLVM/NVVM lowering flattens tokens into integers.
%t = cuda_tile.token.join [%t0, %t1, %t2] : !cuda_tile.token
↓
%t = nv_tileaa.join_mem_token [%t0, %t1, %t2] : !nv_tileaa.mem_token
Singleton joins skip the join_mem_token op and pass the single token through unchanged; empty joins lower to nv_tileaa.create_mem_token (with empty operand list), the same producer the downstream nv_tileas.async.pipeline.create_null_token later consumes, so every downstream op still has a token operand to consume.
Pipeline Handoff
The pass establishes the alias and view shapes that warp-specialized producer/consumer rewriting relies on later, but assigns no final layouts. It keeps enough structure around load/store views, atomic-token operations, and tensor partitions for TileAS layout assignment to insert nv_tileas.view and nv_tileas.convert_layout at producer and consumer boundaries. The invariant: a view produced here must still identify the same memory object, shape, layout intent, and token ordering when it reaches TileAS layout assignment.
Failure Modes
The pass fails with a user-facing diagnostic when:
- compute capability is missing or malformed (
"invalid or missing --compute-capability option"); - partial conversion leaves a residual
cuda_tile.*op ("failed to convert cuda_tile to nv_tileaa"); - a type materialisation cannot bridge a value across the boundary;
- a region rewrite would produce mismatched block arguments or terminators.
Cross-References
Conversion / Lowering Overview describes this pass's position in the four-stage cascade. Shared LLVM Type Converter documents the shared LLVM type converter that the materialiser triple here registers into. TileAA to TileAS is the next lowering stage; the CopyAtom and ReduceAtom witnesses attached there preserve information this pass made explicit. cuda_tile Op Roster lists the lowering-arm classification per family that the populator order in this pass reflects; nv_tileaa Op Roster — Memory Effects gives the operand and attribute tables for the token-aware operations the singleton adders produce. DSL to PTX End-to-End — Stage 1: cuda_tile IR and Stage 2: nv_tileaa IR trace a single GEMM kernel across the boundary this pass enforces, showing the IR shape on either side for a representative cuda_tile.mmaf op.