Pattern Sets and Type Conversion

Abstract

Tileiras lowering rides on ordinary MLIR dialect conversion: a conversion target declares what is legal, a type converter defines ABI shape, and a rewrite pattern set rewrites illegal operations until the target accepts the module. The public contract is short — every lowering stage must build a complete pattern set, and one coherent LLVM type converter must span TileAA, TileAS, CuTe, NVGPU, and kernel function boundaries. Two passes that disagree about descriptor shape or address-space numbering produce a module that verifies but generates wrong PTX.

This page is the deep reference for that contract: the kinds of pattern objects each stage installs, the type-conversion rules every stage shares, the runtime descriptor shapes later passes assume, and the two anchor patterns — the 43-instantiation arith bank and the kernel-attribute-aware shared LLVM type converter — that the per-stage pages refer back to instead of re-documenting in place.

Pattern Categories

Each lowering stage installs patterns from one of four categories. The choice depends on whether the source op rewrites to a single destination op, a small fixed sequence, or a region-rewriting transform.

Category	When to use
Generic one-to-one (`GenericOpPattern<SourceOp>`)	Source op and target op are semantically equivalent; converter handles operand types and the rewrite is a same-name, different-dialect copy. The arith bank below is the canonical example.
Dedicated `OpConversionPattern<SourceOp>`	Region surgery, witness-attribute reads (CopyAtom, ReduceAtom), compute-capability gates, descriptor packing, inline assembly, or any rewrite that emits more than one target op.
One-to-N pattern	Async-pipeline operations that produce multiple SSA values with distinct LLVM types and need `applyPartialOneToNConversion` rather than the standard partial-conversion engine.
Cleanup pattern	Runs after the dialect boundary has moved. Removes `builtin.unrealized_conversion_cast`, folds residual `arith` and `math` ops the dialect-conversion target has now made legal.

The dialect-conversion engine treats all four categories identically — they differ only in what they do inside matchAndRewrite.

Shared LLVM Type Converter

The Tileiras LLVM type converter is an ABI object. Its job is to map every type the lowering stages produce — TileAA memref, TileAS tiled view, async/memory/pipeline tokens, CuTe layout and atom types, function signatures, and the upstream LLVM types — to a single canonical LLVM dialect representation. The dispatcher walks an ordered list of addConversion callbacks and returns the first match.

Source concept	Converted representation
integer, index, float	LLVM scalar with target width and element semantics preserved
vector	LLVM vector of converted element type
function type	function type with converted arguments and results
ranked memref	LLVM memref descriptor unless a bare-pointer ABI rule applies
unranked memref	`{rank, erased_descriptor_pointer}`
TileAA or TileAS memref	same descriptor family as ranked memref, with Tileiras address space
CuTe memref	descriptor compatible with CuTe layout lowering
TileAA and TileAS tiled view	small struct containing base pointer and packed layout metadata
async, memory, producer, and consumer tokens	`i32`
CuTe layout, shape, stride, swizzle, and atom types	LLVM structs or integers consumed by CuTe lowering
tuple and none	LLVM struct or empty marker as required by the operation
LLVM pointer or LLVM struct	identity conversion

Function-signature conversion is the one place the converter must do more than per-type translation: the bare-pointer kernel ABI demands that ranked memref arguments lower to a single aligned pointer plus separately carried launch metadata, not the full descriptor. When convertFunctionSignature fails, the converter emits "failed to convert function signature type for: " (the trailing space is preserved verbatim) followed by the printed form of the offending type. Downstream regression suites grep for this string; the wording is fixed.

Descriptor Layouts

The ranked memref descriptor follows the standard LLVM dialect shape:

struct RankedMemRefDescriptor<T, int Rank, int AddressSpace> {
    T addrspace(AddressSpace) *allocated;
    T addrspace(AddressSpace) *aligned;
    int64_t offset;
    int64_t sizes[Rank];
    int64_t strides[Rank];
};

The tiled-view descriptor is compact because tiled load/store patterns and descriptor builders both consume it.

struct TiledViewDescriptor<T, int AddressSpace> {
    T addrspace(AddressSpace) *base;
    uint32_t swizzle_encoding;
    uint32_t tile_dim0;
    uint32_t tile_dims1_to3[3];
};

Tokens are deliberately narrow. A producer/consumer token is not a pointer to runtime storage — it is an integer phase value. The low bit carries the parity consumed by wait operations; higher bits may carry a pipeline slot index.

uint32_t make_pipeline_token(uint32_t slot, bool phase) {
    return (slot << 1) | (phase ? 1u : 0u);
}

uint32_t token_slot(uint32_t token) {
    return token >> 1;
}

bool token_phase(uint32_t token) {
    return (token & 1u) != 0;
}

Address Spaces

Tileiras keeps memory spaces distinct all the way to LLVM pointers.

Tileiras memory space	LLVM address space	PTX meaning
register memory	0	virtual register values
global memory	1	`.global`
internal memory	2	compiler-internal storage
shared memory	3	`.shared`
constant memory	4	`.const`
local memory	5	`.local`
tensor memory	6	Blackwell tensor memory
generic pointer	101	NVVM generic pointer

Address-space casts must be explicit. The converter rejects implicit transitions that would hide a semantic memory-space change — especially around TMA descriptors, shared-memory barriers, and tensor-memory operations.

Materialization Hooks

Partial conversion sometimes needs a bridge value while only part of the IR has been lowered. The converter installs source and target materialization hooks that both produce builtin.unrealized_conversion_cast. The cast-reconciliation phase (phase 8 of ConvertTileASToLLVM body conversion) erases them once every participating operation has converted. Bridges that survive the final cleanup pass must fail the module rather than reach LLVM translation.

Pattern Discipline

Generic one-to-one patterns must not inspect target hardware — they exist to be benefit-1 fallbacks the engine prefers only when no specialist matches. Dedicated patterns own every target-feature check they introduce, so the legality of an SM100-only rewrite is visible at the point where the rewrite happens. Region-rewriting patterns convert block-argument types and terminators in the same step, or the resulting parent op fails verification with a signature that no longer matches its region. One-to-N async-pipeline patterns run only after scheduling and layout assignment have made the pipeline structure explicit; running them earlier would split tokens before the scheduler can reason about them. Cleanup patterns leave memory-ordering operations alone unless the op is outside the memory-consistency interface.

The 43-Instantiation Arith Bank

The TileAA-to-TileAS arith populator installs 43 instantiations of a single CRTP template, one per arith op that has a same-named TileAS counterpart. Every instantiation derives from OpConversionPattern<SourceOp> and overrides matchAndRewrite to do the generic same-name remap.

template <typename SourceOp>
class GenericOpPattern : public mlir::OpConversionPattern<SourceOp> {
 public:
  using mlir::OpConversionPattern<SourceOp>::OpConversionPattern;
  using OpAdaptor = typename SourceOp::Adaptor;
  LogicalResult matchAndRewrite(SourceOp op, OpAdaptor adaptor,
                                ConversionPatternRewriter &rw) const override;
};

Each rewrite has the shape:

%r = arith.addf %a, %b : tensor<8x64xf32>
   ↓
%r = nv_tileas.addf %a, %b : tensor<8x64xf32>

The pattern reads operand types through the shared type converter, builds the destination op with the converted operands, copies semantic attributes (rounding mode, fast-math flags, predicate kind, overflow flags), and replaces the source. Any failure in operand or result-type conversion bubbles up as a pattern failure rather than producing a partial replacement.

Op-Mnemonic Roster

The 43 mnemonics, in registration order. The order matches the source-level patterns.add<...>(typeConverter, context) argument list and is reproducible by walking the populator's body in linker-emission order.

#	Mnemonic	Op class	#	Mnemonic	Op class
1	`arith.cmpf`	`CmpFOp`	23	`arith.minnumf`	`MinNumFOp`
2	`arith.cmpi`	`CmpIOp`	24	`arith.minsi`	`MinSIOp`
3	`arith.addf`	`AddFOp`	25	`arith.minui`	`MinUIOp`
4	`arith.addi`	`AddIOp`	26	`arith.mulf`	`MulFOp`
5	`arith.andi`	`AndIOp`	27	`arith.muli`	`MulIOp`
6	`arith.bitcast`	`BitcastOp`	28	`arith.negf`	`NegFOp`
7	`arith.ceildivsi`	`CeilDivSIOp`	29	`arith.ori`	`OrIOp`
8	`arith.ceildivui`	`CeilDivUIOp`	30	`arith.remf`	`RemFOp`
9	`arith.divf`	`DivFOp`	31	`arith.remsi`	`RemSIOp`
10	`arith.divsi`	`DivSIOp`	32	`arith.remui`	`RemUIOp`
11	`arith.divui`	`DivUIOp`	33	`arith.select`	`SelectOp`
12	`arith.extf`	`ExtFOp`	34	`arith.shli`	`ShLIOp`
13	`arith.extsi`	`ExtSIOp`	35	`arith.shrsi`	`ShRSIOp`
14	`arith.extui`	`ExtUIOp`	36	`arith.shrui`	`ShRUIOp`
15	`arith.floordivsi`	`FloorDivSIOp`	37	`arith.sitofp`	`SIToFPOp`
16	`arith.fptosi`	`FPToSIOp`	38	`arith.subf`	`SubFOp`
17	`arith.fptoui`	`FPToUIOp`	39	`arith.subi`	`SubIOp`
18	`arith.maximumf`	`MaximumFOp`	40	`arith.truncf`	`TruncFOp`
19	`arith.maxnumf`	`MaxNumFOp`	41	`arith.trunci`	`TruncIOp`
20	`arith.maxsi`	`MaxSIOp`	42	`arith.uitofp`	`UIToFPOp`
21	`arith.maxui`	`MaxUIOp`	43	`arith.xori`	`XOrIOp`
22	`arith.minimumf`	`MinimumFOp`

The Benefit-20 Specialist for `arith.constant`

arith.constant is absent from the bank because it does not route through the generic same-name remap. A hand-written ConstantTensorOpConversion registers separately with PatternBenefit(20), which pre-empts any default-benefit pattern that might otherwise match a constant op. The specialist inspects the constant's Attribute payload and synthesises one of three destinations:

arith.constant dense<1.0> : tensor<8x64xf32>          // SplatElementsAttr branch
   ↓
%r = nv_tileaa.splat 1.0 : tensor<8x64xf32>

arith.constant dense<[1.0, 2.0, ...]> : tensor<32xf32>   // DenseElementsAttr branch
   ↓
%r = nv_tileaa.constant_tensor dense<[1.0, 2.0, ...]> : tensor<32xf32>

Scalar IntegerAttr and FloatAttr payloads keep the same arith.constant form, since arith constants are dynamically legal in the target dialect at scalar types. The decision is structural, not numeric — the specialist branches on attribute kind, not on attribute value.

Parent Driver

The TileAA-to-TileAS pattern bank composes four sub-populators plus six hand-written specialists inlined into the parent body. Sub-populator order is fixed.

Sub-populator	Patterns
`nv_tileaa` structure	`nv_tileaa.block_tile`, `nv_tileaa.make_memref`, `nv_tileaa.get_dim_size`
`nv_tileaa` bitcast/ptr	`nv_tileaa.bitcast`, `nv_tileaa.ptr_to_int`, `nv_tileaa.int_to_ptr`
`func` dialect	`func.func`, `func.call`, `func.return`
arith generic bank	the 43 `GenericOpPattern<arith::*>` instantiations above
inline specialists	`MakeTiledTMADescOpHostConversion`, `ConstantTensorOpConversion`, `AddPtrOpConversion`, `SplatOpHostConversion`, `AssumeOpConversion`, `ExtractOpHostConversion`

The parent driver builds the ConversionTarget marking llvm, cute, cute_nvgpu, builtin, and vector as fully legal; registers arith with a dynamic legality predicate that returns true once an op already has TileAS-form operands; and marks nv_tileaa and nv_tileas as legal. A failed partial conversion emits the diagnostic "expect lower MakeTiledTMADescOp".

PDL Fallback

Every Convert*ToLLVM pass runs a PDL-to-PDLInterp fallback immediately before applyPartialConversion. The fallback walks the PDL pattern modules registered with the active RewritePatternSet, compiles each module down to PDL Interpreter operations, and hands the resulting interpreter bodies to the conversion driver alongside the C++ patterns. The compiled PDL Interpreter bodies live as fixed bytecode in the binary's read-only data; the fallback's job is to materialize them as runnable patterns at pass time, so the on-disk PDL pattern is essentially an interpreter program ready to be wired into the match-and-rewrite loop.

When PDL compilation fails — typically because a registered pattern references an op or attribute the interpreter cannot resolve in the current dialect registry — the fallback emits "failed to lower PDL pattern module to the PDL Interpreter" and returns failure. The parent driver treats this as a hard pass failure rather than a recoverable miss: surviving without the PDL-side patterns would silently change which ops the conversion target deems illegal, producing a module that compiles but mishandles the pattern's intended rewrite.

Shared LLVMTypeConverter Contract

The shared LLVMTypeConverter extends the upstream MLIR TypeConverter with the five tile-extension hooks Tileiras needs and three overridden methods that enforce the kernel-ABI rules above.

Method	Status	Role
`convertType`	overridden	Dispatches to the tile-extension callbacks first, then falls through to the upstream `LLVMTypeConverter` base for ordinary LLVM types.
`convertCallSignature`	overridden	Enforces the bare-pointer ABI for kernel call sites: ranked memref arguments become aligned pointers plus separately-carried launch metadata.
`convertFunctionSignature`	overridden	Lifts kernel-spec fields onto the converted function and emits `"failed to convert function signature type for: "` (trailing space preserved) when a type is unrecognised.
`materializeSourceConversion`	overridden	Emits `builtin.unrealized_conversion_cast` as the source-side bridge during partial conversion.
`materializeTargetConversion`	overridden	Emits the inverse `builtin.unrealized_conversion_cast` for the target side.
`convertTileType`	tile-extension	Maps `TileType` to a `llvm.struct` payload.
`convertTokenType`	tile-extension	Maps `TokenType` to `i32` for memory/async/pipeline tokens.
`convertPipelineIteratorType`	tile-extension	Maps `PipelineIteratorType` to a small `llvm.struct` carrying stage and phase.
`convertTensorViewType`	tile-extension	Maps `TensorViewType` to the tiled-view descriptor (base pointer plus packed metadata).
`convertPartitionViewType`	tile-extension	Maps `PartitionViewType` to a partition descriptor.

convertType walks an ordered list of addConversion callbacks; the first callback that recognises an incoming type wins. Tile-extension callbacks register before the base LLVM callbacks so the partial-conversion driver sees a single uniform converter rather than fanning out across two different machines for tile-only versus upstream-LLVM types.

Type LLVMTypeConverter::convertType(Type t) {
    if (auto tile     = dyn_cast<TileType>(t))              return convertTileType(tile);
    if (auto tok      = dyn_cast<TokenType>(t))             return convertTokenType(tok);
    if (auto pipeIt   = dyn_cast<PipelineIteratorType>(t))  return convertPipelineIteratorType(pipeIt);
    if (auto view     = dyn_cast<TensorViewType>(t))        return convertTensorViewType(view);
    if (auto part     = dyn_cast<PartitionViewType>(t))     return convertPartitionViewType(part);
    /* ... callbacks for cute / cute_nvgpu / nv_tileaa / arith / scf types ...        */
    return baseLLVMConvertType(t);
}

A Convert*ToLLVM pass must construct exactly one LLVMTypeConverter and thread it through every pattern, every ConversionTarget legality predicate, and every PDL pattern module the fallback later compiles. Constructing two converters in the same pass — for example, one for body patterns and one for cleanup — is the most common way to silently break the bare-pointer ABI on function boundaries, because the two converters disagree on whether a ranked memref kernel argument is a pointer or a descriptor.

Cross-References

Conversion / Lowering Overview shows where each pass that uses this infrastructure sits in the cascade. cuda_tile to TileAA, TileAA to TileAS, TileAS to LLVM, CuTe and CuTe-NVGPU to LLVM, and nvgpu / gpu to NVVM are the five passes that install patterns through the bank and thread the shared type converter described here.

Keyboard shortcuts

Tileiras Internals