Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Pattern Sets and Type Conversion

Abstract

Tileiras lowering rides on ordinary MLIR dialect conversion: a conversion target declares what is legal, a type converter defines ABI shape, and a rewrite pattern set rewrites illegal operations until the target accepts the module. The public contract is short — every lowering stage must build a complete pattern set, and one coherent LLVM type converter must span TileAA, TileAS, CuTe, NVGPU, and kernel function boundaries. Two passes that disagree about descriptor shape or address-space numbering produce a module that verifies but generates wrong PTX.

This page is the deep reference for that contract: the kinds of pattern objects each stage installs, the type-conversion rules every stage shares, the runtime descriptor shapes later passes assume, and the two anchor patterns — the 43-instantiation arith bank and the kernel-attribute-aware shared LLVM type converter — that the per-stage pages refer back to instead of re-documenting in place.

Pattern Categories

Each lowering stage installs patterns from one of four categories. The choice depends on whether the source op rewrites to a single destination op, a small fixed sequence, or a region-rewriting transform.

CategoryWhen to use
Generic one-to-one (GenericOpPattern<SourceOp>)Source op and target op are semantically equivalent; converter handles operand types and the rewrite is a same-name, different-dialect copy. The arith bank below is the canonical example.
Dedicated OpConversionPattern<SourceOp>Region surgery, witness-attribute reads (CopyAtom, ReduceAtom), compute-capability gates, descriptor packing, inline assembly, or any rewrite that emits more than one target op.
One-to-N patternAsync-pipeline operations that produce multiple SSA values with distinct LLVM types and need applyPartialOneToNConversion rather than the standard partial-conversion engine.
Cleanup patternRuns after the dialect boundary has moved. Removes builtin.unrealized_conversion_cast, folds residual arith and math ops the dialect-conversion target has now made legal.

The dialect-conversion engine treats all four categories identically — they differ only in what they do inside matchAndRewrite.

Shared LLVM Type Converter

The Tileiras LLVM type converter is an ABI object. Its job is to map every type the lowering stages produce — TileAA memref, TileAS tiled view, async/memory/pipeline tokens, CuTe layout and atom types, function signatures, and the upstream LLVM types — to a single canonical LLVM dialect representation. The dispatcher walks an ordered list of addConversion callbacks and returns the first match.

Source conceptConverted representation
integer, index, floatLLVM scalar with target width and element semantics preserved
vectorLLVM vector of converted element type
function typefunction type with converted arguments and results
ranked memrefLLVM memref descriptor unless a bare-pointer ABI rule applies
unranked memref{rank, erased_descriptor_pointer}
TileAA or TileAS memrefsame descriptor family as ranked memref, with Tileiras address space
CuTe memrefdescriptor compatible with CuTe layout lowering
TileAA and TileAS tiled viewsmall struct containing base pointer and packed layout metadata
async, memory, producer, and consumer tokensi32
CuTe layout, shape, stride, swizzle, and atom typesLLVM structs or integers consumed by CuTe lowering
tuple and noneLLVM struct or empty marker as required by the operation
LLVM pointer or LLVM structidentity conversion

Function-signature conversion is the one place the converter must do more than per-type translation: the bare-pointer kernel ABI demands that ranked memref arguments lower to a single aligned pointer plus separately carried launch metadata, not the full descriptor. When convertFunctionSignature fails, the converter emits "failed to convert function signature type for: " (the trailing space is preserved verbatim) followed by the printed form of the offending type. Downstream regression suites grep for this string; the wording is fixed.

Descriptor Layouts

The ranked memref descriptor follows the standard LLVM dialect shape:

struct RankedMemRefDescriptor<T, int Rank, int AddressSpace> {
    T addrspace(AddressSpace) *allocated;
    T addrspace(AddressSpace) *aligned;
    int64_t offset;
    int64_t sizes[Rank];
    int64_t strides[Rank];
};

The tiled-view descriptor is compact because tiled load/store patterns and descriptor builders both consume it.

struct TiledViewDescriptor<T, int AddressSpace> {
    T addrspace(AddressSpace) *base;
    uint32_t swizzle_encoding;
    uint32_t tile_dim0;
    uint32_t tile_dims1_to3[3];
};

Tokens are deliberately narrow. A producer/consumer token is not a pointer to runtime storage — it is an integer phase value. The low bit carries the parity consumed by wait operations; higher bits may carry a pipeline slot index.

uint32_t make_pipeline_token(uint32_t slot, bool phase) {
    return (slot << 1) | (phase ? 1u : 0u);
}

uint32_t token_slot(uint32_t token) {
    return token >> 1;
}

bool token_phase(uint32_t token) {
    return (token & 1u) != 0;
}

Address Spaces

Tileiras keeps memory spaces distinct all the way to LLVM pointers.

Tileiras memory spaceLLVM address spacePTX meaning
register memory0virtual register values
global memory1.global
internal memory2compiler-internal storage
shared memory3.shared
constant memory4.const
local memory5.local
tensor memory6Blackwell tensor memory
generic pointer101NVVM generic pointer

Address-space casts must be explicit. The converter rejects implicit transitions that would hide a semantic memory-space change — especially around TMA descriptors, shared-memory barriers, and tensor-memory operations.

Materialization Hooks

Partial conversion sometimes needs a bridge value while only part of the IR has been lowered. The converter installs source and target materialization hooks that both produce builtin.unrealized_conversion_cast. The cast-reconciliation phase (phase 8 of ConvertTileASToLLVM body conversion) erases them once every participating operation has converted. Bridges that survive the final cleanup pass must fail the module rather than reach LLVM translation.

Pattern Discipline

Generic one-to-one patterns must not inspect target hardware — they exist to be benefit-1 fallbacks the engine prefers only when no specialist matches. Dedicated patterns own every target-feature check they introduce, so the legality of an SM100-only rewrite is visible at the point where the rewrite happens. Region-rewriting patterns convert block-argument types and terminators in the same step, or the resulting parent op fails verification with a signature that no longer matches its region. One-to-N async-pipeline patterns run only after scheduling and layout assignment have made the pipeline structure explicit; running them earlier would split tokens before the scheduler can reason about them. Cleanup patterns leave memory-ordering operations alone unless the op is outside the memory-consistency interface.

The 43-Instantiation Arith Bank

The TileAA-to-TileAS arith populator installs 43 instantiations of a single CRTP template, one per arith op that has a same-named TileAS counterpart. Every instantiation derives from OpConversionPattern<SourceOp> and overrides matchAndRewrite to do the generic same-name remap.

template <typename SourceOp>
class GenericOpPattern : public mlir::OpConversionPattern<SourceOp> {
 public:
  using mlir::OpConversionPattern<SourceOp>::OpConversionPattern;
  using OpAdaptor = typename SourceOp::Adaptor;
  LogicalResult matchAndRewrite(SourceOp op, OpAdaptor adaptor,
                                ConversionPatternRewriter &rw) const override;
};

Each rewrite has the shape:

%r = arith.addf %a, %b : tensor<8x64xf32>
   ↓
%r = nv_tileas.addf %a, %b : tensor<8x64xf32>

The pattern reads operand types through the shared type converter, builds the destination op with the converted operands, copies semantic attributes (rounding mode, fast-math flags, predicate kind, overflow flags), and replaces the source. Any failure in operand or result-type conversion bubbles up as a pattern failure rather than producing a partial replacement.

Op-Mnemonic Roster

The 43 mnemonics, in registration order. The order matches the source-level patterns.add<...>(typeConverter, context) argument list and is reproducible by walking the populator's body in linker-emission order.

#MnemonicOp class#MnemonicOp class
1arith.cmpfCmpFOp23arith.minnumfMinNumFOp
2arith.cmpiCmpIOp24arith.minsiMinSIOp
3arith.addfAddFOp25arith.minuiMinUIOp
4arith.addiAddIOp26arith.mulfMulFOp
5arith.andiAndIOp27arith.muliMulIOp
6arith.bitcastBitcastOp28arith.negfNegFOp
7arith.ceildivsiCeilDivSIOp29arith.oriOrIOp
8arith.ceildivuiCeilDivUIOp30arith.remfRemFOp
9arith.divfDivFOp31arith.remsiRemSIOp
10arith.divsiDivSIOp32arith.remuiRemUIOp
11arith.divuiDivUIOp33arith.selectSelectOp
12arith.extfExtFOp34arith.shliShLIOp
13arith.extsiExtSIOp35arith.shrsiShRSIOp
14arith.extuiExtUIOp36arith.shruiShRUIOp
15arith.floordivsiFloorDivSIOp37arith.sitofpSIToFPOp
16arith.fptosiFPToSIOp38arith.subfSubFOp
17arith.fptouiFPToUIOp39arith.subiSubIOp
18arith.maximumfMaximumFOp40arith.truncfTruncFOp
19arith.maxnumfMaxNumFOp41arith.trunciTruncIOp
20arith.maxsiMaxSIOp42arith.uitofpUIToFPOp
21arith.maxuiMaxUIOp43arith.xoriXOrIOp
22arith.minimumfMinimumFOp

The Benefit-20 Specialist for arith.constant

arith.constant is absent from the bank because it does not route through the generic same-name remap. A hand-written ConstantTensorOpConversion registers separately with PatternBenefit(20), which pre-empts any default-benefit pattern that might otherwise match a constant op. The specialist inspects the constant's Attribute payload and synthesises one of three destinations:

arith.constant dense<1.0> : tensor<8x64xf32>          // SplatElementsAttr branch
   ↓
%r = nv_tileaa.splat 1.0 : tensor<8x64xf32>
arith.constant dense<[1.0, 2.0, ...]> : tensor<32xf32>   // DenseElementsAttr branch
   ↓
%r = nv_tileaa.constant_tensor dense<[1.0, 2.0, ...]> : tensor<32xf32>

Scalar IntegerAttr and FloatAttr payloads keep the same arith.constant form, since arith constants are dynamically legal in the target dialect at scalar types. The decision is structural, not numeric — the specialist branches on attribute kind, not on attribute value.

Parent Driver

The TileAA-to-TileAS pattern bank composes four sub-populators plus six hand-written specialists inlined into the parent body. Sub-populator order is fixed.

Sub-populatorPatterns
nv_tileaa structurenv_tileaa.block_tile, nv_tileaa.make_memref, nv_tileaa.get_dim_size
nv_tileaa bitcast/ptrnv_tileaa.bitcast, nv_tileaa.ptr_to_int, nv_tileaa.int_to_ptr
func dialectfunc.func, func.call, func.return
arith generic bankthe 43 GenericOpPattern<arith::*> instantiations above
inline specialistsMakeTiledTMADescOpHostConversion, ConstantTensorOpConversion, AddPtrOpConversion, SplatOpHostConversion, AssumeOpConversion, ExtractOpHostConversion

The parent driver builds the ConversionTarget marking llvm, cute, cute_nvgpu, builtin, and vector as fully legal; registers arith with a dynamic legality predicate that returns true once an op already has TileAS-form operands; and marks nv_tileaa and nv_tileas as legal. A failed partial conversion emits the diagnostic "expect lower MakeTiledTMADescOp".

PDL Fallback

Every Convert*ToLLVM pass runs a PDL-to-PDLInterp fallback immediately before applyPartialConversion. The fallback walks the PDL pattern modules registered with the active RewritePatternSet, compiles each module down to PDL Interpreter operations, and hands the resulting interpreter bodies to the conversion driver alongside the C++ patterns. The compiled PDL Interpreter bodies live as fixed bytecode in the binary's read-only data; the fallback's job is to materialize them as runnable patterns at pass time, so the on-disk PDL pattern is essentially an interpreter program ready to be wired into the match-and-rewrite loop.

When PDL compilation fails — typically because a registered pattern references an op or attribute the interpreter cannot resolve in the current dialect registry — the fallback emits "failed to lower PDL pattern module to the PDL Interpreter" and returns failure. The parent driver treats this as a hard pass failure rather than a recoverable miss: surviving without the PDL-side patterns would silently change which ops the conversion target deems illegal, producing a module that compiles but mishandles the pattern's intended rewrite.

Shared LLVMTypeConverter Contract

The shared LLVMTypeConverter extends the upstream MLIR TypeConverter with the five tile-extension hooks Tileiras needs and three overridden methods that enforce the kernel-ABI rules above.

MethodStatusRole
convertTypeoverriddenDispatches to the tile-extension callbacks first, then falls through to the upstream LLVMTypeConverter base for ordinary LLVM types.
convertCallSignatureoverriddenEnforces the bare-pointer ABI for kernel call sites: ranked memref arguments become aligned pointers plus separately-carried launch metadata.
convertFunctionSignatureoverriddenLifts kernel-spec fields onto the converted function and emits "failed to convert function signature type for: " (trailing space preserved) when a type is unrecognised.
materializeSourceConversionoverriddenEmits builtin.unrealized_conversion_cast as the source-side bridge during partial conversion.
materializeTargetConversionoverriddenEmits the inverse builtin.unrealized_conversion_cast for the target side.
convertTileTypetile-extensionMaps TileType to a llvm.struct payload.
convertTokenTypetile-extensionMaps TokenType to i32 for memory/async/pipeline tokens.
convertPipelineIteratorTypetile-extensionMaps PipelineIteratorType to a small llvm.struct carrying stage and phase.
convertTensorViewTypetile-extensionMaps TensorViewType to the tiled-view descriptor (base pointer plus packed metadata).
convertPartitionViewTypetile-extensionMaps PartitionViewType to a partition descriptor.

convertType walks an ordered list of addConversion callbacks; the first callback that recognises an incoming type wins. Tile-extension callbacks register before the base LLVM callbacks so the partial-conversion driver sees a single uniform converter rather than fanning out across two different machines for tile-only versus upstream-LLVM types.

Type LLVMTypeConverter::convertType(Type t) {
    if (auto tile     = dyn_cast<TileType>(t))              return convertTileType(tile);
    if (auto tok      = dyn_cast<TokenType>(t))             return convertTokenType(tok);
    if (auto pipeIt   = dyn_cast<PipelineIteratorType>(t))  return convertPipelineIteratorType(pipeIt);
    if (auto view     = dyn_cast<TensorViewType>(t))        return convertTensorViewType(view);
    if (auto part     = dyn_cast<PartitionViewType>(t))     return convertPartitionViewType(part);
    /* ... callbacks for cute / cute_nvgpu / nv_tileaa / arith / scf types ...        */
    return baseLLVMConvertType(t);
}

A Convert*ToLLVM pass must construct exactly one LLVMTypeConverter and thread it through every pattern, every ConversionTarget legality predicate, and every PDL pattern module the fallback later compiles. Constructing two converters in the same pass — for example, one for body patterns and one for cleanup — is the most common way to silently break the bare-pointer ABI on function boundaries, because the two converters disagree on whether a ranked memref kernel argument is a pointer or a descriptor.

Cross-References

Conversion / Lowering Overview shows where each pass that uses this infrastructure sits in the cascade. cuda_tile to TileAA, TileAA to TileAS, TileAS to LLVM, CuTe and CuTe-NVGPU to LLVM, and nvgpu / gpu to NVVM are the five passes that install patterns through the bank and thread the shared type converter described here.