Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

cute Dialect Overview

Provenance vs Upstream MLIR

cute is NVIDIA-introduced and has no upstream MLIR equivalent. Upstream MLIR has no dialect that models CUTLASS cuTe layout algebra as first-class IR — the open-source CUTLASS library expresses the same algebra in C++ templates, not in MLIR. Tileiras lifts those templates into an MLIR dialect so passes can inspect, compose, verify, and lower layout values rather than expand them at C++ compile time. Without this dialect the pipeline would have no in-IR carrier for shape/stride/swizzle/atom data between layout assignment and the architecture-specific cute_nvgpu binding step.

Abstract

cute is tileiras's MLIR form of CUTLASS cuTe layout algebra. It encodes shapes, strides, layouts, swizzles, coordinates, tiles, pointer views, copy atoms, and MMA atoms — together with the operations that compose, divide, complement, coalesce, and filter them — and stops short of binding any of it to NVIDIA hardware. That binding is the job of cute_nvgpu. Every later GPU-specific dialect (cute_nvgpu, nvgpu, nvvm) reads layout values produced here.

cute is not a code-generation dialect. Its values describe structure: how a logical tile maps to physical coordinates, how coordinates become offsets, how one layout composes with another, how a tiled copy or tiled MMA partitions work across lanes, warps, and memory spaces. That makes it the common language shared by CUTLASS pipeline modeling, TileAS layout assignment, TMA descriptor construction, and MMA lowering.

Role in the Cascade

cuda_tile / nv_tileaa / nv_tileas
    |
    | choose tile shapes, views, and partitioning
    v
cute
    |
    | attach target-specific atoms and SM-tier constraints
    v
cute_nvgpu
    |
    | normalize to nvgpu and nvvm
    v
PTX

cute is a compact typed form of the same algebra that CUTLASS C++ expresses with templates. The templates become values and attributes that passes inspect, compose, verify, and lower.

Core Concepts

ConceptMeaningTypical use
ShapeExtents of a logical tile or nested coordinate tupleDescribes the iteration space of a tile.
StrideOffset step for each coordinate dimensionConverts coordinates into linear offsets.
LayoutShape plus stride, optionally decorated with swizzleMaps logical coordinates to storage locations.
TileA grouped shape/layout fragmentRepresents a fragment moved or computed as a unit.
CoordA point in a shape or tileIndexes layouts, views, and partitioned fragments.
SwizzleBit permutation applied to low address bitsAvoids bank conflicts or matches hardware layout rules.
ViewPointer or memref plus layout metadataDescribes an addressed object without losing its layout.
Tiled copy / MMALayout plus atom-level partitioningFeeds target-specific copy or matrix-multiply lowering.

The key invariant is that cute values remain algebraic. A layout should be composable and queryable without knowing whether it will eventually become a TMA descriptor, an ldmatrix load, a WGMMA operand, or a Blackwell tensor-memory operation.

Layout Semantics in One Line

A layout maps a coordinate to an offset. The simplest model is (shape, stride); the real dialect adds nested tuples, composition, complement, divide, product, and swizzles on top of that single primitive. The algebraic rules and the concrete compose/complement/divide/product definitions live on the algebra page below; this overview only states the kernel.

int64_t layout_offset(Layout L, Coord c) {
    int64_t offset = 0;
    for (int d = 0; d < rank(c); ++d) offset += c[d] * L.stride[d];
    return apply_swizzle(L.swizzle, offset);
}

For a reimplementation, the storage class the original compiler picks does not matter. What does matter: equivalent layouts canonicalize consistently, nested tuple layouts preserve rank and dimension identity, and swizzle composition stays explicit until a target-specific lowering consumes it.

Where to Find What

The dialect is split across four pages by concern. Use this map to find the exact place a topic is documented; the overview does not duplicate any of these.

TopicPage
Layout algebra rules (composition, complement, divide, product, coalesce, filter)Layout Algebra and Descriptor Grammar — Algebra Rules on Shape and Stride Tuples
Tuple-shape grammar, swizzle composition, descriptor round-tripLayout Algebra and Descriptor Grammar — Descriptor Grammar
Tile partitioning ops (local_tile, local_partition, group_modes, dice, slice)Tile and Divide Ops — Builder Operations
Atom builders (make_atom, make_tiled_copy, make_tiled_mma) and desugar rewritesAtom Builders and Desugar — Atom Builder Contract
cute.make_int_tuple hub, make_layout desugaring shapeAtom Builders and Desugar — make_int_tuple Hub
Kernel-entry ABI (cute.kernelnvvm.kernel, grid-constant arg-attrs)Atom Builders and Desugar — Kernel-entry ABI
Verbatim verifier diagnostics (every error string the dialect emits)Verifiers — Verbatim Diagnostics
Mode-range, divide, product, tuple-arithmetic verifier algorithmsVerifiers — Mode and Rank Checks
crd2idx weak-congruence walk, worked diagnostic exampleVerifiers — Worked Example: crd2idx Weak Congruence Violation
LayoutTypeInterface kind discriminator and per-kind dispatch tablesVerifiers — LayoutTypeInterface Kind Discriminator

In-Memory IR Tier

Treat cute as an in-memory compiler tier. It exists so passes can exchange rich layout objects without serializing every intermediate shape into the public input format. Textual rendering helps with debugging and documentation; production input normally enters through cuda_tile, nv_tileaa, cutlass, or another higher-level dialect, and the pipeline constructs cute objects internally.

Practical consequence: do not build tooling that depends on cute bytecode as a stable interchange format unless the serializer is explicitly provided. Textual dumps are for inspecting the compiler, not as a user-facing artifact.

If You Know CUTLASS (open source) — cross-walk

The open-source cute/ C++ headers map almost directly onto this dialect:

CUTLASS C++ (cute namespace)tileiras cute IR
cute::Shape<...> and cute::Stride<...>hierarchical (shape, stride) tuples in a !cute.layout
cute::Layout<Shape, Stride>!cute.layout type, kind-discriminated through the seven-entry sentinel table
cute::Swizzle<B, M, S>!cute.swizzle value composed into a layout via make_composed_layout
cute::make_tile, cute::make_layoutcute.make_tile, cute.make_layout ops
cute::Tensor<Engine, Layout>cute.make_view ties a pointer/memref to a layout
composition, complement, logical_divide, logical_productidentically-named cute.* ops
cute::make_tiled_copy, cute::make_tiled_mmacute.make_tiled_copy, cute.make_tiled_mma (target binding deferred to cute_nvgpu)
Compile-time integer arithmetic in C++ templatescute.make_int_tuple + tuple_div/mod/mul/sub ops

The main difference is where the target boundary sits. The open-source cute/ library compiles SM-specific MMA_Atom and Copy_Atom traits straight into the same headers; tileiras keeps the SM-neutral atoms in cute and pushes every target-specific atom into cute_nvgpu. A pass running inside cute should never need to ask which SM tier is in use. If it does, the layout choice belongs on the cute_nvgpu side.