cute Dialect Overview
Provenance vs Upstream MLIR
cute is NVIDIA-introduced and has no upstream MLIR equivalent. Upstream MLIR has no dialect that models CUTLASS cuTe layout algebra as first-class IR — the open-source CUTLASS library expresses the same algebra in C++ templates, not in MLIR. Tileiras lifts those templates into an MLIR dialect so passes can inspect, compose, verify, and lower layout values rather than expand them at C++ compile time. Without this dialect the pipeline would have no in-IR carrier for shape/stride/swizzle/atom data between layout assignment and the architecture-specific cute_nvgpu binding step.
Abstract
cute is tileiras's MLIR form of CUTLASS cuTe layout algebra. It encodes shapes, strides, layouts, swizzles, coordinates, tiles, pointer views, copy atoms, and MMA atoms — together with the operations that compose, divide, complement, coalesce, and filter them — and stops short of binding any of it to NVIDIA hardware. That binding is the job of cute_nvgpu. Every later GPU-specific dialect (cute_nvgpu, nvgpu, nvvm) reads layout values produced here.
cute is not a code-generation dialect. Its values describe structure: how a logical tile maps to physical coordinates, how coordinates become offsets, how one layout composes with another, how a tiled copy or tiled MMA partitions work across lanes, warps, and memory spaces. That makes it the common language shared by CUTLASS pipeline modeling, TileAS layout assignment, TMA descriptor construction, and MMA lowering.
Role in the Cascade
cuda_tile / nv_tileaa / nv_tileas
|
| choose tile shapes, views, and partitioning
v
cute
|
| attach target-specific atoms and SM-tier constraints
v
cute_nvgpu
|
| normalize to nvgpu and nvvm
v
PTX
cute is a compact typed form of the same algebra that CUTLASS C++ expresses with templates. The templates become values and attributes that passes inspect, compose, verify, and lower.
Core Concepts
| Concept | Meaning | Typical use |
|---|---|---|
| Shape | Extents of a logical tile or nested coordinate tuple | Describes the iteration space of a tile. |
| Stride | Offset step for each coordinate dimension | Converts coordinates into linear offsets. |
| Layout | Shape plus stride, optionally decorated with swizzle | Maps logical coordinates to storage locations. |
| Tile | A grouped shape/layout fragment | Represents a fragment moved or computed as a unit. |
| Coord | A point in a shape or tile | Indexes layouts, views, and partitioned fragments. |
| Swizzle | Bit permutation applied to low address bits | Avoids bank conflicts or matches hardware layout rules. |
| View | Pointer or memref plus layout metadata | Describes an addressed object without losing its layout. |
| Tiled copy / MMA | Layout plus atom-level partitioning | Feeds target-specific copy or matrix-multiply lowering. |
The key invariant is that cute values remain algebraic. A layout should be composable and queryable without knowing whether it will eventually become a TMA descriptor, an ldmatrix load, a WGMMA operand, or a Blackwell tensor-memory operation.
Layout Semantics in One Line
A layout maps a coordinate to an offset. The simplest model is (shape, stride); the real dialect adds nested tuples, composition, complement, divide, product, and swizzles on top of that single primitive. The algebraic rules and the concrete compose/complement/divide/product definitions live on the algebra page below; this overview only states the kernel.
int64_t layout_offset(Layout L, Coord c) {
int64_t offset = 0;
for (int d = 0; d < rank(c); ++d) offset += c[d] * L.stride[d];
return apply_swizzle(L.swizzle, offset);
}
For a reimplementation, the storage class the original compiler picks does not matter. What does matter: equivalent layouts canonicalize consistently, nested tuple layouts preserve rank and dimension identity, and swizzle composition stays explicit until a target-specific lowering consumes it.
Where to Find What
The dialect is split across four pages by concern. Use this map to find the exact place a topic is documented; the overview does not duplicate any of these.
| Topic | Page |
|---|---|
| Layout algebra rules (composition, complement, divide, product, coalesce, filter) | Layout Algebra and Descriptor Grammar — Algebra Rules on Shape and Stride Tuples |
| Tuple-shape grammar, swizzle composition, descriptor round-trip | Layout Algebra and Descriptor Grammar — Descriptor Grammar |
Tile partitioning ops (local_tile, local_partition, group_modes, dice, slice) | Tile and Divide Ops — Builder Operations |
Atom builders (make_atom, make_tiled_copy, make_tiled_mma) and desugar rewrites | Atom Builders and Desugar — Atom Builder Contract |
cute.make_int_tuple hub, make_layout desugaring shape | Atom Builders and Desugar — make_int_tuple Hub |
Kernel-entry ABI (cute.kernel → nvvm.kernel, grid-constant arg-attrs) | Atom Builders and Desugar — Kernel-entry ABI |
| Verbatim verifier diagnostics (every error string the dialect emits) | Verifiers — Verbatim Diagnostics |
| Mode-range, divide, product, tuple-arithmetic verifier algorithms | Verifiers — Mode and Rank Checks |
crd2idx weak-congruence walk, worked diagnostic example | Verifiers — Worked Example: crd2idx Weak Congruence Violation |
LayoutTypeInterface kind discriminator and per-kind dispatch tables | Verifiers — LayoutTypeInterface Kind Discriminator |
In-Memory IR Tier
Treat cute as an in-memory compiler tier. It exists so passes can exchange rich layout objects without serializing every intermediate shape into the public input format. Textual rendering helps with debugging and documentation; production input normally enters through cuda_tile, nv_tileaa, cutlass, or another higher-level dialect, and the pipeline constructs cute objects internally.
Practical consequence: do not build tooling that depends on cute bytecode as a stable interchange format unless the serializer is explicitly provided. Textual dumps are for inspecting the compiler, not as a user-facing artifact.
If You Know CUTLASS (open source) — cross-walk
The open-source cute/ C++ headers map almost directly onto this dialect:
| CUTLASS C++ (cute namespace) | tileiras cute IR |
|---|---|
cute::Shape<...> and cute::Stride<...> | hierarchical (shape, stride) tuples in a !cute.layout |
cute::Layout<Shape, Stride> | !cute.layout type, kind-discriminated through the seven-entry sentinel table |
cute::Swizzle<B, M, S> | !cute.swizzle value composed into a layout via make_composed_layout |
cute::make_tile, cute::make_layout | cute.make_tile, cute.make_layout ops |
cute::Tensor<Engine, Layout> | cute.make_view ties a pointer/memref to a layout |
composition, complement, logical_divide, logical_product | identically-named cute.* ops |
cute::make_tiled_copy, cute::make_tiled_mma | cute.make_tiled_copy, cute.make_tiled_mma (target binding deferred to cute_nvgpu) |
| Compile-time integer arithmetic in C++ templates | cute.make_int_tuple + tuple_div/mod/mul/sub ops |
The main difference is where the target boundary sits. The open-source cute/ library compiles SM-specific MMA_Atom and Copy_Atom traits straight into the same headers; tileiras keeps the SM-neutral atoms in cute and pushes every target-specific atom into cute_nvgpu. A pass running inside cute should never need to ask which SM tier is in use. If it does, the layout choice belongs on the cute_nvgpu side.
Cross-links
- Layout Algebra and Descriptor Grammar — Descriptor Grammar covers the concrete grammar and Round Trip rules.
- Tile and Divide Ops — Divide Variants covers tile partitioning operations.
- Atom Builders and Desugar — Per-Atom Desugar Rewrites covers construction of copy and MMA atoms.
- Verifiers — Verbatim Diagnostics covers layout and atom verifier behavior.