Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Glossary

This glossary defines the public terms used throughout the tileiras wiki. It focuses on behavior, data models, dialects, passes, and target concepts. Detailed operation rosters live in their dialect pages.

Core Tools

TermMeaning
tileirasCUDA TileIR optimizing assembler. It consumes TileIR MLIR bytecode and produces a host object containing compiled GPU code. See Driver Overview.
ptxasNVIDIA assembler invoked after PTX emission to produce the final GPU binary payload. See ptxas Handoff Protocol.
nvdisasmNVIDIA disassembler optionally invoked to produce annotated SASS output.
ciccCUDA C++ device compiler. It shares the LLVM/NVPTX backend family with tileiras but starts from CUDA frontend output, not TileIR bytecode. See cicc Comparison.
TileIRNVIDIA's MLIR-based tile program representation consumed by tileiras. The serialized bytecode form carries builtin.module containers whose gpu.module payloads are expressed in the cuda_tile dialect; passing through the full lowering cascade it becomes nv_tileaa, nv_tileas, cute, cute_nvgpu, cutlass, nvgpu, nvvm, and finally llvm. See Pipeline Overview.
TileASThe pass family and dialect-family name covering scheduling, layout, async pipeline, CTA cluster, and buffer-management work over nv_tileas IR. The CLI prefix and option names use the lowercase form (tileas-*); prose uses TileAS. See TileAS Pass Families.

Dialects

Each dialect occupies one layer of the lowering pipeline. The early dialects preserve tile semantics, the middle dialects make layout and scheduling explicit, and the late dialects bridge to NVVM and LLVM.

TermMeaning
cuda_tilePublic input dialect for tile programs. It describes tile arithmetic, memory, control flow, tokens, tensor views, and kernel entries. The dialect is the only public surface — the rest of the cascade is NVIDIA-private. See cuda_tile Overview.
nv_tileaaAlias-aware internal dialect below cuda_tile. It introduces explicit memory references, pointer provenance, tokens, queues, and reuse markers so later passes can reason about aliasing without re-deriving it. See nv_tileaa Overview.
nv_tileasOperational async-scheduling dialect. It represents producer/consumer pipelines, TMA-ready memory operations, layout conversion, and scheduled regions. The TileAS pass family runs on this dialect. See nv_tileas Overview.
cuteTarget-neutral layout algebra dialect derived from CuTe concepts: shape, stride, layout, tile, coord, swizzle, and tiled atom descriptors. Used to express layout transformations and tile partitioning. See cute Overview.
cute_nvgpuNVIDIA architecture atom dialect for MMA, WGMMA, TMA, TMEM, ldmatrix, stmatrix, and target-specific copy operations. Each atom is parameterised by SM tier (SM70..SM120). See cute_nvgpu Overview.
cutlassCUTLASS pipeline dialect for tile schedulers, sequence barriers, pipeline roles, block-striped operations, and persistent kernel structure. Models the CUTLASS programming-model abstractions as MLIR ops. See cutlass Overview.
nvgpuStock MLIR NVIDIA GPU bridge dialect used before NVVM conversion. Acts as an intermediate between high-level GPU intent and concrete NVVM intrinsics. See nvgpu Overview.
nvvmMLIR dialect representing NVVM/PTX-facing intrinsics and target operations before LLVM IR materialization. See NVVM Overview.
llvmMLIR LLVM dialect used as the last MLIR form before creating an LLVM module.

Tile and Layout Terms

TermMeaning
TileA logical block of tensor data operated on as a unit.
ShapeExtents of a tile, tensor view, or coordinate tuple.
StrideOffset step associated with each coordinate dimension.
LayoutMapping from logical coordinates to physical offsets, usually shape plus stride and optional swizzle.
SwizzleBit permutation used to match hardware layout requirements or avoid memory-bank conflicts.
CoordCoordinate value used to index a shape or layout.
ViewPointer or memref plus shape, stride, element type, and memory-space metadata.
Tensor viewHigh-level view of a tensor region with shape and stride semantics.
Partition viewView that partitions a tensor or tile among program dimensions, lanes, warps, or agents.
AtomA hardware-sized operation descriptor such as a copy atom, MMA atom, or TMA atom.

Scheduling Terms

TermMeaning
StageLogical software-pipeline stage assigned by the TileAS scheduler. Operations in stage k start k iterations of the prologue ahead of the steady-state.
OrderDeterministic tie-breaker within a stage. Together with stage, it forms the (stage, order) pair downstream materialization consumes.
Initiation interval (II)Number of cycles between starts of successive software-pipeline iterations. The minimum II respects both data-dependence and resource constraints.
RRTResource Reservation Table. A bitset table with one row per cycle modulo the candidate II, where each row is a bitset of resource classes. Used to test whether an operation can occupy a candidate modulo cycle. See Resource Constraint Builder and RRT.
Resource footprintPer-operation resource occupancy over one or more cycles. The scheduler reads it before probing an RRT slot.
ScheduleAnalysisPreserved MLIR analysis carrying the fixed schedule from TileASGenerateSchedule to TileASMaterializeSchedule. The two-pass split is what lets the scheduler decide once and the materializer apply once.
MaterializeScheduleThe TileAS pass that consumes the cached ScheduleAnalysis and emits Pipe_ / Mutex_ SSA values along with the cute_nvgpu.arch.agent_switch partitioning at warp-specialized boundaries. See Async/Pipeline Family.
Pipe_Concrete producer/consumer coordination value emitted after schedule placement. Models a depth-d ring buffer with bounded slack between producer and consumer stages. See Pipe_ and Mutex_ Value-Header Layout.
Mutex_Concrete mutual-exclusion coordination value emitted after schedule placement. Models a zero-slack serialization edge — iteration i of the protected region must complete before iteration i+1 starts.
Schedule::solveMaterialization algorithm that groups producers and consumers into Pipe_ values after placement is fixed. See Schedule::solve and Cost Evaluators.
VLIWVery Long Instruction Word. Used in the scheduler context to describe how multiple operations get bundled into a single issue slot — the modulo scheduler emits VLIW-style packed schedules when the target pipeline has multiple parallel function units.

Async Pipeline Terms

TermMeaning
ProducerAgent or region that fills a pipeline stage.
ConsumerAgent or region that waits for and reads a produced pipeline stage.
Pipeline stageRotating buffer slot shared by producer and consumer agents.
Producer acquireOperation that grants producer ownership of a stage.
Producer commitOperation that publishes a filled stage to consumers.
Consumer waitOperation that waits for a committed stage.
Consumer releaseOperation that returns a consumed stage to the pipeline.
Pipeline iteratorSSA value identifying the current rotating stage.
Agent switchOperation that selects producer or consumer agent regions under warp specialization. The nv_tileas.async.pipeline.agent_switch op is the IR-visible form.
AWSAgent-Warp-Specialized. The dispatch mode MaterializeSchedule selects when distinct producer and consumer agents are partitioned across warps; the alternative AUS (Agent-Unspecialized) has a single SIMT agent owning both. The nv_tile.aws.* attribute family threads scheduling keys back into the AsyncValue headers.

GPU Architecture Terms

TermMeaning
SM (Streaming Multiprocessor)The basic GPU compute unit. Each SM owns a register file, a shared-memory bank, warp schedulers, and one or more tensor-core pipelines. Targets are named by SM tier: tileiras emits for the Blackwell family (sm_100, sm_103, sm_110, sm_120, sm_121). See GPU Execution Model.
CTA (Cooperative Thread Array)The PTX-level name for a thread block. A CTA contains 1 to 1024 threads grouped into warps; threads in the same CTA share an SMEM allocation and can synchronize through CTA-local barriers. See GPU Execution Model.
Warp32 threads executing in SIMT lockstep on the same SM. The warp is the unit of instruction issue, divergence, and most synchronization primitives. See GPU Execution Model.
Warp-groupFour contiguous warps, 128 threads. The unit of cooperation for WGMMA on Hopper and for several Blackwell tensor-memory operations. See WGMMA Emission Protocol.
ClusterA SM90-introduced grouping of 1-8 CTAs that share distributed shared memory and can use cluster-scope barriers. Hopper introduced 2-CTA clusters; Blackwell extends to 4-CTA. See Cluster Sync and DSMEM Handshake.
GridThe whole kernel launch — a 1D/2D/3D array of CTAs scheduled together by the driver.
Register fileThe per-SM bank of 32-bit registers, partitioned among resident warps. Tileiras's register-pressure heuristics and the modulo scheduler both reason about this resource.
SMEM (Shared Memory)Per-CTA on-chip memory. Around 228 KB usable per SM on H100-class parts; bandwidth on the order of tens of TB/s. Used for tiles, mbarriers, and TMA staging.
GMEM (Global Memory)Device-wide off-chip DRAM. Tens to hundreds of GB on data-center parts. Accessed through ld.global, cp.async, or TMA.
DSMEM (Distributed Shared Memory)Cross-CTA shared memory inside a cluster: each cluster member can address shared memory of every peer through nvvm.mapa plus llvm.addrspacecast. The handshake pairs nvvm.cluster.arrive and nvvm.cluster.wait with optional fences. See Cluster Sync and DSMEM Handshake.
TMEM (Tensor Memory)SM100+ on-chip memory used as the operand and accumulator store for tcgen05.mma. A separate address space (addrspace 4) with its own load/store and copy primitives. See tcgen05 Tensor Memory Model.
TMA (Tensor Memory Accelerator)SM90+ async bulk tensor-copy engine. Driven by 128-byte tensormap descriptors and the cp.async.bulk.tensor family. See TMA TensorMap and cp.async.bulk.
S2T copyShared-to-tensor-memory copy. Blackwell-specific transfer from SMEM to TMEM, used to stage tcgen05.mma operands. The cute_nvgpu.atom.copy_make_s2t_copy_op family models it.
WGMMAWarp-group matrix multiply-accumulate, introduced for Hopper tensor cores. Issued by a 128-thread warp group cooperatively against an SMEM-resident B descriptor and a register or SMEM A descriptor. See WGMMA Emission Protocol.
UMMAUnified MMA family used by Blackwell tensor-memory operations. Issued through tcgen05.mma with accumulator and operands in TMEM.
IMMAInteger matrix multiply-accumulate. The PTX instruction family for integer MMA tiles; appears in mixed-precision MMA paths alongside the floating-point families.
GMMA descriptorSynonym for SMEM descriptor in the WGMMA context. The 64-bit shared-memory descriptor that encodes the SMEM base address (low 14 bits, in 16-byte units) plus leading and stride byte offsets pinning the 2D tile shape into shared memory. WGMMA operand B is always an SMEM descriptor; operand A is either a register fragment or an SMEM descriptor.
SMEM descriptorSee GMMA descriptor.
f8E8M0FNU8-bit floating-point variant used as the scale-factor type in block-scaled MMA. Encodes a pure exponent (no mantissa, no sign), giving microscale factors a wide dynamic range from a single byte. See also e8m0 under Math and Precision.
MicroscaleBlock-scaled MMA where each tile of operand data carries a small shared scale factor (typically f8E8M0FNU). Allows narrow operand types (FP4 and FP8 mantissa) to express a wide effective dynamic range. See Fast-Math and Numerical Precision.
collector::aThe tcgen05.mma accumulator-mode parameter selecting how the accumulator participates: use reads and writes, fill writes only (zero-init equivalent), discard writes only with no read dependency. The kind-word verifier at sub_1AD26A0 packs this into the same bitfield as cta_group.
tcgen05Blackwell tensor-memory instruction family exposed through NVVM/NVPTX lowering. Covers tcgen05.mma, tcgen05.cp, tcgen05.commit, and the synchronizing primitives. See tcgen05 Tensor Memory Model.
mma.syncWarp-cooperative matrix multiply-accumulate on SM70 through SM89. Operands and accumulator live in registers; the whole warp issues the operation together. Superseded by WGMMA on Hopper and tcgen05.mma on Blackwell, but still emitted for older targets.
ldmatrixSynchronous instruction family that loads matrix fragments from shared memory into per-thread registers shaped for mma.sync/WGMMA consumption. The SMEM-to-RF companion to cp.async/cp.async.bulk.
stmatrixSynchronous matrix-fragment store from registers back to shared memory. The store-side counterpart to ldmatrix.
cp.asyncAmpere (SM80+) asynchronous global-to-shared copy family. Decouples the load issue from the data-ready point through commit-and-wait groups.
cp.async.bulkSM90+ bulk async copy family covering both tensor and non-tensor variants. The tensor variant is the TMA path; the non-tensor variant carries plain byte ranges.
cp.async.bulk.tensorHopper/Blackwell bulk tensor-memory copy family used by TMA, driven by tensormap descriptors.
mbarrierTransactional barrier object held in shared memory. Used by TMA, async copy, and the producer/consumer handshake to coordinate arrivals and byte-count transactions across warps. See mbarrier State Machine.
NamedBarrier (bar.sync N)The CTA-local barrier pool indexed by a small integer (0-15). Distinct from mbarriers: bar.sync is a hardware-implemented synchronous barrier with no transactional state, used for sub-CTA synchronization at warp-specialized boundaries.

PTX and SASS

TermMeaning
PTXNVIDIA's virtual ISA and target-independent intermediate representation. Tileiras emits PTX text that ptxas then translates to a concrete SM's SASS. See ptxas Handoff Protocol.
SASSNVIDIA's hardware ISA, generated by ptxas from PTX and specific to one SM tier. Tileiras itself does not emit SASS; it relies on ptxas for instruction selection at that level. See PTX Version and Target Selection.
State spacePTX's address-space designation on a load/store or pointer: global, shared, local, constant, param, or the unspecified generic. State spaces map to MLIR memory spaces and to LLVM address spaces in the NVVM target. See AddrSpace Vote Lattice.
Inline PTXLLVM inline assembly carrying PTX text and operand constraints. Tileiras emits inline PTX for primitives the NVVM intrinsics layer does not cover directly.

Backend Terms

TermMeaning
NVVM intrinsicLLVM intrinsic in the llvm.nvvm.* family.
LLVM moduleLLVM IR representation produced after MLIR lowering.
MachineIRLLVM target-specific machine representation after instruction selection.
Parameter spacePTX address space used for kernel parameters.
Address spaceMemory-space classification such as generic, global, shared, constant, local, or parameter.
libdeviceNVIDIA device math bitcode library linked into modules that call __nv_* math functions. See libdevice Overview.
__nvvm_reflectCompile-time configuration query used by libdevice and NVVM support code. The reflect pass replaces __nvvm_reflect("name") calls with the resolved integer value at compile time. See NVVMReflect Mechanism.
__grid_constant__Kernel-parameter attribute indicating a value that is constant per grid launch. The TMA descriptor pass uses it to mark TMA descriptors passed by kernel parameter, so codegen can place the descriptor into a read-only constant slot without proving constancy from scratch.
DescriptorGeneric name for a structured operand passed to a hardware primitive. Each architecture family has its own descriptor type: TMA descriptors are 128-byte records for cp.async.bulk.tensor; GMMA/SMEM descriptors are 64-bit records for WGMMA.
IntrinsicA function-like name that lowers to one or a few target instructions rather than a regular call. PTX intrinsics surface in MLIR as nvvm.* ops.
PassAn MLIR transformation that runs on an operation kind (builtin.module, gpu.module, nv_tileaa.func, etc.). Tileiras runs about fifty passes per device module at -O3. See Full Pass List by Opt Level.
DialectAn MLIR namespace owning a set of operations, types, attributes, and interfaces. Tileiras registers nine dialects across the lowering cascade plus upstream MLIR dialects (arith, math, scf, builtin, etc.).
NCLNVPTX Common Library — the family of nv-* and nvptx-* helper passes that perform common-base elimination, dead-sync elimination, kernel attribute stamping, and other NVPTX-specific cleanups in the backend.

MLIR Infrastructure

TermMeaning
MLIR (Multi-Level IR)The LLVM-project IR-of-IRs framework that hosts tileiras's whole lowering cascade. Dialects, operations, types, attributes, and passes are all MLIR concepts. See Architecture Evolution and Design Decisions.
OperationAn instruction-level IR node in MLIR. Carries operands, results, attributes, regions, a source location, and an OperationName. The whole MLIR program is a tree of operations. See Operation Layout.
AttributeCompile-time-known data attached to an operation: integers, strings, types, dictionaries, dialect-defined records, etc. Attributes are uniqued in the MLIRContext. See Attribute System and Lowering.
TypeAn MLIR value's type. Types are uniqued through the StorageUniquer per context and carry a TypeID plus optional dialect-defined storage. See Storage Uniquer and ContextImpl.
RegionA container of basic blocks living inside an operation. Functions, loops, branches, and structured constructs each own one or more regions.
OperationNameThe per-op-kind runtime identity that every concrete operation refers to. Holds the dialect pointer, the operation's TypeID, its interface table, and folding/verification hooks. See Operation Layout.
TypeIDPer-class runtime identity assigned by MLIR's TypeID machinery. Used to key attribute storage, type storage, interface dispatch, and pass IDs. RTTI is disabled in LLVM/MLIR, so TypeID plays the role that typeid would in standard C++. See TypeID Sentinels and Anchors.
TableGenLLVM's declarative DSL (extension .td) for describing instructions, registers, intrinsics, and other compiler tables. A backend reads the .td files and emits C++ headers and tables at build time.
ODS (Operation Definition Specification)The MLIR-specific use of TableGen. Each dialect's operations, types, attributes, and interfaces are declared in .td files; mlir-tblgen emits the C++ classes and definitions consumed by the dialect implementation.

Math and Precision

TermMeaning
FP32 / f32IEEE 754 single-precision binary32. 1 sign + 8 exponent + 23 mantissa bits. The reference precision for tile arithmetic that is not explicitly narrowed.
FP16 / f16 / halfIEEE 754 half-precision binary16. 1 + 5 + 10 bits. Common as MMA operand and accumulator on pre-Hopper tensor cores.
BF16 / bf16brain-float-16. 1 + 8 + 7 bits. Same exponent range as FP32 but only 7-bit mantissa; the standard low-precision training format on Hopper and Blackwell tensor cores.
FP8 (e4m3, e5m2)8-bit floating-point types from the OFP8 family. e4m3 has 4 exponent + 3 mantissa bits (used for forward activations and weights), e5m2 has 5 + 2 (wider range, used for gradients). MMA operand type on SM89+.
FP4 (e2m1)4-bit floating-point type with 2 exponent + 1 mantissa bit. Used as MMA operand in Blackwell block-scaled MMA.
e8m08-bit exponent, 0-bit mantissa, no sign. Used as the per-block scale factor in MX-FP block formats. In MLIR this is f8E8M0FNU.
Block-scaled FPAn MX-FP-style format: a block of N narrow values (FP4 or FP8 mantissa) plus a shared e8m0 scale factor. Lets narrow operands cover a wide effective dynamic range. See Fast-Math and Numerical Precision.
FTZ (Flush to Zero)Hardware option that flushes subnormal inputs and results to signed zero. Controlled per-module through NVVM-Reflect, per-call through libdevice fast variants, and per-instruction through PTX rounding modifiers.
Denormal / SubnormalAn IEEE 754 number with the implicit leading 1 absent, allowing magnitudes below the smallest normal at the cost of reduced relative precision. GPU pipelines often FTZ them for throughput.
FMA (Fused Multiply-Add)The operation a*b + c computed with a single rounding step. Lower error and higher throughput than separate multiply and add. See Fast-Math and Numerical Precision.
Fast-math flagsThe LLVM IR flag set carried on floating-point ops: nnan (no NaNs), ninf (no Infs), nsz (no signed zero), arcp (allow reciprocal), contract (allow FMA contraction), afn (approximate function), reassoc (allow reassociation). Tileiras propagates these through NVVM lowering.

Reverse Engineering and Binary

TermMeaning
ELF (Executable and Linkable Format)The standard Linux binary format. Both the tileiras driver shared object and ptxas's input/output use ELF containers.
StrippedA binary with its symbol table removed. Tileiras ships stripped, which is why the wiki refers to internal routines by sub_ADDR instead of source names. See Binary Anatomy and RE Methodology.
sub_ADDRIDA Pro's auto-generated name for an unnamed function at virtual address ADDR (hex). The wiki uses this convention to cite specific routines in the stripped binary.
IDA ProThe commercial disassembler and decompiler used to recover tileiras's behavior from its stripped shared object. See Binary Anatomy and RE Methodology.
vtableThe per-class table of virtual-function pointers a C++ object carries when it has virtual methods. The wiki cites vtable layouts when discussing dialect interfaces, pass classes, and pattern rewriters.
RTTI (Run-Time Type Information)The standard-C++ mechanism for runtime type identification via typeid/dynamic_cast. LLVM and MLIR disable RTTI for code size; tileiras uses MLIR's TypeID machinery instead. See TypeID Sentinels and Anchors.

CUDA Toolchain

TermMeaning
nvccThe top-level CUDA compiler driver. Invokes the host compiler, cudafe++, cicc/tileiras, ptxas, fatbinary, and the host linker. See nvcc 13.1 Position.
ptxasThe PTX → SASS assembler. Receives PTX text from tileiras and emits a cubin for one SM target. See ptxas Handoff Protocol.
cudafe++NVIDIA's CUDA C++ frontend. Splits a CUDA source file into host and device translation units before either side is compiled. See cudafe Non-Relationship.
ciccThe older LLVM-based device compiler that lowers CUDA C++ device IR to PTX. Shares the NVPTX backend family with tileiras but starts from cudafe++ output rather than TileIR bytecode. See cicc Comparison.
libdeviceNVIDIA's device-side math bitcode library, linked into device modules that call __nv_* math functions. Configured through NVVM-Reflect at link time. See libdevice Overview.
NVVMNVIDIA's variant of LLVM IR for device code. Tileiras's final MLIR form lowers into NVVM-flavored LLVM IR, which is then translated to PTX.
NVVM-ReflectThe mechanism that resolves environment-style integer queries (__CUDA_FTZ, __CUDA_PREC_SQRT, SM version, etc.) into compile-time constants, controlling which libdevice variants survive optimization. See NVVMReflect Mechanism.
FatbinA container format holding multiple cubin and/or PTX images for different SM targets in one file. Produced by fatbinary and consumed by the CUDA runtime for JIT or load-time selection.
CubinA compiled CUDA binary for one SM target, produced by ptxas. The unit packaged into a fatbin.

Scheduler Coordination Values

TermMeaning
AsyncValueThe umbrella value family the TileAS scheduler emits to model async coordination resources after placement. Pipe_ and Mutex_ are the two concrete shapes; both are interned and fingerprinted (BLAKE3) so identical synchronization patterns share storage. See AsyncValue and BLAKE3 Interning.
Pipe_A depth-d producer/consumer ring buffer with bounded slack between producer and consumer stages. Emitted by Schedule::solve after placement. See Pipe_ and Mutex_ Value-Header Layout.
Mutex_A zero-slack mutual-exclusion edge between successive iterations of a protected region. Iteration i must complete before iteration i+1 starts. See Pipe_ and Mutex_ Value-Header Layout.
Rau schedulingThe Rau 1994 modulo-scheduling algorithm: search an initiation interval, place each operation into a cycle modulo II, and respect both recurrence and resource constraints. Tileiras's TileASGenerateSchedule is a Rau-style placement engine. See Modulo Scheduler and Rau.
RRT (Resource Reservation Table)A per-cycle bitset table indexed modulo the candidate II, where each row records which resource classes are occupied. The scheduler probes the RRT before committing an operation to a cycle. See Resource Constraint Builder and RRT.
Modulo Initiation Interval (II)The number of cycles between starts of successive software-pipeline iterations under a modulo schedule. Smaller II raises throughput; the scheduler searches upward from the maximum of the resource, recurrence, and dependence lower bounds. See Modulo Scheduler and Rau.

Common Options and Environment

TermMeaning
--gpu-nameDriver target GPU option.
--host-archHost architecture option used when producing the host object.
--host-osHost operating-system option used when producing the host object.
--opt-level / -OOptimization level controlling the pass pipeline.
--lineinfoRequests line-number information when input debug information exists.
--device-debug / -gRequests device debug information when input debug information exists.
--sanitizeEnables supported sanitizer mode.
CUDA_ROOT, CUDA_HOME, CUDA_PATHEnvironment variables used to locate CUDA tools when needed.

Reading Notes

Operation names are written in backticks, for example nv_tileas.async.pipeline.produce_one. Dialect names are also written in backticks. Pseudocode uses C-like syntax but is descriptive rather than ABI-exact unless the page explicitly says otherwise.