Reading Map

This page is curated reader paths. Each path is an ordered sequence of pages with a one-sentence rationale for why the next page follows. Use these when you want to answer "I want to understand X — what do I read in what order?" instead of browsing the SUMMARY.

Driver and Integration Path

For running tileiras, embedding it, or diagnosing a driver failure:

Driver Overview — what the binary does and which public entry points exist.
Main Entry — how main() builds the configuration and dispatches the four phases.
Program Handle — the 104-byte handle threaded through create / compile / get-output / release.
CLI Options — the option surface, separating user-facing flags from internal cl::opt plumbing.
Env Vars and Runtime Gates — environment-driven knobs that bypass the CLI.
Host Launch and ptxas Knobs — how the driver shells out to ptxas.
ptxas Handoff Protocol — the exact PTX surface ptxas accepts.
Position in nvcc 13.1 — where tileiras fits in the larger CUDA toolchain.

Bytecode Producer Path

For producing valid TileIR bytecode that tileiras will accept:

MLIR Bytecode Format — the container grammar and section layout.
Dialect Reader/Writer Status — which dialects have custom bytecode readers and what coverage looks like.
AsmPrinter Status — printer-side companion (the textual round-trip is partial).
cuda_tile Overview — the public input dialect.
cuda_tile Op Roster — every op the public surface accepts.
cuda_tile Types and Attrs — types and attributes those ops use.
cuda_tile Verifiers — what gets checked at parse time.
TypeID Sentinel Table — lookup table when you need the exact identity of a sentinel.

Dialect Lowering Chain

For understanding how the IR cascades from public input to LLVM:

cuda_tile — public tile-compute surface.
cuda_tile to tileaa — first conversion: introduce alias awareness.
nv_tileaa — alias-aware memory, tokens, queues.
tileaa to tileas — second conversion: make scheduling explicit.
nv_tileas — operational async-scheduling dialect.
cute — target-neutral layout algebra.
cute_nvgpu — NVIDIA architecture atoms (MMA, TMA, tcgen05).
cutlass — pipeline scheduler, sequence barriers, persistent kernels.
tileas to LLVM — final MLIR-side conversion.
cute and cute_nvgpu to LLVM — atom lowering to LLVM intrinsics.
nvgpu and gpu to NVVM — bridge to PTX-facing dialect.
Lowering Overview — top-down summary tying these conversions together.

Scheduler Deep-Dive

For understanding how TileAS turns dependence graphs into placed schedules:

Scheduler Overview — the two-pass GenerateSchedule / MaterializeSchedule split.
Schedule Constraint Attributes — the nine tileas.schedule.constraint.* attributes that drive placement.
Resource Constraint Builder and RRT — how per-op footprints become RRT bits.
Modulo Scheduler and Rau — the modulo-scheduling exemplar (read this one carefully).
Modulo Driver and 4-Arm OR-Chain — the four placement arms (PERMUTE / FUSE / RETRY / CBS).
Serial vs Cost-Based Generators — the two generator implementations and when each fires.
Schedule::solve and Cost Evaluators — the materialization algorithm.
Pipe and Mutex Value Layout — the IR-visible coordination values.
Buffer Assignment and Named Barriers — the 32-slot named-barrier pool and how Mutex_ values consume it.
Blackwell Pipeline 15-Slot Model — the target pipeline model the scheduler reasons against.

TileAS Pass Families

For the per-family pass roster running on nv_tileas IR:

Async/Pipeline Family — MaterializeSchedule, AUS vs AWS, agent materialization.
Layout and Buffer Family — layout assignment, slicing, and shared-memory handoffs.
TMA and Memops Family — TMA-descriptor and bulk-copy lowering.
CTA Cluster Family — cluster geometry, DynamicPersistent, PlanCTA, PrepareForScheduling, ResolveAgentBoundary.
Scheduling Glue — the small passes wiring schedule data into surrounding IR.

Codegen Deep-Dive

For the NVPTX backend that consumes the lowered LLVM IR:

Codegen Overview — pipeline shape from LLVM IR to PTX.
NVPTX Bring-up and Target Init — how the target gets registered and initialized.
NVPTX Subtarget and Feature Matrix — per-SM feature gating.
NVPTX Target Lowering, Call and Args — calling convention, parameter space, byval handling.
ISelDAG and MatcherTable — DAG-to-DAG instruction selection.
Per-SM Emission Templates — emission templates parameterised by SM tier.
AsmPrinter Monster and Windows — final PTX text emission.
tcgen05, WGMMA, mbarrier, Cluster — emission of the Blackwell-era instruction families.
TMA, Tensormap and cp.async.bulk — TMA-descriptor emission.
ldmatrix, stmatrix and Register Class Vtables — matrix-fragment movement.

NVPTX Custom Pass Family

For the NVIDIA-private passes layered onto the NVPTX backend:

NVPTX Backend Passes Overview — pipeline position and shared state.
Kernel, CDP, Inline, Pretreat — entry-side stamping and inline forcing.
Lower-Args, Aggr, Struct — byval lowering and parameter-space pointer materialization.
MemorySpaceOpt and process-restrict — concrete address-space inference and noalias scope generation.
Printf Lowering and vprintf — printf-to-vprintf rewrite.
DeadSyncElim and CommonBaseElim — barrier removal and SCEV-keyed GEP CSE.
Peephole MIR and Image Handles — post-ISel MIR rewriting.
NVVMIRVerifier — kernel-ABI invariants enforced before backend handoff.

libdevice and NVVM Reflect

For modules that link against libdevice math functions:

libdevice Overview — the bitcode library and what it covers.
NVVMReflect Mechanism — how compile-time reflect calls get resolved.
Intrinsic ID Switch and Name Table — __nv_* name to intrinsic ID mapping.
Math Pass Pipeline and Crosswalk — pass ordering around the math expansion.

MLIR Infrastructure Tour

For the MLIR-side mechanics referenced by dialect and lowering pages:

MLIR Infra Overview — what the infra layer covers.
Operation Layout — the 48+ byte Operation record and its slots.
StorageUniquer and Context Impl — type and attribute uniquing.
Pattern Vtables and Shapes — rewrite-pattern shapes and dispatch.
Interface Vtables — op and type interface mechanics.
TypeID Sentinels and Anchors — how TypeIDs are interned and addressed.
Container Fingerprints — recognizing MLIR container shapes in the binary.
Diagnostic ABI and Helpers — diagnostic emission, severity packing.
AsyncValue and BLAKE3 Interning — the 808-byte AsyncValue record backing Pipe_ / Mutex_.

OSS Comparison Tour

For comparing tileiras against the public cuda-tile repository:

OSS Comparison Overview — what the public tree covers vs what tileiras adds.
cuda_tile Tree Mapping — file-by-file mapping between public source and tileiras behavior.
.td Files Delta — TableGen differences.
Transforms, FuseFMA, SynthDbg — public transform passes and where they live in tileiras.

Cross-cutting Infra

For low-level mechanics referenced from multiple pages:

Topic	Page
Data section decryption	Data Section Decryption
Vtable banks	Binary Vtable Banks and Static Ctors
Threading	Threading and Synchronization
Allocators	Allocator BumpPtr and Slab Sizes
String mechanics	Twine, StringRef, format
Diagnostic helpers	Diagnostic Helpers
GlobalValue flags	GlobalValue Flag Bits

End-to-End Reimplementation Path

For a single linear read through every contract you must reproduce:

index
  -> binary-layout
  -> boundaries/nvcc-13-1-position
  -> pipeline/overview
  -> bytecode/mlir-bc-format
  -> dialects/cuda_tile/overview
  -> lowering/cuda-tile-to-tileaa
  -> dialects/nv_tileaa/overview
  -> lowering/tileaa-to-tileas
  -> dialects/nv_tileas/overview
  -> passes/tileas/scheduling-glue
  -> scheduler/overview
  -> scheduler/modulo-scheduler-and-rau
  -> lowering/tileas-to-llvm
  -> codegen/overview
  -> nvptx-passes/overview
  -> libdevice/overview

Then return to the detailed operation, verifier, and pass-family pages for the subsystem you are implementing.

Keyboard shortcuts

Tileiras Internals