Reading Map
This page is curated reader paths. Each path is an ordered sequence of pages with a one-sentence rationale for why the next page follows. Use these when you want to answer "I want to understand X — what do I read in what order?" instead of browsing the SUMMARY.
Driver and Integration Path
For running tileiras, embedding it, or diagnosing a driver failure:
- Driver Overview — what the binary does and which public entry points exist.
- Main Entry — how
main()builds the configuration and dispatches the four phases. - Program Handle — the 104-byte handle threaded through create / compile / get-output / release.
- CLI Options — the option surface, separating user-facing flags from internal
cl::optplumbing. - Env Vars and Runtime Gates — environment-driven knobs that bypass the CLI.
- Host Launch and ptxas Knobs — how the driver shells out to
ptxas. - ptxas Handoff Protocol — the exact PTX surface
ptxasaccepts. - Position in nvcc 13.1 — where tileiras fits in the larger CUDA toolchain.
Bytecode Producer Path
For producing valid TileIR bytecode that tileiras will accept:
- MLIR Bytecode Format — the container grammar and section layout.
- Dialect Reader/Writer Status — which dialects have custom bytecode readers and what coverage looks like.
- AsmPrinter Status — printer-side companion (the textual round-trip is partial).
- cuda_tile Overview — the public input dialect.
- cuda_tile Op Roster — every op the public surface accepts.
- cuda_tile Types and Attrs — types and attributes those ops use.
- cuda_tile Verifiers — what gets checked at parse time.
- TypeID Sentinel Table — lookup table when you need the exact identity of a sentinel.
Dialect Lowering Chain
For understanding how the IR cascades from public input to LLVM:
- cuda_tile — public tile-compute surface.
- cuda_tile to tileaa — first conversion: introduce alias awareness.
- nv_tileaa — alias-aware memory, tokens, queues.
- tileaa to tileas — second conversion: make scheduling explicit.
- nv_tileas — operational async-scheduling dialect.
- cute — target-neutral layout algebra.
- cute_nvgpu — NVIDIA architecture atoms (MMA, TMA, tcgen05).
- cutlass — pipeline scheduler, sequence barriers, persistent kernels.
- tileas to LLVM — final MLIR-side conversion.
- cute and cute_nvgpu to LLVM — atom lowering to LLVM intrinsics.
- nvgpu and gpu to NVVM — bridge to PTX-facing dialect.
- Lowering Overview — top-down summary tying these conversions together.
Scheduler Deep-Dive
For understanding how TileAS turns dependence graphs into placed schedules:
- Scheduler Overview — the two-pass GenerateSchedule / MaterializeSchedule split.
- Schedule Constraint Attributes — the nine
tileas.schedule.constraint.*attributes that drive placement. - Resource Constraint Builder and RRT — how per-op footprints become RRT bits.
- Modulo Scheduler and Rau — the modulo-scheduling exemplar (read this one carefully).
- Modulo Driver and 4-Arm OR-Chain — the four placement arms (PERMUTE / FUSE / RETRY / CBS).
- Serial vs Cost-Based Generators — the two generator implementations and when each fires.
- Schedule::solve and Cost Evaluators — the materialization algorithm.
- Pipe and Mutex Value Layout — the IR-visible coordination values.
- Buffer Assignment and Named Barriers — the 32-slot named-barrier pool and how Mutex_ values consume it.
- Blackwell Pipeline 15-Slot Model — the target pipeline model the scheduler reasons against.
TileAS Pass Families
For the per-family pass roster running on nv_tileas IR:
- Async/Pipeline Family — MaterializeSchedule, AUS vs AWS, agent materialization.
- Layout and Buffer Family — layout assignment, slicing, and shared-memory handoffs.
- TMA and Memops Family — TMA-descriptor and bulk-copy lowering.
- CTA Cluster Family — cluster geometry, DynamicPersistent, PlanCTA, PrepareForScheduling, ResolveAgentBoundary.
- Scheduling Glue — the small passes wiring schedule data into surrounding IR.
Codegen Deep-Dive
For the NVPTX backend that consumes the lowered LLVM IR:
- Codegen Overview — pipeline shape from LLVM IR to PTX.
- NVPTX Bring-up and Target Init — how the target gets registered and initialized.
- NVPTX Subtarget and Feature Matrix — per-SM feature gating.
- NVPTX Target Lowering, Call and Args — calling convention, parameter space, byval handling.
- ISelDAG and MatcherTable — DAG-to-DAG instruction selection.
- Per-SM Emission Templates — emission templates parameterised by SM tier.
- AsmPrinter Monster and Windows — final PTX text emission.
- tcgen05, WGMMA, mbarrier, Cluster — emission of the Blackwell-era instruction families.
- TMA, Tensormap and cp.async.bulk — TMA-descriptor emission.
- ldmatrix, stmatrix and Register Class Vtables — matrix-fragment movement.
NVPTX Custom Pass Family
For the NVIDIA-private passes layered onto the NVPTX backend:
- NVPTX Backend Passes Overview — pipeline position and shared state.
- Kernel, CDP, Inline, Pretreat — entry-side stamping and inline forcing.
- Lower-Args, Aggr, Struct — byval lowering and parameter-space pointer materialization.
- MemorySpaceOpt and process-restrict — concrete address-space inference and noalias scope generation.
- Printf Lowering and vprintf — printf-to-vprintf rewrite.
- DeadSyncElim and CommonBaseElim — barrier removal and SCEV-keyed GEP CSE.
- Peephole MIR and Image Handles — post-ISel MIR rewriting.
- NVVMIRVerifier — kernel-ABI invariants enforced before backend handoff.
libdevice and NVVM Reflect
For modules that link against libdevice math functions:
- libdevice Overview — the bitcode library and what it covers.
- NVVMReflect Mechanism — how compile-time reflect calls get resolved.
- Intrinsic ID Switch and Name Table —
__nv_*name to intrinsic ID mapping. - Math Pass Pipeline and Crosswalk — pass ordering around the math expansion.
MLIR Infrastructure Tour
For the MLIR-side mechanics referenced by dialect and lowering pages:
- MLIR Infra Overview — what the infra layer covers.
- Operation Layout — the 48+ byte
Operationrecord and its slots. - StorageUniquer and Context Impl — type and attribute uniquing.
- Pattern Vtables and Shapes — rewrite-pattern shapes and dispatch.
- Interface Vtables — op and type interface mechanics.
- TypeID Sentinels and Anchors — how TypeIDs are interned and addressed.
- Container Fingerprints — recognizing MLIR container shapes in the binary.
- Diagnostic ABI and Helpers — diagnostic emission, severity packing.
- AsyncValue and BLAKE3 Interning — the 808-byte AsyncValue record backing
Pipe_/Mutex_.
OSS Comparison Tour
For comparing tileiras against the public cuda-tile repository:
- OSS Comparison Overview — what the public tree covers vs what tileiras adds.
- cuda_tile Tree Mapping — file-by-file mapping between public source and tileiras behavior.
- .td Files Delta — TableGen differences.
- Transforms, FuseFMA, SynthDbg — public transform passes and where they live in tileiras.
Cross-cutting Infra
For low-level mechanics referenced from multiple pages:
| Topic | Page |
|---|---|
| Data section decryption | Data Section Decryption |
| Vtable banks | Binary Vtable Banks and Static Ctors |
| Threading | Threading and Synchronization |
| Allocators | Allocator BumpPtr and Slab Sizes |
| String mechanics | Twine, StringRef, format |
| Diagnostic helpers | Diagnostic Helpers |
| GlobalValue flags | GlobalValue Flag Bits |
End-to-End Reimplementation Path
For a single linear read through every contract you must reproduce:
index
-> binary-layout
-> boundaries/nvcc-13-1-position
-> pipeline/overview
-> bytecode/mlir-bc-format
-> dialects/cuda_tile/overview
-> lowering/cuda-tile-to-tileaa
-> dialects/nv_tileaa/overview
-> lowering/tileaa-to-tileas
-> dialects/nv_tileas/overview
-> passes/tileas/scheduling-glue
-> scheduler/overview
-> scheduler/modulo-scheduler-and-rau
-> lowering/tileas-to-llvm
-> codegen/overview
-> nvptx-passes/overview
-> libdevice/overview
Then return to the detailed operation, verifier, and pass-family pages for the subsystem you are implementing.