Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Program Layout

Abstract

Tileiras is a single ELF executable with the entire MLIR-and-LLVM compiler statically linked inside it. Its segments are conventional - .text, .rodata, .data, .bss, plus the usual support sections - but the content of each segment is structured by subsystem boundary. The .text segment is partitioned into bands that correspond to the driver, the bytecode reader, dialect implementations, conversion patterns, the scheduler, codegen, and the LLVM and NVPTX libraries. The .rodata segment carries dialect string pools, pass and diagnostic strings, the bitcode writer's tag tables, and the XOR-3 obfuscated NVPTX mnemonic pool used by the asm printer. The .data segment carries cl::opt globals, dialect registration tables, and the encoded mnemonic pool that the printer walks at runtime. The .bss segment carries the singletons every subsystem accumulates: the StorageUniquer hash tables, the TypeID Meyers-cache slots, the registered dialect instances, and the per-thread caches reached through TLS. This page describes what lives in each band and why each band has the shape it does. Addresses are deliberately omitted; reimplementers care about the structure, not the offsets.

Identity

PropertyValue
Tool roleCUDA TileIR optimizing assembler
CUDA release13.1
Toolkit bannerCuda compilation tools, release 13.1, V13.1.80
LLVM lineageInternal LLVM mainline snapshot identifying as LLVM21.0.0git
Input formatTileIR MLIR bytecode
Primary outputHost relocatable object containing compiled GPU code
Default output nameelf.o
Default GPU familyBlackwell-family target, normally sm_100

Tileiras is not a C++ frontend. It does not parse CUDA C++, instantiate templates, or generate host stubs. It consumes serialized MLIR, lowers it through the TileIR dialect stack, and produces PTX text that is then assembled by ptxas.

.text Bands

Code is grouped by subsystem; each band is a contiguous region containing the functions of one cohesive responsibility. The order, top to bottom, roughly tracks the runtime path through the compiler.

BandContents
Driver textCommand-line parsing, target validation, CUDA-toolkit discovery, subprocess harness, file emission.
Bytecode reader textContainer header parsing, section walkers, varint and string-table reconstruction, operation/region rebuilder, post-load verifier driver.
Dialect-registration textThe register_operations, register_types, register_attributes, and register_interfaces bodies for each of the nine dialects, plus the per-operation verify, fold, print, and parse hooks.
Lowering textThe populate_*_patterns functions and pattern bodies for every conversion edge in the TileIR stack; ConversionTarget configuration; type converters; full and partial conversion drivers.
Scheduler textRRT construction and probing, MII-bound computation, group placement, modulo-schedule solve, pipe/mutex materialization.
Codegen textMLIR-to-LLVM translation, libdevice link logic, LLVM IR pipeline driver, NVPTX selection-DAG driver, MachineIR verifier hooks, asm printer.
LLVM/NVPTX library textThe statically linked LLVM core, LLVM IR analysis and transform passes, the NVPTX backend, the SelectionDAG infrastructure, and the asm-printer support.
C++ runtime and libstdc++ textThe standard-library bodies pulled in by the static link (the std::sort introsort, hash-table primitives, allocator support, the exception runtime).

Each band has a stable internal shape: a small number of public entry points called by neighbouring bands, surrounded by a much larger population of pattern bodies and helper functions called only within the band. The lowering band is the heaviest; the scheduler band is the densest in algorithmic content per byte.

.rodata Bands

Read-only data is structured by purpose. The largest bands carry strings; the smaller bands carry the constant tables that drive the dispatch machinery.

BandContents
Per-dialect mnemonic poolsThe operation, type, and attribute mnemonics for each dialect, kept in registration order. The bytecode reader and the printer both index into these pools through OperationName.
Per-pass diagnostic stringsThe text of every pass-emitted diagnostic, plus the verifier strings used by verify hooks.
Conversion-pattern descriptorsThe static descriptor structures (OpRewritePattern / OpConversionPattern instances) for the lowering patterns, including their root mnemonics, benefits, and source-dialect tags.
NVPTX printer string tableThe PTX mnemonic and operand-format strings consumed by the asm printer. These are stored XOR-3 encoded (each byte XORed with three) and decoded on first use; the encoded form keeps the readable PTX vocabulary out of strings output without changing runtime cost.
Bitcode writer string blobThe fixed blob of attribute, type, and metadata tag names that LLVM's bitcode writer would otherwise embed in every output; here it is interned into one rodata region and referenced by offset.
cl::opt description textThe help text and default-value descriptions for every command-line option, including the LLVM-inherited options and the Tileiras-specific knobs.
C++ typeinfo and vtablesThe __cxxabiv1 typeinfo nodes and the vtables for every polymorphic class - dialects, passes, patterns, conversion-target objects, the diagnostic engine, and so on.
libdevice bitcode blobThe bundled libdevice bitcode for the supported target families, embedded as a rodata blob and parsed at compile time.

Each rodata band has a single dominant access pattern: mnemonic pools are read by hash lookup, descriptor arrays are walked sequentially during pass registration, vtables are indexed by slot, and the libdevice blob is read end-to-end exactly once per compilation.

.data Bands

Writable but statically initialised data carries the global state that registration and option handling populate at startup.

BandContents
cl::opt globalsThe mutable storage for every command-line option (boolean flags, integers, enums, paths). Parsed values land here.
Dialect-registration tablesThe static arrays the dialect registrar walks during register_operations / register_types. Their entries point into the .rodata mnemonic pools and into the .text hook functions.
Pass-registration tablesThe list of static PassRegistration and PassPipelineRegistration entries that the pipeline builder consults to assemble each opt-level pipeline.
XOR-3 mnemonic walking-cipher poolThe runtime working copy of the NVPTX mnemonic table the printer decodes on first reference. The pool is initialised from rodata and walked entry-by-entry as PTX is emitted.
Global constant initialisersInitialised static objects for the LLVM core, including the LLVM context manager, the target registry, and the intrinsic table.

The data segment is small relative to text and rodata. Most truly mutable state lives in .bss; the data segment exists primarily to give registration tables and cl::opt storage a stable load-time address.

.bss Bands

Zero-initialised state is where every subsystem keeps its singletons and lazy caches. These are the structures that grow during a compilation and are not meant to outlive the process.

BandContents
StorageUniquer hash tablesThe hash-set storage that uniques types and attributes. One bucket array per uniquer kind; populated lazily on first lookup.
TypeID Meyers-cache singletonsThe static local slots used by mlir::TypeID::get<T>() to give each concrete type a stable identity. Each instantiation lands in its own zero-initialised slot guarded by a one-time-init flag.
Dialect-singleton storageThe single Dialect instance per dialect kind, owned by the MLIRContext once registration runs.
Operation-name registryThe mnemonic-to-AbstractOperation lookup table, populated by dialect registration and read on every operation construction.
Pass and pattern statisticsThe counter slots that pass and pattern implementations bump for Statistic reporting.
Per-thread cachesThe TLS slots used by the diagnostic engine, the pass timer, the threaded pass manager, and the pattern applicator's local rewrite buffers.
LLVM context stateThe LLVMContext singletons and intrinsic caches the codegen path relies on.

Every entry in the .bss bands is logically owned by exactly one subsystem and is reset (or simply discarded) between compilations of independent modules.

Data Lifetimes

Three lifetime classes span the bands above. Knowing which lifetime a piece of data belongs to is what keeps the layout coherent across a long compilation.

LifetimeDataEnds when
Static-init lifetime.rodata pools, .data registration tables, vtables, libdevice blob, cl::opt defaultsNever; constructed during process startup.
Compile-unit lifetimeBytecode buffer, MLIR module, dialect operations, scheduler analyses, LLVM module, libdevice working copy, PTX textThe host relocatable for the unit has been written.
Per-pass lifetimePattern applicator state, rewrite buffers, conversion target snapshots, diagnostic temporariesThe pass manager finishes the pass.

Keeping these lifetimes separate matters in practice. cuda_tile operations are not meaningful after their conversion edge has run; scheduler analyses are not meaningful after lowering reaches LLVM; LLVM MachineIR is not meaningful before instruction selection; the XOR-3 mnemonic working copy is not meaningful before the asm printer is invoked.

Reimplementation Notes

A compatible implementation does not have to reproduce the executable's segment layout. It does have to reproduce the contracts that the layout exists to support: a stable static-init order for dialect, type, attribute, and pass registration; a uniquer that returns the same value for equal operands across the process; a .rodata-style discipline that keeps mnemonic pools, diagnostic strings, and printer tables immutable; and a .bss-style discipline that scopes all growth-only structures to the compilation that allocated them. The detailed pages under each subsystem describe these contracts as algorithms and invariants rather than as offsets.

For the reader-side view — file identity, the tools the wiki was produced with, and a recipe for verifying any individual wiki claim against the binary — see Binary Anatomy and RE Methodology.