Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Subsystem Function Map

Abstract

Tileiras is a single executable, but it behaves as a stack of cooperating subsystems with a fairly rigid call-graph shape. This page is the function map for that stack at the role level: for each subsystem, the conceptual entry points it exposes, the callers it answers to, the callees it dispatches into, and the canonical wiki page where the mechanism is described in depth. The page is meant to answer the question "if I want to find where X happens in this compiler, where do I look?" without requiring a reader to know any internal address.

Call-Graph Shape at the Subsystem Level

The driver is the only entry. It builds a compile configuration, opens the input file, and calls into the bytecode reader. The reader returns an MLIR module rooted in cuda_tile. The driver hands that module, together with the configuration, to the pipeline builder. The pipeline builder asks the dialect registry for the operations and types it will see, asks the lowering subsystem for the conversion patterns it will use, and asks the scheduler for the analysis passes it must register. Pass execution then walks the registered passes in order: dialect-to-dialect lowering rewrites operations in place; the scheduler runs as an embedded phase that produces a ScheduleAnalysis without changing the IR; subsequent materialization passes consume that analysis to emit pipe and mutex operations. After the IR reaches the LLVM/NVVM dialects, the codegen subsystem translates it to an llvm::Module, links libdevice bitcode, runs the LLVM IR pipeline, and hands the result to the NVPTX backend. The backend produces PTX text. The driver then invokes ptxas as a subprocess, captures its object output, and writes the final host relocatable.

In short: driver -> bytecode -> dialects (via the registry) -> lowering -> scheduler (inside lowering) -> codegen -> libdevice link -> NVPTX backend -> ptxas. The MLIR infrastructure subsystem is orthogonal to that chain; every layer above calls into it for operations, types, attributes, regions, contexts, diagnostics, pattern rewriting, and pass management.

Driver

The driver owns process-level behavior: command-line parsing, target validation, CUDA tool discovery (ptxas, fatbinary), output naming, subprocess invocation via posix_spawn, and error reporting. Note: libdevice is not discovered here — it is embedded in the tileiras binary as _mlir_embedded_libdevice.

Conceptual entry pointRole
parse_command_lineRead argv into a structured compile configuration; reject malformed flags.
resolve_cuda_installationResolve CUDA_HOME / CUDA_PATH to find the ptxas and fatbinary binaries; the libdevice bitcode is embedded in the tileiras binary itself, not loaded from the toolkit.
validate_targetCheck that the requested GPU architecture is supported, emitting invalid GPU architecture: <name> on rejection.
open_input_moduleMap the input file and hand its buffer to the bytecode reader.
run_compilationDrive the pipeline against the parsed module and the configuration.
invoke_ptxasSpawn ptxas via posix_spawn with the derived option set and capture its output.
invoke_fatbinarySpawn fatbinary to package one or more .cubin outputs into a fat binary container.
report_errorFormat a subprocess or pipeline failure as a user-facing diagnostic.

Callers: process entry. Callees: bytecode reader, pipeline builder, codegen, subprocess harness (ptxas, fatbinary). See Driver Overview, CLI Options, Subprocess Harness.

Bytecode Reader

The reader owns the on-disk TileIR format. It validates the container, reconstructs the string and attribute tables, and rebuilds MLIR operations and regions in memory.

Conceptual entry pointRole
read_container_headerCheck the bytecode magic and version, locate the section table.
read_string_sectionRestore the interned string pool from its compressed form.
read_dialect_sectionResolve each referenced dialect against the dialect registry.
read_type_and_attribute_sectionsReconstruct uniqued type and attribute values.
read_operation_sectionWalk the operation stream, building regions and blocks.
verify_moduleInvoke the MLIR verifier on the reconstructed module.

Callers: driver open_input_module. Callees: dialect registry, MLIR infrastructure (storage uniquer, operation builder). See MLIR Bytecode Format.

Dialect Stack

The dialect subsystem registers operations, types, attributes, interfaces, verifiers, folders, and printers for every dialect the compiler understands. Lookup goes through OperationName and RegisteredOperationName, which is the bridge between a mnemonic and its implementation.

DialectConceptual rolePage
cuda_tilePublic tile input dialect; what the bytecode actually contains.cuda_tile Overview
nv_tileaaAlias-aware memory, token, queue, and pointer layer.nv_tileaa Overview
nv_tileasOperationally-scheduled async memory and TMA layer.nv_tileas Overview
cuteTarget-neutral layout algebra.cute Overview
cute_nvgpuNVIDIA-architecture atom registry.cute_nvgpu Overview
cutlassCUTLASS pipeline and tile-scheduler abstractions.cutlass Overview
nvgpuStock MLIR GPU bridge layer.nvgpu Overview
nvvmPTX-facing intrinsic dialect.NVVM Overview

Each dialect exposes the same role-level entry points: register_operations, register_types, register_attributes, register_interfaces, and per-operation verify / fold / print / parse hooks. Callers: bytecode reader (to resolve mnemonics on load), lowering (to identify source-dialect operations), MLIR infrastructure (to build operations and unique types).

Lowering

The lowering subsystem owns dialect-to-dialect conversion. It is a set of pattern populations driven by ConversionTarget and DialectConversion. Each lowering page describes the patterns for one source-to-target hop.

Conceptual entry pointRole
populate_<src>_to_<tgt>_patternsRegister the rewrite patterns for one conversion edge.
configure_conversion_targetMark legal and illegal operations for the current hop.
build_type_converterWire source-dialect types to target-dialect types and addr-space rules.
apply_full_conversionRun the pattern set to fixpoint or fail with a diagnostic.
apply_partial_conversionRun a target-bounded conversion where some operations remain.

Conversion edges, in order along the main path: cuda_tile -> nv_tileaa -> nv_tileas -> LLVM/NVGPU -> NVVM. See Lowering Overview for the conversion cascade.

Scheduler

The scheduler is a phase inside lowering. It does not modify operations; it computes a ScheduleAnalysis (stage, order, and resource placement per operation) that later materialization passes consume.

Conceptual entry pointRole
build_constraintsWalk the loop region, build dependence and resource constraints.
compute_mii_boundsCompute resource, recurrence, fine-density, and dependence lower bounds on II.
place_groupsChoose an II and place operation groups into the resource reservation table.
solveRun the linear schedule solve over a fixed placement.
materialize_pipes_and_mutexesEmit Pipe_ and Mutex_ values once placement is fixed.

Callers: the TileAA-to-TileAS conversion edge, plus the warp-specialize pipelines. Callees: MLIR infrastructure (analysis manager, attribute storage). See Scheduler Overview and Modulo Scheduler.

Codegen and NVPTX Backend

Codegen is the boundary between MLIR and the LLVM toolchain. It produces an llvm::Module, links libdevice, runs the LLVM IR pipeline, then hands control to the NVPTX backend, which selects instructions and prints PTX.

Conceptual entry pointRole
translate_module_to_llvmConvert the LLVM/NVVM-dialect module into an in-memory llvm::Module.
link_libdeviceResolve libdevice math calls against the bundled bitcode.
run_llvm_passesRun the LLVM IR pipeline (NVVM reflection, address-space opt, arg lowering, etc.).
select_nvptx_instructionsMatch LLVM IR against the NVPTX matcher table to produce MachineIR.
verify_nvptx_machine_irRun LLVM's MachineVerifierPass over the NVPTX MachineIR before printing.
emit_ptx_textPrint the final PTX module.

Callers: pipeline driver, after lowering reaches LLVM dialect. Callees: libdevice subsystem, MLIR translation infrastructure. See Codegen Overview, NVPTX Backend Passes.

libdevice

libdevice is the bitcode that supplies math intrinsics. It is embedded as the symbol _mlir_embedded_libdevice inside the tileiras binary (not loaded from the CUDA toolkit), parsed once per compilation, and integrated by the LLVM IR pipeline.

Conceptual entry pointRole
load_libdevice_bitcodeParse the embedded libdevice bitcode (symbol _mlir_embedded_libdevice linked into the tileiras binary) into an llvm::Module.
resolve_nvvm_reflectReplace __nvvm_reflect calls with the compile-time reflection answer.
inline_selected_mathInline the math functions whose bodies are pulled into the kernel module.
prune_unused_bodiesDrop libdevice functions that survived as unused declarations.

Callers: codegen during the LLVM IR pipeline. See libdevice Overview, NVVM Reflect Mechanism.

MLIR Infrastructure

The infrastructure is the shared runtime model every other subsystem depends on.

Conceptual entry pointRole
MLIRContext::getOrLoadDialectLook up or create a dialect in the current context.
StorageUniquer::getOrCreateIntern a type or attribute storage object.
OperationName::resolveBind a mnemonic to its registered operation hook table.
OpBuilder::createBuild a new operation with operands, results, attributes, and regions.
PatternApplicator::matchAndRewriteDrive pattern matching against an operation.
PassManager::runExecute the configured pass pipeline against a module.
Diagnostic::emitSurface an error or warning attached to a location.

Callers: every other subsystem in this map. See MLIR Infrastructure Overview, Storage Uniquer, Operation Layout.

How to Use This Map

Locate the behavior you want to understand by role: input parsing belongs to the driver and bytecode reader; semantic transformation belongs to lowering; placement decisions belong to the scheduler; libdevice math belongs to libdevice; PTX selection and emission belong to codegen and the NVPTX backend. The "page" link in each subsystem section is the canonical entry; child pages drill into one specific mechanism per page.