Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Driver Overview

Abstract

tileiras is NVIDIA's TileIR optimizing assembler. It takes a TileIR bytecode module, lowers it through the TileIR and NVVM pipeline, emits PTX, invokes ptxas, and writes a host relocatable object. It is not a CUDA C++ front-end — no EDG, no cudafe, no host stub synthesis, no CUDA source parser lives in this tool. Those stages must already have produced the TileIR bytecode this driver consumes.

From the command line the driver behaves like a compact LLVM-style compiler:

tileiras [driver options] <tileir-bytecode>
    -> parse TileIR bytecode as an MLIR builtin.module
    -> run TileIR, NVVM, and NVPTX lowering
    -> serialize PTX
    -> assemble PTX with ptxas
    -> optionally dump SASS through nvdisasm -c
    -> write a host relocatable object, default elf.o

The public contract stays deliberately small. Users select the GPU architecture, host architecture, host OS, optimization/debug mode, optional memcheck instrumentation, CUDA toolkit root, and output file. The large pass inventory hiding behind that surface is catalogued in the Pipeline Overview and the Full Pass List by Opt Level.

What the driver does

One translation unit per process invocation. The input is a TileIR bytecode buffer (magic 7f 54 69 6c 65 49 52 00, version 13.1.x); the output is a host relocatable object the driver writes to --output-file or, by default, elf.o. Exit status is 0 on success or one of the five error codes documented in Driver Program Handle; no partial output is ever written.

The driver distinguishes TileIR bytecode from generic upstream MLIR bytecode at the magic-number level. A stream that opens with the MLIR framing prefix 06 03 80 0a 4d 4c 49 52 and the "\nMLIR" payload tag — rather than the TileIR "Tile\0" tag in the same slot — is rejected with a separate diagnostic that names MLIR bytecode explicitly, so the user can route the input to the right tool instead of guessing whether a parser failure means a corrupt file.

Validation runs before any pipeline construction. It rejects null buffers, non-TileIR bytecode, unsupported GPU names, optimization levels above 3, and --device-debug paired with any nonzero optimization level. The verbatim diagnostic strings and their error codes live in Driver CLI Options.

Supported Targets

SurfaceAccepted valuesDefault / effect
--gpu-namesm_100, sm_103, sm_110, sm_120, sm_121Defaults to sm_100.
--host-archx86_64, aarch64, arm64ecSelects the host triple fragment.
--host-oslinux, windowsSelects the object and triple OS fragment.
--sanitizememcheckAdds TileIR memcheck instrumentation when present.
--opt-level / -O0, 1, 2, 3Driver default is 3.
--lineinfobooleanEmits line information without full device debug.
--device-debug / -gbooleanRequires -O0; enables full device debug mode.
--output-file / -opathDefaults to elf.o.

The target set is Blackwell-oriented. A clean-room implementation should treat unsupported SM names as hard validation errors rather than silently remap them to the closest known architecture.

Driver Flow

The compile path is linear and has no user-visible subcommands:

main
  parse argv against the cl::opt registry
  read positional TileIR bytecode file
  resolve CUDA toolkit root
  validate buffer, target, optimization level
  create an MLIRContext and register dialects
  parse bytecode into builtin.module
  attach host/GPU target tuple
  build the TileIR pass pipeline for the requested optimization level
  lower to NVVM and LLVM
  serialize PTX text
  invoke ptxas with PTX passed as --input-as-string
  optionally write cubin to a temporary file and run nvdisasm -c
  write the relocatable object bytes to disk

The only external tools on the default path are CUDA toolkit binaries. ptxas receives PTX through --input-as-string and returns assembled cubin bytes on stdout. The SASS dump path writes that cubin to a temporary file, runs the configured disassembler command, and removes the temporary file when the driver created it.

Failure Model

Every failure prints a diagnostic and returns a nonzero exit status; the driver never writes a partial output file. The user-visible categories are:

CategoryTypical trigger
Input missingNo positional TileIR bytecode file was provided.
Read failureThe input file cannot be opened or mapped.
Bytecode mismatchThe buffer is not TileIR bytecode.
Unsupported target--gpu-name, --host-arch, or --host-os is outside the accepted set.
Invalid options--opt-level > 3 or --device-debug with nonzero optimization.
Toolkit failureCUDA root cannot be resolved for an operation that requires the toolkit.
Compile failureMLIR parsing, pass execution, PTX emission, or ptxas failed.
Dump failureThe configured SASS dump command failed or could not be executed.

Errors are terminal for the current invocation by design. The driver makes no attempt at partial output recovery after a pipeline or assembler failure — a fresh invocation with corrected input is always cheaper than guessing how much of a half-finished artifact is trustworthy.

Driver main() Entry walks the entry-point code path in detail; Driver CLI Options catalogues every option and its validator; Driver Program Handle defines the public error-code numbering; Host Launch ABI and ptxas Knobs covers the kernel-launch metadata the driver emits into the produced object.