Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Boundaries: tileiras vs cicc

Abstract

The tileiras and cicc binaries shipped inside CUDA Toolkit 13.x are siblings. They live in the same bin/ directory, are both invoked by nvcc, and both emit PTX that is handed to the same ptxas. What differs is the front edge of the pipeline: cicc accepts CUDA C++ source and rides an EDG-driven NVVM bridge into the NVPTX backend; tileiras accepts MLIR bytecode and rides a 53-pass MLIR pipeline driver into the same NVPTX backend. This page assumes the reader already knows cicc and documents what is shared, what is reinvented, and what cicc carries that tileiras jettisoned.

Premise

Tileiras and cicc are sibling tools in CUDA 13.1's device-compilation toolchain. They link the same NVIDIA-internal LLVM 21.0.0git fork, expose the same MC subsystem identity, and carry the same NVVM/NVPTX pass family names. Cicc 13.0 carries the same family one minor revision earlier; cicc 13.1 tracks tileiras's LLVM snapshot.

cicc is a CUDA-C++-to-PTX compiler. Its three major subsystems are an EDG 6.6 frontend, an NVVM bridge, and an LLVM NVPTX backend. Together they implement a full source-to-PTX flow with standalone and libNVVM-shaped dispatch. The compiler parses C++, lowers through EDG IL, emits the .int.c/.device.c/.stub.c split artifacts, optimizes through NVIDIA's NVVM pass family, and runs the NVPTX backend.

tileiras is an optimizing assembler in the literal MLIR sense: it consumes a serialized representation of an already-lowered tile program, finishes lowering to a hardware-near IR, and emits a deployable artifact. Input is MLIR bytecode — the on-disk encoding of a builtin.module containing a cuda_tile payload — not source. There is no C++ parser, no EDG frontend, no .int.c emission, no constexpr evaluator. Tileiras is also explicitly not a cudafe++ replacement: cudafe++ does C++ source-to-source rewrite (kernel-launch lowering, host/device split), while tileiras only consumes bytecode and emits a host ELF (elf.o by default).

Pass-by-pass overlap matrix

The clean way to read the shared surface is to split it into two layers.

Layer A — MLIR / IR-frontend. No equivalent in cicc. Cicc has no MLIR; its frontend is EDG 6.6 emitting C, then a hand-written EDG-IL-to-LLVM-IR translator inside the NVVM bridge.

Layer B — NVVM-IR / NVPTX-backend. Shared. Pass names, command-line keys, diagnostic strings, and pass-info constructor shapes match byte-for-byte across the two binaries.

LayerSubsystemtileirasciccStatus
AC++ parserabsentEDG 6.6 frontendcicc-only
Aconstexpr evaluatorabsentEDG tree-walkercicc-only
A.int.c/.device.c/.stub.c tripleabsentEDG backend outputcicc-only
AEDG IL → LLVM IRabsentsource-language IR generationcicc-only
AMLIR bytecode readerpresentabsenttileiras-only
A9-dialect cascade + dialect registrationpresentabsenttileiras-only
ATileAS pass familypresentabsenttileiras-only
AMLIR PassManager constructor53-pass pipelineabsenttileiras-only
AMODSBuildercost-based modulo schedulerabsenttileiras-only
ATileIR pipeline driverregister, configure, run MLIR loweringabsenttileiras-only
APipeline option registrarcompact typed option tablebroad cl::opt surfacedifferent shape
AOptiX IR generationabsent--emit-optix-ir pathcicc-only
AWizard mode / fast-compile tierabsentpresentcicc-only
BNVVMReflect familypresentpresentshared
BNVVM Peephole Optimizerpresentpresentshared
BBaseAddressStrengthReducepresentpresentshared
BMemorySpaceOptpresentpresentshared
BDeadSyncElimpresentpresentshared
BCommonBaseElimpresentpresentshared
BNVVMIRVerifierpresentpresentshared
BIPMSPPasspresentpresentshared
BNVPTXSetFunctionLinkagesPasspresentpresentshared
BSelectKernelsPasspresentpresentshared
BKernelInfoPrinterpresentpresentshared
BNVVMAApresentpresentshared
Bnvvm-reflect-pppresentpresentshared
BNVPTX SelectionDAGpresentpresentshared
BNVPTX instruction printerpresentpresentshared
BPassBuilder::registerAllPassespresentpresentshared
Blibdevice bitcodeembedded onceembedded twice in the two cicc pathsshared content
Bptxas subprocesslaunched by tileiraslaunched by the nvcc/cicc pathboth shell out

The pattern is simple: above NVVM-IR everything is rewritten; below NVVM-IR almost everything is shared.

Shared NVPTX backend evidence

When the cuda_tile MLIR module finishes its descent through the 9-dialect cascade and reaches the llvm/nvvm dialect, tileiras hands the resulting LLVM module to a NVPTX backend from the same NVIDIA-internal fork that cicc links. The pass roster, command-line keys, diagnostics, and analysis names line up across the two tools.

PassPublic key or surfaceRole
NVVM Peephole Optimizernvvm-peephole-optimizerPerforms NVVM-specific instruction and intrinsic cleanups before codegen.
BaseAddressStrengthReduceinternal debug typeRewrites address arithmetic into forms that are cheaper for NVPTX selection.
MemorySpaceOpt-mllvm knob familyNormalizes memory-space casts and address-space information.
DeadSyncElim-nvvm-dead-sync-elimRemoves synchronization operations proven unnecessary.
CommonBaseElimSCEV-driven transformDeduplicates related GEP/base-address computations.
NVVMIRVerifierverifier diagnosticsRejects invalid NVVM IR shapes before NVPTX lowering.
IPMSPPassipmspInterprocedural module-specialization support.
NVPTXSetFunctionLinkagesPasscheck-kernel-functionsSets and validates kernel linkage state.
SelectKernelsPassselect-kernelsRestricts compilation to selected kernel sets or ranges.
KernelInfoPrinterkernel-infoEmits kernel metadata for downstream consumers.
NVVMAAnvvm-aaNVIDIA alias analysis for NVVM/NVPTX transforms.
NVVMReflectnvvm-reflect, nvvm-reflect-ppResolves __nvvm_reflect queries from reflection metadata.

Two CLI knob families confirm the shared backend contract at the user-visible layer. The nvvm-reflect- option family installs the same enable and key/value override behavior in both tools, and the kernel-selection family accepts the same kernel-list, kernel-range, IPMSP dump, and clone-control options.

Crucial scoping note: these passes are not invoked by tileiras's own MLIR PassManager. They run one level down, after tileiras's LLVM-dialect output is materialized as an llvm::Module and handed to the embedded NVPTX backend. The MLIR layer produces valid-shape NVVM-dialect IR; the LLVM layer applies the shared NVPTX pass family unchanged.

Tileiras-only inventions

Above the NVVM-IR boundary, tileiras introduces an MLIR-shaped front-end with no analogue in cicc. None of the following symbols, dialects, or pass mnemonics appear in the cicc binary.

SubsystemDescription
MLIR bytecode readerProject-private MLIR bytecode I/O with Tile versioning, frozen op/type/attribute tags, and cuda_tile schema support.
TileIR top-level driverCompile-and-serialize path that registers dialects, registers pipeline options, and runs lowering.
9-dialect cascadecuda_tilenv_tileaanv_tileas (+ cute, cute_nvgpu, cutlass) → nvgpunvvmllvm.
MLIR-pipeline driverBuilds the mlir::PassManager for O0/O1/O2/O3; the tier is decoded from bytecode attributes such as "nvopt<O2>".
TileAS familyRemoves dead args, resolves agent boundaries, schedules async work, materializes layouts, plans CTA mapping, and inserts OCG knobs.
MODSBuilderCost-based modulo scheduler used at O2 and O3 (inherited from O2) after schedule generation and after GPU-op conversion.
cute dialectCuTe layout algebra: local tiling, partitioning, shape arithmetic, size/cosize, and divide helpers.
cute_nvgpu dialectSM70-SM120 atoms for TMA, tensor memory, GMMA/UMMA descriptors, warp-uniform values, and WGMMA.
cutlass dialectPipeline acquire/commit/wait, tile-scheduler records, block-striped operations, and sequence barriers.
cuda_tile dialectPublic control, entry, tensor-view, atomic, selection, constant, and optimization-hint surface.
nv_tileaa / nv_tileasAlias-aware typed-pointer/token/view layer plus assembler-near schedules, layouts, execution units, tiled loads/stores, and dot operations.
Pipeline option registrarCompact typed table for integer, unsigned, boolean, enum, and string options.
nvdisasm -c shell-outOptional SASS disassembly pass that appends a disassembly section to the emitted host object.

Three pieces deserve a closer look. First, dialect registration has no analogue in cicc, which builds its IR directly in LLVM-IR shape. Second, the MLIR PassManager uses nested operation pass managers, function adapters, and the canonicalizer/CSE/SymbolDCE cleanup trio; cicc's pass manager is a conventional LLVM function/module pipeline. Third, the optimization tier comes from an attribute embedded in the TileIR bytecode, while cicc uses the conventional -O0/-O1/-O2/-O3 driver flag family.

cicc-only baggage tileiras dropped

Cicc's bulk comes from features tileiras explicitly does not need. The following are visible in the cicc binary and entirely absent from tileiras.

Dropped subsystemcicc responsibilityWhy tileiras drops it
EDG 6.6 frontendC++ parsing, type checking, templates, constexpr, and CUDA source diagnostics.input is MLIR bytecode, not C++
.int.c / .device.c / .stub.c emissionEDG backend source splitting and host/device artifact generation.emits host ELF directly
OptiX IR generationOptional OptiX IR output stage.no OptiX path
Wizard modecicc-internal experimental mode.absent
Fast-compile tiersMultiple compile-tier knobs.only the TileIR optimization tier applies
NVVMPassOptions structLarge shared knob block for the cicc NVVM pipeline.consolidated into a compact typed option table
Dual Path A / Path B dispatchTwo frontend/IR-generation paths for standalone and libNVVM-shaped usage.one bytecode-to-object path
Broad cl::opt registryLarge standalone compiler option surface.small driver surface plus TileIR pipeline options
NVVM builtin resolution tableSource-level builtin name and overload resolution.resolution happens upstream
constexpr evaluatorEDG tree-walking interpreter.C++ template/constexpr evaluation happens upstream
C++ template cleanupSynthesized source-language runtime cleanup.no synthesized C++ runtime
-nvvm-version=nvvm-latest/nvvm70 switchPath selector for older cicc modes.absent
LibNVVM API entry pointsLibrary-facing API surface.not a libNVVM client

Tileiras is 88 MB despite carrying a full MLIR runtime, a 9-dialect cascade, the CuTe/CUTLASS pipeline op surface, a cost-based modulo scheduler, and the TileAS pass family, because it leaves the 3.2 MB EDG, the dual-path duplication, the 1,689-option registry, the 4 KB NVVMPassOptions struct, and the OptiX path behind. Cicc 13.0's 60 MB skew toward EDG and dual-path overhead; tileiras's 88 MB skew toward the MLIR/dialect surface and the TileAS family.

Architectural sketch (side-by-side)

                cicc                                          tileiras
                ────                                          ────────
  CUDA C++ source (.cu / .ci / .i)                  MLIR bytecode (.ctir / .ctb)
              │                                                    │
              ▼                                                    ▼
   ┌─────────────────────┐                            ┌──────────────────────┐
   │  EDG 6.6 frontend   │                            │  MLIR bytecode       │
   │  parser, constexpr  │                            │  reader              │
   │   parser, constexpr │                            └──────────┬───────────┘
   │  evaluator          │                                       │
   └──────────┬──────────┘                                       ▼
              │ .int.c / .device.c / .stub.c       ┌────────────────────────┐
              ▼                                    │   cuda_tile dialect    │
   ┌─────────────────────┐                         └──────────┬─────────────┘
   │  IRGEN: EDG IL →    │                                    ▼
   │  LLVM IR translator │                         ┌────────────────────────┐
   │  standalone/libNVVM │                         │   nv_tileaa dialect    │
   │  shaped paths       │                         └──────────┬─────────────┘
   └──────────┬──────────┘                                    ▼
              │                                    ┌────────────────────────┐
              ▼                                    │   nv_tileas dialect    │
   ┌─────────────────────┐                         │   + cute               │
   │  LNK + libdevice    │                         │   + cute_nvgpu         │
   │  (456 KB embedded)  │                         │   + cutlass            │
   └──────────┬──────────┘                         │  TileAS 16 passes      │
              │                                    │  MODSBuilder           │
              ▼                                    │  53-pass MLIR pipeline │
   ┌─────────────────────┐                         └──────────┬─────────────┘
   │  OPT: NVVM passes   │                                    ▼
   │  35 NVIDIA-custom + │                         ┌────────────────────────┐
   │  standard LLVM      │                         │   mlir::nvgpu          │
   │  NVVM pipeline      │                         └──────────┬─────────────┘
   └──────────┬──────────┘                                    ▼
              │                                    ┌────────────────────────┐
              │ (no MLIR layer)                    │   nvvm dialect         │
              │                                    └──────────┬─────────────┘
              │                                               ▼
              │                                    ┌────────────────────────┐
              │                                    │   llvm dialect         │
              │                                    └──────────┬─────────────┘
              │                                               │
              └───────────────────┬───────────────────────────┘
                                  │
                                  ▼  (CONVERGENCE — same NVPTX backend)
              ┌────────────────────────────────────────────────────────┐
              │  NVPTX backend (LLVM 21.0.0git internal fork)         │
              │  ─ nvvm-peephole-optimizer / BaseAddressStrengthReduce│
              │  ─ MemorySpaceOpt / DeadSyncElim / CommonBaseElim     │
              │  ─ NVVMIRVerifier / IPMSP / NVVMAA                    │
              │  ─ NVPTXSetFunctionLinkagesPass / SelectKernelsPass   │
              │  ─ KernelInfoPrinter / NVVMReflect / nvvm-reflect-pp  │
              │  ─ NVPTX SelectionDAG ISel / NVPTXInstPrinter         │
              └────────────────────────────┬───────────────────────────┘
                                           │
                                           ▼
                                       PTX text
                                           │
                                           ▼
                              ┌──────────────────────────┐
                              │  ptxas (subprocess)      │
                              │  PTX → SASS              │
                              └────────────┬─────────────┘
                                           │
                                           ▼
                                    cicc: .ptx       tileiras: elf.o
                                                     (with optional
                                                      nvdisasm -c
                                                      SASS section)

The two pipelines converge at the moment the LLVM module is materialized for the NVPTX backend, and from that point forward they share the same code — passes, ISel, register allocation, scheduling, asm-printer.

Decision matrix: which compiler does nvcc run?

The two compilers see disjoint inputs, so the routing decision is structural rather than policy-driven. nvcc classifies each input artifact and dispatches once; neither compiler probes the input format the other expects.

Input artifactDebug modeSM targetCompiler chosenWhy
.cu CUDA C++ sourcereleaseany supportedcudafe++ → cicconly cicc has a C++ frontend
.cu CUDA C++ source-G device debugany supportedcudafe++ → cicc at -O0only cicc accepts source-language debug info
Preprocessed .cpp1.ii / .cudafe1.cppanyany supportedciccEDG IL re-entry is a cicc-only path
.tileir / .ctir / .ctb bytecodereleasesm_100, sm_103, sm_110, sm_120, sm_121tileirasonly tileiras parses TileIR bytecode
.tileir bytecode--device-debug requestedany supportedtileiras at -O0tileiras rejects -G above -O0
.tileir bytecodereleasesm_70 .. sm_90a(no valid path)tileiras's GPU whitelist excludes pre-Blackwell SMs
.ptx precompiledn/aanyneither (ptxas only)neither device compiler runs on PTX input
.cubin precompiledn/aanyneither (nvlink/fatbinary only)both device compilers are upstream of cubin

Three rows deserve commentary. The pre-Blackwell row is the hard constraint: tileiras's --gpu-name enum accepts only sm_100, sm_103, sm_110, sm_120, and sm_121, so a CUDA build targeting sm_80 or sm_90 cannot use the tileiras path even if the upstream MLIR emitter exists. The cicc path remains the only compile route for those targets. The debug row is a softer constraint: both compilers reject the combination of optimization above -O0 with full device debug, but the wording of the diagnostic and the downstream NVVM options differ. The bytecode rows depend on the upstream emitter — without a CUTLASS-on-MLIR, CuTe-DSL, or Triton-for-CUDA frontend in the build, no .tileir ever appears and the tileiras path stays unused.

Capability split

The clean rule is that tileiras and cicc consume disjoint inputs. CUDA C++ source, with all of its template-instantiation, constexpr-evaluation, lambda-capture, and host/device-split machinery, is cicc's territory; TileIR bytecode, with its already-resolved tile-program structure expressed in the cuda_tile dialect family, is tileiras's territory. Neither tool has a backdoor that consumes the other's input.

What they share is the NVPTX backend below the LLVM-dialect/NVVM-IR handoff. Both compilers materialise an llvm::Module and hand it to the same NVPTX backend from the same LLVM 21 fork. Below that handoff, the two compilers are byte-for-byte equivalent: same SelectionDAG, same NVVM custom passes, same instruction printer, same libdevice payload. Above the handoff they share almost nothing.

The capability split has a practical consequence for emitters and integrators. Upstream tooling that wants the convenience of CUDA C++ source — including templates, constexpr, lambdas, and the standard CUDA runtime API — must target cicc through cudafe++. Upstream tooling that wants the precision of a tile-shaped program, hand-managed pipelines, explicit CTA mapping, and the cuda_tile/cute/cutlass op surfaces must target tileiras through TileIR bytecode. There is no overlap; the question of "which compiler should this kernel use" reduces to "which input format is the emitter willing to produce".

Migration trajectory

cicc is the longer-standing compiler and the only path that accepts CUDA C++ source. tileiras is the newer compiler, introduced in CUDA 13.1, that accepts bytecode produced by MLIR-rooted frontends. The two are sibling tools in the same toolkit, not staged replacements.

Three reading signals shape the trajectory. First, the shared NVPTX backend means new SM targets, new MMA shapes, and new fence semantics arrive in both compilers simultaneously through the LLVM fork. Neither compiler is locked to a particular hardware generation. Second, the tileiras-specific dialect cascade (cuda_tile, nv_tileaa, nv_tileas, cute, cute_nvgpu, cutlass) carries operations that have no analogue in cicc's LLVM-IR-only input; those operations encode tile-program structure that source-level CUDA cannot express directly. Third, cicc still ships in CUDA 13.1, with a one-minor-version-newer copy of the same LLVM fork that tileiras links; both tools track upstream NVPTX changes through the same vendor backport pipeline.

A reimplementation does not have to choose between the two tools. The honest model is "two device-code compilers, one shared backend": dispatch by input format, share the backend by linking the same NVPTX library, and treat the dialect cascade and the EDG frontend as independent front-ends that meet at the LLVM-module level.

Everything tileiras inherits unchanged from the LLVM 21 fork is documented in the cicc wiki, and those pages are reusable verbatim for the tileiras NVPTX backend.

  • NVPTX backend internals — see cicc pipeline/codegen.md and pipeline/emission.md. Same SelectionDAG, same NVPTXTargetLowering, same 19 MMA shapes x 11 data types, and same instruction-printer surface.
  • NVVMReflect mechanism — see cicc reflect docs. Same __nvvm_reflect/__nvvm_reflect_ocl rewrite, same nvvm.reflection module-flag table, same nvvm-reflect-add parser.
  • libdevice — same ~456 KB bitcode payload. Tileiras embeds it once (no Path A / Path B duplication).
  • NVVM Peephole / BaseAddressStrengthReduce — same pre-codegen cleanup and address-strength-reduction roles.
  • MemorySpaceOpt — same address-space normalization and memory-space cleanup behavior.
  • DeadSyncElim — same synchronization-elimination pass.
  • NVVMIRVerifier — same verifier role before backend lowering.
  • IPMSP / SelectKernels / KernelInfo / NVPTXSetFunctionLinkages / NVVMAA / nvvm-reflect-pp — same backend registration family.

For everything above the NVVM-IR boundary, the cicc wiki has nothing to offer; refer to the tileiras-internal pages: cuda_tile Overview, cute Overview, cute_nvgpu Overview, cutlass Overview, nv_tileaa Overview, nv_tileas Overview, the TileAS Pass Families series, Full Pass List by Opt Level, Modulo Scheduler and Rau, CLI Options, and MLIR Bytecode Format. The intent behind the cicc-vs-tileiras split — why an MLIR substrate at all, why a four-stage cascade, why a Rau scheduler — is documented in Architecture Evolution and Design Decisions.

Reimplementation Notes

Model the two tools as two different producers for the same downstream backend shape:

cicc:
    input: CUDA C++ source or preprocessed CUDA source
    frontend: EDG and NVVM bridge
    handoff: LLVM/NVVM module
    backend: shared NVPTX backend
    output: PTX for ptxas

tileiras:
    input: TileIR MLIR bytecode
    frontend: MLIR dialect cascade and TileAS passes
    handoff: LLVM/NVVM module
    backend: shared NVPTX backend
    output: host object that carries ptxas output

This split is the key design constraint. Above the LLVM/NVVM handoff, reuse between the two tools is mostly conceptual. Below that handoff, the pass names, reflection behavior, libdevice payload, and PTX emission semantics should be treated as one shared backend contract.