Boundaries: tileiras vs cicc

Abstract

The tileiras and cicc binaries shipped inside CUDA Toolkit 13.x are siblings. They live in the same bin/ directory, are both invoked by nvcc, and both emit PTX that is handed to the same ptxas. What differs is the front edge of the pipeline: cicc accepts CUDA C++ source and rides an EDG-driven NVVM bridge into the NVPTX backend; tileiras accepts MLIR bytecode and rides a 53-pass MLIR pipeline driver into the same NVPTX backend. This page assumes the reader already knows cicc and documents what is shared, what is reinvented, and what cicc carries that tileiras jettisoned.

Premise

Tileiras and cicc are sibling tools in CUDA 13.1's device-compilation toolchain. They link the same NVIDIA-internal LLVM 21.0.0git fork, expose the same MC subsystem identity, and carry the same NVVM/NVPTX pass family names. Cicc 13.0 carries the same family one minor revision earlier; cicc 13.1 tracks tileiras's LLVM snapshot.

cicc is a CUDA-C++-to-PTX compiler. Its three major subsystems are an EDG 6.6 frontend, an NVVM bridge, and an LLVM NVPTX backend. Together they implement a full source-to-PTX flow with standalone and libNVVM-shaped dispatch. The compiler parses C++, lowers through EDG IL, emits the .int.c/.device.c/.stub.c split artifacts, optimizes through NVIDIA's NVVM pass family, and runs the NVPTX backend.

tileiras is an optimizing assembler in the literal MLIR sense: it consumes a serialized representation of an already-lowered tile program, finishes lowering to a hardware-near IR, and emits a deployable artifact. Input is MLIR bytecode — the on-disk encoding of a builtin.module containing a cuda_tile payload — not source. There is no C++ parser, no EDG frontend, no .int.c emission, no constexpr evaluator. Tileiras is also explicitly not a cudafe++ replacement: cudafe++ does C++ source-to-source rewrite (kernel-launch lowering, host/device split), while tileiras only consumes bytecode and emits a host ELF (elf.o by default).

Pass-by-pass overlap matrix

The clean way to read the shared surface is to split it into two layers.

Layer A — MLIR / IR-frontend. No equivalent in cicc. Cicc has no MLIR; its frontend is EDG 6.6 emitting C, then a hand-written EDG-IL-to-LLVM-IR translator inside the NVVM bridge.

Layer B — NVVM-IR / NVPTX-backend. Shared. Pass names, command-line keys, diagnostic strings, and pass-info constructor shapes match byte-for-byte across the two binaries.

Layer	Subsystem	tileiras	cicc	Status
A	C++ parser	absent	EDG 6.6 frontend	cicc-only
A	constexpr evaluator	absent	EDG tree-walker	cicc-only
A	`.int.c`/`.device.c`/`.stub.c` triple	absent	EDG backend output	cicc-only
A	EDG IL → LLVM IR	absent	source-language IR generation	cicc-only
A	MLIR bytecode reader	present	absent	tileiras-only
A	9-dialect cascade + dialect registration	present	absent	tileiras-only
A	TileAS pass family	present	absent	tileiras-only
A	MLIR `PassManager` constructor	53-pass pipeline	absent	tileiras-only
A	MODSBuilder	cost-based modulo scheduler	absent	tileiras-only
A	TileIR pipeline driver	register, configure, run MLIR lowering	absent	tileiras-only
A	Pipeline option registrar	compact typed option table	broad `cl::opt` surface	different shape
A	OptiX IR generation	absent	`--emit-optix-ir` path	cicc-only
A	Wizard mode / fast-compile tier	absent	present	cicc-only
B	NVVMReflect family	present	present	shared
B	NVVM Peephole Optimizer	present	present	shared
B	BaseAddressStrengthReduce	present	present	shared
B	MemorySpaceOpt	present	present	shared
B	DeadSyncElim	present	present	shared
B	CommonBaseElim	present	present	shared
B	NVVMIRVerifier	present	present	shared
B	IPMSPPass	present	present	shared
B	NVPTXSetFunctionLinkagesPass	present	present	shared
B	SelectKernelsPass	present	present	shared
B	KernelInfoPrinter	present	present	shared
B	NVVMAA	present	present	shared
B	nvvm-reflect-pp	present	present	shared
B	NVPTX SelectionDAG	present	present	shared
B	NVPTX instruction printer	present	present	shared
B	`PassBuilder::registerAllPasses`	present	present	shared
B	libdevice bitcode	embedded once	embedded twice in the two cicc paths	shared content
B	ptxas subprocess	launched by tileiras	launched by the nvcc/cicc path	both shell out

The pattern is simple: above NVVM-IR everything is rewritten; below NVVM-IR almost everything is shared.

Shared NVPTX backend evidence

When the cuda_tile MLIR module finishes its descent through the 9-dialect cascade and reaches the llvm/nvvm dialect, tileiras hands the resulting LLVM module to a NVPTX backend from the same NVIDIA-internal fork that cicc links. The pass roster, command-line keys, diagnostics, and analysis names line up across the two tools.

Pass	Public key or surface	Role
NVVM Peephole Optimizer	`nvvm-peephole-optimizer`	Performs NVVM-specific instruction and intrinsic cleanups before codegen.
BaseAddressStrengthReduce	internal debug type	Rewrites address arithmetic into forms that are cheaper for NVPTX selection.
MemorySpaceOpt	`-mllvm` knob family	Normalizes memory-space casts and address-space information.
DeadSyncElim	`-nvvm-dead-sync-elim`	Removes synchronization operations proven unnecessary.
CommonBaseElim	SCEV-driven transform	Deduplicates related GEP/base-address computations.
NVVMIRVerifier	verifier diagnostics	Rejects invalid NVVM IR shapes before NVPTX lowering.
IPMSPPass	`ipmsp`	Interprocedural module-specialization support.
NVPTXSetFunctionLinkagesPass	`check-kernel-functions`	Sets and validates kernel linkage state.
SelectKernelsPass	`select-kernels`	Restricts compilation to selected kernel sets or ranges.
KernelInfoPrinter	`kernel-info`	Emits kernel metadata for downstream consumers.
NVVMAA	`nvvm-aa`	NVIDIA alias analysis for NVVM/NVPTX transforms.
NVVMReflect	`nvvm-reflect`, `nvvm-reflect-pp`	Resolves `__nvvm_reflect` queries from reflection metadata.

Two CLI knob families confirm the shared backend contract at the user-visible layer. The nvvm-reflect- option family installs the same enable and key/value override behavior in both tools, and the kernel-selection family accepts the same kernel-list, kernel-range, IPMSP dump, and clone-control options.

Crucial scoping note: these passes are not invoked by tileiras's own MLIR PassManager. They run one level down, after tileiras's LLVM-dialect output is materialized as an llvm::Module and handed to the embedded NVPTX backend. The MLIR layer produces valid-shape NVVM-dialect IR; the LLVM layer applies the shared NVPTX pass family unchanged.

Tileiras-only inventions

Above the NVVM-IR boundary, tileiras introduces an MLIR-shaped front-end with no analogue in cicc. None of the following symbols, dialects, or pass mnemonics appear in the cicc binary.

Subsystem	Description
MLIR bytecode reader	Project-private MLIR bytecode I/O with Tile versioning, frozen op/type/attribute tags, and `cuda_tile` schema support.
TileIR top-level driver	Compile-and-serialize path that registers dialects, registers pipeline options, and runs lowering.
9-dialect cascade	`cuda_tile` → `nv_tileaa` → `nv_tileas` (+ `cute`, `cute_nvgpu`, `cutlass`) → `nvgpu` → `nvvm` → `llvm`.
MLIR-pipeline driver	Builds the `mlir::PassManager` for O0/O1/O2/O3; the tier is decoded from bytecode attributes such as `"nvopt<O2>"`.
TileAS family	Removes dead args, resolves agent boundaries, schedules async work, materializes layouts, plans CTA mapping, and inserts OCG knobs.
MODSBuilder	Cost-based modulo scheduler used at O2 and O3 (inherited from O2) after schedule generation and after GPU-op conversion.
`cute` dialect	CuTe layout algebra: local tiling, partitioning, shape arithmetic, size/cosize, and divide helpers.
`cute_nvgpu` dialect	SM70-SM120 atoms for TMA, tensor memory, GMMA/UMMA descriptors, warp-uniform values, and WGMMA.
`cutlass` dialect	Pipeline acquire/commit/wait, tile-scheduler records, block-striped operations, and sequence barriers.
`cuda_tile` dialect	Public control, entry, tensor-view, atomic, selection, constant, and optimization-hint surface.
`nv_tileaa` / `nv_tileas`	Alias-aware typed-pointer/token/view layer plus assembler-near schedules, layouts, execution units, tiled loads/stores, and dot operations.
Pipeline option registrar	Compact typed table for integer, unsigned, boolean, enum, and string options.
`nvdisasm -c` shell-out	Optional SASS disassembly pass that appends a disassembly section to the emitted host object.

Three pieces deserve a closer look. First, dialect registration has no analogue in cicc, which builds its IR directly in LLVM-IR shape. Second, the MLIR PassManager uses nested operation pass managers, function adapters, and the canonicalizer/CSE/SymbolDCE cleanup trio; cicc's pass manager is a conventional LLVM function/module pipeline. Third, the optimization tier comes from an attribute embedded in the TileIR bytecode, while cicc uses the conventional -O0/-O1/-O2/-O3 driver flag family.

cicc-only baggage tileiras dropped

Cicc's bulk comes from features tileiras explicitly does not need. The following are visible in the cicc binary and entirely absent from tileiras.

Dropped subsystem	cicc responsibility	Why tileiras drops it
EDG 6.6 frontend	C++ parsing, type checking, templates, constexpr, and CUDA source diagnostics.	input is MLIR bytecode, not C++
`.int.c` / `.device.c` / `.stub.c` emission	EDG backend source splitting and host/device artifact generation.	emits host ELF directly
OptiX IR generation	Optional OptiX IR output stage.	no OptiX path
Wizard mode	cicc-internal experimental mode.	absent
Fast-compile tiers	Multiple compile-tier knobs.	only the TileIR optimization tier applies
NVVMPassOptions struct	Large shared knob block for the cicc NVVM pipeline.	consolidated into a compact typed option table
Dual Path A / Path B dispatch	Two frontend/IR-generation paths for standalone and libNVVM-shaped usage.	one bytecode-to-object path
Broad `cl::opt` registry	Large standalone compiler option surface.	small driver surface plus TileIR pipeline options
NVVM builtin resolution table	Source-level builtin name and overload resolution.	resolution happens upstream
constexpr evaluator	EDG tree-walking interpreter.	C++ template/constexpr evaluation happens upstream
C++ template cleanup	Synthesized source-language runtime cleanup.	no synthesized C++ runtime
`-nvvm-version=nvvm-latest`/`nvvm70` switch	Path selector for older cicc modes.	absent
LibNVVM API entry points	Library-facing API surface.	not a libNVVM client

Tileiras is 88 MB despite carrying a full MLIR runtime, a 9-dialect cascade, the CuTe/CUTLASS pipeline op surface, a cost-based modulo scheduler, and the TileAS pass family, because it leaves the 3.2 MB EDG, the dual-path duplication, the 1,689-option registry, the 4 KB NVVMPassOptions struct, and the OptiX path behind. Cicc 13.0's 60 MB skew toward EDG and dual-path overhead; tileiras's 88 MB skew toward the MLIR/dialect surface and the TileAS family.

Architectural sketch (side-by-side)

                cicc                                          tileiras
                ────                                          ────────
  CUDA C++ source (.cu / .ci / .i)                  MLIR bytecode (.ctir / .ctb)
              │                                                    │
              ▼                                                    ▼
   ┌─────────────────────┐                            ┌──────────────────────┐
   │  EDG 6.6 frontend   │                            │  MLIR bytecode       │
   │  parser, constexpr  │                            │  reader              │
   │   parser, constexpr │                            └──────────┬───────────┘
   │  evaluator          │                                       │
   └──────────┬──────────┘                                       ▼
              │ .int.c / .device.c / .stub.c       ┌────────────────────────┐
              ▼                                    │   cuda_tile dialect    │
   ┌─────────────────────┐                         └──────────┬─────────────┘
   │  IRGEN: EDG IL →    │                                    ▼
   │  LLVM IR translator │                         ┌────────────────────────┐
   │  standalone/libNVVM │                         │   nv_tileaa dialect    │
   │  shaped paths       │                         └──────────┬─────────────┘
   └──────────┬──────────┘                                    ▼
              │                                    ┌────────────────────────┐
              ▼                                    │   nv_tileas dialect    │
   ┌─────────────────────┐                         │   + cute               │
   │  LNK + libdevice    │                         │   + cute_nvgpu         │
   │  (456 KB embedded)  │                         │   + cutlass            │
   └──────────┬──────────┘                         │  TileAS 16 passes      │
              │                                    │  MODSBuilder           │
              ▼                                    │  53-pass MLIR pipeline │
   ┌─────────────────────┐                         └──────────┬─────────────┘
   │  OPT: NVVM passes   │                                    ▼
   │  35 NVIDIA-custom + │                         ┌────────────────────────┐
   │  standard LLVM      │                         │   mlir::nvgpu          │
   │  NVVM pipeline      │                         └──────────┬─────────────┘
   └──────────┬──────────┘                                    ▼
              │                                    ┌────────────────────────┐
              │ (no MLIR layer)                    │   nvvm dialect         │
              │                                    └──────────┬─────────────┘
              │                                               ▼
              │                                    ┌────────────────────────┐
              │                                    │   llvm dialect         │
              │                                    └──────────┬─────────────┘
              │                                               │
              └───────────────────┬───────────────────────────┘
                                  │
                                  ▼  (CONVERGENCE — same NVPTX backend)
              ┌────────────────────────────────────────────────────────┐
              │  NVPTX backend (LLVM 21.0.0git internal fork)         │
              │  ─ nvvm-peephole-optimizer / BaseAddressStrengthReduce│
              │  ─ MemorySpaceOpt / DeadSyncElim / CommonBaseElim     │
              │  ─ NVVMIRVerifier / IPMSP / NVVMAA                    │
              │  ─ NVPTXSetFunctionLinkagesPass / SelectKernelsPass   │
              │  ─ KernelInfoPrinter / NVVMReflect / nvvm-reflect-pp  │
              │  ─ NVPTX SelectionDAG ISel / NVPTXInstPrinter         │
              └────────────────────────────┬───────────────────────────┘
                                           │
                                           ▼
                                       PTX text
                                           │
                                           ▼
                              ┌──────────────────────────┐
                              │  ptxas (subprocess)      │
                              │  PTX → SASS              │
                              └────────────┬─────────────┘
                                           │
                                           ▼
                                    cicc: .ptx       tileiras: elf.o
                                                     (with optional
                                                      nvdisasm -c
                                                      SASS section)

The two pipelines converge at the moment the LLVM module is materialized for the NVPTX backend, and from that point forward they share the same code — passes, ISel, register allocation, scheduling, asm-printer.

Decision matrix: which compiler does nvcc run?

The two compilers see disjoint inputs, so the routing decision is structural rather than policy-driven. nvcc classifies each input artifact and dispatches once; neither compiler probes the input format the other expects.

Input artifact	Debug mode	SM target	Compiler chosen	Why
`.cu` CUDA C++ source	release	any supported	cudafe++ → cicc	only cicc has a C++ frontend
`.cu` CUDA C++ source	`-G` device debug	any supported	cudafe++ → cicc at `-O0`	only cicc accepts source-language debug info
Preprocessed `.cpp1.ii` / `.cudafe1.cpp`	any	any supported	cicc	EDG IL re-entry is a cicc-only path
`.tileir` / `.ctir` / `.ctb` bytecode	release	sm_100, sm_103, sm_110, sm_120, sm_121	tileiras	only tileiras parses TileIR bytecode
`.tileir` bytecode	`--device-debug` requested	any supported	tileiras at `-O0`	tileiras rejects `-G` above `-O0`
`.tileir` bytecode	release	sm_70 .. sm_90a	(no valid path)	tileiras's GPU whitelist excludes pre-Blackwell SMs
`.ptx` precompiled	n/a	any	neither (ptxas only)	neither device compiler runs on PTX input
`.cubin` precompiled	n/a	any	neither (nvlink/fatbinary only)	both device compilers are upstream of cubin

Three rows deserve commentary. The pre-Blackwell row is the hard constraint: tileiras's --gpu-name enum accepts only sm_100, sm_103, sm_110, sm_120, and sm_121, so a CUDA build targeting sm_80 or sm_90 cannot use the tileiras path even if the upstream MLIR emitter exists. The cicc path remains the only compile route for those targets. The debug row is a softer constraint: both compilers reject the combination of optimization above -O0 with full device debug, but the wording of the diagnostic and the downstream NVVM options differ. The bytecode rows depend on the upstream emitter — without a CUTLASS-on-MLIR, CuTe-DSL, or Triton-for-CUDA frontend in the build, no .tileir ever appears and the tileiras path stays unused.

Capability split

The clean rule is that tileiras and cicc consume disjoint inputs. CUDA C++ source, with all of its template-instantiation, constexpr-evaluation, lambda-capture, and host/device-split machinery, is cicc's territory; TileIR bytecode, with its already-resolved tile-program structure expressed in the cuda_tile dialect family, is tileiras's territory. Neither tool has a backdoor that consumes the other's input.

What they share is the NVPTX backend below the LLVM-dialect/NVVM-IR handoff. Both compilers materialise an llvm::Module and hand it to the same NVPTX backend from the same LLVM 21 fork. Below that handoff, the two compilers are byte-for-byte equivalent: same SelectionDAG, same NVVM custom passes, same instruction printer, same libdevice payload. Above the handoff they share almost nothing.

The capability split has a practical consequence for emitters and integrators. Upstream tooling that wants the convenience of CUDA C++ source — including templates, constexpr, lambdas, and the standard CUDA runtime API — must target cicc through cudafe++. Upstream tooling that wants the precision of a tile-shaped program, hand-managed pipelines, explicit CTA mapping, and the cuda_tile/cute/cutlass op surfaces must target tileiras through TileIR bytecode. There is no overlap; the question of "which compiler should this kernel use" reduces to "which input format is the emitter willing to produce".

Migration trajectory

cicc is the longer-standing compiler and the only path that accepts CUDA C++ source. tileiras is the newer compiler, introduced in CUDA 13.1, that accepts bytecode produced by MLIR-rooted frontends. The two are sibling tools in the same toolkit, not staged replacements.

Three reading signals shape the trajectory. First, the shared NVPTX backend means new SM targets, new MMA shapes, and new fence semantics arrive in both compilers simultaneously through the LLVM fork. Neither compiler is locked to a particular hardware generation. Second, the tileiras-specific dialect cascade (cuda_tile, nv_tileaa, nv_tileas, cute, cute_nvgpu, cutlass) carries operations that have no analogue in cicc's LLVM-IR-only input; those operations encode tile-program structure that source-level CUDA cannot express directly. Third, cicc still ships in CUDA 13.1, with a one-minor-version-newer copy of the same LLVM fork that tileiras links; both tools track upstream NVPTX changes through the same vendor backport pipeline.

A reimplementation does not have to choose between the two tools. The honest model is "two device-code compilers, one shared backend": dispatch by input format, share the backend by linking the same NVPTX library, and treat the dialect cascade and the EDG frontend as independent front-ends that meet at the LLVM-module level.

Cross-link recommendations

Everything tileiras inherits unchanged from the LLVM 21 fork is documented in the cicc wiki, and those pages are reusable verbatim for the tileiras NVPTX backend.

NVPTX backend internals — see cicc pipeline/codegen.md and pipeline/emission.md. Same SelectionDAG, same NVPTXTargetLowering, same 19 MMA shapes x 11 data types, and same instruction-printer surface.
NVVMReflect mechanism — see cicc reflect docs. Same __nvvm_reflect/__nvvm_reflect_ocl rewrite, same nvvm.reflection module-flag table, same nvvm-reflect-add parser.
libdevice — same ~456 KB bitcode payload. Tileiras embeds it once (no Path A / Path B duplication).
NVVM Peephole / BaseAddressStrengthReduce — same pre-codegen cleanup and address-strength-reduction roles.
MemorySpaceOpt — same address-space normalization and memory-space cleanup behavior.
DeadSyncElim — same synchronization-elimination pass.
NVVMIRVerifier — same verifier role before backend lowering.
IPMSP / SelectKernels / KernelInfo / NVPTXSetFunctionLinkages / NVVMAA / nvvm-reflect-pp — same backend registration family.

For everything above the NVVM-IR boundary, the cicc wiki has nothing to offer; refer to the tileiras-internal pages: cuda_tile Overview, cute Overview, cute_nvgpu Overview, cutlass Overview, nv_tileaa Overview, nv_tileas Overview, the TileAS Pass Families series, Full Pass List by Opt Level, Modulo Scheduler and Rau, CLI Options, and MLIR Bytecode Format. The intent behind the cicc-vs-tileiras split — why an MLIR substrate at all, why a four-stage cascade, why a Rau scheduler — is documented in Architecture Evolution and Design Decisions.

Reimplementation Notes

Model the two tools as two different producers for the same downstream backend shape:

cicc:
    input: CUDA C++ source or preprocessed CUDA source
    frontend: EDG and NVVM bridge
    handoff: LLVM/NVVM module
    backend: shared NVPTX backend
    output: PTX for ptxas

tileiras:
    input: TileIR MLIR bytecode
    frontend: MLIR dialect cascade and TileAS passes
    handoff: LLVM/NVVM module
    backend: shared NVPTX backend
    output: host object that carries ptxas output

This split is the key design constraint. Above the LLVM/NVVM handoff, reuse between the two tools is mostly conceptual. Below that handoff, the pass names, reflection behavior, libdevice payload, and PTX emission semantics should be treated as one shared backend contract.

Keyboard shortcuts

Tileiras Internals