tileiras vs cudafe++ (Non-Relationship)
Abstract
A common misconception about CUDA Toolkit 13.1 is that the new tileiras binary is a successor or replacement for cudafe++. It is not. The two tools share a parent driver (nvcc), a vendor (NVIDIA), and a problem domain (CUDA), but their inputs, outputs, internal architectures, and roles in the build graph have zero overlap. This page documents that non-relationship explicitly.
What tileiras does NOT have
The cleanest way to state the boundary is to enumerate, point by point, each cudafe++ subsystem that is absent from tileiras. The absence is architectural, not just cosmetic: tileiras starts from serialized MLIR bytecode, so every source-language responsibility that belongs to cudafe++ has already happened upstream or does not apply.
- No EDG frontend. cudafe++ is built around the Edison Design Group C++ Front End v6.6: lexer, parser, type system, template instantiation engine, overload resolver, and constexpr interpreter. tileiras has none of that machinery. Its bulk comes from the MLIR runtime, TileIR dialect libraries, and the LLVM 21 NVPTX backend.
- No C++ parser. tileiras has no recursive-descent C++ parser, token kind table, operator-precedence engine, or Itanium ABI name mangler. Its inbound surface is the MLIR bytecode reader, which decodes a serialized
builtin.modulewhose ops, types, and attributes have already been resolved upstream. tileiras enters at bytecode, not source text. - No
.int.cemission. cudafe++ is a C++ source-to-source translator; one of its jobs is writing the transformed host-side.int.coutput. tileiras emits no C source. Its terminal output is a host ELF object, with PTX as the intermediate textual artifact handed toptxas. - No host stubs. cudafe++ generates
__wrapper__device_stub_<kernel>()host-side forwarding functions, the.nvHRKI/.nvHRDE/.nvHRCEELF host-reference arrays, the__cudaRegisterFatBinary/__cudaRegisterFunctionregistration table, and the CRC32-derived module ID. tileiras is device-only. No kernel-launch lowering, no host-side stub synthesis, no fat-binary registration boilerplate. - No lambda machinery. cudafe++ injects template wrappers (
__nv_dl_wrapper_t,__nv_hdl_wrapper_t,__nv_hdl_create_wrapper_t) to carry extended__device__and__host__ __device__lambdas across the host/device boundary, driven by 1024-bit capture-count bitmasks. tileiras has no concept of a lambda or a capture. Whatever upstream tool produces the bytecode has already lowered any C++ lambda away by the time tileiras sees it. - No template instantiation. cudafe++ runs a full C++ template instantiation worklist with deduction, partial specialization, SFINAE, and constexpr evaluation. tileiras has no template engine — no instantiation queue, no template parameter binding table, no constexpr tree-walker. Template specialization is a source-language concept that does not exist in MLIR bytecode.
Why people might confuse them
The confusion is structural rather than semantic. Both binaries live in the same bin/ directory of a CUDA Toolkit installation. Both are stripped, statically linked NVIDIA-internal ELF binaries. Both are invoked transparently by nvcc. Both bear the word "CUDA" in their public framing. Both deal with device-side work. None of those surface similarities reflect any internal overlap. The two tools operate at completely different levels of the pipeline — cudafe++ in the source-translation layer, tileiras at the device-IR-to-PTX layer — and never see each other's outputs.
What cudafe++ actually does
cudafe++ is the CUDA C++ source-to-source translator. It accepts a .cu translation unit, runs the EDG 6.6 C++ frontend over it, separates device code from host code via execution-space attributes (__device__, __host__, __global__), and produces two outputs: an EDG IL stream consumed by cicc, and a transformed .int.c file consumed by the system C++ compiler (gcc, clang, or MSVC). cudafe++ is not a compiler in the conventional sense — it never emits PTX, never emits cubin, and never emits machine code. It is a frontend that splits a CUDA translation unit and hands the two tracks to different downstream tools.
Redirect
This wiki documents tileiras only. For cudafe++ documentation — its EDG frontend internals, the 5-pass IL finalization, the 85-entry-kind IL graph, the .int.c emission format, the CUDA execution-space bitfield, lambda wrapper template injection, the 276-flag CLI surface, and the 3,795-entry diagnostic table — see the separate cudafe++ wiki at nvopen-tools/cudafe++/wiki/.
Boundary table
The four NVIDIA device-toolchain binaries, their inputs, outputs, and roles:
| tool | input | output | role |
|---|---|---|---|
| cudafe++ | .cu source (CUDA C++) | .int.c (transformed C/C++ host source) + EDG IL stream | C++ source-to-source translator; host/device split |
| cicc | .cu / .i / EDG IL | PTX text | CUDA-to-PTX compiler (EDG 6.6 + NVVM bridge + LLVM NVPTX backend) |
| tileiras | MLIR bytecode (cuda_tile dialect) | host ELF (elf.o) wrapping PTX (and optional SASS section) | MLIR-to-PTX optimizing assembler (53-pass MLIR pipeline + shared NVPTX backend) |
| ptxas | PTX text | SASS / cubin | PTX-to-SASS assembler |
cudafe++ is the gate at the source boundary; cicc is the conventional source-language compile path; tileiras is the optimizing-assembler path for tile-shaped kernels expressed in MLIR; ptxas is the final SASS encoder. tileiras and cudafe++ sit at opposite ends of this chain and never interact.
Reimplementation Notes
Do not model tileiras as a cudafe++ mode. A clean driver should keep the responsibilities separate:
cudafe++:
input: CUDA C++ source
work: split host and device code, lower launches, emit host-side transformed source
output: host-side source plus device-side compiler input
tileiras:
input: TileIR MLIR bytecode
work: verify bytecode schema, run MLIR/NVVM/NVPTX lowering, invoke ptxas
output: host ELF object carrying the generated device code
The only shared orchestration point is nvcc, which chooses which downstream compiler to run. The tools themselves should remain independent in any faithful reconstruction.