Troubleshooting and Known Issues
Abstract
A symptom-to-root-cause catalog. Each entry pairs the user-visible text — the literal stderr line, the exit code, or the silent behavior — with the layer that produced it and the change that resolves it. The catalog also covers diagnostic typos that are part of the public contract (and therefore preserved), wire-format incompatibilities between tileiras's bytecode and upstream MLIR tooling, and a set of behaviors that are not bugs but frequently confuse first-time users.
The page is organized by where in the pipeline the error originates rather
than by what the user typed. Driver-level rejections fire before any pass
runs and produce exit codes 2 or 3 from
Driver Program Handle.
Verifier failures fire inside the pass manager and produce exit code 5.
Codegen failures surface as either an LLVM report_fatal_error (abort) or
an exit code 5 carrying a ptxas-shaped tail. The verifier-ladder positioning
of each layer is documented in
Correctness Layers, and the exit-code contract is
documented in Error Handling and Diagnostics.
Symptom-driven index
| If you see | Read |
|---|---|
failed to parse IR bytecode (it looks like MLIR bytecode instead) | Bytecode parse failures |
unknown attribute tag from the bytecode reader | Bytecode parse failures |
unsupported GPU target, invalid optimization level | Driver-level rejections |
optimized debugging is not supported | Driver-level rejections |
could not find libdevice | Driver-level rejections |
op expects ... arguement types to match ... (note the typo) | Verifier failures |
Formal parameter space overflowed (X bytes required, max Y bytes allowed) | Verifier failures |
Not a canonical UMMA_MN Layout: No flat offset mode | Verifier failures |
colletor::a (note the typo) in tcgen05 diagnostics | Verifier failures, Known typos |
| Cannot find WGMMA in selection, sm_90 target | Codegen failures |
Function uses too much shared data, ptxas stderr | Runtime and ptxas failures |
| Cluster launch silently fails or aborts at runtime | Gotchas |
| TMA descriptor produces garbage output | Gotchas |
Bytecode parse failures
The bytecode reader is the first stage every invocation passes through. Three failure modes surface here, all returned as driver exit code 3.
Symptom. input does not correspond to Tile IR bytecode. Exit code 3.
Cause. The magic word at the head of the file is not the Tile IR magic.
The driver checked looks_like_mlir_bytecode first and that probe also
failed; the file is neither Tile IR bytecode nor upstream MLIR bytecode.
Fix. Verify the producer. Tile IR bytecode comes from the frontend's
serializer; a file with a .bc extension is almost certainly LLVM bitcode,
which belongs to a different stage of the toolchain (ptxas), not tileiras.
Symptom. failed to parse IR bytecode (it looks like MLIR bytecode instead).
Exit code 3.
Cause. The magic-word probe matched upstream MLIR bytecode. This is
almost always a wire-format incompatibility (see
Wire-format incompatibilities below): the
caller used mlir-translate --serialize-bytecode or
mlir-opt --emit-bytecode on a module that happens to import Tile IR
dialects, and the resulting file uses upstream AttrTag numbering rather
than tileiras's.
Fix. Re-serialize through the tileiras-aware bytecode writer. The
frontend's emission path is described in
Frontend Contract and Tile IR Emission;
the bytecode envelope itself is in
MLIR Bytecode Format.
Symptom. unknown attribute tag N from the bytecode reader, with N an
integer. Exit code 5 (parser failures surface after the magic check
succeeded, so the driver classes them as compile failures rather than
configuration failures).
Cause. The bytecode envelope passed the magic check but a per-dialect
reader hit an AttrTag value it does not recognize. Tileiras's
Dialect Reader/Writer Status
table shows which dialect-specific readers know which tag ranges; a tag
outside the recorded ranges typically means the producer is from a
different tileiras snapshot.
Fix. Pin the producer and the consumer to the same tileiras revision.
The AttrTag numbering is not stable across snapshots.
Driver-level rejections
The driver validator runs before any pass begins and rejects ill-formed inputs with exit code 2 (configuration) or 3 (bytecode shape). The verbatim text appears in Driver CLI Options — Validation Algorithm.
Symptom. unsupported GPU target. Exit code 2.
Cause. --gpu-name was set to a string the driver's accept table does
not list. The accept table covers sm_100, sm_103, sm_110, sm_120,
sm_121. Note in particular that sm_90 is not in the driver's accept
table — Hopper is reachable only through a frontend that writes the
nv_tileaa.compute_capability module attribute directly.
Fix. Pick a supported spelling. The full list is in
Driver CLI Options — Enum-valued Options;
the subtarget mechanism that turns the spelling into .target sm_NNa
is documented in
PTX Version and Target Selection.
Symptom. invalid optimization level. Exit code 2.
Cause. --opt-level was outside 0..3. The driver checks (uint32_t)opt > 3
so negative values appear as huge unsigned values and also fail.
Fix. Use 0, 1, 2, or 3. The driver and pipeline use different
defaults (3 vs 2); both are valid on the CLI.
Symptom. optimized debugging is not supported, change optimization level to 0 or disable full debug info.
Exit code 2.
Cause. --device-debug (or -g) was combined with --opt-level != 0.
Full device debug injects NVVM debug options that disable several
code-motion and block-merge transforms; the driver rejects the
combination rather than silently degrading the build.
Fix. Either drop -g or set -O0. For lighter source-line context
without disabling optimization, use --lineinfo instead.
Symptom. could not find libdevice.bc, or a missing-symbol error from
the math pipeline after the NVVM-Reflect pass.
Cause. Tileiras resolves the libdevice path from the environment
variables CUDA_HOME, CUDA_PATH, CUDA_ROOT, and the install-relative
fallback; when none of them point at a directory containing
libdevice.10.bc, the math pipeline cannot link the device math
intrinsics and ends up with unresolved symbols at the NVVM verifier
layer.
Fix. Export one of CUDA_HOME, CUDA_PATH, or CUDA_ROOT to the
CUDA install root that ships nvvm/libdevice/libdevice.10.bc. The full
env-var contract is in
Env Vars and Runtime Gates;
the link order is covered in Libdevice Overview.
Symptom. unsupported host operating system, unsupported host architecture. Exit code 2.
Cause. --host-os or --host-arch was a string outside the accept
table (linux, windows and x86_64, aarch64, arm64ec respectively).
Fix. Pick a supported spelling. There is no autodetection on the CLI
surface; the defaults are platform-derived but the CLI parser rejects
arbitrary strings.
Verifier failures
Verifier diagnostics fire inside the pass manager and propagate to the driver as exit code 5. The text is part of the public contract — frontends and tests key off the exact spelling. The verifier ladder (per-op, between-pass, named, NVVM IR) is described in Correctness Layers.
Symptom. op expects ... arguement types to match with producer types ....
Note the verbatim typo arguement.
Cause. A region-bearing op (typically nv_tileas.async.pipeline.consume_one)
was left with a region-argument list that does not match its paired producer's
result list. This is almost always a partial-rewrite leftover from a pass that
aborted between the producer and consumer rewrites.
Fix. Identify the pass that touched the op last; the between-pass
verifier names it. The verbatim diagnostic is preserved with the typo;
see Known typos and
nv_tileas Verifiers — Region-Op Verifier Template.
Symptom. Formal parameter space overflowed (X bytes required, max Y bytes allowed) in function Z.
Cause. The kernel's by-value parameter struct, summed with target
alignment rules, exceeds the SM's parameter-space ceiling. For sm_75 the
ceiling is 1024 bytes; for sm_80 it is 32760 bytes; the per-SM table is
in NVVM IR Verifier.
Fix. Restructure the kernel signature. Move the bulk of the data
behind a global-memory pointer, split into two kernels, or pack with
explicit alignment. The diagnostic is the actionable form; the ptxas
analogue is far less specific.
Symptom. Not a canonical UMMA_MN Layout: No flat offset mode.
Cause. A cute_nvgpu layout passed to a UMMA verifier does not match
the verifier's expected canonical shape. The verifier walks the layout's
mode tree and requires a flat offset mode at a specific position.
Fix. Have the frontend emit the canonical layout shape, or run a
layout-canonicalization pass before the consumer. The mode-pattern
contract is in
cute_nvgpu Mode-Pattern Verifiers.
Symptom. expects #C element type to be f32, but got <type>.
Cause. A WGMMA or tcgen05.mma op was built with a non-f32
accumulator type. The verifier rejects accumulator types its emission
template does not have a code path for.
Fix. Use f32. The accumulator-type matrix is in
WGMMA Emission Protocol and
tcgen05 Tensor Memory Model.
Symptom. expected TileType for block arguments but got types: ....
Cause. The frontend built a block whose entry argument types do not
match the surrounding tile op's contract. This typically means the
emitter passed an LLVM-tier type into a Tile-tier region.
Fix. Run the frontend's tile-type fixup before the tileiras entry
point; the tile-type discipline is described in
Frontend Contract and Tile IR Emission.
Symptom. 'tcgen05.alloc' op expects colletor::a layout (note the
typo colletor::a).
Cause. A tcgen05 allocation was passed a layout that does not match
the expected collector shape.
Fix. Search source by the literal typo, not by the corrected
spelling. The full diagnostic catalog for the dialect is in
tcgen05 Ops.
Codegen and backend failures
These fire during MLIR-to-LLVM conversion or NVPTX instruction selection.
Symptom. WGMMA requires sm_90a, or an unhelpful "Cannot select"
message from the NVPTX selector.
Cause. WGMMA is arch-conditional. Plain sm_90 does not enable the
WGMMA instruction set; the a suffix on the target string is required.
Tileiras's driver does not accept sm_90a as a --gpu-name value — the
suffix is selected by the frontend through the
nv_tileaa.compute_capability module attribute.
Fix. Have the frontend write the arch-conditional attribute. The
mechanism is in PTX Version and Target Selection.
Symptom. Cannot select tcgen05.* intrinsic, or a similar selector
error.
Cause. tcgen05 instructions are introduced at sm_100. A target below
sm_100 cannot legalize them.
Fix. Use --gpu-name=sm_100 (and ensure the frontend selects
sm_100a if the program uses arch-conditional variants). The per-SM
intrinsic matrix is in
NVPTX Subtarget and Feature Matrix.
Symptom. unsupported tma load mode '<mode>', where the mode name
contains im2col.
Cause. The frontend emitted an im2col TMA descriptor against a target
that does not implement that descriptor variant. Im2col was added later
than the basic tiled mode and is not available on every SM that has TMA.
Fix. Pick the basic tiled mode or pick a target that supports im2col.
The atom registry is in TMA Atoms.
Symptom. Compilation succeeds but dsmem operations produce zero or
garbage at runtime on sm_80.
Cause. Distributed shared memory (dsmem) requires sm_90 or higher.
Lowering does not reject the op on sm_80 at the dialect level; the lowered
PTX exists but the hardware path is absent.
Fix. Target sm_90 or higher. The DSMEM handshake is documented in
Cluster Sync and DSMEM Handshake.
Runtime and ptxas failures
Tileiras invokes ptxas as a subprocess. ptxas diagnostics surface through the harness's stderr capture; the driver returns exit code 5 carrying the ptxas text.
Symptom. ptxas: error: Function '<name>' uses too much shared data.
Cause. The kernel's static + dynamic shared-memory footprint exceeds
the per-SM shared-memory ceiling. Tileiras's smem-accounting does not
re-check the ceiling after pipeline buffer assignment, so the limit
surfaces only at ptxas.
Fix. Reduce SMEM pressure — fewer pipeline stages, smaller tile
sizes, or split into two kernels. See
Buffer Assignment and Named-Barrier Binding.
Symptom. ptxas: error: Multiple kernel definitions for the same
function name.
Cause. The frontend emitted the same nvvm.kernel function twice,
typically because two upstream modules with the same kernel name were
merged before tileiras saw them.
Fix. Deduplicate at the frontend. Tileiras does not rename to break
collisions.
Symptom. ptxas: error: Address out of range, or any internal ptxas
assertion.
Cause. Almost always a tileiras codegen bug. ptxas does not normally
diagnose the producer; an Address-out-of-range from ptxas usually means
the AsmPrinter emitted an out-of-range immediate.
Fix. Report. Capture the full invocation (see
Reporting a bug).
Symptom. ptxas exits non-zero with no stderr text. Cause. ptxas crashed (signal exit) rather than failing cleanly. The subprocess harness reports the non-zero status but cannot reconstruct a diagnostic from a signal-killed child. Fix. Re-run ptxas directly on the PTX text tileiras emitted; the harness logs the argv on debug builds. The handoff protocol is documented in ptxas Handoff Protocol.
Known typos in diagnostic strings
The following typos are present in the binary's diagnostic strings and are preserved across snapshots. Downstream tooling (log scrapers, test-failure classifiers, frontend translation tables) keys on the verbatim text, so a corrected spelling would be a wire-format-style break. When grepping the binary, the build directory, or production logs, use the typo'd form; the corrected form will not match.
| Verbatim string in binary | Corrected English | Where it fires |
|---|---|---|
colletor::a | collector::a | tcgen05 layout verifier in tcgen05 Ops |
arguement (e.g. region arguement types to match) | argument | region-op verifiers in nv_tileas Verifiers |
types to be match | types to match | several pattern-side rewrite diagnostics |
paramater (occasional) | parameter | a small number of parser callouts |
succeded | succeeded | a single info-class message from the pass instrumentation |
A reimplementer who silently corrects the spelling produces a binary whose diagnostics no longer line up with the recovered binary's downstream consumers. The corrections must be a coordinated change at the consumer side first.
Wire-format incompatibilities
Tileiras's bytecode envelope reuses upstream MLIR's container format but the per-dialect AttrTag numbering is tileiras-specific. Mixing upstream MLIR tooling with tileiras bytecode produces files that pass the magic check but fail at the per-dialect reader.
Pitfall. mlir-translate --serialize-bytecode on a module that
imports cuda_tile, nv_tileaa, nv_tileas, cute, or cute_nvgpu
emits the upstream AttrTag numbering. When tileiras reads it, the
per-dialect reader sees a tag value outside its accept range and emits
unknown attribute tag N. Use the tileiras-aware serializer that
ships with the frontend.
Pitfall. mlir-opt --emit-bytecode produces the same shape and
fails the same way. The --emit-bytecode flag is in upstream
mlir-opt; the tileiras driver does not expose an --emit-bytecode
mode because it consumes bytecode rather than producing it.
Pitfall. Bytecode produced by an older tileiras snapshot may fail
the version word check at the envelope level (unsupported version),
even if every dialect tag would otherwise be readable. The envelope
version is independent of the dialect AttrTag versioning, and both must
match.
Pitfall. Mixing two tileiras snapshots' bytecode in a single multi-module compile (for example, by linking a prebuilt library bytecode with a newly emitted module) is not supported. The dialect tag numbering can shift between snapshots without an envelope-level version bump.
The full bytecode envelope is documented in MLIR Bytecode Format; per-dialect tag range support is in Dialect Reader/Writer Status.
Gotchas
These are not bugs. They are mechanism details that routinely trip up first-time users.
Cluster dim must be a power of 2. The cluster verifier accepts only
1, 2, 4, or 8 for each cluster-dim component, with the product
bounded by 16. A frontend that emits cluster_dim = [3, 1, 1] fails
the verifier. The hardware does not implement non-power-of-2 cluster
shapes; the verifier is enforcing a hardware constraint, not a
tileiras convention. See
Cluster Sync and DSMEM Handshake.
TMA descriptor alignment is 128 bytes. The TMA descriptor passed to
cp.async.bulk.tensor must be aligned to 128 bytes in host memory.
Tileiras's host-side descriptor mutator assumes the alignment; a
misaligned descriptor produces incorrect copies at runtime rather than
a clean error. The frontend's descriptor builder is responsible for the
alignment; if the descriptor lives in a host allocator that does not
honor the alignment, the descriptor builder must over-allocate and
align manually.
WGMMA accumulator reads inside the body are silent UB. The WGMMA
accumulator can only be read after wgmma.wait_group N completes for
the relevant group. Reading the accumulator earlier — inside the
warp-group body before the matching wait — produces no diagnostic and
no compile-time rejection; the read silently returns stale data. The
read-after-wait discipline is documented in
WGMMA Emission Protocol.
⚡ QUIRK — accumulator-before-
fence_asyncis silent UB, not a verifier error The natural assumption is that any verifier that knows about WGMMA also knows that the accumulator is asynchronously written and would diagnose a too-early read. It doesn't. The mbarrier/fence_async ordering is enforced only at runtime by the hardware's async-proxy ordering rules; the MLIR verifier accepts a use of the accumulator SSA value at any point after the WGMMA op, including betweenwgmma.commit_groupandwgmma.wait_group N. The read compiles, runs, and returns whatever bits the accumulator register held before the WGMMA retired — usually the previous iteration's result, occasionally garbage from a sibling warp-group's register reuse. There is no--Werrorflag that catches it; the discipline is a frontend obligation.
--use-fast-math enables FTZ even when no op carries fast-math
flags. The driver's fast-math flag is a global on/off; it does not
key off per-op MLIR fast-math metadata. A program that writes
fast-math-free MLIR but compiles with --use-fast-math (or the
pipeline option ftz=true) still emits FTZ-mode arithmetic. To get
non-FTZ arithmetic, leave the global flag off.
⚡ QUIRK —
--use-fast-mathis a module-wide FTZ master switch with no per-function escape hatch Upstream LLVM exposes FTZ as a function attribute (denormal-fp-math) plus per-opFastMathFlags, so a single hot kernel can opt in while the rest of the module stays IEEE. Tileiras's driver flag short-circuits that: enabling--use-fast-mathwrites theunsafe-fp-mathFunction attribute onto every function it lowers, and the case-0x66FMA selector reads that attribute and picks the FTZ opcode unconditionally. There is no__attribute__((noflush))or per-function override that reverses the global flag — a function that needs IEEE-denormal arithmetic must be compiled in a separate invocation with the flag off, then linked in. The same applies to the pipeline optionftz=true.
Libdevice link order matters. NVVMReflect must run before
always-inline. If a pipeline rearrangement moves always-inline above
NVVMReflect, libdevice's __nvvm_reflect calls get inlined with
unresolved arguments and the math intrinsics emit fallback bodies
rather than fast paths. The pass order is in
NVVMReflect Mechanism.
--gpu-name does not accept the a or f suffix. The driver's
parser only matches the bare sm_NN form. Architecture-conditional
selection (sm_90a, sm_100a, etc.) is decided by the frontend's
module attribute and combined with --gpu-name to pick the final
.target line. A user trying to "force" sm_100a from the CLI cannot
do so; the frontend must write the attribute.
⚡ QUIRK —
--gpu-name=sm_90ais silently rejected, not diagnosed The driver's--gpu-nameaccept table contains only the bare numeric forms (sm_100,sm_103,sm_110,sm_120,sm_121); the family- and arch-conditional suffixes (a,f) that downstream tools likeptxasaccept are not in this table. A user who writes--gpu-name=sm_90adoes not get an "unknown target" diagnostic — the parser either ignores the unrecognised string and falls back to the default or rejects it with a generic "unrecognised --gpu-name" message that does not mention the suffix. The architecture-conditional.target sm_90aline is written by the frontend's module attribute, not the CLI, so the only way to compile forsm_90ais to set that attribute and pass--gpu-name=sm_90— butsm_90is also missing from the accept table on the current driver (Blackwell-first), so Hopper builds bypass--gpu-nameentirely.
The driver --opt-level default differs from the pipeline
opt-level default. Driver default is 3; pipeline default is 2.
A user invoking the pipeline directly through --pass-pipeline
without specifying opt-level gets 2; a user invoking the driver
without --opt-level gets 3. The mismatch is documented but a
common source of "I changed nothing and the output changed" reports.
⚡ QUIRK — driver and pipeline disagree on the default
opt-levelTwo entry points to the same compiler read different defaults from different option tables:--opt-levelon the driver defaults to3, whileopt-level=inside--pass-pipelinedefaults to2. Identical IR therefore produces different PTX depending on whether the user invoked the driver or hand-rolled the pipeline, with no warning either way. This is the canonical "I changed nothing and the output changed" failure mode — pinopt-level=explicitly in both invocations to make builds reproducible across entry points.
sm_90 is not in the driver's accept table. The driver targets
Blackwell as its primary deployment surface. Hopper builds go through
a frontend that writes compute_capability directly and bypass the
--gpu-name table.
Reporting a bug
A useful bug report contains five artifacts. Capture them at the time of the failure; reconstructing them after the fact is significantly harder.
- The exact tileiras invocation. The full argv vector and the
environment variables
CUDA_HOME,CUDA_PATH,CUDA_ROOT,LD_LIBRARY_PATH, and anyTILE*variable. The env-var catalog is in Env Var and Runtime Gate Catalog. - The input Tile IR. Either the original bytecode buffer or the
textual form produced by
--mlir-print-ir-before-all(which dumps the IR before every pass; the last entry before the crash is the one that mattered). See Debugging and Introspection. - The full stderr output. Tileiras's diagnostics arrive in emission order; the first diagnostic is usually the actionable one even when later diagnostics look more dramatic.
- The scheduler trace, if the failure is scheduling-related. Pass
schedule-trace-file=<path>through--pass-pipelineto emit the Chrome-timeline JSON. The format is described in Modulo Driver and 4-Arm OR-Chain. - The SM target, CUDA version, and host platform. SM target is in the
argv; CUDA version comes from
${CUDA_HOME}/version.txtornvcc --version; host platform isuname -srmon Linux,cmd /c veron Windows.
A report that omits any of these forces the maintainer to ask for it before any diagnosis can begin; one with all five usually permits an immediate root-cause hypothesis.
Cross-references
Error Handling and Diagnostics is
the canonical reference for the diagnostic engine, the pass-failure
handshake bit, and the five driver-level exit codes; the symptoms in
this page are organized by the layer that produces them in that page's
architecture.
Correctness Layers places each verifier-level
failure mode in its layer (per-op, between-pass, named, NVVM IR,
ptxas), explaining why the same overflow can appear with two different
diagnostic texts at two different layers.
Driver CLI Options is the source for every
driver-level rejection string; the validation algorithm there matches
the Driver-level rejections section
line-for-line.
Debugging and Introspection is the
primary reference for the introspection flags
(--mlir-print-ir-before-all, --mlir-print-ir-after-failure,
schedule-trace-file, dump-host) used to assemble a bug report.
MLIR Bytecode Format and
Dialect Reader/Writer Status
document the envelope and AttrTag contracts that the
Wire-format incompatibilities section
describes from the user's side.
Frontend Contract and Tile IR Emission
documents the producer side of the same wire-format contract and the
tile-type discipline a frontend must observe to avoid the verifier
failures in this page's catalog.
Env Vars and Runtime Gates
covers the CUDA_HOME / CUDA_PATH / CUDA_ROOT resolution that the
libdevice gotcha turns on.
Testing and Observability takes the
verbatim diagnostic strings catalogued here — including the preserved
typos in Known typos — and shows
how to pin them as golden-test assertions so a downstream regression
suite detects diagnostic-catalog drift between snapshots.