Testing and Observability
Abstract
Tileiras ships no public test harness. The binary is stripped, its source is closed, and there are no exposed unit-test entry points. What the compiler does expose is an observation surface wide enough that an integrator can build a complete validation suite around it from outside: stderr is the diagnostic catalog, exit codes are the pass/fail oracle, --mlir-print-ir-after-all is a per-pass IR snapshot, --schedule-trace-file is a per-decision scheduler log, --mlir-pass-timing is a per-pass walltime breakdown, and the emitted PTX is the final golden artifact. Every layer that produces a diagnostic emits a verbatim string the user can pin tests against; every pass that mutates IR emits a snapshot the user can diff.
This page enumerates each observation surface, the test pattern it supports, and the failure modes a regression suite built on those surfaces will and will not catch. The principle running through all of it is that tileiras is a black box with five honest windows, and a tester who knows where each window faces can build robust validation without source.
Observable surfaces
The five surfaces below are the only mechanisms by which tileiras's behavior reaches a test harness. Everything else — the compiler's internal cost evaluations, the verifier's branch-by-branch decisions, the random tie-break order — is inaccessible from outside the binary.
| Surface | Output | Test pattern it enables | Cost when enabled |
|---|---|---|---|
| Stderr diagnostics | Verbatim error and warning strings | Diagnostic golden tests | None on success, per-message on failure |
| Exit code | Integer 0..5 from Driver Program Handle | Pass/fail classification, error-class bucketing | None |
| Pipeline IR snapshots | Pre- and post-pass MLIR text from --mlir-print-ir-after-all | Per-pass invariant validation, regression bisection | AsmPrinter throughput per snapshot point |
| Pass timing | Per-pass wall-clock from --mlir-pass-timing | Performance regression detection | Negligible |
| Scheduler trace | Per-decision JSON log from --schedule-trace-file=PATH | Scheduling-determinism validation, gate-rejection bucketing | Tens of MB per heavily pipelined kernel |
| Generated PTX | Final stdout text | Output-level golden tests | Always emitted |
| Symbol dumps | None (binary is stripped) | — | — |
The stripped-binary line is not cosmetic. nm tileiras returns the dynamic-symbol table only; objdump --syms reports the same minimal set. A tester cannot identify which internal function emitted a diagnostic by symbol — only by the diagnostic's verbatim text and, when --mlir-print-stacktrace-on-diagnostic is enabled, by a backtrace whose frames resolve to address-only lines. The diagnostic text is the only stable identifier.
Differential testing patterns
Each pattern below uses one or more of the surfaces above. The patterns compose — a regression suite typically chains several of them — but each has a single dominant failure mode it is designed to catch.
Pattern 1: Output golden tests
Compile a small fixed input with a fixed --gpu-name, --opt-level, and pipeline option set. Capture stdout. Diff against a previously recorded golden PTX file. Any pipeline change that affects emission reveals itself as a non-empty diff.
The diff itself is not the verdict. PTX text has elements that legitimately vary across builds: comment headers carrying timestamps, virtual-to-physical register naming inside .reg declarations, label numbering inside basic-block tails. A robust harness normalizes those before comparison. The structural diff — the instruction sequence, the launch-bound directives, the parameter declarations — is what matters.
Golden updates are reviewable code-review artifacts. A diff that changes a single .maxntid directive is a one-line review; a diff that rewrites every WGMMA shape is a structural-change review. Treating goldens as VCS-tracked source rather than as autogenerated noise keeps the review loop honest.
Pattern 2: Diagnostic golden tests
For each diagnostic in the catalog from Troubleshooting and Known Issues — Symptom-driven index, construct a small input designed to trigger it. Compile. Assert that the captured stderr contains the verbatim diagnostic string and that the exit code is the documented one.
The pattern verifies that the diagnostic catalog stays stable across snapshots. Tileiras's diagnostic strings are part of the public contract — downstream log scrapers, frontend translation tables, and CI failure classifiers key off the verbatim text, including the preserved typos (arguement, colletor::a, succeded) enumerated in Troubleshooting — Known typos. A test that silently passes after a diagnostic text changes is a false negative; matching the literal string with the typo is correct.
A useful refinement: pin the exit code along with the text. optimized debugging is not supported always exits 2; unknown attribute tag N always exits 5. A test that asserts both catches the case where the text stays but the classification drifts.
Pattern 3: Cross-version diff testing
Run two snapshots of tileiras (for example, the CUDA 13.0 binary and the CUDA 13.1 binary) on the same fixed input with the same flags. Diff stdout. Any non-empty diff is a behavioral change between the two releases.
The pattern is the canonical way to learn what changed in a snapshot when no changelog is published. A non-empty diff narrows the question to "which pass produced this output difference"; that question is then answerable with pattern 5 below. The two patterns together reduce "the new binary is different" to "this pass at this stage produces different IR".
Cross-version testing is the only way to identify silent behavior changes. A snapshot that adds a new optimization without a new diagnostic produces no exit-code change, no stderr change, and no timing red flag — only an output diff at the PTX level reveals it.
Pattern 4: Cross-target diff testing
Same input compiled for sm_90a, sm_100a, sm_103a, sm_120a. Diff PTX outputs pairwise. The diff reveals the per-architecture dispatch — which atoms differ, which intrinsics differ, which scheduler decisions differ. The cross-target matrix is documented in PTX Version and Target Selection and Matmul Progression by SM.
The pattern catches arch-conditional bugs. An emission template that should produce different instructions for sm_100a and sm_120a but produces identical output for both is a regression; the diff is empty when it should not be. Conversely, a template that should produce identical instructions for both but produces different output is also a regression; the diff is non-empty when it should be empty. The expected-shape matrix has to be encoded into the harness — bare diffing only tells the tester that something changed, not whether the change was intentional.
Pattern 5: Snapshot diff testing
Enable --mlir-print-ir-after-all --mlir-print-ir-module-scope. Capture the per-pass IR stream. When an output regression appears, diff the snapshot stream pass-by-pass between the working and broken configurations. The first pass at which the snapshots diverge is the pass that introduced the regression.
The pattern depends on the snapshot stream being deterministic between identical runs. Tileiras's pipeline is deterministic at fixed opt-level and fixed flags — the modulo scheduler uses stable sort, the cost-based arm uses lexicographic ranking, and the random-tie-break path documented in Modulo Scheduler and Rau is seeded from input hash rather than from wall-clock. Two identical invocations produce byte-identical snapshot streams.
The IR stream is large. A O3 compile on a multi-kernel module can produce hundreds of megabytes of snapshot text. A harness that retains snapshots for every passing test will run out of disk; the right design is to retain snapshots only when an output diff appears, which means the snapshot capture has to be deferred to the second run, not the first.
Performance regression patterns
Performance regressions are a separate test class because they fire no diagnostic. The compiler does not know its scheduler picked a worse II; the cost model judged the chosen seat as best. Only an external test can rank "best from the cost model's view" against "best from the user's view".
The two performance-class observation surfaces are pass timing and the scheduler trace.
--mlir-pass-timing produces a per-pass wall-clock breakdown at compile end. A regression that doubles a single pass's runtime is visible as a 2x entry in the timing report. The expected baseline has to be recorded; a percentage threshold of ±5–10% is a sensible default to absorb microsecond-level noise without missing structural regressions. The flag is documented in Debugging and Introspection — MlirAction-based instrumentation.
--schedule-trace-file=PATH produces per-decision JSON for the modulo scheduler. The trace records every (op, cycle) placement attempt, the cost vector it produced, and which gate (G1–G4) accepted or rejected it. A regression that changes the chosen II from 4 to 5 is visible as a different commit decision in the trace; a regression that increases the number of gate-G3 rejections is visible as a different rejection count. The trace format is documented in Debugging and Introspection — Surface 5.
Neither surface catches the case where the compiler picks a structurally similar but slightly worse placement that produces equal-shape PTX with worse runtime SASS performance. That case is invisible to source-side observation and requires runtime profiling against compiled cubin.
Observability gotchas
Several mechanisms produce output that looks deterministic but is not, or output that looks comparable across runs but is not.
PTX register names are dependent on register-allocator state. Two compiles of the same input typically produce identical register naming, but a compile where a pass produced more or fewer virtual registers above the allocator can shift every physical register downstream of that point. A diff harness that compares %r0..%r127 literally fails on any pipeline change that touches register count; comparing instruction structure with register names normalized to opaque tokens is the robust form.
Label numbering inside basic-block tails is also allocator-dependent. A diff that detects only a BB0_42 becoming BB0_43 is almost always cosmetic. Normalizing label numbers before structural comparison eliminates the noise.
Diagnostic order is not always deterministic when multiple passes can emit independently. The pipeline's verify-each path emits in pass order, which is stable, but a single verify pass walking a multi-function module can visit functions in hash-table order. A test that asserts diagnostic A appears before diagnostic B may pass on one platform and fail on another; assert the set of diagnostics, not their order.
Pass timing has microsecond-level jitter. The MLIR pass-timing harness reports wall-clock, which includes kernel-side OS scheduling jitter, page-fault costs from cold caches, and TLB-fill overhead on the first invocation. Single-run timing comparisons are unreliable; aggregate over a minimum of three runs and compare medians. The --mlir-pass-timing walltime also includes time spent printing IR if any of the print-IR flags are enabled simultaneously; clean timing reports require the print flags to be off.
The diagnostic-stack-trace mechanism resolves symbols from a stripped binary. Frames within the tileiras binary resolve to address-only entries (tileiras+0x12345); frames within libdevice or libLLVM may resolve to names if the host's libraries are not stripped. A diagnostic-source identification routine that assumes all frames resolve to names will silently skip the tileiras-internal frames and report only the host library frames.
Continuous integration patterns
For a project shipping tileiras-generated PTX as part of its deliverable, the natural CI shape is a four-stage pipeline.
Stage one regenerates every PTX golden from its source on every commit. Failures here mean tileiras refuses the input; capture the exit code and the stderr, and bucket the failure by exit code class — 2 is configuration, 3 is wire-format, 5 is verifier or codegen. The exit-code contract is in Driver Program Handle — Public error codes.
Stage two diffs the regenerated PTX against the VCS-tracked golden. Failures here mean the output changed without a corresponding golden update. The diff is the actionable artifact; treat it as a review-required change rather than as a hard break. A snapshot-stream capture (pattern 5 above) on the failing input localizes which pass introduced the change.
Stage three runs the diagnostic golden tests (pattern 2 above). Each test pins a verbatim diagnostic and an exit code. Failures here mean the diagnostic catalog drifted; the new text becomes a new golden, or the old behavior is restored.
Stage four runs pass-timing measurements against a recorded baseline. Failures here are advisory rather than blocking — a 5% per-pass slowdown is worth investigating but not worth refusing a commit. The blocking threshold should be higher (2x or more) to absorb measurement jitter without false alarms.
The four stages compose into a regression net that catches most observable-behavior changes. Stage one catches outright build breaks; stage two catches output-shape changes; stage three catches diagnostic-catalog changes; stage four catches gross performance regressions. What slips through all four is the runtime-correctness class — wrong-output bugs that compile, link, and run without a diagnostic. Those require a separate runtime test layer that loads the cubin and validates against a numerical reference, which lives outside the compile-time observation surface entirely.
Regression scenarios
The typical regression-suite shape covers six scenarios. Each one is testable from the observation surfaces above; each one corresponds to a different failure mode in tileiras.
All kernels compile. The exit code from every entry in the fixture set is 0. Any non-zero exit is a regression. The fixture has to be small enough that a full sweep runs in seconds; a per-SM matrix of one kernel per major mechanism (matmul, convolution, attention, reduction, transpose) is a reasonable starting shape.
No new warnings. Stderr from every entry is matched against a known baseline. Any new diagnostic — whether warning or error — surfaces as a stderr diff. The baseline grows over time; old diagnostics get added to the allowlist when they are intentionally expected.
All kernels produce identical PTX to their goldens. The diff is empty after register and label normalization. Diff-non-empty entries become review-required.
Pass timing within ±10% of baseline. The --mlir-pass-timing report is captured and compared against a recorded baseline. The 10% threshold absorbs measurement noise; larger deviations are flagged.
Generated PTX register count within budget. The .reg declarations at the head of each kernel are parsed and summed. A kernel that grew from 64 registers to 96 may now spill, which has runtime cost the harness cannot directly observe but which is predictable from register-pressure budgets per SM documented in NVPTX Subtarget and Feature Matrix.
SMEM usage within budget. The kernel's static + dynamic SMEM footprint is parsed from the PTX (.shared declarations) and the launch directives. A kernel that exceeds the per-SM SMEM ceiling fails at ptxas with Function uses too much shared data, but the harness can preempt that failure by checking before invocation. The mechanism is documented in Troubleshooting — Runtime and ptxas failures.
What you cannot test from outside
Some compiler decisions never reach the observation surface. The scheduler's internal cost evaluations are exposed through the trace, but only the chosen decisions; the rejected ones are aggregated into a count rather than itemized. The verifier's branch-by-branch logic produces a single pass/fail bit at the diagnostic; the chain of predicates that led to the verdict is not exposed. The random-tie-break order is seeded from input hash, which makes it reproducible from the test side, but the seed itself is not exposed and re-seeding for fault injection is not possible.
The unverifiable bug classes from Correctness Layers — The Unverifiable — data races in user-written warp-cooperative algorithms, numerical-precision mismatches, performance regressions below the noise threshold, decision bugs in the compiler's own cost model — all of these compile cleanly, produce no diagnostic, and pass every observation-based regression test. Catching them requires runtime testing against a numerical reference, not source-side observation.
A reasonable position for an integrator: observable-behavior regression tests cover the compile-time contract; runtime-correctness tests cover the execution-time contract; performance benchmarks cover the runtime-performance contract. The three layers are not substitutes for each other, and a CI pipeline that runs only the first will pass every test on a build that silently produces wrong-output numerics.
Cross-references
Debugging and Introspection is the primary reference for each observation flag enumerated above; this page assumes its surface catalog and applies the surfaces to test design rather than re-explaining what each flag prints. Troubleshooting and Known Issues provides the symptom-driven index of diagnostic strings that diagnostic-golden tests (pattern 2) pin against, including the verbatim typos that must be preserved in test assertions. Correctness Layers documents the verifier ladder whose layer-by-layer diagnostics the regression suite is keying off, and identifies the bug classes that observable-behavior testing cannot catch. Error Handling and Diagnostics covers the diagnostic engine and the five exit-code classes that exit-code-based test bucketing relies on. Driver CLI Options enumerates the flags that activate each observation surface, and Driver Program Handle — Public error codes is the canonical exit-code reference. Pipeline Instrumentation and Action Handler documents the scope tree that --mlir-pass-timing exposes and the MlirAction mechanism that lets external tooling instrument any pass without modifying the pass list. OSS Comparison Overview is the cross-validation reference: the open-source cuda-tile preview is the only point where parts of tileiras's IR surface are visible in original source form, and a tester who wants to validate tileiras's behavior against a reference implementation has only that subset to compare against.