Instrumentation and Action Handling
Abstract
Tileiras exposes four orthogonal tracing surfaces wired into one shared PassManager. Pass instrumentation records named scopes around pipeline stages and forms the spine of pass-timing and IR-printing. The IRPrinter instrumentation layered on top emits *** IR Dump Before/After *** banners for --mlir-print-ir-before-all / --print-after-all. PassTiming consumes the same scope tree and renders either a flat list or a nested tree, in text or JSON, under --mlir-timing. MLIR actions are a lower-level mechanism for tracing individual rewrites and pattern applications; they are visible through mlir::ApplyPatternAction / GreedyPatternRewriteIteration records and through the context's action-handler callback. All four surfaces interact with a single PassInstrumentation chain that the pipeline builder owns.
This page covers the chain itself: the algorithm an instrumentation hook follows, the canonical scope tree the binary actually emits, the MLIR action surface and its printable records, the pass-timing report grammar, the IR-printing banner protocol, the crash reproducer, opt-bisect, and debug-counter — plus the four NVIDIA-private hook points where the chain ties into the scheduler.
Confidence on the existence and ordering of the inner scopes is HIGH (every scope name is a verbatim .rodata string referenced from the scheduler band of .text). Confidence on the exact algorithmic decomposition of runOnOperation between the instrumentation chain and the action handler is MED — the upstream MLIR shape is preserved unchanged in tileiras, and the reverse-engineered call graph matches it bucket-for-bucket, but specific function offsets are not the public contract.
The PassInstrumentation chain
PassManager owns a PassInstrumentor whose only payload is an ordered list of PassInstrumentation* hooks. Each hook is a polymorphic object with eight virtual entry points the pass manager calls at well-defined moments around every pass invocation, every pipeline run, and every analysis computation:
struct PassInstrumentation {
// Pass-execution boundary callbacks.
virtual void runBeforePass(Pass *, Operation *) = 0;
virtual void runAfterPass(Pass *, Operation *) = 0;
virtual void runAfterPassFailed(Pass *, Operation *) = 0;
// Pipeline-nesting callbacks (one per nested OpPassManager).
virtual void runBeforePipeline(OperationName, const PipelineParentInfo &) = 0;
virtual void runAfterPipeline(OperationName, const PipelineParentInfo &) = 0;
// Analysis-cache callbacks.
virtual void runBeforeAnalysis(StringRef, TypeID, Operation *) = 0;
virtual void runAfterAnalysis (StringRef, TypeID, Operation *) = 0;
};
The pass-execution callbacks bracket every Pass::runOnOperation call. The pipeline callbacks fire once at the entry and exit of each nested OpPassManager invocation, regardless of how many passes run inside it. The analysis callbacks fire around any AnalysisManager::getAnalysis<T> cache miss; cache hits fire neither.
The instrumentor walks its hook list in registration order for the Before* callbacks and in reverse order for the After* callbacks. The reverse traversal is what gives nested timing the right semantics — an outer timer's runAfterPass fires after every inner timer has already closed.
LogicalResult run_pass_with_instrumentation(
PassManager &pm, Pass &pass, Operation *anchor) {
PassInstrumentor *instr = pm.getInstrumentor();
for (auto &h : instr->hooks_forward()) h->runBeforePass(&pass, anchor);
LogicalResult result = pass.runOnOperation(anchor);
if (failed(result)) {
for (auto &h : instr->hooks_reverse()) h->runAfterPassFailed(&pass, anchor);
} else {
for (auto &h : instr->hooks_reverse()) h->runAfterPass(&pass, anchor);
}
return result;
}
When no instrumentation is installed the instrumentor's hook list is empty and entering and exiting the bracketed region is a single null check — the cost of a no-op PassInstrumentation chain is two function-pointer-array length tests per pass invocation, which is far below the cost of any non-trivial pass.
Scope tree the binary emits
Every named scope the binary actually emits as a .rodata string is enumerated below. The scopes form a strict tree: outer scopes cover whole compilation phases; inner scopes cover scheduling and TileAS preparation substages. The order in which they fire on a typical compile is top-to-bottom in the table — CompileNVVM opens first, SerializeGPUModule closes last, and every scheduler scope nests inside TileASGenerateSchedule or TileASPrepareForScheduling.
| Scope | Layer | Purpose |
|---|---|---|
CompileNVVM | Outer | Entire MLIR-to-NVVM/NVPTX compile run. |
SerializeGPUModule | Outer | GPU module serialization and downstream assembler handoff. |
IRWalk::findTargetForLoops | Schedule prep | Search the IR for loops eligible for schedule materialization. |
Schedule::unrollStaticForLoop | Schedule prep | Emit static loop unrolling during schedule materialization. |
TileASGenerateSchedule | Schedule | Schedule constraint generation. |
TileASPrepareForScheduling | Schedule | TileAS preparation before schedule solving. |
legalizeLoopScheduleForMaterialization | Schedule | Loop-shape cleanup before materializing a schedule. |
DumpTraceImpl::run | Schedule | Write a scheduler trace when schedule-trace-file is set. |
unrollSmallLoopsForScheduling | Schedule prep | Unroll small loops before schedule construction. |
decomposeSingleOp | Schedule prep | Decompose a single complex op for schedule-friendly IR. |
loopUnrollByFactor | Schedule prep | Apply an explicit unroll factor. |
loopUnrollByHeuristic | Schedule prep | Apply heuristic loop unrolling. |
decomposeTiledLoadStoreView | Schedule prep | Split tiled view loads/stores into scheduler-friendly forms. |
refineVecSizeOfAtoms | Schedule prep | Refine vector sizes for atom operations. |
sliceAndFuse | Schedule prep | Slice and fuse loops or regions for scheduling. |
TileASSliceAndFuse | Schedule prep | Wrapper scope around the TileAS-specific slice-and-fuse stage. |
runCanonicalizer | Schedule prep | Run canonicalization inside a scheduler preparation stage. |
compactMemLayout | Schedule prep | Compact memory layout metadata. |
refreshBoxDim | Schedule prep | Refresh box dimensions after layout changes. |
ResourceConstraintBuilder::tryAddConstraintToAvoidRegSpilling | Schedule constraint | Add scheduling constraints to avoid spills (see Resource Constraint Builder and RRT). |
These names are part of the stable surface — external timing reports and callback integrations match on them verbatim. TileASSliceAndFuse and TileASGenerateSchedule are the two most expensive scopes on a typical O2 compile; together they dominate the wall-clock budget for any schedule-intensive kernel. The DumpTraceImpl::run scope only opens when schedule-trace-file is set in TileIRPipelineOptions.
⚡ QUIRK — typo-stable scope names Several scopes carry their upstream typos verbatim.
tryAddConstraintToAvoidRegSpillingis the canonical example: the spelling is the cross-build contract, not a guideline. Downstream timing-report ingestion matches on this exact string. A reimplementation that silently corrects the spelling breaks every log-scraping pipeline that consumes the timing tree.
Scope Algorithm
Instrumentation scopes are exception-safe and nest correctly through an RAII helper. The exit always fires, even when the body throws — the chain's runAfterPassFailed hooks are the failure path.
class ScopeGuard {
public:
ScopeGuard(PassInstrumentor *p, StringRef name)
: p_(p), token_(p ? p->enter(name) : INVALID_TOKEN) {}
~ScopeGuard() {
if (p_) p_->exit(token_, std::uncaught_exceptions() == 0);
}
private:
PassInstrumentor *p_;
ScopeToken token_;
};
// Caller side:
{
ScopeGuard g(instr, "TileASGenerateSchedule");
if (failed(generate_schedule_constraints(...))) return failure();
// ScopeGuard destructor fires runAfterPass/runAfterPassFailed as appropriate.
}
When no instrumentation handler is installed enter returns INVALID_TOKEN and exit is a single integer compare against that sentinel — entering and exiting a scope costs roughly the same as a function call's prologue and epilogue.
⚡ QUIRK — scope tokens are not pointer-comparable Scope tokens are opaque integers, not pointers into the hook chain. Two independent scopes can return the same token after one closes — the chain reuses retired slots. Code that tries to memoize the token across multiple scope-aware components must use the scope name, not the token, as the key.
Pass Timing Report Format
--mlir-timing enables a TimingInstrumentation hook that accumulates per-scope wall time and user+system time. The hook prints its report at process exit through an llvm::Timer-shaped grammar. Two display modes and two output formats are available:
mlir-timing-display | mlir-output-format | Report grammar |
|---|---|---|
list (default) | text (default) | Flat list of `Wall time |
tree | text | Nested tree, indentation per scope depth; preserves pass-manager nesting. |
list | json | JSON array of {wall, user_system, name} triples. |
tree | json | JSON tree of {wall, user_system, name, children} nodes. |
The header strings are stable across builds and are matched by downstream log-scrapers: ---User Time---, ----User Time----, ---Wall Time---, ----Wall Time---- ----Name----, and ---User+System--. The leading-whitespace count is part of the contract.
===-------------------------------------------------------------------------===
Pass execution timing report
===-------------------------------------------------------------------------===
Total Execution Time: 4.812 seconds
---Wall Time--- ---User+System-- Name
2.013 (41.8%) 2.009 (41.7%) TileASGenerateSchedule
1.144 (23.8%) 1.143 (23.8%) TileASSliceAndFuse
0.401 ( 8.3%) 0.401 ( 8.3%) Canonicalizer
...
4.812 (100.0%) 4.811 (100.0%) Total
The list-mode report's percentage column is computed against the total, which is the root scope's wall time, not the sum of children — Canonicalizer invocations nested under TileASGenerateSchedule contribute to both rows. The tree mode disambiguates by showing the percentages relative to each parent.
⚡ QUIRK — typo-preserved enum value The
--mlir-timing-display=treedescription string in the binary readsdisplay the results ina with a nested tree view— the dropped 'i' afterinis preserved verbatim from upstream MLIR. The cl::opt help text matches the upstreamPass.cppbyte-for-byte. Do not "fix" this when reimplementing; downstream regression suites match the typo.
--enable-statistics and --stats-json enable a parallel statistics pass that walks every registered llvm::Statistic after compilation and prints counters and frequencies. This surface is independent of the timing instrumentation — a build can have timing on while statistics are off, or vice versa.
IR-Printing Instrumentation
The IR-printer hook intercepts the same pass-execution callbacks the timing hook uses. Where the timing hook accumulates a duration, the IR-printer emits a textual MLIR snapshot of the anchor operation either before or after the pass body runs. The hook gates its emission on a per-pass filter, a per-anchor scope, and a change detector. The full surface partitions into eight knobs the user enables through MLIRContext::setPrintIR… or through the corresponding driver flags.
| Knob | Default | Effect |
|---|---|---|
--mlir-print-ir-before-all | off | Snapshot before every pass that runs. |
--mlir-print-ir-after-all | off | Snapshot after every pass that runs. |
--mlir-print-ir-after-change | off | Suppress the after-snapshot unless the pass actually mutated the anchor. |
--mlir-print-ir-after-failure | off | Snapshot only when a pass returns failure. |
--mlir-print-ir-module-scope | off | Use the enclosing builtin.module as the snapshot root rather than the anchor. |
filter-print-funcs=<name> | empty | Limit the per-function snapshot to named functions. |
--mlir-elide-elementsattrs-if-larger=<n> | per-MLIR | Truncate large DenseElementsAttr payloads. |
--mlir-print-skip-regions | off | Skip every region body — print operations alone. |
Each snapshot is bracketed by a verbatim banner: *** IR Dump Before <pass-name> *** or *** IR Dump After <pass-name> ***. The pass name is the registered argument name (e.g., cse, tileas-generate-schedule-constraints). When --mlir-print-ir-module-scope is enabled, the banner is followed by the full module text; without module-scope, only the anchor operation is printed, which on a multi-kernel module can suppress context that the user actually needs to see (see Debugging and Introspection — IR Printing for the user-facing recipe).
void IRPrinterInstrumentation::runAfterPass(Pass *pass, Operation *anchor) {
if (suppressed_by_filter(pass)) return;
if (require_change_ && !pass_changed_anchor(anchor)) return;
Operation *root = use_module_scope_ ? containing_module(anchor) : anchor;
raw_ostream &os = printer_stream();
os << "*** IR Dump After " << pass->getArgument() << " ***\n";
OpPrintingFlags flags;
if (print_debug_info_) flags.enableDebugInfo();
if (print_op_generic_) flags.printGenericOpForm();
if (skip_regions_) flags.skipRegions();
if (elide_threshold_ >= 0) flags.elideLargeElementsAttrs(elide_threshold_);
root->print(os, flags);
os << "\n";
}
The pass_changed_anchor predicate compares a hash captured at runBeforePass against a hash captured at runAfterPass. Hash collisions are not handled — an IR change that produces an identical hash will appear as "no change" to the change detector, but the failure mode is conservative (snapshots are suppressed, never duplicated).
The AsmPrinter knobs that govern the textual form of the printed IR are documented under Dialect Asm-Printer Status — Alias Resolution; they include mlir-print-debuginfo, mlir-print-local-scope, mlir-print-op-generic, mlir-print-skip-regions, mlir-print-unique-ssa-ids, mlir-print-value-users, and mlir-use-nameloc-as-prefix. Each one is a process-wide cl::opt that the IRPrinter hook reads to construct its OpPrintingFlags per snapshot.
⚡ QUIRK —
print-changedrequires a baseline The IRPrinter only emits an "after" snapshot under--mlir-print-ir-after-changeifrunBeforePassran first and stashed a baseline. If both--mlir-print-ir-after-changeand--mlir-print-ir-after-failureare passed without--mlir-print-ir-before-all, the change detector has no baseline to compare against and falls back to "always emit after a failure." This is invisible in the help text but the actual behavior any reproducer must match.
MLIR Actions
Actions are a finer-grained surface than pass instrumentation. Where instrumentation brackets pass-level work, an action brackets a single transformation event — typically one greedy-rewrite iteration or one pattern application — and gives a registered handler the option to inspect, skip, log, or replace the work. Three action types ship in the binary:
| Action type | Class | Fires per |
|---|---|---|
mlir::ApplyPatternAction | RTTI string in .rodata | Single rewrite-pattern invocation inside the greedy or dialect-conversion driver. |
mlir::GreedyPatternRewriteIteration | RTTI string in .rodata | One sweep of the greedy worklist. |
mlir::SetMaxRegisterActionAttr | RTTI string in .rodata | NVVM setmaxnreg rewrite at NVVM lowering time. |
An action has an identity, a tag, and an optional payload pointer that the handler interprets. A context-level handler installed through MLIRContext::registerActionHandler intercepts every action that runs inside that context:
LogicalResult MLIRContext::executeAction(
const Action &action, function_ref<void()> work) {
if (!action_handler_) {
work();
return success();
}
return action_handler_(action, [&] { work(); return success(); });
}
The handler receives the action's IRPrinting state and may print, defer, drop, or stack-trace the work before calling the supplied work closure. Three call sites in the binary thread actions through this dispatch: the greedy-pattern-rewrite driver (one action per iteration plus one per pattern attempt), the dialect-conversion driver (one action per attempted conversion), and the NVVM lowering pass (one action per setmaxnreg rewrite).
The action-record dump format begins with the verbatim banner >> Action Record . The handler then prints Action: <name> followed by the action's payload formatted by the action class itself. Greedy-rewrite actions print their root op and the matched pattern's RTTI name; dialect-conversion actions print the source-and-target op pair. The handler can also emit Action: cleanup for the deferred-rewrites cleanup pass.
⚡ QUIRK — actions are independent of pass instrumentation A pass with no instrumentation hooks installed still emits actions when a handler is registered. Conversely, a pass with instrumentation but no action handler emits scopes but no action records. The two surfaces share neither a thread nor a buffer — they are wired in parallel through
MLIRContextandPassManagerrespectively. This is what lets a build run pass timing without the per-pattern noise of action tracing, or trace patterns without timing overhead.
Opt-Bisect
opt-bisect-limit is LLVM's bisect-by-pass-index harness. Set to a positive integer, it caps the number of passes that run; passes above the cap are skipped, and the IR captured at the cap is the bisect output. Tileiras inherits the LLVM-side instrumentation hook unchanged. Four cl::opts control its behavior:
| Knob | Type | Default | Effect |
|---|---|---|---|
opt-bisect-limit | int | INT_MAX | Skip every pass whose index exceeds the limit. |
opt-bisect-funcs | string list | empty | Restrict bisection to a comma-separated function list. |
opt-bisect-verbose | bool | false | Print each pass index and decision to stderr. |
opt-bisect-print-ir-path | string | empty | Write the IR to this path when the limit is reached. |
Opt-bisect runs in the same instrumentation chain as the IR printer and the timer. Its runBeforePass hook increments a counter, compares against the limit, and either lets the pass through or replaces it with a no-op. The reduced search space — pass-by-pass binary search instead of textual diff — is what makes opt-bisect viable on the 200+-pass tileiras pipeline.
A related but independent surface is debug-counter, which gates individual transformations within a pass by name (e.g., --debug-counter=dce-counter=10 lets DCE run for ten elimination attempts before short-circuiting). Two knobs apply:
| Knob | Type | Default | Effect |
|---|---|---|---|
print-debug-counter | bool | false | Print the post-run counter summary. |
debug-counter-break-on-last | bool | false | Trap into the debugger on the final allowed attempt. |
The Requested --debug-counter in LLVM build without assertions. This is a no-op. warning is emitted at startup when the build does not carry the assertion-time counter table; tileiras's release build is one such build, so debug-counter is silently a no-op unless the build is rebuilt with assertions.
Crash Reproducer
When a pass aborts or fails late in the pipeline, the pass manager can emit a self-contained reproducer — the IR captured immediately before the failing pass plus a --pass-pipeline= string that names every pass run up to the failure. The reproducer is written when the pipeline builder calls enableCrashReproducerGeneration with a path; the file path then surfaces as the verbatim diagnostic reproducer generated at \
Crash-reproducer output has two flavors:
| Flavor | When | What it captures |
|---|---|---|
| Local | Default | The single pass that failed plus the IR seen by that pass; minimal context. |
| Pipeline | genLocalReproducer=false | The full pipeline from the failure backward to the last cse or canonicalize checkpoint; replays the path that produced the failing IR. |
Neither flavor captures host-side cl::opt state — a reproducer built against the binary needs to be re-run with the same CLI flags. The "Reproducer" banner in the captured file is emitted before the IR payload as a textual delimiter.
⚡ QUIRK — reproducer paths are per-PassManager, not per-context Each
PassManagerinstance carries its own reproducer-path setting. Two pass managers running in the same MLIRContext can have different reproducer paths, or one can have it enabled and the other not. This is intentional — it lets the outer driver capture a reproducer for the device-module pipeline without bloating the host-wrapper pipeline's reproducer with kilobytes of unrelated MLIR.
NVIDIA-Private Hook Points
Four hook points on the instrumentation chain are NVIDIA-private:
-
DumpTraceImpl::run— Activated byschedule-trace-file. The hook installs itself on the innergpu.func(TileAS-stage) adaptor and writes a per-pass scheduler trace in the Chromechrome://tracingformat. Each pass invocation produces one event with name, timestamp, and duration; each scheduler decision produces a child event under its enclosing pass. The trace is closed at process exit through__cxa_atexit. -
schedule-trace-fileconsumer — Reads the option at pipeline-construction time and refuses to install the hook if the file cannot be opened. The diagnosticfailed to legalizeLoopScheduleForMaterializationfires when the legalisation phase rejects a loop the scheduler expected to consume; it is the most common visible scheduler-side failure and rides in the same instrumentation tree. -
TileIR Callbacks — Five
__CUDA_TILEIR_*env vars (see Env Vars and Runtime Gates and the TILEIR_CALLBACKS ABI) modify the instrumentation surface at module load time.__CUDA_TILEIR_CALLBACKS_ON_PRE_LOADregisters a host-side pre-load hook that the instrumentation chain notifies before any compilation begins;__CUDA_TILEIR_FUNC_CALLBACKSand__CUDA_TILEIR_FUNC_ON_ARGUMENTS_CHANGEregister per-function callbacks that fire on argument-buffer mutation. -
Action attributes —
mlir::NVVM::SetMaxRegisterActionAttris the only NVIDIA-privateActionclass. It fires when the NVVM lowering rewrites agpu.functo add thesetmaxnregPTX-level directive; the action wraps the rewrite so an outside observer can log the per-function maxnreg decision without modifying the pass itself.
Verify-Each and Action Composition
The verify-each knob (off by default for release builds, on by default for assert builds) runs the full verify() on the anchor operation between every pair of passes. Verify-each is implemented as a PassInstrumentation hook that fires from runAfterPass; it is in the same chain as the IR printer and the timer, so its cost is roughly proportional to the size of the anchor operation. See Pipeline Invariants and Verifiers — Verifier Layers for the layered verifier model and why between-pass verification catches a class of bugs the explicit verifier passes cannot.
Mixing all five hooks — timer, IR printer, verify-each, opt-bisect, and a custom action handler — is supported by the chain. The interleaving order matters for one specific case: if both --mlir-print-ir-after-all and verify-each are enabled, the IR printer's snapshot is taken before verify-each runs, so a failed verification produces both the post-pass IR snapshot (from the printer hook) and the failure diagnostic (from verify-each). The snapshot reflects the IR that triggered the verification failure, which is exactly what the user wants when bisecting a between-pass invariant violation.
Callback Integration
The same compile instrumentation surface feeds the TileIR callback emission path. Callback emission materialises well-known module symbols and launch-site hooks so a runtime can patch instrumentation at module load time. The host-side symbols __CUDA_TILEIR_CALLBACKS, __CUDA_TILEIR_CALLBACKS_ON_PRE_LOAD, __CUDA_TILEIR_FUNC_CALLBACKS, __CUDA_TILEIR_FUNC_ON_ARGUMENTS_CHANGE, and __CUDA_TILEIR_ON_PRE_LOAD are wired through the same pipeline-builder logic that installs DumpTraceImpl; the driver-level ABI is documented in TILEIR_CALLBACKS ABI, and the format strings the callbacks emit ([TileIR Callback] Argument %d: offset = %ld, size = %ld, [TileIR Callback] CUdeviceptr: %p, [TileIR Callback] DESC_TMA512: ...) are part of the public log format.
Reimplementation Notes
build_instrumentation_chain(pm, opts):
chain = PassInstrumentor()
if opts.timing_enabled:
chain.push(TimingInstrumentation(opts.timing_display, opts.timing_format))
if opts.ir_printing_enabled:
chain.push(IRPrinterInstrumentation(opts.print_filter, opts.print_flags))
if pm.verify_each:
chain.push(VerifyEachInstrumentation(pm))
if opts.opt_bisect_limit < INT_MAX:
chain.push(OptBisectInstrumentation(opts.opt_bisect_limit, opts.opt_bisect_funcs))
if opts.schedule_trace_file:
chain.push(DumpTraceImpl(opts.schedule_trace_file))
pm.set_instrumentor(chain)
run_pass(pm, pass, anchor):
for h in chain.forward(): h.run_before(pass, anchor)
result = pass.run_on_operation(anchor)
for h in chain.reverse():
if failed(result): h.run_after_failed(pass, anchor)
else: h.run_after(pass, anchor)
return result
The ordering rule: the timer goes first because it must envelop every other hook's cost; the IR printer goes second so its snapshot is taken before verify-each potentially rejects; verify-each goes third; opt-bisect goes fourth so it can short-circuit before the scheduler-trace hook runs; the trace hook goes last so it sees only the passes opt-bisect allows through. Reversing this order silently changes the meaning of every reported number.
Cross-References
Driver Entry and Optimization Levels — Serialization Scopes names the two outer scopes (CompileNVVM, SerializeGPUModule) that this page enumerates. Modulo Scheduler and Rau-Style Placement is the scheduler whose stages drive most of the inner scopes. Resource Constraint Builder and RRT is what ResourceConstraintBuilder::tryAddConstraintToAvoidRegSpilling is part of. Debugging and Introspection is the user-facing guide that turns the scope tree and action surface documented here into a workflow for diagnosing pipeline issues. Testing and Observability covers how external test suites consume the timing and snapshot streams. Pass Manager Internals — Threading Model covers the threading contract this chain runs inside. cl::opt Full Catalog — Layer 6 is the canonical catalog for every --mlir-* knob mentioned here.