Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Driver Env Vars + Runtime Gates

Abstract

tileiras carries two configuration surfaces beyond ordinary command-line options. First, a small set of environment variables drives toolkit discovery, ptxas knob forwarding, TMA policy, swizzle selection, and a TileAS shared-memory debug escape hatch. Second, a family of pass-level runtime gates lives as LLVM command-line options — not process environment variables, but worth documenting here because they shape the same compile-time behavior.

Environment-variable parsing is uneven by design. Some variables are presence-only, some require an exact value, and one parses a base-10 integer. A faithful reimplementation should preserve those differences rather than collapse every variable into a generic boolean parser.

Per-Feature Runtime Gates

Each variable has a known consumer sub_ADDR and a known parse mode. The parse modes vary on purpose: some are presence-only, one demands the literal byte "1", one parses a base-10 integer through strtol, and one is a path forwarded verbatim into an argv slot. A faithful reimplementation must preserve those differences rather than collapse them into a single boolean parser.

VariableConsumerParse modeDefault behaviourHazard
TILEIR_DELAY_TMA_STORE_WAIT(varies)strtol0 (no delay)strtol failure on non-numeric input throws a std::stoi-shaped exception that aborts the process
TILEIR_DEBUG_DUMP_BC(varies)presencefalseSetting to anything (including "0") enables BC dump
TILEIR_DEBUG_DUMP_LLVM(varies)presencefalseSetting to anything enables LLVM-IR dump
TILEIR_PREFER_TMA_FOR_LOAD_STORE(varies)string boolean ("true" / "false")falsePrefers TMA lowering over the default cp.async / vector heuristic
TILEIR_ALWAYS_SWIZZLE(varies)presencenormal swizzle heuristicSetting to anything (including "0") forces the swizzled-layout path
TILE_AS_DEBUG_UNLIMITED_SMEMRCBpresenceSMEM cap = 232448 BSetting to anything overrides the cap to INT_MAX; on hardware with less than 232448 B SMEM this miscompiles
TILE_AS_DEBUG_VERBOSE(varies)string-eq "1"falseCompared to "1" only; other truthy strings like "true" are ignored
MLIR_ENABLE_EVO(varies)presencefalseInternal experimental switch; behaviour is undocumented and subject to silent change
PTX_KNOBS_PATHptxas subprocesspath string(none)Forwarded verbatim as --knobs-file=<path> into the ptxas argv

Three operational details matter. First, MLIR_ENABLE_EVO and PTX_KNOBS_PATH form an AND gate inside the ptxas-argv builder — setting only one does not forward a knob file. Second, the presence-only gates do not strip the value: assigning "0" or "false" still enables them, opposite of what a deployment-script reader usually expects. Third, TILE_AS_DEBUG_VERBOSE is a string-equality compare against "1", so "true" and "yes" are silently ignored.

The TILEIR_DELAY_TMA_STORE_WAIT hazard deserves its own paragraph because the failure mode is brutal. The consumer parses the value with strtol, but the calling path treats parse failure as an exception (a std::stoi-shaped std::invalid_argument) that percolates up unhandled. The result is SIGABRT on terminate. Setting TILEIR_DELAY_TMA_STORE_WAIT=foo instead of TILEIR_DELAY_TMA_STORE_WAIT=0 produces no warning and no fallback to default — it aborts the compile. Deployment scripts should either omit the variable entirely or set a decimal integer.

CUDA-Root Resolution

CUDA-root resolution is trickier than it looks: two separate resolvers live in the binary and they disagree on the miss path. The driver-side resolver sub_5773C0 fires early during command-line processing; the NVVM-side resolver sub_1A41D30 fires later from inside libdevice and libnvvm lookup. Both walk the same env-var chain — CUDA_ROOT, then CUDA_HOME, then CUDA_PATH — but they part ways when every variable is unset.

Resolversub_ADDRChainMiss behaviour
Driversub_5773C0CUDA_ROOTCUDA_HOMECUDA_PATHWalks /proc/self/exe up two directories; aborts with "cannot find CUDA installation" if that path does not exist
NVVMsub_1A41D30CUDA_ROOTCUDA_HOMECUDA_PATHReturns the empty string; no /proc/self/exe fallback, no abort

The divergence is the hazard. A deployment that leaves CUDA_ROOT unset but keeps /proc/self/exe resolvable inside the expected toolkit layout sees the driver succeed silently. Later, when the NVVM path tries to locate libdevice.10.bc, its resolver returns "", and the libdevice loader joins that empty string against nvvm/libdevice/libdevice.10.bc. The user gets a confusing "libdevice.10.bc not found in $CUDA_ROOT/nvvm/libdevice" error even though the driver itself reported no problem. The workaround is mechanical: always export CUDA_ROOT explicitly in production deployments, even when /proc/self/exe would have been enough for the driver alone.

Runtime-gate globals (static-ctor populated)

The runtime-gate layer is ordinary LLVM option storage for individual passes. These flags help when debugging pass behavior, but getenv never reads them.

Gate familyRepresentative flagsDefault behavior
CDP inline pretreatCDP launch-name tableRecognizes the known CDP launch helper names.
Unsafe algebra-opt-unsafe-algebraEnabled.
Dead barrier elimination-basic-dbeDisabled unless requested.
SCEV-CGP-scev-cgp-*, -do-scev-cgp, -do-function-scev-cgpEnabled with bounded search budgets.
Base-address strength reduction-do-base-address-strength-reduceEnabled at level 4.
DOT and FFMA fusion-enable-dot, -enable-fma-to-ffma2, -balance-dot-chainDOT is on; FFMA2 fusion is off.
IPMSP-do-clone-for-ip-msp, -dump-ip-mspClone budget is automatic; dump is off.
LSA-lsa-optEnabled.
Memory-space optimization-track-indir-load, -dump-ir-before-memory-space-optTracking is on; dumps are off.
ProcessRestrict-process-restrict, -apply-multi-level-restrictBase restrict processing is on.
NVPTX printf lowering-nvvm-lower-printfEnabled.
Kernel selection-select-kernel-range, -select-kernel-listEmpty selection means no narrowing.

Consumers

Each environment variable is consumed close to the operation it affects:

PtxasArgs build_ptxas_args(const DriverState *state) {
    PtxasArgs args = default_ptxas_args(state);

    if (getenv("MLIR_ENABLE_EVO") != NULL) {
        const char *knobs = getenv("PTX_KNOBS_PATH");
        if (knobs != NULL)
            args_append(&args, concat("--knobs-file=", knobs));
    }

    return args;
}

bool unlimited_smem_debug(void) {
    return getenv("TILE_AS_DEBUG_UNLIMITED_SMEM") != NULL;
}

bool verbose_debug(void) {
    const char *value = getenv("TILE_AS_DEBUG_VERBOSE");
    return value != NULL && strcmp(value, "1") == 0;
}

int delay_tma_store_wait(void) {
    const char *value = getenv("TILEIR_DELAY_TMA_STORE_WAIT");
    if (value == NULL)
        return 0;
    // Parse path raises std::invalid_argument on bad input;
    // calling frame does not catch, so the process aborts.
    return std::stoi(value);
}

The two CUDA-root resolvers look structurally similar, but their miss paths are not equivalent. The pseudocode below is the contract a reimplementation must preserve byte-for-byte — in particular, the NVVM resolver must keep its empty-string return, because downstream libdevice loading relies on that sentinel to defer the actual lookup to a layered fallback that lives outside this function.

const char *resolveCudaRoot_driver(void) {              // sub_5773C0
    if (const char *p = getenv("CUDA_ROOT"))   return p;
    if (const char *p = getenv("CUDA_HOME"))   return p;
    if (const char *p = getenv("CUDA_PATH"))   return p;
    return walkSelfExeUpTwo();                          // /proc/self/exe -> ../../
}

const char *resolveCudaRoot_nvvm(void) {                // sub_1A41D30
    if (const char *p = getenv("CUDA_ROOT"))   return p;
    if (const char *p = getenv("CUDA_HOME"))   return p;
    if (const char *p = getenv("CUDA_PATH"))   return p;
    return "";                                          // hazard: no fallback
}

Host Launch ABI and ptxas Knobs covers how the PTX_KNOBS_PATH value lands in the ptxas argv; Subprocess Harness shows the ptxas-launcher argv shape that consumes the forwarded knob path; Driver CLI Options documents the companion option flags whose semantics overlap with these gates; Resource Constraint Builder and RRT is the RCB consumer of TILE_AS_DEBUG_UNLIMITED_SMEM.