Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

TileAS Scheduling Glue Passes

Abstract

The scheduling glue passes prepare IR for schedule generation, publish schedule analysis, and tag loops that LLVM must unroll later. They sit between layout/buffer preparation and async-pipeline materialization. Their job: make the IR schedulable, run the scheduler, preserve the result for materialization, and keep register-tensor loops from surviving as runtime-indexed loops in PTX.

This chapter also records the LLVM/NVVM cleanup passes that run after TileIR has been lowered. They schedule no TileAS IR, but they protect the backend handoff — selecting kernels, printing kernel metrics, propagating address spaces, and demoting non-kernel functions before PTX emission.

Pass Roster

PassLayerPurpose
TileASPrepareForSchedulingTileAS MLIRcanonicalizes tiled views, atom widths, loop hints, memory layout, and TMA box dimensions
TileASGenerateScheduleTileAS MLIRruns serial or cost-based scheduling and publishes ScheduleAnalysis
TileASUnrollRegisterLoopsTileAS/LLVM bridgetags register-tensor loops for full LLVM unrolling
PretreatPassLLVM/NVVMcanonicalizes NVVM LLVM IR after inlining and before verification
CheckGepIndexPassLLVM/NVVMvalidates constant GEP indices after canonicalization
SelectKernelsPassLLVM/NVVMfilters the module to selected kernels
KernelInfoPrinterLLVM/NVVMprints kernel metrics for diagnostics and tuning
IPMSPPassLLVM/NVVMpropagates concrete NVPTX address-space facts interprocedurally
NVVMAALLVM analysisprovides alias facts from NVPTX address spaces
NVPTXSetFunctionLinkagesPassLLVM/NVVMdemotes non-kernel functions and marks kernel visibility

Prepare for Scheduling

TileASPrepareForScheduling is the last TileAS cleanup before schedule generation. Six stages run in fixed order.

StagePurpose
tiled load/store view decompositionrewrites reshaped tiled views into direct views of the underlying allocation
atom vector-size refinementchooses target-aware vector widths for copy, TMA, and MMA atoms
slice/fuse markingannotates loops with slicing and fusion hints already discovered upstream
scheduling canonicalizerunrolls tiny loops and decomposes single-op wrappers around atom operations
memory layout compactionassigns shared-memory offsets to non-overlapping live ranges
TMA box-dimension refreshrecomputes runtime box dimensions used by Blackwell tensor-copy operations

The stage order is part of the contract. TMA box dimensions, for instance, must refresh after view decomposition and memory layout compaction — the descriptor dimensions have to reflect the final view and allocation shape.

LogicalResult prepare_for_scheduling(FuncOp func, TargetInfo target) {
    if (failed(decompose_tiled_load_store_views(func))) {
        return failure();
    }
    if (failed(refine_atom_vector_sizes(func, target))) {
        return failure();
    }
    mark_slice_and_fuse_candidates(func);

    if (failed(run_scheduling_canonicalizer(func))) {
        return failure();
    }
    if (failed(compact_memory_layout(func, target))) {
        return failure();
    }

    return refresh_tma_box_dimensions(func, target);
}

Generate Schedule

TileASGenerateSchedule reads compute capability, builds the machine model, picks a scheduler strategy, runs it on each schedulable region, legalizes the result for materialization, and publishes ScheduleAnalysis.

The scheduler option accepts the serial baseline and cost-based variants. Unknown values fall back to the cost scheduler. The cost path also reads the maximum constraint-iteration count and the RRT-size threshold — it can refine constraints repeatedly before convergence.

LogicalResult generate_schedule(FuncOp func, SchedulerOptions options) {
    ComputeCapability cc = read_compute_capability(func);
    if (!cc.valid()) {
        return func.emit_error("failed to get compute capability");
    }

    MachineModel model = build_machine_model(cc);
    Scheduler scheduler = make_scheduler(options.scheduler, model, options);

    for (Region *region : find_schedulable_regions(func)) {
        ScheduleAnalysis analysis;
        if (failed(scheduler.schedule(region, &analysis))) {
            return region->emit_error("failed to find a schedule");
        }
        if (failed(legalize_loop_schedule_for_materialization(region, analysis))) {
            return region->emit_error("failed to legalize schedule");
        }
        preserve_schedule_analysis(region, analysis);
    }

    return success();
}

With trace output enabled, the pass emits a Chrome tracing JSON stream. The useful events: issue timing, atom attributes, dependency edges, minimal latency, actual latency, and thread order. Tracing is diagnostic only — it must not affect scheduling.

Register-Loop Unroll Tagging

TileASUnrollRegisterLoops (D10) is the smallest pass in this family. It attaches a single #llvm.loop_annotation<unroll = <enable>> attribute to ops D08 already marked as the root of a reg-to-reg copy sequence. The pass unrolls nothing itself — it writes metadata so LLVM's LoopUnrollPass fires on the resulting loop later in the NVPTX backend, after convert-scf-to-cf and LowerLoopAnnotation turn the scf.for body into an llvm.br/llvm.phi cycle carrying the !llvm.loop MD node.

The pass triple lives at 0x825590 (getName returning "TileASUnrollRegisterLoops"), 0x8255A0 (getArgument returning "tileas-unroll-register-loops"), and 0x8255B0 (getDescription returning "Unroll loops that access slices of register tensors"). getDependentDialects at 0x8256B0 loads nv_tileaa and nv_tileas. The pass exposes no options: no Option<...> registrations appear in the range, and no command-line wrappers reference it beyond the argument string.

The core rewriter is sub_825BF0 (1 272 B). It walks the function with runOnOperation at sub_8261E0, and for each op carrying D08's "nv_tileas.root_reg_to_reg_copy_op" UnitAttr marker it interns the "loop_annotation" key via sub_440A060(ctx, "loop_annotation", 15) and attaches a LoopAnnotationAttr whose only populated field is unroll = <enable>. The cached global Identifier slot for the llvm.loop_annotation attribute name is qword_591E3B8; the rewriter loads it once and reuses the interned key across every tagged op in the function.

LogicalResult unrollRegisterLoops(FunctionOpInterface fn) {
    fn.walk([&](Operation *op) {
        if (!op->hasAttr("nv_tileas.root_reg_to_reg_copy_op")) {
            return;
        }
        StringAttr key = StringAttr::get(ctx, "loop_annotation");        // sub_440A060
        op->setAttr(key, LoopAnnotationAttr::get(ctx, /*unroll=*/UnrollAttr::enable()));
    });
    return success();
}

The resulting attribute is the single-element LoopAnnotationAttr form #llvm.loop_annotation<unroll = <enable>>. LLVM's LoopUnrollPass reads this through the !llvm.loop metadata convention and picks the unroll factor from its own cost model (computeUnrollCount, PartialThreshold, register pressure). D10 deliberately writes neither llvm.loop.unroll.count nor llvm.loop.unroll.full: the policy is "enable, and let LLVM decide", keeping the factor decision in one place rather than splitting it between the TileAS scheduler and the NVPTX backend.

Interaction with D08

D10 pairs with D08 (MaterializeConvertLayout). With reg2reg-vec-size at or below 15, D08 cannot vectorise the layout conversion and falls back to a per-element copy sequence. The root op of that sequence carries the "nv_tileas.root_reg_to_reg_copy_op" UnitAttr; downstream lowerings turn that root into a scf.for whose body shuffles register slices one element at a time. D10 recognises the marker and adds the unroll hint so the loop fully unrolls in PTX, where register variables have no runtime addressing — a rolled loop would either spill to local memory or be miscompiled by the tile-to-vector lowering.

The two passes stay split rather than fused because their context requirements pull in opposite directions. D08 needs layout-decomposition state — the cute atom catalogue, the per-target vector-size table, the layout-conversion analysis — to decide which ops are reg-to-reg copy roots. D10 needs the final IR shape: every layout-affecting pass between D08 and PTX emission must already have run so the marker still sits on the op that becomes the actual loop. Fold D10 into D08 and the marker decision has to win against every subsequent rewrite; fold D08 into D10 and D10 has to redo the layout analysis. Splitting them lets each pass run at its natural point in the pipeline and reduces the marker to a one-bit handoff.

LLVM/NVVM Backend Glue

The LLVM/NVVM passes run after TileAS has lowered to LLVM and NVVM dialects.

PretreatPass performs NVIDIA-specific LLVM IR cleanup after inlining: normalising pointer casts, byval aggregates, lifetime intrinsics, intrinsic placeholders, and NVVM metadata before verification and lowering.

CheckGepIndexPass validates that constant GEP indices are in bounds once canonicalization has exposed the final aggregate shapes.

SelectKernelsPass filters the module by a configured list or ordinal range. Non-selected kernels disappear before code generation.

KernelInfoPrinter emits tuning metrics — stack allocation count, direct and indirect calls, inline assembly calls, invokes, and flat-address-space accesses. Flat address-space use matters: generic pointers block tensor-memory and bulk-copy lowering.

IPMSPPass propagates concrete pointer address spaces through calls. When a helper receives a generic pointer but every caller supplies the same concrete space, the pass rewrites or clones the helper so later NVPTX code skips the generic address-space cast.

NVVMAA is the companion alias analysis: generic address space may alias, distinct non-generic spaces do not alias, and equal non-generic spaces may alias.

NVPTXSetFunctionLinkagesPass identifies kernels by calling convention, nvvm.kernel, or kernel metadata, then demotes non-kernel definitions to internal linkage.

Schedule Support Bodies

Three support algorithms serve every schedule consumer (MaterializeSchedule, UnspecializedPipeline, and the downstream stage expander). They live as free functions in the schedule-utils namespace rather than methods on ScheduleAnalysis — the same code then serves both AUS and AWS drivers without dragging analysis state into per-driver vtables.

HelperRole
materialize schedulepartitions loads, stores, pending async operations, and iteration data into materialization maps keyed by the producing op
build stagesconverts union constraints into stage-ordered producer/consumer pairs that drive prologue/body/epilogue construction
expand single tiled opreplicates a tiled operation per stage and rewires operands to stage-local values
StageMap materialize_schedule(Region *region, ScheduleAnalysis &an) {
    StageMap m;
    for (Operation *op : region_ops(region)) {
        Stage s = an.stage_of(op);
        if      (is_load(op))  m.loads[s].push_back(op);
        else if (is_store(op)) m.stores[s].push_back(op);
        else if (is_async(op)) m.pending_async[s].push_back(op);
        else                    m.iter_data[s].push_back(op);
    }
    return m;
}

void build_stages(ScheduleAnalysis &an, SmallVector<Stage> *out) {
    for (Constraint &c : an.union_constraints()) {
        Stage producer = c.producer_stage();
        Stage consumer = c.consumer_stage();
        emit_producer_consumer_pair(*out, producer, consumer);
    }
    stable_sort_by_stage(*out);
}

void expand_single_tiled_op(TiledOp op, ScheduleAnalysis &an, Rewriter *rw) {
    Map<Value, Value> operand_source       = collect_operand_sources(op);
    Map<Stage, Operation *> stage_anchor   = collect_stage_anchors(an);

    for (Stage stage : an.stages_for(op)) {
        Operation *replica = clone_for_stage(op, stage, rw);
        for (Operand operand : replica->operands()) {
            if (Value replacement = lookup_stage_value(operand, stage, operand_source, stage_anchor)) {
                operand.replace_with(replacement);
            }
        }
    }
}

Ordering Invariants

  • PrepareForScheduling must run before GenerateSchedule.
  • GenerateSchedule must preserve ScheduleAnalysis for materialization.
  • Register-tensor unroll tagging must run before LLVM loop-unroll passes.
  • LLVM/NVVM pretreatment and verification must run before PTX emission.
  • Kernel selection and linkage demotion must not run before kernel attributes are finalized.
  • Address-space propagation should run before alias-sensitive optimization and final NVPTX lowering.

Cross-References

Layout and Buffer Family — Assign Load/Store Layouts documents D08 (MaterializeConvertLayout), which emits the "nv_tileas.root_reg_to_reg_copy_op" marker that D10 consumes. nv_tileas Op Roster and Builders — Attribute Roster lists the attribute as part of the dialect's UnitAttr inventory. The upstream LLVM dialect loop_annotation attribute documents the metadata shape that D10 attaches and that the NVPTX backend's LoopUnrollPass later reads. The schedule analysis this family publishes is consumed by Materialize Schedule; the resource model behind it is documented in Modulo Scheduler and Rau-Style Placement and Resource Constraint Builder and RRT. Backend address-space alias facts that NVVMAA exposes are described in NVPTX Bring-up and Target Init.