Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Schedule Solve and Cost Evaluators

Abstract

Schedule::solve is not the modulo scheduler and not an optimization solver. It is the materialization step that consumes an already-computed schedule analysis and emits the SSA pipe values the TileAS program needs. Resource search, cost ranking, and initiation-interval work all happen earlier during TileASGenerateSchedule.

The cost evaluators on this page belong to that earlier generation pass. They use paired RRT probes and structural distance constraints to rank candidate placements. Schedule::solve runs none of that machinery — it performs classification, closure, disjoint-set merging, and pipe emission.

Pass Boundary

PhasePassMain job
schedule generationTileASGenerateSchedulecompute stage/order, resource placement, and schedule analysis
schedule materializationMaterializeScheduleemit Pipe_ and Mutex_ values from the preserved analysis

The split matters for reimplementation. Treating Schedule::solve as the place where II search happens tangles the compiler architecture and ties the materialization pass to mutable generation state. In Tileiras, generation publishes analysis; materialization reads it.

Schedule::solve

Schedule::solve fires once for each relevant outer operation and candidate consumer. Its output is a set of pipe values that connect producer groups to consumer positions. The body is a greedy disjoint-set propagation pass over six phases — never choosing a new II, never probing the RRT. It sorts candidate operations by (stage, order), builds maps from operation to schedule info, filter state, and owning outer operation, closes the producer set under the "live at this consumer" relation, links producers that share the same original value through a disjoint-set forest, sweeps each disjoint-set root to pick the canonical owner by schedule order, and emits Pipe_ values for producer-to-consumer groups — plus a consumer-only fallback when the live set is empty.

void solve_schedule(Schedule *schedule,
                    RawValue raw_value,
                    Operation *consumer) {
    Worklist work = collect_candidate_producers(schedule, raw_value, consumer);
    stable_sort(work, compare_stage_then_order);

    Map<Operation *, ScheduleInfo> info = build_schedule_info(work);
    Map<Operation *, Operation *> owner = build_owner_map(work);
    DisjointSet dsu = {};

    ProducerSet live = close_under_live_at_consumer(work, consumer, info);

    for (Operation *producer : live) {
        for (Value operand : producer->operands()) {
            Operation *def = operand.defining_op();
            if (origin(schedule, operand) == raw_value) {
                dsu.union_nodes(producer, def);
            }
        }
    }

    for (DsuRoot root : dsu.roots()) {
        SmallVector<Operation *> members = dsu.members(root);
        Operation *canonical = choose_earliest_owner(members, info);
        emit_pipe_for_group(schedule, canonical, members, consumer);
    }

    if (live.empty()) {
        emit_consumer_only_pipe(schedule, consumer);
    }
}

The comparator is lexicographic — lower stage first, then lower order. This solver never consults a scalar cost function, a resource row, or an initiation interval.

Pipe and Mutex Emission

Before Schedule::solve emits final pipes, materialization builds auxiliary maps from the preserved schedule analysis. One path emits mutex values for exclusion relationships, another emits preliminary pipe placeholders, and the solver reconciles those placeholders into final pipe SSA values from the disjoint-set groups. The naming convention is deliberately visible in the IR: Pipe_ values model dataflow between scheduled producer and consumer regions, while Mutex_ values model exclusion or ordering constraints that ordinary value dependencies cannot express.

void emit_pipe_and_mutex(Schedule *S, const ScheduleAnalysis *analysis) {
    build_orig_map(S, analysis);                   // sub_8E2790 probe target
    build_second_map(S, analysis);                 // sub_8E2F00 probe target

    seed_mutex_placeholders(S);                    // exclusion edges first
    seed_pipe_placeholders(S);                     // dataflow edges next

    for (CandidatePair p : S->consumer_worklist) {
        solve_schedule(S, p.raw_value, p.consumer);
    }

    collapse_skeleton_pipes(S);                    // dedup producer groups
    rebuild_scheduled_region(S);                   // splice Pipe_/Mutex_ ops in
    verify_scheduled_region(S);                    // hard postcondition
}

Mutex placeholders go first because their exclusion semantics are stricter than the pipe placeholders. A missing mutex is a correctness bug; a missing pipe edge is a missed-optimization bug. The materializer commits the harder constraint before relaxing into the softer one.

Cost Evaluators

Cost evaluators run during generation. They answer one question: can this candidate fit at this interval or cycle, and how expensive is that choice? Tileiras uses two paired evaluators.

EvaluatorPair being modeledRole
bank-pressure evaluatorcurrent-iteration and next-iteration resource shadowschecks bank and carry-over resource conflicts
pipe-slot evaluatorresource RRT and structural distance matrixchecks resource slots and dependence distance together

A generic feasible-search driver calls both evaluators. They return a boolean success flag and, on success, a candidate placement state the caller can commit.

bool evaluate_candidate(SearchOutput *out,
                        Candidate candidate,
                        uint32_t ii,
                        ResourceModel *resources,
                        DistanceMatrix *distances) {
    if (!resource_rows_are_free(resources, candidate, ii)) {
        return false;
    }

    if (!distance_window_allows(distances, candidate, ii)) {
        return false;
    }

    out->placement = candidate.placement;
    out->cost = compute_lexicographic_cost(candidate, resources, distances);
    return true;
}

The cost is lexicographic. Resource feasibility is the hard gate; pipeline-slot utilization and structural distance rank only the candidates that clear it.

Structural Distance Matrix

The pipe-slot evaluator builds an all-pairs distance matrix for the candidate interval. Each edge encodes how far apart two operations must sit after dependence latency, iteration distance, and skew are accounted for. A transitive closure then lets the evaluator query a legal placement window in constant time.

void build_distance_closure(DistanceMatrix *matrix, Graph graph, uint32_t ii) {
    matrix->fill(INFINITE_DISTANCE);

    for (Edge edge : graph.edges()) {
        int32_t distance = edge.latency - (int32_t)(ii * edge.iteration_distance);
        matrix->set(edge.src, edge.dst, distance);
    }

    for (Node k : graph.nodes()) {
        for (Node i : graph.nodes()) {
            for (Node j : graph.nodes()) {
                int32_t through = matrix->get(i, k) + matrix->get(k, j);
                if (through > matrix->get(i, j)) {
                    matrix->set(i, j, through);
                }
            }
        }
    }
}

The closure is generation-only state. It is not carried into Schedule::solve.

Search Driver

The evaluators share the same galloping-plus-binary search shape the resource builder uses. The outer search chooses II; the inner search may pick a cycle threshold or candidate row. That nesting lets generation find a feasible schedule without linearly scanning every interval and cycle.

bool search_with_probe(SearchOutput *out,
                       ProbeFn probe,
                       uint32_t lower,
                       uint32_t upper) {
    uint32_t hi = lower;

    while (hi < upper && !probe(hi, out)) {
        hi = min(upper, hi * 2);
    }

    if (!probe(hi, out)) {
        return false;
    }

    uint32_t lo = lower;
    while (lo < hi) {
        uint32_t mid = lo + (hi - lo) / 2;
        SearchOutput candidate = {};

        if (probe(mid, &candidate)) {
            hi = mid;
            *out = candidate;
        } else {
            lo = mid + 1;
        }
    }

    return true;
}

Shared Helpers

Several helpers run alongside Schedule::solve on the scheduling-adjacent lowering path. Persistent-loop construction emits the canonical widened-index scf.for used by persistent kernels; shape verification checks that problem shape, tile shape, and cluster-group agree; result-type verification keeps work-tile info results in one consistent representation; generic work-tile info construction adapts async values and CUTLASS work-tile descriptors. None of them participate in the solve itself, but each enforces a contract that scheduling and materialization assume — failing any of these checks marks the schedule invalid before the solver runs.

Schedule::solve Body (sub_8EEE70)

The previous section described Schedule::solve as five abstract steps. The actual implementation in sub_8EEE70 is a 2 269-line, six-phase state machine that materialises five distinct hashtables and a Union-Find forest on the stack, runs a generator dispatcher, falls back to a cost-based generator on failure, and emits one trivial-schedule fallback when every generator path fails. The trampoline sub_8F19D0 invokes the body, and that trampoline reaches the body from Schedule::buildAndSolve (sub_8F1AA0) inside the materialization pass. The constraint graph and schedule analysis arrive from ResourceConstraintBuilder upstream; sub_8EEE70 consumes that analysis and produces the final op-to-stage and op-to-order mapping.

Stack-Local State

Five hashtables and one Union-Find bucket array live on the solver stack frame. Bucket strides vary with the payload size — schedInfo carries a 32-byte payload of stage/order/RRT-row metadata, while the three pointer-typed maps store a single 8-byte payload after the key.

TableBucket strideKeyValue storedProducer
schedInfo40 BOp*current stage, order, RRT row, classification flag bytesfilled by phase 2 topo walk
filterMap16 BOp*"dead in retry" sentinel byte at +8written by retry-arm failures
parentMap16 BOp*DSU group leader pointerseeded from constraint-graph slot
opToOwner16 BOp*Pipe_ / Mutex_ owner referencewritten by phase 3 generator
UF buckets72 Bu64{list_ptr, tail_cap32, tail_size32, Op* inline[3]}seeded from Schedule+0x70

All four Op*-keyed maps use the standard llvm::DenseMap open-addressing layout with hash = (op>>9) ^ (op>>4), empty sentinel 0xFFFFFFFFFFFFE000, tombstone 0xFFFFFFFFFFFFF000, and the 4*(size+1) >= 3*capacity grow rule. The Union-Find bucket layout is custom: the inline three-slot tail amortises the common case where a DSU group has at most three members, and overflow spills to the heap via list_ptr. The shape matches the empirical group sizes ResourceConstraintBuilder produces, which rarely group more than three ops together.

Six-Phase State Machine

void schedule_solve_body(Schedule *schedule,
                         uint64_t  raw_value,
                         Operation *consumer) {
    // Phase 1: init. Construct five tables and the ready-queue staging vector.
    DenseMap sched_info = dense_map_create(/*bucket_stride=*/40);
    DenseMap filter_map = dense_map_create(/*bucket_stride=*/16);
    DenseMap parent_map = dense_map_create(/*bucket_stride=*/16);
    DenseMap op_to_owner = dense_map_create(/*bucket_stride=*/16);
    UfBuckets uf = uf_buckets_create(/*bucket_stride=*/72);
    SmallVector<Operation *, 0> ready_queue = {};

    uf_seed_from_constraint_dsu(&uf, /*src=*/schedule->dsu_at_0x70);

    // Phase 2: topo prep. Stamp parentMap[op] = leader for every op.
    for (Operation *op : schedule_ops_topological(schedule)) {
        ConstraintSlot *slot = sub_94A550(schedule, op);
        parent_map_set(&parent_map, op, slot->leader);
    }

    // Phase 3: generator dispatch. Cheap placement driver runs first.
    if (sub_981D50(schedule, &sched_info, &parent_map, &op_to_owner)) {
        goto finalize;
    }

    // Phase 4: cost-based fallback. CostBasedScheduleGenerator with pre-warmed UF.
    if (cost_based_schedule_generator(schedule,
                                      &sched_info,
                                      &parent_map,
                                      &op_to_owner,
                                      &uf,
                                      /*cmp=*/sub_8F7900,
                                      /*sort=*/sub_8F7EF0)) {
        goto finalize;
    }

    // Phase 5: zero-producers fallback. Trivial Pipe_ flavour-A schedule.
    sub_8E9450(schedule, raw_value, /*producers=*/NULL, /*n_producers=*/0,
                                     /*consumers=*/&consumer, /*n_consumers=*/1);
    schedule->flags |= 4u;                       // mark trivial-fallback exit

finalize:
    // Phase 6: materialise Op.start and Op.stage per op, then free everything.
    for (Operation *op : schedule_ops(schedule)) {
        SchedInfoRow *row = dense_map_find(&sched_info, op);
        op->stage = row->stage;
        op->start = row->order;
    }

    dense_map_destroy(&sched_info);    // sub_4560420 — bucket slab
    dense_map_destroy(&filter_map);    // sub_4560420
    dense_map_destroy(&parent_map);    // sub_4560420
    dense_map_destroy(&op_to_owner);   // sub_4560420
    uf_buckets_destroy(&uf);           // sub_4560420 per overflow tail + slab
}

The eight distinct sub_4560420 aligned-free sites in the finalize phase match the four DenseMap bucket slabs, the UF bucket slab, the UF per-bucket overflow tails, and two scratch SmallVector tails. An inline-vs-heap check on the SSO capacity word guards every free so the deallocator never runs on stack-resident storage.

Comparator and Heap-Sort

sub_8F7900 is the lexicographic (stage, order) comparator. It reads schedInfo[op].stage first and tie-breaks on schedInfo[op].order. No resource row enters the comparison — ResourceConstraintBuilder already consulted the RRT upstream. sub_8F7EF0 is a textbook libc++ __push_heap / __pop_heap pair operating on the 24-strided Op* ready-queue vector. Both functions appear only in phase 4; phase 3's generator dispatch uses its own internal ordering driven by the constraint-graph topology.

int sub_8F7900(Operation *a, Operation *b, DenseMap *sched_info) {
    SchedInfoRow *ra = dense_map_find(sched_info, a);
    SchedInfoRow *rb = dense_map_find(sched_info, b);
    if (ra->stage != rb->stage) {
        return (ra->stage > rb->stage) ? +1 : -1;
    }
    return (ra->order > rb->order) ? +1 : -1;
}

The comparator never returns zero. Ties on (stage, order) are impossible at this point because phase 2's topo walk assigns a unique order to every op — which is what lets the heap-sort stay stable without an explicit tie-breaker.

Callee Inventory (Sampled)

sub_8EEE70 calls 41 distinct functions. The ten most relevant beyond the dispatchers and the eight free sites appear below.

CalleeRole
sub_94A550Constraint-slot lookup — returns the parentMap leader for an op
sub_8E4510Constraint propagator — walks the DSU forest to finalise group leaders
sub_8E2790origMap probe at Schedule+80..96 — used by phase 3 to find raw-value producers
sub_8E2F00Second-table fmix64 probe at Schedule+104..120 — depth-keyed lookup
sub_8F19D0Per-pair solve trampoline — caller of sub_8EEE70
sub_8EC560Union-Find coalesce — merges groups that share a producer post-generator
sub_8E1900DSU snapshot copy-out — preserves the final group leaders for the materializer
sub_8E4F10Alias materialisation, 10 430 bytes — rewrites operand references through Pipe_ SSA
sub_8FB180parseFromAttrs reading nv_tile.aws.stage and nv_tile.aws.order
sub_981D50Placement-driver entry — generator dispatched in phase 3

sub_8E4F10 is the heaviest callee — it materialises the Pipe_ and Mutex_ operand rewrites that bring the IR into final form. The 10 430-byte size reflects every operand shape the constraint graph can produce, including the asymmetric cases where a Pipe_ consumer sits at a different stage than its producer.

Zero-Producers Fallback Semantics

The phase 5 fallback fires only when both the placement driver and the cost-based generator report failure. The body emits a single Pipe_ flavour-A value (the scalar-shaped pipe constructor — see Pipe and Mutex Value Layout for the flavour-A/flavour-B split) with zero producers and exactly one consumer — the consumer argument — then sets the trivial-schedule flag on Schedule.flags. Neither the Mutex_ constructor nor the Pipe_ flavour-B constructor runs from this branch; those primitives are materialised earlier by the walker that precedes the per-pair solve trampoline. The trivial schedule means "ship the consumer with no producers and let later passes diagnose the missing dataflow", not a recovery attempt.

Dual-RRT Cost Evaluators

Once Schedule::solve invokes the placement driver, the cost-based fallback ranks candidate placements through two dual-RRT cost evaluators. Both wrap in std::function<bool(int)>-shaped lambda thunks and route through a shared exponential-then-binary search driver. The two evaluators answer disjoint feasibility questions: the first measures actual hardware-resource occupancy under the current II, the second measures structural distance between a candidate op and its data dependencies. The cost-based scheduler runs both and combines them lexicographically — pipe-slot first as the legality gate, then bank-pressure as the preference signal.

Exponential-Then-Binary Search Driver sub_988080

The driver accepts a candidate-cost predicate lambda. It expands the cost threshold exponentially (1, 2, 4, 8, ...) until the predicate flips from false to true, then binary-searches inside the bracketing range to pin down the smallest threshold at which the candidate becomes feasible. Two lambda thunks ride the driver: sub_987E70 for the bank-pressure evaluator and sub_987EE0 for the pipe-slot evaluator. Each captures a pointer to the surrounding Schedule state, the candidate Op*, and the cycle index being probed; the threshold value flows through the lambda's single integer argument.

uint32_t sub_988080(std::function<bool(int)> probe, uint32_t lower, uint32_t upper) {
    uint32_t hi = (lower > 0) ? lower : 1;

    // Exponential expansion: double hi until probe flips true or we hit the cap.
    while (hi < upper && !probe((int)hi)) {
        hi = (hi * 2u <= upper) ? hi * 2u : upper;
    }

    if (!probe((int)hi)) {
        return UINT32_MAX;                       // candidate never becomes feasible
    }

    // Binary search inside [lower, hi] for the smallest threshold that satisfies probe.
    uint32_t lo = lower;
    while (lo < hi) {
        uint32_t mid = lo + (hi - lo) / 2u;
        if (probe((int)mid)) {
            hi = mid;
        } else {
            lo = mid + 1u;
        }
    }

    return lo;                                   // smallest feasible threshold
}

The return value is the cost-threshold cap at which the candidate first becomes feasible. The cost-based generator treats that value as the candidate's score and picks the minimum across all candidates at the current cycle.

Bank-Pressure Evaluator sub_98C440

The bank-pressure evaluator probes a dual-RRT pair: rrt0 holds in-iteration occupancy (resources used by ops placed at the current iteration of the kernel), rrt1 holds cross-iteration carry (resources still occupied from the previous iteration's tail at the same modulo cycle). The evaluator reads pool caps 4 (in-iteration) and 3 (cross-iteration) from the 9-element pool-capacity vector at indices 1 and 6 — TMEM and named-barrier pools respectively (see Blackwell Pipeline 15-Slot Model — Pool Capacity Vector). The trampoline sub_98E6A0 wires the evaluator into the driver, and the thunk sub_987E70 adapts the call site to the std::function<bool(int)> shape sub_988080 expects.

bool sub_98C440(const Schedule *S, const Op *op, uint32_t t, uint32_t cost_cap) {
    // rrt0 = in-iteration occupancy; rrt1 = cross-iteration carry.
    uint32_t in_iter  = countRrtBits(S->rrt0, op->node_rrt, t);
    uint32_t cross_it = countRrtBits(S->rrt1, op->node_rrt, t);

    // Pool caps from the 9-element capacity vector: index 1 = TMEM, index 6 = named-barrier.
    // Hard cap per pool, then a combined budget that the cost driver tightens via cost_cap.
    if (in_iter  > 4u)  return false;            // TMEM bank pressure
    if (cross_it > 3u)  return false;            // named-barrier carry
    if (in_iter + cross_it > cost_cap) return false;

    return true;
}

The combined in_iter + cross_it <= cost_cap term is what the exponential-then-binary search inside sub_988080 walks. The two per-pool caps (4 and 3) are hard gates no amount of cost relaxation can lift — they reflect physical hardware limits on TMEM banks and named-barrier slots, baked into the binary as immediate constants.

Pipe-Slot Evaluator sub_98E6C0

The pipe-slot evaluator probes a single RRT against an N×N all-pairs distance matrix produced upstream by sub_98BEE0 — an SSE2-unrolled Floyd-Warshall over the dependence graph. sub_12D0EA0 initialises the matrix, filling every cell with the sentinel 0x7FFFFFFF before the relaxation loops run. The trampoline sub_990C20 wires the evaluator into the driver, and the thunk sub_987EE0 adapts the call site to the std::function<bool(int)> shape.

bool sub_98E6C0(const Schedule *S, const Op *op, uint32_t t, uint32_t slot_cap) {
    // Single-RRT probe: does the candidate's footprint fit in the current pipe slot?
    if (!rrt_probe(&S->rrt_pipe, &op->footprint, t)) {
        return false;
    }

    // All-pairs structural distance gate: every predecessor must reach `op` within slot_cap.
    const int32_t *D = S->dist_matrix;           // n_ops x n_ops i32, row-major
    uint32_t n = S->n_ops;
    for (uint32_t p = 0; p < n; ++p) {
        if (!is_predecessor(S, p, op)) continue;
        int32_t d = D[p * n + op->index];
        if (d == 0x7FFFFFFF) continue;           // no path between these ops — vacuous
        if ((uint32_t)d > slot_cap) return false;
    }

    return true;
}

All-Pairs Distance Matrix sub_98BEE0

The matrix is n_ops × n_ops i32 cells stored row-major, allocated by sub_44A8C20(4 * n_ops * n_ops). The inner kernel processes four cells at a time with SSE2 unrolling. The surrounding outer-k / middle-i / inner-j triple loop is the canonical Floyd-Warshall shape — recognisable in the disassembly by the three-deep nest with a cmp + cmov (or pminsd after vectorisation) sequence on each inner iteration. The initialiser sub_12D0EA0 is a memset-shaped fill that writes the 0x7FFFFFFF infinity sentinel into every cell before the relaxation loops run; the loop then relaxes only edges that exist in the dependence graph to finite distances.

void sub_98BEE0(int32_t *D, const Graph *g, uint32_t ii) {
    uint32_t n = g->n_nodes;
    sub_12D0EA0(D, /*value=*/0x7FFFFFFF, /*n_cells=*/n * n);

    for (Edge e : g->edges) {
        int32_t d = e.latency - (int32_t)(ii * e.iter_distance);
        D[e.src * n + e.dst] = d;
    }

    // Floyd-Warshall with SSE2-unrolled inner loop (4 cells per iteration).
    for (uint32_t k = 0; k < n; ++k) {
        for (uint32_t i = 0; i < n; ++i) {
            int32_t dik = D[i * n + k];
            if (dik == 0x7FFFFFFF) continue;     // skip rows with no path through k
            uint32_t j = 0;
            for (; j + 4u <= n; j += 4u) {
                // SSE2 block: load 4 D[k*n+j..j+3], add dik, pminsd against D[i*n+j..j+3].
                __m128i dkj = _mm_loadu_si128((const __m128i *)&D[k * n + j]);
                __m128i dij = _mm_loadu_si128((const __m128i *)&D[i * n + j]);
                __m128i thru = _mm_add_epi32(dkj, _mm_set1_epi32(dik));
                _mm_storeu_si128((__m128i *)&D[i * n + j], _mm_min_epi32(dij, thru));
            }
            for (; j < n; ++j) {                 // scalar tail
                int32_t thru = D[k * n + j] + dik;
                if (thru < D[i * n + j]) D[i * n + j] = thru;
            }
        }
    }
}

The unrolled block uses pminsd (SSE4.1) where available and falls back to a scalar cmp + cmov pair on the SSE2-only path; the binary carries both code paths under a CPU-feature dispatch handled higher in the resource builder. The infinity sentinel survives the relaxation untouched for unreachable pairs because adding any finite dik to 0x7FFFFFFF overflows, and the dik != 0x7FFFFFFF guard at the top of the middle loop masks it off.

Why Two Evaluators

The two evaluators answer different questions and feed different terms of the lexicographic cost. Bank-pressure ranks placements by actual hardware-resource occupancy under the current II — the signal that stops the scheduler from overcommitting TMEM banks or named-barrier slots when several candidates are otherwise tied. Pipe-slot ranks by structural distance between the candidate op and its data dependencies, using the all-pairs matrix as a constant-time legality oracle — the signal that rejects placements which would force a dependence edge to span more than the available pipeline depth. The cost-based scheduler runs both and combines them lexicographically: pipe-slot is the hard legality gate, bank-pressure is the preference between the candidates that survive the gate.

Worked Scoring Example

The clearest way to see how the lexicographic cost vector ranks candidate schedules is to walk two concrete placements of the four-op loop body from Blackwell Pipeline 15-Slot Model — Worked Example. Both candidates seat the same four ops; both target II = 8; they differ only in whether the SMEM write seats at the same cycle as the TMA load or one cycle later. The cost vector has four lexicographic components, ordered from hardest to softest:

PositionComponentSource
1resource feasibilityRRT row-OR test, capacity-pool caps
2pipe-slot legalitystructural distance matrix, pre-deps inside the slot
3bank-pressure pressureSMEM bank-conflict count
4structural distancedependence-shape preference vs original order

Candidate A seats every op at II = 8 with a single-stage pipeline:

op             stage  order  cycle  slots claimed at cycle 0..7
tiled_tma_load  0      0      0     tma + tp_smem_wr
smem_write      0      1      0     tp_smem_wr             ← collision
wgmma           0      2      0     tc_and_mma + tp_mma
smem_read       0      3      0     tp_smem_rd

The RRT probe at cycle 0 finds tp_smem_wr already claimed by tiled_tma_load. Component 1 fails: cost vector is (∞, *, *, *). The candidate is rejected before any later component matters.

Candidate B spreads the SMEM write into stage 1 by seating it at order 1, cycle 0 of the next iteration's modulo window:

op             stage  order  cycle  slots claimed
tiled_tma_load  0      0      0     tma + tp_smem_wr   [cycles 0..7]
smem_write      1      0      8     tp_smem_wr         [cycles 0..6 of next iter, modulo 8]
wgmma           0      1      0     tc_and_mma + tp_mma [cycles 0..7]
smem_read       0      2      0     tp_smem_rd          [cycles 0..6]

Component 1 passes — every slot has at most one claimant per modulo cycle. The cost reducer moves to component 2.

Component 2 walks the all-pairs distance matrix produced by sub_98BEE0. The tiled_tma_load → smem_read edge has latency 8 and iteration distance 0; the distance matrix reads D[load, read] = 8. The pipe-slot threshold for tp_smem_rd at II = 8 is also 8, so the gate passes with zero slack. Cost contribution from this component is 0 — equal to the threshold means no preference penalty.

Component 3 counts SMEM bank pressure. The bank-pressure evaluator sub_98C440 reads rrt0 and rrt1 at the modulo cycle, sums them, and compares against pool caps 4 (TMEM) and 3 (named-barrier). For Candidate B the in-iteration occupancy is 2 (load + read on different rows of the SMEM bank) and the cross-iteration carry is 1 (the SMEM write spilling from iteration n−1). The sum 2 + 1 = 3 is below the TMEM cap and equal to the named-barrier cap; the gate passes with zero slack. Contribution to the cost vector is the raw sum 3.

Component 4 computes structural distance from the original program order. The original order is (load, write, mma, read) and Candidate B emits (load, mma, read, write) after the modulo wrap — the SMEM write moved past two later ops. The distance penalty is the Kendall-tau metric 2, the number of inversions.

Candidate B's full cost vector is therefore (0, 0, 3, 2). Compare against any alternative that pulls the SMEM write back into stage 0 by raising II to 9: that alternative would have cost vector (0, 0, 2, 0) on its own resources but pays a +1 in the outer II search; the outer driver penalises larger II directly and rejects it before this inner cost reducer ever runs. Among candidates that share the same outer II, Candidate B wins because every alternative either fails component 1 (like Candidate A) or accumulates a larger component-3 or component-4 cost.

The lexicographic comparison is strict: a candidate that improves component 4 at the price of component 3 always loses. This is what keeps the cost-based generator deterministic — the order in which the components rank is fixed at the binary level, and the cost reducer never sums or normalises across components.

Sentinel 0x7FFFFFFF

The constant 0x7FFFFFFF plays two distinct roles inside the scheduler, and both stay correct because the value never appears as a real cost in either context. Inside the distance matrix it means "no path between these two ops" and survives the Floyd-Warshall relaxation; elsewhere in the scheduler the same value marks snapshot-dead retry-arm entries. Reimplementations must keep the two roles separate — some downstream cost combiners would otherwise treat the snapshot-dead value as a finite cost and mis-rank candidates.

Usage and Contract

Schedule::solve runs inside MaterializeSchedule once per producer-consumer candidate pair. It consumes the cached ScheduleAnalysis produced by TileASGenerateSchedule, the constraint-attribute DSU at Schedule+0x70 seeded by the parser, and the per-pair raw-value handle plus consumer pointer the materializer walker discovered. It produces the per-op stage and order fields on the schedule record and the Pipe_ and Mutex_ SSA values emitted by the alias materialiser sub_8E4F10. Callers must invoke the cost-based fallback only after the cheap placement driver sub_981D50 has declined — inverting that order makes every schedule pay the cost-evaluator price even when the cheap path would have succeeded.

The dual-RRT cost evaluators live inside the cost-based fallback. They consume the candidate op, the cycle index t, the current II, the per-op footprint at op+96, the global RRT pair rrt0 / rrt1 from the schedule state, the all-pairs distance matrix built by sub_98BEE0, and the per-pool caps 4 and 3 at indices 1 and 6 of the 9-element pool-capacity vector. Their output is a smallest-feasible cost threshold per candidate that the cost reducer then ranks lexicographically. The evaluators never mutate the RRT — they only probe.

Cross-References

Modulo Scheduler and Rau-Style Placement documents the surrounding modulo scheduler and the placement-arm sequence that invokes these evaluators. Resource Constraint Builder and RRT documents the RRT bit-counting primitives consumed by the bank-pressure evaluator. Blackwell Pipeline 15-Slot Model — Pool Capacity Vector documents the 9-element pool-capacity vector and the TMEM / named-barrier slots referenced by the per-pool caps 4 and 3. Pipe_ and Mutex_ Value-Header Layout documents the 808-byte header that the alias materialiser writes into the IR.