Modulo Scheduler and Rau-Style Placement

Abstract

Tileiras software-pipelines TileAS loops with a Rau-style modulo scheduler. The scheduler searches for an initiation interval, places operations into cycles modulo that interval, and respects both data dependences and target-machine resource capacity. This is the throughput-critical scheduler for loops that use Blackwell tensor memory, shared memory, WGMMA, TMA, named barriers, and async-copy queues.

The output of this pass is schedule analysis: stage, order, resource placement, and depth information. A later materialization pass consumes that analysis to emit Pipe_ and Mutex_ IR.

Scheduling Model

A software-pipelined loop overlaps multiple logical iterations. If the initiation interval is II, the steady-state loop starts one new iteration every II cycles. Smaller II means higher throughput, but only if every recurrence and resource constraint can be satisfied.

Tileiras starts from the usual lower bound:

II >= max(resource_mii, recurrence_mii, fine_density_mii, dependency_mii)

Bound	Meaning
resource MII	lower bound from total resource demand and per-cycle capacity
recurrence MII	lower bound from dependence cycles crossing loop iterations
fine-density MII	lower bound from resource groups with fractional or pooled capacity
dependency MII	lower bound from known longest dependence depth

Resource Reservation Table

The scheduler represents resource occupancy with a Resource Reservation Table. The global table has one row per cycle modulo II; each row is a bitset of resources. Each operation has a footprint table with one row per occupied cycle.

bool rrt_probe(const RRT *global, const NodeRRT *node, uint32_t start) {
    for (uint32_t k = 0; k < node->duration; ++k) {
        uint32_t row = (start + k) % global->ii;
        if ((global->rows[row] & node->rows[k]) != 0) {
            return false;
        }
    }

    return true;
}

void rrt_commit(RRT *global, const NodeRRT *node, uint32_t start) {
    for (uint32_t k = 0; k < node->duration; ++k) {
        uint32_t row = (start + k) % global->ii;
        global->rows[row] |= node->rows[k];
    }
}

Probe-before-commit is mandatory. Committing a partially probed footprint can make retry behavior nondeterministic.

Candidate-II Search

The scheduler tries the lower bound first, then increases or searches until it finds a feasible interval or reaches the configured cap.

bool schedule_loop(Schedule *sched, RauBounds bounds) {
    uint32_t lower = max4(bounds.resource_mii,
                         bounds.recurrence_mii,
                         bounds.fine_density_mii,
                         bounds.dependency_mii);

    for (sched->ii = lower; sched->ii <= bounds.ii_cap; ++sched->ii) {
        clear_resource_table(&sched->global_rrt, sched->ii);

        if (place_all_groups_at_current_ii(sched)) {
            sched->stage_count = compute_stage_count(sched);
            return true;
        }
    }

    sched->failed = true;
    return false;
}

Some subproblems use galloping plus binary search rather than a linear increment. Both forms obey the same semantic contract: find the smallest feasible value according to the probe.

Placement Arms

For a fixed II, Tileiras tries a deterministic sequence of placement arms:

Arm	Role
permute	cheap priority order and earliest legal seat
fuse	merge compatible groups to improve packing
retry	use a snapshot overlay to skip or reconsider failed operations
cost-based	choose the lowest-cost legal placement with RRT-backed scoring

The first successful arm wins for the current group. If the full group pass fails, the driver clears temporary state and reruns heavier retry/fallback arms.

bool place_group(Schedule *sched, Group group, RetrySnapshot *snapshot) {
    if (try_permute(sched, group)) {
        return true;
    }
    if (try_fuse(sched, group)) {
        return true;
    }
    if (try_retry(sched, group, snapshot)) {
        return true;
    }
    return try_cost_based(sched, group, snapshot);
}

Permute Arm

The permute arm sorts the current group by a predecessor-priority map and seats each operation at the earliest legal cycle.

bool try_permute(Schedule *sched, Group group) {
    PriorityMap priority = build_predecessor_priority(group);
    stable_sort(group.nodes, [&](Node *a, Node *b) {
        return priority.less(a, b);
    });

    for (Node *node : group.nodes) {
        if (!seat_earliest_legal_cycle(sched, node)) {
            return false;
        }
    }

    return true;
}

Fuse Arm

The fuse arm merges compatible scheduling groups. It must reject fusions that violate dependence order or MLIR nesting relationships.

bool groups_can_fuse(OperationSet a, OperationSet b) {
    for (Operation *left : a) {
        for (Operation *right : b) {
            if (has_required_order(left, right)) {
                return false;
            }
            if (is_proper_ancestor(left, right) || is_proper_ancestor(right, left)) {
                return false;
            }
        }
    }

    return true;
}

Stable sorting is part of the contract. It keeps tied fusions deterministic across builds and platforms.

Retry and Snapshot Overlay

Retry uses a snapshot map instead of mutating the canonical group state. A failed attempt can mark a candidate dead in the overlay, and later strategies can clear or rebuild the overlay without corrupting the schedule's operation list.

bool try_retry(Schedule *sched, Group group, RetrySnapshot *snapshot) {
    for (Node *node : group.nodes) {
        if (snapshot_is_dead(snapshot, node)) {
            continue;
        }

        if (!seat_earliest_legal_cycle(sched, node)) {
            snapshot_mark_dead(snapshot, node);
            return false;
        }
    }

    return true;
}

Cost-Based Arm

The cost-based arm ranks legal placements instead of accepting the first legal seat.

bool try_cost_based(Schedule *sched, Group group, RetrySnapshot *snapshot) {
    for (Node *node : group.nodes) {
        Placement best = find_lowest_cost_legal_placement(sched, node, snapshot);
        if (!best.valid) {
            snapshot_mark_dead(snapshot, node);
            return false;
        }

        rrt_commit(&sched->global_rrt, &node->footprint, best.start);
        node->start = best.start;
    }

    return true;
}

Cost ranking is lexicographic. Resource legality is a hard gate; pressure and structural distance rank candidates that already pass.

Prologue and Stage Count

After successful placement, the scheduler computes the latest occupied cycle. This determines the number of overlapped stages in the steady-state loop.

uint32_t compute_stage_count(const Schedule *sched) {
    uint32_t end = 0;

    for (Node *node : sched->nodes) {
        if (node->start < 0) {
            return UINT32_MAX;
        }

        uint32_t finish = (uint32_t)node->start + node->latency;
        end = max(end, finish);
    }

    return ceil_div(end + 1, sched->ii);
}

Diagnostics

When resource placement fails, useful diagnostics should include:

candidate II;
operation duration and requested slot footprint;
global RRT rows at the conflicting modulo cycles;
dependency window for the operation;
active capacity pool that rejected the candidate, if any;
group and stage/order information for the failed operation.

These diagnostics let users distinguish an impossible loop body from a heuristic failure.

Cross-References

Resource Constraint Builder and RRT documents footprint construction and feasible-II probing. Blackwell Pipeline 15-Slot Model documents the resource slots. Schedule Solve and Cost Evaluators explains the materialization boundary.

Keyboard shortcuts

Tileiras Internals