Blackwell Pipeline 15-Slot Model
Abstract
The Blackwell pipeline model is the target-machine resource vocabulary that drives Tileiras scheduling. It maps scheduled operations onto issue, transport, tensor-memory, shared-memory, and MMA resource slots. Those slots become row bits in the modulo scheduler's Resource Reservation Table, so a candidate operation can be seated only when its footprint does not overlap resources already claimed in the same modulo cycle.
The model defines 24 slot identifiers. Eight are coarse families used for classification and grouping; fifteen primary slots feed the optimized scheduler's fine-grained pressure model; the remaining identifiers cover catch-all and test classes.
Slot Identifiers
Slot identifiers are one-based. The RRT row bit for a slot is slot_id - 1.
| Slot | Name | Kind | Role |
|---|---|---|---|
| 1 | issue | coarse | generic issue family |
| 2 | xu | coarse | transcendental or special-function unit family |
| 3 | xu64 | coarse | 64-bit special-function variant |
| 4 | fp32x2_fp16ultra | fine | paired FP32x2 and packed FP16 issue |
| 5 | alu | coarse | scalar ALU family |
| 6 | alu_or_fmaheavy | fine | ALU or FMA-heavy issue group |
| 7 | dual_alu | fine | dual-ALU datapath |
| 8 | lsu | coarse | load/store family |
| 9 | tmem | coarse | tensor-memory family |
| 10 | mma | coarse | MMA family |
| 11 | tc_and_mma | fine | TensorCore and legacy MMA issue stage |
| 12 | tma | coarse | tensor memory accelerator family |
| 13 | tp_gnic_rd | fine | generic interconnect transport read |
| 14 | tp_gnic_wr | fine | generic interconnect transport write |
| 15 | tp_smem_rd | fine | shared-memory transport read |
| 16 | tp_smem_wr | fine | shared-memory transport write |
| 17 | tp_tmem_rd | fine | tensor-memory transport read |
| 18 | tp_tmem_wr | fine | tensor-memory transport write |
| 19 | tp_mma | fine | MMA transport |
| 20 | unknown | fine | unclassified fallback |
| 21 | omitted_simt | fine | deliberately omitted SIMT operation |
| 22 | test_simt | fine | scheduler self-test SIMT row |
| 23 | test_mma | fine | scheduler self-test MMA row |
| 24 | test_dma | test | scheduler self-test DMA row |
The fifteen primary fine slots are:
fp32x2_fp16ultra
alu_or_fmaheavy
dual_alu
tc_and_mma
tp_gnic_rd
tp_gnic_wr
tp_smem_rd
tp_smem_wr
tp_tmem_rd
tp_tmem_wr
tp_mma
unknown
omitted_simt
test_simt
test_mma
Tensor-memory read and write slots are the clearest Blackwell markers — tcgen05-style tensor-memory load, store, copy, and MMA paths all hinge on them.
Operation Footprints
Each scheduled operation has:
- a slot identifier or slot group;
- a duration in cycles;
- a row footprint describing which resources it occupies at each cycle;
- optional capacity pressure in a pool that cannot be represented by one bit.
typedef struct ScheduledResourceUse {
uint32_t slot_id;
uint32_t duration;
uint64_t rows[MAX_DURATION];
} ScheduledResourceUse;
uint64_t resource_row_bit(uint32_t slot_id) {
return 1ull << (slot_id - 1);
}
The RRT probes the footprint. Dependency and cost calculation read the latency. The two concepts are related but distinct: a long-latency value can occupy an issue slot for only a moment, while a transport operation can hold transport resources across several cycles.
Latency Families
The model groups operations into latency families that match the scheduler's rough machine model.
| Family | Typical latency | Typical slots |
|---|---|---|
| dual ALU | 2 cycles | dual_alu |
| ordinary ALU or FMA-heavy | 4 cycles | alu_or_fmaheavy, fp32x2_fp16ultra |
| shared-memory or tensor-memory transport | 7 cycles | tp_smem_*, tp_tmem_*, tp_gnic_*, tp_mma |
| TensorCore or MMA issue | 8 cycles | tc_and_mma |
| far memory or cross-cluster anchors | thousands of cycles | modeled as scheduling anchors, not ordinary row duration |
Treat these values as scheduling weights — not a complete microarchitectural latency table, and never source-language semantics.
Capacity Pools
Some resources are modeled by capacities in addition to row bits. The important capacity pools are:
| Pool | Meaning |
|---|---|
| ALU-or-FMA-heavy issue width | limits how many FMA-heavy operations can be admitted in one cycle |
| dual-ALU issue width | limits dual-ALU pressure |
| shared-memory byte budget | constrains shared-memory allocation and spill pressure |
| tensor-memory budget | constrains tensor-memory-backed operations |
| register-bank pairing | models paired register-bank pressure |
| transport singleton slots | keep shared, tensor, and interconnect transports from overbooking |
A debug configuration treats shared memory as effectively unbounded. It isolates whether a scheduling failure comes from shared-memory pressure or from a different resource.
The capacity-pool probe mirrors the row-bit check: count current usage at the modulo cycle, add the candidate's requested count, compare against the cap. Pools with cap 1 behave as singleton resources — the second op claiming the same pool in the same cycle is rejected outright.
bool capacity_pools_allow(const ResourceTable *table,
const ScheduledResourceUse *use,
uint32_t t) {
for (uint32_t k = 0; k < use->duration; ++k) {
uint32_t row = (t + k) % table->ii;
for (uint32_t p = 0; p < table->n_pools; ++p) {
uint32_t requested = use->pool_counts[p];
if (requested == 0) continue;
if (table->pool_usage[row][p] + requested > table->pool_caps[p]) {
return false;
}
}
}
return true;
}
Rau Cost Weight Tables
Six lookup tables anchor the cost model. The placement driver sub_981D50 and its four arms consult them before reserving a slot in the global RRT. The tables do not live in a single rodata blob — they split across the forward-walk seeder sub_12C8DF0, the backward-walk variant sub_12CBDD0, the dispatcher sub_12CEBF0, and the slot-id hashmap builder sub_12CF910. Three tables hold per-op latencies, two hold per-slot cycle anchors, and one holds the per-resource-class capacities. The placement arms read these tables through stable offsets into the 444-byte SchedulerResourcePool and through XMM-word loads from rodata at 0x4CC9980..0x4CC9D70.
Per-Op Latency Table
A contiguous 12-byte stride array seeded by sub_12C8DF0 starting at offset 0 of the resource pool holds the per-op latency table. It carries 23 entries laid out as {u32 latency; u32 op_tag; u32 reserved} covering offsets 0..395. Each entry pairs a Blackwell op tag (the dialect's internal opcode classifier) with the cycles the cost model charges for a single issue of that tag. The per-op latency assigner reads this table to fill in Op.latency for every candidate before the cost-sort runs.
| Offset | Latency | Op Tag | Slot | Family |
|---|---|---|---|---|
+0 | 4 | 0x0B | 6 | FMA-heavy |
+12 | 2 | 0x0C | 7 | dual ALU |
+24 | 4 | 0x09 | 6 | FMA-heavy |
+36 | 4 | 0x0A | 6 | FMA-heavy |
+48 | 4 | 0x0D | 4 | paired FP32x2 / FP16 |
+60 | 4 | 0x0E | 4 | paired FP32x2 / FP16 |
+72 | 4 | 0x0F | 4 | paired FP32x2 / FP16 |
+84 | 4 | 0x10 | 4 | paired FP32x2 / FP16 |
+208 | 2 | 0x03 | 7 | dual ALU |
+228 | 2 | 0x01 | 7 | dual ALU |
+240 | 2 | 0x02 | 7 | dual ALU |
+252 | 4 | 0x15 | 6 | FMA-heavy |
+264 | 4 | 0x16 | 6 | FMA-heavy |
+276 | 4 | 0x17 | 6 | FMA-heavy |
+288 | 7 | 0x1C | 15 / 16 | SMEM transport |
+300 | 7 | 0x1E | 17 | TMEM read transport |
+312 | 7 | 0x1D | 16 | SMEM write transport |
+324 | 7 | 0x1F | 18 | TMEM write transport |
+336 | 8 | 0x18 | 11 | TC+MMA issue |
+348 | 8 | 0x19 | 11 | TC+MMA issue |
+360 | 7 | 0x20 | 19 | MMA transport |
+372 | 7 | 0x21 | 13 | gnic read transport |
+384 | 7 | 0x22 | 14 | gnic write transport |
High tag ids that the linear stride cannot reach (0x11..0x1B plus a handful of secondary tags) live in XMM-word continuations at rodata 0x4CC9980..0x4CC9A10 for the forward walk and 0x4CC9A20..0x4CC9AE0 for the backward walk. Each XMM word packs two {lat, op_tag} pairs into four i32 lanes; the backward-walk table mirrors the forward-walk encoding but carries reverse-dataflow weights consumed by sub_12CBDD0.
Cycle Anchor Table
Rodata 0x4CC9D10..0x4CC9D70 holds the cycle anchor table — per-slot cycle weights that fix the stage-start cycle every candidate seat time must clear before its slot stays admissible. The slot-cycle-weight reader sub_12CBDD0 consults this table during the dispatcher pass and enforces two architectural ceilings: 5000 cycles for HBM3e bandwidth saturation, 7000 cycles for cross-cluster transfers. Both ceilings land as inline 3-word vectors in self[16] and self[20] of the resource pool. Secondary fence anchors at 1600 and 2000 cycles cover near-SM SMEM spill and intra-cluster fences.
| Rodata | Slot Range | Cycle Weights |
|---|---|---|
0x4CC9D10 | 1..4 (issue, xu, xu64, fp32x2_fp16ultra) | 100, 100, 110, 101 |
0x4CC9D20 | 5..8 (alu, alu_or_fmaheavy, dual_alu, lsu) | 102, 102, 103, 103 |
0x4CC9D30 | 9..12 (tmem, mma, tc_and_mma, tma) | 120, 104, 121, 104 |
0x4CC9D40 | 13..16 (gnic rd/wr, smem rd/wr) | 200, 400, 800, 900 |
0x4CC9D50 | 17..20 (tmem rd/wr, mma transport, unknown) | 1500, 2000, 2400, 3000 |
0x4CC9D60 | misc (test_* and omitted_simt scratch) | 50, 100, 200, 360 |
0x4CC9D70 | secondary anchors | 600, 800, 1000, 1200 |
The 5000-cycle HBM3e ceiling is the absolute round-trip budget the scheduler attributes to a worst-case far-memory dependence; the 7000-cycle ceiling is the same budget for TMA traffic that crosses the cluster boundary. Both serve as Big-M terms — every candidate's accumulated latency must stay below them or the placement is rejected outright before any RRT probe runs.
Pool Capacity Vector
A 9-element capacity vector {37, 4, 7, 37, 5, 1, 3, 6, 2} tells the per-cycle pressure summariser sub_12CEBF0 how much of each resource is available in a single modulo cycle. Pool construction installs the vector through nine explicit calls to the capacity rounder sub_12BB050.
| Pool | Capacity | Role |
|---|---|---|
| 0 | 37 | op-tag → latency table, first 37 distinct op tags |
| 1 | 4 | ALU-or-FMA-heavy issue cap |
| 2 | 7 | dual_alu slot fan-out |
| 3 | 37 | shadow of pool 0 for backward-walk |
| 4 | 5 | per-slot issue-width for coarse families |
| 5 | 1 | singleton global lock for SMEM capacity |
| 6 | 3 | dual-issue cap |
| 7 | 6 | alu_or_fmaheavy slot fan-out |
| 8 | 2 | register-bank pairing |
Caps of 1 on transport pools and the SMEM byte budget are what make the tp_smem_*, tp_tmem_*, tp_gnic_*, and tp_mma slots behave as singleton resources — the modulo scheduler rejects any second op claiming the same transport row in the same RRT cycle. The capacity rounder rounds each request up to the next power of two times four-thirds before allocation, so the rodata values are the live counts before rounding.
Tier-2 Global Capacity Struct
sub_12C8DF0 writes a small struct at the same resource pool that holds five hard inequalities every committed schedule must respect. The struct lives at the start of the pool's secondary section and encodes per-tag caps as packed u64 words.
| Offset | Op Tag | Cap | Interpretation |
|---|---|---|---|
ptr[ 0] | 2 | 262144 | TMEM / register-file byte budget |
ptr[ 8] | 1 | 3 | max concurrent ALU issue per warp slot |
ptr[16] | — | 232448 or INT_MAX | SMEM byte budget per SM |
ptr[20] | 1 | 4 | max concurrent ALU-or-FMA-heavy issue |
ptr[28] | 8 | 2048 | register-bank width across 8 banks |
The SMEM byte budget at ptr[16] is the 227 KiB Blackwell floor (232448 bytes). Setting TILE_AS_DEBUG_UNLIMITED_SMEM="1" toggles this cell to INT_MAX, letting the developer isolate whether a scheduling failure comes from SMEM pressure or from another resource. The check runs once at pass-init time inside sub_12C8DF0; later admission attempts read the rewritten cell directly.
Cost Table Consumers
Each of the three readers pulls from a single table and produces one class of cost-model input. The split keeps the per-op latency view, the per-slot cycle anchor view, and the per-class capacity view independently addressable from both placement arms and the cost reducer.
| Cost lookup table | Rodata / Offset | Consumer | Role |
|---|---|---|---|
| Per-op latency, 23 packed entries | SchedulerResourcePool +0..+395 | sub_12C8DF0 | per-op latency assigner; fills Op.latency |
| Forward-walk XMM continuations | 0x4CC9980..0x4CC9A10 | sub_12C8DF0 | high-tag latency lookups (tags 0x11..0x1B) |
| Backward-walk XMM continuations | 0x4CC9A20..0x4CC9AE0 | sub_12CBDD0 | reverse-dataflow latency view |
| Per-slot cycle anchor weights | 0x4CC9D10..0x4CC9D70 | sub_12CBDD0 | slot-cycle-weight reader; applies 5000/7000 ceilings |
| 9-element pool capacity vector | inline arguments to sub_12BB050 | sub_12CEBF0 | per-cycle pressure summariser |
| Tier-2 global capacity struct | SchedulerResourcePool +0..+28 | sub_12C8DF0 | installs hard inequalities (TMEM, ALU, SMEM, regbank) |
The cost reducer that drives CostBasedScheduleGenerator::generateOrRefineScheduleWithConstraint (sub_980290) reaches all three views through the same resource-pool pointer, so the lexicographic cost vector it produces — latency, slot pressure, structural distance — comes from a single coherent snapshot of the tables.
Axis and Buffer Inputs
Names alone do not classify operations. The scheduler consumes analysis facts:
- contiguity, divisibility, and constancy from axis analysis;
- scalar bounds and memory ranges for index expressions;
- buffer lifetime records for shared memory, tensor memory, and auxiliary scratch;
- leader groups and pipeline identifiers from buffer assignment;
- allocation sizes and live ranges from the layout and buffer passes.
Axis analysis decides whether a vector load, TMA coordinate, or pointer expression is aligned and compact enough for a particular resource class. Buffer lifetime decides whether two memory operations share a live resource and must be coupled or separated.
Worked Example: Four-Op Loop Body
The clearest way to read the slot model is to walk a loop body small enough to fit in one RRT and rich enough to touch the transport, MMA, and SMEM rows simultaneously. The body below is the steady-state shape of a software-pipelined matmul inner loop:
%0 = nv_tileas.async.tiled_tma_load %desc, %coord : !smem_ref
%1 = nv_tileas.async.smem_write %src : !smem_ref
%2 = nv_tileas.async.wgmma %a, %b, %c : !tmem_ref
%3 = nv_tileas.async.smem_read %0 : !reg
Each op's resource vector is the triple (slot_id, duration, occupancy) produced by the constraint builder. The classifier reads the op's MLIR opcode plus its operand types, picks the slot from the table at the top of this page, and reads the duration from the latency family.
| Op | Slot | Duration | Occupancy | Family |
|---|---|---|---|---|
tiled_tma_load %0 | 12 (tma) + 16 (tp_smem_wr) | 8 cycles | 1 each | TMA + SMEM write transport |
smem_write %1 | 16 (tp_smem_wr) | 7 cycles | 1 | SMEM write transport |
wgmma %2 | 11 (tc_and_mma) + 19 (tp_mma) | 8 cycles | 1 each | MMA issue + transport |
smem_read %3 | 15 (tp_smem_rd) | 7 cycles | 1 | SMEM read transport |
The TMA load is the only op that claims two slots simultaneously: the descriptor stays parked on the tma row while the tensor payload flows through the SMEM write transport. The cost reducer sees two row contributions for one op, which is why the per-op latency table at offset +288 of the resource pool charges both 0x1C (SMEM transport) and 0x1D (SMEM write transport) variants for the same source-level operation.
Suppose the candidate II is 8. The scheduler probes the four ops in dataflow order and seats each at the earliest legal cycle. The resulting RRT — one 24-bit row per modulo cycle, drawn here only over the slots the example touches — is:
cycle tc_and_mma tma tp_smem_rd tp_smem_wr tp_mma
0 . X . X . ← tiled_tma_load occupies tma + smem_wr
1 . X . X .
2 . X . X .
3 . X . X .
4 . X . X .
5 . X . X .
6 . X . X .
7 . X . X .
// smem_write seats at cycle 0 of next iteration; in the modulo
// view it overlays the same RRT, claiming tp_smem_wr at cycles
// [0..6] mod 8. The probe fails — tp_smem_wr is already busy.
//
// The placement driver bumps smem_write forward; the only legal
// start is cycle 8 mod 8 = 0 of the iteration *after* the TMA
// tail drains, which the modulo scheduler models as a stage-1
// seat with order 0.
The example shows two things at once: (i) singleton transports (tp_smem_wr pool cap = 1) force the modulo scheduler to spread overlapping iterations across stages rather than packing them onto the same cycle, and (ii) the per-op latency table's split between 0x1C and 0x1D exists precisely so the TMA load and the loose SMEM write can be charged at different per-cycle weights — the TMA load's 8-cycle hold is what makes it the structural bottleneck, while the SMEM write's 7-cycle hold lets it slip into the gap one cycle later.
The cost reducer ranks this schedule against any alternative by reading the per-slot cycle weights from rodata 0x4CC9D40 for slots 13..16: 200 for gnic-rd, 400 for gnic-wr, 800 for smem-rd, 900 for smem-wr. A schedule that doubled-up on tp_smem_wr would multiply that 900 by the second user's surcharge; a schedule that kept the SMEM transports balanced pays the base weight once and clears the gate.
Admission Rule
An operation is legal at cycle t when every occupied row is conflict-free.
bool resource_admit(ResourceTable *table,
const ScheduledResourceUse *use,
uint32_t t) {
for (uint32_t k = 0; k < use->duration; ++k) {
uint32_t row = (t + k) % table->ii;
if ((table->rows[row] & use->rows[k]) != 0) {
return false;
}
}
return capacity_pools_allow(table, use, t);
}
Commit is the same loop with OR assignment after all probes pass.
Cross-References
Resource Constraint Builder and RRT consumes the slot identifiers documented here as row bits in its qword footprint stack. Modulo Scheduler and Rau drives the RRT probe and commit against these slots. Modulo Driver and 4-Arm OR-Chain consults the per-op latency table and the 9-element pool-capacity vector during cost ranking. Schedule Solve and Cost Evaluators reads the per-pool caps 4 (TMEM) and 3 (named-barrier) from indices 1 and 6 of the pool-capacity vector. Performance and Cost Model walks the roofline calculation that turns these slot costs into a stage count.