TPU Scheduling Pipeline
Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id
89edbbe81c5b328a958fe628a9f2207d). Other versions differ.
Abstract
TPU code is scheduled twice, at two different IR levels, by two different algorithms, and the second one depends on a separate resource-assignment pass that has no analogue in a CPU or GPU backend. This page is the map of that stack. The first scheduler is the HLO-level LatencyHidingScheduler (LHS) — a greedy, bottom-up, critical-path list scheduler that reorders an already-memory-minimized HLO sequence so long-latency async work (collective starts, host/ICI DMA, async copies) is issued early and hidden under independent compute. The second is the LLO-level bundle packer — a forward greedy "earliest-legal-bundle" list scheduler that assigns each low-level op to a slot inside a fixed-width VLIW bundle, plus a separate modulo scheduler that software-pipelines inner hardware loops. Between the two sits the MXU-sequence assignment and MRB (matrix-result-buffer) allocation pass, which decides matmul accumulation chains and result-FIFO placement that the bundle packer's latch ordering then depends on. Pricing every one of these decisions is the ResourceType model and AsyncTracker from Part VII's cost model.
The reader who knows LLVM should hold the analogy stage by stage. The LHS is a classic MachineScheduler-style ready-set list scheduler, but it builds the sequence in reverse (a current_time clock that walks back from the roots) and its priority is dominated by async critical-path depth/height rather than plain dependency height. The bundle packer is the TPU's equivalent of LLVM's VLIWPacketizer (DFAPacketizer), except the resource model is a per-generation BundleRequirement bitmap rather than a deterministic finite automaton, and there is no spill-to-next-bundle search — when a slot is full, a new empty bundle is appended and the op retries. The modulo scheduler is textbook iterative modulo scheduling with RecMII/ResMII initiation-interval search, riding on LLVM's ScheduleDAGMI over MachineInstr. The MXU/MRB pass has no LLVM analogue at all: it is a domain-specific bin-packer over the systolic array's accumulation buffers.
This is an orientation page. It frames the four stages, fixes the stage ordering, and links each sub-page that documents the detail. It does not duplicate the algorithms — the LHS comparator's 22 keys, the per-gen bundle slot matrix, the MRB reservation timeline, the modulo II search — each lives on its own page, cross-referenced below.
For reimplementation, the stack contract is:
- The two schedulers operate on disjoint IRs and never share state. The LHS reorders
HloInstructions and writes amodule->schedule; the bundle packer reorders/packsLloInstructions (orMachineInstrs) intoBundlerecords. The HLO scheduling annotations do not flow into bundle packing; they only steer async-overlap of HLO computations. - The LHS requires a base schedule as a precondition. It does not build a schedule from scratch —
HloMemorySchedulerWithBrkgaFallbacklays down a memory-pressure-minimizing order first, and the LHS replaces it with one that overlaps async edges. The four LHS pages are variants of one shared body; they differ only in pipeline placement andSchedulerConfig/AsyncTrackerinputs. - MXU-sequence assignment runs between the schedulers and feeds the packer.
AssignMxusForSequenceGroupdecides matmul accumulation chains, latch placement, and MRB FIFO entries; the bundle packer's per-op latch ordering and the per-gen encoder's latch serialization read that assignment downstream. - Every scheduling decision is priced by the bundle cost model. Node cost (
MaxResourceCyclesover a 23-slot bundle vector) and dependency latency (LatencyBetween) feed the LHS priority; theResourceType/AsyncTrackermodel gates which async ops may overlap by physical resource.
| Stage 1 — HLO scheduler | xla::LatencyHidingScheduler::RunImpl @ 0x136321a0; per-comp drain DefaultSchedulerCore::ScheduleComputation @ 0x1362eb60; comparator FindAndExtractBestNodeAvailable @ 0x13618880 |
| Stage 1 driver | (anon)::RunHloScheduler @ 0x1096fac0 — two pipelines: final_scheduler (base) → async_scheduling (LHS) |
| Stage 2 — MXU/MRB assignment | AssignMxusForSequenceGroup @ 0x10f753c0 (worker 0x10f77ca0); MrbChainAllocator::ExtendMrbReservation @ 0x10f58800; MxuAssigner::AllocateMrbEntriesAsFifo @ 0x10f3ef80, BounceBetweenMsrs @ 0x10f3fae0 |
| Stage 3 — LLO bundle packer | PackBundles @ 0x10a30a20; GlobalBundlePacker::Pack @ 0x10a86420; BundlePacker::Feed @ 0x14021f20 |
| Stage 3b — modulo (loops) | llvm::TPUScheduleDAGModulo::findSchedule @ 0x13b1d7c0; II search calculateResourceMII @ 0x13c0bee0, calculateLargestLatencyMII @ 0x13c0b840 |
| Pricing — cost model | MaxResourceCycles @ 0x1c89b9e0; LatencyBetween @ 0x1c89f820; TpuAsyncTracker::GetResourceHazardType @ 0x110015e0 |
| IR levels | Stage 1: HLO (HloInstruction) · Stages 2–3: LLO (LloInstruction) / LLVM MachineInstr |
| Confidence | CONFIRMED (byte-anchored) unless a row or callout says otherwise |
The Stage Ordering
The four stages run strictly in pipeline order, each consuming the previous stage's output. The two list schedulers bookend the matrix-resource assignment; the cost model is queried by all three.
HLO module (post layout-assignment, post main-fusion)
│
▼
┌──────────────────────────────────────────────────────────────────────┐
│ STAGE 1 — HLO LATENCY-HIDING SCHEDULER RunHloScheduler │
│ final_scheduler : HloMemorySchedulerWithBrkgaFallback (base order) │ 0x1096fac0
│ async_scheduling : LatencyHidingScheduler::RunImpl (overlap rewrite)│ 0x136321a0
│ └─ DefaultSchedulerCore::ScheduleComputation (reverse list drain) │ 0x1362eb60
│ └─ FindAndExtractBestNodeAvailable (22-key ReadySetLt) │ 0x13618880
│ priced by GetLatencyBetween → MaxResourceCycles / LatencyBetween │
│ gated by TpuAsyncTracker (ResourceType hazard classes) │
└──────────────────────────────────────────────────────────────────────┘
│ (HLO → MHLO/TLP → tpu dialect → LLO IR descent; not on this page)
▼
┌──────────────────────────────────────────────────────────────────────┐
│ STAGE 2 — MXU SEQUENCE + MRB ALLOCATION │
│ AssignMxusForSequenceGroup (bin-pack matmuls into MXU sequences) │ 0x10f753c0
│ └─ MrbChainAllocator (accumulation-chain reservation timeline) │ 0x10f58800
│ └─ AllocateMrbEntriesAsFifo / BounceBetweenMsrs (result FIFO/MSR) │ 0x10f3ef80
│ └─ SetLatchIndices (commit latch ids onto each MxuSequence) │ 0x10f3b4c0
└──────────────────────────────────────────────────────────────────────┘
│ (latch indices + MRB entries are inputs to slot legality below)
▼
┌──────────────────────────────────────────────────────────────────────┐
│ STAGE 3 — LLO BUNDLE PACKING PackBundles │ 0x10a30a20
│ GlobalBundlePacker::Pack (forward greedy, topological) │ 0x10a86420
│ └─ BundlePacker::Feed (earliest-legal-bundle + SlotTracker) │ 0x14021f20
│ └─ per-gen TpuBundleRestrictions (slot legality, latch ordering) │
│ STAGE 3b (inner loops only): TPUScheduleDAGModulo (II-search SWP) │ 0x13b1d7c0
└──────────────────────────────────────────────────────────────────────┘
│
▼
vector<Bundle> ─→ per-gen Encoder*::EncodeBundle (latch serialization) ─→ raw bytes
The ordering is not arbitrary. Stage 1 runs on HLO precisely because async overlap is a macro decision — which collective starts before which fusion — that must be made while the program is still a high-level dataflow graph with whole-tensor dependency edges. Stage 3 runs on LLO because slot legality is a micro decision — which scalar/vector/MXU lane each individual op occupies in one VLIW word — that only exists after the program has been lowered to hardware ops. Stage 2 sits between them because the MXU's accumulation buffers (MRB) and result latches are a physical resource whose assignment is finer-grained than HLO but is a precondition for slot legality, not a consequence of it: the bundle packer needs to know which latch each matmul result lands in before it can decide whether two matmul ops conflict in a bundle.
GOTCHA — there are two unrelated "schedulers" and conflating them mis-prices everything. Stage 1 (
LatencyHidingScheduler) and Stage 3 (BundlePacker) are both list schedulers, but they share no code, no IR, and no cost units. Stage 1 prices in bundle cycles viaMaxResourceCyclesto overlap async transfers; Stage 3 prices in slot occupancy via per-genBundleRequirementbitmaps to fit a VLIW word. A reimplementation that runs one and skips the other produces either an un-overlapped async schedule (skip Stage 1) or an unencodable instruction stream with slot conflicts (skip Stage 3).
Stage 1 — HLO Latency-Hiding Scheduler
Purpose
The LHS is the production HLO scheduler. It is invoked from RunHloScheduler (0x1096fac0), the final-scheduling phase of DeepseaCompilerBase::RunHloPasses, after layout assignment and main fusion. It does not build a schedule from nothing: RunHloScheduler first runs a final_scheduler pipeline whose HloMemorySchedulerWithBrkgaFallback lays down a memory-pressure-minimizing base order (the LHS has_schedule() precondition), then runs an async_scheduling pipeline whose LatencyHidingScheduler replaces that order with one that overlaps async start/done pairs against compute under a TPU resource model.
Structure
RunHloScheduler 0x1096fac0
├─ pipeline "final_scheduler"
│ └─ HloMemorySchedulerWithBrkgaFallback ── base memory-minimized order
└─ pipeline "async_scheduling"
└─ AddPass<LatencyHidingScheduler> ── shared RunImpl 0x136321a0
├─ SchedulerConfig (per-variant POD)
├─ SchedulingContext (LatencyEstimator + TpuAsyncTracker)
└─ DefaultSchedulerCore::ScheduleComputation 0x1362eb60
└─ FindAndExtractBestNodeAvailable 0x13618880
The single most important structural fact is that the RunImpl body (0x136321a0) is shared across all variants. There is exactly one compiled RunImpl — the unmodified upstream latency_hiding_scheduler.cc body. The four LHS pages below document variants that differ only in two inputs handed to AddPass<LatencyHidingScheduler>: the SchedulerConfig POD and the SchedulingContext (estimator + async tracker). The core algorithm — the reverse-direction drain loop, the 22-key ReadySetLt comparator, the precomputed async depth/height, the memory-pressure retry loop — is documented once on the core page.
Variant Map
| Variant | What it changes | Page |
|---|---|---|
| Core | the shared RunImpl body: drain loop, comparator, async tracker, memory retry | LatencyHidingScheduler Core |
| post_layout | the live placement in 0.0.40 — two-pipeline RunHloScheduler, SchedulerConfig/estimator/tracker wiring | LHS post_layout |
| post_layout_pre_fusion | the same body wired at a dead "Pre main fusion" slot; config delta only | LHS post_layout_pre_fusion |
| ILP | the EnableIlpLatencyHidingScheduler gate swaps the async classifier only — the comparator and RunImpl are unchanged. The named ILPMemoryScheduler MIP is a separate, opt-in memory scheduler (reachable via GetMemorySchedulerAlgorithm case 6), not invoked by this flag | LHS ILP variant |
NOTE — the "ILP-LHS" flag does not introduce an ILP scheduler. The flag family
xla_tpu_enable_ilp_latency_hiding_schedulerreads as if an integer-linear-programming scheduler replaces the greedy list scheduler. It does not: theEnableIlpLatencyHidingSchedulerpath insideRunHloScheduleronly swaps which async-op kinds are exposed as overlap candidates to the same greedy LHS (the comparator andRunImplare unchanged). This is distinct from the genuinexla::ILPMemorySchedulerCP-SAT memory scheduler, which is a separate, reachable opt-in:GetMemorySchedulerAlgorithm(0x10abd6a0) selects it at case 6 (MemorySchedulerProto.Value::ILP) of its 0..9 switch (Default0 / List1 / DFS2 / PostOrder3 / BRKGA4 / BFS5 / ILP6 / Backtracking7 / BruteForce8 / LocalOrder9), keyed offTpuCompilationEnvironment+0x10c. The MIP is on-demand (not the default), not dead. See the ILP variant page.
What It Consumes
The comparator's per-node cost and per-edge latency come from the cost model: CostModelLatencyEstimator::GetLatencyBetween (0x10ff8f00) routes compute edges through CostModel::GetCycles → MaxResourceCycles (bundle throughput) and dependency edges through LatencyBetween (RAW stall). Whether two async ops may overlap is decided by the TpuAsyncTracker resource model — a 0..46 ResourceType enum with per-resource hazard classes. Both the cost feed and the resource model have their own pages: Bundle-Aware Cost and the ResourceType Taxonomy.
Stage 2 — MXU Sequence Assignment and MRB Allocation
Purpose
Between the HLO scheduler and the bundle packer, a TPU-specific pass assigns matmul ops to the systolic array's physical resources. Each matmul accumulation chain becomes an MxuSequence; the chains are bin-packed onto the available MXU passes (AssignMxusForSequenceGroup, 0x10f753c0, worker 0x10f77ca0); the matrix-result buffers (MRB) that hold partial sums are reserved on a timeline (MrbChainAllocator, 0x10f58800); and the result FIFOs / matrix-shift-registers (MSR) that drain those buffers are placed (AllocateMrbEntriesAsFifo 0x10f3ef80, BounceBetweenMsrs 0x10f3fae0). The assignment is committed by writing latch indices onto each sequence (SetLatchIndices, 0x10f3b4c0).
This stage has no analogue in a scalar backend. It exists because the TPU MXU is a reservation-scheduled unit: a matmul does not complete in one cycle but pushes weights and operands into a systolic pipeline whose result appears in a specific buffer many cycles later, and the compiler must statically reserve that buffer and the latch that reads it. The bundle packer downstream treats those latch indices as a slot-legality input — two ops that would collide on the same latch cannot share a bundle.
Structure
AssignMxusForSequenceGroup 0x10f753c0 ── bin-pack matmul sequences onto MXU passes
├─ (worker) AssignMxusForSequenceGroupInternal 0x10f77ca0
├─ MrbChainAllocator ── reservation timeline for accumulation buffers
│ ├─ ExtendMrbReservation 0x10f58800
│ ├─ SplitAccumulationChain 0x10f598e0
│ ├─ ReleaseMrbReservation 0x10f5f9e0
│ └─ AdvanceTimeTo 0x10f5e9e0
├─ AllocateMrbEntriesAsFifo 0x10f3ef80 ── result FIFO placement
├─ BounceBetweenMsrs 0x10f3fae0 ── matrix-shift-register ping-pong
└─ SetLatchIndices 0x10f3b4c0 ── commit latch ids onto each MxuSequence
Sub-Pages
| Component | What it owns | Page |
|---|---|---|
MxuSequence / SequenceInfo | the per-sequence record and the set_mxu commit | MxuSequence / SequenceInfo |
AssignMxusForSequenceGroup | the bin-packing algorithm that places sequences on MXU passes | MXU Assignment Bin-Packer |
MrbChainAllocator | the accumulation-chain reservation timeline and jitter model | MRB Chain Allocator |
AllocateMrbEntriesAsFifo / BounceBetweenMsrs | result-FIFO and MSR placement | MRB FIFO / MSR Placement |
SetLatchIndices / overrun handshake | latch-index assignment and the per-gen overrun handshake | Latch Assignment & Overrun |
NOTE — Stage 2 is a dependency of Stage 3, not a sub-pass of it. The bundle packer (
BundlePacker::Feed) reads each op's already-assigned latch when it asks whether the op'sBundleRequirementfits a candidate bundle. A reimplementation that folds MXU assignment into the packer's per-op loop will deadlock the dependency: latch assignment needs the full accumulation chain in view, which the packer's one-op-at-a-time forward walk does not have.
Stage 3 — LLO Bundle Packing
Purpose
The bundle packer takes the LLO instruction list — the last IR before raw bundle-byte emission — and assigns each instruction to a slot inside a fixed-width VLIW bundle, respecting per-generation resource constraints (functional-unit count, vector source ports, immediate slots, predicate fields). The output is a vector<Bundle> consumed by the per-gen Encoder*::EncodeBundle. The algorithm is forward greedy earliest-legal-bundle list scheduling: walk the region topologically, and for each op compute the earliest bundle its RAW dependencies allow, then ask SlotTracker for the first bundle at or after that point with room for the op's BundleRequirement; if none fits, append a new empty bundle and retry. There is no spill-to-next-bundle search and no backtracking.
Two Packers, One Algorithm
libtpu ships two structurally identical packers at different IR levels: the canonical xla::jellyfish::BundlePacker operating on LloInstruction* (the path straight to the per-gen encoder), and an llvm::(anonymous)::BundlePacker MachineFunctionPass operating on MachineInstr* (the path through LLVM MC). Both produce the same bundle byte layout; they are two implementations of the same earliest-legal-bundle algorithm against two IRs. The per-gen slot legality is encapsulated in four TpuBundleRestrictions subclasses — JellyfishBundleRestrictions, PufferfishBundleRestrictions, ViperfishBundleRestrictions, and GhostliteBundleRestrictions (the only *BundleRestrictions subclasses present in the binary; there is no dragonfish/6acc60406 restriction table) — each providing SetLimits, AddXluRequirements, AddMxuRequirements, and MatchScalar virtuals.
Structure
PackBundles 0x10a30a20 ── top-level entry (timer, numbering, options)
└─ GlobalBundlePacker::Pack 0x10a86420 ── orchestrator, topological region walk
├─ PackPhis (region) 0x10a87160 ── SSA-resolution moves into predecessor tails
├─ PackInstruction (instr) 0x10a875a0 ── regular ops
│ └─ BundlePacker::Feed (instr, lat) 0x14021f20 ── earliest-legal-bundle + SlotTracker
│ └─ SlotTracker::FindFeasibleBundleAfter 0x140340a0
└─ PackBranch (instr) 0x10a877a0 ── opcodes 0x87/0x88/0xef + delay slots
→ ConsumeBundles 0x14026c40 ── hand off vector<Bundle>; ValidatePacking 0x14026ca0
Modulo Scheduling for Inner Loops (Stage 3b)
Hardware-loop inner bodies take a separate path: LLVM's TPUScheduleDAGModulo performs iterative modulo scheduling — pick an initiation interval (II) from the max of resource and recurrence bounds, then place the loop body across II-many bundles. The II search is calculateResourceMII (0x13c0bee0, the ResMII bound) and calculateLargestLatencyMII (0x13c0b840, the recurrence/RecMII bound); findSchedule (0x13b1d7c0) is the placement driver. This is the textbook software-pipelining path and is invoked only for loops marked hardware_loop; the straight-line list packer handles everything else.
Sub-Pages
| Component | What it owns | Page |
|---|---|---|
GlobalBundlePacker / BundlePacker::Feed / SlotTracker / TpuBundleRestrictions | the forward greedy algorithm, the per-gen slot matrix, branch/barrier/phi handling | LLO → Bundle Packing |
TPUScheduleDAGModulo | the II-search and software-pipelining path for hardware loops | Bundle Modulo Scheduling |
per-gen Encoder*::EncodeBundle | how the assigned latch fields serialize into the per-gen bundle word | Per-Gen Encoder Latch Serialization |
QUIRK — the packer never spills; it grows the bundle vector. When
SlotTracker::FindFeasibleBundleAfterreturns "no feasible bundle," the algorithm does not search alternative slots or reorder earlier ops — itemplace_backs a fresh empty bundle and retries the same op. A compile failure only occurs when a single op'sBundleRequirementexceeds the gen's per-bundle limit even alone (a miscoded restriction table), not when the schedule is merely dense. A reimplementation that adds a spill/backtrack heuristic will diverge from the binary on every bundle-boundary decision.
How the Stages Are Priced
All three stages query the Part VII cost model, but at different granularities and for different decisions.
| Stage | Decision priced | Cost query | Unit |
|---|---|---|---|
| Stage 1 (LHS) | which async op to issue, how much compute hides its latency | GetLatencyBetween → MaxResourceCycles (node bundle cost) + LatencyBetween (edge RAW stall) | bundle cycles → seconds via TC clock |
| Stage 1 (LHS) | whether two async ops may overlap | TpuAsyncTracker::GetResourceHazardType over the 0..46 ResourceType enum | hazard class (shareable / serial / unsharable / nonextendable) |
| Stage 2 (MXU) | MRB reservation windows, accumulation jitter | per-gen CycleTable matmul/matprep latencies | MXU pass cycles |
| Stage 3 (packer) | whether an op fits a bundle | per-gen BundleRequirement slot bitmaps (not the cost model) | slot occupancy |
The cost model itself — the 23-slot ResourceVector, the MaxResourceCycles max-over-lanes reduction, the throughput-vs-latency split — is documented in Part VII; this page only names where each scheduling stage taps it. The two most relevant pages are the Cost Model Overview (the three cost-class families and the Target clock wiring) and Bundle-Aware Cost (the MaxResourceCycles bundle cost and LatencyBetween dependency latency that feed the LHS comparator).
NOTE — Stage 3's "cost" is legality, not latency. Unlike Stage 1, the bundle packer does not minimize cycles; it satisfies a constraint (fit the slots). Its only latency input is the per-op RAW latency it uses to compute the earliest legal bundle in
BundlePacker::Feed. The latency hiding — overlapping long edges under compute — has already been decided at the HLO level by Stage 1. A reimplementer must not re-run latency-hiding heuristics in the packer; that double-counts the optimization.
Confidence Summary
| Claim | Evidence |
|---|---|
Stage 1 LHS is invoked from RunHloScheduler as two sequential pipelines (base then overlap) | RunHloScheduler @ 0x1096fac0 (Brkga / final_scheduler / async_scheduling strings) |
LHS RunImpl body is shared across all variants | single compiled RunImpl @ 0x136321a0 |
LHS comparator is the 22-key ReadySetLt chain, reverse-direction drain | FindAndExtractBestNodeAvailable @ 0x13618880; ScheduleComputation @ 0x1362eb60 |
Stage 2 bin-packs matmuls into MxuSequences and reserves MRB | AssignMxusForSequenceGroup @ 0x10f753c0; MrbChainAllocator::ExtendMrbReservation @ 0x10f58800 |
| Stage 2 commits latch indices read downstream by Stage 3 | SetLatchIndices @ 0x10f3b4c0; AllocateMrbEntriesAsFifo @ 0x10f3ef80 |
| Stage 3 is forward greedy earliest-legal-bundle list scheduling | PackBundles @ 0x10a30a20; GlobalBundlePacker::Pack @ 0x10a86420; BundlePacker::Feed @ 0x14021f20 |
| Stage 3b modulo scheduler for hardware loops with RecMII/ResMII II search | TPUScheduleDAGModulo::findSchedule @ 0x13b1d7c0; calculateResourceMII @ 0x13c0bee0; calculateLargestLatencyMII @ 0x13c0b840 |
Stage 1 prices via MaxResourceCycles / LatencyBetween; gates via TpuAsyncTracker | MaxResourceCycles @ 0x1c89b9e0; LatencyBetween @ 0x1c89f820; GetResourceHazardType @ 0x110015e0 |
Per-gen slot legality is four TpuBundleRestrictions subclasses (Jellyfish/Pufferfish/Viperfish/Ghostlite) | only four *BundleRestrictions::SetLimits symbols exist (0x1c457a40 / 0x1c457d80 / 0x1c458360 / 0x1c458860); zero Trillium/dragonfish restriction symbols |
| "ILP-LHS" flag swaps the async classifier only; ILP MIP is a separate opt-in memory scheduler (not dead) | EnableIlpLatencyHidingScheduler gate only re-classifies async ops; ILPMemoryScheduler::Run @ 0x10acd020 is reached via GetMemorySchedulerAlgorithm @ 0x10abd6a0 case 6 (switch arms Default0..LocalOrder9, ILP=6, constructs xla::ILPMemoryScheduler) |
| HLO scheduling annotations do not flow into bundle packing | LLO packer reads no HLO annotation; the two operate on disjoint IRs |
Cross-References
- LatencyHidingScheduler Core — Stage 1 core: the reverse-direction drain loop, the 22-key comparator, precomputed async depth/height, the memory-pressure retry loop.
- LHS post_layout — the live Stage 1 placement in 0.0.40: the two-pipeline
RunHloSchedulerand its config/estimator/tracker wiring. - LHS post_layout_pre_fusion — the same LHS body wired at the dead "Pre main fusion" slot; documented by its config delta.
- LHS ILP variant — the
EnableIlpLatencyHidingSchedulerasync-classifier swap and the deadILPMemorySchedulerMIP. - ResourceType Taxonomy — the 0..46
ResourceTypeenum, hazard classes, andAsyncTracker→ core registry that gate Stage 1 overlap. - MxuSequence / SequenceInfo — Stage 2 per-sequence record and the
set_mxucommit. - MXU Assignment Bin-Packer — Stage 2
AssignMxusForSequenceGroupbin-packing algorithm. - MRB Chain Allocator — Stage 2 accumulation-chain reservation timeline and jitter model.
- MRB FIFO / MSR Placement — Stage 2 result-FIFO and matrix-shift-register placement.
- Latch Assignment & Overrun — Stage 2 latch-index assignment and the per-gen overrun handshake.
- LLO → Bundle Packing — Stage 3 forward greedy packer, per-gen slot matrix, branch/barrier/phi handling.
- Bundle Modulo Scheduling — Stage 3b II-search software pipelining for hardware loops.
- Per-Gen Encoder Latch Serialization — how Stage 2's latch fields serialize into the per-gen bundle word.
- Cost Model Overview — the three cost-class families and the
Targetclock wiring behind every cycle number. - Bundle-Aware Cost — the
MaxResourceCyclesbundle cost andLatencyBetweendependency latency that feed Stage 1's comparator. - Bundle Model Overview — the VLIW bundle word that Stage 3 packs and Stage 2's latches serialize into.
- Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part VIII — Instruction Scheduling & Bundle Packing — back to index