LHS post_layout_pre_fusion Variant
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64wheel (build-id89edbbe81c5b328a958fe628a9f2207d). Reconstructed by static analysis of the unstripped, full-symbol ELF (nm -Cresolves every method); other versions will differ.
Abstract
The post_layout_pre_fusion variant is the LatencyHidingScheduler (LHS) as it would sit at the "Pre main fusion" slot of HloOptimizeAfterLayoutAssignment (0x10962660) — the scheduler that runs after layout assignment but before the dedicated "Main fusion" pipeline, so that it schedules still-unfused ops and the resulting overlap profile can inform the fusion pass. The defining finding of this page is negative and decisive for any reimplementer: in v0.0.40 this slot is not wired. The "Pre main fusion" HloPassPipeline (built at line 1169 of HloOptimizeAfterLayoutAssignment) contains no AddPass<LatencyHidingScheduler>, and HloOptimizeAfterLayoutAssignment never calls RunHloScheduler. Every reachable AddPass<LatencyHidingScheduler> lives in RunHloScheduler (0x1096fac0), which is Phase 7 — after main fusion — and is the canonical post_layout pass.
The reason the slot is dead is structural, not accidental. The shared RunImpl (0x136321a0) opens with a LogMessageFatal on !module->has_schedule() — "LatencyHidingScheduler expects a base schedule that minimizes memory pressure." That base schedule is produced by HloMemorySchedulerWithBrkgaFallback, which is added only inside RunHloScheduler (line 567). At the pre-fusion slot no prior memory schedule exists, so a live LHS there would abort the compile. The "Pre main fusion" pipeline therefore cannot host the canonical OSS LHS, and the only collective-/schedule-adjacent work it does run is CollectivePipeliner and its cleanups — none of which add the LHS.
This page documents the variant the way the wiki documents the other two: by its delta from the live post_layout pass and from the shared scheduler core. Because the pass body (RunImpl), the config builder (GetSchedulerConfig), the estimator selector (GetLatencyEstimator), and the tracker builder (GetTpuAsyncTracker) are all single compiled instances shared with post_layout, the difference is entirely in the inputs a pre-fusion callsite would feed — which, in this build, no callsite supplies. A reimplementer porting a future libtpu that does wire this slot needs exactly three deltas: the pre-fusion memory_limit_bytes, the cleared v45 post-SC-assignment rerun byte, and the empty reserved-SparseCore set.
For reimplementation, the contract is:
- The slot is dead in v0.0.40. The
"Pre main fusion"pipeline exists (0x10962660line 1169) but adds no LHS; the only twoAddPass<LatencyHidingScheduler>sites (RunHloSchedulerlines 1137 / 1411) are both Phase 7. A grep of the whole decompile for the canonicalAddPass<LatencyHidingScheduler>template finds references only insideRunHloScheduler. - The
has_schedule()FATAL is why.RunImplline 280 is aLogMessageFatal; without a priorHloMemorySchedulerWithBrkgaFallbackbase schedule (added atRunHloSchedulerline 567, Phase 7), the LHS aborts rather than no-ops. The pre-fusion slot has no such base schedule. - The pass body is the shared
RunImpl(0x136321a0) — identical topost_layout: thehas_schedule()precondition, the computation-list filter, the pre/postLatencyHidingStatisticsVLOG(1) passes, and the fragmentation-aware memory-pressure retry loop with a0.9relax factor. - The config delta vs
post_layoutis three callsite values, not a different builder:v45 == 0(no SC-assignment rerun), a pre-fusionmemory_limit_bytes, and an empty reserved-SparseCore set. The 137-byteSchedulerConfigschema, the estimator branch, and the resource model are otherwise byte-for-byte the same aspost_layout.
| Intended phase | Phase 6 — "Pre main fusion" sub-pipeline of HloOptimizeAfterLayoutAssignment (0x10962660, string at line 1169) |
| Wired in v0.0.40? | NO — no AddPass<LatencyHidingScheduler> in the pre-fusion pipeline; the slot is dead |
| Why dead | RunImpl has_schedule() LogMessageFatal (0x136321a0 line 280); no base schedule exists pre-fusion |
| Live LHS instead | post_layout in RunHloScheduler (0x1096fac0), Phase 7, after main fusion |
| Shared pass body | xla::LatencyHidingScheduler::RunImpl — 0x136321a0 (1217-line decompile; identical to post_layout) |
| Shared config builder | xla::jellyfish::(anon)::GetSchedulerConfig — 0x10974aa0 (137-line decompile, 137-byte POD) |
| Config delta vs post_layout | v45 == 0; pre-fusion memory_limit_bytes; empty reserved-SparseCore set |
| Canonical AddPass template | 0x10975c40 — referenced only from RunHloScheduler |
| Confidence | HIGH (the negative result is byte-anchored across two functions) |
Where the slot lives
Purpose
HloOptimizeAfterLayoutAssignment (0x10962660, 2651-line decompile) is the Phase-6 driver. It builds a sequence of named HloPassPipeline objects between LayoutAssignment (Phase 5) and final scheduling (Phase 7). The pre-fusion LHS would be added to its "Pre main fusion" pipeline — the one whose name string is stored at line 1169.
Phase layout (recovered)
DeepseaCompilerBase::RunHloPasses (0x1093a420)
...
RunHloPassesAfterLayoutAssignment (0x21395d20)
└─ HloOptimizeAfterLayoutAssignment (0x10962660) [Phase 6]
"Post layout assignment" HloPassPipeline (SSO-built; tail "signment" at line 707)
"Pre main fusion" HloPassPipeline (line 1169) ← THIS SLOT (no LHS added)
CreateNestedHloPipeline "layout fixup" (line 1176)
CreateNestedHloPipeline "host offloading" (line 1396)
CollectivePipeliner + PipeliningCleanups (collective-adjacent; no AddPass<LHS>)
(60+ inline AddPass<...> sites — none is LatencyHidingScheduler)
"Main fusion" HloPassPipeline (line 2279)
PostMainFusionHloOptimize (0x10966560) [Phase 6 tail; called at line 2614]
PostOptimizationPipeline (0x1093fd40) [Phase 7 driver; called from RunHloPasses line 1987]
└─ RunHloScheduler (0x1096fac0) [Phase 7] ← the live LHS
NOTE — the
"Post layout assignment"pipeline name is not a single.rodataC-string; it is assembled by an inline 16-byte SSO store whose tail"signment"is the literal at0x10962660line 707 (strcpy(&v342[14], "signment")). The same stack-built-string idiom hides the pipeline names inRunHloScheduler(final_scheduler/async_scheduling); seepost_layout. A reimplementer grepping the binary for"Post layout assignment"will not find it.
The negative result, byte-anchored
// HloOptimizeAfterLayoutAssignment @ 0x10962660
// "Pre main fusion" pipeline built at line 1169.
// rg over this function for LatencyHidingScheduler / AddPass<...Latency...>
// / RunHloScheduler → NO MATCHES.
// It builds the pipeline names and 60+ pass adds but never the LHS,
// and never invokes RunHloScheduler (Phase 7 is a sibling call, not nested here).
RunHloScheduler is not reached from inside HloOptimizeAfterLayoutAssignment; it is a separate Phase-7 call, driven by PostOptimizationPipeline (0x1093fd40), which RunHloPasses invokes at line 1987 — after the Phase-6 HloOptimizeAfterLayoutAssignment call returns. The canonical AddPass<LatencyHidingScheduler> template body (0x10975c40) is referenced from exactly one function in the entire decompile — RunHloScheduler — at two mutually-exclusive sites (lines 1137 and 1411), both after "Main fusion". There is no pre-fusion instantiation to configure.
Why the slot is dead — the has_schedule() FATAL
Purpose
The structural reason a live LHS cannot sit pre-fusion is the very first thing RunImpl does after logging: it asserts a base schedule exists. This is a hard LogMessageFatal, not a recoverable status.
Algorithm
// xla::LatencyHidingScheduler::RunImpl(module, exec_threads) // 0x136321a0
// line 280 / source latency_hiding_scheduler.cc:4110
if (!module->has_schedule()) {
LogMessageFatal(".../latency_hiding_scheduler.cc", 4110, "module->has_schedule()")
<< "LatencyHidingScheduler expects a base schedule that minimizes memory pressure.";
// ^ aborts the compile
}
The base schedule the precondition demands is built by HloMemorySchedulerWithBrkgaFallback, constructed at RunHloScheduler line 567 — inside Phase 7's final_scheduler pipeline. No equivalent runs in the "Pre main fusion" pipeline. Adding the LHS pre-fusion would therefore FATAL on the first computation. This is the same precondition documented for the live post_layout pass; here it is the reason the variant is only a configuration, never a live pass.
GOTCHA — the FATAL is exactly why "schedule unfused ops to inform fusion" is not achievable by simply re-adding the LHS earlier in the pipeline. A future libtpu that wants a pre-fusion LHS must also add a memory scheduler ahead of it (the BRKGA-fallback base scheduler or equivalent) in the same pre-fusion pipeline; otherwise the
has_schedule()CHECK aborts. A reimplementer cannot satisfy the intent by moving theAddPassalone.
Pass body — shared RunImpl
Purpose
When (in a future build) a pre-fusion callsite does add the LHS, it invokes the same compiled RunImpl (0x136321a0) that post_layout uses. There is exactly one RunImpl in the binary — the unmodified upstream latency_hiding_scheduler.cc body. The variant changes the config and context handed to AddPass, never the body. The full algorithm (drain loop, 22-key comparator, async resource model) is documented on the scheduler core page; this section pins the RunImpl envelope that surrounds it.
Algorithm — RunImpl envelope
absl::StatusOr<bool> RunImpl(HloModule *module,
const flat_hash_set<string_view>& exec_threads) { // 0x136321a0
VLOG(5) << "Original module:" << module->ToString(); // line ~270
// (1) precondition — FATAL if no base schedule (the dead-slot reason above)
CHECK(module->has_schedule()); // line 280, src 4110
// (2) computations to schedule: post-order, keep top-level OR force_delay group
to_schedule = MakeComputationPostOrder(exec_threads)
.filter(c -> c->caller_instructions().empty()
|| c->frontend_attribute("scheduling_group") == "force_delay");
VLOG(2) << "Computations to schedule " << to_schedule.size() // line 578
<< " size HloGraphNode: " << ... << " size HloEdge " << ...;
if (to_schedule.empty()) return false;
// (3) core init (+ real-cost-model init if config.use_real_cost_model)
scheduler_core_->SchedulerInitialize(module);
// (4) pre-schedule statistics (VLOG(1) only)
if (VLOG_IS_ON(1))
for (c : to_schedule)
LOG(INFO) << "] Statistics before scheduling:" // line 673
<< LatencyHidingStatistics(c, ...).ToString();
// (5) memory-pressure retry loop (see below)
// (6) final-limit log + (7) post-schedule statistics (VLOG(1))
LOG(INFO) << " LatencyHidingScheduler current memory usage: " // line 1088
<< (frag + computed_use()) << " bytes. Current limit: " << memory_limit();
if (VLOG_IS_ON(1))
for (c : to_schedule)
LOG(INFO) << "] Statistics after scheduling:" // line 1146
<< LatencyHidingStatistics(c, ...).ToString();
// (8) optional ScheduleProto dump
return true; // always reports "changed"
}
The pre/post LatencyHidingStatistics passes (lines 673 / 1146) are what make the LHS observable: emitted before the schedule they report the input base-schedule wasted cycles, and after the delta is the latency hidden. Both build a fresh BufferInfoTracker + ModulePressureState. They are VLOG(1)-gated, so they cost nothing in production unless --v=1.
NOTE — the
force_delayfilter (step 2) keeps a non-top-level computation in the schedule set only if its caller carries thescheduling_groupfrontend attribute"force_delay". This is the same string the core's per-computation entry RE2-matches; it is the single mechanism by which a callee computation is scheduled independently rather than inline with its caller.
Memory-pressure retry loop
// RunImpl memory retry @ 0x136321a0 (lines ~840-1054)
int64_t frag = config.estimate_fragmentation_size
? EstimateFragmentationSize(module, alias_analysis, alias_info) // line 840
: 0;
int64_t mem_limit = scheduler_core_->memory_limit();
for (int attempt = 0; attempt < scheduler_core_->memory_limit_retries(); ++attempt) {
if (frag + scheduler_core_->computed_use() <= mem_limit) break;
LOG(INFO) << "LatencyHidingScheduler current memory usage: " // line 874
<< (frag + computed_use())
<< " bytes, does not fit in initial limit: " << mem_limit // line 878
<< ". Setting the new limit to " << next_memory_limit(); // line 881
scheduler_core_->SchedulerInitialize(module);
mem_limit = (int64_t)scheduler_core_->next_memory_limit(); // relax-by-factor
scheduler_core_->set_memory_limit(mem_limit);
reschedule_all_computations();
if (config.estimate_fragmentation_size)
frag = EstimateFragmentationSize(...); // line 1054
}
next_memory_limit() multiplies the current limit by the AVX constant qword_A2DFD10 (read at lines 892 / 918: vmovsd xmm1, cs:qword_A2DFD10 / vmulsd), which the core page resolves to 0.9 — each retry shrinks the budget by 10%, pushing the comparator's memory-pressure keys (6-8) over its overlap keys. In the pre-fusion configuration this loop would be inert for the same reason it is inert on a first post_layout run: memory_limit == -1 unless EnableSchedulerMemoryPressureTracking is true. The pre-fusion config can never set the v45 rerun byte (it is, by construction, not a post-SC-assignment rerun), so the only way the loop fires pre-fusion is the pressure-tracking flag.
GOTCHA (LOW) — the exact
qword_A2DFD10 = 0.9decimal is resolved on the scheduler core page from the IEEE bit pattern;EstimateFragmentationSize(TPU-private; aGlobalDecreasingSizeBestFitHeapheap simulation, not a closed-form alignment formula) is documented there too. Treat both as inherited facts, not re-derived on this page.
Config delta vs post_layout
Purpose
GetSchedulerConfig (0x10974aa0) is a single compiled function. It builds the same 137-byte SchedulerConfig POD regardless of phase; what differs between a pre-fusion and a post_layout invocation is the arguments at the callsite. Three of the POD fields are driven by inputs a pre-fusion site supplies differently.
The three deltas
// GetSchedulerConfig(out, mem_limit_bytes=a2, env=a3, target=a4, is_lem=a5) // 0x10974aa0
char v45 = *(BYTE*)(env + 4097); // xla_tpu_rerun_lhs_post_sc_assignment (line 29)
// (A) memory_limit @ POD off 80 (lines 66-71)
out.memory_limit = -1;
if (v45) out.memory_limit = mem_limit_bytes;
if (EnableSchedulerMemoryPressureTracking(...)) out.memory_limit = mem_limit_bytes;
// (B) schedule_send_recvs @ off 120 (line 76)
out.schedule_send_recvs = v45 | ScheduleSendRecvs(...);
// (C) enable_send_recv_aggregation @ off 127 (line 90)
out.enable_send_recv_aggregation = env[4789] | v45;
| POD field (offset) | post_layout (live) | post_layout_pre_fusion (this variant) |
|---|---|---|
v45 source (env+4097) | may be 1 on the SC-assignment rerun | always 0 (not a rerun) |
memory_limit (80) | -1, or mem_limit_bytes on pressure-track or rerun | -1 unless pressure-track (rerun path unreachable) |
schedule_send_recvs (120) | v45 | ScheduleSendRecvs(...) — forced on a rerun | ScheduleSendRecvs(...) only |
enable_send_recv_aggregation (127) | env[4789] | v45 — forced on a rerun | env[4789] only |
mem_limit_bytes argument | the final post-fusion budget | a pre-fusion budget (fewer fused buffers) |
| reserved-SparseCore set | populated from GetReservedSparseCores | empty (SC assignment has not run pre-fusion) |
Everything else in the POD is identical: the per-kind overlap limits (offsets 0-72, e.g. collective_permute_overlap_limit = env[3964], all_reduce_overlap_limit = env[3960] or INT32_MAX under EnableSparseCoreOffloadQueuingInLhs), the use_real_cost_model byte at offset 123, the packed bool quad at offsets 122-125, and the disable_loop_consume_resource / enable_idle_count_extension bytes at 134-135. The complete 137-byte schema is tabulated on the post_layout page; this variant does not change a single offset.
GOTCHA — the
v45byte and the reserved-SparseCore set are the only things that structurally distinguish a pre-fusion config from a post_layout config, and both are forced to their "first run / nothing reserved" state pre-fusion. A reimplementer who copies the post_layout config verbatim into a pre-fusion slot would wrongly carry a populated reserved-SC set (information that does not exist before SC assignment) and a possibly-setv45(a rerun semantic that makes no sense pre-fusion). Clear both.
Context delta — estimator and tracker
Purpose
AddPass<LatencyHidingScheduler> takes a SchedulingContext (estimator + tracker) and a DefaultSchedulerCore. Both are built by shared helpers; the pre-fusion delta is again in inputs, chiefly the empty reserved-SparseCore set passed to the tracker.
LatencyEstimator — identical selection
GetLatencyEstimator (0x10974e00) branches on SchedulerConfig offset 123 (use_real_cost_model):
function GetLatencyEstimator(target, config, env): // 0x10974e00
if config.use_real_cost_model != 1: // off 123, line 55
if PGLE_profile_path set:
FATAL "Attempted to use profile-guided latency estimator without the real "
"cost model. Either enable --xla_tpu_scheduler_using_real_cost_model "
"or unset --xla_tpu_impure_latency_hiding_scheduler_profile_path." // line 70
return ApproximateLatencyEstimator()
est = CostModelLatencyEstimator(target, config, env) // 0x10ff8a60, line 91
if PGLE_profile_path non-empty:
proto = tsl::ReadBinaryProto(path) // line 99
est = ProfileGuidedLatencyEstimator(est, proto) // line 104
return est
This is the same GetLatencyEstimator the live post_layout pass uses; the pre-fusion variant chooses the estimator identically (the choice is a function of use_real_cost_model, a config byte, and the PGLE flag — none phase-dependent). See cost/overview for the cost-model family and learned-cost-model-client for the CostModelLatencyEstimator path.
AsyncTracker — empty reserved-SC set
GetTpuAsyncTracker (0x10975520) wraps TpuAsyncTracker::Create (0x10ffb3e0, called at line 179) and takes an absl::flat_hash_set<int64_t> reserved-SparseCore set as its final argument. In the live post_layout path that set is filled from GetReservedSparseCores (RunHloScheduler line 873) — the SparseCore IDs withheld from the overlap pool, which exist only after SparseCore assignment. A pre-fusion invocation runs before SC assignment, so this set is empty. The base resource taxonomy (ICI links, HBM banks, megacore queues, host-DMA taps, the SparseCore offload queue under EnableSparseCoreOffloadQueuingInLhs) is unchanged; the scheduler-resourcetype-model page documents it. The empty reserved-SC set is the single AsyncTracker delta of this variant.
DefaultSchedulerCore — identical wiring
The core ctor (0x10976ce0) copies the 137-byte config POD with vmovups blocks from [rcx]…[rcx+0x69] into [rdi+0x10]…[rdi+0x79] (lines 23-32) and installs four std::function slots. In the live build candidate_compare_ and post_step_mutator_ are empty (RunHloScheduler lines 1000 / 1002), so the core falls back to the built-in ReadySetLt heuristic; should_schedule_ is the $_1 (ILP) or $_2 (regular) classifier selected by EnableIlpLatencyHidingScheduler (lines 1084-1107), and target_overlap_limit is RunTensorCoreAsyncOpScheduler::$_0 (line 1040). A pre-fusion build would wire these identically — the classifier swap is the ILP variant, orthogonal to phase.
post_layout vs post_layout_pre_fusion at a glance
| Aspect | post_layout (live) | post_layout_pre_fusion (this page) |
|---|---|---|
| Intended phase | Phase 7, async_scheduling in RunHloScheduler | Phase 6, "Pre main fusion" pipeline |
| Wired in v0.0.40? | YES (RunHloScheduler lines 1137 / 1411) | NO — no AddPass<LHS> in pre-fusion pipeline |
| Sees fusion result? | yes (runs after "Main fusion") | no (would run before fusion — its intent) |
| Base schedule available? | yes (HloMemorySchedulerWithBrkgaFallback, line 567) | no → has_schedule() FATAL if added |
RunImpl | 0x136321a0 (shared) | 0x136321a0 (shared, identical body) |
SchedulerConfig builder | GetSchedulerConfig (0x10974aa0) | GetSchedulerConfig (same) |
v45 rerun byte (env+4097) | may be 1 (SC-assignment rerun) | always 0 |
memory_limit (off 80) | -1 unless pressure-track or v45 | -1 unless pressure-track |
mem_limit_bytes arg | final post-fusion budget | pre-fusion budget |
| reserved-SparseCore set | from GetReservedSparseCores | empty (SC assignment not yet run) |
| LatencyEstimator | GetLatencyEstimator (same) | GetLatencyEstimator (same) |
candidate_compare_ / mutator | empty (heuristic) | empty (heuristic) |
| Async classifier | $_2, or $_1 under ILP gate | $_2, or $_1 under ILP gate |
The single live LHS in v0.0.40 is post_layout. This variant is the same RunImpl configured for the dead pre-fusion slot; documenting it is documenting (a) why the slot is empty and (b) the exact three-input delta a future build would need.
Function Map
| Function | Address | Role |
|---|---|---|
HloOptimizeAfterLayoutAssignment | 0x10962660 | Phase-6 driver; builds "Pre main fusion" pipeline (line 1169) — no LHS added |
PostOptimizationPipeline | 0x1093fd40 | Phase-7 driver; calls RunHloScheduler (reached from RunHloPasses line 1987) |
RunHloScheduler | 0x1096fac0 | The only LHS AddPass sites (1137 / 1411); built by PostOptimizationPipeline |
LatencyHidingScheduler::RunImpl | 0x136321a0 | Shared pass body; has_schedule() FATAL at line 280, retry loop |
GetSchedulerConfig | 0x10974aa0 | 137-byte SchedulerConfig; v45 = env[4097] at line 29 |
GetLatencyEstimator | 0x10974e00 | Approximate / CostModel / PGLE selection (byte 123) |
GetTpuAsyncTracker | 0x10975520 | Wraps TpuAsyncTracker::Create; takes reserved-SC flat_hash_set |
TpuAsyncTracker::Create | 0x10ffb3e0 | TPU resource model + reserved-SC set (line 179 callsite) |
DefaultSchedulerCore ctor | 0x10976ce0 | Copies config POD (vmovups); installs four std::function hooks |
AddPass<LatencyHidingScheduler, …> | 0x10975c40 | Templated pass adder — referenced only from RunHloScheduler |
HloMemorySchedulerWithBrkgaFallback ctor | (line 567 of 0x1096fac0) | Phase-7 base memory schedule (the has_schedule() precondition) |
IsPassDisabled | 0x12fd8340 | "latency-hiding-scheduler" / "async-op-scheduler" gates |
EnableSchedulerMemoryPressureTracking | 0x1d6b66e0 | memory_limit tri-state |
EnableIlpLatencyHidingScheduler | 0x1d6b7e00 | $_1/$_2 classifier gate (ILP variant) |
Confidence Summary
| Claim | Evidence |
|---|---|
Pre-fusion slot has no AddPass<LatencyHidingScheduler> | 0x10962660 — only "Pre main fusion" (1169) / "Main fusion" (2279); no LHS / no RunHloScheduler ref |
Canonical AddPass<LHS> template referenced only from RunHloScheduler | rg --no-ignore for AddPassINS_22LatencyHidingScheduler → 1 file (its own body) |
| Both live LHS sites are Phase 7 (post-fusion) | RunHloScheduler lines 1137 / 1411 |
Slot is dead because RunImpl FATALs on !has_schedule() | 0x136321a0 line 280 LogMessageFatal + precondition string |
| Base schedule only exists in Phase 7 | HloMemorySchedulerWithBrkgaFallback at RunHloScheduler line 567 |
v45 = env[4097]; drives memory_limit/schedule_send_recvs/aggregation | GetSchedulerConfig lines 29 / 66-71 / 76 / 90 |
Pre-fusion config delta = v45==0, pre-fusion budget, empty reserved-SC | derived from GetSchedulerConfig + GetReservedSparseCores (line 873) phase-ordering |
next_memory_limit relax factor qword_A2DFD10 | RunImpl lines 892 / 918 vmulsd (decimal 0.9 resolved on core page) |
| Empty reserved-SC set is the AsyncTracker delta | GetTpuAsyncTracker final flat_hash_set<long> arg; SC assignment post-dates pre-fusion |
GetLatencyEstimator branch + PGLE FATAL identical | 0x10974e00 lines 55 / 70 / 91 / 104 |
Cross-References
- LatencyHidingScheduler Core — the shared
ScheduleComputationdrain loop, 22-key comparator, andTpuAsyncTrackerresource model thatRunImpldrives. - LHS post_layout — the live Phase-7 LHS; this page is its delta. The full 137-byte
SchedulerConfigschema, the two-pipelineRunHloSchedulerstructure, and the reserved-SparseCore set are documented there. - LHS ILP Variant — the
EnableIlpLatencyHidingScheduler$_1/$_2classifier swap, orthogonal to scheduling phase. - Scheduler ResourceType Model — the ICI/HBM/megacore/host-DMA/SparseCore resource taxonomy the
AsyncTrackerenforces. - Scheduling Overview — where scheduling sits between lowering and encoding.
- Cost Model Overview — the latency/cost model family feeding
GetLatencyEstimator. - Learned Cost Model Client — the
CostModelLatencyEstimatorreal-cost-model path. - Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part VIII — Instruction Scheduling & Bundle Packing — back to index