MSA AllocateSegment
Addresses, struct offsets, and bitmask values apply to
libtpu.sofrom the libtpu-0.0.40-cp314 wheel (build-id89edbbe81c5b328a958fe628a9f2207d). Other versions differ.
Abstract
MsaAlgorithm::AllocateSegment (0x1dc73ca0) is the inner loop body of memory-space assignment. The driver — the buffer-interval sort, the colocation grouping, and the autotuner "IOR" replay surface — lives on msa-overview.md; this page documents the per-segment placement attempt that the driver calls once for every (AllocationValue, use) pair, in sorted order. A "segment" is the live range of one value between its definition (or a prior allocation's end) and a single use; AllocateSegment decides where that segment's buffer physically sits — in alternate memory (VMEM) with or without a copy, or back in default memory (HBM) — and emits the Allocation objects (and any async CopyStart/CopyDone pair) that realize the decision.
The reader who knows the upstream XLA MsaAlgorithm::AllocateSegment will recognise the shape: reconcile required assignments, then run a fixed cascade of placement primitives that each return an AllocationResult bitmask, short-circuiting on the first kSuccess. What this build pins down byte-exactly is (1) the six-stage cascade order and its short-circuit/uncommit semantics, (2) the ~0x210-byte AllocationRequest struct the cascade reads, recovered from the writer CreateAllocationRequest (0x1dc72820) cross-checked against every field read in the body, and (3) the AllocationResult failure bitmask — 10 named failure bits spanning 0x001..0x200 — recovered from SingleFailureResultToString (0x1dc7bc80). The greedy character is concrete here: there is no backtracking inside AllocateSegment — a failed prefetch or evict that leaves pending chunks ORs in FailRequiresUncommit (0x40) so the driver rolls back, and the segment is retried only on a later driver retry with a widened window.
This page documents three things a reimplementer must reproduce: the cascade body and its stage gating, the alternate-memory fit check (AllocateInAlternateMemoryNoCopy and the prefetch chunk-candidate search), and the prefetch/eviction insertion (Prefetch/Evict and the find_if that supplies them their predecessor allocation). The interval-picker sweep that prefetch consults, the async-copy resource model, and the per-version numeric defaults are summarized here and detailed on sibling pages.
For reimplementation, the contract is:
- The cascade. Required-assignment reconcile → direct pin → free coloring-reserved + alternate no-copy → force-min-time (when
+0x208 == 1) → prefetch → evict → default fallback — each returning anAllocationResult, short-circuiting onkSuccess. - The request. The
AllocationRequestfield layout the cascade reads; in particular the four trailing state bytes at+0x208..+0x20bthat gate which stages run. - The result. The failure bitmask (10 named bits,
0x001..0x200), its byte-exact bit→name map, and the|0x40(FailRequiresUncommit) rule that hands rollback to the driver. - The fit check + insertion. How the no-copy path and the prefetch path each query the best-fit heap for a chunk, and how prefetch/evict find their predecessor allocation.
| Method | xla::memory_space_assignment::MsaAlgorithm::AllocateSegment(AllocationRequest&) |
| Entry | 0x1dc73ca0 (~5.5 KB; 1301-line decompile) |
| Request writer | MsaAlgorithm::CreateAllocationRequest @ 0x1dc72820 |
| Result stringizer | MsaAlgorithm::SingleFailureResultToString @ 0x1dc7bc80; ResultToString @ 0x1dc68c80 |
| No-copy fit | AllocateInAlternateMemoryNoCopy @ 0x1dc7ca80 |
| Prefetch / evict | Prefetch @ 0x1dc7e4a0 · Evict @ 0x1dc7d660 |
Predecessor find_if | AllocateSegment::$_2 @ 0x1dc7e420 |
| IR level | XLA HLO (HloValue / HloUse), heap-simulator timeline (logical times) |
| Memory spaces | kDefault = HBM (0), kAlternate = VMEM (1); VMEM byte budget and word size are per-version — see msa-reservation-hbm-policy.md |
| Confidence | CONFIRMED (byte-anchored) unless a row or callout says otherwise |
The AllocationRequest
CreateAllocationRequest (0x1dc72820) builds one AllocationRequest per (value, use) and hands it to AllocateSegment by reference. The struct is ~0x210 bytes; the layout below was recovered from the writer's field-store region (0x1dc730d8..0x1dc731e5) cross-checked against every [rbx+N] / [a2+N] read in the cascade body. The decompile confirms the gating-byte offsets directly: Evict is called with *((unsigned __int8 *)a2 + 520) (= +0x208) as its force argument, and Prefetch with *((unsigned __int8 *)a2 + 521) (= +0x209) as its window argument.
| Offset | Field | Meaning | Confidence |
|---|---|---|---|
+0x00 | inclusive_start_time | max(use_time, earliest_prefetch) or preferred-override time | CONFIRMED |
+0x08 | end_time | use latest time bound | CONFIRMED |
+0x10 | latest_prefetch_time | corrected use time bound | CONFIRMED |
+0x18 | require_start_time | required def time | CONFIRMED |
+0x20 | size | BufferInterval size, bytes | CONFIRMED |
+0x28 | allow_no_copy_alternate_mem | bool | CONFIRMED |
+0x29 | require_no_copy_alternate_mem | bool | CONFIRMED |
+0x2a | require_copy_allocation | bool (arg) | CONFIRMED |
+0x2b | no_copy_chunk_required | bool | CONFIRMED |
+0x2c | allow_prefetch | bool | CONFIRMED |
+0x30 | earliest_prefetch_time | packed: low byte flag, time >>8 | HIGH |
+0x38 | has_preferred_prefetch_time | bool | CONFIRMED |
+0x40 | preferred_prefetch_time | packed (low-byte flag | time<<8) | HIGH |
+0x48 | prefer_no_copy | bool | CONFIRMED |
+0x50 | preferred_offset | AliasedOffset* (arg) | CONFIRMED |
+0x58 | use | const HloUse* — the use being placed | CONFIRMED |
+0x60 | allocation_value | AllocationValue* | CONFIRMED |
+0x68 | all_use_times | 16-B span head/span | HIGH |
+0x88 | required_copy_done_time | int64 | HIGH |
+0x90 | allow_sync_conversion | bool (arg) | CONFIRMED |
+0xc0 | split_shape | Shape, constructed iff +0x200 == 1 | HIGH |
+0x200 | has_split_shape | bool | HIGH |
+0x208 | force-evict flag | passed as Evict's bool force | CONFIRMED |
+0x209 | allow-prefetch-window flag | passed as Prefetch's bool window | CONFIRMED |
+0x20a, +0x20b | uncommit / retry state bits | consulted to gate retry/evict/bail | HIGH |
The four trailing bytes at +0x208..+0x20b are the "already-attempted / requires-uncommit" state. +0x208 (cmp 1) gates the ForceAlternateMemoryAllocationForMinTime path and is the force argument to Evict; +0x209 is the window argument to Prefetch; +0x20a|+0x20b OR'd together gate the colored-reserved free path. The sync-copy candidate test that CreateAllocationRequest reaches (through IsAsyncConversionSliceCandidate → IsAsyncConversion{Copy,Slice}Candidate) matches HLO opcode 44 (kCopy) for copies and 119/54 (kSlice/kDynamicSlice) for slices, letting MSA convert an already-scheduled synchronous kCopy/kSlice into an MSA-managed async copy (see Async-copy resource and sync replacement).
NOTE —
~0x210not exact. The struct size is bounded below (≥0x210) by the highest observed store; padding past+0x20bis not enumerated. Offsets+0x68/+0xc0/+0x200and the two packed-time fields are decoded from store widths and the conditionalShapector at0x20cd6660, hence HIGH rather than CONFIRMED.
The AllocationResult bitmask
The return type is not an enum — it is a failure bitset (10 named bits, 0x001..0x200) where 0 is success and each set bit is one failure reason. SingleFailureResultToString (0x1dc7bc80) range-checks the value against 0x7f (cmp 0x7f), routing values ≤0x7f through a 65-entry jump table at 0xb564d94 (indexed by the single-bit value 0..0x40) and the three high bits (0x80/0x100/0x200) through explicit compares; each arm stores its result string. Success, FailOutOfMemory, and UnknownResult are inline strcpy string literals in the decompile; the remaining names are loaded from rodata (so they do not surface as inline string args), consistent with the jump-table-plus-rodata structure.
| Bit | Name |
|---|---|
0x000 | Success (no bits set) |
0x001 | FailOutOfMemory |
0x002 | FailPrevAllocationNotInAlternateMem |
0x004 | FailLiveRangeTooLong |
0x008 | FailLiveRangeTooShort |
0x010 | FailOutOfAsyncCopies |
0x020 | FailViolatesAsyncCopyResource |
0x040 | FailRequiresUncommit |
0x080 | AllSlicesHaveTheSameStartTime |
0x100 | FailConflictingPreferredOffsets |
0x200 | FailSyncDataMoveReplacement |
| else | UnknownResult |
A result with multiple bits set is decomposed and joined name-by-name by ResultToString (0x1dc68c80). Two predicate helpers read the mask: result_requires_uncommit() tests bit 0x40; result_failed_because_of_async_copy() tests 0x10|0x20. The semantics that matter for the cascade:
FailRequiresUncommit(0x40) is not produced by a primitive —AllocateSegmentORs it in after a failed evict (and on the residual-fail fallback path) to tell the driver that pending chunks for this value must be rolled back before the next attempt. The decompile shows this OR at three sites (v69 | 0x40,v240 | 0x40,v83 | 0x40).FailOutOfAsyncCopies(0x10) vsFailViolatesAsyncCopyResource(0x20) distinguish "the configured outstanding-copy cap is hit" from "the copy cannot fit any free time-bucket window" — both come out of the async-copy resource model, not from a counter.
The cascade
AllocateSegment runs a fixed sequence of placement attempts, each returning an AllocationResult, and returns at the first kSuccess (== 0). The stages, in body order:
// MsaAlgorithm::AllocateSegment(AllocationRequest& req) // 0x1dc73ca0
function AllocateSegment(req):
// ---- STAGE 0: REQUIRED-ASSIGNMENT RECONCILIATION ----
def_req = RequiredMemoryAssignmentAt(value, def_time) // 0x1dc48840
use_req = RequiredMemoryAssignmentAt(value, use_time) // 0x1dc48840
alias_req = AliasedRequiredAssignmentForUse(use) // 0x1dc769c0
if conflict(def_req, use_req): // (memory_space, offset, padding)
FATAL(def_req.ToString(), use_req.ToString(), use) // 0x1dc47680
required = reconcile(def_req, use_req, alias_req)
// ---- STAGE 1: DIRECT PIN (colored / required alternate-memory) ----
if required == kAlternate:
a = new PinnedAllocation(position, kAlternate, chunk, // 0x1dcdc400
def_time, use_time)
a->AddUse(use) // 0x1dcda700
value.allocation_sequence.push_back(a) // 0x1dc55740
CreateOrAddToAliasedOffset(*a, req.preferred_offset) // 0x1dc79160
return kSuccess
result = kSuccess // sentinel cleared
// ---- STAGE 2: free coloring-reserved, then dual-live + no-copy fit ----
FreeAlternateMemoryColoringReservedAllocations(req) // 0x1dc7c640 (line 717)
if alternate_mem_allowed(req): // gate over +0x29/+0x2a/+0x2b/+0x20a/+0x20b
CheckAndUpdateForDualLiveAllocationValues(prev, req) // 0x1dc7bf20 (line 722)
// ---- STAGE 3: ALTERNATE NO-COPY (the fit check) ----
result = AllocateInAlternateMemoryNoCopy(req) // 0x1dc7ca80 (line 723)
if result == kSuccess: return kSuccess
// ---- STAGE 3b: FORCE-MIN-TIME (only when +0x208 == 1) ----
if req[+0x208] == 1: // line 738
if ForceAlternateMemoryAllocationForMinTime(req): // 0x1dc7d460 (line 749)
return result // non-zero -> propagate
// ---- STAGE 4: PREFETCH (insert async copy) ----
prev_alloc = *find_if(value.allocation_sequence, $_2) // 0x1dc7e420
if req[+0x209] or req.allow_prefetch: // +0x2c
result |= Prefetch(req, *prev_alloc, /*window=*/req[+0x209]) // 0x1dc7e4a0
if result == kSuccess: return kSuccess
// ---- STAGE 5: EVICT ----
if has_prior_alternate_alloc:
result = Evict(req, /*force=*/req[+0x208]) // 0x1dc7d660
if result == kSuccess: return kSuccess
result |= 0x40 // FailRequiresUncommit -> driver rolls back pending
return result
// ---- STAGE 6: DEFAULT-MEMORY FALLBACK ----
d = new PinnedAllocation(position, kDefault/*0*/, none, def, use) // 0x1dcdc400
d->AddUse(use) // 0x1dcda700
value.allocation_sequence.push_back(d) // 0x1dc55740
return result // may carry residual fail bits (also OR'd 0x40 on a fail path)
The decompile confirms the call order directly: FreeAlternateMemoryColoringReservedAllocations at line 717, AllocateInAlternateMemoryNoCopy at 723, ForceAlternateMemoryAllocationForMinTime at 749, Evict(..., a2[+0x208]) at 829 (with | 0x40 immediately after at 832), and Prefetch(..., a2[+0x209]) at 988. The PinnedAllocation ctor (0x1dcdc400) appears twice in the body — once for Stage 1's direct alternate pin, once for the Stage 6 default fallback — while AddUse (0x1dcda700) appears four times across the segment's allocation-emitting and use-attachment paths.
NOTE — no in-function backtracking. Stages 3–6 do not undo each other's side effects. Once a primitive commits chunks to the heap, only the driver's uncommit path (triggered by
FailRequiresUncommit) reclaims them. This is the difference between "greedy with retry" (what MSA is) and a true backtracking search.
Stage 0 — reconcile required assignments
Three required-assignment lookups run first. RequiredMemoryAssignmentAt(value, t) (0x1dc48840) returns any hard memory-space pin the value carries at logical time t — one for the def, one for the use. AliasedRequiredAssignmentForUse(use) (0x1dc769c0) folds in any pin inherited through a colocation/alias group (e.g. a while-loop carried value, or a buffer the layout pass colored). If the def and use disagree on (memory_space, offset, padding), the pass is in an unsatisfiable state and emits a LogMessageFatal printing all three ToString()s — a genuine conflict is a compiler bug, not a fallback. Otherwise the consistent assignment becomes required.
Stage 1 — direct pin
If required == kAlternate, the value must live in VMEM (it was colored or carries a required alternate pin), so there is nothing to decide: build a PinnedAllocation over [def_time, use_time] in kAlternate, attach the use, push it onto the value's allocation sequence, and thread the AliasedOffset through CreateOrAddToAliasedOffset so every member of the colocation group shares one VMEM offset. Returns kSuccess unconditionally — the chunk reservation happened during coloring.
Stage 2 — free reserved, then dual-live + no-copy
FreeAlternateMemoryColoringReservedAllocations runs unconditionally at the top of this region (line 717), releasing VMEM that coloring had reserved but that this segment may now claim. A gate then computed from the trailing state bytes (+0x20a | +0x20b | …) together with the require_no_copy_alternate_mem flag (+0x29) decides whether to attempt alternate placement; when it does, CheckAndUpdateForDualLiveAllocationValues handles the case where the value is simultaneously live in two places, then Stage 3's AllocateInAlternateMemoryNoCopy runs and short-circuits to kSuccess if it places the buffer.
The ForceAlternateMemoryAllocationForMinTime path is separate and later (decompile line 749, after the no-copy attempt at 723): it is reached only when +0x208 == 1, and forces the buffer into VMEM for at least its minimum live interval; if it returns non-zero, that propagates straight out (no further stages).
The fit check — AllocateInAlternateMemoryNoCopy
Stage 3 is the cheapest placement: keep the buffer in VMEM across [def, use] with no copy at all, which is possible only if a free VMEM chunk of the requested size persists for the whole interval. AllocateInAlternateMemoryNoCopy (0x1dc7ca80) queries the best-fit heap for that chunk via a single FindBestChunkCandidate (0x1dc7fb00) call, which in turn wraps GlobalDecreasingSizeBestFitHeap::FindChunkCandidate (0x1e48f960). The heap is the GlobalDecreasingSizeBestFitHeap that MSA maintains over the VMEM byte budget, minus any scoped reservation, with chunk offsets/sizes aligned to the VMEM word; the per-version budget and word size are documented on msa-reservation-hbm-policy.md and the allocator detail on ../memory/vmem-allocator.md.
// AllocateInAlternateMemoryNoCopy(req) // 0x1dc7ca80
result = kSuccess
chunk = FindBestChunkCandidate(req, /*interval=*/[def, use]) // 0x1dc7fb00 -> best-fit
// over VMEM byte budget
if not chunk:
return FailOutOfMemory // 0x01 (no persistent free chunk)
heap.Commit(chunk)
a = new PinnedAllocation(position, kAlternate, chunk, def, use)
a->AddUse(use)
value.allocation_sequence.push_back(a)
return kSuccess
The fit check is temporal, not just spatial: the chunk must be free for the entire [def, use] interval on the heap-simulator timeline, because a no-copy allocation occupies VMEM continuously. If no single chunk spans the whole interval the result is FailOutOfMemory (0x01) and the cascade falls through to prefetch.
Prefetch and eviction insertion
If the buffer cannot live in VMEM for free across the whole interval, MSA tries to copy it into VMEM just in time (prefetch) or, if the value already has an older VMEM allocation, copy it back out to HBM (evict). Both primitives need the value's predecessor allocation — the one whose interval abuts the current use — which AllocateSegment locates with a find_if over value.allocation_sequence using the lambda predicate $_2 (0x1dc7e420).
Prefetch (Stage 4)
Prefetch (0x1dc7e4a0) decides whether copying the buffer from HBM into VMEM ahead of the use is profitable and feasible. The body sketched below is the prefetch path, not a literal transcription of the 122-line Prefetch function: the alternate-memory-benefit weighting is computed up in AllocateSegment itself (it calls both the HloPosition and HloUse overloads of GetAlternateMemoryBenefit), the picker sweep and fit/resource loop live in Prefetch + CheckPrefetchFit (0x1dc820e0), and the sliced emission is in AddAsyncSlicesForPrefetch (0x1dc7ea40).
// prefetch path: AllocateSegment + Prefetch (0x1dc7e4a0) + CheckPrefetchFit (0x1dc820e0)
benefit = CostAnalysis::GetAlternateMemoryBenefit(use) // 0x1dcec480 (in AllocateSegment)
// weighed against OperandBytesAccessed / OutputBytesAccessed
// (0x1dceb340 / 0x1dceb360)
picker.Begin(use, earliest_prefetch, latest_prefetch, preferred) // 0x1dcd7a00
while not picker.Done():
t_start = picker.Next() // 0x1dcd7e20
chunk = FindBestChunkCandidate(req, [t_start, use_time]) // 0x1dc7fb00
// -> GlobalDecreasingSizeBestFitHeap::FindChunkCandidate (0x1e48f960)
if not chunk: continue
if not AsyncCopyResource.HasEnoughResource(t_start, use_time, cost):
result |= FailViolatesAsyncCopyResource // 0x20
continue
emit CopyAllocation(kCopyStart @ t_start, kCopyDone @ use_time)
// or SlicedCopyAllocation (make_unique @ 0x1dc7f7e0, via AddAsyncSlicesForPrefetch
// @ 0x1dc7ea40) on the sliced path
return kSuccess
return result // accumulated fail bits
The interval picker sweeps candidate copy-start times outward from a preferred time using two cursors (an increasing one seeded from earliest_prefetch_time, a decreasing one from latest_prefetch_time) and a direction flag, integrating bandwidth-idle time so the copy is placed where it best hides behind compute. For each candidate start time, FindBestChunkCandidate (0x1dc7fb00) asks the VMEM heap for a chunk over [t_start, use_time] (the chunk only needs to be free from the copy's arrival, not from the def, which is why prefetch can succeed where no-copy failed), and the async-copy resource model verifies the copy fits the DMA-bandwidth time budget. The picker, its three overlap ratios, and the /100.0 bandwidth-idle integration step are detailed on msa-overview.md; see the per-version ratio defaults on msa-per-version-defaults.md.
CONFIDENCE — picker integration constant. The
/100.0(float at0x84a2fb8) integration granularity in the pickerNextwas traced at the disassembly level; the C decompile renders the float arithmetic opaquely (the constant does not surface as a literal in pseudocode), so the constant value is HIGH, while the picker functions' existence and signatures are CONFIRMED by their demangled symbols.
Evict (Stage 5)
Eviction is the inverse: a value that already has a VMEM (kAlternate) allocation may need to be copied back to HBM to make room. Evict (0x1dc7d660) is reached only when has_prior_alternate_alloc holds, and takes the force flag from AllocationRequest +0x208:
// Evict(req, force = req[+0x208]) // 0x1dc7d660
prev = the prior kAlternate allocation for this value
emit CopyAllocation(kCopyStart in VMEM @ prev.end,
kCopyDone in HBM) // copy buffer OUT to default memory
if not AsyncCopyResource.HasEnoughResource(...):
return FailViolatesAsyncCopyResource // 0x20 (or FailOutOfAsyncCopies 0x10)
return kSuccess
On evict failure the cascade does not fall through silently — it ORs FailRequiresUncommit (0x40) into the result and returns, signalling the driver to roll back any chunks this segment pended. This is the only place a primitive's failure is escalated to a forced driver uncommit rather than a fall-through to the next stage.
Stage 6 — default-memory fallback
If the value never had a prior VMEM allocation (so evict is not applicable) and prefetch failed, the buffer simply stays in HBM: build a PinnedAllocation in kDefault (0) over the use, attach the use, push it. The returned result may still carry residual fail bits from the prefetch attempt (and, on the fail path, an OR'd 0x40), which the driver interprets — a non-success result here means MSA could not improve on HBM placement for this segment, not that compilation failed.
Async-copy resource and sync replacement
Both prefetch and evict gate their copies through AsynchronousCopyResource, a per-logical-time free-resource bucket model (not a simple counter). HasEnoughResource(start, end, resource) (0x1dc78d40) walks the buckets over [start, end) checking the copy's cost can be drawn from the available DMA-bandwidth windows; ConsumeResource (0x1dc77540) debits them; AddCopy/RemoveCopy (0x1dc78580 / 0x1dc787c0) commit/undo. A copy that fits no bucket window yields FailViolatesAsyncCopyResource (0x20); exceeding the configured outstanding-copy cap yields FailOutOfAsyncCopies (0x10).
The kCopy opcode (44, 0x2c) match reached from CreateAllocationRequest feeds the sync-copy-replacement path: an already-scheduled synchronous kCopy (or kSlice/kDynamicSlice, opcodes 119/54) can be converted into an MSA-managed async copy when alternate-memory benefit justifies it, gated by xla_msa_enable_sync_copy_replacement / _sync_slice_replacement and bounded by copies_limit_for_sync_mem_op_conversion. A failed conversion surfaces as FailSyncDataMoveReplacement (0x200). The resource-model loop body and the sync-replacement state machine are not fully decoded at the basic-block level (see Limits).
Worked example — one while-body activation buffer
Trace one while-body activation %act (size 2 MiB, defined by a matmul at logical t=100, consumed at t=160) through the cascade (the VMEM byte budget and word size are per-version — see msa-reservation-hbm-policy.md):
- Driver sort. The buffer-interval comparator ranks
%actbymemory_boundedness_score × size; the producing matmul is bandwidth-bound, so the score is high and%actis visited early (see msa-overview.md). - Request.
CreateAllocationRequestbuilds theAllocationRequest:+0x00start =earliest_prefetch,+0x08end =160,+0x20size = 2 MiB,+0x2callow_prefetch = 1,+0x58use =%act@160. - Stage 3 fit check.
AllocateInAlternateMemoryNoCopyasks the heap for a 2-MiB chunk free across[100, 160]. If one persists, akAlternatePinnedAllocationis placed →kSuccess, done. - Stage 4 prefetch. Otherwise
picker.Begin(use=%act@160, earliest, latest);Next()sweeps candidate copy-start times; for each,FindBestChunkCandidateasks for a 2-MiB chunk over[t_start, 160](only needs to be free from the copy's arrival), andAsyncCopyResource.HasEnoughResource(t_start, 160, cost)verifies the copy's DMA window. On success aCopyAllocation(kCopyStart @ t_start,kCopyDone @ 160) is emitted →kSuccess. Both chunks are aligned to the VMEM word. - Stage 5 evict / Stage 6 fallback. If prefetch fails and an older
kAlternateallocation exists,Evictcopies an older buffer out; if that fails,result |= 0x40and the driver rolls back pending chunks. If no prior VMEM allocation exists,%actstays in HBM (kDefaultPinnedAllocation). - Handoff. The chosen
Chunk{offset, size}rides in theAllocation; after the driver'sProcess()it becomes a static program-memory offset that the on-device allocator pins and never relocates — see ../memory/on-device-compaction.md.
Limits
What this page does not pin down byte-exactly:
- Numeric ratio/cap defaults. The literal
min/preferred/max_overlap_to_async_copy_ratioandmax_outstanding_prefetches/evictionsper TPU version live in the per-versionTpuCompilationEnvironmentproto field-default overlays, not in code; the flags are all wired (see msa-per-version-defaults.md). Same gap forinefficient_use_to_copy_ratio/memory_budget_ratio. (LOW for the numeric values; the flag names are CONFIRMED.) ConsumeResourcespill/delay loop. Decoded to the call level; the inner loop that spreads a copy across later buckets and records the induced delay is not enumerated. (HIGH.)- Sync-replacement state machine. The
AsyncConversionResultvalue layout and the exactcopies_limit_for_sync_mem_op_conversionenforcement point are not fully decoded. (HIGH.) - Picker integration constant — see the CONFIDENCE callout in Prefetch. (HIGH for the
100.0value.)
Cross-References
- msa-overview.md — the MSA driver: buffer-interval sort, colocation grouping, the autotuner IOR replay surface, and the prefetch interval picker / async-copy resource model in full.
- msa-per-version-defaults.md — overlap ratios and outstanding-copy caps per TPU generation; the 5-variant per-family flag override scheme.
- msa-reservation-hbm-policy.md — the VMEM byte budget, scoped reservations, and the HBM fallback policy this cascade feeds.
- layout-assignment.md — the pass that stamps
memory_spaceand colors buffers, producing the required-assignment pins Stage 0 reconciles. - compile-phases.md — where MSA sits in the compilation pipeline.
- ../memory/vmem-allocator.md — the VMEM best-fit allocator the no-copy and prefetch fit checks query.
- ../memory/on-device-compaction.md — how the static chunk offsets MSA assigns are pinned at runtime.