On-Device Compaction
All addresses, struct offsets, vtable slots, and magic constants on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (build-id md589edbbe81c5b328a958fe628a9f2207d, buildlibtpu_lts_20260413_b_RC00, clang trunk). The image is not stripped; demangled C++ symbol names are quoted verbatim. Other versions will differ.
Abstract
This page answers one question precisely, with binary evidence: does the libtpu runtime perform HBM defragmentation by relocating live buffers, and how? The answer is yes — and that is the non-obvious result, because the device allocator (tpu::BestFitAllocator, see hbm-allocator.md) is a classic non-moving boundary-tag allocator whose Allocate/Deallocate paths only coalesce adjacent free blocks. Live-buffer relocation lives in a single, separate, OOM-triggered method: tpu::BestFitAllocator::Compact (0x1e81c360, vtable slot vt+0x90).
The clean way to state the architecture is a three-way split of "where does each buffer sit in HBM":
- Compile-time placement — XLA's MSA + jellyfish
ProgramMemoryAllocatorfreeze static offsets into the program (../compiler/msa-overview.md). The bulk of HBM occupancy is laid out once, at compile time, which is why the runtime so rarely needs to defragment — packing is solved before the program ever runs. - Runtime free-list math (non-moving) —
Allocate/Deallocateservice the dynamic surface (user input/output PJRT buffers, transfer staging, async-copy scratch, the dynamic program stack) with best-fit search + eager bidirectional coalescing on free. This is documented on hbm-allocator.md; it never moves a live buffer. - Runtime compaction (moving) — when even fully-coalesced free space cannot satisfy a request,
Compactrelocates movable live buffers downward to consolidate free space, emitting a relocation plan (std::vector<tpu::TpuMemmoves::Memmove>) that a separate codegen path turns into on-device DMA programs.
So the page is a present/absent inventory. PRESENT: a full live-buffer relocation path (Compact → TpuMemmoves → DMA codegen → enqueue), an OOM-driven retry chain, a pinned-set that keeps MSA-static and aliased buffers immovable, and donation-driven reuse (a zero-copy alternative to allocation). ABSENT: any relocation inside Allocate/Deallocate (those coalesce only); any periodic/background defragmenter; any device-side arena/slab compaction; any growth-by-doubling. The distinction this page owns is precisely coalescing (in-place merge of free neighbours, every free) vs. compaction (relocation of live buffers, only on OOM).
| Compaction performed? | YES — live movable buffers are relocated to consolidate free space |
| Trigger | OOM only (allocation failure → retry); never periodic/background |
| Relocation engine | tpu::BestFitAllocator::Compact(flat_hash_set<long> pinned) @ 0x1e81c360 (vt+0x90) |
| Output | std::vector<tpu::TpuMemmoves::Memmove> — a relocation plan; bytes moved by a later codegen step |
| Byte movement | TpuCompactionIsaEmitterCodegen::Generate @ 0x1090ece0 (VMEM-staged DMA) → EnqueueCompactionImpl @ 0x1d12ed00 → CompactionRunner |
| Device driver entry | tpu::System::CompactMemory @ 0x1d0b6000 (shrink program stacks → EnqueueCompaction) |
| PJRT entry | TpuClient::DefragmentMemory @ 0xf7fd660 → EnqueueDefragmentMemory @ 0xf7fd180 |
| Retry policy | TpuClient::ShouldRetryOnOom @ 0xf8141a0 — ≤ 2 attempts; evict programs + defragment between tries |
| Immovability bridge | the pinned absl::flat_hash_set<long> — MSA-static and DMA/aliasing-held addresses never move |
Relocation in Allocate/Deallocate? | NO — those paths only coalesce free blocks (verified absent) |
| Algorithm | one-pass, reverse (high→low) greedy bin-pack against a gtl::IntervalSet<long> occupancy oracle |
| Source label | learning/45eac/tpu/runtime/hal/internal/best_fit_allocator.cc (from ResourceExhaustedErrorBuilder site-strings) |
| Confidence | CONFIRMED (byte-anchored, full Compact body decompiled) unless a row or callout says otherwise |
Scope and Boundaries
This page owns the present/absent compaction inventory, the Compact relocation method, and the coalescing-vs-compaction distinction. Three adjacent concerns live on their own pages; this page links them rather than duplicating:
| Concern | Owner page |
|---|---|
| The non-moving free-list: best-fit search, the dual data structure, coalescing on free, split policy, alignment quantum | hbm-allocator.md |
| Compile-time HBM/VMEM placement (the static layout that obviates most runtime compaction) | ../compiler/msa-overview.md |
| Donation/aliasing as a reuse path (a donated input buffer becomes the output, avoiding an allocation entirely) | buffer-donation-aliasing.md |
NOTE — the coalescing/compaction split is the whole point. Coalescing (on hbm-allocator.md) is in-place: on every
Deallocate, the freed block is merged with its adjacent free neighbours by extending one map entry and erasing the other — no bytes move, no live buffer is touched. Compaction (this page) is the only path that moves a live buffer's bytes to a new HBM offset. They are complementary: coalescing keeps free space maximally contiguous cheaply; compaction is the heavyweight last resort when coalesced contiguity is still insufficient.
What Is Present vs. Absent
The headline inventory, each line byte-verified in the decompile (see Evidence):
| Mechanism | Present? | Evidence |
|---|---|---|
Live-buffer relocation (Compact) | PRESENT | BestFitAllocator::Compact @ 0x1e81c360 emits vector<TpuMemmoves::Memmove> and rewrites the block map + free tree |
| Relocation plan decoupled from byte movement | PRESENT | Compact returns the move list; bytes moved later by Generate codegen |
| OOM-triggered defrag-and-retry | PRESENT | System::CompactMemory 0x1d0b6000, TpuClient::DefragmentMemory 0xf7fd660, ShouldRetryOnOom 0xf8141a0 |
| Pinned set keeping static/aliased buffers immovable | PRESENT | Compact takes flat_hash_set<long> pinned; CRC32 SwissTable probe per block |
| Program-stack shrink before compaction | PRESENT | System::CompactMemory calls TpuProgramStack::MaybeShrink 0x1db0c100 |
| Program eviction to free HBM between retries | PRESENT | EvictLoadedPrograms 0xf80d2c0, UnloadAllProgramsForCore (in retry lambda) |
| Donation-driven reuse (allocation avoidance) | PRESENT | AllocateOutputBuffersWithInputReuse 0xf7ba9a0 (see buffer-donation-aliasing.md) |
Relocation inside Allocate | ABSENT | Allocate 0x1e817820 has no Compact/Memmove reference; only SplitBlock |
Relocation inside Deallocate (free-time defrag) | ABSENT | Deallocate 0x1e819dc0 only calls MergeBlock/free-tree ops — coalescing, not moving |
| Periodic / background defragmenter | ABSENT | no timer/thread driving Compact; sole driver is the OOM retry path |
Multi-pass fixpoint compaction within one Compact call | ABSENT | single reverse sweep; a block failing IsDisjoint is left in place |
| Device-side arena/slab compaction | ABSENT | one fixed-region BestFitAllocator per (core, tier); no arenas — see hbm-allocator.md |
| Growth-by-doubling on the device | ABSENT | region is a fixed [base, end); Expand/Shrink only adjust stack bounds |
WARNING — this is not the negative result one might expect. A naive read of "
BestFitAllocatoris a non-moving boundary-tag allocator" would conclude libtpu cannot defragment. That read is wrong. The allocator's free-list methods never relocate, but the allocator class also exposes a separateCompactmethod that does — and the runtime calls it on OOM. The honest summary is: non-moving for the common case, moving-as-last-resort on exhaustion.
The Relocation Method: Compact
BestFitAllocator::Compact (0x1e81c360, vt+0x90) is the only method in the allocator that produces buffer relocations. Its signature (demangled from the binary symbol) is:
// 0x1e81c360 — returns the relocation plan by value (NRVO into a1).
std::vector<tpu::TpuMemmoves::Memmove>
tpu::BestFitAllocator::Compact(const absl::flat_hash_set<long>& pinned);
It is a one-pass, reverse (high-address-to-low) greedy bin-pack with a pinned exclusion set. The decompiled body (~3.9 kB) confirms five phases:
// Reconstructed from the 0x1e81c360 decompile (byte-confirmed structure).
std::vector<TpuMemmoves::Memmove>
Compact(const flat_hash_set<int64_t>& pinned) {
std::vector<TpuMemmoves::Memmove> moves; // RETURNED plan (begins empty)
// (1) COLLECT every allocated block from the boundary-tag map.
std::vector<LiveBlock> live; // LiveBlock = {begin,end,state} (48 B)
for (HashTableEntry& e : blocks_by_offset_) // walk SwissTable slots
if (e.state != kFree) // *((_DWORD*)slot+2) != 0
live.push_back({e.offset, e.offset + e.size, e.state});
// (2) SORT the live blocks (introsort, by offset).
__introsort(live.begin(), live.end()); // 0x1e81e260
// (3) REVERSE SWEEP: pack movable blocks against the top, against an occupancy oracle.
gtl::IntervalSet<int64_t> occupied; // absl btree of intervals
int64_t top = allocatable_range_end_; // this+0x70
std::vector<LiveBlock> kept;
for (LiveBlock& b : reverse(live)) { // high -> low
if (b.state == kReserved) { kept.push_back(b); continue; } // reserved -> immovable
int64_t addr = b.begin + base_offset_in_bytes_; // this+0x58
if (pinned.contains(addr)) { // CRC32 H2 SwissTable probe
kept.push_back(b);
occupied.Add({b.begin, b.end}); // 0x1e824ae0 (AddImpl)
continue; // PINNED -> never relocated
}
int64_t size = b.end - b.begin;
int64_t cand_lo = top - size, cand_hi = top; // candidate placement against the top
if (occupied.IsDisjoint({cand_lo, cand_hi})) { // 0x1cc99740 — placement is free
moves.push_back(Memmove{ b.begin / granule_, // +0 src, +8 dst, +0x10 size (GRANULE units)
cand_lo / granule_,
size / granule_ });
kept.push_back({cand_lo, cand_hi, b.state});
occupied.Add({cand_lo, cand_hi});
top = cand_lo;
} else {
kept.push_back(b); // overlap -> leave in place (single pass)
occupied.Add({b.begin, b.end});
}
}
// (4) REWRITE bookkeeping to the post-compaction layout, atomically in-call.
ClearBackingArray(blocks_by_offset_); // wipe the SwissTable
free_tree_.clear(); // __tree_deleter over the RB-tree
__introsort(kept.begin(), kept.end());
int64_t cursor = 0;
for (LiveBlock& b : kept) {
if (b.begin > cursor) { // gap below -> synthesize a free block
blocks_by_offset_.try_emplace(cursor, {cursor, kFree, b.begin - cursor});
free_tree_.insert(FreeBlock{cursor, b.begin}); // find_equal + balance_after_insert
}
blocks_by_offset_.try_emplace(b.begin, {b.begin, b.state, b.end - b.begin});
cursor = b.end;
}
if (cursor < allocatable_range_end_) { // trailing free block
blocks_by_offset_.try_emplace(cursor, {cursor, kFree, allocatable_range_end_ - cursor});
free_tree_.insert(FreeBlock{cursor, allocatable_range_end_});
}
// (5) RECOMPUTE the derived watermarks via a GetBlockIf probe at the reserved offset.
capacity_in_bytes_ = ...; // this+0xB8 updated from GetBlockIf result
return moves; // caller turns the plan into DMA programs
}
Why the plan is returned, not executed
Compact does not move a single byte. It returns a std::vector<TpuMemmoves::Memmove> — a list of {dst, src, size} triples, all three fields in granule units (the byte values divided by granule_in_bytes_, this+0x78). The byte movement is performed by a separate pipeline (next section). This decoupling matters: the allocator's bookkeeping is updated synchronously inside Compact (so live tpu::TpuBuffer handles resolve to the new offsets immediately, because they resolve their device address through the allocator's current map), while the physical DMA is enqueued asynchronously on the device.
struct tpu::TpuMemmoves::Memmove { // 24 B, GRANULE units
int64_t src; // +0 source offset / granule (the block's current begin)
int64_t dst; // +8 destination offset / granule (the new top placement)
int64_t size; // +0x10 size / granule
};
struct Compact::LiveBlock { // 48 B local helper
int64_t begin; // +0
int64_t end; // +8
int32_t state; // +0x10
// (trailing bytes used as a small-string/scratch slot in the decompile)
};
The pinned set is the MSA / aliasing immovability bridge
The single argument to Compact is const absl::flat_hash_set<long>& pinned — buffer addresses that must not move. In the decompiled reverse sweep, each candidate block's absolute address (b.begin + base_offset_in_bytes_) is looked up in this set via a CRC32-keyed Abseil SwissTable probe (_mm_crc32_u64 + vpcmpeqb group scan, lines confirming the H2 metadata match). A hit means the block is appended to the kept list and its interval marked occupied — it is never assigned a Memmove.
This is the precise mechanism that enforces "MSA owns the static layout; the runtime owns the dynamic layout in the remaining space":
- MSA-static buffers (placed at compile time, replayed at load by
ProgramMemoryAllocator::CreateFromProto— see ../compiler/msa-overview.md) are passed inpinnedand stay put. - Buffers with an outstanding DMA or aliasing hold (donated inputs reused as outputs, buffers mid-transfer — see buffer-donation-aliasing.md) are also pinned, because relocating bytes underneath an in-flight DMA would corrupt them.
- Only the remaining dynamic, movable PJRT buffers receive
Memmoverecords.
NOTE —
kReservedis a second, state-level immovability gate. Independently of thepinnedset, any block whosestate == kReserved(e.g. the bottom-of-memory reservation set byReserveBottomOfMemory, see hbm-allocator.md) is skipped by the sweep entirely. So a buffer is immovable if it is either reserved-state or address-pinned.
The reverse-sweep packing, in words
Allocations are placed at the top of free blocks by Allocate (top-down placement, see hbm-allocator.md), so live buffers naturally cluster against the high end. Compact mirrors that: it walks live blocks from high address to low, and for each movable block tries to place it as high as possible (top - size) against the running IntervalSet of already-placed/pinned intervals. If that candidate window is disjoint from everything occupied, the block relocates there and top drops; otherwise the block is left where it is. The net effect is that movable buffers slide downward to fill gaps left by pinned buffers, consolidating free space into a single contiguous run at the bottom — which is exactly the run a subsequent best-fit Allocate can then satisfy.
GOTCHA — single pass, not a fixpoint. Each
Compactcall is one reverse sweep. A block that fails theIsDisjointtest (because a pinned block sits in its candidate window) is left in place and not retried within the same call. Multi-pass behaviour, if it occurs, comes from the OOM retry loop invoking the whole defrag chain again (bounded at 2 attempts). This was observed but not exhaustively traced — marked HIGH.
Byte Movement: From Plan to On-Device DMA
Compact produces the plan; three downstream functions turn it into executed DMA:
TpuSharedMemoryCommonImpl::GenerateAndValidateCompactionPrograms(const TpuMemmoves&)(0x1d130ec0) validates the move set (itsCHECKstrings confirmcompaction_buffer_base_ != nullptr,compaction_codegen() != nullptr, andmemmoves.type() == compaction_buffer().location().type()) and drives codegen.TpuCompactionIsaEmitterCodegen::Generate(TpuSharedMemoryType, Span<TpuMemmoves::Memmove const>)(0x1090ece0) codegens the actual DMA. The decompile shows it readingTarget::VmemSizeBytes/Target::VmemWordSizeBytesand building avector<Transaction>— i.e. it stages moves through VMEM (HBM→VMEM→HBM) in chunks bounded by VMEM capacity, merging adjacent moves into batched transactions. This is why compaction is genuinely a physical relocation, not a remapping trick: there is no HBM address-translation layer to retarget, so the bytes are copied.TpuSharedMemoryDriverCommonImpl::EnqueueCompactionImpl(const TpuMemmoves&, AnyInvocable<void(Status const&)>)(0x1d12ed00) constructs aCompactionRunner(CompactionRunner::Create) and enqueues the generated programs on the core's command stream, invoking the completion callback when the DMA finishes.
The stream-executor surface exposes the same capability via TpuExecutor::EnqueueCompactionOnStreamForHbm (0xe997400 / interface 0x1d0eff00), and TpuNodeContext::CompactionSupported (0xeaca440) gates whether a given chip supports it at all.
BestFitAllocator::Compact(pinned) // build the plan + rewrite map/tree
│ std::vector<TpuMemmoves::Memmove>
▼
TpuSharedMemory::EnqueueCompaction (0x1d4bcde0) // per-core entry
▼
GenerateAndValidateCompactionPrograms (0x1d130ec0) // validate the move set
▼
TpuCompactionIsaEmitterCodegen::Generate (0x1090ece0) // codegen VMEM-staged DMA transactions
▼
EnqueueCompactionImpl (0x1d12ed00) -> CompactionRunner // enqueue on the device command stream
| Function | Address | Role |
|---|---|---|
BestFitAllocator::Compact | 0x1e81c360 | collect → sort → IntervalSet pack → emit TpuMemmoves → rewrite map/tree |
TpuSharedMemory::EnqueueCompaction | 0x1d4bcde0 | per-core compaction entry (TpuCompactionConfig + callback) |
GenerateAndValidateCompactionPrograms | 0x1d130ec0 | validate the move set, drive codegen |
TpuCompactionIsaEmitterCodegen::Generate | 0x1090ece0 | codegen merged VMEM↔HBM staged DMA transactions |
EnqueueCompactionImpl | 0x1d12ed00 | build CompactionRunner, enqueue on the command stream |
gtl::IntervalSet<long>::AddImpl | 0x1e824ae0 | mark an interval occupied during the pack |
gtl::IntervalSet<long>::IsDisjoint | 0x1cc99740 | test a candidate placement window |
__introsort (Compact LiveBlock) | 0x1e81e260 | sort live blocks by offset |
TpuExecutor::EnqueueCompactionOnStreamForHbm | 0xe997400 | stream-executor compaction surface |
TpuNodeContext::CompactionSupported | 0xeaca440 | per-chip capability gate |
What Triggers Compaction: the OOM Retry Chain
Nothing runs Compact proactively. The sole driver is an allocation failure. When BestFitAllocator::Allocate (0x1e817820) cannot find a fitting free block, it returns the ResourceExhausted diagnostic "Attempting to allocate <size>. That was not possible. There are <free> free. The largest contiguous region of free memory is <largest> due to fragmentation." (site string at best_fit_allocator.cc:129). (The superficially similar "…at the bottom of memory…" wording is a different string emitted by ReserveBottomOfMemory @ 0x1e81b0c0, not by this Allocate OOM path — see hbm-allocator.md.) That error propagates up the four-layer allocator stack and lands in the retry chain:
// tpu::System::CompactMemory (0x1d0b6000): the device-side defrag driver.
TpuEvent CompactMemory(const TpuSharedMemoryLocation& loc) {
auto* core = ResolveCore(loc);
if (!core) // best_fit_allocator path needs a core
return MakeError<ResourceExhausted>( // MakeErrorImpl<13>, system.cc:2410
"No attached TPU to compact.");
int seg = TpuSharedMemoryTypeToTpuSegmentMemoryType(loc.type()); // 0..2
for (TpuProgramStack* stack : core->program_stacks())
stack->MaybeShrink(stack->segment_limit(seg)); // 0x1db0c100 — reclaim dynamic-stack HBM first
return shm(loc)->EnqueueCompaction(/* -> BestFitAllocator::Compact(pinned) */);
}
// xla::TpuClient::ShouldRetryOnOom (0xf8141a0): the bound + recovery actions.
bool ShouldRetryOnOom(int attempt, PjRtDevice* dev, PjRtLoadedExecutable* exe, Status s) {
if (attempt > 1) return attempt < 2; // <= 2 total attempts
// tpu_pjrt_client.cc:4441 — verbatim LOG prefix
LOG(INFO) << "TpuLoadedExecutable::ExecutePrepareWithOomRetries "
"attempting to defragment and retry after seeing error: " << s;
if (this->evict_programs_on_oom_ /*+0x67*/)
for (auto& [id, ls] : loaded_executables_)
if (ls != exe) ls->EvictLoadedPrograms(); // 0xf80d2c0 — free program HBM
if (this->defragment_on_oom_ /*+0x69*/)
DefragmentMemory(dev->core()->LocalSharedMemory(kHbm)); // 0xf7fd660 -> CompactMemory
return attempt < 2;
}
The async leaf, tfrt::tpu::AllocateTpuBufferWithRetry (0xf7ec6a0), is dependency-gated rather than spin-looping: if the System AsyncValueRef is still pending it enqueues a waiter carrying a retry continuation ($_0 @ 0xf7ed620) that also calls TpuCompilationCache::UnloadAllProgramsForCore to free program HBM, then re-runs System::Allocate when the dependency resolves. So the recovery sequence on OOM is, in order: (a) shrink dynamic program stacks; (b) evict/unload loaded programs to reclaim their static HBM; (c) run Compact to relocate movable buffers and consolidate the remainder; (d) retry the allocation — at most twice.
NOTE — two config bools gate recovery.
TpuClient+0x67toggles program eviction andTpuClient+0x69toggles defragmentation on OOM; either can be disabled independently. Their exactPjRtTpuClientConfigkey names were not back-traced (marked in the open items). The megascale-aware variantCommonPjRtClient::ShouldRetryOnOom(0xe6edc80) additionally consults theDeviceAssignmentso a pod slice can coordinate retry across devices.
| Function | Address | Role |
|---|---|---|
BestFitAllocator::Allocate (OOM site) | 0x1e817820 | emits the ResourceExhausted "…due to fragmentation" leaf error |
tpu::System::CompactMemory | 0x1d0b6000 | shrink program stacks → EnqueueCompaction; MakeErrorImpl<13> on bad core |
TpuProgramStack::MaybeShrink | 0x1db0c100 | reclaim dynamic-stack HBM before compacting |
TpuClient::DefragmentMemory | 0xf7fd660 | PJRT defrag entry → EnqueueDefragmentMemory → System::CompactMemory |
TpuClient::EnqueueDefragmentMemory | 0xf7fd180 | enqueue the defrag work item |
TpuClient::ShouldRetryOnOom | 0xf8141a0 | ≤ 2 attempts; evict programs + defragment |
CommonPjRtClient::ShouldRetryOnOom | 0xe6edc80 | megascale/pod-aware retry coordination |
tfrt::tpu::AllocateTpuBufferWithRetry | 0xf7ec6a0 | async dependency-gated retry; UnloadAllProgramsForCore |
TpuExecutableLoadState::EvictLoadedPrograms | 0xf80d2c0 | free program HBM between retries |
Why Runtime Compaction Is Rarely Needed: the MSA Story
The reason a TPU program can run a 745 MB-binary's worth of kernels and almost never hit Compact is that the hard packing problem is solved before the program runs. XLA's MemorySpaceAssignment (MsaAlgorithm) plus jellyfish ProgramMemoryAllocator assign every static intra-program buffer a memory space and a frozen byte offset at compile time, serialized into ProgramMemoryMetadata inside the compiled program. At load time, ProgramMemoryAllocator::CreateFromProto (0x1c631f20) replays those offsets, and the runtime BestFitAllocator is told where each static buffer goes — it does not run a best-fit search for them, and it never needs to relocate them (they are pinned). See ../compiler/msa-overview.md.
This is the key architectural insight for a reimplementer: compile-time packing is the primary anti-fragmentation strategy; runtime compaction is the safety net. The runtime free-list math (and any compaction) is exercised only against the dynamic surface in the HBM left over after MSA's static layout — user input/output PJRT buffers, transfer staging, async-copy scratch, the dynamic program stack. Because those are typically large, few, and short-lived relative to the static program footprint, the runtime mostly gets by on best-fit + coalescing alone, and Compact fires only under genuine pressure.
NOTE — donation is the other allocation-avoidance mechanism. Before any dynamic allocation,
AllocateOutputBuffersWithInputReuse(0xf7ba9a0) consults the compiler-emittedHloInputOutputAliasConfigand, for aliased outputs, reuses the donated input buffer in place — no new HBM, no relocation. This is reuse, not compaction, but it has the same effect of reducing the dynamic allocation pressure that would otherwise drive fragmentation. See buffer-donation-aliasing.md.
Coalescing vs. Compaction: the Precise Distinction
To close the inventory, the two free-space-recovery mechanisms side by side. This is the central reason these are two pages, not one.
| Property | Coalescing (hbm-allocator.md) | Compaction (this page) |
|---|---|---|
| Method | inside Deallocate 0x1e819dc0 (via MergeBlock 0x1e819700) | Compact 0x1e81c360 |
| When | every free, eager, immediate | only on OOM, after coalescing already failed to help |
| Moves live bytes? | No — merges adjacent free blocks only | Yes — relocates movable live buffers via DMA |
| Mechanism | extend one map entry, erase the neighbour entry | emit TpuMemmoves plan → VMEM-staged DMA |
Touches a live TpuBuffer? | never | yes (movable, unpinned ones) |
| Cost | O(1) amortised neighbour lookup + O(log n) tree edit | O(n log n) collect+sort + interval-set pack + DMA + full map/tree rebuild |
| Recovers | adjacency-induced free fragments | non-adjacent free fragments (external fragmentation) |
| Failure mode it addresses | small free gaps next to a freed block | "largest contiguous region … due to fragmentation" OOM |
Coalescing guarantees the free tree never holds two physically adjacent free blocks (see hbm-allocator.md); that keeps best-fit honest and recovers all adjacency-recoverable space for free. What coalescing cannot do is merge free space separated by a live buffer — that is precisely the gap compaction closes, by moving the intervening live buffer out of the way. The OOM diagnostic's "largest contiguous region of free memory is <X> due to fragmentation" is the symptom of exactly this situation: total free bytes are sufficient, but no single contiguous run is, because live buffers sit between the gaps. Compact slides the movable ones aside to coalesce those gaps into one run.
Cross-References
- hbm-allocator.md — the non-moving free-list: best-fit search, the dual data structure, coalescing on free (the in-place counterpart to this page's relocation), split policy, alignment quantum
- ../compiler/msa-overview.md — the compile-time placement pass that freezes static offsets and is the primary reason runtime compaction is rarely needed
- buffer-donation-aliasing.md — donation/aliasing as a zero-copy reuse path, and the source of the DMA/aliasing holds that make a buffer pinned during compaction
- tpu-buffer-layout.md — how a device buffer's HBM offset maps to its on-device tile layout (the bytes that
Compactactually relocates) - overview.md — the five on-chip memory tiers and where the HBM allocator sits in the map
- hbm-dma-alignment.md — the 1024 B DMA quantum that keeps every relocated offset DMA-legal
- vmem-allocator.md, cmem-pool.md — sibling tiers using the same
BestFitAllocatorclass (and therefore the sameCompact) - back to index