Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

On-Device Compaction

All addresses, struct offsets, vtable slots, and magic constants on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build-id md5 89edbbe81c5b328a958fe628a9f2207d, build libtpu_lts_20260413_b_RC00, clang trunk). The image is not stripped; demangled C++ symbol names are quoted verbatim. Other versions will differ.

Abstract

This page answers one question precisely, with binary evidence: does the libtpu runtime perform HBM defragmentation by relocating live buffers, and how? The answer is yes — and that is the non-obvious result, because the device allocator (tpu::BestFitAllocator, see hbm-allocator.md) is a classic non-moving boundary-tag allocator whose Allocate/Deallocate paths only coalesce adjacent free blocks. Live-buffer relocation lives in a single, separate, OOM-triggered method: tpu::BestFitAllocator::Compact (0x1e81c360, vtable slot vt+0x90).

The clean way to state the architecture is a three-way split of "where does each buffer sit in HBM":

  1. Compile-time placement — XLA's MSA + jellyfish ProgramMemoryAllocator freeze static offsets into the program (../compiler/msa-overview.md). The bulk of HBM occupancy is laid out once, at compile time, which is why the runtime so rarely needs to defragment — packing is solved before the program ever runs.
  2. Runtime free-list math (non-moving)Allocate/Deallocate service the dynamic surface (user input/output PJRT buffers, transfer staging, async-copy scratch, the dynamic program stack) with best-fit search + eager bidirectional coalescing on free. This is documented on hbm-allocator.md; it never moves a live buffer.
  3. Runtime compaction (moving) — when even fully-coalesced free space cannot satisfy a request, Compact relocates movable live buffers downward to consolidate free space, emitting a relocation plan (std::vector<tpu::TpuMemmoves::Memmove>) that a separate codegen path turns into on-device DMA programs.

So the page is a present/absent inventory. PRESENT: a full live-buffer relocation path (CompactTpuMemmoves → DMA codegen → enqueue), an OOM-driven retry chain, a pinned-set that keeps MSA-static and aliased buffers immovable, and donation-driven reuse (a zero-copy alternative to allocation). ABSENT: any relocation inside Allocate/Deallocate (those coalesce only); any periodic/background defragmenter; any device-side arena/slab compaction; any growth-by-doubling. The distinction this page owns is precisely coalescing (in-place merge of free neighbours, every free) vs. compaction (relocation of live buffers, only on OOM).

Compaction performed?YES — live movable buffers are relocated to consolidate free space
TriggerOOM only (allocation failure → retry); never periodic/background
Relocation enginetpu::BestFitAllocator::Compact(flat_hash_set<long> pinned) @ 0x1e81c360 (vt+0x90)
Outputstd::vector<tpu::TpuMemmoves::Memmove> — a relocation plan; bytes moved by a later codegen step
Byte movementTpuCompactionIsaEmitterCodegen::Generate @ 0x1090ece0 (VMEM-staged DMA) → EnqueueCompactionImpl @ 0x1d12ed00CompactionRunner
Device driver entrytpu::System::CompactMemory @ 0x1d0b6000 (shrink program stacks → EnqueueCompaction)
PJRT entryTpuClient::DefragmentMemory @ 0xf7fd660EnqueueDefragmentMemory @ 0xf7fd180
Retry policyTpuClient::ShouldRetryOnOom @ 0xf8141a0 — ≤ 2 attempts; evict programs + defragment between tries
Immovability bridgethe pinned absl::flat_hash_set<long> — MSA-static and DMA/aliasing-held addresses never move
Relocation in Allocate/Deallocate?NO — those paths only coalesce free blocks (verified absent)
Algorithmone-pass, reverse (high→low) greedy bin-pack against a gtl::IntervalSet<long> occupancy oracle
Source labellearning/45eac/tpu/runtime/hal/internal/best_fit_allocator.cc (from ResourceExhaustedErrorBuilder site-strings)
ConfidenceCONFIRMED (byte-anchored, full Compact body decompiled) unless a row or callout says otherwise

Scope and Boundaries

This page owns the present/absent compaction inventory, the Compact relocation method, and the coalescing-vs-compaction distinction. Three adjacent concerns live on their own pages; this page links them rather than duplicating:

ConcernOwner page
The non-moving free-list: best-fit search, the dual data structure, coalescing on free, split policy, alignment quantumhbm-allocator.md
Compile-time HBM/VMEM placement (the static layout that obviates most runtime compaction)../compiler/msa-overview.md
Donation/aliasing as a reuse path (a donated input buffer becomes the output, avoiding an allocation entirely)buffer-donation-aliasing.md

NOTE — the coalescing/compaction split is the whole point. Coalescing (on hbm-allocator.md) is in-place: on every Deallocate, the freed block is merged with its adjacent free neighbours by extending one map entry and erasing the other — no bytes move, no live buffer is touched. Compaction (this page) is the only path that moves a live buffer's bytes to a new HBM offset. They are complementary: coalescing keeps free space maximally contiguous cheaply; compaction is the heavyweight last resort when coalesced contiguity is still insufficient.


What Is Present vs. Absent

The headline inventory, each line byte-verified in the decompile (see Evidence):

MechanismPresent?Evidence
Live-buffer relocation (Compact)PRESENTBestFitAllocator::Compact @ 0x1e81c360 emits vector<TpuMemmoves::Memmove> and rewrites the block map + free tree
Relocation plan decoupled from byte movementPRESENTCompact returns the move list; bytes moved later by Generate codegen
OOM-triggered defrag-and-retryPRESENTSystem::CompactMemory 0x1d0b6000, TpuClient::DefragmentMemory 0xf7fd660, ShouldRetryOnOom 0xf8141a0
Pinned set keeping static/aliased buffers immovablePRESENTCompact takes flat_hash_set<long> pinned; CRC32 SwissTable probe per block
Program-stack shrink before compactionPRESENTSystem::CompactMemory calls TpuProgramStack::MaybeShrink 0x1db0c100
Program eviction to free HBM between retriesPRESENTEvictLoadedPrograms 0xf80d2c0, UnloadAllProgramsForCore (in retry lambda)
Donation-driven reuse (allocation avoidance)PRESENTAllocateOutputBuffersWithInputReuse 0xf7ba9a0 (see buffer-donation-aliasing.md)
Relocation inside AllocateABSENTAllocate 0x1e817820 has no Compact/Memmove reference; only SplitBlock
Relocation inside Deallocate (free-time defrag)ABSENTDeallocate 0x1e819dc0 only calls MergeBlock/free-tree ops — coalescing, not moving
Periodic / background defragmenterABSENTno timer/thread driving Compact; sole driver is the OOM retry path
Multi-pass fixpoint compaction within one Compact callABSENTsingle reverse sweep; a block failing IsDisjoint is left in place
Device-side arena/slab compactionABSENTone fixed-region BestFitAllocator per (core, tier); no arenas — see hbm-allocator.md
Growth-by-doubling on the deviceABSENTregion is a fixed [base, end); Expand/Shrink only adjust stack bounds

WARNING — this is not the negative result one might expect. A naive read of "BestFitAllocator is a non-moving boundary-tag allocator" would conclude libtpu cannot defragment. That read is wrong. The allocator's free-list methods never relocate, but the allocator class also exposes a separate Compact method that does — and the runtime calls it on OOM. The honest summary is: non-moving for the common case, moving-as-last-resort on exhaustion.


The Relocation Method: Compact

BestFitAllocator::Compact (0x1e81c360, vt+0x90) is the only method in the allocator that produces buffer relocations. Its signature (demangled from the binary symbol) is:

// 0x1e81c360 — returns the relocation plan by value (NRVO into a1).
std::vector<tpu::TpuMemmoves::Memmove>
tpu::BestFitAllocator::Compact(const absl::flat_hash_set<long>& pinned);

It is a one-pass, reverse (high-address-to-low) greedy bin-pack with a pinned exclusion set. The decompiled body (~3.9 kB) confirms five phases:

// Reconstructed from the 0x1e81c360 decompile (byte-confirmed structure).
std::vector<TpuMemmoves::Memmove>
Compact(const flat_hash_set<int64_t>& pinned) {
    std::vector<TpuMemmoves::Memmove> moves;          // RETURNED plan (begins empty)

    // (1) COLLECT every allocated block from the boundary-tag map.
    std::vector<LiveBlock> live;                      // LiveBlock = {begin,end,state} (48 B)
    for (HashTableEntry& e : blocks_by_offset_)       // walk SwissTable slots
        if (e.state != kFree)                         // *((_DWORD*)slot+2) != 0
            live.push_back({e.offset, e.offset + e.size, e.state});

    // (2) SORT the live blocks (introsort, by offset).
    __introsort(live.begin(), live.end());            // 0x1e81e260

    // (3) REVERSE SWEEP: pack movable blocks against the top, against an occupancy oracle.
    gtl::IntervalSet<int64_t> occupied;               // absl btree of intervals
    int64_t top = allocatable_range_end_;             // this+0x70
    std::vector<LiveBlock> kept;
    for (LiveBlock& b : reverse(live)) {              // high -> low
        if (b.state == kReserved) { kept.push_back(b); continue; }   // reserved -> immovable
        int64_t addr = b.begin + base_offset_in_bytes_;              // this+0x58
        if (pinned.contains(addr)) {                  // CRC32 H2 SwissTable probe
            kept.push_back(b);
            occupied.Add({b.begin, b.end});           // 0x1e824ae0 (AddImpl)
            continue;                                 // PINNED -> never relocated
        }
        int64_t size = b.end - b.begin;
        int64_t cand_lo = top - size, cand_hi = top;  // candidate placement against the top
        if (occupied.IsDisjoint({cand_lo, cand_hi})) { // 0x1cc99740 — placement is free
            moves.push_back(Memmove{ b.begin  / granule_,  // +0 src, +8 dst, +0x10 size (GRANULE units)
                                     cand_lo  / granule_,
                                     size     / granule_ });
            kept.push_back({cand_lo, cand_hi, b.state});
            occupied.Add({cand_lo, cand_hi});
            top = cand_lo;
        } else {
            kept.push_back(b);                        // overlap -> leave in place (single pass)
            occupied.Add({b.begin, b.end});
        }
    }

    // (4) REWRITE bookkeeping to the post-compaction layout, atomically in-call.
    ClearBackingArray(blocks_by_offset_);             // wipe the SwissTable
    free_tree_.clear();                               // __tree_deleter over the RB-tree
    __introsort(kept.begin(), kept.end());
    int64_t cursor = 0;
    for (LiveBlock& b : kept) {
        if (b.begin > cursor) {                       // gap below -> synthesize a free block
            blocks_by_offset_.try_emplace(cursor, {cursor, kFree, b.begin - cursor});
            free_tree_.insert(FreeBlock{cursor, b.begin});   // find_equal + balance_after_insert
        }
        blocks_by_offset_.try_emplace(b.begin, {b.begin, b.state, b.end - b.begin});
        cursor = b.end;
    }
    if (cursor < allocatable_range_end_) {            // trailing free block
        blocks_by_offset_.try_emplace(cursor, {cursor, kFree, allocatable_range_end_ - cursor});
        free_tree_.insert(FreeBlock{cursor, allocatable_range_end_});
    }

    // (5) RECOMPUTE the derived watermarks via a GetBlockIf probe at the reserved offset.
    capacity_in_bytes_ = ...;                         // this+0xB8 updated from GetBlockIf result

    return moves;                                     // caller turns the plan into DMA programs
}

Why the plan is returned, not executed

Compact does not move a single byte. It returns a std::vector<TpuMemmoves::Memmove> — a list of {dst, src, size} triples, all three fields in granule units (the byte values divided by granule_in_bytes_, this+0x78). The byte movement is performed by a separate pipeline (next section). This decoupling matters: the allocator's bookkeeping is updated synchronously inside Compact (so live tpu::TpuBuffer handles resolve to the new offsets immediately, because they resolve their device address through the allocator's current map), while the physical DMA is enqueued asynchronously on the device.

struct tpu::TpuMemmoves::Memmove {     // 24 B, GRANULE units
    int64_t src;     // +0   source offset / granule  (the block's current begin)
    int64_t dst;     // +8   destination offset / granule  (the new top placement)
    int64_t size;    // +0x10  size / granule
};
struct Compact::LiveBlock {            // 48 B local helper
    int64_t begin;   // +0
    int64_t end;     // +8
    int32_t state;   // +0x10
    // (trailing bytes used as a small-string/scratch slot in the decompile)
};

The pinned set is the MSA / aliasing immovability bridge

The single argument to Compact is const absl::flat_hash_set<long>& pinned — buffer addresses that must not move. In the decompiled reverse sweep, each candidate block's absolute address (b.begin + base_offset_in_bytes_) is looked up in this set via a CRC32-keyed Abseil SwissTable probe (_mm_crc32_u64 + vpcmpeqb group scan, lines confirming the H2 metadata match). A hit means the block is appended to the kept list and its interval marked occupied — it is never assigned a Memmove.

This is the precise mechanism that enforces "MSA owns the static layout; the runtime owns the dynamic layout in the remaining space":

  • MSA-static buffers (placed at compile time, replayed at load by ProgramMemoryAllocator::CreateFromProto — see ../compiler/msa-overview.md) are passed in pinned and stay put.
  • Buffers with an outstanding DMA or aliasing hold (donated inputs reused as outputs, buffers mid-transfer — see buffer-donation-aliasing.md) are also pinned, because relocating bytes underneath an in-flight DMA would corrupt them.
  • Only the remaining dynamic, movable PJRT buffers receive Memmove records.

NOTE — kReserved is a second, state-level immovability gate. Independently of the pinned set, any block whose state == kReserved (e.g. the bottom-of-memory reservation set by ReserveBottomOfMemory, see hbm-allocator.md) is skipped by the sweep entirely. So a buffer is immovable if it is either reserved-state or address-pinned.

The reverse-sweep packing, in words

Allocations are placed at the top of free blocks by Allocate (top-down placement, see hbm-allocator.md), so live buffers naturally cluster against the high end. Compact mirrors that: it walks live blocks from high address to low, and for each movable block tries to place it as high as possible (top - size) against the running IntervalSet of already-placed/pinned intervals. If that candidate window is disjoint from everything occupied, the block relocates there and top drops; otherwise the block is left where it is. The net effect is that movable buffers slide downward to fill gaps left by pinned buffers, consolidating free space into a single contiguous run at the bottom — which is exactly the run a subsequent best-fit Allocate can then satisfy.

GOTCHA — single pass, not a fixpoint. Each Compact call is one reverse sweep. A block that fails the IsDisjoint test (because a pinned block sits in its candidate window) is left in place and not retried within the same call. Multi-pass behaviour, if it occurs, comes from the OOM retry loop invoking the whole defrag chain again (bounded at 2 attempts). This was observed but not exhaustively traced — marked HIGH.


Byte Movement: From Plan to On-Device DMA

Compact produces the plan; three downstream functions turn it into executed DMA:

  1. TpuSharedMemoryCommonImpl::GenerateAndValidateCompactionPrograms(const TpuMemmoves&) (0x1d130ec0) validates the move set (its CHECK strings confirm compaction_buffer_base_ != nullptr, compaction_codegen() != nullptr, and memmoves.type() == compaction_buffer().location().type()) and drives codegen.
  2. TpuCompactionIsaEmitterCodegen::Generate(TpuSharedMemoryType, Span<TpuMemmoves::Memmove const>) (0x1090ece0) codegens the actual DMA. The decompile shows it reading Target::VmemSizeBytes / Target::VmemWordSizeBytes and building a vector<Transaction> — i.e. it stages moves through VMEM (HBM→VMEM→HBM) in chunks bounded by VMEM capacity, merging adjacent moves into batched transactions. This is why compaction is genuinely a physical relocation, not a remapping trick: there is no HBM address-translation layer to retarget, so the bytes are copied.
  3. TpuSharedMemoryDriverCommonImpl::EnqueueCompactionImpl(const TpuMemmoves&, AnyInvocable<void(Status const&)>) (0x1d12ed00) constructs a CompactionRunner (CompactionRunner::Create) and enqueues the generated programs on the core's command stream, invoking the completion callback when the DMA finishes.

The stream-executor surface exposes the same capability via TpuExecutor::EnqueueCompactionOnStreamForHbm (0xe997400 / interface 0x1d0eff00), and TpuNodeContext::CompactionSupported (0xeaca440) gates whether a given chip supports it at all.

BestFitAllocator::Compact(pinned)                 // build the plan + rewrite map/tree
        │  std::vector<TpuMemmoves::Memmove>
        ▼
TpuSharedMemory::EnqueueCompaction (0x1d4bcde0)    // per-core entry
        ▼
GenerateAndValidateCompactionPrograms (0x1d130ec0) // validate the move set
        ▼
TpuCompactionIsaEmitterCodegen::Generate (0x1090ece0)  // codegen VMEM-staged DMA transactions
        ▼
EnqueueCompactionImpl (0x1d12ed00) -> CompactionRunner // enqueue on the device command stream
FunctionAddressRole
BestFitAllocator::Compact0x1e81c360collect → sort → IntervalSet pack → emit TpuMemmoves → rewrite map/tree
TpuSharedMemory::EnqueueCompaction0x1d4bcde0per-core compaction entry (TpuCompactionConfig + callback)
GenerateAndValidateCompactionPrograms0x1d130ec0validate the move set, drive codegen
TpuCompactionIsaEmitterCodegen::Generate0x1090ece0codegen merged VMEM↔HBM staged DMA transactions
EnqueueCompactionImpl0x1d12ed00build CompactionRunner, enqueue on the command stream
gtl::IntervalSet<long>::AddImpl0x1e824ae0mark an interval occupied during the pack
gtl::IntervalSet<long>::IsDisjoint0x1cc99740test a candidate placement window
__introsort (Compact LiveBlock)0x1e81e260sort live blocks by offset
TpuExecutor::EnqueueCompactionOnStreamForHbm0xe997400stream-executor compaction surface
TpuNodeContext::CompactionSupported0xeaca440per-chip capability gate

What Triggers Compaction: the OOM Retry Chain

Nothing runs Compact proactively. The sole driver is an allocation failure. When BestFitAllocator::Allocate (0x1e817820) cannot find a fitting free block, it returns the ResourceExhausted diagnostic "Attempting to allocate <size>. That was not possible. There are <free> free. The largest contiguous region of free memory is <largest> due to fragmentation." (site string at best_fit_allocator.cc:129). (The superficially similar "…at the bottom of memory…" wording is a different string emitted by ReserveBottomOfMemory @ 0x1e81b0c0, not by this Allocate OOM path — see hbm-allocator.md.) That error propagates up the four-layer allocator stack and lands in the retry chain:

// tpu::System::CompactMemory (0x1d0b6000): the device-side defrag driver.
TpuEvent CompactMemory(const TpuSharedMemoryLocation& loc) {
    auto* core = ResolveCore(loc);
    if (!core)                                        // best_fit_allocator path needs a core
        return MakeError<ResourceExhausted>(          // MakeErrorImpl<13>, system.cc:2410
            "No attached TPU to compact.");
    int seg = TpuSharedMemoryTypeToTpuSegmentMemoryType(loc.type());  // 0..2
    for (TpuProgramStack* stack : core->program_stacks())
        stack->MaybeShrink(stack->segment_limit(seg));  // 0x1db0c100 — reclaim dynamic-stack HBM first
    return shm(loc)->EnqueueCompaction(/* -> BestFitAllocator::Compact(pinned) */);
}

// xla::TpuClient::ShouldRetryOnOom (0xf8141a0): the bound + recovery actions.
bool ShouldRetryOnOom(int attempt, PjRtDevice* dev, PjRtLoadedExecutable* exe, Status s) {
    if (attempt > 1) return attempt < 2;              // <= 2 total attempts
    // tpu_pjrt_client.cc:4441 — verbatim LOG prefix
    LOG(INFO) << "TpuLoadedExecutable::ExecutePrepareWithOomRetries "
                 "attempting to defragment and retry after seeing error: " << s;
    if (this->evict_programs_on_oom_ /*+0x67*/)
        for (auto& [id, ls] : loaded_executables_)
            if (ls != exe) ls->EvictLoadedPrograms(); // 0xf80d2c0 — free program HBM
    if (this->defragment_on_oom_ /*+0x69*/)
        DefragmentMemory(dev->core()->LocalSharedMemory(kHbm));  // 0xf7fd660 -> CompactMemory
    return attempt < 2;
}

The async leaf, tfrt::tpu::AllocateTpuBufferWithRetry (0xf7ec6a0), is dependency-gated rather than spin-looping: if the System AsyncValueRef is still pending it enqueues a waiter carrying a retry continuation ($_0 @ 0xf7ed620) that also calls TpuCompilationCache::UnloadAllProgramsForCore to free program HBM, then re-runs System::Allocate when the dependency resolves. So the recovery sequence on OOM is, in order: (a) shrink dynamic program stacks; (b) evict/unload loaded programs to reclaim their static HBM; (c) run Compact to relocate movable buffers and consolidate the remainder; (d) retry the allocation — at most twice.

NOTE — two config bools gate recovery. TpuClient+0x67 toggles program eviction and TpuClient+0x69 toggles defragmentation on OOM; either can be disabled independently. Their exact PjRtTpuClientConfig key names were not back-traced (marked in the open items). The megascale-aware variant CommonPjRtClient::ShouldRetryOnOom (0xe6edc80) additionally consults the DeviceAssignment so a pod slice can coordinate retry across devices.

FunctionAddressRole
BestFitAllocator::Allocate (OOM site)0x1e817820emits the ResourceExhausted "…due to fragmentation" leaf error
tpu::System::CompactMemory0x1d0b6000shrink program stacks → EnqueueCompaction; MakeErrorImpl<13> on bad core
TpuProgramStack::MaybeShrink0x1db0c100reclaim dynamic-stack HBM before compacting
TpuClient::DefragmentMemory0xf7fd660PJRT defrag entry → EnqueueDefragmentMemorySystem::CompactMemory
TpuClient::EnqueueDefragmentMemory0xf7fd180enqueue the defrag work item
TpuClient::ShouldRetryOnOom0xf8141a0≤ 2 attempts; evict programs + defragment
CommonPjRtClient::ShouldRetryOnOom0xe6edc80megascale/pod-aware retry coordination
tfrt::tpu::AllocateTpuBufferWithRetry0xf7ec6a0async dependency-gated retry; UnloadAllProgramsForCore
TpuExecutableLoadState::EvictLoadedPrograms0xf80d2c0free program HBM between retries

Why Runtime Compaction Is Rarely Needed: the MSA Story

The reason a TPU program can run a 745 MB-binary's worth of kernels and almost never hit Compact is that the hard packing problem is solved before the program runs. XLA's MemorySpaceAssignment (MsaAlgorithm) plus jellyfish ProgramMemoryAllocator assign every static intra-program buffer a memory space and a frozen byte offset at compile time, serialized into ProgramMemoryMetadata inside the compiled program. At load time, ProgramMemoryAllocator::CreateFromProto (0x1c631f20) replays those offsets, and the runtime BestFitAllocator is told where each static buffer goes — it does not run a best-fit search for them, and it never needs to relocate them (they are pinned). See ../compiler/msa-overview.md.

This is the key architectural insight for a reimplementer: compile-time packing is the primary anti-fragmentation strategy; runtime compaction is the safety net. The runtime free-list math (and any compaction) is exercised only against the dynamic surface in the HBM left over after MSA's static layout — user input/output PJRT buffers, transfer staging, async-copy scratch, the dynamic program stack. Because those are typically large, few, and short-lived relative to the static program footprint, the runtime mostly gets by on best-fit + coalescing alone, and Compact fires only under genuine pressure.

NOTE — donation is the other allocation-avoidance mechanism. Before any dynamic allocation, AllocateOutputBuffersWithInputReuse (0xf7ba9a0) consults the compiler-emitted HloInputOutputAliasConfig and, for aliased outputs, reuses the donated input buffer in place — no new HBM, no relocation. This is reuse, not compaction, but it has the same effect of reducing the dynamic allocation pressure that would otherwise drive fragmentation. See buffer-donation-aliasing.md.


Coalescing vs. Compaction: the Precise Distinction

To close the inventory, the two free-space-recovery mechanisms side by side. This is the central reason these are two pages, not one.

PropertyCoalescing (hbm-allocator.md)Compaction (this page)
Methodinside Deallocate 0x1e819dc0 (via MergeBlock 0x1e819700)Compact 0x1e81c360
Whenevery free, eager, immediateonly on OOM, after coalescing already failed to help
Moves live bytes?No — merges adjacent free blocks onlyYes — relocates movable live buffers via DMA
Mechanismextend one map entry, erase the neighbour entryemit TpuMemmoves plan → VMEM-staged DMA
Touches a live TpuBuffer?neveryes (movable, unpinned ones)
CostO(1) amortised neighbour lookup + O(log n) tree editO(n log n) collect+sort + interval-set pack + DMA + full map/tree rebuild
Recoversadjacency-induced free fragmentsnon-adjacent free fragments (external fragmentation)
Failure mode it addressessmall free gaps next to a freed block"largest contiguous region … due to fragmentation" OOM

Coalescing guarantees the free tree never holds two physically adjacent free blocks (see hbm-allocator.md); that keeps best-fit honest and recovers all adjacency-recoverable space for free. What coalescing cannot do is merge free space separated by a live buffer — that is precisely the gap compaction closes, by moving the intervening live buffer out of the way. The OOM diagnostic's "largest contiguous region of free memory is <X> due to fragmentation" is the symptom of exactly this situation: total free bytes are sufficient, but no single contiguous run is, because live buffers sit between the gaps. Compact slides the movable ones aside to coalesce those gaps into one run.


Cross-References

  • hbm-allocator.md — the non-moving free-list: best-fit search, the dual data structure, coalescing on free (the in-place counterpart to this page's relocation), split policy, alignment quantum
  • ../compiler/msa-overview.md — the compile-time placement pass that freezes static offsets and is the primary reason runtime compaction is rarely needed
  • buffer-donation-aliasing.md — donation/aliasing as a zero-copy reuse path, and the source of the DMA/aliasing holds that make a buffer pinned during compaction
  • tpu-buffer-layout.md — how a device buffer's HBM offset maps to its on-device tile layout (the bytes that Compact actually relocates)
  • overview.md — the five on-chip memory tiers and where the HBM allocator sits in the map
  • hbm-dma-alignment.md — the 1024 B DMA quantum that keeps every relocated offset DMA-legal
  • vmem-allocator.md, cmem-pool.md — sibling tiers using the same BestFitAllocator class (and therefore the same Compact)
  • back to index