Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Memory Hierarchy Overview

All addresses on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build libtpu_lts_20260413_b_RC00, build-id md5 89edbbe81c5b328a958fe628a9f2207d). The image is not stripped; demangled C++ symbol names are quoted verbatim. Other versions will differ.

Abstract

A TPU program touches six addressable memory regions, and libtpu.so names every one of them with a single C++ enumeration (xla::jellyfish::MemorySpace) and services every one of them — except the host heap — with a single allocator class (tpu::BestFitAllocator). The hierarchy is, from abundant-and-far to scarce-and-near: HBM (off-chip DRAM, tens of GiB, the kDefault tier), then four on-chip SRAM tiers per TensorCore — VMEM (vector memory, the kAlternate fast-staging tier the MXU/VPU read operands from), CMEM (constant memory, a Pufferfish-only read-mostly operand pool), SMEM (scalar memory, the SPU's private spill/parameter store), and SFLAG (sync-flag memory, a word-granular atomic register file used for cross-engine handshakes) — and finally the host tcmalloc-class heap, which libtpu does not embed (no jemalloc/tcmalloc is linked); host allocations go through posix_memalign wrapped by either tpu::PremappedMemoryManager (DMA staging) or tsl::BFCAllocator (HBM-spill offload).

The reader who knows LLVM and GPU programming should hold one analogy and immediately complicate it. The HBM↔VMEM relationship is XLA's analogue of register allocation: Memory-Space Assignment (MSA) is the compile-time pass that "colors" each HloValue kDefault (HBM) or kAlternate (VMEM), and the "spills" are HBM↔VMEM DMAs. But the analogy breaks four ways. First, the "registers" are tens of MiB, not 64-bit slots. Second, the allocator that realizes the coloring at runtime is the same best-fit class for every tier — there is no HbmAllocator, VmemAllocator, SmemAllocator, or CmemAllocator class; each tier is one tpu::BestFitAllocator instance distinguished only by a 32-byte MemoryAllocator::Config{base_offset, end, alignment, granule}. Third, only VMEM (and CMEM on Pufferfish) is MSA-managed; SMEM and SFLAG are not part of the kAlternate/kDefault tug-of-war — they are placed by opcode semantics and a fixed number-space partition respectively. Fourth, the runtime allocator almost never decides anything: MSA freezes every offset into the compiled program as a ProgramMemoryMetadata_Allocation proto, and the runtime allocator merely replays those offsets.

This page is the section map for the memory subsystem. It fixes the memory-space taxonomy, names the enum that labels them, gives the at-a-glance allocator/alignment/management facts for each tier, and points at the per-tier pages that own the detail. It does not reproduce the best-fit allocate/deallocate algorithm (that is hbm-allocator.md), the per-generation VMEM bank/bandwidth tables (vmem-allocator.md), the SFLAG atomic protocol (sflag-protocol.md), or the MSA placement cascade (msa-overview.md).

For reimplementation, the orientation contract is:

  • The six-region taxonomy — what each space physically is, who reads/writes it, and which engine owns it.
  • The MemorySpace enum — the single label space (kNone … kAlternate) shared by the compile-time placer and the wire/profiler layer, plus the second numbering the DMA-driver-resource path uses (and why they disagree).
  • The per-space allocator/alignment matrix — one BestFitAllocator per tier, the Config triple per tier, the 1024-B HBM DMA floor vs. the 16-KiB compile-time HBM alignment, and the word-granular on-chip alignments.
  • The compile-time → runtime hand-off — MSA/ProgramMemoryAllocator freezes offsets into a proto; CreateFromProto rehydrates one BestFitAllocator per tier and replays.
Memory-space enumxla::jellyfish::MemorySpace (17 values, kNone=0 … kPinnedHbm=16); decoder MemorySpaceToString @ 0x1d6ffae0, string-pointer table @ 0x21ce6b08
Managed space-id arrayProgramMemoryAllocator::kAllocatedMemorySpaces @ 0xb42ff10 (.rodata)
Universal runtime allocatortpu::BestFitAllocator (200-byte instance, operator new(0xC8); ctor 0x1e817500); one per tier; typeinfo 0x21d346e8, vtable 0x21d34630
Allocator base classtpu::MemoryAllocator (abstract; typeinfo 0x21d34700)
Compile-time placerxla::jellyfish::ProgramMemoryAllocator::AllocateBytes @ 0x1c629e40 (one entry, branches on MemorySpace)
MSA (HBM↔VMEM coloring)xla::memory_space_assignment::MsaAlgorithm::Finish @ 0x1dc5b560 — see msa-overview.md
Hand-off protoplatforms_deepsea::jellyfish::xdb::ProgramMemoryMetadata_Allocation
RehydratorProgramMemoryAllocator::CreateFromProto @ 0x1c631f20
Endpoint render (DMA)xla::jellyfish::MemorySpaceToDriverResource @ 0x1d6223e0 (its own numbering — see §2)
HBM DMA alignment floorjf_driver::kHbmMinimumDmaAlignment = 1024 B (mask & 0x3FF, WritePremappedHbm @ 0xe73db80)
Compile-time HBM alignmentFLAGS_xla_jf_program_hbm_alignment_in_kib = 16 ⇒ 16 KiB (@ 0x223b4888)
ConfidenceCONFIRMED (byte-anchored) unless a row or callout says otherwise

1. The Six-Region Taxonomy

Purpose

There are six addressable regions a TPU program names. Five are on the chip (per TensorCore, plus per-BarnaCore / per-SparseCore variants); one is off-chip DRAM. The host heap is a seventh region that the driver (not the program) allocates from. The table below is the whole map at a glance; each row is owned by a dedicated page.

At-a-glance

TierPhysicalScopeAllocatorAlignment / granuleMSA-managed?Owner page
HBMOff-chip DRAM, tens of GiBPer-chip, host-visibletpu::BestFitAllocator (runtime); ProgramMemoryAllocator (compile)1024 B DMA floor / 16 KiB compile-timeyes — the kDefault tierhbm-allocator.md · hbm-dma-alignment.md
VMEMOn-chip SRAM, ~16–64 MiB/TensorCorePer-TensorCoreBestFitAllocator (runtime); MSA + ProgramMemoryAllocator (compile)VmemAlignmentBoundaryInBytes() — per-gen (ChunkBytes on JF; max(Granule, VmemWord) on PF/VF/GL)yes — the kAlternate fast tiervmem-allocator.md
CMEMOn-chip SRAM, read-mostly operand poolPer-TensorCore (Pufferfish only)BestFitAllocator (runtime); MSA (xla_tpu_cmem_*)CmemWordSizeBytes() (~16 B on PF)yes on Pufferfish only; MemBanks(kCmem) is LogFatal elsewherecmem-pool.md
SMEMOn-chip SRAM, scalar/word-flatPer-SPU (per-core scalar engine)BestFitAllocator (runtime); ProgramMemoryAllocator (compile)SmemWordSizeBytes() (word = alignment = granule)no — placed by scalar-load/store opcode semanticssmem-scalar-memory.md · smem-register-window.md
SFLAGOn-chip atomic register file, word-granularPer-engine banks (TC / SCS / TEC / TAC); global sub-space cross-coreBestFitAllocator (size) + fixed number-space partition (compile)SflagWordSizeBytes() (log2 cached @ Target+0x4c8)no — placed by a reserved number-space partitionsflag-protocol.md
Host heapHost DRAMProcess-widePremappedMemoryManager (DMA staging) / tsl::BFCAllocator (HBM offload) — both → posix_memalign4 KiB or 2 MiB page (PickPageAlignment); 16 B (BFC)n/a (host-offload via custom calls)embedded-tcmalloc.md

NOTE — "register window" is a misnomer for every on-chip tier here. SMEM, CMEM, and SFLAG are all flat byte/word arrays; a search of the binary for SmemRegisterWindow / SregWindow / a CMEM register file returns zero hits. Scalar register windowing lives on the SREG file (allocated by LSRA-v2), and SMEM is merely its spill backing store. See smem-register-window.md for why the window concept does not apply.

Considerations

Three facts cut across all tiers and a reimplementer must internalize them before reading any per-tier page:

  1. There is no per-tier allocator class. tpu::BestFitAllocator (200-byte instance — every factory does operator new(0xC8) then the ctor 0x1e817500) is the single concrete tpu::MemoryAllocator subclass in libtpu. The TpuHal binds one instance per tier through an AllocatorFactory (5 callbacks $_0..$_4 at 0x1e815600 … 0x1e815700, all default to a kBestFit policy). The only thing distinguishing the HBM allocator from the VMEM allocator is the 32-byte Config triple each is constructed with.

  2. There is no per-TpuVersion branch in the allocator. Every per-codename divergence (HBM byte size, VMEM word size, alignment, granule) is data, carried in the embedded *chip_parts.binarypb resource and surfaced at boot as the Config triple. The allocator code is family-agnostic.

  3. The runtime allocator replays, it does not decide. MSA and ProgramMemoryAllocator choose every offset at compile time and freeze them into ProgramMemoryMetadata_Allocation proto entries. At load time CreateFromProto (0x1c631f20) instantiates one BestFitAllocator per tier and replays the frozen offsets. The free-list / red-black-tree machinery is exercised at runtime only for dynamic allocations (scoped scratch, async-copy staging) that MSA marked run-time-allocated. See §4.


2. The MemorySpace Enum

Purpose

One C++ enumeration, xla::jellyfish::MemorySpace, labels every region throughout the compiler and runtime. A reimplementer must reproduce exactly this enum because it is the operand-space tag on every LLO load/store, the ProgramMemoryAllocator::AllocateBytes selector, and the key the kAllocatedMemorySpaces array (0xb42ff10) iterates. A second, unrelated numbering governs how a memory space renders into a DMA descriptor's address word; the two disagree on almost every value, and the boundary between them is the subject of §2.3.

Encoding — the compile-time MemorySpace enum

Recovered byte-exactly from the MemorySpaceToString string-pointer table at rodata 0x21ce6b08MemorySpaceToString (0x1d6ffae0) is a one-instruction lookup mov rax, [0x21ce6b08 + ms*8], so the integer is the table index and each slot's C-string is the canonical lowercase region name (resolved through its R_X86_64_RELATIVE reloc). The region enum is 17 values (0..16). The two MSA aliases kDefault/kAlternate belong to a separate two-value enum, xla::memory_space_assignment::MemorySpace (decoded by 0x1dcda1c0, values kDefault=0 / kAlternate=1) — they are the colors MSA assigns ("abundant"=HBM vs. "scarce" on-chip tier), not members of xla::jellyfish::MemorySpace.

MemorySpaceValueStringPhysical tierOwner
kNone0<no memory space>— (no space)
kHbm1hbmHBM (off-chip)per-chip
kHib2hibHBM↔host interface-buffer staging tierper-chip
kVmem3vmemVMEMper-TensorCore
kCmem4cmemCMEMper-TensorCore (PF)
kSmem5smemSMEMper-SPU
kSflag6sflagSFLAG (chip sync-flag tier)per-engine banks
kImem7imeminstruction memoryper-core
kBarnaCoreBmem8barna_core_bmemBarnaCore buffer memoryBarnaCore
kBarnaCoreSmem9barna_core_smemBarnaCore scalar memoryBarnaCore
kBarnaCoreSflag10barna_core_sflagBarnaCore sync-flag tierBarnaCore
kBarnaCoreImem11barna_core_imemBarnaCore instruction memoryBarnaCore
kSparseCoreSequencerSflag12sparse_core_sequencer_sflagSC sequencer sync-flag regionSparseCore
kHost13hostHost RAM (offload spill target)host
kSparseCoreSequencerSmem14sparse_core_sequencer_smemSC sequencer scalar memorySparseCore
kSparseCorePrivateStackHbm15sparse_core_private_stack_hbmSC private-stack HBM regionSparseCore
kPinnedHbm16pinned_hbmHBM, runtime-locked (peer-DMA inputs; repacker may not relocate)per-chip

QUIRK — the string-pointer table at 0x21ce6b08 is longer than the 17-value region enum: slots 17/18/19 resolve to absolute (0x868144c), heap_relative (0x8678cad), and stack_relative (0x8678cbb). Those three are pointer-relativity tags of the LloAddress relocation model that share storage with the region-name array — they are not memory pools. A reimplementation that sizes the MemorySpace enum by the string-table length, or treats absolute/heap_relative/stack_relative as tiers, is wrong: the region enum is exactly 17 values (0..16). The ordering is also not a clean physical-tier ordering and is wider than the spaces any one generation uses (CMEM is alive only on Pufferfish). Drive tier tables off the named constants, never off contiguous integers. The per-codename byte sizes that populate each tier's Config are absent from the C++ (they live in chip_parts.binarypb); the enum is the label, not the size. The full enum↔MemorySpaceProto field-number remap lives on memory-space-enum.md.

The DMA-driver-resource numbering is a different integer space

xla::jellyfish::MemorySpaceToDriverResource(MemorySpace) (0x1d6223e0) maps the LLO MemorySpace enum to a hardware driver-resource id stamped into a DMA descriptor's address word. Its switch (verified arm-by-arm in the decompile) does not return the enum value — it returns a permuted, non-monotone id, and it traps on cmem and the SparseCore spaces:

// xla::jellyfish::MemorySpaceToDriverResource(MemorySpace ms)   sub_1D6223E0
function MemorySpaceToDriverResource(ms):
    switch ms:                       // ms = the 17-value LLO MemorySpace enum
        case 0 (<no space>): return 10
        case 1 (hbm):        return 2
        case 2 (hib):        return 3
        case 3 (vmem):       return 4
        case 4 (cmem):       FATAL("Unsupported memory space")   // not DMA-addressable here
        case 5 (smem):       return 6
        case 6 (sflag):      return 0
        case 7 (imem):       return 5
        case 8 (barna_core_bmem):  return 7
        case 9 (barna_core_smem):  return 9
        case 10 (barna_core_sflag): return 1
        case 11 (barna_core_imem):  return 8
        case 12..16 (sparse_core_*): FATAL("Unsupported memory space")

The switch consumes the MemorySpace enum of the §2 table — its case labels (hbm=1, hib=2, vmem=3, cmem=4, smem=5, sflag=6, imem=7, …) are identical to the placer constants — but it returns a different integer space, the permuted, non-monotone driver-resource id tabulated above. The returned id never equals the MemorySpace integer for any tier.

GOTCHA — carry the MemorySpace enum end-to-end and convert to a driver-resource id only at the descriptor boundary, via this explicit switch. Deriving the id by reusing the enum integer is wrong for every tier, and cmem(4) plus every SparseCore space (12..16) is not DMA-addressable here at all (LogMessageFatal("Unsupported memory space")). The full resource-id table lives on intra-chip-descriptor.md.


3. The Per-Space Allocator / Alignment Matrix

Purpose

Every tier is one tpu::BestFitAllocator constructed from a 32-byte MemoryAllocator::Config. The only per-tier differences are the four Config fields and (for HBM) a stricter compile-time alignment than the hardware DMA floor. This section gives the Config triple and the alignment rule per tier; the allocate/deallocate algorithm itself is identical across tiers and is documented once on hbm-allocator.md.

The Config struct (one per tier)

struct tpu::MemoryAllocator::Config {   // 32 B, passed by const&
    int64_t base_offset_in_bytes_;      // +0   ≥ 0   (0 for every on-chip tier)
    int64_t allocatable_range_end_;     // +8   > 0   (capacity = end − base)
    int64_t alignment_in_bytes_;        // +16  > 0, power of two, divides granule
    int64_t granule_in_bytes_;          // +24  hardware granule (page / word)
};

The ctor (0x1e817500) asserts, as LogMessageFatal checks: base_offset_in_bytes_ >= 0, allocatable_range_end_ > 0, alignment_in_bytes_ > 0, alignment_in_bytes_ % granule_in_bytes_ == 0, and alignment_in_bytes_ is a power of two. These invariants hold for every tier — they are what let the allocator's round-up arithmetic ((size + align − (size!=0)) & −align, confirmed at the head of Allocate 0x1e817820) be a single AND.

Per-tier Config and alignment

Tierbase_offsetend (capacity)alignmentgranule
HBM0chip_parts.binarypb HBM bytes (− xla_tpu_user_reserved_hbm_bytes)16 KiB compile-time (xla_jf_program_hbm_alignment_in_kib=16); 1024 B runtime DMA floorchip_parts HBM granule
VMEM0Target::VmemSizeBytes() (Target+0x458) or xla_tpu_override_vmem_size_kibVmemAlignmentBoundaryInBytes()ChunkBytes (JF) / max(Granule, VmemWord) (PF/VF/GL)VmemWordSizeBytes() (Target+0x50C)
CMEM0Target::CmemSizeBytes() (Target+0x460)CmemWordSizeBytes()CmemWordSizeBytes() (Target+0x510, ~16 B PF)
SMEM0Target::SmemSizeBytes() (Target+0x470)SmemWordSizeBytes()SmemWordSizeBytes() (Target+0x508)
SFLAG0Target::SflagSizeBytes() (Target+0x468)SflagWordSizeBytes() (Target+0x504)SflagWordSizeBytes()
Host (premapped)per-partition partition_size * ipartition_size4 KiB if size ≤ 2 MiB, else 2 MiB (PickPageAlignment)= alignment
Host (BFC offload)0256 GiB cap (0x40'0000'0000, the tsl::BFCAllocator ctor size arg)≥ 16 B (posix_memalign)2 MiB region growth

GOTCHA — HBM has two alignment numbers, and confusing them silently corrupts a DMA. kHbmMinimumDmaAlignment = 1024 B is the hardware floor: every DMA issue site masks size and address with & 0x3FF and LogMessageFatals on a non-zero remainder (byte_offset % jf_driver::kHbmMinimumDmaAlignment == 0, size % … == 0, in WritePremappedHbm @ 0xe73db80). The 16 KiB compile-time figure (xla_jf_program_hbm_alignment_in_kib) is stricter — it rounds every program-level HBM tensor up to 16 KiB before MSA places it, to accommodate XLA's stride/sub-tile addressing and slice-prefetch boundaries. A reimplementer who aligns HBM allocations to 1024 B at compile time will produce a layout MSA's slice machinery cannot address; one who enforces 16 KiB at DMA-issue time wastes nothing but is needlessly strict. The 1024-B floor is the wire contract; the 16-KiB rule is the placement contract. See hbm-dma-alignment.md.

NOTE — the on-chip tiers (VMEM/CMEM/SMEM/SFLAG) all set alignment == granule == <tier>WordSizeBytes() and base_offset == 0 — every on-chip tier starts at sub-tile address 0, and a single allocation is always one word-aligned run. Only HBM separates alignment from granule (16 KiB alignment over a smaller hardware granule), and only the host premapped manager uses a non-zero base_offset (the per-partition slot base). The numeric word sizes per codename live in chip_parts.binarypb and are not in the C++; the formulas above are exact.

The host heap is not a tcmalloc

libtpu embeds no jemalloc and no tcmalloc (despite the page family name). The only OS-level allocation primitive reached is posix_memalign, wrapped two ways: tpu::PremappedMemoryManager partitions a single posix_memalign region into power-of-two partitions, each wrapping a per-partition BestFitAllocator under an absl::Mutex, round-robined for DMA staging; and tsl::BFCAllocator (the TF best-fit-with-coalescing allocator, ~1.2 KiB/instance, 21 size-class bins) backs only the HostOffloadingTpuAllocator (256 GiB cap) that receives HBM buffers MSA elected to spill to host RAM. Neither is the on-device allocator. See embedded-tcmalloc.md.


4. The Compile-Time → Runtime Hand-off

Purpose

The same offset that MSA chooses at compile time is the offset the runtime allocator hands back at load time. This is the spine that ties the per-tier pages together: every tier flows through the same seven-stage hand-off, differing only in which compile-time placer chose the offset (MSA for VMEM/CMEM, opcode semantics for SMEM, number-space partition for SFLAG).

Stages

Compile time (XLA core):
  HeapSimulator::Run(GlobalDecreasingSizeBestFitHeap<HloValue>, …)   0x1e49dae0
      └─ produces per-buffer Chunk{offset, size}
Compile time (XLA TPU layer):
  MsaAlgorithm::Finish()                                             0x1dc5b560   ── HBM↔VMEM(↔CMEM) coloring only
      └─ Allocation objects {Pinned / Copy / Prefetch / Scoped / …}
Compile time (jellyfish):
  ProgramMemoryAllocator::AllocateBytes(MemorySpace, …)              0x1c629e40   ── one entry, branches on MS
      └─ emits ProgramMemoryMetadata_Allocation{memory_space, offset, size, block_type, name}
Codegen:
  the compiled XDB/LLO program embeds the proto (offsets symbolic until link)
────── load time ──────
  ProgramMemoryAllocator::CreateFromProto(LloModule*, …, proto)      0x1c631f20
      └─ per tier: TpuHal::GetAllocatorFactory() (this+0x48)         0x1e8139a0
            └─ tpu::BestFitAllocator(Config{base=0, end=<tier>Size, align, granule})
Execution:
  each compile-time Allocation → BestFitAllocator::Allocate(size) at the FROZEN offset
Deallocation:
  BestFitAllocator::Deallocate(offset) — eager coalescing on every free

QUIRK — MSA only colors the HBM↔VMEM (and, on Pufferfish, HBM↔CMEM) axis. SMEM and SFLAG flow through the same ProgramMemoryAllocator → proto → BestFitAllocator spine but are never seen by MSA's kAlternate/kDefault decision: SMEM is committed wherever a scalar-load/store opcode declares MemorySpace=kSmem, and SFLAG is allocated out of a fixed number-space partition (GetStartReservedSyncFlagNumber 0x1d6178e0 reads Target+0x530; the overlay-reserved ceiling GetOverlayReservedSyncFlagNumber 0x1d617900 reads Target+0x534), not the byte heap. A reimplementer who routes SMEM/SFLAG through the MSA cost model will mis-place them; MSA's tier-balancing is a VMEM/CMEM-only concern. See smem-scalar-memory.md and sflag-protocol.md.


ComponentRelationship
msa-overview.mdThe compile-time pass that colors HBM (kDefault) vs. VMEM/CMEM (kAlternate) and inserts the async copies
intra-chip-descriptor.mdThe DMA descriptor whose (mem_id, core_id) endpoints render the MemorySpace enum via MemorySpaceToDriverResource
memory-space-enum.mdThe 17-value LLO MemorySpace enum as it appears at the ISA / operand-tag level

Cross-References

  • hbm-allocator.md — the universal tpu::BestFitAllocator algorithm (best-fit + eager coalescing); the HBM tier and the two-stack (compile-time ProgramMemoryAllocator + runtime BestFitAllocator) model
  • hbm-dma-alignment.md — the 1024-B kHbmMinimumDmaAlignment floor vs. the 16-KiB compile-time program alignment
  • vmem-allocator.md — the kAlternate fast tier; per-generation VMEM size/word/bank/bandwidth Config and scoped-VMEM machinery
  • cmem-pool.md — the Pufferfish-only read-mostly operand pool; xla_tpu_cmem_* MSA knobs; MemBanks(kCmem) LogFatal elsewhere
  • smem-scalar-memory.md — the SPU's scalar memory; opcode-driven placement (not MSA); reserved top/bottom blocks
  • smem-register-window.md — why no SMEM register window exists; SMEM as the SREG-file spill backing store (LSRA-v2)
  • sflag-protocol.md — the sync-flag atomic tier; number-space partition, three SC sub-spaces, fence/ordering model
  • embedded-tcmalloc.md — the host heap: no tcmalloc/jemalloc; PremappedMemoryManager and tsl::BFCAllocator over posix_memalign
  • on-device-compaction.mdBestFitAllocator::Compact relocation; the repacker that reduces fragmentation
  • buffer-donation-aliasing.mdkPinnedHbm and input/output aliasing that the repacker may not relocate
  • tpu-buffer-layout.md — how a logical XLA buffer maps to physical offsets in these tiers
  • msa-overview.md — Phase 7; the consumer of this taxonomy on the compile side
  • intra-chip-descriptor.md — the wire view of the MemorySpace enum at the DMA boundary
  • back to index — Part X — On-Chip Memory & DMA