Memory Hierarchy Overview
All addresses on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, build-id md589edbbe81c5b328a958fe628a9f2207d). The image is not stripped; demangled C++ symbol names are quoted verbatim. Other versions will differ.
Abstract
A TPU program touches six addressable memory regions, and libtpu.so names every one of them with a single C++ enumeration (xla::jellyfish::MemorySpace) and services every one of them — except the host heap — with a single allocator class (tpu::BestFitAllocator). The hierarchy is, from abundant-and-far to scarce-and-near: HBM (off-chip DRAM, tens of GiB, the kDefault tier), then four on-chip SRAM tiers per TensorCore — VMEM (vector memory, the kAlternate fast-staging tier the MXU/VPU read operands from), CMEM (constant memory, a Pufferfish-only read-mostly operand pool), SMEM (scalar memory, the SPU's private spill/parameter store), and SFLAG (sync-flag memory, a word-granular atomic register file used for cross-engine handshakes) — and finally the host tcmalloc-class heap, which libtpu does not embed (no jemalloc/tcmalloc is linked); host allocations go through posix_memalign wrapped by either tpu::PremappedMemoryManager (DMA staging) or tsl::BFCAllocator (HBM-spill offload).
The reader who knows LLVM and GPU programming should hold one analogy and immediately complicate it. The HBM↔VMEM relationship is XLA's analogue of register allocation: Memory-Space Assignment (MSA) is the compile-time pass that "colors" each HloValue kDefault (HBM) or kAlternate (VMEM), and the "spills" are HBM↔VMEM DMAs. But the analogy breaks four ways. First, the "registers" are tens of MiB, not 64-bit slots. Second, the allocator that realizes the coloring at runtime is the same best-fit class for every tier — there is no HbmAllocator, VmemAllocator, SmemAllocator, or CmemAllocator class; each tier is one tpu::BestFitAllocator instance distinguished only by a 32-byte MemoryAllocator::Config{base_offset, end, alignment, granule}. Third, only VMEM (and CMEM on Pufferfish) is MSA-managed; SMEM and SFLAG are not part of the kAlternate/kDefault tug-of-war — they are placed by opcode semantics and a fixed number-space partition respectively. Fourth, the runtime allocator almost never decides anything: MSA freezes every offset into the compiled program as a ProgramMemoryMetadata_Allocation proto, and the runtime allocator merely replays those offsets.
This page is the section map for the memory subsystem. It fixes the memory-space taxonomy, names the enum that labels them, gives the at-a-glance allocator/alignment/management facts for each tier, and points at the per-tier pages that own the detail. It does not reproduce the best-fit allocate/deallocate algorithm (that is hbm-allocator.md), the per-generation VMEM bank/bandwidth tables (vmem-allocator.md), the SFLAG atomic protocol (sflag-protocol.md), or the MSA placement cascade (msa-overview.md).
For reimplementation, the orientation contract is:
- The six-region taxonomy — what each space physically is, who reads/writes it, and which engine owns it.
- The
MemorySpaceenum — the single label space (kNone … kAlternate) shared by the compile-time placer and the wire/profiler layer, plus the second numbering the DMA-driver-resource path uses (and why they disagree). - The per-space allocator/alignment matrix — one
BestFitAllocatorper tier, theConfigtriple per tier, the 1024-B HBM DMA floor vs. the 16-KiB compile-time HBM alignment, and the word-granular on-chip alignments. - The compile-time → runtime hand-off — MSA/
ProgramMemoryAllocatorfreezes offsets into a proto;CreateFromProtorehydrates oneBestFitAllocatorper tier and replays.
| Memory-space enum | xla::jellyfish::MemorySpace (17 values, kNone=0 … kPinnedHbm=16); decoder MemorySpaceToString @ 0x1d6ffae0, string-pointer table @ 0x21ce6b08 |
| Managed space-id array | ProgramMemoryAllocator::kAllocatedMemorySpaces @ 0xb42ff10 (.rodata) |
| Universal runtime allocator | tpu::BestFitAllocator (200-byte instance, operator new(0xC8); ctor 0x1e817500); one per tier; typeinfo 0x21d346e8, vtable 0x21d34630 |
| Allocator base class | tpu::MemoryAllocator (abstract; typeinfo 0x21d34700) |
| Compile-time placer | xla::jellyfish::ProgramMemoryAllocator::AllocateBytes @ 0x1c629e40 (one entry, branches on MemorySpace) |
| MSA (HBM↔VMEM coloring) | xla::memory_space_assignment::MsaAlgorithm::Finish @ 0x1dc5b560 — see msa-overview.md |
| Hand-off proto | platforms_deepsea::jellyfish::xdb::ProgramMemoryMetadata_Allocation |
| Rehydrator | ProgramMemoryAllocator::CreateFromProto @ 0x1c631f20 |
| Endpoint render (DMA) | xla::jellyfish::MemorySpaceToDriverResource @ 0x1d6223e0 (its own numbering — see §2) |
| HBM DMA alignment floor | jf_driver::kHbmMinimumDmaAlignment = 1024 B (mask & 0x3FF, WritePremappedHbm @ 0xe73db80) |
| Compile-time HBM alignment | FLAGS_xla_jf_program_hbm_alignment_in_kib = 16 ⇒ 16 KiB (@ 0x223b4888) |
| Confidence | CONFIRMED (byte-anchored) unless a row or callout says otherwise |
1. The Six-Region Taxonomy
Purpose
There are six addressable regions a TPU program names. Five are on the chip (per TensorCore, plus per-BarnaCore / per-SparseCore variants); one is off-chip DRAM. The host heap is a seventh region that the driver (not the program) allocates from. The table below is the whole map at a glance; each row is owned by a dedicated page.
At-a-glance
| Tier | Physical | Scope | Allocator | Alignment / granule | MSA-managed? | Owner page |
|---|---|---|---|---|---|---|
| HBM | Off-chip DRAM, tens of GiB | Per-chip, host-visible | tpu::BestFitAllocator (runtime); ProgramMemoryAllocator (compile) | 1024 B DMA floor / 16 KiB compile-time | yes — the kDefault tier | hbm-allocator.md · hbm-dma-alignment.md |
| VMEM | On-chip SRAM, ~16–64 MiB/TensorCore | Per-TensorCore | BestFitAllocator (runtime); MSA + ProgramMemoryAllocator (compile) | VmemAlignmentBoundaryInBytes() — per-gen (ChunkBytes on JF; max(Granule, VmemWord) on PF/VF/GL) | yes — the kAlternate fast tier | vmem-allocator.md |
| CMEM | On-chip SRAM, read-mostly operand pool | Per-TensorCore (Pufferfish only) | BestFitAllocator (runtime); MSA (xla_tpu_cmem_*) | CmemWordSizeBytes() (~16 B on PF) | yes on Pufferfish only; MemBanks(kCmem) is LogFatal elsewhere | cmem-pool.md |
| SMEM | On-chip SRAM, scalar/word-flat | Per-SPU (per-core scalar engine) | BestFitAllocator (runtime); ProgramMemoryAllocator (compile) | SmemWordSizeBytes() (word = alignment = granule) | no — placed by scalar-load/store opcode semantics | smem-scalar-memory.md · smem-register-window.md |
| SFLAG | On-chip atomic register file, word-granular | Per-engine banks (TC / SCS / TEC / TAC); global sub-space cross-core | BestFitAllocator (size) + fixed number-space partition (compile) | SflagWordSizeBytes() (log2 cached @ Target+0x4c8) | no — placed by a reserved number-space partition | sflag-protocol.md |
| Host heap | Host DRAM | Process-wide | PremappedMemoryManager (DMA staging) / tsl::BFCAllocator (HBM offload) — both → posix_memalign | 4 KiB or 2 MiB page (PickPageAlignment); 16 B (BFC) | n/a (host-offload via custom calls) | embedded-tcmalloc.md |
NOTE — "register window" is a misnomer for every on-chip tier here. SMEM, CMEM, and SFLAG are all flat byte/word arrays; a search of the binary for
SmemRegisterWindow/SregWindow/ a CMEM register file returns zero hits. Scalar register windowing lives on the SREG file (allocated by LSRA-v2), and SMEM is merely its spill backing store. See smem-register-window.md for why the window concept does not apply.
Considerations
Three facts cut across all tiers and a reimplementer must internalize them before reading any per-tier page:
-
There is no per-tier allocator class.
tpu::BestFitAllocator(200-byte instance — every factory doesoperator new(0xC8)then the ctor0x1e817500) is the single concretetpu::MemoryAllocatorsubclass in libtpu. The TpuHal binds one instance per tier through anAllocatorFactory(5 callbacks$_0..$_4at0x1e815600 … 0x1e815700, all default to akBestFitpolicy). The only thing distinguishing the HBM allocator from the VMEM allocator is the 32-byteConfigtriple each is constructed with. -
There is no per-
TpuVersionbranch in the allocator. Every per-codename divergence (HBM byte size, VMEM word size, alignment, granule) is data, carried in the embedded*chip_parts.binarypbresource and surfaced at boot as theConfigtriple. The allocator code is family-agnostic. -
The runtime allocator replays, it does not decide. MSA and
ProgramMemoryAllocatorchoose every offset at compile time and freeze them intoProgramMemoryMetadata_Allocationproto entries. At load timeCreateFromProto(0x1c631f20) instantiates oneBestFitAllocatorper tier and replays the frozen offsets. The free-list / red-black-tree machinery is exercised at runtime only for dynamic allocations (scoped scratch, async-copy staging) that MSA marked run-time-allocated. See §4.
2. The MemorySpace Enum
Purpose
One C++ enumeration, xla::jellyfish::MemorySpace, labels every region throughout the compiler and runtime. A reimplementer must reproduce exactly this enum because it is the operand-space tag on every LLO load/store, the ProgramMemoryAllocator::AllocateBytes selector, and the key the kAllocatedMemorySpaces array (0xb42ff10) iterates. A second, unrelated numbering governs how a memory space renders into a DMA descriptor's address word; the two disagree on almost every value, and the boundary between them is the subject of §2.3.
Encoding — the compile-time MemorySpace enum
Recovered byte-exactly from the MemorySpaceToString string-pointer table at rodata 0x21ce6b08 — MemorySpaceToString (0x1d6ffae0) is a one-instruction lookup mov rax, [0x21ce6b08 + ms*8], so the integer is the table index and each slot's C-string is the canonical lowercase region name (resolved through its R_X86_64_RELATIVE reloc). The region enum is 17 values (0..16). The two MSA aliases kDefault/kAlternate belong to a separate two-value enum, xla::memory_space_assignment::MemorySpace (decoded by 0x1dcda1c0, values kDefault=0 / kAlternate=1) — they are the colors MSA assigns ("abundant"=HBM vs. "scarce" on-chip tier), not members of xla::jellyfish::MemorySpace.
MemorySpace | Value | String | Physical tier | Owner |
|---|---|---|---|---|
kNone | 0 | <no memory space> | — (no space) | — |
kHbm | 1 | hbm | HBM (off-chip) | per-chip |
kHib | 2 | hib | HBM↔host interface-buffer staging tier | per-chip |
kVmem | 3 | vmem | VMEM | per-TensorCore |
kCmem | 4 | cmem | CMEM | per-TensorCore (PF) |
kSmem | 5 | smem | SMEM | per-SPU |
kSflag | 6 | sflag | SFLAG (chip sync-flag tier) | per-engine banks |
kImem | 7 | imem | instruction memory | per-core |
kBarnaCoreBmem | 8 | barna_core_bmem | BarnaCore buffer memory | BarnaCore |
kBarnaCoreSmem | 9 | barna_core_smem | BarnaCore scalar memory | BarnaCore |
kBarnaCoreSflag | 10 | barna_core_sflag | BarnaCore sync-flag tier | BarnaCore |
kBarnaCoreImem | 11 | barna_core_imem | BarnaCore instruction memory | BarnaCore |
kSparseCoreSequencerSflag | 12 | sparse_core_sequencer_sflag | SC sequencer sync-flag region | SparseCore |
kHost | 13 | host | Host RAM (offload spill target) | host |
kSparseCoreSequencerSmem | 14 | sparse_core_sequencer_smem | SC sequencer scalar memory | SparseCore |
kSparseCorePrivateStackHbm | 15 | sparse_core_private_stack_hbm | SC private-stack HBM region | SparseCore |
kPinnedHbm | 16 | pinned_hbm | HBM, runtime-locked (peer-DMA inputs; repacker may not relocate) | per-chip |
QUIRK — the string-pointer table at
0x21ce6b08is longer than the 17-value region enum: slots17/18/19resolve toabsolute(0x868144c),heap_relative(0x8678cad), andstack_relative(0x8678cbb). Those three are pointer-relativity tags of theLloAddressrelocation model that share storage with the region-name array — they are not memory pools. A reimplementation that sizes theMemorySpaceenum by the string-table length, or treatsabsolute/heap_relative/stack_relativeas tiers, is wrong: the region enum is exactly 17 values (0..16). The ordering is also not a clean physical-tier ordering and is wider than the spaces any one generation uses (CMEM is alive only on Pufferfish). Drive tier tables off the named constants, never off contiguous integers. The per-codename byte sizes that populate each tier'sConfigare absent from the C++ (they live inchip_parts.binarypb); the enum is the label, not the size. The full enum↔MemorySpaceProtofield-number remap lives on memory-space-enum.md.
The DMA-driver-resource numbering is a different integer space
xla::jellyfish::MemorySpaceToDriverResource(MemorySpace) (0x1d6223e0) maps the LLO MemorySpace enum to a hardware driver-resource id stamped into a DMA descriptor's address word. Its switch (verified arm-by-arm in the decompile) does not return the enum value — it returns a permuted, non-monotone id, and it traps on cmem and the SparseCore spaces:
// xla::jellyfish::MemorySpaceToDriverResource(MemorySpace ms) sub_1D6223E0
function MemorySpaceToDriverResource(ms):
switch ms: // ms = the 17-value LLO MemorySpace enum
case 0 (<no space>): return 10
case 1 (hbm): return 2
case 2 (hib): return 3
case 3 (vmem): return 4
case 4 (cmem): FATAL("Unsupported memory space") // not DMA-addressable here
case 5 (smem): return 6
case 6 (sflag): return 0
case 7 (imem): return 5
case 8 (barna_core_bmem): return 7
case 9 (barna_core_smem): return 9
case 10 (barna_core_sflag): return 1
case 11 (barna_core_imem): return 8
case 12..16 (sparse_core_*): FATAL("Unsupported memory space")
The switch consumes the MemorySpace enum of the §2 table — its case labels (hbm=1, hib=2, vmem=3, cmem=4, smem=5, sflag=6, imem=7, …) are identical to the placer constants — but it returns a different integer space, the permuted, non-monotone driver-resource id tabulated above. The returned id never equals the MemorySpace integer for any tier.
GOTCHA — carry the
MemorySpaceenum end-to-end and convert to a driver-resource id only at the descriptor boundary, via this explicit switch. Deriving the id by reusing the enum integer is wrong for every tier, andcmem(4)plus every SparseCore space (12..16) is not DMA-addressable here at all (LogMessageFatal("Unsupported memory space")). The full resource-id table lives on intra-chip-descriptor.md.
3. The Per-Space Allocator / Alignment Matrix
Purpose
Every tier is one tpu::BestFitAllocator constructed from a 32-byte MemoryAllocator::Config. The only per-tier differences are the four Config fields and (for HBM) a stricter compile-time alignment than the hardware DMA floor. This section gives the Config triple and the alignment rule per tier; the allocate/deallocate algorithm itself is identical across tiers and is documented once on hbm-allocator.md.
The Config struct (one per tier)
struct tpu::MemoryAllocator::Config { // 32 B, passed by const&
int64_t base_offset_in_bytes_; // +0 ≥ 0 (0 for every on-chip tier)
int64_t allocatable_range_end_; // +8 > 0 (capacity = end − base)
int64_t alignment_in_bytes_; // +16 > 0, power of two, divides granule
int64_t granule_in_bytes_; // +24 hardware granule (page / word)
};
The ctor (0x1e817500) asserts, as LogMessageFatal checks: base_offset_in_bytes_ >= 0, allocatable_range_end_ > 0, alignment_in_bytes_ > 0, alignment_in_bytes_ % granule_in_bytes_ == 0, and alignment_in_bytes_ is a power of two. These invariants hold for every tier — they are what let the allocator's round-up arithmetic ((size + align − (size!=0)) & −align, confirmed at the head of Allocate 0x1e817820) be a single AND.
Per-tier Config and alignment
| Tier | base_offset | end (capacity) | alignment | granule |
|---|---|---|---|---|
| HBM | 0 | chip_parts.binarypb HBM bytes (− xla_tpu_user_reserved_hbm_bytes) | 16 KiB compile-time (xla_jf_program_hbm_alignment_in_kib=16); 1024 B runtime DMA floor | chip_parts HBM granule |
| VMEM | 0 | Target::VmemSizeBytes() (Target+0x458) or xla_tpu_override_vmem_size_kib | VmemAlignmentBoundaryInBytes() — ChunkBytes (JF) / max(Granule, VmemWord) (PF/VF/GL) | VmemWordSizeBytes() (Target+0x50C) |
| CMEM | 0 | Target::CmemSizeBytes() (Target+0x460) | CmemWordSizeBytes() | CmemWordSizeBytes() (Target+0x510, ~16 B PF) |
| SMEM | 0 | Target::SmemSizeBytes() (Target+0x470) | SmemWordSizeBytes() | SmemWordSizeBytes() (Target+0x508) |
| SFLAG | 0 | Target::SflagSizeBytes() (Target+0x468) | SflagWordSizeBytes() (Target+0x504) | SflagWordSizeBytes() |
| Host (premapped) | per-partition partition_size * i | partition_size | 4 KiB if size ≤ 2 MiB, else 2 MiB (PickPageAlignment) | = alignment |
| Host (BFC offload) | 0 | 256 GiB cap (0x40'0000'0000, the tsl::BFCAllocator ctor size arg) | ≥ 16 B (posix_memalign) | 2 MiB region growth |
GOTCHA — HBM has two alignment numbers, and confusing them silently corrupts a DMA.
kHbmMinimumDmaAlignment= 1024 B is the hardware floor: every DMA issue site masks size and address with& 0x3FFandLogMessageFatals on a non-zero remainder (byte_offset % jf_driver::kHbmMinimumDmaAlignment == 0,size % … == 0, inWritePremappedHbm@0xe73db80). The 16 KiB compile-time figure (xla_jf_program_hbm_alignment_in_kib) is stricter — it rounds every program-level HBM tensor up to 16 KiB before MSA places it, to accommodate XLA's stride/sub-tile addressing and slice-prefetch boundaries. A reimplementer who aligns HBM allocations to 1024 B at compile time will produce a layout MSA's slice machinery cannot address; one who enforces 16 KiB at DMA-issue time wastes nothing but is needlessly strict. The 1024-B floor is the wire contract; the 16-KiB rule is the placement contract. See hbm-dma-alignment.md.
NOTE — the on-chip tiers (VMEM/CMEM/SMEM/SFLAG) all set
alignment == granule == <tier>WordSizeBytes()andbase_offset == 0— every on-chip tier starts at sub-tile address 0, and a single allocation is always one word-aligned run. Only HBM separates alignment from granule (16 KiB alignment over a smaller hardware granule), and only the host premapped manager uses a non-zerobase_offset(the per-partition slot base). The numeric word sizes per codename live inchip_parts.binarypband are not in the C++; the formulas above are exact.
The host heap is not a tcmalloc
libtpu embeds no jemalloc and no tcmalloc (despite the page family name). The only OS-level allocation primitive reached is posix_memalign, wrapped two ways: tpu::PremappedMemoryManager partitions a single posix_memalign region into power-of-two partitions, each wrapping a per-partition BestFitAllocator under an absl::Mutex, round-robined for DMA staging; and tsl::BFCAllocator (the TF best-fit-with-coalescing allocator, ~1.2 KiB/instance, 21 size-class bins) backs only the HostOffloadingTpuAllocator (256 GiB cap) that receives HBM buffers MSA elected to spill to host RAM. Neither is the on-device allocator. See embedded-tcmalloc.md.
4. The Compile-Time → Runtime Hand-off
Purpose
The same offset that MSA chooses at compile time is the offset the runtime allocator hands back at load time. This is the spine that ties the per-tier pages together: every tier flows through the same seven-stage hand-off, differing only in which compile-time placer chose the offset (MSA for VMEM/CMEM, opcode semantics for SMEM, number-space partition for SFLAG).
Stages
Compile time (XLA core):
HeapSimulator::Run(GlobalDecreasingSizeBestFitHeap<HloValue>, …) 0x1e49dae0
└─ produces per-buffer Chunk{offset, size}
Compile time (XLA TPU layer):
MsaAlgorithm::Finish() 0x1dc5b560 ── HBM↔VMEM(↔CMEM) coloring only
└─ Allocation objects {Pinned / Copy / Prefetch / Scoped / …}
Compile time (jellyfish):
ProgramMemoryAllocator::AllocateBytes(MemorySpace, …) 0x1c629e40 ── one entry, branches on MS
└─ emits ProgramMemoryMetadata_Allocation{memory_space, offset, size, block_type, name}
Codegen:
the compiled XDB/LLO program embeds the proto (offsets symbolic until link)
────── load time ──────
ProgramMemoryAllocator::CreateFromProto(LloModule*, …, proto) 0x1c631f20
└─ per tier: TpuHal::GetAllocatorFactory() (this+0x48) 0x1e8139a0
└─ tpu::BestFitAllocator(Config{base=0, end=<tier>Size, align, granule})
Execution:
each compile-time Allocation → BestFitAllocator::Allocate(size) at the FROZEN offset
Deallocation:
BestFitAllocator::Deallocate(offset) — eager coalescing on every free
QUIRK — MSA only colors the HBM↔VMEM (and, on Pufferfish, HBM↔CMEM) axis. SMEM and SFLAG flow through the same
ProgramMemoryAllocator→ proto →BestFitAllocatorspine but are never seen by MSA'skAlternate/kDefaultdecision: SMEM is committed wherever a scalar-load/store opcode declaresMemorySpace=kSmem, and SFLAG is allocated out of a fixed number-space partition (GetStartReservedSyncFlagNumber0x1d6178e0readsTarget+0x530; the overlay-reserved ceilingGetOverlayReservedSyncFlagNumber0x1d617900readsTarget+0x534), not the byte heap. A reimplementer who routes SMEM/SFLAG through the MSA cost model will mis-place them; MSA's tier-balancing is a VMEM/CMEM-only concern. See smem-scalar-memory.md and sflag-protocol.md.
Related Components
| Component | Relationship |
|---|---|
| msa-overview.md | The compile-time pass that colors HBM (kDefault) vs. VMEM/CMEM (kAlternate) and inserts the async copies |
| intra-chip-descriptor.md | The DMA descriptor whose (mem_id, core_id) endpoints render the MemorySpace enum via MemorySpaceToDriverResource |
| memory-space-enum.md | The 17-value LLO MemorySpace enum as it appears at the ISA / operand-tag level |
Cross-References
- hbm-allocator.md — the universal
tpu::BestFitAllocatoralgorithm (best-fit + eager coalescing); the HBM tier and the two-stack (compile-timeProgramMemoryAllocator+ runtimeBestFitAllocator) model - hbm-dma-alignment.md — the 1024-B
kHbmMinimumDmaAlignmentfloor vs. the 16-KiB compile-time program alignment - vmem-allocator.md — the
kAlternatefast tier; per-generation VMEM size/word/bank/bandwidthConfigand scoped-VMEM machinery - cmem-pool.md — the Pufferfish-only read-mostly operand pool;
xla_tpu_cmem_*MSA knobs;MemBanks(kCmem)LogFatal elsewhere - smem-scalar-memory.md — the SPU's scalar memory; opcode-driven placement (not MSA); reserved top/bottom blocks
- smem-register-window.md — why no SMEM register window exists; SMEM as the SREG-file spill backing store (LSRA-v2)
- sflag-protocol.md — the sync-flag atomic tier; number-space partition, three SC sub-spaces, fence/ordering model
- embedded-tcmalloc.md — the host heap: no tcmalloc/jemalloc;
PremappedMemoryManagerandtsl::BFCAllocatoroverposix_memalign - on-device-compaction.md —
BestFitAllocator::Compactrelocation; the repacker that reduces fragmentation - buffer-donation-aliasing.md —
kPinnedHbmand input/output aliasing that the repacker may not relocate - tpu-buffer-layout.md — how a logical XLA buffer maps to physical offsets in these tiers
- msa-overview.md — Phase 7; the consumer of this taxonomy on the compile side
- intra-chip-descriptor.md — the wire view of the
MemorySpaceenum at the DMA boundary - back to index — Part X — On-Chip Memory & DMA