Memory Hierarchy
Every size, offset, and enum value on this page was decoded byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (BuildID md589edbbe81c5b328a958fe628a9f2207d). Other versions differ.
Abstract
A TPU TensorCore sees a flat, software-managed scratchpad hierarchy, not a transparent cache. Five on-chip tiers sit between the off-chip HBM and the compute units: HBM (the device-global backing store), VMEM (the vector working set, the analogue of an L1/shared-memory scratchpad), SMEM (the scalar/sequencer scratchpad), SFLAG (the dedicated sync-flag tier the DMA engines and the barrier primitives poll), and CMEM (a Pufferfish-only second large scratchpad). There is no hardware coherence and no automatic eviction: a kernel explicitly DMAs tiles between tiers, and the compiler's allocators (Part X) place every buffer in exactly one space. This page is the orientation map for those tiers — what they are, how big each one is on each generation, and how the runtime represents them — and it hands off to Part X for the allocator internals and to Part VI for the ISA-level addressing.
Every tier's runtime size and word geometry is one field of the single per-device xla::jellyfish::Target object. Target::Init fills those fields at boot from the embedded <codename>_chip_parts.binarypb proto (see Per-Codename Constants), and the rest of the compiler reads them through trivial accessors — Target::HbmSizeBytes() is literally return *(int64_t*)(this + 0x450). The same Target carries the bank counts (a separate set of C++ MemBanks literals, not proto fields) and the per-tier word size used to convert byte offsets to hardware word addresses. So the hierarchy is not an abstraction layered over the silicon constants — it is those constants, indexed by tier.
The tier set is named twice in the binary, by two distinct enumerations that a reimplementer must keep separate. The compiler-side xla::jellyfish::MemorySpace enum (decoded by MemorySpaceToString reading the pointer table at 0x21ce6b08, 17 region values, indices 0..16) names every space a buffer can live in, including the BarnaCore and SparseCore sub-spaces and host/pinned_hbm. The 0x21ce6b08 string array is physically longer than 17 — indices 17/18/19 resolve to absolute/heap_relative/stack_relative — but those are LloAddress pointer-relativity tags, not region-enum members, so the canonical enum is 17 (see MemorySpace Enum). The LLVM-level address-space IDs (Address-Space IDs) are the numbers the LLO/LLVM IR carries in addrspace(N). The two are related but not identical, and the cost-model DMA dispatcher uses yet a third, narrower numbering (Hbm=1, Vmem=3, Cmem=4, Smem=5) for the same physical tiers. This page anchors the physical tiers to all three.
For orientation, the contract is:
- The five physical tiers, their role, and which generations have each.
- The per-generation size / word / bank table — kept consistent with Per-Codename Constants, which is the authoritative source.
- The
Targetfield map — which struct offset and accessor backs each tier, so the runtime view is reconstructable. - The enum relationships —
MemorySpaceint, the DMA-dispatcher int, and the address-space ID for each tier.
| Runtime carrier | xla::jellyfish::Target object (one per device); tiers at Target+0x450..+0x510 |
| Boot population | Target::Init (0x1d60fc20) from TpuChipParts::FromProto (0x20b1b400) |
| Tier accessors | HbmSizeBytes 0x1d615320, VmemSizeBytes 0x1d615e00, SmemSizeBytes 0x1d615e40, CmemSizeBytes 0x1d615e20, SflagSizeBytes 0x1d615e60 |
| Space enum | MemorySpaceToString (0x1d6ffae0) → table 0x21ce6b08 (17 region values, indices 0..16; table over-long with relativity tags at 17+) |
| DMA bandwidth dispatcher | Target::LocalDmaBandwidth(MemorySpace,MemorySpace) 0x1d6168e0 |
| Tiers | HBM · VMEM · SMEM · SFLAG · CMEM (v4-only) |
| Generations | jellyfish v2 / dragonfish v3 / pufferfish v4 / viperfish v5p·v5e / ghostlite v6e / 6acc60406 v7x |
The Five Tiers
Each tier is a physically distinct on-chip (or, for HBM, on-package) memory with its own width, capacity, and access cost. The diagram sketches the data paths a TensorCore sees; every arrow is an explicit DMA priced by LocalDmaBandwidth, never an implicit cache fill. The table below is the orientation summary; the per-generation sizes follow in the next section, and the cross-tier bandwidth matrix lives in the Memory Bandwidth & Latency Model.
off-package / per-die
+-------------------------------------------+
| HBM | Target+0x450 (int64)
| 8..190 GiB · word 32..1024 B · MS=1 | HbmSizeBytes 0x1d615320
+----+--------------------+-----------------+
| DMA | DMA (v4: staged via VMEM)
v v
per-TensorCore scratchpads +---------------------+
+-----------------------+ | CMEM (v4 only) | Target+0x460 (int64)
| VMEM |<----->| 128 MiB · word 512 | CmemSizeBytes 0x1d615e20
| 16..128 MiB | DMA | MS=4 | (==0 ⇒ absent)
| word 512 B · MS=3 | +---------------------+
+-----------+-----------+
| DMA
v
+-----------------------+ +-----------------------+
| SMEM | | SFLAG | Target+0x468 (int32)
| 16 KiB..1 MiB | | 1..16 KiB | SflagSizeBytes 0x1d615e60
| word 4 B · MS=5 | | word 4 B · MS=6 | (sync-flag words;
| Target+0x470 (int32) | | polled by DMA/barrier) BarnaCore SFLAG @+0x478)
+-----------------------+ +-----------------------+
scalar operands completion / sync
| Tier | Role | Scope | Word | First gen |
|---|---|---|---|---|
| HBM | Device-global backing store; all program inputs/outputs, spill, embeddings | whole chip / per-die | 32 B – 1024 B (per gen) | v2 (all) |
| VMEM | Vector working set — the tiled scratchpad MXU/VPU operands stage through | per TensorCore | 512 B (all gens) | v2 (all) |
| SMEM | Scalar / sequencer scratchpad — addresses, loop bounds, scalar operands | per TensorCore | 4 B (all gens) | v2 (all) |
| SFLAG | Sync-flag tier — the DMA/barrier completion words polled for cross-engine sync | per TensorCore | 4 B (all gens) | v2 (all) |
| CMEM | Second large scratchpad (SharedMemory[CMEM]) — staging buffer above VMEM | whole chip | 512 B | v4 only |
NOTE — the tiers are software-managed scratchpads, not caches. A reimplementation that models VMEM/SMEM as a coherent cache backed by HBM is wrong: the only data path between tiers is an explicit DMA, priced by
LocalDmaBandwidth. The compiler chooses each buffer's home tier; nothing migrates a buffer at runtime.
HBM — device-global store
HBM is the only tier large enough to hold a whole program's tensors. Its size, word, and clock are proto fields (SharedMemory[HBM]), materialized into Target+0x450 (int64, HbmSizeBytes), with the per-stack clock at Target+0x910 and full-chip bandwidth at Target+0x4f0 (HbmFullChipBytesPerSecond). The HBM word grows narrower across generations — 1024 B on v2/v3, 512 B on v4, then 32 B from v5p on — which is the same value the DMA path uses as its granule (DmaRequirements.granule_bytes); see HBM DMA Alignment. The allocator that carves HBM is HBM Allocator; the embedded tcmalloc that sub-allocates host-visible regions is Embedded tcmalloc.
VMEM — vector working set
VMEM is the per-TensorCore scratchpad the MXU and vector units read their operands from; it is the TPU's analogue of GPU shared memory, but explicitly allocated by the compiler rather than declared per-kernel. VmemSizeBytes (0x1d615e00) reads Target+0x458 as an int32 (a movslq sign-extends it), and the VMEM word is a fixed 512 B on every generation (Target+0x50c, VmemWordSizeBytes). VMEM capacity climbed 16 → 64 → 128 MiB across generations. The placement/coloring engine is VMEM Allocator, and the per-buffer tile shape it consumes is TPU Buffer Layout.
SMEM — scalar scratchpad
SMEM holds scalars: loop induction variables, computed addresses, sequencer constants. It is tiny relative to VMEM (16 KiB on v2/v3, 1 MiB from v4) and addressed in 4-byte words (Target+0x508, SmemWordSizeBytes). SmemSizeBytes is at Target+0x470 (int32). The register-window view of SMEM and its allocator are SMEM Register Window and SMEM Scalar Memory.
SFLAG — sync-flag tier
SFLAG is a dedicated small memory of 4-byte sync-flag words that the DMA engines increment on completion and that barrier/wait primitives poll. It is first-class — it has its own Target size field (Target+0x468, int32, SflagSizeBytes 0x1d615e60), its own word size (Target+0x504) and log2 (Target+0x4c8), and its own bank ladder — exactly so the sync protocol can address it without colliding with data tiers. SFLAG is 1 KiB on v2/v3, 2 KiB on v4–v6e, and jumps to 16 KiB on v7x. The wait/clear protocol that consumes it is SFLAG Protocol, and the barrier binding is Barrier-to-SFLAG Binding.
QUIRK — there are two physical SFLAG tiers in
Target, not one. The main chip SFLAG (size+0x468, word+0x504) coexists with a separate BarnaCore SFLAG (sizeTarget+0x478, accessorBarnaCoreSflagSizeBytes 0x1d615f80) that only exists when theHasBarnaCorevtable predicate (vtable+0x258) is true — i.e. on v2/v3/v4. BarnaCore also owns its own SMEM (BarnaCoreSmemSizeBytes, baseBarnaCoreSmemBaseBytes). A reimplementer targeting the v2–v4 embedding path must allocate both SFLAG regions; targeting v5p+ there is only the one.
CMEM — the Pufferfish second scratchpad
CMEM is a 128 MiB chip-level SharedMemory[CMEM] present only on Pufferfish (v4). CmemSizeBytes (0x1d615e20) reads Target+0x460 (int64); on every other generation that field is 0, and the canonical CMEM-presence test in the binary is exactly target().CmemSizeBytes() > 0. The CMEM pool allocator and its VMEM-staged transfer model are CMEM Pool.
GOTCHA — "CMEM is v4-only" is encoded three independent ways, and all three must agree in a reimplementation: (1)
CmemSizeBytesis0except on v4; (2) onlyPufferfishTarget::MemBankshas a CMEM (space-4) entry, every other genLOG(FATAL)s on it; (3) only the v4chip_partsblob carries aSharedMemory[CMEM]. There is also noLocalDmaBandwidthHbmToCmemvirtual — the cost-model dispatcher routes(Hbm,Cmem)into theVmemToVmemslot, because HBM→CMEM is modeled as HBM→VMEM→CMEM staging.
Per-Generation Sizes
This is the orientation copy of the size/word/bank rows; the authoritative, fully-sourced master table (with bandwidth, clocks, MXU geometry, register files, SparseCore tiers) is Per-Codename Constants. The values here are the proto-decoded sizes (bytes_per_word × word_count) for the data tiers and the C++ MemBanks literals for the bank counts. All CONFIRMED unless flagged.
| Tier / field | v2 JF | v3 DF | v4 PF (std / lite) | v5p / v5e VF | v6e GL | v7x |
|---|---|---|---|---|---|---|
| HBM size | 16 GiB | 32 GiB | 32 / 8 GiB | 96 / 16 GiB | 31.5 GiB | 95 / 190 GiB |
| HBM word | 1024 B | 1024 B | 512 B | 32 / 512 B | 32 B | 32 B |
| VMEM / TensorCore | 16 MiB | 16 MiB | 16 MiB | 64 / 128 MiB | 128 MiB | 64 MiB |
| VMEM word | 512 B | 512 B | 512 B | 512 B | 512 B | 512 B |
| SMEM / TensorCore | 16 KiB | 16 KiB | 1 MiB | 1 MiB | 1 MiB | 1 MiB |
| SMEM word | 4 B | 4 B | 4 B | 4 B | 4 B | 4 B |
| SFLAG / TensorCore | 1 KiB | 1 KiB | 2 KiB | 2 KiB | 2 KiB | 16 KiB |
| SFLAG word | 4 B | 4 B | 4 B | 4 B | 4 B | 4 B |
| CMEM (chip) | absent | absent | 128 MiB | absent | absent | absent |
| VMEM banks | 8 | 8 | 16 | 32 | 32 | 32 |
| SMEM banks | 2 | 2 | 8 | 8 | 8 | 8 |
| CMEM banks | FATAL | FATAL | 32 | FATAL | FATAL | FATAL |
Three discontinuities are worth memorizing because they break a "scale everything linearly" reimplementation:
- SMEM jumps 64× at v4 (16 KiB → 1 MiB) and then holds flat — the scalar scratchpad stopped being a bottleneck after Pufferfish.
- SFLAG jumps 8× at v7x (2 KiB → 16 KiB), reflecting the much larger sync-flag fan-out of the v7 collective fabric.
- VMEM is non-monotonic — it peaks at 128 MiB on the single-TensorCore lite/v6e dies (which pack more VMEM per core) and is back down to 64 MiB on v7x. VMEM-per-TensorCore is not a proxy for generation.
NOTE — the bank counts are not proto fields. They are C++ literals in the per-codename
*Target::MemBanksoverrides:JellyfishTarget(0x1d48fc80) returns 8/–/2 for spaces 3/–/5;PufferfishTarget(0x1d493900) indexes the.rodatatable0xB5305C8 = {16, 32, 8}for spaces 3/4/5; Dragonfish overrides none and inherits Jellyfish. AMemBankscall on a space the generation lacks is a hardLOG(FATAL), not a zero — which is how CMEM-on-v2 is made unrepresentable rather than merely empty.
The Target Field Map
The runtime view of the whole hierarchy is one contiguous block of Target struct fields filled by Target::Init from the chip-parts proto. Every tier accessor is a one-instruction getter, verified against the decompiled bodies: HbmSizeBytes is return *((int64_t*)this + 138) (= +0x450), VmemSizeBytes is return *((int*)this + 278) (= +0x458), SflagSizeBytes is return *((uint*)this + 282) (= +0x468). The map below is the field layout a reimplementation must reproduce so the same accessors resolve to the same offsets.
| Target off | Accessor (VA) | Type | Tier datum |
|---|---|---|---|
+0x438 / +0x448 | user-alloc shared-mem limit clamp | int64 | HBM / scoped (CMEM) user-alloc cap |
+0x450 | HbmSizeBytes (0x1d615320) | int64 | HBM size |
+0x458 | VmemSizeBytes (0x1d615e00, movslq) | int32 | VMEM size |
+0x460 | CmemSizeBytes (0x1d615e20) | int64 | CMEM size (0 ⇒ absent) |
+0x468 | SflagSizeBytes (0x1d615e60) | int32 | SFLAG size |
+0x470 | SmemSizeBytes (0x1d615e40) | int32 | SMEM size |
+0x478 | BarnaCoreSflagSizeBytes (0x1d615f80) | int32 | BarnaCore SFLAG (v2–v4, gated by HasBarnaCore) |
+0x4c8 / +0x4cc / +0x4d0 | Sflag/Smem/Vmem WordSizeLog2 | int32 | per-tier word log2 (byte→word shift) |
+0x4f0 | HbmFullChipBytesPerSecond (0x1d6172a0) | int64 | HBM bandwidth |
+0x4f8 | CmemFullChipBytesPerSecond (0x1d6172c0) | int64 | CMEM bandwidth (0 off-v4) |
+0x504 / +0x508 / +0x50c / +0x510 | Sflag/Smem/Vmem/Cmem WordSizeBytes | int32 | per-tier word size |
+0x90c / +0x910 | TC freq / HBM freq MHz | int32 | clocks (boot-filled, 0xFFFFFFFF sentinel pre-init) |
The word-size pairs (WordSizeBytes + WordSizeLog2) exist because every tier address the ISA produces is a word address, not a byte address: a buffer's byte size is converted to a hardware word count by >> WordSizeLog2, and the allocator asserts each buffer is a multiple of the tier word. The bounds-check assertions that gate addressing are visible verbatim in the binary as byte_address < target().HbmSizeBytes(), < target().VmemSizeBytes(), < target().SmemSizeBytes(), < target().SflagSizeBytes() — one per data tier — which is the cleanest evidence that each tier's Target size field is the addressing ceiling, not merely a capacity hint.
Tiers, MemorySpace, and Address Spaces
A reimplementer hits three distinct numberings for the same physical tiers. Keeping them straight is the single most error-prone part of the memory model, so the tiers (the five TensorCore data tiers plus the BarnaCore / SparseCore sub-spaces that share the enum) are tabulated together here against the MemorySpace int and the narrower DMA-dispatcher int; the full 17-value enum and the LLVM address-space band live on their own pages.
The three numberings
The compiler-side xla::jellyfish::MemorySpace enum is the string table at 0x21ce6b08, indexed directly by the enum value (MemorySpaceToString(e) is literally off_21CE6B08[e]). The cost-model DMA dispatcher Target::LocalDmaBandwidth uses a narrower set of the same integers — its decompiled body XORs the argument against 1 (Hbm), 3 (Vmem), 4 (Cmem), 5 (Smem) to pick a vtable offset, confirming those four enum values for the data tiers. The LLVM-level address-space IDs are a separate numbering carried in addrspace(N) and detailed on Address-Space IDs.
| Physical tier | MemorySpace name | enum int | DMA-dispatcher int |
|---|---|---|---|
| HBM | hbm (also kDefault) | 1 | 1 |
| HIB (host-interface) | hib | 2 | — |
| VMEM | vmem | 3 | 3 |
| CMEM | cmem | 4 | 4 |
| SMEM | smem | 5 | 5 |
| SFLAG | sflag | 6 | — |
| IMEM (instr.) | imem | 7 | — |
| BarnaCore SMEM | barna_core_smem | 9 | — |
| BarnaCore SFLAG | barna_core_sflag | 10 | — |
| SC sequencer SFLAG | sparse_core_sequencer_sflag | 12 | — |
| host | host | 13 | — |
| SC sequencer SMEM | sparse_core_sequencer_smem | 14 | — |
| pinned HBM | pinned_hbm | 16 | — |
NOTE — the canonical enum assignment is
hib = 2,sflag = 6,imem = 7, all CONFIRMED — there is no extra slot nearhib. Four independent byte-exact probes pin it: theMemorySpaceToStringflat lookup (string table0x21ce6b08), theMemorySpaceToDriverResource(0x1d6223e0) per-case switch, theMakeCmemConstant/MakeSparseCoreSequencerSmemConstantctors, and theMemBanksoverrides (see MemorySpace Enum and Memory-Space Master Table). The data-tier values (Hbm 1, Vmem 3, Cmem 4, Smem 5) are independently confirmed by theLocalDmaBandwidthXOR constants.
SFLAG's sub-spaces
SFLAG is the one tier that splits into scoped sub-spaces at the MLIR level, because synchronization happens at three granularities: the global cross-engine sflag (the only one reachable cross-CORE — TC↔SC, SC↔SC, TC↔TC, and remote), the per-tile sflag_tile (TEC sub-engine scope), and the per-SCS sflag_scs. These are distinct mlir::sparse_core::MemorySpace values, asserted by the verifier; they are detailed in SFLAG Protocol, not duplicated here. The point for the hierarchy map is that the single physical SFLAG tier backs all three scopes — the sub-spaces are addressing scopes, not separate memories.
SparseCore tiers
From v5p the SparseCore brings its own memory family — SPMEM, TILESPMEM, and per-SCS SMEM/SFLAG — that is parallel to, not part of, the TensorCore hierarchy described here. Those sizes are tabulated in Per-Codename Constants (the SC rows) and the BarnaCore↔SparseCore pivot is BarnaCore overview. They are out of scope for this TensorCore-tier orientation page beyond noting that SPMEM→HBM is a separate DMA-bandwidth virtual not reachable through the TensorCore LocalDmaBandwidth dispatcher.
Related Components
| Name | Relationship |
|---|---|
Target::Init | fills Target+0x450..+0x510 from chip_parts; the boot source of every tier size |
TpuChipParts::FromProto (0x20b1b400) | decodes the chip_parts proto whose fields become the tier sizes |
*Target::MemBanks | C++ source of the per-tier bank counts (the only non-proto integers here) |
Target::LocalDmaBandwidth (0x1d6168e0) | the (src,dst) tier → bandwidth dispatcher; the consumer of the tier enum ints |
MemorySpaceToString (0x1d6ffae0) | the enum→name table that names every tier and sub-space |
Cross-References
- Per-Codename Constants — the authoritative, fully-sourced master table these sizes are drawn from (bandwidth, clocks, MXU geometry, SparseCore tiers)
- Address-Space IDs — the LLVM
addrspace(N)numbering for these tiers, incl. the SparseCore fat-pointer bands - MemorySpace Enum — the full 17-value
MemorySpaceregion enum (0..16), its string table, and the proto↔enum remap - MemorySpace Table — the same enum as a reference appendix
- Memory Subsystem Overview — Part X entry point: how the tiers are allocated and managed
- HBM Allocator — the device-global store allocator
- VMEM Allocator — the vector-scratchpad placement/coloring engine
- SMEM Scalar Memory — the scalar-tier allocator and register window
- SFLAG Protocol — the sync-flag wait/clear protocol that drives SFLAG
- CMEM Pool — the Pufferfish-only second-scratchpad pool
- Memory Bandwidth & Latency Model — the cross-tier
LocalDmaBandwidthmatrix and DMA latencies