Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Memory Hierarchy

Every size, offset, and enum value on this page was decoded byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

A TPU TensorCore sees a flat, software-managed scratchpad hierarchy, not a transparent cache. Five on-chip tiers sit between the off-chip HBM and the compute units: HBM (the device-global backing store), VMEM (the vector working set, the analogue of an L1/shared-memory scratchpad), SMEM (the scalar/sequencer scratchpad), SFLAG (the dedicated sync-flag tier the DMA engines and the barrier primitives poll), and CMEM (a Pufferfish-only second large scratchpad). There is no hardware coherence and no automatic eviction: a kernel explicitly DMAs tiles between tiers, and the compiler's allocators (Part X) place every buffer in exactly one space. This page is the orientation map for those tiers — what they are, how big each one is on each generation, and how the runtime represents them — and it hands off to Part X for the allocator internals and to Part VI for the ISA-level addressing.

Every tier's runtime size and word geometry is one field of the single per-device xla::jellyfish::Target object. Target::Init fills those fields at boot from the embedded <codename>_chip_parts.binarypb proto (see Per-Codename Constants), and the rest of the compiler reads them through trivial accessors — Target::HbmSizeBytes() is literally return *(int64_t*)(this + 0x450). The same Target carries the bank counts (a separate set of C++ MemBanks literals, not proto fields) and the per-tier word size used to convert byte offsets to hardware word addresses. So the hierarchy is not an abstraction layered over the silicon constants — it is those constants, indexed by tier.

The tier set is named twice in the binary, by two distinct enumerations that a reimplementer must keep separate. The compiler-side xla::jellyfish::MemorySpace enum (decoded by MemorySpaceToString reading the pointer table at 0x21ce6b08, 17 region values, indices 0..16) names every space a buffer can live in, including the BarnaCore and SparseCore sub-spaces and host/pinned_hbm. The 0x21ce6b08 string array is physically longer than 17 — indices 17/18/19 resolve to absolute/heap_relative/stack_relative — but those are LloAddress pointer-relativity tags, not region-enum members, so the canonical enum is 17 (see MemorySpace Enum). The LLVM-level address-space IDs (Address-Space IDs) are the numbers the LLO/LLVM IR carries in addrspace(N). The two are related but not identical, and the cost-model DMA dispatcher uses yet a third, narrower numbering (Hbm=1, Vmem=3, Cmem=4, Smem=5) for the same physical tiers. This page anchors the physical tiers to all three.

For orientation, the contract is:

  • The five physical tiers, their role, and which generations have each.
  • The per-generation size / word / bank table — kept consistent with Per-Codename Constants, which is the authoritative source.
  • The Target field map — which struct offset and accessor backs each tier, so the runtime view is reconstructable.
  • The enum relationshipsMemorySpace int, the DMA-dispatcher int, and the address-space ID for each tier.
Runtime carrierxla::jellyfish::Target object (one per device); tiers at Target+0x450..+0x510
Boot populationTarget::Init (0x1d60fc20) from TpuChipParts::FromProto (0x20b1b400)
Tier accessorsHbmSizeBytes 0x1d615320, VmemSizeBytes 0x1d615e00, SmemSizeBytes 0x1d615e40, CmemSizeBytes 0x1d615e20, SflagSizeBytes 0x1d615e60
Space enumMemorySpaceToString (0x1d6ffae0) → table 0x21ce6b08 (17 region values, indices 0..16; table over-long with relativity tags at 17+)
DMA bandwidth dispatcherTarget::LocalDmaBandwidth(MemorySpace,MemorySpace) 0x1d6168e0
TiersHBM · VMEM · SMEM · SFLAG · CMEM (v4-only)
Generationsjellyfish v2 / dragonfish v3 / pufferfish v4 / viperfish v5p·v5e / ghostlite v6e / 6acc60406 v7x

The Five Tiers

Each tier is a physically distinct on-chip (or, for HBM, on-package) memory with its own width, capacity, and access cost. The diagram sketches the data paths a TensorCore sees; every arrow is an explicit DMA priced by LocalDmaBandwidth, never an implicit cache fill. The table below is the orientation summary; the per-generation sizes follow in the next section, and the cross-tier bandwidth matrix lives in the Memory Bandwidth & Latency Model.

                       off-package / per-die
            +-------------------------------------------+
            |                  HBM                      |  Target+0x450 (int64)
            |   8..190 GiB · word 32..1024 B · MS=1     |  HbmSizeBytes 0x1d615320
            +----+--------------------+-----------------+
                 | DMA                | DMA (v4: staged via VMEM)
                 v                    v
   per-TensorCore scratchpads     +---------------------+
  +-----------------------+       |   CMEM  (v4 only)   |  Target+0x460 (int64)
  |   VMEM                |<----->|  128 MiB · word 512 |  CmemSizeBytes 0x1d615e20
  |  16..128 MiB          |  DMA  |  MS=4               |  (==0 ⇒ absent)
  |  word 512 B · MS=3    |       +---------------------+
  +-----------+-----------+
              | DMA
              v
  +-----------------------+   +-----------------------+
  |   SMEM                |   |   SFLAG               |  Target+0x468 (int32)
  |  16 KiB..1 MiB        |   |  1..16 KiB            |  SflagSizeBytes 0x1d615e60
  |  word 4 B · MS=5      |   |  word 4 B · MS=6      |  (sync-flag words;
  |  Target+0x470 (int32) |   |  polled by DMA/barrier)  BarnaCore SFLAG @+0x478)
  +-----------------------+   +-----------------------+
        scalar operands           completion / sync
TierRoleScopeWordFirst gen
HBMDevice-global backing store; all program inputs/outputs, spill, embeddingswhole chip / per-die32 B – 1024 B (per gen)v2 (all)
VMEMVector working set — the tiled scratchpad MXU/VPU operands stage throughper TensorCore512 B (all gens)v2 (all)
SMEMScalar / sequencer scratchpad — addresses, loop bounds, scalar operandsper TensorCore4 B (all gens)v2 (all)
SFLAGSync-flag tier — the DMA/barrier completion words polled for cross-engine syncper TensorCore4 B (all gens)v2 (all)
CMEMSecond large scratchpad (SharedMemory[CMEM]) — staging buffer above VMEMwhole chip512 Bv4 only

NOTE — the tiers are software-managed scratchpads, not caches. A reimplementation that models VMEM/SMEM as a coherent cache backed by HBM is wrong: the only data path between tiers is an explicit DMA, priced by LocalDmaBandwidth. The compiler chooses each buffer's home tier; nothing migrates a buffer at runtime.

HBM — device-global store

HBM is the only tier large enough to hold a whole program's tensors. Its size, word, and clock are proto fields (SharedMemory[HBM]), materialized into Target+0x450 (int64, HbmSizeBytes), with the per-stack clock at Target+0x910 and full-chip bandwidth at Target+0x4f0 (HbmFullChipBytesPerSecond). The HBM word grows narrower across generations — 1024 B on v2/v3, 512 B on v4, then 32 B from v5p on — which is the same value the DMA path uses as its granule (DmaRequirements.granule_bytes); see HBM DMA Alignment. The allocator that carves HBM is HBM Allocator; the embedded tcmalloc that sub-allocates host-visible regions is Embedded tcmalloc.

VMEM — vector working set

VMEM is the per-TensorCore scratchpad the MXU and vector units read their operands from; it is the TPU's analogue of GPU shared memory, but explicitly allocated by the compiler rather than declared per-kernel. VmemSizeBytes (0x1d615e00) reads Target+0x458 as an int32 (a movslq sign-extends it), and the VMEM word is a fixed 512 B on every generation (Target+0x50c, VmemWordSizeBytes). VMEM capacity climbed 16 → 64 → 128 MiB across generations. The placement/coloring engine is VMEM Allocator, and the per-buffer tile shape it consumes is TPU Buffer Layout.

SMEM — scalar scratchpad

SMEM holds scalars: loop induction variables, computed addresses, sequencer constants. It is tiny relative to VMEM (16 KiB on v2/v3, 1 MiB from v4) and addressed in 4-byte words (Target+0x508, SmemWordSizeBytes). SmemSizeBytes is at Target+0x470 (int32). The register-window view of SMEM and its allocator are SMEM Register Window and SMEM Scalar Memory.

SFLAG — sync-flag tier

SFLAG is a dedicated small memory of 4-byte sync-flag words that the DMA engines increment on completion and that barrier/wait primitives poll. It is first-class — it has its own Target size field (Target+0x468, int32, SflagSizeBytes 0x1d615e60), its own word size (Target+0x504) and log2 (Target+0x4c8), and its own bank ladder — exactly so the sync protocol can address it without colliding with data tiers. SFLAG is 1 KiB on v2/v3, 2 KiB on v4–v6e, and jumps to 16 KiB on v7x. The wait/clear protocol that consumes it is SFLAG Protocol, and the barrier binding is Barrier-to-SFLAG Binding.

QUIRK — there are two physical SFLAG tiers in Target, not one. The main chip SFLAG (size +0x468, word +0x504) coexists with a separate BarnaCore SFLAG (size Target+0x478, accessor BarnaCoreSflagSizeBytes 0x1d615f80) that only exists when the HasBarnaCore vtable predicate (vtable+0x258) is true — i.e. on v2/v3/v4. BarnaCore also owns its own SMEM (BarnaCoreSmemSizeBytes, base BarnaCoreSmemBaseBytes). A reimplementer targeting the v2–v4 embedding path must allocate both SFLAG regions; targeting v5p+ there is only the one.

CMEM — the Pufferfish second scratchpad

CMEM is a 128 MiB chip-level SharedMemory[CMEM] present only on Pufferfish (v4). CmemSizeBytes (0x1d615e20) reads Target+0x460 (int64); on every other generation that field is 0, and the canonical CMEM-presence test in the binary is exactly target().CmemSizeBytes() > 0. The CMEM pool allocator and its VMEM-staged transfer model are CMEM Pool.

GOTCHA — "CMEM is v4-only" is encoded three independent ways, and all three must agree in a reimplementation: (1) CmemSizeBytes is 0 except on v4; (2) only PufferfishTarget::MemBanks has a CMEM (space-4) entry, every other gen LOG(FATAL)s on it; (3) only the v4 chip_parts blob carries a SharedMemory[CMEM]. There is also no LocalDmaBandwidthHbmToCmem virtual — the cost-model dispatcher routes (Hbm,Cmem) into the VmemToVmem slot, because HBM→CMEM is modeled as HBM→VMEM→CMEM staging.


Per-Generation Sizes

This is the orientation copy of the size/word/bank rows; the authoritative, fully-sourced master table (with bandwidth, clocks, MXU geometry, register files, SparseCore tiers) is Per-Codename Constants. The values here are the proto-decoded sizes (bytes_per_word × word_count) for the data tiers and the C++ MemBanks literals for the bank counts. All CONFIRMED unless flagged.

Tier / fieldv2 JFv3 DFv4 PF (std / lite)v5p / v5e VFv6e GLv7x
HBM size16 GiB32 GiB32 / 8 GiB96 / 16 GiB31.5 GiB95 / 190 GiB
HBM word1024 B1024 B512 B32 / 512 B32 B32 B
VMEM / TensorCore16 MiB16 MiB16 MiB64 / 128 MiB128 MiB64 MiB
VMEM word512 B512 B512 B512 B512 B512 B
SMEM / TensorCore16 KiB16 KiB1 MiB1 MiB1 MiB1 MiB
SMEM word4 B4 B4 B4 B4 B4 B
SFLAG / TensorCore1 KiB1 KiB2 KiB2 KiB2 KiB16 KiB
SFLAG word4 B4 B4 B4 B4 B4 B
CMEM (chip)absentabsent128 MiBabsentabsentabsent
VMEM banks8816323232
SMEM banks228888
CMEM banksFATALFATAL32FATALFATALFATAL

Three discontinuities are worth memorizing because they break a "scale everything linearly" reimplementation:

  • SMEM jumps 64× at v4 (16 KiB → 1 MiB) and then holds flat — the scalar scratchpad stopped being a bottleneck after Pufferfish.
  • SFLAG jumps 8× at v7x (2 KiB → 16 KiB), reflecting the much larger sync-flag fan-out of the v7 collective fabric.
  • VMEM is non-monotonic — it peaks at 128 MiB on the single-TensorCore lite/v6e dies (which pack more VMEM per core) and is back down to 64 MiB on v7x. VMEM-per-TensorCore is not a proxy for generation.

NOTE — the bank counts are not proto fields. They are C++ literals in the per-codename *Target::MemBanks overrides: JellyfishTarget (0x1d48fc80) returns 8/–/2 for spaces 3/–/5; PufferfishTarget (0x1d493900) indexes the .rodata table 0xB5305C8 = {16, 32, 8} for spaces 3/4/5; Dragonfish overrides none and inherits Jellyfish. A MemBanks call on a space the generation lacks is a hard LOG(FATAL), not a zero — which is how CMEM-on-v2 is made unrepresentable rather than merely empty.


The Target Field Map

The runtime view of the whole hierarchy is one contiguous block of Target struct fields filled by Target::Init from the chip-parts proto. Every tier accessor is a one-instruction getter, verified against the decompiled bodies: HbmSizeBytes is return *((int64_t*)this + 138) (= +0x450), VmemSizeBytes is return *((int*)this + 278) (= +0x458), SflagSizeBytes is return *((uint*)this + 282) (= +0x468). The map below is the field layout a reimplementation must reproduce so the same accessors resolve to the same offsets.

Target offAccessor (VA)TypeTier datum
+0x438 / +0x448user-alloc shared-mem limit clampint64HBM / scoped (CMEM) user-alloc cap
+0x450HbmSizeBytes (0x1d615320)int64HBM size
+0x458VmemSizeBytes (0x1d615e00, movslq)int32VMEM size
+0x460CmemSizeBytes (0x1d615e20)int64CMEM size (0 ⇒ absent)
+0x468SflagSizeBytes (0x1d615e60)int32SFLAG size
+0x470SmemSizeBytes (0x1d615e40)int32SMEM size
+0x478BarnaCoreSflagSizeBytes (0x1d615f80)int32BarnaCore SFLAG (v2–v4, gated by HasBarnaCore)
+0x4c8 / +0x4cc / +0x4d0Sflag/Smem/Vmem WordSizeLog2int32per-tier word log2 (byte→word shift)
+0x4f0HbmFullChipBytesPerSecond (0x1d6172a0)int64HBM bandwidth
+0x4f8CmemFullChipBytesPerSecond (0x1d6172c0)int64CMEM bandwidth (0 off-v4)
+0x504 / +0x508 / +0x50c / +0x510Sflag/Smem/Vmem/Cmem WordSizeBytesint32per-tier word size
+0x90c / +0x910TC freq / HBM freq MHzint32clocks (boot-filled, 0xFFFFFFFF sentinel pre-init)

The word-size pairs (WordSizeBytes + WordSizeLog2) exist because every tier address the ISA produces is a word address, not a byte address: a buffer's byte size is converted to a hardware word count by >> WordSizeLog2, and the allocator asserts each buffer is a multiple of the tier word. The bounds-check assertions that gate addressing are visible verbatim in the binary as byte_address < target().HbmSizeBytes(), < target().VmemSizeBytes(), < target().SmemSizeBytes(), < target().SflagSizeBytes() — one per data tier — which is the cleanest evidence that each tier's Target size field is the addressing ceiling, not merely a capacity hint.


Tiers, MemorySpace, and Address Spaces

A reimplementer hits three distinct numberings for the same physical tiers. Keeping them straight is the single most error-prone part of the memory model, so the tiers (the five TensorCore data tiers plus the BarnaCore / SparseCore sub-spaces that share the enum) are tabulated together here against the MemorySpace int and the narrower DMA-dispatcher int; the full 17-value enum and the LLVM address-space band live on their own pages.

The three numberings

The compiler-side xla::jellyfish::MemorySpace enum is the string table at 0x21ce6b08, indexed directly by the enum value (MemorySpaceToString(e) is literally off_21CE6B08[e]). The cost-model DMA dispatcher Target::LocalDmaBandwidth uses a narrower set of the same integers — its decompiled body XORs the argument against 1 (Hbm), 3 (Vmem), 4 (Cmem), 5 (Smem) to pick a vtable offset, confirming those four enum values for the data tiers. The LLVM-level address-space IDs are a separate numbering carried in addrspace(N) and detailed on Address-Space IDs.

Physical tierMemorySpace nameenum intDMA-dispatcher int
HBMhbm (also kDefault)11
HIB (host-interface)hib2
VMEMvmem33
CMEMcmem44
SMEMsmem55
SFLAGsflag6
IMEM (instr.)imem7
BarnaCore SMEMbarna_core_smem9
BarnaCore SFLAGbarna_core_sflag10
SC sequencer SFLAGsparse_core_sequencer_sflag12
hosthost13
SC sequencer SMEMsparse_core_sequencer_smem14
pinned HBMpinned_hbm16

NOTE — the canonical enum assignment is hib = 2, sflag = 6, imem = 7, all CONFIRMED — there is no extra slot near hib. Four independent byte-exact probes pin it: the MemorySpaceToString flat lookup (string table 0x21ce6b08), the MemorySpaceToDriverResource (0x1d6223e0) per-case switch, the MakeCmemConstant/MakeSparseCoreSequencerSmemConstant ctors, and the MemBanks overrides (see MemorySpace Enum and Memory-Space Master Table). The data-tier values (Hbm 1, Vmem 3, Cmem 4, Smem 5) are independently confirmed by the LocalDmaBandwidth XOR constants.

SFLAG's sub-spaces

SFLAG is the one tier that splits into scoped sub-spaces at the MLIR level, because synchronization happens at three granularities: the global cross-engine sflag (the only one reachable cross-CORE — TC↔SC, SC↔SC, TC↔TC, and remote), the per-tile sflag_tile (TEC sub-engine scope), and the per-SCS sflag_scs. These are distinct mlir::sparse_core::MemorySpace values, asserted by the verifier; they are detailed in SFLAG Protocol, not duplicated here. The point for the hierarchy map is that the single physical SFLAG tier backs all three scopes — the sub-spaces are addressing scopes, not separate memories.

SparseCore tiers

From v5p the SparseCore brings its own memory family — SPMEM, TILESPMEM, and per-SCS SMEM/SFLAG — that is parallel to, not part of, the TensorCore hierarchy described here. Those sizes are tabulated in Per-Codename Constants (the SC rows) and the BarnaCore↔SparseCore pivot is BarnaCore overview. They are out of scope for this TensorCore-tier orientation page beyond noting that SPMEM→HBM is a separate DMA-bandwidth virtual not reachable through the TensorCore LocalDmaBandwidth dispatcher.


NameRelationship
Target::Initfills Target+0x450..+0x510 from chip_parts; the boot source of every tier size
TpuChipParts::FromProto (0x20b1b400)decodes the chip_parts proto whose fields become the tier sizes
*Target::MemBanksC++ source of the per-tier bank counts (the only non-proto integers here)
Target::LocalDmaBandwidth (0x1d6168e0)the (src,dst) tier → bandwidth dispatcher; the consumer of the tier enum ints
MemorySpaceToString (0x1d6ffae0)the enum→name table that names every tier and sub-space

Cross-References