Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

CMEM Constant-Memory Pool

All addresses on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (build libtpu_lts_20260413_b_RC00, build-id md5 89edbbe81c5b328a958fe628a9f2207d). The image is not stripped; demangled C++ symbol names are quoted verbatim. CMEM byte sizes per codename live in the embedded *chip_parts.binarypb resource and are not in the C++ — every numeric size below is sourced as noted. Other versions will differ.

Abstract

CMEM ("constant memory", MemorySpace::kCmem = 4) is an on-chip SRAM operand pool that exists as a first-class memory tier on exactly one production codename — Pufferfish (PXC, TPU v4). It is the one tier the memory-space taxonomy calls "a Pufferfish-only read-mostly operand pool", and it is the reason the Pufferfish TensorCore can fetch two vector operands per bundle cycle: a dedicated TensorCoreCmemLoad slot issues a CMEM read in the same 51-byte bundle as the regular VMEM VectorLoad, doubling MXU operand-fetch bandwidth.

This page owns three things and nothing else: (1) the CMEM pool layout — the allocator Config, the byte/word/bank addressing model, and where the per-program image and stack sit; (2) the per-generation sizing — which codenames have CMEM at all, and how the size, word, and bank count are surfaced; and (3) the constant packing — what the compiler elects to place in CMEM and how it gets there. The on-chip SMEM model is smem-scalar-memory.md; the tier taxonomy and the shared BestFitAllocator are overview.md; the bit-exact cmem_load bundle slot is on the ISA side (slot-cmem-load-pf.md, bundle-pf-51b.md). This page links those and does not duplicate them.

A reimplementer needs to internalize the following contract:

  • CMEM is not a separate allocator class. There is no CmemAllocator. CMEM uses the same two-stack model as every other tier (overview.md §1): a compile-time ProgramMemoryAllocator::AllocateBytes(MemorySpace::kCmem, …) and a runtime tpu::BestFitAllocator instance, distinguished only by a 32-byte Config{base_offset, end, alignment, granule}.
  • CMEM sizing is two scalar fields on Target. Target::CmemSizeBytes() (Target+0x460, uint64) is total bytes; Target::CmemWordSizeBytes() (Target+0x510, uint32) is the word = the allocator's alignment and granule. Both are filled at boot from chip_parts.binarypb; neither is overridden by any codename's Target subclass.
  • CMEM is real only where MemBanks(kCmem) returns a value. Only PufferfishTarget::MemBanks(kCmem) returns a number (32); on every other codename MemBanks(kCmem) is a LogMessageFatal. The whole per-gen CMEM difference reduces to data: the MemBanks(kCmem) return, the presence of the PufferfishTarget::LocalDmaBandwidth*Cmem overrides, the presence of a pxc::isa::*Cmem* opcode family, and the chip_parts.binarypb-supplied size triple.
  • Constants are packed by MSA and filled by program-prologue DMA. The placement decision is the MSA pass gated by the xla_tpu_cmem_* flag family; the bytes arrive via kDmaHbmToCmem / kDmaVmemToCmem issued in the compiled program's prologue, not at NEFF load time.
Memory spacexla::jellyfish::MemorySpace::kCmem = 4 (name table @ 0x21ce6b08)
Total-size accessorTarget::CmemSizeBytes() @ 0x1d615e20*(uint64*)(this+0x460)
Word/granule accessorTarget::CmemWordSizeBytes() @ 0x1d617320*(uint32*)(this+0x510)
Banks (Pufferfish only)PufferfishTarget::MemBanks(kCmem) @ 0x1d493900 = 32 ({16,32,8}[MS-3] @ rodata 0xb5305c8)
Compile-time placerProgramMemoryAllocator::AllocateBytes(MemorySpace::kCmem, …) @ 0x1c629e40
Runtime allocatortpu::BestFitAllocator (Config{base=0, end=CmemSizeBytes, align=granule=CmemWordSizeBytes}); rehydrated by CreateFromProto @ 0x1c631f20
Address scaleLloRegionBuilder::CmemAddrScaled @ 0x1d539980 (byte→word by CmemWordSizeBytes; kCmem guard)
Constant formLloAddress::MakeCmemConstant(long) @ 0x1d60ba20LloAddress(MemorySpace=4, off)
Placement gateparent->module()->target().CmemSizeBytes() > 0 (asserted in CreateVectorCmemResult @ 0x1d4d99a0)
Master MSA switchFLAGS_xla_tpu_cmem_memory_space_assignment @ 0x223c1750
Per-gen availabilityPufferfish (PXC, TPU v4) only; MemBanks(kCmem) is LogFatal on DF/JF/VF/GLC/GFC
ConfidenceCONFIRMED (byte-anchored, decompile-verified) unless a row or callout says otherwise

1. The CMEM Pool Layout

Purpose

CMEM is a flat per-TensorCore SRAM region. At the allocator level it is byte-flat from offset 0; at the LLO bundle level it is word-indexed; at issue time it is indirect-state-driven. A reimplementer must reproduce all three views because the byte→word conversion happens in the address builder, not at issue, and the bank coordinate is derived from the word index.

The allocator Config

CMEM is one tpu::BestFitAllocator constructed from the universal 32-byte MemoryAllocator::Config (overview.md §3). For the CMEM tier the four fields are:

tpu::MemoryAllocator::Config {
    base_offset_in_bytes_  = 0,                       // CMEM starts at sub-tile address 0
    allocatable_range_end_ = Target::CmemSizeBytes(), // *(uint64*)(Target+0x460)
    alignment_in_bytes_    = Target::CmemWordSizeBytes(), // *(uint32*)(Target+0x510)
    granule_in_bytes_      = Target::CmemWordSizeBytes(), // alignment == granule
};

The ctor (0x1e817500) asserts alignment % granule == 0 and alignment is a power of two — for CMEM both are the same value, so both invariants hold trivially. Capacity is end − base = CmemSizeBytes(). There is no base_offset reservation and no ReserveBottomOfMemory call for CMEM — unlike HBM there is no DMA-floor / compile-time-alignment split.

NOTE — CMEM has no scoped-limit machinery analogous to scoped-VMEM. There is no Target::OverlayReservedCmemBytes, no DefaultPlatformScopedCmemBytes, no RequiredPostModuleScopedCmemBytes. LloRegionBuilder::AllocateScopedCmem (0x1d5181a0) is a thin trampoline into the shared AllocateScopedMemory(shape, MemorySpace::kCmem, name) (0x1d517c20). Scoped-CMEM allocations are bounded only by CmemSizeBytes() − (used chunks × chunk_bytes).

The addressing model

LevelUnitMechanism
Allocatorbyte offset, base 0BestFitAllocator hands out byte offsets relative to base_offset = 0
Bundle / LLOword indexCmemAddrScaled (0x1d539980) divides the byte address by CmemWordSizeBytes (the Granule arg to AddrScaled 0x1d538880); the bundle Offset field carries the word index
Bank (PF)(word_index) mod 32low bits of the word index pick one of MemBanks(kCmem)=32 banks
Issueindirect-state slotCmemIndirectState pre-builds 16-byte slots; the bundle base_address field carries a slot index

CmemAddrScaled is the linkage between the byte allocator and the word-indexed bundle. Its decompiled body asserts the address operand is kCmem before scaling:

// LloRegionBuilder::CmemAddrScaled  @0x1d539980  (verbatim-derived)
LloCheckForFailure<MemorySpace, MemorySpace, kCmem>(...);   // v13[0] = 4 == kCmem
// on failure: UpdateStatus("base->memory_space() == MemorySpace::kCmem",
//                          "platforms/xla/service/jellyfish/llo_region_builder.cc", 4965)
return AddrScaled(this, byte_addr, base, off, /*Granule=*/0 → CmemWordSizeBytes, ...);

The constant form LloAddress::MakeCmemConstant(long off) (0x1d60ba20) builds an LloAddress(MemorySpace=4, off) directly (constructor 0x1d60b980, esi=0x4). There is no CMEM register file: the destination of a CMEM load is a Vreg in the VS / VEX register file; CMEM is purely an operand source pool plus a small mutable stack.

The per-program image and stack

The CMEM contents are a per-program image, not a boot-time constant pool. Two regions sit inside it:

  • The indirect-state table at the head of the image, anchored by Target::CmemStackBaseTableWordOffset() (0x1d6185c0). Each entry is a 16-byte indirect-state record built at program start by a CmemIndirectStateFactory (pfc::b0::Cmq16BIndirectStateFactory on full Pufferfish, plc::Cmq16BIndirectStateFactory on Pufferfish-Lite — the Cmq16B name is the source of the inferred 16-byte word).
  • A small per-program stack at Target::CmemStackStartWordOffset() (0x1d618660) holding runtime-mutable scalars (loop counters consumed by kVectorCmemLoadAndPop, the load-and-post-increment streaming-read opcode). The rodata anchors are "cmem stack start" and "cmem stack base table".

GOTCHA — CMEM is per-program scratch, overwritten on every program switch, not a pinned chip-wide constant store. The CMEM allocator is instantiated per program by CreateFromProto (0x1c631f20); the image is filled from HBM in the program's prologue; the next program's prologue overwrites it. There is no dedicated "CMEM load engine" outside the compiled program — every CMEM byte is written by an LLO-emitted DMA descriptor or a VPU store opcode. This is why CMEM holds a working set of weights/LUTs for one program, not the whole model.


2. Per-Generation CMEM Sizing

Purpose

Whether CMEM exists at all is a per-codename decision, and it is decided by data plus three small Target overrides, not by any TpuVersion branch in the allocator. A reimplementer who drives CMEM off the compile-time placer must replicate the gate exactly: CMEM placement is attempted only when CmemSizeBytes() > 0 and MemBanks(kCmem) does not LogFatal — which, in the published 0.0.40 runtime, is Pufferfish alone.

The sizing accessors (decompile-verified)

Both accessors are direct field reads on the base Target, with no per-codename override:

// Target::CmemSizeBytes()      @0x1d615e20
return *((uint64_t*)this + 140);   // 140 * 8  = 0x460   — total CMEM bytes / core

// Target::CmemWordSizeBytes()  @0x1d617320
return *((uint32_t*)this + 324);   // 324 * 4  = 0x510   — word = alignment = granule

There is no Target::CmemWordSizeLog2 field (SMEM has one at +0x4CC because LLO bundle packing emits a byte→word shift for SMEM; CMEM's byte→word conversion is done inside CmemAddrScaled / the indirect-state classes, which hard-wire the granule).

PufferfishTarget::MemBanks(MemorySpace) is the one override that gates everything (decompiled @ 0x1d493900):

// PufferfishTarget::MemBanks(MemorySpace ms)   @0x1d493900
unsigned v4 = ms - 3;                 // kVmem=3 → 0, kCmem=4 → 1, kSmem=5 → 2
if (v4 >= 3)                          // ms >= 6 (sflag, …)
    LogMessageFatal("target_pufferfish.h", 228,
                    "Unsupported memory space specified for MemBanks()");
return qword_B5305C8[v4];             // {16, 32, 8}  → kCmem ⇒ 32

The base Target::MemBanks and every non-Pufferfish override land kCmem in a LogMessageFatal branch — MemBanks(kCmem) simply has no valid return on JF/VF/GL/GF.

Per-codename CMEM matrix

Sizes come from chip_parts.binarypb at boot; alignment = granule = CmemWordSizeBytes() for every codename. The Pufferfish bandwidth/latency immediates were decoded from the PufferfishTarget::LocalDmaBandwidth* / InitialDmaLatencyInNs bodies and re-verified here against the decompile.

TpuVersion / familyTarget classCMEM size (B)Word / granule (B)MemBanks(kCmem)CMEM ISA opcodesBW VMEM→CMEM (GB/s)InitialDmaLatency (kCmem, ns)
Dragonfish (DF)DragonfishTargetchip_parts (≈ 0)chip_partsn/a (LogFatal)none0 (base, unmodelled)240 (inherited)
Jellyfish (JF)JellyfishTargetchip_parts (≈ 0)chip_partsn/a (LogFatal)none0 (base)240 (constant)
Pufferfish (PXC, TPU v4)PufferfishTargetchip_parts (non-zero)~16 (Cmq16B)32dedicated PXC slot + VS + Misc112150
Viperfish (VFC)ViperfishTargetchip_parts (= 0)chip_partsn/a (LogFatal)emitters only, lower to VL/VS0 (base)1200 / 0 (lite)
Ghostlite (GLC / glc, v6e)GhostliteTargetchip_parts (= 0)chip_partsn/a (LogFatal)none (no gxc::glc::isa::*Cmem*)0 (base)1200
6acc60406 (GFC / gfc, TPU7x)GhostliteTargetchip_parts (= 0)chip_partsn/a (LogFatal)none (no gxc::gfc::isa::*Cmem*)0 (base)1200

GOTCHA (literal byte size is LOW) — the literal per-codename CMEM byte size lives in the embedded chip_parts.binarypb proto's TpuMemoryParts record and has not been extracted from the binary; the C++ reads it blindly from Target+0x460. The word/granule "16 B" for Pufferfish is inferred from the Cmq16BIndirectStateFactory template-parameter name, not read from the proto. Public Pufferfish materials describe CMEM at the per-TensorCore-MiB scale, but a reimplementer should treat the literal size and word as chip_parts-supplied data, not as binary-derived constants.

NOTE — Pufferfish ships two sub-variants that both carry the full CMEM story behind the same PufferfishTarget C++ class: pfc::b0 (canonical Pufferfish-A) and plc (Pufferfish-Lite), distinguished only by their CmemIndirectStateFactory<…Cmq16BIndirectStateFactory> specialization and (potentially) by a different chip_parts.binarypb CMEM size. The C++ Target class does not branch between them.

Why Pufferfish alone

The dedicated CMEM bundle slot exists because Pufferfish alone models CMEM. The structural reason (decompile-verified): only PufferfishTarget overrides MemBanks(kCmem) to a real value, and only PufferfishTarget overrides the LocalDmaBandwidth*Cmem virtuals away from the base 0.0. The dedicated cmem_load slot is disjoint from the vector_load slot in the bundle, so a CMEM read and a VMEM read co-issue in the same cycle — the v4 double-operand-fetch datapath. The bit-exact slot layout is on slot-cmem-load-pf.md.

The Pufferfish CMEM cost-model immediates, re-decoded from the decompiled movabs/return bodies (IEEE-754 doubles):

Pair (PufferfishTarget::…)AddrRaw immediateValue (GB/s)
LocalDmaBandwidthVmemToCmem0x1d4943e00x40918400000000001121.0
LocalDmaBandwidthCmemToVmem0x1d4944400x40A24600000000002339.0
LocalDmaBandwidthCmemToSmem0x1d4944800x404100000000000034.0
LocalDmaBandwidthSmemToCmem0x1d4944e00x404100000000000034.0
LocalDmaBandwidthCmemToHbm0x1d4944200x4090E000000000001080.0
LocalDmaBandwidthCmemToCmem0x1d4944600x4092A400000000001193.0
InitialDmaLatencyInNs(kCmem)0x1d493d00table [555.0, 50.0][ms==4]50 ns

The asymmetry (read side 2339 GB/s vs. write side 1121 GB/s) reflects a CMEM physical bus whose read path has roughly twice the wire count; the 50 ns startup is an order of magnitude below the 555 ns VMEM/HBM startup, reflecting CMEM's per-tile-resident short bus. There is no LocalDmaBandwidthHbmToCmem accessor at all — HBM↔CMEM traffic is cost-modelled as the VMEM-bridged formula (HBM→VMEM latency + VMEM→CMEM bandwidth).


3. Constant Packing — What Goes in CMEM and How

Purpose

CMEM is the "tile-resident operand store" for compile-time-known, read-mostly data. This section is the placement criteria (what MSA elects to pack) and the fill path (how the bytes get there). The bit-level encoding of the load that reads them is on the ISA pages.

What gets packed

Recovered from the PXC ISA opcode family, the xla_tpu_cmem_* flag family, and the runtime error strings:

Packed contentWhy CMEMSource evidence
Convolution-filter / matmul weights for fixed-shape modelsstreamed in once at program start, then read many times to feed the MXU operand pipedominant kDmaHbmToCmemTensorCoreCmemLoad flow
Look-up tables (activation, saturation, quantisation)materialise-once, read-many-per-elementLUT placement; high CMEM read BW
Indirect-state descriptor blocksdrive indirect-address resolution for vectorised readsCmemIndirectState (16-B Cmq16B records)
All-reduce staging buffersfrees VMEM for operand staging during collectivesFLAGS_xla_tpu_scoped_cmem_for_all_reduce @ 0x223b8de8
MXU result direct-to-CMEMbypasses the VMEM round-trip on matmul outputkVectorCmemResult (CreateVectorCmemResult @ 0x1d4d99a0)
HLO output buffers (experimental)a configured fraction of CMEM for top-level outputsFLAGS_xla_tpu_experimental_cmem_fraction_for_hlo_outputs @ 0x223b8cc8
CMEM stack scalars (loop counters)runtime-mutable; streamed by load-and-popkCmemStackOffset, kVectorCmemLoadAndPop
In-CMEM copy spans (CmemSpan)sliding-window sequential readsFLAGS_xla_tpu_allow_in_cmem_copy @ 0x223b30d0 (default off)

The MSA placement decision

Packing is decided by the Memory-Space Assignment pass — the same kAlternate-tier coloring that places VMEM (overview.md §4) — gated by a per-tier xla_tpu_cmem_* flag family that mirrors xla_jf_vmem_* one-to-one. The master switch is FLAGS_xla_tpu_cmem_memory_space_assignment (0x223c1750); the byte cap is FLAGS_xla_tpu_max_cmem_used_by_memory_space_assignment (0x223ae230); prefetch/eviction caps, repack/retry counts, and overlap ratios round out the family (0x223c0250 … 0x223c1e78).

The two-tier tug-of-war (decompile-gated by CmemSizeBytes() > 0, see below):

HloValue characteristicWins
MXU primary operand (matmul A/B), needs VS-slot portsthe codename's FastMemorySpace() — VMEM on VF/GL, HBM on JF, CMEM on PF (see per-codename breakdown below)
Weight tile for fixed-shape conv; LUT; quant table (read-mostly, small)CMEM (high read BW, dedicated bundle slot) — Pufferfish only
Activation tile (changes per batch)VMEM
All-reduce stagingCMEM if xla_tpu_scoped_cmem_for_all_reduce else VMEM
HLO top-level outputHBM default; CMEM if …cmem_fraction_for_hlo_outputs > 0
No live-range headroom anywhereHBM (spill, fetch on demand)

FastMemorySpace() is per-codename and decompile-verified: JellyfishTarget::FastMemorySpace() (0x1d491a20) returns kHbm(=1) (Jellyfish has no on-chip fast operand tier); ViperfishTarget (0x1d49c3c0) and GhostliteTarget (0x1d499000) both return kVmem(=3); and — the one that makes the whole CMEM datapath exist — PufferfishTarget::FastMemorySpace() (0x1d495f00) returns kCmem(=4). On Pufferfish, CMEM is the fast tier: a single mov $0x4,%al; ret. MSA therefore biases read-mostly tiles toward CMEM specifically on Pufferfish, where the active Target advertises both the kCmem fast space and non-zero LocalDmaBandwidthCmemToVmem.

NOTE (CONFIRMED gate) — the entire CMEM placement path is short-circuited on any codename that advertises zero CMEM. The decompiled LloInstruction::CreateVectorCmemResult (0x1d4d99a0) asserts parent->module()->target().CmemSizeBytes() > 0 and LogMessageFatals otherwise — so a CMEM-result write is unreachable unless the active Target has non-zero CMEM. The same invariant guards MSA, alongside the rodata diagnostic "CMEM is not supported.". A reimplementer that omits the CmemSizeBytes() > 0 precondition will mis-route CMEM placement onto a codename that cannot back it.

The fill path

CMEM is populated by program-prologue DMA, not at NEFF / image load. The compiled XDB program embeds the CMEM image declarations as ProgramMemoryMetadata_Allocation entries with memory_space = kCmem, exactly like VMEM/HBM/SMEM; CreateFromProto (0x1c631f20) turns them into runtime allocator offsets. The bytes then arrive via the compiled prologue:

Fill mechanismEmitterCost-model BW (PF)
HBM→CMEM bulk fill (weights, LUTs)LloRegionBuilder::DmaHbmToCmemInBytes @ 0x1d576e60 (kDmaHbmToCmem)VMEM-bridged (no direct HbmToCmem BW)
VMEM→CMEM staged fillLloRegionBuilder::DmaVmemToCmemInBytes @ 0x1d576c801121 GB/s
In-program VPU storeTensorCoreVectorStore_CmemStore / …NoOffset (folded into the VS slot)(VS write port)
MXU result direct writekVectorCmemResult (CreateVectorCmemResult @ 0x1d4d99a0)(result-buffer drain)
SMEM→CMEM scalar stagingDmaSmemToCmemInBytes @ 0x1d577220 (only this direction has an emitter; no DmaCmemToSmemInBytes)34 GB/s

After the bulk fill, the CmemIndirectState ctor builds the indirect-state vector at the head of the image so the VPU bundle slot can issue TensorCoreCmemLoad against the resident tiles. Eviction back to HBM (the MSA spill path) uses DmaCmemToHbmInBytes (0x1d577040).

GOTCHA — read-only is a convention, not a hardware write-protect. CMEM is "read-mostly" by compiler classification only: the ISA has explicit CMEM write opcodes (VectorCmemStore, VectorCmemStoreNoOffset), seven DMA write paths target CMEM, and the MXU result drain writes it. LSRA-v2 treats compile-time-constant CMEM loads as rematerialisable, and MSA biases read-mostly tiles to CMEM, but a buggy program that stores over a constant region will silently corrupt subsequent reads — there is no hardware fault. The protection is purely SSA / dataflow-level.

Exhaustion handling

Compile-time CMEM exhaustion modes (from runtime error strings):

  • Requested size exceeds capacityresult.memory_space_assignment_size_in_bytes <= target.UserAllocationSharedMemoryLimitBytes(MemorySpace::kCmem) (0x1d616680); violation is LogMessageFatal ("Requested Cmem size for memory space assignment (…)").
  • Target advertises zero CMEM — the CmemSizeBytes() > 0 gate above; emits "CMEM is not supported." and skips CMEM (or fails the compile when a custom-call required CMEM).
  • Out-of-range byte addressbyte_address < target().CmemSizeBytes() asserted at every CMEM operand emission; violation is LogMessageFatal.
  • MSA cannot place — retries up to xla_tpu_cmem_max_retries, then falls back to HBM and logs the soft diagnostics "Ignoring cmem allocation: …" / "Ignoring cmem allocation storage".
  • CmemSpan disableduse_cmem_span_ == true / !use_cmem_span_ invariants in cmem_indirect_state_factory.h; a span requested while xla_tpu_allow_in_cmem_copy is off LogFatals with "Disabled cmem span".

Runtime exhaustion goes through the standard BestFitAllocator::Allocate ResourceExhaustedError path (hbm-allocator.md) — rare, because the CMEM image is fully materialised at compile time and replayed at load.


Cross-References

  • overview.md — the six-region memory-space taxonomy, the MemorySpace enum (kCmem=4), the shared BestFitAllocator and the compile-time→runtime hand-off this page specializes for CMEM
  • smem-scalar-memory.md — the sibling on-chip scalar tier (kSmem=5); the SMEM↔CMEM 34 GB/s staging pair and the CmemWordSizeLog2-absent contrast
  • vmem-allocator.md — the kAlternate fast tier CMEM competes with in MSA; the per-gen VMEM word/bank Config
  • hbm-allocator.md — the universal best-fit allocate/deallocate algorithm CMEM shares; the runtime ResourceExhaustedError path
  • tpu-buffer-layout.md — how a logical XLA buffer maps to physical offsets across these tiers
  • slot-cmem-load-pf.md — the bit-exact Pufferfish cmem_load bundle slot (the read access path this page references)
  • bundle-pf-51b.md — the 51-byte Pufferfish TensorCore bundle the dedicated CMEM-load slot lives in
  • memory-space-enum.md — the LLO MemorySpace enum at the ISA / operand-tag level
  • msa-overview.md — the placement pass that colors CMEM via the xla_tpu_cmem_* flag family
  • back to index — Part X — On-Chip Memory & DMA