cmem_load Slot (Pufferfish)

Addresses apply to libtpu.so from the libtpu-0.0.40-cp314 wheel. Other versions differ.

Abstract

The cmem_load slot is the one Pufferfish-only slot in the 51-byte TensorCore bundle: a dedicated 16-bit region (absolute bits 103..118) that issues a read from the constant-memory (CMEM) pool, co-resident in the same bundle cycle as the regular VMEM vector_load slot. It exists on Pufferfish (TPU v4) and nowhere else, because Pufferfish is the only production codename that models CMEM at all — it is the only Target whose MemBanks(kCmem) returns a real bank count (32) and the only one with non-zero CMEM DMA bandwidth. On Jellyfish / Viperfish / Ghostlite / Ghostfish, MemBanks(kCmem) is a LogFatal, and no *::isa::TensorCoreCmemLoad class exists in those codecs. CMEM is a v4 feature, so the bundle slot that reads it is a v4 slot.

Functionally the slot is a second, independent memory read port. Its addressing fields — sublane mask, base-address selector, offset, stride — mirror the regular vector_load slot exactly; the only difference is that the address targets MemorySpace = kCmem (= 4) and is scaled by CmemWordSizeBytes rather than VmemWordSizeBytes. Because the cmem_load dedicated region (103..118) is disjoint from the vector_load region (119..140), a single bundle can issue one CMEM read and one VMEM read together — the v4 double-operand-fetch datapath that feeds the MXU operand pipe at twice the single-port rate. That is the reason CMEM, and this slot, exist.

This page documents the slot as a reimplementation target: its on-wire field map at absolute-bit precision (decoded byte-exact from TensorCoreCmemLoadEncoder::Encode), its three-way oneof tag (empty / Noop / CmemLoad) and the kNeverExecute = 31 idle encoding, its draw from the bundle's shared Y-register and immediate operand pool, and the emit → proto → SlotMap → address-scale bridge that turns an EmitVectorCmemLoad call into bundle bits and a CMEM byte address. The CMEM pool, allocator, banking, and per-generation availability are owned by the CMEM Pool page; this page owns the bundle slot.

For reimplementation, the contract is:

The dedicated region (103..118): SublaneMask @103/3, BaseAddress @106/2, Offset @108/2, Stride @110/3, has-bit @113/1, Predication @114/5 — each written by one BitCopy(buf, abs_bit, &field, 0, width).
The shared-pool draw: three 5-bit Y-register selectors (@251/246/241 = Vs2/Vs1/Vs0) and up to four 16-bit immediates (@304/288/272/256), allocated from the same pool every other slot uses, gated by the submessage has-bits.
The oneof tag at proto+0x50: 0 → empty (predication only), 5 → Noop (predication forced to 31 = kNeverExecute), 6 → CmemLoad (the issue variant, writes all addressing + operand fields).
The CMEM-address bridge: EmitVectorCmemLoad → EmitVectorLoadCommon<…CmemLoad> → proto_utils::Place* SlotMap binders; the byte address scaled by CmemAddrScaled and constructed with MakeCmemConstant(MemorySpace = kCmem = 4).
Why v4-only: only PufferfishTarget::MemBanks(kCmem) returns a value (32); the dedicated slot makes CMEM a read port co-issuable with VMEM in one bundle.


Encoder	`pxc::isa::TensorCoreCmemLoadEncoder::Encode(TensorCoreCmemLoad const&, Span<uint8>)` @ `0x1ecf89a0`
Bit primitive	`BitCopy(void, int dst_bit, void const, int src_bit, int nbits)` @ `0x1fa0a900` — `dst_bit` == absolute bundle bit
Dedicated region	abs bits 103..118 (16 bits) — disjoint from `vector_load` @119..140
Oneof tag	proto `+0x50`: `0`=empty, `5`=Noop (pred=31), `6`=CmemLoad (issue)
Submessage defaults	`TensorCoreCmemLoad_globals_` @ `0x223fb410` (outer), `TensorCoreCmemLoad_CmemLoad_globals_` @ `0x223fb3c8` (inner)
Emit bridge	`PufferfishTensorCoreEmitter::EmitVectorCmemLoad` @ `0x14120a40` → `EmitVectorLoadCommon<…CmemLoad>` @ `0x14120f40`
One-per-bundle gate	`EmitVectorCmemLoad` rejects on `bundle_proto[+0x10] & 0x40` (cmem_load has-bit already set); `PopulatedSlotsInBundle` @ `0x1d2ea840` builds the slot set
Address scale	`LloRegionBuilder::CmemAddrScaled` @ `0x1d539980` (byte→word by `CmemWordSizeBytes`)
Constant address	`LloAddress::MakeCmemConstant` @ `0x1d60ba20` → `LloAddress(MemorySpace=4=kCmem, off)`
Why v4-only	`PufferfishTarget::MemBanks(kCmem)` @ `0x1d493900` = 32; JF/VF/GL/GF `LogFatal`
Has-bit (bundle dispatch)	`0x040` in the 12-bit slot has-mask at `TensorCoreBundle` proto `+0x10` (7th slot); the slot submessage pointer is at proto `+0x48`

On-Wire Field Map

The slot is encoded by TensorCoreCmemLoadEncoder::Encode (0x1ecf89a0), which writes each field with one call to the shared bit-packing primitive BitCopy(buf, abs_bit, &field, 0, width). As across the whole Pufferfish bundle, the absolute bundle bit of a field is the literal dst_bit argument — there is no shift arithmetic to invert and no header offset to subtract (see Pufferfish 51B Bundle §The Direct-BitCopy Model). The dedicated region is sixteen bits, 103..118; the operand registers and immediates ride in the bundle's shared pool, not in the dedicated region.

// pxc::isa::TensorCoreCmemLoadEncoder::Encode(proto, buf)  @ 0x1ecf89a0  (decoded byte-exactly)
pred = proto[+0x20];
BitCopy(buf, 114, &pred, 0, 5);              // Predication @114/5 — written first, unconditionally
tag = proto[+0x50];                           // oneof discriminator
if (tag == 0) return OK;                      // empty slot: only predication written
if (tag == 5) { pred = 31; BitCopy(buf, 114, &pred, 0, 5); return OK; }   // Noop = kNeverExecute
// tag == 6  (CmemLoad, the issue variant):
one = 1;       BitCopy(buf, 113, &one,    0, 1);   // has-bit (slot-present marker) = const 1
inner = proto[+0x48];                          // the CmemLoad submessage (or _globals_ default if clear)
if (inner.has[0x10] & 0x01) BitCopy(buf, 110, &inner.Stride,      0, 3);   // Stride
if (inner.has[0x10] & 0x02) BitCopy(buf, 108, &inner.Offset,      0, 2);   // Offset
if (inner.has[0x10] & 0x04) BitCopy(buf, 106, &inner.BaseAddress, 0, 2);   // BaseAddress
if (inner.has[0x10] & 0x08) BitCopy(buf, 103, &inner.SublaneMask, 0, 3);   // SublaneMask
// ----- shared operand pool (co-allocated with the other slots) -----
if (inner.has[0x10] & 0x10) BitCopy(buf, 251, &inner.Vs2, 0, 5);   // Vs2 register selector
if (inner.has[0x10] & 0x20) BitCopy(buf, 246, &inner.Vs1, 0, 5);   // Vs1
if (inner.has[0x10] & 0x40) BitCopy(buf, 241, &inner.Vs0, 0, 5);   // Vs0  (dest VREG / base)
if (inner.has[0x10] & 0x80) BitCopy(buf, 304, &inner.Imm, 0, 16);  // shared immediate word
if (inner.has[0x11] & 0x01) BitCopy(buf, 288, &inner.Imm, 0, 16);  // shared immediate word
if (inner.has[0x11] & 0x02) BitCopy(buf, 272, &inner.Imm, 0, 16);  // shared immediate word
if (inner.has[0x11] & 0x04) BitCopy(buf, 256, &inner.Imm, 0, 16);  // shared immediate word

Field	Abs bit	Width	proto off	has-bit	Role
SublaneMask	103	3	`+0x24`	`[0x10]&0x08`	sublane predicate mask
BaseAddress	106	2	`+0x20`	`[0x10]&0x04`	`BaseAddressEncoding`: ZERO / vs0 / vs1 / vs2
Offset	108	2	`+0x1c`	`[0x10]&0x02`	`OffsetEncoding` mode selector
Stride	110	3	`+0x18`	`[0x10]&0x01`	stride / feature-length mode
has-bit	113	1	(const 1)	—	slot-present marker
Predication	114	5	`+0x20`(outer)	—	`0..14` pred reg / `15` always / `31` never (default `0x1f`)
Vs0	241	5	`+0x30`	`[0x10]&0x40`	Y-register selector — dest VREG / base (shared pool)
Vs1	246	5	`+0x2c`	`[0x10]&0x20`	Y-register selector (shared pool)
Vs2	251	5	`+0x28`	`[0x10]&0x10`	Y-register selector (shared pool)
Imm	304	16	`+0x34`	`[0x10]&0x80`	shared immediate word (offset / index)
Imm	288	16	`+0x38`	`[0x11]&0x01`	shared immediate word
Imm	272	16	`+0x3c`	`[0x11]&0x02`	shared immediate word
Imm	256	16	`+0x40`	`[0x11]&0x04`	shared immediate word

The decode-side accessors confirm the widths independently: TensorCoreCmemLoad{Field}::GetConcatenatedValue (0x1ecf8820..0x1ecf8980) read the proto-internal struct with shr+and masks whose pop-counts match the on-wire BitCopy widths exactly — 5-bit predication, 3-bit sublane/stride, 2-bit base/offset, 5-bit Y-regs, 16-bit immediates.

NOTE — the Pufferfish 51B Bundle page lists this same @103..118 region (sublane-mask @103/3, stride @110/3, offset @108/2, base @106/2, has @113/1, pred @114/5) and the two pages agree bit-for-bit. The per-field roles are named identically on both pages from the dedicated TensorCoreCmemLoad{Field} accessor symbols; this page additionally resolves the shared-pool Y-register and immediate placements.

The Oneof Tag and the Idle Encoding

The TensorCoreCmemLoad submessage carries a oneof discriminator at proto +0x50, read first by the encoder, that selects one of three behaviours:

Tag	Variant	Encoder behaviour
`0`	empty	only Predication @114 is written; encoder returns
`5`	Noop	Predication forced to `31` (`kNeverExecute`); return
`6`	CmemLoad	the issue variant — has-bit @113 set, all addressing + operand fields written

An idle slot therefore carries Predication = 31 (kNeverExecute), the identical empty-slot value the rest of the Pufferfish bundle uses. When the bundle codec finds the slot's has-bit clear in the twelve-slot dispatch, it substitutes the TensorCoreCmemLoad_globals_ default instance (0x223fb410); within the issue variant, a field whose per-field has-bit is clear takes the TensorCoreCmemLoad_CmemLoad_globals_ default (0x223fb3c8). The two-level default — one for an absent slot, one for an absent field inside a present slot — is what lets the encoder run branch-free over a partially-populated submessage.

GOTCHA — predication 0 is a live op, not "skip". The bundle buffer is memset to zero before encode, but predicate 0 is a valid predicate-register reference, not the empty marker. A reimplementation that leaves an unused cmem_load slot at the zeroed default issues a CMEM read gated on predicate register 0. The empty value is the explicit 31 written by tag-0/tag-5 (or the _globals_ default's stamped 31), never the zeroed buffer.

The Shared Operand Pool

The cmem_load slot owns its addressing fields (sublane / base / offset / stride) in the dedicated region, but its register and immediate operands come from the bundle's shared pool — the same three 5-bit Y-register selectors (abs 241/246/251) and six 16-bit immediate words (abs 256/272/288/304/320/338) that the two VALU lanes and the other two memory slots draw from. The encoder writes into the pool slots (Vs2 @251, Vs1 @246, Vs0 @241, immediates @304/288/272/256) but does not decide which pool entry each operand uses; that binding happens earlier, at the proto-build layer, through xla::pufferfish::proto_utils::Place* template functions parameterized on TensorCoreCmemLoad_CmemLoad:

// proto_utils binders for cmem_load (proto-build layer) — same SlotMap pool as VALU / vector_load / vector_store
PlaceSublaneMask<CmemLoad>(optional<SregnoOrImm>, SlotMap<ulong,3>&, SlotMap<ImmValue,6>&, *);  // @0x1c492080
PlaceBaseAddressRegister<CmemLoad>(uint, SlotMap<ulong,3>&, *);                                 // @0x1c492b60
PlaceVsRegister<BaseAddress,CmemLoad>(uint, SlotMap<ulong,3>&, *);                              // @0x1c492be0
PlaceOffsetImmediate<CmemLoad>(ImmValue, SlotMap<ImmValue,6>&, *);                              // @0x1c4937a0
PlaceStride<CmemLoad>(SregnoOrImm, SlotMap<ulong,3>&, …);                                       // @0x1c4a9420

The Vs0 selector (abs 241) receives the loaded data — the destination VREG — while BaseAddress (2-bit) picks ZERO / vs0 / vs1 / vs2 as the address base from the same Y-register pool, and the offset/index immediate rides one of the 16-bit shared slots. This is the same SlotMap<ulong,3> (three Y-register selectors) and SlotMap<ImmValue,6> (six immediates) the rest of the bundle shares.

GOTCHA — the shared pool, not the dedicated region, bounds co-issue. The cmem_load dedicated region (103..118) is disjoint from vector_load (119..140), so the two slots co-exist freely on the wire. But both draw their registers and immediates from the same three-Y-register / six-immediate pool. A bundle that issues a cmem_load plus a vector_load plus two VALU lanes can name at most three distinct Y-registers and six distinct immediates across all of them combined. The SlotMap allocator enforces this at proto-build time and rejects an over-subscribed bundle before Encode ever runs.

The Emit → Proto → SlotMap → Address Bridge

A CMEM load reaches the slot through PufferfishTensorCoreEmitter::EmitVectorCmemLoad (0x14120a40), whose signature is (SregnoOrImm base, SregnoOrImm, ImmValue offset, optional<SregnoOrImm> sublane_mask). The emit path is:

// PufferfishTensorCoreEmitter::EmitVectorCmemLoad  @ 0x14120a40  (decoded)
bundle = CurrentBundle();                              // @0x140fea80 — the bundle under construction
slots  = PopulatedSlotsInBundle(bundle);               // @0x1d2ea840 — one-cmem-load-per-bundle legality
// (rejects a second cmem_load in the same bundle — the "bundle has cmem load instruction already" diagnostic)
sub = DefaultConstruct<TensorCoreCmemLoad>();          // build the slot submessage; oneof tag (proto+0x50) := 6
// predicate value chosen by querying the emitter's Predication state, then stamped into sub[+0x20]:
//   is_always_execute (0x1d5b22e0) -> 15 ; is_never_execute (0x1d5b2300) -> 31 ; else 16*reg | sub-index
EmitVectorLoadCommon<TensorCoreCmemLoad_CmemLoad>(sub, base, offset, sublane_mask);   // @0x14120f40

EmitVectorLoadCommon<…CmemLoad> (0x14120f40) is the shared field-placement routine — the same one the regular vector_load slot uses, templated on the CmemLoad submessage — which calls the Place* SlotMap binders above. The slot's addressing submessage is the same shape as vector_load's; the only divergence is the address space.

The CMEM byte address itself is produced separately and enters the bundle as the offset immediate plus the base Y-register:

// LloAddress::MakeCmemConstant(long off)  @ 0x1d60ba20  (decoded)
return LloAddress(/*MemorySpace=*/4 /*=kCmem*/, off);   // calls LloAddress(MS, long) with esi = 0x4

// LloRegionBuilder::CmemAddrScaled(LloValue* base, LloValue* idx, long k)  @ 0x1d539980
return AddrScaled(base, idx, k, /*Granule=*/CmemWordSizeBytes);   // @0x1d538880, with a kCmem failure guard

MakeCmemConstant constructs an LloAddress with MemorySpace = 4 = kCmem — confirmed byte-exact (the constructor is called with esi = 4). CmemAddrScaled performs the byte→word scale by CmemWordSizeBytes so the bundle carries a word offset (the byte→word conversion happens in the bundle packer, not at issue time). The scaled word offset becomes the Offset immediate (via PlaceOffsetImmediate), and the CMEM base SREG (from llo.saddr.cmem / ScalarAddressCmem) becomes the BaseAddress Y-register (via PlaceBaseAddressRegister). There is no separate "CmemAllocator" — the address derivation shares the ProgramMemoryAllocator(MS = kCmem) / BestFitAllocator path described on the CMEM Pool page.

NOTE — there is no cmem_load → sparsity-DMA edge; the two never co-occur. The cmem_load emit chain calls only SlotMap field binders, never a DMA-descriptor or sparsity emitter. The entire callee set of EmitVectorCmemLoad (0x14120a40) is CurrentBundle (0x140fea80), PopulatedSlotsInBundle (0x1d2ea840, the one-per-bundle legality check), Arena::DefaultConstruct<TensorCoreCmemLoad> / <…CmemLoad>, PlaceStride (0x1c4a9420), and EmitVectorLoadCommon<…CmemLoad> (0x14120f40); and EmitVectorLoadCommon's callee set is exactly PlaceSublaneMask (0x1c492080), PlaceBaseAddressRegister (0x1c492b60), PlaceOffsetImmediate (0x1c4937a0) plus SregnoOrImm/ImmValue accessors — no DMA, no sparsity. Two structurally disjoint things sit nearby: (1) sparsity is a v5+ SparseCore ISA — every SparseCore*Dma{Encoder,Decoder} symbol lives under asic_sw::deepsea::gxc::{gfc,glc}::isa inside SparseCoreScsCodecBase / SparseCoreTecCodecBase / SparseCoreTacCodecBase, and the v4 pxc::isa namespace has no sparsity codec at all (its only "sparse" symbols are pxc::profiler::BcFsmSparseReduce trace events, not ISA slots); (2) the only Dma* near cmem_load is the codec-level program-image chunker pxc::isa::TensorCoreCodecBase<…TensorCoreCmemLoadEncoder…>::Dma{Encode,Decode}, where GetBundlesPerDmaChunk (0x1d224f20) = 0xa (10 bundles) and GetBytesPerDmaChunk (0x1d224f40) = 0x200 (512 bytes) — i.e. ten 51-byte bundles packed into a 512-byte instruction-stream DMA chunk. TensorCoreCmemLoadEncoder is just one of ~14 slot encoders that codec template is parameterized on; the chunker streams the whole bundle image and is slot-agnostic. CMEM (v4-only read port) and SparseCore DMA (v5+) never share a target, a namespace, or a code path.

Why the Slot Is Pufferfish-Only

The cmem_load slot exists on Pufferfish and on no other codename because Pufferfish is the only Target that models the CMEM tier:

PufferfishTarget::MemBanks(kCmem) (0x1d493900) returns 32 — it indexes a {16, 32, 8} rodata table at 0xb5305c8 by (MemorySpace − 3), so kCmem (= 4) → index 1 → 32 banks. JellyfishTarget, ViperfishTarget, and GhostliteTarget all LogFatal on MemBanks(kCmem) — the decompiled PufferfishTarget::MemBanks is v4 = MS − 3; if (v4 >= 3) LogFatal("Unsupported memory space specified for MemBanks()"); return table[v4];, and the others handle only kVmem (= 3) and kSmem (= 5).
Only PufferfishTarget overrides the LocalDmaBandwidth*Cmem virtuals (1121 GB/s VMEM→CMEM, 2339 GB/s CMEM→VMEM, 50 ns initial DMA latency). Every other codename inherits the base zero, which the cost model reads as "not modelled" and which gates MSA from placing anything in CMEM.
No pxc-analog TensorCoreCmemLoad class exists on any other codename. Viperfish has EmitVectorCmemLoad emitter symbols but lowers them into the regular VectorLoad / VectorStore slots — there is no vxc::vfc::isa::TensorCoreCmemLoad. Ghostlite (gxc::glc) and Ghostfish (gxc::gfc) have neither the slot nor the emitters.

Because the dedicated cmem_load region (103..118) is disjoint from the vector_load region (119..140), a Pufferfish bundle can co-issue a CMEM read and a VMEM read in the same cycle. That doubling of operand-fetch bandwidth into the MXU pipe is the reason CMEM exists on v4, and the dedicated slot is its bundle-level expression.

What Is Not Yet Pinned

The cmem_load addressing-mode subset. The slot reuses the vector_load addressing submessage; whether it supports all four vector_load addressing modes (Vmem / Shuffled / Indexed-Iar0 / Indexed-Iar1) or a CMEM-restricted subset is not separated branch-by-branch here.
The immediate-slot partition. The encoder writes four of the six shared immediate slots (@256/272/288/304) for cmem_load; whether it can ever claim the @320/338 slots is decided at the SlotMap-allocation layer (PlaceOffsetImmediate over SlotMap<ImmValue,6>), not visible in the encoder.
The decode-side gap behaviour. The encoder's field positions agree with the GetConcatenatedValue width accessors; the TensorCoreCmemLoadDecoder::Decode body (the on-wire→proto inverse) is not disassembled here.
The literal CMEM word granule. The Cmq16BIndirectStateFactory naming points at 16 bytes; the exact value is supplied by the embedded chip_parts.binarypb and read at boot into Target +0x510.

Cross-References

Pufferfish 51B Bundle — the full 51-byte slot map this slot sits in (region 103..118), the twelve-slot has-bit dispatch (has-bit 0x040 in the slot has-mask at proto +0x10; submessage pointer at proto +0x48), and the direct-BitCopy absolute-bit model.
Memory-Load Slot — the regular VMEM vector_load slot (abs 119..140) whose addressing submessage cmem_load mirrors, and the shared EmitVectorLoadCommon placement routine.
CMEM Pool — the constant-memory pool, the ProgramMemoryAllocator(MS=kCmem) / BestFitAllocator path, banking, per-generation availability, and why Pufferfish alone models CMEM.

Keyboard shortcuts

libtpu Internals — Reverse-Engineering Reference