Per-Codename Constant Table

All constant values on this page were decoded byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

This page is the consolidated per-generation hardware-constant table for every TPU codename the binary knows: Jellyfish (v2), Dragonfish (v3), Pufferfish (v4), Viperfish (v5p/v5e), Ghostlite (v6e), and 6acc60406 (v7x). It is reference-table-centric: the master table is the point of the page, and the prose around it exists only to name the source of each row and flag the confidence.

Two source classes feed the table. The first and dominant is the embedded <codename>_chip_parts.binarypb proto blob, decoded directly from .rodata (see chip_parts.binarypb Decode for the schema and resolution path). Every memory size, core count, MXU geometry integer, clock, register count, and DMA constant below comes from those bytes, materialized as bytes_per_word × word_count or read as a scalar field. The second is the small set of constants the proto does not carry — the VMEM/SMEM/CMEM bank counts — which are C++ literals in the per-codename *Target::MemBanks overrides.

All nine blobs were carved from .rodata, md5-verified against their FileWrapper descriptor fingerprints, and walked field-by-field against the schema recovered from protodesc_cold. The decode reproduces, byte-for-byte, the relationships a reimplementer would expect (e.g. peak BF16 = 2 × mxu_count × 128² × frequency_mhz for the 128×128 generations), and every row carries a Confidence column with its source.


Source (capability)	nine `*_chip_parts.binarypb` blobs, `.rodata` `0x0BDF29A0..0x0BDF3AB8`
Source (bank counts)	`*Target::MemBanks` C++ overrides (addresses below)
Decode method	md5-verified carve + field-walk against `protodesc_cold` schema
Generations	jellyfish/dragonfish/pufferfish/viperfish/ghostlite/6acc60406 (TpuVersionProto 1..6)
Confidence	CONFIRMED unless a cell is annotated otherwise

Master Hardware-Constant Table

All values are per TensorCore unless noted. "std" is the full part; "lite" is the viperfish_lite (v5e) or pufferfish_lite (v4 lite) blob, distinguished at resolution by the variant_name field. 6acc60406 is shown as die / full-chip (the tensornode blob vs the full two-die package). HBM size cells show the exact byte product; the GiB figure is bytes / 2^30.

Constant	v2 Jellyfish	v3 Dragonfish	v4 Pufferfish (std / lite)	v5p/v5e Viperfish (std / lite)	v6e Ghostlite	v7x 6acc60406 (die / full)
TpuVersionProto	1	2	3	4	5	6
`driver_abi_version`	1	1	1	1	1	1
HBM size	16 GiB	32 GiB	32 GiB / 8 GiB	96 GiB / 16 GiB	31.5 GiB	95 GiB / 190 GiB
HBM stacks × per-stack	2 × 8 GiB	2 × 16 GiB	1 × 32 GiB / 1 × 8 GiB	1 × 96 GiB / 1 × 16 GiB	1 × 31.5 GiB	1 / 2 × 95 GiB
HBM word (`bytes_per_word`)	1024 B	1024 B	512 B	32 B / 512 B	32 B	32 B
HBM clock	1400 MHz	1800 MHz	2400 MHz	3600 / 3200 MHz	6400 MHz	7200 MHz
HBM bandwidth / stack	0.317 TB/s	0.430 TB/s	0.982 / 0.492 TB/s	2.350 / 0.738 TB/s	1.638 TB/s	3.686 TB/s
VMEM / TensorCore	16 MiB	16 MiB	16 MiB	64 / 128 MiB	128 MiB	64 MiB
VMEM word	512 B	512 B	512 B	512 B	512 B	512 B
SMEM (TensorCore)	16 KiB	16 KiB	1 MiB	1 MiB	1 MiB	1 MiB
SMEM word	4 B	4 B	4 B	4 B	4 B	4 B
SFLAG (TensorCore)	1 KiB	1 KiB	2 KiB	2 KiB	2 KiB	16 KiB
SFLAG word	4 B	4 B	4 B	4 B	4 B	4 B
CMEM (SharedMemory)	absent	absent	128 MiB / 128 MiB	absent	absent	absent
CMEM word / clock / bw	—	—	512 B / 1050 MHz / 2.151 TB/s	—	—	—
MXU lane × sublane	128 × 8	128 × 8	128 × 8	128 × 8	128 × 8	128 × 8
MXU count / TensorCore	1	2	4	4	2	2
XLU count / TensorCore	1	1	2	3	2	2
IAR count / TensorCore	2	2	2	2	2	2
MXU systolic dim	128×128	128×128	128×128	128×128	256×256	256×256
TensorCore freq	700 MHz	940 MHz	1050 MHz	1750 / 1500 MHz	1750 MHz	1900 MHz
TensorCores / chip (std/lite)	2	2	2 / 1	2 / 1	1	1 / 2
Reg file SREG/VREG/PREG/VMREG	32/32/15/8	32/32/15/8	32/32/15/8	32/64/14/16	32/64/14/16	32/64/14/16
Accelerator core type	BARNA_CORE	BARNA_CORE	BARNA_CORE	SPARSE_CORE	SPARSE_CORE	SPARSE_CORE
accelerator count / chip (std/lite)	2	2	4 / 0	4 / 0	2	2 / 4
SparseCore freq	—	—	—	1475 MHz	1350 MHz	1750 MHz
SC sequencers (SEQ/TAC/TEC)	—	—	—	1 / 16 / 16	1 / 16 / 16	1 / 0 / 16
SC TEC VectorIsa lane × sublane	—	—	—	8 × 1	8 × 1	16 × 1
SC SPMEM / TILESPMEM	—	—	—	8 MiB / 512 KiB	4 MiB / 256 KiB	8 MiB / 512 KiB
SC SMEM (SCS) / SFLAG (SCS)	—	—	—	64 KiB / 28 KiB	64 KiB / 28 KiB	64 KiB / 28 KiB
SC `tile_hbm_bw` / `stream_granule`	—	—	—	32 B/cyc / 4 B	32 B/cyc / 4 B	64 B/cyc / 4 B
DMA granule bytes	1024 B	1024 B	512 B	32 B / 512 B	32 B	32 B
DMA host / device align	16 / 1024	16 / 1024	32 / 512	32 / 32 (lite 32 / 512)	32 / 32	32 / 32
`sync_flag_granule`	1024 B	1024 B	512 B	32 B	32 B	32 B
`max_single_host_dma`	8 MiB	16 MiB	2 GiB	128 GiB	64 GiB	32 GiB
misc: extra_done / host_async / count_dones	n/n/n	n/n/n	y/n/n	y/y/y (lite y/y/n)	y/y/y	y/y/y

The exact byte products behind the headline HBM and VMEM cells: Jellyfish HBM 1024 × 8,388,608 = 8,589,934,592 B per stack × 2; Pufferfish HBM 512 × 67,108,864 = 34,359,738,368 B (32 GiB) + CMEM 512 × 262,144 = 134,217,728 B (128 MiB); Viperfish HBM 32 × 3,221,225,472 = 103,079,215,104 B (exactly 96 GiB); Ghostlite HBM 32 × 1,056,964,608 = 33,822,867,456 B (31.5 GiB, 32 GiB nominal less ECC); 6acc60406 HBM 32 × 3,187,671,040 = 102,005,473,280 B (95 GiB per die).

Bank counts (not a proto field — `*Target::MemBanks` C++ literals)

MemBanks(space)	v2 JF	v3 DF	v4 PF	v5p VF	v6e GL	v7x
VMEM (space 3)	8	8	16	32	32	32
CMEM (space 4)	FATAL	FATAL	32	FATAL	FATAL	FATAL
SMEM (space 5)	2	2	8	8	8	8

JellyfishTarget::MemBanks @ 0x1d48fc80 returns 8 for space 3, 2 for space 5, and LOG(FATAL) otherwise (target_jellyfish.h:215). PufferfishTarget::MemBanks @ 0x1d493900 indexes the table at .rodata 0xb5305c8 = {16, 32, 8} for spaces 3/4/5 (target_pufferfish.h:228). ViperfishTarget::MemBanks @ 0x1d4999c0 and GhostliteTarget::MemBanks @ 0x1d4969c0 return 32 / 8 / FATAL. Dragonfish overrides none of these and inherits Jellyfish's 8 / 2.

Note: Pufferfish is the only generation whose MemBanks ladder has a CMEM (space 4) entry, and it is the only generation whose chip_parts has a SharedMemory[CMEM] (128 MiB). Every other generation LOG(FATAL)s on the CMEM space and has no CMEM shared memory — two independent encodings of "CMEM is first-class only on v4." See Memory Hierarchy.

Reading the Table

The BarnaCore → SparseCore pivot

The accelerator-core row is the generational hinge. v2/v3/v4 carry BARNA_CORE cores (the pre-SparseCore embedding/dedup engine); v5p/v6e/v7 carry SPARSE_CORE cores with SC_SEQ + (16× SC_TAC on v5p/v6e only) + 16× SC_TEC sequencers, plus the SPMEM/TILESPMEM/TEC memory family. 6acc60406 drops the separate SC_TAC sequencer (its SparseCore has SC_SEQ + SC_TEC×16 only) and widens the SC_TEC VectorIsa lane from 8 to 16. The lite parts (pufferfish_lite, viperfish_lite) carry neither BarnaCore nor SparseCore — they are TensorCore-only single-core dies.

The BarnaCore sequencer composition is not uniform across v2/v3/v4: Jellyfish's BarnaCore carries BC_ADDR ×16 only (no BC_SEQ entry in its blob), while Dragonfish and Pufferfish add a BC_SEQ ×1 master sequencer alongside the 16 BC_ADDR handlers. A reimplementer enumerating Jellyfish BarnaCore sequencers must not assume a BC_SEQ that is absent from the v2 proto.

The register-file widening at v5p

v2/v3/v4 TensorCore sequencers report SREG 32, VREG 32, PREG 15, VMREG 8. From Viperfish (v5p) onward the file is SREG 32, VREG 64, PREG 14, VMREG 16 — VREG doubled, VMREG doubled, PREG dropped by one. This is a clean proto-visible discontinuity at the v4→v5p boundary that a reimplementer must respect when allocating the per-generation register sets.

MXU count vs systolic dimension

mxu_count rises 1→2→4→4 across v2..v5p, then drops to 2 for v6e/v7. The drop is compensated by the systolic-array dimension: v6e/v7 use 2 × 256×256 arrays (the GhostliteTarget C++ override MxuContractingSize/MxuNoncontractingSize = 256, byte-confirmed at 0x1d497840/0x1d497860; base Target returns 128), where v2..v5p use up-to-4 × 128×128. The 256 dimension is the one MXU geometry value that is a C++ literal, not a proto field — the proto carries only lane_count=128 and mxu_count — but the literal itself is byte-exact, so the systolic-dim row is CONFIRMED, flagged as a C++-override source rather than a proto field. The 128×128 generations cross-validate: peak BF16 = 2 × mxu_count × 128² × frequency_mhz reproduces the published per-chip FLOPS for v2 (22.9 T at 1 MXU × 700 MHz), v3 (61.6 T at 2 × 940), and v4 (137.6 T at 4 × 1050) to within 1%.

Note: the proto's sublane_count is 8 for every generation, including v4 — the chip_parts VectorIsa.sublane_count field is unambiguously 8, not 16. The tile dimension a tiling pass consumes is Tile(SublaneCount, LaneCount) = (8, 128) on every gen in this build (the Target::SublaneCount accessor reads exactly this proto value). A reimplementation that hardcodes a 16-sublane v4 tile diverges from the loaded geometry.

Name	Relationship
`TpuChipParts::FromProto`	parses each blob; the proto fields above become `Target` capability fields
`*Target::MemBanks`	the C++ source of the bank-count rows (the only non-proto integers here)
`TpuChipConfig::Create`	parallel resolver for the mode configs; not the source of any row on this page

Cross-References

chip_parts.binarypb Decode — the proto schema, blob locations, and resolution path these constants are decoded from
TpuChipConfig — how these constants assemble into the runtime Target config and who reads it
Codename Matrix — TpuVersion ↔ codename ↔ marketing-name mapping
Per-Gen Comparison Matrix — cross-page consolidated per-generation comparison
Memory Hierarchy — the HBM/VMEM/SMEM/SFLAG/CMEM tier model these sizes populate
Cost Model Overview — consumer of the frequency and bandwidth rows
ISA Overview — consumer of the MXU lane/sublane and register-file rows

Keyboard shortcuts

libtpu Internals — Reverse-Engineering Reference