Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Per-Codename Constant Table

All constant values on this page were decoded byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

This page is the consolidated per-generation hardware-constant table for every TPU codename the binary knows: Jellyfish (v2), Dragonfish (v3), Pufferfish (v4), Viperfish (v5p/v5e), Ghostlite (v6e), and 6acc60406 (v7x). It is reference-table-centric: the master table is the point of the page, and the prose around it exists only to name the source of each row and flag the confidence.

Two source classes feed the table. The first and dominant is the embedded <codename>_chip_parts.binarypb proto blob, decoded directly from .rodata (see chip_parts.binarypb Decode for the schema and resolution path). Every memory size, core count, MXU geometry integer, clock, register count, and DMA constant below comes from those bytes, materialized as bytes_per_word × word_count or read as a scalar field. The second is the small set of constants the proto does not carry — the VMEM/SMEM/CMEM bank counts — which are C++ literals in the per-codename *Target::MemBanks overrides.

All nine blobs were carved from .rodata, md5-verified against their FileWrapper descriptor fingerprints, and walked field-by-field against the schema recovered from protodesc_cold. The decode reproduces, byte-for-byte, the relationships a reimplementer would expect (e.g. peak BF16 = 2 × mxu_count × 128² × frequency_mhz for the 128×128 generations), and every row carries a Confidence column with its source.

Source (capability)nine *_chip_parts.binarypb blobs, .rodata 0x0BDF29A0..0x0BDF3AB8
Source (bank counts)*Target::MemBanks C++ overrides (addresses below)
Decode methodmd5-verified carve + field-walk against protodesc_cold schema
Generationsjellyfish/dragonfish/pufferfish/viperfish/ghostlite/6acc60406 (TpuVersionProto 1..6)
ConfidenceCONFIRMED unless a cell is annotated otherwise

Master Hardware-Constant Table

All values are per TensorCore unless noted. "std" is the full part; "lite" is the viperfish_lite (v5e) or pufferfish_lite (v4 lite) blob, distinguished at resolution by the variant_name field. 6acc60406 is shown as die / full-chip (the tensornode blob vs the full two-die package). HBM size cells show the exact byte product; the GiB figure is bytes / 2^30.

Constantv2 Jellyfishv3 Dragonfishv4 Pufferfish (std / lite)v5p/v5e Viperfish (std / lite)v6e Ghostlitev7x 6acc60406 (die / full)
TpuVersionProto123456
driver_abi_version111111
HBM size16 GiB32 GiB32 GiB / 8 GiB96 GiB / 16 GiB31.5 GiB95 GiB / 190 GiB
HBM stacks × per-stack2 × 8 GiB2 × 16 GiB1 × 32 GiB / 1 × 8 GiB1 × 96 GiB / 1 × 16 GiB1 × 31.5 GiB1 / 2 × 95 GiB
HBM word (bytes_per_word)1024 B1024 B512 B32 B / 512 B32 B32 B
HBM clock1400 MHz1800 MHz2400 MHz3600 / 3200 MHz6400 MHz7200 MHz
HBM bandwidth / stack0.317 TB/s0.430 TB/s0.982 / 0.492 TB/s2.350 / 0.738 TB/s1.638 TB/s3.686 TB/s
VMEM / TensorCore16 MiB16 MiB16 MiB64 / 128 MiB128 MiB64 MiB
VMEM word512 B512 B512 B512 B512 B512 B
SMEM (TensorCore)16 KiB16 KiB1 MiB1 MiB1 MiB1 MiB
SMEM word4 B4 B4 B4 B4 B4 B
SFLAG (TensorCore)1 KiB1 KiB2 KiB2 KiB2 KiB16 KiB
SFLAG word4 B4 B4 B4 B4 B4 B
CMEM (SharedMemory)absentabsent128 MiB / 128 MiBabsentabsentabsent
CMEM word / clock / bw512 B / 1050 MHz / 2.151 TB/s
MXU lane × sublane128 × 8128 × 8128 × 8128 × 8128 × 8128 × 8
MXU count / TensorCore124422
XLU count / TensorCore112322
IAR count / TensorCore222222
MXU systolic dim128×128128×128128×128128×128256×256256×256
TensorCore freq700 MHz940 MHz1050 MHz1750 / 1500 MHz1750 MHz1900 MHz
TensorCores / chip (std/lite)222 / 12 / 111 / 2
Reg file SREG/VREG/PREG/VMREG32/32/15/832/32/15/832/32/15/832/64/14/1632/64/14/1632/64/14/16
Accelerator core typeBARNA_COREBARNA_COREBARNA_CORESPARSE_CORESPARSE_CORESPARSE_CORE
accelerator count / chip (std/lite)224 / 04 / 022 / 4
SparseCore freq1475 MHz1350 MHz1750 MHz
SC sequencers (SEQ/TAC/TEC)1 / 16 / 161 / 16 / 161 / 0 / 16
SC TEC VectorIsa lane × sublane8 × 18 × 116 × 1
SC SPMEM / TILESPMEM8 MiB / 512 KiB4 MiB / 256 KiB8 MiB / 512 KiB
SC SMEM (SCS) / SFLAG (SCS)64 KiB / 28 KiB64 KiB / 28 KiB64 KiB / 28 KiB
SC tile_hbm_bw / stream_granule32 B/cyc / 4 B32 B/cyc / 4 B64 B/cyc / 4 B
DMA granule bytes1024 B1024 B512 B32 B / 512 B32 B32 B
DMA host / device align16 / 102416 / 102432 / 51232 / 32 (lite 32 / 512)32 / 3232 / 32
sync_flag_granule1024 B1024 B512 B32 B32 B32 B
max_single_host_dma8 MiB16 MiB2 GiB128 GiB64 GiB32 GiB
misc: extra_done / host_async / count_donesn/n/nn/n/ny/n/ny/y/y (lite y/y/n)y/y/yy/y/y

The exact byte products behind the headline HBM and VMEM cells: Jellyfish HBM 1024 × 8,388,608 = 8,589,934,592 B per stack × 2; Pufferfish HBM 512 × 67,108,864 = 34,359,738,368 B (32 GiB) + CMEM 512 × 262,144 = 134,217,728 B (128 MiB); Viperfish HBM 32 × 3,221,225,472 = 103,079,215,104 B (exactly 96 GiB); Ghostlite HBM 32 × 1,056,964,608 = 33,822,867,456 B (31.5 GiB, 32 GiB nominal less ECC); 6acc60406 HBM 32 × 3,187,671,040 = 102,005,473,280 B (95 GiB per die).

Bank counts (not a proto field — *Target::MemBanks C++ literals)

MemBanks(space)v2 JFv3 DFv4 PFv5p VFv6e GLv7x
VMEM (space 3)8816323232
CMEM (space 4)FATALFATAL32FATALFATALFATAL
SMEM (space 5)228888

JellyfishTarget::MemBanks @ 0x1d48fc80 returns 8 for space 3, 2 for space 5, and LOG(FATAL) otherwise (target_jellyfish.h:215). PufferfishTarget::MemBanks @ 0x1d493900 indexes the table at .rodata 0xb5305c8 = {16, 32, 8} for spaces 3/4/5 (target_pufferfish.h:228). ViperfishTarget::MemBanks @ 0x1d4999c0 and GhostliteTarget::MemBanks @ 0x1d4969c0 return 32 / 8 / FATAL. Dragonfish overrides none of these and inherits Jellyfish's 8 / 2.

Note: Pufferfish is the only generation whose MemBanks ladder has a CMEM (space 4) entry, and it is the only generation whose chip_parts has a SharedMemory[CMEM] (128 MiB). Every other generation LOG(FATAL)s on the CMEM space and has no CMEM shared memory — two independent encodings of "CMEM is first-class only on v4." See Memory Hierarchy.


Reading the Table

The BarnaCore → SparseCore pivot

The accelerator-core row is the generational hinge. v2/v3/v4 carry BARNA_CORE cores (the pre-SparseCore embedding/dedup engine); v5p/v6e/v7 carry SPARSE_CORE cores with SC_SEQ + (16× SC_TAC on v5p/v6e only) + 16× SC_TEC sequencers, plus the SPMEM/TILESPMEM/TEC memory family. 6acc60406 drops the separate SC_TAC sequencer (its SparseCore has SC_SEQ + SC_TEC×16 only) and widens the SC_TEC VectorIsa lane from 8 to 16. The lite parts (pufferfish_lite, viperfish_lite) carry neither BarnaCore nor SparseCore — they are TensorCore-only single-core dies.

The BarnaCore sequencer composition is not uniform across v2/v3/v4: Jellyfish's BarnaCore carries BC_ADDR ×16 only (no BC_SEQ entry in its blob), while Dragonfish and Pufferfish add a BC_SEQ ×1 master sequencer alongside the 16 BC_ADDR handlers. A reimplementer enumerating Jellyfish BarnaCore sequencers must not assume a BC_SEQ that is absent from the v2 proto.

The register-file widening at v5p

v2/v3/v4 TensorCore sequencers report SREG 32, VREG 32, PREG 15, VMREG 8. From Viperfish (v5p) onward the file is SREG 32, VREG 64, PREG 14, VMREG 16 — VREG doubled, VMREG doubled, PREG dropped by one. This is a clean proto-visible discontinuity at the v4→v5p boundary that a reimplementer must respect when allocating the per-generation register sets.

MXU count vs systolic dimension

mxu_count rises 1→2→4→4 across v2..v5p, then drops to 2 for v6e/v7. The drop is compensated by the systolic-array dimension: v6e/v7 use 2 × 256×256 arrays (the GhostliteTarget C++ override MxuContractingSize/MxuNoncontractingSize = 256, byte-confirmed at 0x1d497840/0x1d497860; base Target returns 128), where v2..v5p use up-to-4 × 128×128. The 256 dimension is the one MXU geometry value that is a C++ literal, not a proto field — the proto carries only lane_count=128 and mxu_count — but the literal itself is byte-exact, so the systolic-dim row is CONFIRMED, flagged as a C++-override source rather than a proto field. The 128×128 generations cross-validate: peak BF16 = 2 × mxu_count × 128² × frequency_mhz reproduces the published per-chip FLOPS for v2 (22.9 T at 1 MXU × 700 MHz), v3 (61.6 T at 2 × 940), and v4 (137.6 T at 4 × 1050) to within 1%.

Note: the proto's sublane_count is 8 for every generation, including v4 — the chip_parts VectorIsa.sublane_count field is unambiguously 8, not 16. The tile dimension a tiling pass consumes is Tile(SublaneCount, LaneCount) = (8, 128) on every gen in this build (the Target::SublaneCount accessor reads exactly this proto value). A reimplementation that hardcodes a 16-sublane v4 tile diverges from the loaded geometry.


NameRelationship
TpuChipParts::FromProtoparses each blob; the proto fields above become Target capability fields
*Target::MemBanksthe C++ source of the bank-count rows (the only non-proto integers here)
TpuChipConfig::Createparallel resolver for the mode configs; not the source of any row on this page

Cross-References