Per-Gen Comparison Matrix
Every per-generation constant on this page was decoded byte-exactly from
libtpu.soin thelibtpu-0.0.40-cp314wheel (buildlibtpu_lts_20260413_b_RC00, BuildID md589edbbe81c5b328a958fe628a9f2207d). The binary is not stripped — every accessor is a demangled C++ name;.text/.rodata/.lrodatamap VMA == file offset and.data.rel.romaps VMA −0x200000== file offset. Other builds will differ.
Abstract
This appendix is the single master table that lines up every TPU generation libtpu.so knows about — jellyfish (v2), dragonfish (v3), pufferfish (v4, + v4 lite), viperfish (v5p, + v5e lite), ghostlite (v6e), 6acc60406 (v7) — side by side across every recovered hardware constant a reimplementer needs to model a generation: bundle byte size, the lane/sublane/chunk geometry, the MXU / XLU / IAR execution-unit counts and dimensions, the per-tier memory capacities (VMEM, SMEM, SFLAG, CMEM, HBM), the cost-model class trio (LatencyTable / CycleTable / Performance subclass), local DMA bandwidth, ICI/PCIe bandwidth, the accelerator-core type (BarnaCore vs SparseCore, with the SCS/TAC/TEC sequencer split), and the three independent version ordinals. Each column's deep derivation lives on a dedicated cost / isa / targets page; this page is the cross-gen index those pages point back to.
Three facts make the consolidation worth doing on one page. First, almost every per-die geometry constant flows from one source — the embedded <codename>_chip_parts.binarypb proto, selected by TpuChipParts::DefaultsForVersion (0x20b1b040) and materialized into the runtime Target / TpuTopology objects at boot — so the whole matrix is anchored to one load path rather than to scattered C++ literals. Second, two constants that look per-gen are actually gen-stable (lane_count = 128, sublane_count = 8, iar_count = 2 for every generation); the genuinely per-gen ISA-geometry knobs are mxu_count (1→2→4→4→2→2) and xlu_count (1→1→2→3→2→2). Third, the cost model collapses 6 versions onto 5 class boundaries at the silicon-architecture seams: v2 and v3 share one LatencyTable class (split only by a DeviceIdentifiers compare into PerformanceJf/PerformanceDf), and v6e and v7 share GhostlitePerformance behind distinct wrapper classes. The matrix below makes those merges explicit.
For navigation, the contract is:
- The master matrix is one row per generation, the headline constants as columns, each row carrying a Confidence. It is the canonical lookup; the rest of the wiki links here.
- Grouped detail tables then split the matrix by subsystem — geometry, compute units, memory tiers, cost-model classes, interconnect, and the BarnaCore↔SparseCore pivot — because no single 30-column table is readable.
- Callouts flag the traps: the Viperfish std/lite variant split decided at runtime by a string compare, the v7
6acc60406dropping the TAC sequencer, thev4-only CMEM tier, and the cells this build genuinely cannot confirm.
| Version axis (internal) | tpu::TpuVersion 0..5 — TpuVersionToString @ 0x20b3a480, table off_22011BF0 (6 ptrs) |
| Per-die geometry source | <codename>_chip_parts.binarypb → TpuChipParts::DefaultsForVersion @ 0x20b1b040 → FromProto |
| Geometry cache | tpu::TpuTopology ctor @ 0x20acee60 (lane→+0x198, sublane→+0x1a0); read via Target[+0x3b8] |
| ISA-geometry source | VectorIsa sub-message (mxu/xlu/iar) → Target+0x4ac/+0x4b0/+0x4a8 |
| Cost-model factories | LatencyTable::Create @ 0x1c89fba0 (version-indexed); CycleTable::Create @ 0x1c89cc00 (Target-keyed) |
| Confidence | CONFIRMED (byte-anchored) unless a cell or callout says otherwise |
The Master Per-Gen Matrix
One row per generation, in tpu::TpuVersion order. The cells are the headline per-die (single-tensornode) values; lite-variant and full-chip deltas live in the grouped tables below. The Confidence column applies to the whole row's binary-anchored cells.
| Generation | TpuVer | Codec | Bundle | Lane×Sub | Chunks/Tile | MXU (count×geom) | XLU | IAR | VMEM | SMEM | SFLAG | CMEM | Accel core | Cost trio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Jellyfish v2 | 0 | jxc | 41 B | 128×8 | 16 | 1 × 128² | 1 | 2 | 16 MiB | 16 KiB | 1 KiB | — | BarnaCore ×2 | JF / Jf / PerformanceJf |
| Dragonfish v3 | 1 | jxc | 41 B | 128×8 | 16 | 2 × 128² | 1 | 2 | 16 MiB | 16 KiB | 1 KiB | — | BarnaCore ×2 | JF / Jf / PerformanceDf |
| Pufferfish v4 | 2 | pxc / pfc | 51 B | 128×8 | 16 | 4 × 128² | 2 | 2 | 16 MiB | 1 MiB | 2 KiB | 128 MiB | BarnaCore ×4 | PF / Pf / PufferfishPerformance |
| Viperfish v5p | 3 | vxc / vfc | 64 B | 128×8 | 16 | 4 × 128² | 3 | 2 | 64 MiB | 1 MiB | 2 KiB | — | SparseCore ×4 | VF / Vf / ViperfishPerformance |
| Ghostlite v6e | 4 | gxc / glc | 64 B | 128×8 | 16 | 2 × 256² | 2 | 2 | 128 MiB | 1 MiB | 2 KiB | — | SparseCore ×2 | GL / Glc / GhostlitePerformance |
| 6acc60406 v7 | 5 | gxc / gfc | 64 B | 128×8 | 16 | 2 × 256² | 2 | 2 | 64 MiB | 1 MiB | 16 KiB | — | SparseCore ×2 | Gf / Gfc / GhostlitePerformance |
NOTE — the v7
6acc60406blob ships in this build (both the single-dietensornodeand the 2-die full-chip variant), so its memory/ISA constants are decoded straight from the proto wire bytes. The v2–v6e blobs are also embedded in this build (ninechip_parts.binarypbblobs total, contiguous in.rodataat0xbdf29a0..), so the older-gen lane/sublane/MXU/memory values are likewise proto-sourced, not inferred — thechip_partsHBM/VMEM/SMEM/SFLAG bytes, the MXUVectorIsa, and the per-gen frequencies are all materializable. Cost-model class selection (LatencyTable/CycleTable/Performance) is decided purely by theTpuVersionordinal via the two factories named above.
GOTCHA —
lane_count(128),sublane_count(8), andiar_count(2) do not grow with the generation. A reimplementer who expects the lane width to widen at v5/v7 (as the HBM clock and VMEM do) will mis-model every tile loop. The lane/sublane growth is a constant fallback theTpuTopologyctor (0x20acee60) installs as128/8when the proto omits the field; every blob also carries128/8explicitly, so both paths agree. The only per-genVectorIsadeltas aremxu_countandxlu_count.
Geometry — Lane / Sublane / Chunk
The transpose / tiling geometry is a chip_parts property cached on the TpuTopology, read through the Target accessors. The whole matrix is numerically identical across generations because every blob's TensorCore VectorIsa reports lane_count = 128, sublane_count = 8.
| Constant | Accessor | Value (all gens) | Source |
|---|---|---|---|
LaneCount | Target::LaneCount @ 0x1d60f400 | 128 | *(Target[+0x3b8] + 0x198) ← VectorIsa.lane_count |
SublaneCount | Target::SublaneCount @ 0x1d60f300 | 8 | *(Target[+0x3b8] + 0x1a0) ← VectorIsa.sublane_count |
ChunksPerTile | Target::ChunksPerTile @ 0x1d60f2c0 | 16 | LaneCount / SublaneCount (idiv, 32-bit fast path) |
| Tile element count | TpuTopology[+0x1a8] | 1024 | LaneCount × SublaneCount (ctor @0x20acf2e8) |
The write path is unambiguous: the TpuTopology ctor (0x20acee60) gates on the platform-has-TensorCore byte (topo[+0x80]), reads CoreParts(TENSOR_CORE).SequencerParts(0).vector_isa(), checks the matrix-unit has-bit, then stores LaneCount ← vector_isa[+0x0] (@0x20acf2bc) and SublaneCount ← vector_isa[+0x4] (@0x20acf2c6), with the hard-coded 128/8 fallback at @0x20acf2cc/@0x20acf2dc. Target::Init (0x1d60fc20) then installs the TpuTopology* at Target[+0x3b8] (mov [rcx+0x3b8],rdi).
NOTE — the lane/sublane geometry is not written by
tpu::TpuChipConfig::Create(0x20ae98e0).TpuChipConfig::Createresolves a config alias via thekChipConfigAliasesflat-map (0x2200b8b0) and builds the driver-side memory/queue layout; it does not touchLaneCount/SublaneCount. The geometry is aTpuChipParts/TpuTopologyproperty, written by theTpuTopologyctor from the MXUVectorIsa. ThecfginMaximumNumberOfChunksis theTpuTopologyobject (Target[+0x3b8]), not a separate config struct.
NOTE — which generation reads which
kChipConfigAliasesentry.kChipConfigAliases(0x2200b8b0) is not keyed by a runtime-probed device string; it is a staticgtl::flat_map<tpu::TpuVersionAndVariant, MapView<string_view, string_view>>with exactly four inline entries, keyed byTpuVersionordinal {2, 3, 4, 5} — each with variant"default"(the version field is the literal byte at each 48-byte entry's+0x0:02/03/04/05, byte-confirmed in.data.rel.ro).TpuChipConfig::Create(TpuVersion, string_view variant)(0x20ae98e0) looks the pair up withgtl::flat_map::find(0x20afd7c0) and returns that version's alias sub-map. The alias-name vocabulary across the four sub-maps isdefault,legacy,megacore,megachip,megachip_tccontrol(.rodatastring_views at0x84f7d8c/0x84be65c/0x86a44a9/0x85c5f87/0x861b61d). The v4 and v5 entries (TpuVersion4 and 5) share one MapView backing (both entry-+0x28relocate to0x2200b9b0), so those two generations resolve aliases through an identical sub-map. The consumer split is therefore fully static and per-TpuVersion; the exact key→value pairing inside each type-erased MapView is the only residual. [Confidence: HIGH for the 4-entry version-keyed structure, the variant="default"keys, the alias vocabulary, and the v4/v5 shared sub-map — all byte-read from.data.rel.ro/.rodata+ thefindcall site; MEDIUM for the internal pair direction. See the TpuVersion ↔ codename matrix for the ordinal→codename binding.]
MaximumNumberOfChunks(VxposeMode, n) (0x1d60f200) is therefore fully numeric and identical across generations (ElementCount table 0xb53c830 = {1,2,4,1,2}): B32 → 16, CompressedB16 → 8, CompressedB8 → 4, SegmentedB32 → n/8, SegmentedB16 → n/16.
Compute Units — MXU / XLU / IAR
These three counts come from the TensorCore VectorIsa sub-message (TpuSequencerPartsProto field 5), rendered field-by-field from each blob's wire bytes. Only mxu_count and xlu_count vary per gen; iar_count is 2 everywhere.
TpuVer | Gen | mxu_count (f5) | xlu_count (f6) | iar_count (f7) | MXU systolic dim | FLOPS-derivation MXU |
|---|---|---|---|---|---|---|
| 0 | Jellyfish v2 | 1 | 1 | 2 | 128×128 | 1 × 128² |
| 1 | Dragonfish v3 | 2 | 1 | 2 | 128×128 | 2 × 128² |
| 2 | Pufferfish v4 | 4 | 2 | 2 | 128×128 | 4 × 128² |
| 3 | Viperfish v5p | 4 | 3 | 2 | 128×128 | 4 × 128² |
| 4 | Ghostlite v6e | 2 | 2 | 2 | 256×256 | 2 × 256² (override) |
| 5 | 6acc60406 v7 | 2 | 2 | 2 | 256×256 | 2 × 256² (override) |
The VectorIsa wire stream per gen is an 11-byte varint block 10 80 01 (lane=128) · 18 08 (sublane=8) · 28 <m> (mxu) · 30 <x> (xlu) · 38 02 (iar); tag 0x20 (issue_latency_cycle_count, field 4) is never present (see GOTCHA below). All three counts reach the runtime Target: Target::Init (0x1d60fc20) reads vector_isa()+0xc as one QWORD packing {mxu_count, xlu_count} → Target+0x4ac/+0x4b0, and vector_isa()+0x14 = iar_count → Target+0x4a8. Target+0x4ac (mxu_count) is consumed by the SDC-checker MXU-sequence injector (mov eax,[rax+0x4ac] @ 0x144fca07); it is not a dead field.
The v6e/v7 drop to mxu_count = 2 (from v4/v5's 4) is compensated by the GhostliteTarget C++ override of the contracting/non-contracting matrix dimension to 256 (vs the base 128): v6e/v7 use 2 × 256×256 systolic arrays where v4/v5 use 4 × 128×128, yielding comparable peak per-cycle MACs. The 256×256 override is a C++ accessor literal, hence HIGH rather than CERTAIN — the proto lane_count stays 128.
NOTE — the FLOPS cross-validation pins both
mxu_countand the TensorCore clock simultaneously for v2/v3/v4: peak BF16 =2 × mxu_count × 128² × freq_MHz·1e6reproduces the C++FlopsPerSecondliteral within 1 % (v2 1·700 MHz → 22.9 T vs 22.8 T; v3 2·940 MHz → 61.6 T vs 61.4 T; v4 4·1050 MHz → 137.6 T vs 137 T). The v5p/v6e/v7 peak uses the 256² override plus a per-dtype reduced-precision ladder, so the simple128²formula no longer applies.
GOTCHA —
issue_latency_cycle_count(VectorIsafield 4) is absent in every embedded blob (proto default 0). A reimplementer must not read MXU/VPU issue latency fromchip_parts; the real per-gen issue latency lives in the cost-modelPerformancegrids, queried throughCycleTable::GetCyclesForThroughput(per-gen vtable slot+0x10). Thechip_partsfield exists in the schema but is never populated in this build.
Memory Tiers
Per-die (single-tensornode) capacities. VMEM, SMEM, SFLAG, and HBM are chip_parts Memory/SharedMemory sub-messages; CMEM is a SharedMemory[CMEM] present on exactly one generation. MemBanks columns are C++ ladder accessors (the bank count per memory space).
| Tier | v2 JF | v3 DF | v4 PF (std / lite) | v5p VF (std / lite) | v6e GL | v7 6acc (die / full) |
|---|---|---|---|---|---|---|
| HBM (total) | 16 GiB | 32 GiB | 32 GiB / 8 GiB | 96 GiB / 16 GiB | 31.5 GiB | 95 GiB / 190 GiB |
| HBM word | 1024 B | 1024 B | 512 B | 32 B / 512 B | 32 B | 32 B |
| HBM clock | 1400 MHz | 1800 MHz | 2400 MHz | 3600 / 3200 MHz | 6400 MHz | 7200 MHz |
| HBM bw / stack | 0.317 TB/s | 0.430 TB/s | 0.982 / 0.492 TB/s | 2.350 / 0.738 TB/s | 1.638 TB/s | 3.686 TB/s |
| VMEM / TC | 16 MiB | 16 MiB | 16 MiB | 64 / 128 MiB | 128 MiB | 64 MiB |
| VMEM word | 512 B | 512 B | 512 B | 512 B | 512 B | 512 B |
| SMEM (TC) | 16 KiB | 16 KiB | 1 MiB | 1 MiB | 1 MiB | 1 MiB |
| SFLAG (TC) | 1 KiB | 1 KiB | 2 KiB | 2 KiB | 2 KiB | 16 KiB |
| CMEM | — | — | 128 MiB | — | — | — |
| IMEM bundle_count | 65536 | 65536 | 65536 | 65536 | 65536 | 65536 |
| Reg SREG/VREG/PREG/VMREG | 32/32/15/8 | 32/32/15/8 | 32/32/15/8 | 32/64/14/16 | 32/64/14/16 | 32/64/14/16 |
| DMA granule | 1024 B | 1024 B | 512 B | 32 B / 512 B | 32 B | 32 B |
max_single_host_dma | 8 MiB | 16 MiB | 2 GiB | 128 GiB | 64 GiB | 32 GiB |
MemBanks ladders (C++ accessors; the bank count per MemorySpace, decoded from the FATAL-vs-return structure):
MemBanks(space) | v2/v3 (JellyfishTarget @ 0x1d48fc80) | v4 (PufferfishTarget @ 0x1d493900) | v5p (ViperfishTarget @ 0x1d4999c0) | v6e/v7 (GhostliteTarget @ 0x1d4969c0) |
|---|---|---|---|---|
| VMEM (space 3) | 8 | 16 | 32 | 32 |
| CMEM (space 4) | FATAL | 32 | FATAL | FATAL |
| SMEM (space 5) | 2 | 8 | 8 | 8 |
QUIRK — CMEM is first-class on Pufferfish (v4) only.
PufferfishTarget::MemBanksis the single ladder that accepts memory space 4 — it computesidx = space − 3and indexes a 3-entry table{16, 32, 8}(0xb5305c8) for spaces 3/4/5, so CMEM (4) returns 32 banks. Every other generation'sMemBankseither testsspace == 3 || space == 5and FATALs otherwise (Jellyfish/Dragonfish,@0x1d48fc80) or excludes space 4 from its valid range. This is double-encoded: v4 is also the only generation whosechip_partscarries aSharedMemory[CMEM](128 MiB, 1050 MHz, 2.151 TB/s) and the only one with a CMEM local-DMA bandwidth row. A reimplementer must FATAL on CMEM for every gen except v4.
NOTE — the register file widens at v5p, not gradually: v2/v3/v4 TensorCores carry SREG 32 / VREG 32 / PREG 15 / VMREG 8; v5p/v6e/v7 carry SREG 32 / VREG 64 / PREG 14 / VMREG 16 (VREG doubled, VMREG doubled, PREG dropped by one). VMEM also does not grow monotonically — the single-TensorCore lite/v6e dies pack more VMEM per core (128 MiB on v5e/v6e) than the dual-core v7 (64 MiB), because VMEM is per-TensorCore and the lite dies have fewer cores to share the die area.
Cost-Model Class Trio
Each generation selects a LatencyTable subclass (edge/latency model), a CycleTable subclass (throughput model), and a Performance backend (the numeric grids). The selection is by TpuVersion ordinal through two distinct factory mechanisms; 6 versions collapse onto 5 class boundaries.
TpuVer | Gen | LatencyTable subclass | LT size | LT vtable | CycleTable subclass | Performance backend |
|---|---|---|---|---|---|---|
| 0 | Jellyfish v2 | LatencyTableJellyfish | 0x58 | 0x21c202d0 | JfCycleTable | PerformanceJf (isa) |
| 1 | Dragonfish v3 | LatencyTableJellyfish | 0x58 | 0x21c202d0 | JfCycleTable | PerformanceDf (isa) |
| 2 | Pufferfish v4 | LatencyTablePufferfish | 0x1e0 | 0x21c20320 | PfCycleTable | PufferfishPerformance |
| 3 | Viperfish v5p | LatencyTableViperfish | 0x1e0 | 0x21c203f0 | VfCycleTable | ViperfishPerformance |
| 4 | Ghostlite v6e | LatencyTableGhostlite | 0x1e0 | 0x21c20698 | GlcCycleTable | GhostlitePerformance |
| 5 | 6acc60406 v7 | (anon-ns gf LatencyTable) | 0x1e0 | 0x21c20920 | GfcCycleTable | GhostlitePerformance (shared) |
The two factories:
LatencyTable::Create(TpuVersion)(0x1c89fba0) is a direct version-indexedInlinedVectorregistry (registry@0x225799f8): sign-checkversion >= 0, bounds-checkversion < registry->size(), SOO buffer select, entry stride0x20, factory pointer at[entry+0x18], thencall. The registry is populated by 6Register()calls at static-init from 5 TUs (latency_table_{jf,pf,vf,gl,gf}.cc) —jf.ccregisters both v0 and v1 ontoLatencyTableJellyfish.CycleTable::Create(Target const&)(0x1c89cc00) is the parallel-but-distinct mechanism: aFunctionRegistry(Mutex-guardedFlatHashMap,0x225799e8) keyed onTarget[+0x398]=TpuVersion. Six registrations,JfCycleTableserving both v2 and v3.
QUIRK — the v2-vs-v3 split is not in the
LatencyTableorCycleTableclass — both versions instantiateLatencyTableJellyfish(slots 0 and 1) andJfCycleTable. The actual JF→DF cost delta comes fromPerformance::CreateTensorCore(0x1d4927e0), which compares theDeviceIdentifiersagainstkJellyfishIdentifiers(0xbdf3c0c) →PerformanceJf, elsekDragonfishIdentifiers(0xbdf3c18) →PerformanceDf. Likewise v6e and v7 shareGhostlitePerformancebehind distinctLatencyTable/CycleTablewrapper classes — which is exactly why theirVectorIsais byte-identical (mxu=2/xlu=2/iar=2) and their trig estimates match. A reimplementer modeling per-gen cost must split on the silicon-architecture seams, not on the 6-way version enum.
NOTE — the v7 (
Gfc)LatencyTableis an anonymous-namespace type inlatency_table_gf.cc— no named typeinfo symbol, so its precise C++ class name is unrecoverable (hence HIGH, not CERTAIN). Its vtable (0x21c20920), ctor (0x1c8b9520),VectorRawHazardCycles= 7 (0x1c8b9c80), andGhostlitePerformancedelegation are all confirmed; only the source class name is unknown.
Interconnect — ICI / PCIe / Local DMA Bandwidth
ICI and PCIe bandwidths are C++ Target accessor literals (each returns an immediate IEEE-754 double). The local on-chip DMA bandwidth matrix is per-codename LocalDmaBandwidth* overrides; the base Target accessors return 0 (use the default cost model).
| Metric | v2 JF | v3 DF | v4 PF | v5p VF | v6e GL |
|---|---|---|---|---|---|
IciGigabytesPerSecond | 123.8 | 164.0 | 89.6 | 186.7 | 186.7 |
PcieUnidirectionalBytesPerSecond | 16 GB/s | 16 GB/s | 16 GB/s | 16 GB/s | 32 GB/s |
| FLOPS BF16 (per chip) | 22.8 T | 61.4 T | 137 T | 197 T | 918 T |
| Transcendentals/s (per TC) | 717 G | 963 G | 537.75 G | 1.536 T | 1.792 T |
Local DMA bandwidth (GB/s, C++ literals; selected rows — the full matrix lives on the deep page). The base Target::LocalDmaBandwidth* returns 0; each codename subclass overrides the live cells with immediate doubles.
| src → dst | v3 DF | v4 PF | v5p VF (std) | v5e VF-lite | v6e GL |
|---|---|---|---|---|---|
| HBM → VMEM | 423 | 481 | 1198 | 822 | 1285 |
| VMEM → HBM | 423 | 1111 | 1224 | 828 | 1432 |
| HBM → SMEM | — | 34 | 55 | 56 | 55 |
| VMEM → CMEM | — | 1121 | — | — | — |
| CMEM → VMEM | — | 2339 | — | — | — |
| SPMEM → HBM | — | — | 587.4 | (587.4) | 588 |
GOTCHA — Viperfish has a runtime std/lite branch inside the bandwidth accessors.
ViperfishTarget::LocalDmaBandwidthHbmToVmem(0x1d49a380) loads the variant byte, checksvariant != 4→ returns the std value0x4092B80000000000(1198.0), then compares the variant string dword against0x65746C6C("ltil"little-endian = the tail of"lite") and returns0x4089B00000000000(822.0) for the lite die. So one C++ method serves both v5p (std, returns 1198) and v5e (viperfish_lite, returns 822); a reimplementer cannot treat v5p and v5e as separateTargetsubclasses — they are one class with a string-compare fork. Jellyfish (v2) overrides none of these, so all itsLocalDmaBandwidth*return base 0.
Accelerator Cores — BarnaCore ↔ SparseCore Pivot
The embedding/dedup accelerator changes type at v5p. v2/v3/v4 carry BARNA_CORE cores (the pre-SparseCore engine); v5p/v6e/v7 carry SPARSE_CORE cores. The lite dies (pufferfish_lite, viperfish_lite) ship neither — they are TensorCore-only single-core dies.
TpuVer | Gen | Accel core type | Count / chip (std) | Sequencers | TEC VectorIsa lane×sub |
|---|---|---|---|---|---|
| 0 | Jellyfish v2 | BARNA_CORE | 2 | BC_SEQ + 16× BC_ADDR | 1×8 (BC_ADDR) |
| 1 | Dragonfish v3 | BARNA_CORE | 2 | BC_SEQ + 16× BC_ADDR | 1×8 (BC_ADDR) |
| 2 | Pufferfish v4 | BARNA_CORE | 4 | BC_SEQ + 16× BC_ADDR | 1×8 (BC_ADDR) |
| 3 | Viperfish v5p | SPARSE_CORE | 4 | SC_SEQ + 16× SC_TAC + 16× SC_TEC | 8×1 (SC_TEC) |
| 4 | Ghostlite v6e | SPARSE_CORE | 2 | SC_SEQ + 16× SC_TAC + 16× SC_TEC | 8×1 (SC_TEC) |
| 5 | 6acc60406 v7 | SPARSE_CORE | 2 (die) / 4 (full) | SC_SEQ + 16× SC_TEC (no TAC) | 16×1 (SC_TEC) |
The runtime tpu::TpuSequencerType enum is SC_SEQ = 4, SC_TILE_ACCESS_CORE (TAC) = 5, SC_TILE_EXECUTE_CORE (TEC) = 6 (and on the BarnaCore side, BC_SEQ = 2, BC_ADDR = 3) — the TpuSequencerTypeToString-table numbering used to size per-engine resource arrays. The codec template-parameter numbering used by the codec/ISA pages is off by one (the codec form omits the INVALID slot): {SCS=3, TAC=4, TEC=5} (and BARNA=1/BARNA_ADDR=2). Cross the two numberings only with the +1; see getSequencerType. The non-TensorCore sequencers carry no matrix unit, so their VectorIsa is lane/sublane only — the SC_TEC vector width is the transposed analogue of the BC_ADDR address-walker geometry.
QUIRK — Trillium (v6e /
ghostlite) ships SCS + TAC + TEC, but the newest generation drops the TAC sequencer: the v7 (6acc60406)chip_partscarries SC_SEQ + 16× SC_TEC and no SC_TILE_ACCESS_CORE_SEQ. The SC_TEC vector width also doubles from 8 lanes (v5p/v6e) to 16 lanes (v7) — the tile-execute width grew while the tile-access sequencer was folded away. A reimplementer enumerating SparseCore sequencer types must not assume the v5p/v6e {SEQ, TAC, TEC} triple holds for v7.
NOTE — (missing-data cells) the BarnaCore detailed sub-message geometry (
BarnaCoreFsmfreg/smem offsets, the dedup/address-map sizes) is not walked here; the table reports the per-BarnaCore sequencer/memory presence and the TEC/BC vector geometry, but the BarnaCoreCoresub-message internals remain undecoded. Likewise the exactMatmulDataFormatenum → dtype-name binding for the v5p/v6e/v7 reduced-precision FLOPS ladder (1×/2×/4×) is inferred from the doubling pattern, not byte-pinned per index. Cells marked "—" are genuinely absent for that generation (e.g. SparseCore rows for v2–v4, CMEM rows for every gen but v4), not unknown.
Version Ordinals — Three Independent Axes
A generation wears three integer ordinals on three axes that do not share a numbering. The matrix above is keyed on the internal tpu::TpuVersion; this table binds it to the other two so the page never indexes one axis with another's ordinal.
| Generation | tpu::TpuVersion (internal) | TpuVersionProto (wire) | xprof::DeviceType (profiler) |
|---|---|---|---|
| Jellyfish v2 | 0 | 1 | 3 |
| Dragonfish v3 | 1 | 2 | 5 |
| Pufferfish v4 | 2 | 3 | 7 |
| Viperfish v5p | 3 | 4 | 10 |
| Ghostlite v6e | 4 | 5 | 13 |
| 6acc60406 v7 | 5 | 6 | 12 |
TpuVersionToString (0x20b3a480) FATALs on version >= 6 and indexes the 6-pointer table off_22011BF0, whose relocations target the codename literals jellyfish … 6acc60406. The proto ordinal is always internal + 1 (the blob's field-1 version). DeviceType is a sparse profiler axis (DeviceTypeFromDeviceIdentifiers @ 0xf6993a0) that includes lite variants as their own ordinals (Puffylite = 8, Viperlite = 11) — it is not arithmetically derivable from TpuVersion.
GOTCHA — the
DeviceTypeordinals are out ofTpuVersionorder at the top end: Ghostlite (v6e) isDeviceType13 but6acc60406(v7) isDeviceType12. A reimplementer who assumesDeviceTypeincreases with generation will swap v6e and v7. Always translate through this table, never by adding a constant to one ordinal.
Cross-References
Each column of the master matrix has a deep page that owns its derivation:
- Codename Cheat-Sheet — the full three-axis version map (codec / fish /
TpuVersion/DeviceType/ proto / PCI DID), the source of the ordinal table here - Chip-Parts
.binarypb— the embedded blob catalog, theDefaultsForVersionembed://load path, theTpuChipPartsProtoschema - Per-Codename HW Constants — the full per-gen
chip_partsdecode (HBM/VMEM/SMEM/SFLAG/MXU/register file) the memory columns summarize - TPU-Topology Struct — the lane/sublane cache and
Target[+0x3b8]geometry chain - TPU-Version / Codename Matrix — the
TpuVersionToStringtable and ordinal axes - Bundle Model Overview — the per-gen bundle byte sizes (JF 41 B, PF 51 B, VF 64 B, GL, GF)
- IARs per TensorCore — the
iar_count= 2 derivation and theVectorIsa→Target+0x4a8store - CycleTable Family and Performance Overview — the per-gen
CycleTable/Performancesubclass dispatch - Local DMA Bandwidth — the full per-codename
LocalDmaBandwidth*matrix and the Viperfish variant split - Matmul-Mode Modifiers — the
MatmulDataFormatset the reduced-precision FLOPS ladder prices - LLO Opcode Table — the per-gen opcode roster the bundle/sequencer geometry encodes
- Memory-Space Master Table — the
MemorySpaceenum theMemBanksladders and per-tier capacities key on