SparseCore Overview
Every codename, engine name, and per-generation presence claim on this page was read from
libtpu.soin thelibtpu-0.0.40-cp314wheel (build-id89edbbe81c5b328a958fe628a9f2207d) — from the demangled C++ symbol table and the embedded proto-descriptor strings. Other versions differ.
Abstract
SparseCore (SC) is the TPU's on-die embedding / sparse-gather coprocessor. It sits next to the matrix-heavy TensorCore (TC) on the same die, shares the chip's HBM physical interface, and exists to do the one thing the TensorCore's systolic MXU cannot: stream indirect (index-driven) reads and writes against gigabyte-scale tables in HBM, including atomic floating-point scatter-add. The canonical workload is the embedding lookup of a recommender / DLRM-class model — gather millions of variable-length rows from an embedding table, reduce each sample's rows to one dense vector, hand the dense result to the TensorCore for the matmul, and on the backward pass scatter the gradient back into HBM by atomic add. The TensorCore's HBM controller is tuned for long contiguous bursts; SparseCore's is tuned for many short random transactions. That access-pattern split is the entire architectural justification for two engines instead of one.
SparseCore is not a TensorCore extension. It is a separate ISA family with its own opcode sets, its own register files, its own VLIW bundle formats, and its own per-generation encoder/decoder codecs. Inside SparseCore the compute fabric is itself split into three sub-engine classes, each driven by an independent VLIW bundle stream and selected by a tpu::TpuSequencerType enum value:
- SCS — SparseCore Scalar sequencer. The control / addressing CPU. 32-byte bundle.
TpuSequencerType = 3. - TAC — Tile Access Core. A specialised address-handler + DMA issuer; no vector compute. 64-byte bundle.
TpuSequencerType = 4. - TEC — Tile Execute Core. The wide vector engine that runs the per-tile reductions. 64-byte bundle.
TpuSequencerType = 5.
The most consequential structural fact this binary records is that the roster is not constant across silicon. SparseCore appears first on the Viperfish generation as a three-engine SCS+TAC+TEC split; Ghostlite keeps all three; 6acc60406 drops TAC entirely, folding tile-fetch issuance into the SCS+TEC pair. There is also a legacy monolithic predecessor, SCv0, which survives in this build only as two profiler-label constants — no SCv0 codec ships.
This page is navigational. It fixes what SparseCore is, names the three engine classes and their binary-evidenced per-generation presence, sketches the host→HBM→SparseCore→TensorCore data path, and routes to the page that owns each piece. The deep mechanics — opcode rosters, slot bit-layouts, the embedding scan datapath, the back-end pass pipeline — live on the sibling pages cross-referenced below.
For reimplementation, the contract is:
- Three sub-engine classes, three independent bundle streams. SCS, TAC, and TEC are distinct codecs (
SparseCore{Scs,Tac,Tec}CodecBaseper gen), each selected by itsTpuSequencerTypenon-type template parameter (3 / 4 / 5). A reimplementer must treat them as three separate VLIW machines that coordinate through sync flags and shared memory, not one machine with three modes. - Per-gen presence is part of the contract. Viperfish and Ghostlite ship all three engines; 6acc60406 ships SCS + TEC only. Emitting a TAC program for a 6acc60406 target is a codec error — the
gfcnamespace contains zeroSparseCoreTac*symbols. - The data path is HBM-mediated and indirect. Embedding tables live in HBM; SC gathers tile rows into a per-tile SRAM (
TILE_SPMEM), the TEC reduces them, and the dense result is DMA'd to the TensorCore's VMEM (or scattered back to HBM). SC'sSTREAM_OPCODE_SCATTER_FLOAT_ADDruns atomic FP-add directly in HBM with no TC round-trip — the primitive that justifies SC as a separate engine. - SCv0 is enum-only. The legacy single-personality SparseCore is preserved only as
TpuSequencerTypeProtovalues 7 / 8 (the proto enum) and two profiler labels; it has no C++TpuSequencerTypevalue (TpuSequencerTypeFromProtorejects it) and no encoder, decoder, codec, or descriptor ships for it.
| What it is | On-die embedding / sparse-gather coprocessor, co-located with the TensorCore, sharing HBM |
| Engine classes | SCS (scalar sequencer) · TAC (tile-access / DMA) · TEC (vector compute) |
| Sequencer enum | C++ tpu::TpuSequencerType — SCS=3, TAC=4, TEC=5 (same as the codec template). SCv0 has no C++ value — see getSequencerType for the proto-enum +1 peer |
| Codec class roots | SparseCore{Scs,Tac,Tec}CodecBase (per-gen, in asic_sw::deepsea::<fam>::isa) |
| Gens with SC | Viperfish (all 3) · Ghostlite (all 3) · 6acc60406 (SCS+TEC, no TAC) |
| Gens without SC | Jellyfish / Dragonfish / Pufferfish (BarnaCore era — see BarnaCore Overview) |
| Memory | HBM (chip-wide) · SPMEM (per-chip SC SRAM) · TILE_SPMEM (per-tile) · TIMEM (per-tile instr) |
| Confidence | CONFIRMED (symbol-table / descriptor-anchored) unless a row or callout says otherwise |
What SparseCore Is — and Why It Is Separate
A TensorCore is a statically-scheduled VLIW machine built around a systolic matrix unit; it is at its best when data arrives as dense tiles streamed contiguously out of HBM. Embedding-heavy models break that assumption: the dominant cost is not matmul FLOPs but pointer-chasing — reading a handful of rows out of a table with millions of rows, where which rows are touched is data-dependent and changes every minibatch. Forcing that traffic through the TensorCore's contiguous-burst HBM controller wastes the burst, and the systolic array's accumulation path has no way to do an atomic floating-point add into an arbitrary HBM address (the operation embedding-gradient accumulation needs on every backward step).
SparseCore exists to absorb exactly that traffic. It is a peer engine on the same die, with read/write access to the same HBM, but reaching HBM through a different MMU translation surface and prefetcher policy tuned for many short random transactions rather than long bursts. Its instruction set is built around gather, scatter, scan, sort, uniquify, and pack/unpack — the operations that turn a stream of variable-length index lists into dense embedding vectors. The SparseCore and TensorCore run in parallel and hand off through shared HBM, through programmed VMEM↔SPMEM DMA, and through a shared sync-flag pool.
NOTE — SparseCore is its own ISA family, not a TensorCore mode. The binary carries fully independent
SparseCore{Scs,Tac,Tec}CodecBasetemplate hierarchies per generation, scoped underasic_sw::deepsea::{vxc.vfc, gxc.glc, gxc.gfc}::isa. These have their own bundle widths, opcode enums, and operand-field protos, distinct from the TensorCore'sLloOpcodeProto/ per-gen bundle encoders documented in ISA Overview. A reimplementer who models SparseCore as extra TensorCore slots will not produce encodable SC programs.
The Three Engine Classes
Inside SparseCore the compute fabric is partitioned into three sub-engine classes, each a separate VLIW machine with its own bundle stream, its own program type, and its own codec. They are distinguished in the binary by the tpu::TpuSequencerType enum that the encoder template carries as a non-type parameter, and by the engine-specific name strings.
The value columns split the two numberings: C++ is the tpu::TpuSequencerType enum the codec templates and TpuSequencerTypeToString both use (the wiki-canonical numbering); proto is the TpuSequencerTypeProto peer, which reserves an INVALID=0 slot and so runs one higher (see getSequencerType for the FromProto bridge).
| C++ | proto | TpuSequencerTypeProto literal | Short | Bundle | Role |
|---|---|---|---|---|---|
| 3 | 4 | TPU_SEQUENCER_TYPE_SPARSE_CORE_SEQUENCER | SCS | 32 B | Scalar control / addressing sequencer |
| 4 | 5 | TPU_SEQUENCER_TYPE_SPARSE_CORE_TILE_ACCESS_CORE_SEQUENCER | TAC | 64 B | Address generation + tile-fetch DMA issuer |
| 5 | 6 | TPU_SEQUENCER_TYPE_SPARSE_CORE_TILE_EXECUTE_CORE_SEQUENCER | TEC | 64 B | Wide vector compute over loaded tiles |
| — | 7 | TPU_SEQUENCER_TYPE_SPARSE_CORE_V0_SEQUENCER | SCv0 | (legacy) | Monolithic predecessor — proto-only, no C++ value, no codec |
| — | 8 | TPU_SEQUENCER_TYPE_SPARSE_CORE_V0_ADDRESS_HANDLER | SCv0-AH | (legacy) | SCv0 address handler — proto-only, no C++ value, no codec |
SCS — the scalar sequencer. SCS is the control plane: it runs the program counter, computes addresses, manages circular buffers, reads chip registers (GTC clocks, tile id, sparse-core id, DMA credits), and issues the sync-flag and atomic operations that coordinate the SC tiles with each other and with the TensorCore. Its 32-byte bundle is the narrowest of the three. Its instruction set is dominated by a scalar ALU slot (integer + F32 arithmetic, compares, shifts, branches) and a "scalar misc" slot that holds the atomic-and-sync family. See SCS (Scalar) Engine and the Scalar Opcode Enum.
TAC — the tile-access core. TAC is a specialised address-handler + DMA issuer that sits between SCS and TEC. It has its own sequencer thread (branches, halt, delay), integer address arithmetic, compares, SMEM load/store for address-buffer staging, and a stream slot — but no FPU, no vector ALU, no vector load/store. Its sole job is to take a stream of indices and emit the tile-fetch DMAs that pull embedding rows from HBM/SPMEM into TILE_SPMEM. Its bundle is 64 bytes despite carrying no vector compute, because the width goes to many scalar-style address ops. TAC exists only on Viperfish and Ghostlite. See TAC Engine.
TEC — the tile-execute core. TEC is the wide compute engine. A single 64-byte TEC bundle can issue three vector ALU operations, a vector load, a vector store, a "vector extended" op (scan / sort / uniquify), a "vector result" pop, two scalar ALU slots, immediates, a DMA, and a stream slot — all in one cycle. This is where the per-row reductions, the precision pack/unpack (including FP8 E4m3/E5m2 and sub-byte formats on the newest gen), and the embedding-optimizer math live. The vector ALU slot is the widest opcode set in all of SparseCore. See TEC (Vector) Engine, the Vector Opcode Enum, and the Stream Gather/Scatter datapath.
GOTCHA — TAC reuses SCS's proto-enum opcode space but constrains it. The TAC sub-bundle does not declare a separate
SparseCoreTacScalarAluproto-enum namespace; it reuses the genericSparseCoreScalarAlu/SparseCoreScalarMiscenums and gates legality in theSparseCoreTacScalarAlu{0,1}Encoderclasses during emit. A reimplementer cannot enumerate "TAC's opcodes" from a dedicated enum — TAC legality is the SCS scalar set minus the FPU and vector ops, enforced at encode time.
Per-Generation Presence
SparseCore is a v5+ feature. The generations that ship SC silicon are Viperfish, Ghostlite, and 6acc60406; the earlier BarnaCore-era generations (Jellyfish / Dragonfish / Pufferfish) carry no SparseCore. Which sub-engines are present is itself per-generation, and is the single most important fact a reimplementer must encode.
The discriminator is the codec class family. SparseCore codecs are scoped under the per-generation asic_sw::deepsea family namespace — vxc.vfc (Viperfish), gxc.glc (Ghostlite), gxc.gfc (6acc60406) — and the presence or absence of a SparseCore<Engine>CodecBase class in that namespace is a direct binary readout of whether the engine ships.
| Gen | Codename | Family ns | SCS | TAC | TEC | SCv0 | Notes |
|---|---|---|---|---|---|---|---|
| TPU v2 | Jellyfish | jxc | – | – | – | – | No SparseCore (BarnaCore era) |
| TPU v3 | Dragonfish | jxc | – | – | – | – | No SparseCore (shares Jellyfish codec) |
| TPU v4 | Pufferfish | pxc.pfc | – | – | – | – | No SparseCore (BarnaCore VLIW) |
| TPU v5p | Viperfish | vxc.vfc | Y | Y | Y | – | First gen with the three-engine split |
| TPU v6e | Ghostlite | gxc.glc | Y | Y | Y | – | Full SCS+TAC+TEC; widened vector set |
| TPU7x | 6acc60406 | gxc.gfc | Y | – | Y | – | TAC removed; TEC widened (FP8/FP4 pack/unpack) |
NOTE — codename-to-marketing-name mapping. SparseCore appears in the binary under the silicon codenames Viperfish, Ghostlite, and
6acc60406— never under any marketing string. The marketing-name column above follows the SparseCore-specific sibling SparseCore Target Descriptor (nSparseCoreTarget= v5p;GhostLiteSparseCoreTarget= v6e Ghostlite + v7x6acc60406); the binary itself keys everything on the codename family namespace and theTpuVersionordinal carried inside the SC codec templates, not on the marketing name. (Some sibling parts label Viperfishv5/v5e; the codename + family namespace is the only binary-authoritative discriminator. For the record, Google's Trillium marketing name is TPU v6e, i.e. codename Ghostlite — not6acc60406, which is TPU7x.) Treat the codename + family namespace as authoritative; the marketing column is MEDIUM confidence.
Decompile cross-check — engine roster and TAC absence
The roster and 6acc60406's missing TAC were confirmed directly against the decompiled function set. Counting decompiled SparseCore<Engine>* functions per family namespace:
| Family ns | Gen | SparseCoreScs* fns | SparseCoreTac* fns | SparseCoreTec* fns |
|---|---|---|---|---|
vxc.vfc | Viperfish | 509 | 770 | 4218 |
gxc.glc | Ghostlite | 498 | 762 | 5961 |
gxc.gfc | 6acc60406 | 493 | 0 | 6526 |
The gfc (6acc60406) namespace has zero SparseCoreTac* decompiled functions, against ~760–770 for each of Viperfish and Ghostlite — TAC is entirely absent from 6acc60406 silicon. (The single gfc-tagged SparseCoreTac symbol that survives, llvm::TPUGfcSubtarget::isSparseCoreTac, is a subtarget query that returns false — not a codec leaf.) The corresponding SparseCoreTacCodecBase class exists only under vxc.vfc and gxc.glc, never gxc.gfc. Conversely the TEC function count grows gen over gen (4218 → 5961 → 6526), consistent with TAC's tile-fetch role being folded into the SCS+TEC pair on 6acc60406. SCS is present and roughly constant in all three.
CONFIRMED — 6acc60406 collapses the 3-engine pipeline to 2. On Viperfish/Ghostlite the path is SCS → TAC (tile-fetch DMA) → TEC (compute). On 6acc60406 TAC is gone and TEC absorbs the address-generation + DMA-issue duties through its own stream slot, leaving a single SCS↔TEC boundary. This is the "2-sequencer (SCS+TEC) model." See TAC Engine for the absorbed-role detail.
SCv0 — the legacy monolithic predecessor
The proto enum TpuSequencerTypeProto reserves two values (7, 8) for a single-personality SparseCore that predates the SCS/TAC/TEC split (the C++ TpuSequencerType enum has no SCv0 value — TpuSequencerTypeFromProto rejects proto 7/8), plus a TPU_CORE_TYPE_SPARSE_CORE_V0 core-type. In this build SCv0 survives only as two profiler-label constants — kHloSparseCoreV0Infeed and kHloSparseCoreV0Outfeed, referenced from the XLA HLO profiler's DisambiguateInfeedOutfeed. No SCv0 encoder, decoder, codec metadata, scheduling table, or descriptor proto ships. A user proto naming SCv0 will fail at codec lookup. The enum values are retained for schema back-compat only.
GOTCHA — the
V0suffix on6acc60406TEC field symbols is not SCv0. Decompiled symbols such asgfc::isa::SparseCoreTecVectorExtendedAddScanF32V0XFieldcarry aV0that is an operand-field version tag on a live6acc60406TEC vector-extended op — unrelated to the SCv0 core type. The only genuine SCv0 references are the two profiler labels above.
The Data Path
SparseCore's reason to exist is the embedding lookup. The high-level flow moves host-resident embedding tables into HBM, gathers tile rows on demand, reduces them on TEC, and hands the dense result to the TensorCore — with the gradient flowing back on the reverse path.
HOST HBM (shared TC/SC) SPARSECORE TENSORCORE
──── ───────────────── ────────── ──────────
embedding tables ──load──▶ embedding rows matmul / MLP
(GB-scale, indirect)
│
index stream ────┤ STREAM_OPCODE_GATHER SCS schedules
▼ ───────────────────────▶ tile-fetch program
[HBM row r_i] │
TAC* / TEC stream slot
issues tile-fetch DMA
▼
TILE_SPMEM ◀── row tiles
│
TEC vector reduce
(sum / max / weighted)
│
DMA VMEM ◀────────────────┘ dense vector
(DMA_CORE_ID_TENSOR_CORE_0,
DMA_MEMORY_ID_VMEM) ──sync flag──▶ consume
│
── backward pass ── ▼
TEC loads grad slice ◀── DMA SPMEM
applies optimizer math
│
[HBM row r_i] ◀── STREAM_OPCODE_SCATTER_FLOAT_ADD (atomic, in-HBM)
The forward pass: the TensorCore requests the next minibatch's embeddings; SCS receives and schedules a tile-fetch program; on Viperfish/Ghostlite TAC issues the stream-gather DMA from the HBM table into TILE_SPMEM, while on 6acc60406 the TEC's own stream slot does it; TEC vector-loads the tiles, runs the per-sample reduction, and DMAs the dense result into the TensorCore's VMEM, then raises a sync flag the TC waits on. The backward pass runs in reverse: the TC writes gradient slices to SPMEM, TEC loads them and applies the embedding-optimizer math (the newest gen's TEC supports stochastic round-to-bf16 and packed FP8 formats specifically for optimizer-state quantisation), and SC's stream engine issues STREAM_OPCODE_SCATTER_FLOAT_ADD to the embedding-row addresses in HBM — an atomic floating-point add that lands directly in HBM with no TensorCore round-trip.
SparseCore owns four address spaces, of decreasing scope and increasing speed: HBM (chip-wide, GB-scale, embedding tables + gradient buffers), SPMEM (all SC cores on chip, MB-scale, cross-SC communication and large buffers), TILE_SPMEM (per-tile, KB-scale, the local working set the TEC computes over), and TIMEM (per-tile instruction memory). The handoff with the TensorCore uses three mechanisms — HBM (slow, global), programmed VMEM↔SPMEM DMA (fast), and a shared sync-flag pool (control plane). The detail of the gather/scatter descriptor format is on Stream Gather/Scatter; the SC↔MXU integration handshake is on SC ↔ MXU Handshake.
NOTE — the in-HBM atomic add is the keystone primitive.
STREAM_OPCODE_SCATTER_FLOAT_ADD(paired with the DMA destination opcodeDMA_DEST_OPCODE_READ_AND_ADD) accumulates floating-point gradients directly into embedding rows in HBM. The TensorCore's MXU cannot do this. Without it, embedding gradients would require either shrinking the table to fit VMEM or a multi-pass scatter-gather-add. In DLRM-class models this single op carries the gradient-accumulation traffic for every embedding table on every backward step — the architectural reason SparseCore is worth its silicon.
How Part IX Is Organized
Part IX keeps SparseCore whole rather than slicing it across the ISA / cost / scheduling seam used by the TensorCore parts, because it is a self-contained engine a reader wants in one place. The pages group into bands:
- Engines (this band) — this overview, plus Architecture (engine roles + the embedding datapath in depth), SCS (Scalar) Engine, TAC Engine, TEC (Vector) Engine, the per-engine Bundle Slot-Base Map, the Region → Sequencer Outliner that partitions an SC computation into per-engine bundles, and getSequencerType (the SCS/TAC/TEC selection function).
- ISA — the opcode enums and slot encodings: Scalar Opcode Enum, Vector Opcode Enum, the OneSlot Scalar Router, the VectorLoad / VectorStore / VectorExtended / VEX slot pages, VEX Operand-Port Binding, VEX Mask / Dest-Port / Sub-Opcode, the M-Register Predicate Word, and the CBREG Circular-Buffer Register.
- Embedding datapath — the sparse compute: Scan Datapath, Segmented Scan, Segmented-Add-Scan, Embedding Minibatching Decomposition, SampleCombiner Emitter, EmitValencyLoop, RankAndPermute / RadixSort, and Dedup Multiplicity.
- Pointers & DMA — the addressing model: Fat Pointers (AS7/8/9), addrspacecast ISel, Tile-ID Cast, Stream Gather/Scatter, and IndirectVregStream.
- Back-end — the compiler offload path: SC Backend Pipeline (RunPasses and the MEGACORE barrier), SC EmitX Dispatcher (the seq3/seq4/seq5 → EmitX jump tables), SC Core Selection, SC Queue Assignment & Reservation, and GetSparseCoreConfig.
- Cross-cutting — SC ↔ MXU Handshake and the cross-vendor SparseCore vs Neuron MatmultSparse.
The retired predecessor, BarnaCore (the embedding accelerator on Jellyfish / Dragonfish / Pufferfish, replaced by SparseCore from Viperfish onward), has its own band starting at BarnaCore Overview. The TensorCore ISA that SparseCore hands off to is ISA Overview; the collective-offload story lives in Part XIII.
Confidence Summary
| Claim | Evidence |
|---|---|
Three engine classes SCS/TAC/TEC, selected by TpuSequencerType 3/4/5 | TpuSequencerType enum (TpuSequencerTypeToString @ 0x20b362e0 jump table over off_22010DE0); EncoderBase<…, TpuSequencerType=3/5> template instantiations |
Per-gen codec roots SparseCore{Scs,Tac,Tec}CodecBase scoped vxc.vfc / gxc.glc / gxc.gfc | demangled codec-base class symbols in the decompiled set |
| Viperfish + Ghostlite ship all three engines | vxc.vfc and gxc.glc each have Scs/Tac/Tec codec bases; fn counts Scs≈500, Tac≈760, Tec 4218/5961 |
6acc60406 drops TAC (SCS+TEC only) | zero gfc::…SparseCoreTac* symbols / functions; no gxc.gfc SparseCoreTacCodecBase; TEC fn count 6526 |
| Jellyfish / Dragonfish / Pufferfish carry no SparseCore | no SparseCore*CodecBase under jxc / pxc.pfc; these are BarnaCore-era gens |
| SCv0 is enum-only (no codec) | TpuSequencerTypeProto 7/8 (proto-only; no C++ value, FromProto rejects) + kHloSparseCoreV0{Infeed,Outfeed} profiler labels; no SCv0 encoder/decoder/descriptor ships |
| SCS=32 B, TAC=64 B, TEC=64 B bundles; no check byte | BundleSizeBytes reads codec-metadata vtable slot returning 32/64; SC bundles carry no 0x55 trailer |
| In-HBM atomic FP scatter-add is the keystone primitive | STREAM_OPCODE_SCATTER_FLOAT_ADD + DMA_DEST_OPCODE_READ_AND_ADD enum strings |
V0 suffix on 6acc60406 TEC field symbols is an operand-version tag, not SCv0 | gfc::isa::SparseCoreTecVectorExtended*V0XField are live 6acc60406 TEC ops |
Marketing names v5p/v6e/v7x for codenames Viperfish/Ghostlite/6acc60406 | follows sibling SparseCore Target Descriptor; binary keys on codename family ns, not marketing name |
Cross-References
- Architecture — engine roles and the embedding datapath in full depth.
- SCS (Scalar) Engine — the scalar control / addressing sequencer.
- TAC Engine — the tile-access / DMA-issuer engine and its
6acc60406removal. - TEC (Vector) Engine — the wide vector compute engine.
- getSequencerType — the SCS/TAC/TEC engine-selection function.
- Region → Sequencer Outliner — partitioning an SC computation into per-engine bundle streams.
- Scalar Opcode Enum — the SCS / TAC scalar ALU and scalar-misc opcode roster.
- Vector Opcode Enum — the TEC vector ALU opcode roster (per-gen widths).
- Stream Gather/Scatter — the indirect-DMA descriptor format and the
STREAM_OPCODE_*set. - SC Backend Pipeline — the SparseCore compiler offload pass pipeline.
- SC ↔ MXU Handshake — the SparseCore ↔ TensorCore integration handshake.
- GetSparseCoreConfig — the offload op-type configuration source.
- SparseCore vs Neuron MatmultSparse — cross-vendor comparison.
- ISA Overview — the TensorCore VLIW ISA SparseCore hands off to.
- BarnaCore Overview — the retired v2–v4 embedding accelerator SparseCore replaced.
- Binary:
extracted/libtpu-0.0.40-cp314-cp314-manylinux_2_31_x86_64/libtpu/libtpu.so(build-id89edbbe81c5b328a958fe628a9f2207d) - Index entry: Part IX — SparseCore & BarnaCore / SparseCore engines — back to index