Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SparseCoreTarget (Target+0x948)

Every offset, value, and address on this page was read byte-exactly from libtpu.so in the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

xla::jellyfish::Target is the per-generation hardware descriptor that the XLA TPU backend queries for every codegen and cost decision. The SparseCore half of that descriptor does not live inline in Target; it is a separately allocated std::unique_ptr<xla::jellyfish::SparseCoreTarget> parked at Target+0x948. This page is the full field map of that sub-object, the accessor surface that reaches into it, and — because it is the cleanest per-codename MXU differentiator recoverable from this wheel — the per-generation MXU-contracting-depth table that lives alongside it in the per-codename Target subclasses.

Three concrete facts drive the page, and a reimplementer needs all three:

  • SparseCore presence is gated, not assumed. Every Target::SparseCore* accessor first calls a virtual predicate (SupportsSparseCore, vtable slot +0x260) and LOG(FATAL)s if it is false. The Target+0x948 pointer is only populated on the SparseCore-bearing generations; on the BarnaCore generations (Jellyfish/Dragonfish/Pufferfish) and the TensorCore-only lite dies, the guard short-circuits before the pointer is ever dereferenced.
  • Three SparseCoreTarget classes ship. A base SparseCoreTarget plus two concrete subclasses — ViperfishSparseCoreTarget (v5p) and GhostLiteSparseCoreTarget (v6e Ghostlite and v7x 6acc60406). The per-gen capability bits, latencies, and FLOPS are C++ literals in those subclass virtuals.
  • The MXU contracting depth is a compiled-in literal. Base Target::MxuContractingSize returns 128; only GhostliteTarget overrides it to 256. This 128-vs-256 split is the single hard per-codename MXU-geometry value in this build, and it is not a chip_parts proto field.
Sub-objectstd::unique_ptr<xla::jellyfish::SparseCoreTarget> at Target+0x948
Built bySparseCoreTarget::Init @ 0x1D612B20, installed by Target::Init
Presence gateTarget::SupportsSparseCore @ 0x1D48FD40 (vtable slot +0x260)
Concrete classesbase / ViperfishSparseCoreTarget / GhostLiteSparseCoreTarget
Has SparseCoreViperfish (v5p), Ghostlite (v6e), 6acc60406 (v7x)
No SparseCoreJellyfish / Dragonfish / Pufferfish (BarnaCore); lite dies (none)
ConfidenceCONFIRMED (byte-anchored) unless a row says otherwise

The Presence Gate

Target+0x948 is meaningful only when the target has SparseCore silicon. The predicate that decides this is Target::SupportsSparseCore @ 0x1D48FD40, and it does not test the +0x948 pointer — it reads a count out of the TpuTopology sub-object at Target+0x3B8:

// Target::SupportsSparseCore @ 0x1D48FD40
bool Target::SupportsSparseCore(Target *this) {
  // (this + 119) == Target+0x3B8 == TpuTopology*; field +0x98 is an SC count
  return *(uint32_t *)(*(uintptr_t *)(this + 0x3B8) + 0x98) > 0;
}

It is wired into vtable slot +0x260. Every scalar accessor in the next section dispatches through that slot before touching Target+0x948, so on a BarnaCore generation the accessor LOG(FATAL)s with "SparseCore is not supported by this target" rather than dereferencing a null pointer. This is the structural reason the older generations carry no SparseCoreTarget: their TpuTopology[+0x98] SparseCore count is zero, the gate returns false, and Target::Init never builds the sub-object.

NOTE — the predicate is a runtime topology test, not a compile-time class test. A reimplementation must populate the SparseCore-count field of its topology descriptor before any SparseCore accessor is reachable; otherwise the entire SC accessor surface traps.


The Scalar Accessor Surface

Six Target::SparseCore* accessors reach into Target+0x948 and read one scalar each. They share an identical shape: dispatch the +0x260 gate, LOG(FATAL) if false, then load *(*(this+0x948) + field). The decompiled bodies confirm the base and every field offset (the decompiler renders Target+0x948 as *((_QWORD *)this + 297); 297 × 8 = 0x948).

AccessorAddressReadsv7x value
SparseCoreTiles0xFAAFA40[0x948] + 0x9016
SparseCoreLaneCount0xF7906E0[0x948] + 0x9416
SparseCoreHbm4bWordSizeBytes0x1320C220[0x948] + 0x584
SparseCoreStreamGranuleSizeBytes0x13886EE0[0x948] + 0xA44
GetSparseCoreBarrierSyncFlagCount0x10972FA0[0x948] + 0x1D4
SupportsSparseCore0x1D48FD40[0x3B8 TpuTopology] + 0x98 > 0true
SparseCoresPerLogicalDevice0x135159C0CoresPerChip / LogicalDevicesPerChip2

A representative body (SparseCoreTiles, 0xFAAFA40):

__int64 Target::SparseCoreTiles(Target *this) {
  // vtable +0x260 == *(*this + 608)
  if (!(*(uint8_t (**)(Target *))(*(uintptr_t *)this + 608))(this))
    LOG(FATAL) << "SparseCore is not supported by this target";  // target.h:1704
  return *(uint32_t *)(*(uintptr_t *)(this + 0x948) + 0x90);     // SparseCoreTiles
}

SparseCoresPerLogicalDevice (0x135159C0) is the odd one out: it does not read [0x948] at all, computing CoresPerChip(SPARSE_CORE) / LogicalDevicesPerChip instead. It is included here because it is the entry point the runtime calls (via tensorflow::GetAndSetSparseCoresPerLogicalDevice) to size per-device SparseCore counts.


Full Field Layout of Target+0x948

The sub-object is one struct with two consumer views: the SC-MLO allocator view (the +0x10..+0x54 word/capacity block) and the HAL/cost-model view (+0x58..+0x1D4). Offsets are read from the SparseCoreTarget::Init (0x1D612B20) store sites or from the matching accessor deref. The struct spans at least 0x240 bytes; its constructor is inlined, so the destructor (0x1D499060) is a bare ret.

OffTypeField / meaning
+0x08ptrback-pointer (Target* / core ptr)
+0x10i32TIMEM/SCS-sflag capacity (group 0)
+0x14i32TAC-sflag capacity (group 1)
+0x18i32TEC-sflag / per-tile sflag pool (group 2)
+0x1Ci32SCS-SMEM capacity (group 0)
+0x20i32TAC-SMEM capacity (group 1)
+0x24i32TEC-SMEM capacity (group 2)
+0x28i32TILE_SPMEM capacity (bytes)
+0x2Ci32SPMEM capacity (bytes)
+0x30i32sparse-core SFLAG capacity
+0x38i64HBM 4b-word count = (Target[0x450]+3) >> 2
+0x40i32TIMEM/sflag word size (group 0)
+0x44i32SMEM word size
+0x48i32TILE_SPMEM word size
+0x4Ci32SPMEM word size
+0x50i32TILE-instr (TIMEM) word size
+0x54i32per-group word-size slot
+0x58i32SparseCoreHbm4bWordSizeBytes = 4
+0x5C..+0x70i32×Ngroup-mirrored sflag/smem word sizes + capacities
+0x74i32literal 2 (SCS sequencer/group count)
+0x78..+0x8Ci32×Nper-coretype memory counts (TENSOR/SC group fanout)
+0x90i32SparseCoreTiles = SequencerCount(5)
+0x94i32SparseCoreLaneCount = vector_isa(5).lane_count
+0x98i32SC lane bytes = lane_count << 2
+0x9Ci32SC freq / host-irq block (TpuCoreParts[0x138])
+0xA0i32tile_hbm_bandwidth_bytes_per_cycle (cp[0x1E0], gated cp[0x1E8])
+0xA4i32stream_granule_size (cp[0x1E4])
+0xA8..+0x238ptr×Nper-tile reserved-region vectors (SMEM/SPMEM/SFLAG windows)
+0x188u8bool flag
+0x190i64embedding param-region base word offset
+0x198u8bool flag
+0x1D0i32SC barrier sync-flag base (SpecialPurposeSyncFlags[0])
+0x1D4i32GetSparseCoreBarrierSyncFlagCount (SpecialPurposeSyncFlags[+0x10])
+0x1E8u8SparseCore-proto-present flag
+0x1EC..+0x22Ci32×Ncompiler-reserved SFLAG watermarks + per-tile reserved-SFLAG block
+0x230/+0x238ptr×2tail region vectors
~0x240sizeof(SparseCoreTarget) >= 0x240

GOTCHA — the +0xA8..+0x238 block is a run of libc++ vectors holding per-tile reserved-region descriptors. The store offsets are byte-confirmed, but the element layout of each vector (which reserved tile region each holds) was not individually walked — those rows are HIGH, not CONFIRMED.


Init Geometry Block

SparseCoreTarget::Init (0x1D612B20) populates the geometry from the SPARSE_CORE TpuCoreParts. The block below is the decompiled store sequence, byte-confirmed (decompiler offsets in decimal; hex in comments):

// SparseCoreTarget::Init @ 0x1D612B20 — geometry block
*(uint32_t *)(sc + 0x58) = 4;                              // SparseCoreHbm4bWordSizeBytes (literal)
*(uint32_t *)(sc + 0x74) = 2;                              // SCS sequencer/group count (literal)
*(uint32_t *)(sc + 0x90) = TpuCoreParts::SequencerCount(cp, 5);   // SparseCoreTiles (seq type 5)
vi = TpuSequencerParts::vector_isa(cp.SequencerParts(5));
*(uint32_t *)(sc + 0x94) = vi.lane_count;                  // SparseCoreLaneCount
*(uint32_t *)(sc + 0x98) = 4 * vi.lane_count;              // SC lane bytes
if (cp[0x1E8] /* SparseCore submsg present */) {
  *(uint32_t *)(sc + 0xA0) = cp[0x1E0];                    // tile_hbm_bandwidth_bytes_per_cycle
  *(uint32_t *)(sc + 0xA4) = cp[0x1E4];                    // stream_granule_size
}
// ... later: barrier sync-flag base + count from a SpecialPurposeSyncFlags sub-object
*(uint32_t *)(sc + 0x1D0) = SpecialPurposeSyncFlags[0];
*(uint32_t *)(sc + 0x1D4) = SpecialPurposeSyncFlags[+0x10];   // GetSparseCoreBarrierSyncFlagCount

The tile/lane geometry is sourced from TpuCoreParts::SequencerCount(cp, 5) and SequencerParts(5).vector_isa(). Both indices use the internal (codec-template) TpuSequencerType numbering {SCS=3, TAC=4, TEC=5}, so index 5 is SC_TEC — the SparseCore tile-execute pool, the 16-instance, lane_count-16 VectorIsa on v7x. So SparseCoreTiles = 16, SparseCoreLaneCount = 16 for v7x, matching the chip_parts decode of the 6acc60406 SC sequencer geometry.

NOTE — Init's SequencerCount(cp, 5) / SequencerParts(cp, 5) indices are internal TpuSequencerType values, the same codec-template numbering the Part IX SparseCore pages use: {TC=0, BARNA=1, BARNA_ADDR=2, SCS=3, TAC=4, TEC=5}. In that scheme index 5 = SC_TEC (the tile-execute pool), which is exactly what the geometry block reads — so SparseCoreTiles / SparseCoreLaneCount come from the TEC sequencer, not from TAC. The SparseCore block of this enum is {SCS=3, TAC=4, TEC=5}, the canonical wiki numbering; see getSequencerType and SparseCore Overview.

QUIRK — there are two legitimate numbering schemes, off by one, and Init uses the internal one. The proto enum TpuSequencerTypeProto prepends an INVALID=0 slot and renumbers the SparseCore block one higher — {INVALID=0, TC=1, BARNA=2, BARNA_ADDR=3, SCS=4, TAC=5, TEC=6} — so a reader who applies proto numbering to Init's index would mis-read 5 as TAC. tpu::TpuSequencerTypeToProto (0x20B36460) returns internal + 1 and tpu::TpuSequencerTypeFromProto (0x20B36300) maps proto case N to internal = N − 1, so internal = proto − 1 exactly. Code paths that index TpuCoreParts (Init here) take the internal value; only serialized proto fields and the TpuSequencerTypeToString label table use the proto value. (6acc60406 / gfc carries internal codec params 3 and 5 only — SCS + TEC, no TAC — so on v7x there is no TAC sequencer to confuse with TEC in the first place.)


Non-Virtual Helpers

The base SparseCoreTarget defines three non-virtual helpers the SC-MLO allocator reads. All byte-confirmed:

HelperAddressBody
SparseCoreSpmemStripeGranularityBytes0x1D499440return 0x100000020LL;optional<32> (value 0x20=32, engaged-flag in high dword)
SparseCoreParamPtrLocationWordOffset(i)0x1D618080return [0x190] - i; with CHECK(result >= 0)
SparseCoreStartReservedSmemWordOffset(i)0x1D618060tail-jumps to the above

These pin field +0x190 as the embedding param-region base word offset: the allocator places parameter pointers at descending offsets [0x190] - i and fatals on underflow ("Parameter number N causes underflow").


Per-Codename MXU Contracting-Depth Table

The MXU contracting dimension is a per-codename C++ literal in the Target subclass, not a chip_parts field. The base Target returns 128; only GhostliteTarget overrides to 256. "inherit" means the subclass does not override and uses the base value. Every numeric cell is a byte-exact literal read from the named method.

MXU constant (CODE)v2 Jellyv3 Dragonv4 Pufferv5p Viperfishv6e Ghostlitev7x 6acc60406
MxuContractingSize128128128128256256
MxuNoncontractingSize128128128128256256
MxuSparseContractingSize00000 (inherit)0 (inherit)
MxuContractingSizeIsDoubled(mode)falsefalsefalsepredicatepredicatepredicate
└ doubled-mode set (raw GainLatchMode){22,23,24,25}{22,23,24,25}{22,23,24,25}
MinLmrWidthInColumnsFATALFATALFATAL81616
MaxLmrWidthInColumnsFATALFATALFATAL128128128

NOTE — v7x 6acc60406 reuses the GhostliteTarget TensorCore subclass; no separate Tpu7xTarget/TpuV7xTarget class exists in this build. The v6e/v7 distinction is data-driven via chip_parts, so both get MxuContractingSize = 256.

How a reimplementer should read this:

  • Contracting depth. The systolic MXU reduction depth is 128 on Jellyfish through Viperfish and 256 on the Ghostlite class (v6e/v7). This is orthogonal to the chip_parts lane_count=128: the proto carries the 128-lane width and mxu_count=2 for v7x; the 256 is the systolic row depth code constant. Cross-validated by the FLOPS relation on the 128×128 generations (peak BF16 = 2 × mxu_count × 128² × frequency_mhz reproduces v2/v3/v4 to within 1%).
  • The doubling predicate. Both ViperfishTarget and GhostliteTarget override MxuContractingSizeIsDoubled with the identical predicate (mode - 22) < 4u, i.e. true for raw GainLatchMode ∈ {22,23,24,25}. Those indices are the 4-bit packed-nibble matmul modes (S4/U4): two nibbles pack into one physical systolic row, doubling effective contracting depth (256 on the 128-row gens, 512 on Ghostlite's 256-row). The base returns always-false — the older gens have no 4-bit MXU mode.
  • LMR width. Min/MaxLmrWidthInColumns is the latch-matrix-register column window (the staging-register granularity), distinct from the contracting depth. It is defined only on the SparseCore-bearing gens; the base LOG(FATAL)s "Unimplemented". Jellyfish/Pufferfish instead expose the older MxuResultEntries{Pushed,Popped} result-FIFO model. MxuSparseContractingSize is 0 on every gen — no subclass overrides it; sparse-MXU contracting is unused in this build.

SparseCore Presence and Geometry Per Generation

GenerationTpuVersionProtoSparseCore?SparseCoreTarget classTiles × LaneMXU contracting
v2 Jellyfish1no (BarnaCore)128
v3 Dragonfish2no (BarnaCore)128
v4 Pufferfish3no (BarnaCore)128
v5p Viperfish4yesViperfishSparseCoreTarget— × 8 (SC_TEC lane)128
v6e Ghostlite5yesGhostLiteSparseCoreTarget— × 8 (SC_TEC lane)256
v7x 6acc604066yesGhostLiteSparseCoreTarget16 × 16256
v4 lite / v5e liteno (TC-only)128

The Tiles × Lane column shows the chip_parts SC_TEC VectorIsa lane width; v7x widens that to 16 (and SparseCoreTiles/SparseCoreLaneCount both report 16 after Init). v5p/v6e carry the narrower 8-lane TEC. The lite dies (pufferfish_lite, viperfish_lite) carry neither BarnaCore nor SparseCore — their TpuTopology[+0x98] SC count is zero, so SupportsSparseCore is false and no SparseCoreTarget is built.


Per-Gen SparseCore Capability / Geometry Table

The 24 SparseCoreTarget virtual accessors are present in both concrete subclasses (ViperfishSparseCoreTarget for v5p, GhostLiteSparseCoreTarget for v6e Ghostlite + v7x 6acc60406). Values are byte-exact, matched by method name (the Viperfish vtable order shifts by one vs GhostLite).

SparseCoreTarget vfnViperfish (v5p)Ghostlite / 6acc60406 (v6e/v7x)Source (VF / GL) · Confidence
FlopsPerSparseCore(fmt∈{1,2})1.0e12 (1 TFLOP/s)3.595e13 (35.95 TFLOP/s)0x1D49C540 / 0x1D4990A0
└ other fmtTarget vtable +0x718Target vtable +0x718tail-call
SparseCoreTileCrossbarBandwidthRandomAccess29 B/cyc29 B/cycvtable
SparseCoreTileVectorAluSlotCount(bool)33vtable
SparseCoreHbmAccessLatency4184180x1D49C5C0 / 0x1D499120
SparseCoreSpmemAccessLatency30300x1D49C5E0 / 0x1D499140
TaskRequestStartAccessFunctionArgWordOffset11vtable
TaskRequestStartExecuteFunctionArgWordOffset11vtable
TaskRequestEndExecuteFunctionArgWordOffset00vtable
TraceEnBitOffsetInStreamControl1515vtable
SetDoneBitOffsetInStreamControl22vtable
TileLocalStrideOffsetInStreamControl33vtable
IndirectListTypeBitOffsetInStreamControl77vtable
IndirectFilterEnBitOffsetInStreamControl1414vtable
SupportsScVdupcntVuniqueWithLaneIds010x1D49C7A0 / 0x1D499300
SupportsScVldVstIdxAdd010x1D49C7C0 / 0x1D499320
SupportsScVar000x1D49C7E0 / 0x1D499340
SupportsScFp8VectorCmp00vtable
SupportsScVmemStream00vtable
SupportsScHbm4bStream110x1D49C840 / 0x1D4993A0
SupportsScLocalSpmemDma110x1D49C860 / 0x1D4993C0
SupportsScBundleCompression00vtable
SupportsScB8VectorMaskPopulationCount00vtable
SupportsScEupOps110x1D49C8C0 / 0x1D499420
SupportsTileSmemDma00vtable

The genuine Viperfish → Ghostlite/6acc60406 capability gains are exactly two bits plus the FLOPS jump:

  • SupportsScVdupcntVuniqueWithLaneIds (0 → 1) — vdupcnt / vunique-with-lane-ids ops.
  • SupportsScVldVstIdxAdd (0 → 1) — indexed vector load/store with add.
  • FlopsPerSparseCore 1 TFLOP/s → 35.95 TFLOP/s — the widened Ghostlite/6acc60406 TEC vector engine.

GOTCHA — SupportsScEupOps returns 1 on both Viperfish and Ghostlite/6acc60406 (verified at 0x1D49C8C0 and 0x1D499420 — both return 1). It is not a per-generation delta: the Extended-Unit-Pipeline ops are present on Viperfish too, so do not count it among the Ghostlite gains. The only capability-bit deltas are the two listed above.

The FlopsPerSparseCore body is a small dispatcher, identical in shape across both subclasses:

// GhostLiteSparseCoreTarget::FlopsPerSparseCore @ 0x1D4990A0
__int64 FlopsPerSparseCore(this, int fmt) {
  if ((uint8_t)(fmt - 1) >= 2u)                       // fmt not in {1,2}
    return (*Target_vtable[+0x718])(this->target, fmt);  // delegate to Target flops
  return /* xmm0 = */ qword_A2DE920;                  // 3.595e13 for fmt in {1,2}
}

For matmul data formats 1 and 2 it returns the SparseCore-local FLOPS constant; for everything else it tail-delegates to the Target per-format FLOPS at vtable slot +0x718. Viperfish has the identical structure with the constant at qword_A2DFD18 (1.0e12).


NameRelationship
Target::Initbuilds unique_ptr<SparseCoreTarget> via SparseCoreTarget::Init, installs it at Target+0x948
SparseCoreTarget::Init (0x1D612B20)populates every field from the SPARSE_CORE TpuCoreParts
xla_mlo_util::CapacityInBytes / WordSizeInBytesSC-MLO allocator consumers of the +0x10..+0x54 capacity/word block
GhostliteTarget / ViperfishTargetthe TensorCore Target subclasses carrying the MXU-256 / LMR-width literals

Cross-References

  • Per-Codename Constants — the chip_parts-sourced per-generation memory/MXU/core-count table this descriptor's geometry is decoded against
  • TpuChipConfig — how the per-codename constants assemble into the runtime Target configuration
  • SparseCore Overview — the SCS/TAC/TEC engine model these capability bits and stream-control offsets feed
  • SC ↔ MXU Handshake — the consumer of MxuContractingSize, the doubling predicate, and the LMR-width window