TpuChipConfig
All addresses and offsets on this page apply to
libtpu.sofrom thelibtpu-0.0.40-cp314wheel (BuildID md589edbbe81c5b328a958fe628a9f2207d). Other versions differ.
Abstract
There are two embedded-proto descriptions of a TPU generation, and they are easy to confuse. chip_parts (see chip_parts.binarypb Decode) describes the silicon's capability geometry — memory sizes, MXU dimensions, clocks. TpuChipConfig describes a runtime mode: bounce buffers, sync-flag resources, and on-device transfer windows for a particular (version, variant) pair. TpuChipConfig::Create resolves an embed://tpu_chip_config/<version>_chip_configs_<alias>.binarypb resource, parses it into a TpuChipConfigProto, and materializes a TpuChipConfig object the driver layer uses to provision per-core queues and DMA buffers.
This page documents that resolution-and-materialization path and, separately, the consumption surface that the decoded capability constants reach through the xla::jellyfish::Target object. The Target is the runtime hardware-abstraction object; chip_parts fills its field block at boot, and a family of tiny accessors (LaneCount, SublaneCount, ChunksPerTile, HbmSizeBytes, TensorCoreFrequencyInMegaHertz, …) read individual fields out of it. Those accessors are the API the cost model, ISA emitter, and tiling passes call; the offset map below is what makes a struct dump from a Target instance readable.
For reimplementation, the contract is:
- The
TpuChipConfig::Createpath: version+variant →kChipConfigAliaseslookup →embed://tpu_chip_config/…→ReadBinaryProto→FromProto. - The
TpuChipConfigobject header: the three mode booleans at+8/+9/+10(Megacore/Megachip/TcControl) the HAL branches on, and the per-coreEnumMap<TpuCoreType, …, 3>tables keyed by the runtime{kTensorCore=0, kBarnaCore=1, kSparseCore=2}enum. - The
Targetfield block: which struct offset holds each decodedchip_partsconstant, and the accessor that reads it. - The consumers: the geometry accessors (
LaneCount/SublaneCount/ChunksPerTile/VexMatrixWidthToSyU32) and who calls them (cost model, ISA emitter, topology).
| Config resolver | tpu::TpuChipConfig::Create @ 0x20AE98E0 |
| Config parser | tpu::TpuChipConfig::FromProto @ 0x20AEA100 |
| Config constructor | tpu::TpuChipConfig::TpuChipConfig @ 0x20AF6300 |
| Source file | learning/45eac/tpu/runtime/topology/tpu_chip_config.cc (Create alias-log line 368) |
| Alias map | kChipConfigAliases (gtl::flat_map<TpuVersionAndVariant, …>, 4 entries) |
| Mode flags | Megacore +8 (0x20AFCA00), Megachip +9 (0x20AFCC00), TcControl +10 (0x20AFCC20) |
| Geometry descriptor | Target+0x3B8 → tpu::TpuTopology* |
| Capability field block | Target+0x438..+0x510 (filled by TpuChipParts::FromProto) |
TpuChipConfig::Create
Purpose
Create obtains the mode configuration for a (version, variant) pair: the TpuChipConfigProto enumerates SharedMemoryRegion records and SyncFlag records (parsed by FromProto) that the driver uses to lay out bounce buffers and sync-flag resources per core. It is the mode-config sibling of chip_parts's capability-geometry resolver TpuChipParts::DefaultsForVersion (0x20B1B040). The two are deliberate parallels — same shape, same embed:// mechanism — but they live in different translation units: Create is in tpu_chip_config.cc, DefaultsForVersion is in tpu_chip_parts.cc (its fatal-log strings cite tpu_chip_parts.cc:341,343,345).
Algorithm
StatusOr<TpuChipConfig> TpuChipConfig::Create(TpuVersion v, string_view variant, // sub_20AE98E0
string_view config_variant):
// 1. Resolve a (version, variant) alias to a config-variant name.
key = TpuVersionAndVariant{v, variant}
hit = kChipConfigAliases.find(key) // 4-entry flat_map of TpuVersionAndVariant
if hit and bcmp(hit->value, config_variant) != 0: // alias differs from requested
LOG(INFO) << "Resolved chip config alias " << config_variant << "->" << hit->value
<< " for " << v << ", " << variant // tpu_chip_config.cc:368
alias = hit->value // (logged only when the alias renames the request)
else:
alias = config_variant // no alias / identity: use config_variant verbatim
// 2. Build the embed:// resource path and read it.
name = AsciiStrToLower(TpuVersionToString(v))
if variant.non_empty(): StrAppend(&name, "_", variant)
filename = CatPieces("embed://tpu_chip_config/", name, "_chip_configs_", alias, ".binarypb")
status = tsl::ReadBinaryProto(Env::Default(), filename, &proto) // tpu_chip_config.cc:379-383
if status != Ok:
return StatusBuilder(...) << "Failed to obtain chip config \"" << alias
<< "\" for version " << v << ", variant \"" << variant << "\""
return TpuChipConfig::FromProto(proto) // sub_20AEA100
QUIRK — the
embed://scheme istpu_chip_config/here, buttpu_chip_parts/inDefaultsForVersion, and the suffix is_chip_configs_<alias>.binarypb(note the plural and the alias slot) versus_chip_parts.binarypb. The two resolvers target different resource families.chip_configblobs are the*_chip_configs_*.binarypbentries in the embedded TOC; do not conflate them with the nine*_chip_parts.binarypbcapability blobs. UnlikeDefaultsForVersion,Createreturns aStatuson failure rather thanCHECK-failing — a missing config is recoverable, a missing capability description is fatal.
What FromProto Materializes
TpuChipConfig::FromProto (0x20AEA100) walks the TpuChipConfigProto, iterating its SharedMemoryRegion repeated field and a SyncFlag repeated field, inserting each into the TpuChipConfig's flat_hash_map<TpuSharedMemoryOnChip, BounceBuffer> and sync-flag-resource tables. The TpuChipConfig constructor (0x20AF6300) takes that map plus three leading booleans and the per-core layout. The result is the object TpuPxcDriver::InitializeCores (0xE806500) and TpuChipCommonImpl::RegisterContinuationQueueConfigs (0xE72B340) consume to provision hardware queues — it is a resource object, not the capability Target.
The TpuChipConfig Object Header
The first 11 bytes of the constructed object are a fixed header the constructor writes from its leading scalar arguments (0x20AF6300 stores a2→+0, a3→+8, a4→+9, a5→+10), and three one-line const accessors expose the three mode flags. These are the per-config booleans a reimplementation must reproduce verbatim, because the HAL branches on them at boot:
| Off | Accessor (VA) | Type | Holds |
|---|---|---|---|
| +0x00 | (constructor a2) | int64 | leading scalar word (per-config count) |
| +0x08 | Megacore (0x20AFCA00) | bool | megacore mode (name CONFIRMED; semantics inferred) |
| +0x09 | Megachip (0x20AFCC00) | bool | megachip mode (name CONFIRMED; semantics inferred) |
| +0x0A | TcControl (0x20AFCC20) | bool | TensorCore-control mode (name CONFIRMED; semantics inferred) |
Each accessor is literally return *((uint8_t*)this + N) — Megacore reads +8 (0x20AFCA00), Megachip reads +9 (0x20AFCC00), TcControl reads +10 (0x20AFCC20). The byte positions and accessor symbols are byte-exact from the disassembly; the meaning of each mode (megacore as a two-TensorCore fused topology, megachip as a multi-die-as-one-device topology) is the standard TPU interpretation, inferred rather than proven from this binary. xla::jellyfish::Target::IsMegachip (0x10914F60) consumes the +9 byte through the config pointer the Target holds at Target+0x3B8→+0x18 (*(TpuChipConfig**)([Target+0x3B8]+24)), and gates its result on a positive [Target+0x3B8]+148 count — so megachip-parallel compilation requires both the Megachip byte set and a non-zero descriptor count.
QUIRK — these three bytes are mode state on the
TpuChipConfigresource object, not capability geometry. They do not live in theTargetfield block this page also maps; do not look forMegachipat aTargetoffset. The selector that picks which config blob was loaded is the variant string fed toCreate, and the megacore/megachip booleans inside that blob are whatMegacore()/Megachip()then surface.
Per-Core Layout Keying
After the header the constructor stores a long tail of per-core-type tables — EnumMap<TpuCoreType, vector<InfeedQueue>, 3>, EnumMap<TpuCoreType, vector<ContinuationQueue>, 3>, EnumMap<TpuCoreType, vector<OutfeedQueue>, 3>, EnumMap<TpuCoreType, UserInterrupts, 3>, and so on (visible in the constructor's mangled signature). Every one is an EnumMap of arity 3, keyed by the runtime tpu::TpuCoreType enum. That enum is {kTensorCore=0, kBarnaCore=1, kSparseCore=2} — TpuCoreTypeFromProto (0x20B36840) maps wire TENSOR_CORE=1→0, BARNA_CORE=2→1, SPARSE_CORE=3→2 (proto value minus one, with a fatal "Invalid core type" at tpu_chip_enums.cc:301 for anything else). A reimplementation that wants to index the per-core queue tables must use this 0-based runtime enum, not the 1-based proto enum.
The Target Field Block
The capability constants from chip_parts do not live in TpuChipConfig; they live in the xla::jellyfish::Target object, filled at boot by TpuChipParts::FromProto (0x20B1B400) and TpuMemoryParts::FromProto (0x20B333A0). Each constant has a fixed struct offset, read by a one-line accessor. The offset map below was read byte-exact from the accessor disassembly (e.g. HbmSizeBytes is literally return *((int64_t*)this + 138), i.e. Target+0x450).
| Target off | Accessor (VA) | Type | Holds |
|---|---|---|---|
| +0x398 | Target::TpuVersionToString (0x12772CC0) | int32 | tpu_version (0..5 → jellyfish/dragonfish/pufferfish/viperfish/ghostlite/6acc60406) |
| +0x3B8 | LaneCount/SublaneCount/… (0x1D60F400) | TpuTopology* | geometry descriptor pointer |
| +0x3B8→+0x198 | LaneCount (0x1D60F400) | int64 | lane_count (=128 all gens) |
| +0x3B8→+0x1A0 | SublaneCount (0x1D60F300) | int64 | sublane_count (=8 all gens) |
| +0x450 | HbmSizeBytes (0x1D615320) | int64 | HBM size (v7x: 102,005,473,280) |
| +0x458 | VmemSizeBytes (0x1D615E00) | int32 | VMEM size (v7x: 67,108,864) |
| +0x460 | CmemSizeBytes (0x1D615E20) | int64 | CMEM size (v7x: 0) |
| +0x468 | SflagSizeBytes (0x1D615E60) | int32 | SFLAG size (v7x: 16,384) |
| +0x470 | SmemSizeBytes (0x1D615E40) | int32 | SMEM size (v7x: 1,048,576) |
| +0x50C | VmemWordSizeBytes (0x1D617300) | int32 | VMEM word (v7x: 512) |
| +0x90C | TensorCoreFrequencyInMegaHertz (0x1D615B60) | int32 | TC freq MHz (v7x: 1900) |
| +0x910 | HbmFrequencyInMegaHertz (0x1D615BA0) | int32 | HBM freq MHz (v7x: 7200) |
The offsets resolve directly from the accessor bodies: LaneCount returns *(int64_t*)(*((void**)this + 119) + 408) — this+119*8 = Target+0x3B8 is the TpuTopology*, and +408 = +0x198 is lane_count inside it. SublaneCount reads the same descriptor at +416 = +0x1A0. HbmSizeBytes is *((int64_t*)this + 138) = Target+0x450; VmemSizeBytes is *((int32_t*)this + 278) = Target+0x458; SmemSizeBytes is +284 = +0x470; SflagSizeBytes is +282 = +0x468; VmemWordSizeBytes is +323 = +0x50C; TensorCoreFrequencyInMegaHertz is +579 = +0x90C; HbmFrequencyInMegaHertz is +580 = +0x910. See TpuTopology Struct for the +0x3B8 descriptor's full layout.
NOTE —
tpu_versionatTarget+0x398is the single per-version switch that drives every dispatch — MSA family selection, loop-unroll policy, and the per-codename*Targetvtable. The capability constants at+0x438..+0x510and the geometry at+0x3B8are the data;+0x398is the selector that chose whichchip_partsblob filled them.
Geometry Accessors and Their Consumers
ChunksPerTile
Target::ChunksPerTile (0x1D60F2C0) is the canonical example of a derived geometry value: it divides lane_count by sublane_count.
int64 Target::ChunksPerTile(): // sub_1D60F2C0
topo = *((void**)this + 119) // Target+0x3B8
lane = *(int64_t*)(topo + 0x198) // lane_count
sublane = *(int64_t*)(topo + 0x1A0) // sublane_count
return lane / sublane // 128 / 8 = 16 on every gen in this build
A 32-bit fast path is taken when both operands fit in 32 bits (they do), so the division is an unsigned 32-bit divide. With lane=128, sublane=8 on every generation, ChunksPerTile() == 16 — sixteen 8×128 lane-chunks per HBM tile. The HardwareLayout pass stamps Tile(SublaneCount(), LaneCount()) = Tile(8, 128) from the same two fields.
VexMatrixWidthToSyU32
pxc::mnemonics::VexMatrixWidthToSyU32 (one specialization per VEX instruction family, e.g. 0x1D2C20A0) is an ISA-emitter-side consumer of the lane geometry. It converts a VexMatrixWidth enum on a TensorCore-vector instruction into the concrete SyU32 matrix width to encode.
uint32 VexMatrixWidthToSyU32(const Instr* a1): // sub_1D2C20A0 (one of several specializations)
switch (a1->vex_matrix_width): // a1[7]
case 0: return 128 // the default full MXU lane width
case 1: return a1->width_field[0] // explicit per-operand widths
case 2: return a1->width_field[1]
...
case 7: return a1->width_field[6]
default: LOG(FATAL) << "Invalid VexMatrixWidth value" // pf_proto_to_env_utils.h:1550
QUIRK —
case 0returns the literal128, not a value read fromTarget. The MXU lane width is hardcoded in the ISA-emitter helper because it is invariant —lane_countis 128 on every generation in this build. The non-zero cases read explicit width fields off the instruction for segmented/transpose VEX variants. A reimplementation that drives matrix width purely offTarget::LaneCountwould still get 128 here, but would miss the per-operand override cases.
Who Reads the Config
The decoded constants fan out to three consumer layers:
- Cost model.
NodeCostcomputes wall-clock time ascycles × trip / (TensorCoreFrequencyInMegaHertz() × 1e6)— readingTarget+0x90C. Memory cost readsHbmSizeBytes/HBM bandwidth and the granule. See Cost Model Overview. - ISA emitter.
VexMatrixWidthToSyU32and the bundle encoders read the lane/sublane geometry and register counts to pick instruction encodings. See ISA Overview. - Topology / tiling.
LaneCount/SublaneCount/ChunksPerTilefeedHardwareLayouttile stamping and the layout-assignment tiling pass. See TpuTopology Struct.
Related Components
| Name | Relationship |
|---|---|
TpuChipParts::DefaultsForVersion | the parallel capability resolver in tpu_chip_parts.cc; embed://tpu_chip_parts/…_chip_parts.binarypb, fatal on miss |
TpuChipParts::FromProto | fills the Target+0x438..+0x510 capability block this page's accessors read |
TpuChipConfig::FromProto | parses the mode config (SharedMemoryRegion / SyncFlag) into the resource object |
TpuChipConfig::Megacore / ::Megachip / ::TcControl | the +8/+9/+10 mode-byte accessors on the config object |
Target::IsMegachip | reads the config's Megachip byte (via Target+0x3B8→+0x18) to gate megachip-parallel compilation |
TpuPxcDriver::InitializeCores | consumes the TpuChipConfig resource object to provision per-core queues |
xla::jellyfish::Target | the runtime HAL object whose field block and accessors are mapped here |
Cross-References
- chip_parts.binarypb Decode — the capability blob whose decoded fields fill the Target block this page maps
- Per-Codename Constant Table — the per-generation values that land in these Target offsets
- Codename Matrix — TpuVersion ↔ codename mapping (the
Target+0x398selector) - Per-Gen Comparison Matrix — cross-page consolidated per-generation comparison
- TpuTopology Struct — the
Target+0x3B8geometry descriptor read by the lane/sublane accessors - Cost Model Overview — reads
TensorCoreFrequencyInMegaHertz, HBM size/bandwidth, granule - ISA Overview — reads lane/sublane geometry and register counts via the emitter