Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

TpuChipConfig

All addresses and offsets on this page apply to libtpu.so from the libtpu-0.0.40-cp314 wheel (BuildID md5 89edbbe81c5b328a958fe628a9f2207d). Other versions differ.

Abstract

There are two embedded-proto descriptions of a TPU generation, and they are easy to confuse. chip_parts (see chip_parts.binarypb Decode) describes the silicon's capability geometry — memory sizes, MXU dimensions, clocks. TpuChipConfig describes a runtime mode: bounce buffers, sync-flag resources, and on-device transfer windows for a particular (version, variant) pair. TpuChipConfig::Create resolves an embed://tpu_chip_config/<version>_chip_configs_<alias>.binarypb resource, parses it into a TpuChipConfigProto, and materializes a TpuChipConfig object the driver layer uses to provision per-core queues and DMA buffers.

This page documents that resolution-and-materialization path and, separately, the consumption surface that the decoded capability constants reach through the xla::jellyfish::Target object. The Target is the runtime hardware-abstraction object; chip_parts fills its field block at boot, and a family of tiny accessors (LaneCount, SublaneCount, ChunksPerTile, HbmSizeBytes, TensorCoreFrequencyInMegaHertz, …) read individual fields out of it. Those accessors are the API the cost model, ISA emitter, and tiling passes call; the offset map below is what makes a struct dump from a Target instance readable.

For reimplementation, the contract is:

  • The TpuChipConfig::Create path: version+variant → kChipConfigAliases lookup → embed://tpu_chip_config/…ReadBinaryProtoFromProto.
  • The TpuChipConfig object header: the three mode booleans at +8/+9/+10 (Megacore/Megachip/TcControl) the HAL branches on, and the per-core EnumMap<TpuCoreType, …, 3> tables keyed by the runtime {kTensorCore=0, kBarnaCore=1, kSparseCore=2} enum.
  • The Target field block: which struct offset holds each decoded chip_parts constant, and the accessor that reads it.
  • The consumers: the geometry accessors (LaneCount/SublaneCount/ChunksPerTile/VexMatrixWidthToSyU32) and who calls them (cost model, ISA emitter, topology).
Config resolvertpu::TpuChipConfig::Create @ 0x20AE98E0
Config parsertpu::TpuChipConfig::FromProto @ 0x20AEA100
Config constructortpu::TpuChipConfig::TpuChipConfig @ 0x20AF6300
Source filelearning/45eac/tpu/runtime/topology/tpu_chip_config.cc (Create alias-log line 368)
Alias mapkChipConfigAliases (gtl::flat_map<TpuVersionAndVariant, …>, 4 entries)
Mode flagsMegacore +8 (0x20AFCA00), Megachip +9 (0x20AFCC00), TcControl +10 (0x20AFCC20)
Geometry descriptorTarget+0x3B8tpu::TpuTopology*
Capability field blockTarget+0x438..+0x510 (filled by TpuChipParts::FromProto)

TpuChipConfig::Create

Purpose

Create obtains the mode configuration for a (version, variant) pair: the TpuChipConfigProto enumerates SharedMemoryRegion records and SyncFlag records (parsed by FromProto) that the driver uses to lay out bounce buffers and sync-flag resources per core. It is the mode-config sibling of chip_parts's capability-geometry resolver TpuChipParts::DefaultsForVersion (0x20B1B040). The two are deliberate parallels — same shape, same embed:// mechanism — but they live in different translation units: Create is in tpu_chip_config.cc, DefaultsForVersion is in tpu_chip_parts.cc (its fatal-log strings cite tpu_chip_parts.cc:341,343,345).

Algorithm

StatusOr<TpuChipConfig> TpuChipConfig::Create(TpuVersion v, string_view variant,   // sub_20AE98E0
                                              string_view config_variant):
    // 1. Resolve a (version, variant) alias to a config-variant name.
    key = TpuVersionAndVariant{v, variant}
    hit = kChipConfigAliases.find(key)        // 4-entry flat_map of TpuVersionAndVariant
    if hit and bcmp(hit->value, config_variant) != 0:   // alias differs from requested
        LOG(INFO) << "Resolved chip config alias " << config_variant << "->" << hit->value
                  << " for " << v << ", " << variant    // tpu_chip_config.cc:368
        alias = hit->value                              // (logged only when the alias renames the request)
    else:
        alias = config_variant                // no alias / identity: use config_variant verbatim

    // 2. Build the embed:// resource path and read it.
    name     = AsciiStrToLower(TpuVersionToString(v))
    if variant.non_empty(): StrAppend(&name, "_", variant)
    filename = CatPieces("embed://tpu_chip_config/", name, "_chip_configs_", alias, ".binarypb")
    status   = tsl::ReadBinaryProto(Env::Default(), filename, &proto)   // tpu_chip_config.cc:379-383
    if status != Ok:
        return StatusBuilder(...) << "Failed to obtain chip config \"" << alias
                                  << "\" for version " << v << ", variant \"" << variant << "\""
    return TpuChipConfig::FromProto(proto)    // sub_20AEA100

QUIRK — the embed:// scheme is tpu_chip_config/ here, but tpu_chip_parts/ in DefaultsForVersion, and the suffix is _chip_configs_<alias>.binarypb (note the plural and the alias slot) versus _chip_parts.binarypb. The two resolvers target different resource families. chip_config blobs are the *_chip_configs_*.binarypb entries in the embedded TOC; do not conflate them with the nine *_chip_parts.binarypb capability blobs. Unlike DefaultsForVersion, Create returns a Status on failure rather than CHECK-failing — a missing config is recoverable, a missing capability description is fatal.

What FromProto Materializes

TpuChipConfig::FromProto (0x20AEA100) walks the TpuChipConfigProto, iterating its SharedMemoryRegion repeated field and a SyncFlag repeated field, inserting each into the TpuChipConfig's flat_hash_map<TpuSharedMemoryOnChip, BounceBuffer> and sync-flag-resource tables. The TpuChipConfig constructor (0x20AF6300) takes that map plus three leading booleans and the per-core layout. The result is the object TpuPxcDriver::InitializeCores (0xE806500) and TpuChipCommonImpl::RegisterContinuationQueueConfigs (0xE72B340) consume to provision hardware queues — it is a resource object, not the capability Target.

The TpuChipConfig Object Header

The first 11 bytes of the constructed object are a fixed header the constructor writes from its leading scalar arguments (0x20AF6300 stores a2+0, a3+8, a4+9, a5+10), and three one-line const accessors expose the three mode flags. These are the per-config booleans a reimplementation must reproduce verbatim, because the HAL branches on them at boot:

OffAccessor (VA)TypeHolds
+0x00(constructor a2)int64leading scalar word (per-config count)
+0x08Megacore (0x20AFCA00)boolmegacore mode (name CONFIRMED; semantics inferred)
+0x09Megachip (0x20AFCC00)boolmegachip mode (name CONFIRMED; semantics inferred)
+0x0ATcControl (0x20AFCC20)boolTensorCore-control mode (name CONFIRMED; semantics inferred)

Each accessor is literally return *((uint8_t*)this + N)Megacore reads +8 (0x20AFCA00), Megachip reads +9 (0x20AFCC00), TcControl reads +10 (0x20AFCC20). The byte positions and accessor symbols are byte-exact from the disassembly; the meaning of each mode (megacore as a two-TensorCore fused topology, megachip as a multi-die-as-one-device topology) is the standard TPU interpretation, inferred rather than proven from this binary. xla::jellyfish::Target::IsMegachip (0x10914F60) consumes the +9 byte through the config pointer the Target holds at Target+0x3B8→+0x18 (*(TpuChipConfig**)([Target+0x3B8]+24)), and gates its result on a positive [Target+0x3B8]+148 count — so megachip-parallel compilation requires both the Megachip byte set and a non-zero descriptor count.

QUIRK — these three bytes are mode state on the TpuChipConfig resource object, not capability geometry. They do not live in the Target field block this page also maps; do not look for Megachip at a Target offset. The selector that picks which config blob was loaded is the variant string fed to Create, and the megacore/megachip booleans inside that blob are what Megacore()/Megachip() then surface.

Per-Core Layout Keying

After the header the constructor stores a long tail of per-core-type tables — EnumMap<TpuCoreType, vector<InfeedQueue>, 3>, EnumMap<TpuCoreType, vector<ContinuationQueue>, 3>, EnumMap<TpuCoreType, vector<OutfeedQueue>, 3>, EnumMap<TpuCoreType, UserInterrupts, 3>, and so on (visible in the constructor's mangled signature). Every one is an EnumMap of arity 3, keyed by the runtime tpu::TpuCoreType enum. That enum is {kTensorCore=0, kBarnaCore=1, kSparseCore=2}TpuCoreTypeFromProto (0x20B36840) maps wire TENSOR_CORE=1→0, BARNA_CORE=2→1, SPARSE_CORE=3→2 (proto value minus one, with a fatal "Invalid core type" at tpu_chip_enums.cc:301 for anything else). A reimplementation that wants to index the per-core queue tables must use this 0-based runtime enum, not the 1-based proto enum.


The Target Field Block

The capability constants from chip_parts do not live in TpuChipConfig; they live in the xla::jellyfish::Target object, filled at boot by TpuChipParts::FromProto (0x20B1B400) and TpuMemoryParts::FromProto (0x20B333A0). Each constant has a fixed struct offset, read by a one-line accessor. The offset map below was read byte-exact from the accessor disassembly (e.g. HbmSizeBytes is literally return *((int64_t*)this + 138), i.e. Target+0x450).

Target offAccessor (VA)TypeHolds
+0x398Target::TpuVersionToString (0x12772CC0)int32tpu_version (0..5 → jellyfish/dragonfish/pufferfish/viperfish/ghostlite/6acc60406)
+0x3B8LaneCount/SublaneCount/… (0x1D60F400)TpuTopology*geometry descriptor pointer
+0x3B8→+0x198LaneCount (0x1D60F400)int64lane_count (=128 all gens)
+0x3B8→+0x1A0SublaneCount (0x1D60F300)int64sublane_count (=8 all gens)
+0x450HbmSizeBytes (0x1D615320)int64HBM size (v7x: 102,005,473,280)
+0x458VmemSizeBytes (0x1D615E00)int32VMEM size (v7x: 67,108,864)
+0x460CmemSizeBytes (0x1D615E20)int64CMEM size (v7x: 0)
+0x468SflagSizeBytes (0x1D615E60)int32SFLAG size (v7x: 16,384)
+0x470SmemSizeBytes (0x1D615E40)int32SMEM size (v7x: 1,048,576)
+0x50CVmemWordSizeBytes (0x1D617300)int32VMEM word (v7x: 512)
+0x90CTensorCoreFrequencyInMegaHertz (0x1D615B60)int32TC freq MHz (v7x: 1900)
+0x910HbmFrequencyInMegaHertz (0x1D615BA0)int32HBM freq MHz (v7x: 7200)

The offsets resolve directly from the accessor bodies: LaneCount returns *(int64_t*)(*((void**)this + 119) + 408)this+119*8 = Target+0x3B8 is the TpuTopology*, and +408 = +0x198 is lane_count inside it. SublaneCount reads the same descriptor at +416 = +0x1A0. HbmSizeBytes is *((int64_t*)this + 138) = Target+0x450; VmemSizeBytes is *((int32_t*)this + 278) = Target+0x458; SmemSizeBytes is +284 = +0x470; SflagSizeBytes is +282 = +0x468; VmemWordSizeBytes is +323 = +0x50C; TensorCoreFrequencyInMegaHertz is +579 = +0x90C; HbmFrequencyInMegaHertz is +580 = +0x910. See TpuTopology Struct for the +0x3B8 descriptor's full layout.

NOTE — tpu_version at Target+0x398 is the single per-version switch that drives every dispatch — MSA family selection, loop-unroll policy, and the per-codename *Target vtable. The capability constants at +0x438..+0x510 and the geometry at +0x3B8 are the data; +0x398 is the selector that chose which chip_parts blob filled them.


Geometry Accessors and Their Consumers

ChunksPerTile

Target::ChunksPerTile (0x1D60F2C0) is the canonical example of a derived geometry value: it divides lane_count by sublane_count.

int64 Target::ChunksPerTile():                 // sub_1D60F2C0
    topo    = *((void**)this + 119)            // Target+0x3B8
    lane    = *(int64_t*)(topo + 0x198)        // lane_count
    sublane = *(int64_t*)(topo + 0x1A0)        // sublane_count
    return lane / sublane                       // 128 / 8 = 16 on every gen in this build

A 32-bit fast path is taken when both operands fit in 32 bits (they do), so the division is an unsigned 32-bit divide. With lane=128, sublane=8 on every generation, ChunksPerTile() == 16 — sixteen 8×128 lane-chunks per HBM tile. The HardwareLayout pass stamps Tile(SublaneCount(), LaneCount()) = Tile(8, 128) from the same two fields.

VexMatrixWidthToSyU32

pxc::mnemonics::VexMatrixWidthToSyU32 (one specialization per VEX instruction family, e.g. 0x1D2C20A0) is an ISA-emitter-side consumer of the lane geometry. It converts a VexMatrixWidth enum on a TensorCore-vector instruction into the concrete SyU32 matrix width to encode.

uint32 VexMatrixWidthToSyU32(const Instr* a1):  // sub_1D2C20A0 (one of several specializations)
    switch (a1->vex_matrix_width):              // a1[7]
        case 0: return 128                       // the default full MXU lane width
        case 1: return a1->width_field[0]        // explicit per-operand widths
        case 2: return a1->width_field[1]
        ...
        case 7: return a1->width_field[6]
        default: LOG(FATAL) << "Invalid VexMatrixWidth value"  // pf_proto_to_env_utils.h:1550

QUIRK — case 0 returns the literal 128, not a value read from Target. The MXU lane width is hardcoded in the ISA-emitter helper because it is invariant — lane_count is 128 on every generation in this build. The non-zero cases read explicit width fields off the instruction for segmented/transpose VEX variants. A reimplementation that drives matrix width purely off Target::LaneCount would still get 128 here, but would miss the per-operand override cases.

Who Reads the Config

The decoded constants fan out to three consumer layers:

  • Cost model. NodeCost computes wall-clock time as cycles × trip / (TensorCoreFrequencyInMegaHertz() × 1e6) — reading Target+0x90C. Memory cost reads HbmSizeBytes/HBM bandwidth and the granule. See Cost Model Overview.
  • ISA emitter. VexMatrixWidthToSyU32 and the bundle encoders read the lane/sublane geometry and register counts to pick instruction encodings. See ISA Overview.
  • Topology / tiling. LaneCount/SublaneCount/ChunksPerTile feed HardwareLayout tile stamping and the layout-assignment tiling pass. See TpuTopology Struct.

NameRelationship
TpuChipParts::DefaultsForVersionthe parallel capability resolver in tpu_chip_parts.cc; embed://tpu_chip_parts/…_chip_parts.binarypb, fatal on miss
TpuChipParts::FromProtofills the Target+0x438..+0x510 capability block this page's accessors read
TpuChipConfig::FromProtoparses the mode config (SharedMemoryRegion / SyncFlag) into the resource object
TpuChipConfig::Megacore / ::Megachip / ::TcControlthe +8/+9/+10 mode-byte accessors on the config object
Target::IsMegachipreads the config's Megachip byte (via Target+0x3B8→+0x18) to gate megachip-parallel compilation
TpuPxcDriver::InitializeCoresconsumes the TpuChipConfig resource object to provision per-core queues
xla::jellyfish::Targetthe runtime HAL object whose field block and accessors are mapped here

Cross-References